C++, Unicode and localisation

I’ve been dealing with some issues with Unicode at the moment, and trawling the internets looking for answers has revealed to me just how many people don’t seem to have a comprehension of what’s really going on in the string classes they’re using. I was one of them.

 

The first thing you need to read – before going any further – is Joel Spotsky’s blog on this:

The Absolute Minimum Every Software Developer Must Know About Unicode and Character Sets

If you haven’t read that, go there now. Right now. I’m not kidding, stop reading this and go read that. Come back afterwards. Nowait, don’t start reading more Joel Spotsky – come back! Oh, there you are.

 

Okay, so that was a simplified discussion about the history of text encodings and what it means to most developers. And it ended with “might as well use std::wstring, it’s native”. Well, here’s the kicker: not necessarily.

 

C++ is fantastic because it’s cross-platform. It can compile for Windows, Unix, OSX, iPhone, Android, DS, Playstation – you name it, if it supports native code there’s probably a C++ compiler for it. Unfortunately, not every C++ compiler treats everything the same. There’s a lot of ‘holes’ in the specification that let the compiler decide what it wants to do with certain data types, how big it wants to make them (e.g. on a 32-bit processor, the int datatype is 32 bits. On a 64-bit processor? You guessed it, 64. There needs to be room for compilers to set the size that works best on the native hardware and operating system. But this leads to issues when you want one piece of code that compiles to all platforms.

 

On Windows a std::string uses a single byte, which essentially means it can only support the first part of a UTF-8 encoding. On Unix a std::string is 2-bytes, and can happily use a UTF-16 or UCS-2 encoding. On Windows you need to use a std::wstring (wide-string) to have any Unicode-based foreign language support. On Unix, a wide-string uses 4-bytes, which can lead to strings being memory hogs and requiring conversion to and from the strings used natively by the operating system.

 

There are basically three ways around this:

  • use the time-honoured tradition of selective #defines and macros to compile using std::string for some systems, std::wstring for others etc. The Microsoft header <tchar.h> does something similar (using char or wchar_t arrays instead of STL classes)
  • define your own basic_string<uint16_t> template or similar that always uses the same size for a character. String literals become harder to use then, and need a macro-hack to work properly. C++0x (if it ever gets ratified), will introduce new Unicode support to make this method much easier. Interoperability with other APIs or libraries can easily lead to issues though.
  • go ahead and do what Joel Spotsky says: just use std::wstring and allow it to be UTF-16/UCS-2 on one system and UTF-32 on another. So long as you want your game running on a system and not sending text between systems/servers etc. this should be fine.

Personally I prefer the second option. I use string literals a lot less than I use everything else, and when writing Objective-C I have to prefix my string literals with an @ anyway, so wrapping them in a macro isn’t an issue for me. Plus it makes it easier to move forward with the next version of C++. Until that gets released though, hopefully you now know enough about Unicode to decide how to go about supporting (or not supporting) it in your next project.

2 thoughts on “C++, Unicode and localisation

  1. I’ve been cramming UTF-8 encoded text into C++ “std::string” and ANSI C “char*” for a while now, as recommended by “UTF-8 Everywhere” proselytizers http://www.theregister.co.uk/2013/10/04/verity_stob_unicode/ .

    Is that going to cause problems for me? If not, perhaps you could add a 4th point to your list of work-arounds, something like “use std::string to hold UTF-8 encoded text, translating as necessary to interoperate with APIs and libraries that use some other encoding”.

    I suppose it is technically true that “On Windows a std::string uses a single byte” per indexed location. But isn’t that also true for every standards-compliant implementation of std::string?

  2. Huh, I haven’t re-read this blog post for a while. But yeah, I definitely learned a lot more about Unicode after writing that as well. Mainly that wide strings are stupid, everyone already uses normal-width strings and you can put Unicode in them by using UTF-8 compliant string libraries. So you’re quite right, and this blog post is utter garbage right now. I’ll make another one about strings and localisation soon and put a header on this one to go read that one instead :)

Leave a Reply

Your email address will not be published. Required fields are marked *