Good News About Unicode in Erlang

Bad News (today)

If there are good news it means there should be some bad ones nearby. So bad news about Unicode support in Erlang is that it’s just impossible to use Unicode string literals in source files because Erlang compiler assumes they are Latin-1 encoded. Therefore in order to write something like "a∘b" in source code file you should use "a\x{2218}b" or even uglier [$a, 8728, $b] both of which are equal to the original string literal "a∘b". Even if you save the source file as UTF-8 the compiler still assumes it’s Latin-1 and there is no way telling the truth so far. Another thing that can be used is keeping Unicode string literals in separate files and reading them at runtime with built-in Erlang functions. (But hey, Swedish alphabet is covered by Latin-1 charset and it’s definitely better than bare US-ASCII :)).

Good News (near future)

Now then for the good news and all I can do is to quote decisions affecting Erlang releases R16 & R17:

The board decided to go for a solution where comments in the code (in the same way as in Python) informs the tool chain about input file encoding formats. This means that only UTF-8 and ISO-Latin-1 encoding will be supported. All source files can be marked as containing UTF-8 encoded Unicode characters by using the same mechanism (even files read using file:consult/1), namely formalized comments in the beginning of the file.

The change to the file format will be done incrementally, so that the tools will accept Unicode input (meaning that source code can contain Unicode strings, even for binary construction), but restrictions regarding characters in atoms will remain for two releases (due to distribution compatibility). The default file encoding will be ISO-Latin-1 in R16, but will be changed to UTF-8 in R17.

Source code will need no change in R16, but adding a comment denoting ISO-Latin-1 encoding will ensure that the code can be compiled with the R17 compiler. Adding a comment denoting UTF-8 encoding will allow for Unicode characters with code points > 255 in string and character literals in R16. The same comment will allow for atoms containing any Unicode code point in R18. From this follows that function names also can contain any Unicode code point in R18.

UTF-8 BOM’s will not be handled due to their limited use.

Variable names will continue to be limited to Latin characters.

It looks like the right decision overall for those who want to use characters out of Latin-1 character set in string literals.

Awaiting cryptic DSLs in R18 though… :)