== Unicode Guide to Rails This guide explains some concepts and implementation details you *need to know* to use ActiveSupport::Multibyte. A programmer can't support Unicode in an application without knowing some of the details of the standard. === Unicode Primer in One Paragraph Unicode differs from normal character sets and encodings because it provides a complete character level text processing system. In Unicode all characters are represented by a number, which we call a codepoint. Text is a sequence of these codepoints. When text is displayed the glyphs (images) corresponding to the codepoints are found in the font(s) and displayed on the screen. The Unicode standard describes a number of encodings to store Unicode text in a computer system such as UTF-8, UTF-16 and UTF-32. Unicode strings in Ruby are best stored in UTF-8 (this is also the most interoperable encoding), so we will limit our discussion to UTF-8. It sums up to a handful of rules: * Never break other people's text * Always normalize strings before you compare or measure length * Never stop thinking when working with Unicode data === String is for bytestrings ASCII strings are very straightforward because there are less than 256 characters, which means that every character can be encoded with just one byte. The first 128 Unicode codepoints correspond to the ASCII characters and are encoding in the first byte in UTF-8. This means that ASCII text is also valid UTF-8 text. Characters outside of the ASCII standard can take up to 6 bytes (4 in practice). Letter ASCII byte UTF-8 bytes Unicode name ------ ---------- ----------- ------------ a 97 97 LATIN SMALL LETTER A ë - 171 195 LATIN SMALL LETTER E WITH DIAERESIS In the table we can see that the letter A is encoded in both ASCII and UTF-8 as 97, the letter E with diaeresis can't be represented in ASCII and is two bytes long in UTF-8. Because characters can take up more that one byte normal bytestring operations don't work anymore. name = "Martin Müller" name.length #=> 14, one too many name[0..8] #=> "Martin M\303", you damaged a codepoint Ruby splits codepoints in half and doesn't know how long our strings really are. Now think about what this does to languages which are encoded entirely in multiple bytes per character. Putting UTF-8 encoded strings in bytestrings isn't a problem in itself, you just have to make sure you don't break them. This leads to our first Rule Of Thumb: *ROT: Never break other people's text* There are some ways to work around the bytestring operations in Ruby using the regular expression engine and other functionality, but they aren't very intuitive and require some knowledge about UTF-8 encoded strings. Certain operations are hard or impossible to do in Ruby right now: Unicode normalization, upcase / downcase and sorting. Note that UTF-8 can be a little confusing because as long as you keep working with ASCII strings exclusively you can still use the byte-based operations. However you risk actively damagind text as soon as some multibyte characters appear in the strings you are processing (which is bound to happen in the context of a globally available Web application). Invalid UTF-8 sequences will make every Web-service backend throw an error, and damaged codepoints will display as ugly question marks or boxes in your HTML! === Normalization forms Not all characters in Unicode are letters, we also have: 26 forms of whitespace, text direction characters, a whole array of quotation marks, fraction characters, mathematical figures and so on. The most problematic of these are the combining sequences. A large group of characters in Unicode exist in two forms: composite form and combining sequence form. Let's take the letter ë as an example again. It can be represented in two ways, either as the LATIN SMALL LETTER E WITH DIAERESIS or the LATIN SMALL LETTER E followed by the COMBINING DIAERESIS. In Hangul (Korean) all modern characters have a composite and combining form. Note that most terminals and rendering engines are notoriously poor in their display of decomposed Unicode sequences If you want to compare Unicode strings at the codepoint level, you will have to perform normalization. Normalization makes sure that all the combining characters are either in composite or combining form. There are four forms of normalization in Unicode: * Decomposed form (normalization form D) * Composed form (form C) * Compatability variants of both (KD and KC). The compatability variants are destructive because besides the normal decompositions they also decompose some graphically rich characters. For instance the ½ character (VULGAR FRACTION ONE HALF) to DIGIT ONE, FRACTION SLASH, DIGIT TWO. From this sequence we can't make up what the original author meant, so we've lost semantic information. The W3C Consortium recommends the form C as the baseline model for web services and other Web-based interchange where strings are involved. The number of characters can easily change in a Unicode string after normalization. The only way to measure the same length for canonically equal strings is to normalize them to the same form. We recommend using the Normalization Form KC (NFKC) in your average web application (unless you know that you will need special display characters such as fractionals and ligatures). *ROT: "Always normalize strings before you compare or measure length"* To enable compulsory normalization for all strings in your CGI parameters use the +normalize_params+ method. === Case conversions (upcase and downcase) Some languages have capital forms for characters, but the codepoints aren't ordered in such a way that the capital form of a character can be computed. The Unicode standard provides a mapping for uppercase to lowercase and the other way around. However, sometimes (for some languages) the operation is irreversible. Take German for instance: Lowercase Uppercase ========= ========= Straße STRASSE, there is no uppercase ß === Sorting Sorting is a difficult problem. Codepoints aren't ordered in such a way that a higher byte means that it should appear later in the ordering like in ASCII. There are tables with information on how to sort languages, but they become very large if you want to support a lot of languages. And even with these tables there is the matter of preference, for instance in English: should we sort the capitals before lowercase characters and where do we put the special characters? Thus Ruby (and this plugin) will use binary sort order, which might not satisfy some languages using diacritics (diacritics in Swedish and German, among others, are sorted differently with respect to tha base Latin subset). Proper Unicode sorting is _always_ locale-dependent, thus most programs and databases implement it separately. === How to work with the Unicode plugin When working with Unicode data, you always have to be on your toes. There are still al lot of choices to be made during development. We will look at some common problems and how to work around them. *ROT: "Never stop thinking when working with Unicode data"* The Unicode plugin doesn't perform any Unicode related processing if the global variable $KCODE isn't set to 'u' or 'utf-8'. You can set the $KCODE anywhere in your code to switch the Unicode processing on or off, if you want to use Unicode all through your application set it at the start of environment.rb. $KCODE = 'u' If your system primarily handles non-ASCII data, we recommend normalizing all incoming strings. That way you can use all the proxy methods without thinking about normalization first. You can set this either globally in application.rb or per controller. class WikiController normalize_params :form => 'KC' end The Unicode plugin defines a new instance method on the string class that returns an ActiveSupport::Multibyte::Chars proxy. The proxy 'wraps' the string so you can use the Unicode safe methods on the bytestring. If you're not using the icu4r library, the chars method is also aliased as u. All string methods are defined on the proxy, we've also defined some extra methods for your convenience. String length is a problematic notion in Unicode strings. You can count the number of bytes or the number of codepoints, and depending on the normalization form, the length of the string will vary. You can also cound characters (called +grapheme clusters+ in the standard), in which case some characters will contain more than one codepoint. *Normalization does not convert all combined grapheme clusters into one codepoint!*. Note that Unicode, like ASCII, can contain some control characters and non-spacing whitespace which you might want to strip out first. The strip method can do this for you, just like with normal strings. *Be aware that you might be getting special Unicode whitespace copy-pasted into your application from programs such as InDesign, where thin and hair spaces are used for layout.* name = "Martin Müller" name.length #=> 14 name.chars.length #=> 13 name.chars.raw_string.length #=> 14 name.chars.normalize_D.length #=> 15 name.chars.normalize_KC.length #=> 14 The string operations on the proxy are chainable. name.chars.reverse.length #=> 13 name.chars.reverse.class #=> ActiveSupport::Multibyte::Chars The way chaining works is that every method on the proxy that normally returns a String returns a Chars object. Most of the times this shouldn't be a problem because the Chars object works almost identical to a String object. However, some libraries might expect the ++length++ method to return the number of bytes, or check explicitly for the String class (bad!). In those cases use ++to_s++ to pass a string. Expressions in views typically don't have this problem because they get joined with strings, which implicitly calls ++to_s++. If you have a byte_offset and you want to translate it to offset in codepoints, you can use translate_offset, which will convert the offset in bytes to a safe offset in Unicode codepoints. All the methods like slice and split do this for you. name.chars.translate_offset(8) #=> 8 name.chars.translate_offset(9) #=> 8 name.chars[0..8] #=> "Martin Mü" If you want to normalize the string manually, you can use the normalize methods. This is also a good example of how some characters get decomposed. quarter = [0xBC].pack('U*') # The character VULGAR FRACTION ONE QUARTER which has codepoints 0xBC item = "#{quarter} liter" # Put it in a string item.chars.normalize_D #=> "¼ liter" item.chars.normalize_KD #=> "1⁄4 liter" item.chars.normalize_C #=> "¼ liter" item.chars.normalize_KC #=> "1⁄4 liter" The only string operation that is not fully Unicode aware is the +<=>+ method that takes care of comparing objects for sorting, this just falls back to way Ruby sorts objects. Like we mentioned above, sorting is basically impossible to handle without knowing the language we are targeting (and it's alphabet), so it's outside of our scope for now. The easiest solution is to do all the sorting in the database. Almost all databases engines have Unicode aware sorting algorithms for a variety of languages. Please consult the documentation of your database and look for *collation*. For more information about the Char proxy, look at the API documentation. You can generate the API documentation from the plugin with +rake doc+. === But that's not all In order for Unicode to stay Unicode you need your whole toolchain to keep the bytes intact. That means that from the browser, through proxies, through your webserver, through rails and into the database everything has to be either left alone or handled as Unicode. The same goes for data from the database to the browser. In order to do this we need to do some things, and guess what? The Unicode plugin does most of that for you, it sets the encoding of the database client driver to UTF-8 and adds charset=utf-8 to your headers when you're sending text. So what's left for you to do? * Make sure your database and tables use UTF-8 encoding throughout (and collation if you're interested) * Make sure you normalize incoming data before you process it. Preferably at the gate (where text gets into your system). * Make sure you use the Chars proxy when doing string operations SQLite3 databases are always UTF-8 unless you specify otherwise. In MySQL you have to alter the database configuration to create UTF-8 tables by default. ALTER DATABASE app_dev DEFAULT CHARACTER SET utf8; In PostgreSQL you have to make sure locale support was enabled during compilation and then set the locale for your postmaster (the user running the server), in most unix/linux variants you can do this by setting two variables in your environment. export LC_CTYPE=utf-8 export LC_COLLATE=utf-8 You can optionally alter the collation by using the language subcomponent of the locale, as in +en_US.UTF-8+ or +ru_RU.UTF-8+ === Speeding things up The unicode plugin implements a number of backends for handling encoded strings. At initialization it tries to load a number of compiled Ruby extensions and falls back on a pure ruby implementation if none are present. The README of the plugin lists all the extensions you can use. When you've installed one of the extensions and you want to force a certain handler you can do so in the Rails configuration, most of the times this isn't necessary. ActiveSupport::Multibyte.handler = :UTF8HandlerPure === The payoff Now you can finally use a skull in your text. skull = [0x2620].pack('U*') puts skull === Further reading and assistance http://www.joelonsoftware.com/articles/Unicode.html If you want to discuss multibyte support for Ruby on Rails, please visit IRC #multibyte_rails on irc.freenode.net