Day 21 - Unicode

Handle and process multilingual data properly

NOTE: This article was written in 2006. For more up-to-date information on Unicode, please see http://dev.catalystframework.org/wiki/tutorialsandhowtos/using_unicode. In particular, Catalyst::Plugin::Unicode::Encoding is recommended instead of Catalyst::Plugin::Unicode.

Introduction

As the Internet gains more and more users, it's likely that your site will receive visits by people that don't want to interact with it in English. They might want to use Japanese names, or perhaps tag an entry in Russian. Unfortunately, most apps out there break completely when this happens. While users of Ruby and PHP have to fight with their language to add (poor) support for multiple languages (or more specifically, multiple character sets), Perl has native support for Unicode right in the language.

Terminology

Some of the terminology used in this article might take a bit of getting used to, because when the terms were invented, Unicode didn't exist. The basic unit of a string in Perl is a "character":

    $string =~ /(.)/;

$1, in this example, will contain the first character in the string. This intuitive if the string is something like "abcde", but it also holds true for a string like 日本語. What you think of as characters, Perl thinks of as characters.

However, the problem arises when you confuse a byte (or an octet) with a character. Although 'a' is both a character and an octet, a UTF-8 character can be made up of more than one octet. This is where the problem begins because the Internet (and terminal emulators, filesystems, etc.) have no concept of a character -- they only transmit, receive, or store octets.

Using Unicode Correctly

By default, Perl thinks of everything as an octet. Therefore it won't handle Unicode properly. This is so that Perl can process binary data (a task equally important as handling text). Since binary data isn't made up of characters, you have to explicitly tell Perl that the incoming (or outgoing) data is text. If you know the charset that the text is encoded in, you can decode the "binary" data into textual Perl characters by using the Encode module, and writing:

    $string = decode('the-charset', $octets);

Now $string contains Perl characters, and will work with regular expressions, substr, etc. You can also combine it with other strings correctly. (If you used the data directly, then /./ would match an octet and the result would be garbage data.)

As we said before, the Internet (and files) have no concept of "characters", and so we need to encode characters as octets of some sort before we can output them:

    $octets = encode('the-charset', $string);

That's all there is to correctly handling any language -- decode the data that you get from the user, and then encode it when you send it back. As long as you know what encoding to specify for 'the-encoding', everything will work perfectly! Simple!

Technical Detail

A cause of a lot of Unicode problems is the fact that Perl stores character strings internally as UTF-8. Since most web browsers and terminal emulators also use UTF-8 as their native encodings, copying the data from a perl string to the terminal (via print) or web browser will appear to work (and the reverse -- inputting UTF-8 data from the terminal or web browser into perl -- will also appear to work).

However, when you do this, you'll get a warning:

   Wide character in print at foo.pl line 42

This is perl's way of telling you that you are ignoring the instructions above, and are likely to get garbage output unless you are very very lucky. In simple programs, it's easy to be lucky, but in complex programs, it's difficult. Therefore, you always need to decode input, and encode output, even if you "know" that the input is UTF-8 and the output is UTF-8. (In this case, a mere while(<>){ print } loop will work fine, but if you do any string manipulations the result will be an unreadable mess.)

To help rid your program of this kind of problem, use the encoding::warnings pragma. When you load it and run your application, it will emit a warning message that shows you where you're misusing Unicode. From there, you can see (in the debugger, etc.) where the invalid data is coming from, and add a decode or encode statement.

Unicode in Catalyst

Catalyst can handle most of the details of this for you. If you load Catalyst::Plugin::Unicode into your application, Catalyst will decode all incoming request data into Perl characters, and encode outgoing characters output to Unicode (UTF-8) octets. Only the body will be encoded -- you should avoid using non-ASCII characters in the headers.

In most cases this will work. However if you're using a legacy character encoding, you will have to do the conversion manually with the Encode module (and explicitly specify what charset to use).

Catalyst::Plugin::Unicode won't handle everything, though. If you're using an external module that reads files or other data sources, verify that it handles the Unicode conversion for you. (XML::RSS, for example, won't; but YAML::Syck will.) Be especially aware that your database might not be storing data as Unicode.

This is important to be careful about, because mixing characters and octets in the same string (the body of your page), will result in severe problems. If you never decode anything, and everything is UTF-8, your site will probably work (but don't do this!). If you decode some things, and then mix in un-decoded data, perl will interpret each octet of the un-decoded data as a full character. When you print this out, this "character" (which is really an octet representing different character) will be encoded into UTF-8 octets representing the character. This is called double-encoding, and is unfortunately the most frequently occurring Unicode bug that I've seen (even https://metacpan.org/author/JROCKWAY has this problem).

Perl will do the right thing, but only if you know what the right thing is :)

One more thing...

Perl has a built-in pragma called utf8 that you can use at the top of your source file to tell Perl that the file is encoded in UTF-8. If you do this, then you can use any character as a name of a function, variable, etc. There's no other reason to use utf8;, though, not even if you want to write:

    my $string = get_utf8_octets();
    utf8::decode($string);
    ...

instead of:

    my $string = get_utf8_octets();
    $output = Encode::decode('utf-8', $string);
    ...

The name of those functions are utf8::encode and utf8::decode, they aren't imported into your namespace by the utf8 pragma.

(As an aside, you might see people use utf8::decode or utf8::encode directly like this. It's best to use encode, but using utf8::* will result in valid characters or octets, as long as you want to use UTF-8. Just be sure to remember that Encode copies the string, but utf8::* does the conversion in-place.)

Recipes

Now that you know how to use Unicode, here are some things that you can do in Perl that other languages just can't do!

Make the first character of a string BIG

    my $string = $c->request->param->{string};
    $string =~ /^(.)(.*)$/;
    $c->response->body(qq{<font size="+100">$1</font>$2});

Truncate text

    use Encode;
    $string = Encode::decode('utf-8', $some_utf8_octets);
    my $summary = substr $string, 0, 10;
    print Encode::encode('utf-8', $summary. '...');

This will print the first ten characters of the input. YouTube tries to do this, but gets it wrong (Japanese summaries get truncated in the "middle" of a character, leading to garbage output). Feel secure in knowing that Perl can practically extract and report textual summaries :)

Automatically encode and decode files

    open my $in, '<:utf8', 'a_file.utf8' or die $!;
    open my $out, '>:encoding(euc-jp)', 'a_file.eucjp' or die $!;
    print {$out} $_ while(<$in>);

The converts the input utf8 file to euc-jp by automatically calling the correct Encode function. You can do more than just echo -- any character operations would work correctly inside the while loop.

AUTHOR

Jonathan Rockway jrockway@cpan.org

Catalyst Advent Calendar