Day 21 - Unicode
Handle and process multilingual data properly
NOTE: This article was written in 2006. For more up-to-date information on Unicode, please see http://dev.catalystframework.org/wiki/tutorialsandhowtos/using_unicode. In particular, Catalyst::Plugin::Unicode::Encoding is recommended instead of Catalyst::Plugin::Unicode.
Introduction
As the Internet gains more and more users, it's likely that your site will receive visits by people that don't want to interact with it in English. They might want to use Japanese names, or perhaps tag an entry in Russian. Unfortunately, most apps out there break completely when this happens. While users of Ruby and PHP have to fight with their language to add (poor) support for multiple languages (or more specifically, multiple character sets), Perl has native support for Unicode right in the language.
Terminology
Some of the terminology used in this article might take a bit of getting used to, because when the terms were invented, Unicode didn't exist. The basic unit of a string in Perl is a "character":
$string =~ /(.)/;
$1
, in this example, will contain the first character in the
string. This intuitive if the string is something like "abcde"
,
but it also holds true for a string like 日本語
. What you think of as
characters, Perl thinks of as characters.
However, the problem arises when you confuse a byte (or an octet) with a character. Although 'a' is both a character and an octet, a UTF-8 character can be made up of more than one octet. This is where the problem begins because the Internet (and terminal emulators, filesystems, etc.) have no concept of a character -- they only transmit, receive, or store octets.
Using Unicode Correctly
By default, Perl thinks of everything as an octet. Therefore it won't handle Unicode properly. This is so that Perl can process binary data (a task equally important as handling text). Since binary data isn't made up of characters, you have to explicitly tell Perl that the incoming (or outgoing) data is text. If you know the charset that the text is encoded in, you can decode the "binary" data into textual Perl characters by using the Encode module, and writing:
$string = decode('the-charset', $octets);
Now $string
contains Perl characters, and will work with regular
expressions, substr, etc. You can also combine it with other strings
correctly. (If you used the data directly, then /./
would match an
octet and the result would be garbage data.)
As we said before, the Internet (and files) have no concept of "characters", and so we need to encode characters as octets of some sort before we can output them:
$octets = encode('the-charset', $string);
That's all there is to correctly handling any language -- decode
the data that you get from the user, and then encode
it when you
send it back. As long as you know what encoding to specify for
'the-encoding'
, everything will work perfectly! Simple!
Technical Detail
A cause of a lot of Unicode problems is the fact that Perl stores
character strings internally as UTF-8. Since most web browsers and
terminal emulators also use UTF-8 as their native encodings, copying
the data from a perl string to the terminal (via print
) or web
browser will appear to work (and the reverse -- inputting UTF-8 data
from the terminal or web browser into perl -- will also appear to
work).
However, when you do this, you'll get a warning:
Wide character in print at foo.pl line 42
This is perl
's way of telling you that you are ignoring the
instructions above, and are likely to get garbage output unless you
are very very lucky. In simple programs, it's easy to be lucky, but
in complex programs, it's difficult. Therefore, you always need to
decode
input, and encode
output, even if you "know" that the
input is UTF-8 and the output is UTF-8. (In this case, a mere
while(<>){ print }
loop will work fine, but if you do any string
manipulations the result will be an unreadable mess.)
To help rid your program of this kind of problem, use the
encoding::warnings pragma. When you load it and run your
application, it will emit a warning message that shows you where
you're misusing Unicode. From there, you can see (in the debugger,
etc.) where the invalid data is coming from, and add a decode
or
encode
statement.
Unicode in Catalyst
Catalyst can handle most of the details of this for you. If you load
Catalyst::Plugin::Unicode into your application, Catalyst will
decode
all incoming request data into Perl characters, and
encode
outgoing characters output to Unicode (UTF-8) octets. Only
the body will be encoded -- you should avoid using non-ASCII
characters in the headers.
In most cases this will work. However if you're using a legacy character encoding, you will have to do the conversion manually with the Encode module (and explicitly specify what charset to use).
Catalyst::Plugin::Unicode won't handle everything, though. If you're using an external module that reads files or other data sources, verify that it handles the Unicode conversion for you. (XML::RSS, for example, won't; but YAML::Syck will.) Be especially aware that your database might not be storing data as Unicode.
This is important to be careful about, because mixing characters and octets in the same string (the body of your page), will result in severe problems. If you never decode anything, and everything is UTF-8, your site will probably work (but don't do this!). If you decode some things, and then mix in un-decoded data, perl will interpret each octet of the un-decoded data as a full character. When you print this out, this "character" (which is really an octet representing different character) will be encoded into UTF-8 octets representing the character. This is called double-encoding, and is unfortunately the most frequently occurring Unicode bug that I've seen (even https://metacpan.org/author/JROCKWAY has this problem).
Perl will do the right thing, but only if you know what the right thing is :)
One more thing...
Perl has a built-in pragma called utf8 that you can use
at the
top of your source file to tell Perl that the file is encoded in UTF-8.
If you do this, then you can use any character as a name of a
function, variable, etc. There's no other reason to use utf8;
,
though, not even if you want to write:
my $string = get_utf8_octets(); utf8::decode($string); ...
instead of:
my $string = get_utf8_octets(); $output = Encode::decode('utf-8', $string); ...
The name of those functions are utf8::encode
and utf8::decode
,
they aren't imported into your namespace by the utf8 pragma.
(As an aside, you might see people use utf8::decode
or
utf8::encode
directly like this. It's best to use encode, but
using utf8::*
will result in valid characters or octets, as long as
you want to use UTF-8. Just be sure to remember that Encode copies
the string, but utf8::*
does the conversion in-place.)
Recipes
Now that you know how to use Unicode, here are some things that you can do in Perl that other languages just can't do!
Make the first character of a string BIG
my $string = $c->request->param->{string}; $string =~ /^(.)(.*)$/; $c->response->body(qq{<font size="+100">$1</font>$2});
Truncate text
use Encode; $string = Encode::decode('utf-8', $some_utf8_octets); my $summary = substr $string, 0, 10; print Encode::encode('utf-8', $summary. '...');
This will print the first ten characters of the input. YouTube tries to do this, but gets it wrong (Japanese summaries get truncated in the "middle" of a character, leading to garbage output). Feel secure in knowing that Perl can practically extract and report textual summaries :)
Automatically encode and decode files
open my $in, '<:utf8', 'a_file.utf8' or die $!; open my $out, '>:encoding(euc-jp)', 'a_file.eucjp' or die $!; print {$out} $_ while(<$in>);
The converts the input utf8 file to euc-jp by automatically calling
the correct Encode
function. You can do more than just echo -- any
character operations would work correctly inside the while loop.
SEE ALSO
The perlunicode
and perllocale manual pages contain even more
information about Unicode.
AUTHOR
Jonathan Rockway jrockway@cpan.org
COPYRIGHT
Copyright (c) 2006, Jonathan Rockway. This article may be redistributed under the same terms of Perl -- GPLv2, or Artistic 1.0. Enjoy.