[Templates] TT and Perl's UTF8 problem again

Ivan Kurmanov kurmanov@openlib.org
Sat, 1 Mar 2003 05:30:28 +0200


--envbJBWh7q8WU6mo
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Hi everyone.

Remember that problem with non-latin-1 characters being painfully
distorted when added to a UTF-8 encoded string (in perl 5.6.x)?
Although slightly offtopic, we discussed it here in November last year
under subject "UTF8 support and issues".  Then a magic workaround was
proposed:

  $var = pack( 'U*', unpack( 'U*', $var ) );

If applied to all template variables, it was supposed to make template
output clean.  Now I ran into this problem again.

Let me start from the beginning.  I use perl 5.6.1, XML::XPath and
Template-Toolkit (2.07) to read in an XML file (content) and write out
a bunch of HTML files (in presentation).  All XML-parsed data in Perl
is correctly in UTF-8 encoding.  That is perfectly well with me: I
write HTML in UTF-8.

My XML file has non-latin-language content (russian).  To write out
correct UTF8, I use mentioned above pack/unpack workaround hack to
make my data clean.  This helped well up until recently.

Then I introduced some russian words into one of my templates.  Now
that got f##ked-up when output by Template-Toolkit.  Before that there
were only pure latin characters in the templates and it worked fine.

As you may guess, the templates are in UTF8, although this doesn't
make much difference.  There is text in a template and it gets output
in a distorted way.  To make my point clear:

1- this only happens if you output it together with some
   internally-marked UTF8 data, like e.g. XML-originating data; 
2- it is a Perl's bug, not TT's.

To make my point even clearer, I attach a bug-reproducing-demo script
with one small file it needs.  (Although in addition to TT, it also
needs XML::XPath to run.)  What it does is tries to output the same
non-latin UTF8 string, read from XML and then read from a template
file.  It tries it in several different ways, showing you the result.


So I had no other option, but to dig in the TT sources a little bit to
add that funny pack/unpack hack there, where it might help.  After a
number of attempts, I sticked it into certain points of
Template::Provider and Template::Directive.  That's exactly two lines
changed and two lines added.

Now my XML-to-HTML thing works fine, as long as I put changed versions
of these modules in PERLLIB before the original ones.  So, the
question:

Should this fix be included into official TT or should it stay as it
is -- a hack?
 
Or may be we can and should make it optional?  I don't know.  But I
certainly think there should be an official TT way to workaround this
perl's problem, 'cause it is serious.  It's not specific to russian
language, it will happen everytime when there is some XML data and
some non-latin data.

I did simplest possible timing tests.  At least on my system on my
program (which does quite a bit of template processing, but not only
that) it doesn't show any significant performance hit.

But if we do make it official, then we probably need to do the same
thing to every variable included in TT output as well.  (I mean the
original problem discussed in November).  So this probably needs to be
a more general solution.

What do you think?


BTW, is this problem fixed in perl 5.8?


Cheers,

Kurmanov

--envbJBWh7q8WU6mo
Content-Type: application/x-perl
Content-Disposition: attachment; filename="utf8bugdemo.pl"
Content-Transfer-Encoding: quoted-printable

#!/usr/local/bin/perl=0Ause strict;=0Ause XML::XPath;=0A=0Amy $file =3D \*D=
ATA;=0Amy $xp =3D XML::XPath -> new( ioref =3D> $file );=0Amy $xmltext =3D =
$xp -> findvalue( '/doc/text()' );=0A=0Ause Template;=0Amy $tt =3D Template=
->new( { INCLUDE_PATH =3D> "." } );=0A=0A=0Amy $template;=0A=0Aprint "--- t=
hrough tt INSERT ---\n";=0A=0A$template =3D =0Aqq!xml var text: [% xmltext =
%]=0A    template: [% INSERT footer %]=0A!;=0A=0A$tt -> process( \$template=
, { xmltext =3D> $xmltext } ) =0A  or die "Error: ", $tt->error();=0A=0A=0A=
print "--- through tt PROCESS ---\n";=0A=0A$template =3D =0Aqq!xml var text=
: [% xmltext %]=0A    template: [% PROCESS footer %]=0A!;=0A=0A$tt -> proce=
ss( \$template, { xmltext =3D> $xmltext } ) =0A  or die "Error: ", $tt->err=
or();=0A=0A=0Aprint "--- through tt INCLUDE ---\n";=0A=0A$template =3D =0Aq=
q!xml var text: [% xmltext %]=0A    template: [% INCLUDE footer %]=0A!;=0A=
=0A$tt -> process( \$template, { xmltext =3D> $xmltext } ) =0A  or die "Err=
or: ", $tt->error();=0A=0A=0Amy $footer;=0A=0Aprint "--- manually without p=
ack/unpack 'U*' ---\n";=0A=0Aopen FH, 'footer';=0A$footer =3D join( '', <FH=
> );=0Aclose FH;=0A=0Aprint =0Aqq!xml var text: $xmltext=0A    template: $f=
ooter=0A!;=0A=0A=0Aprint "--- manually with pack/unpack 'U*' ---\n";=0A=0Ao=
pen FH, 'footer';=0A$footer =3D join( '', <FH> );=0Aclose FH;=0A=0A$footer =
=3D pack('U*', unpack('U*', $footer ) );=0A=0Aprint =0Aqq!xml var text: $xm=
ltext=0A    template: $footer=0A!;=0A=0A=0Aprint "--- the end ---\n";=0A=0A=
__DATA__=0A<?xml version=3D'1.0' encoding=3D'utf-8'?>=0A<doc>=D0=BF=D1=80=
=D0=B8=D0=B2=D0=B5=D1=82</doc>=0A
--envbJBWh7q8WU6mo
Content-Type: text/plain; charset=koi8-r
Content-Disposition: attachment; filename=footer
Content-Transfer-Encoding: 8bit

привет

--envbJBWh7q8WU6mo--