Non-ASCII encoding (was: Re: proposed time zone package changes: Brazil; Mauritius; URL fixes)

July 1, 2008

      On 01.07.2008 01:26, Rodrigo Severo wrote:
...
The text by Paul Schulze about Brazilian timezones is missing all 
accented characters. Here is the text with the proper characters:
I would like to use the opportunity to clarify the question of the 
encoding of non-ASCII characters in the tzdata files. This is only a 
minor point because they only occur in the comments but I think it 
should at least be defined.
In tzdata2008c there seems to be only one non-ASCII character, the 
accented e in the name José Miguel Garrido in the file southamerica. It 
is obviously encoded in ISO 8859-1 (Latin1).

If more non-ASCII characters are going to be included in the tzdata 
files, I would like to propose to define UTF-8 as the official encoding 
of the tzdata files. UTF-8 is widely supported and is a true superset of 
7-bit ASCII, so it does not change the encoding of the actual data. I 
think it is only a question of time until the name of a contributor, a 
location, or an official publication cannot be properly represented in 
any single 8-bit encoding. For example, the letter "r" in my surname 
should really be "ř", "Latin Small Letter R With Caron" (U+0159) which 
is not part of ISO 8859-1.

Best regards
Martin Jerabek