Non-ASCII outside comments?
Prompted by the introduction of non-ASCII into comments, I checked how zic handles a... Zone America/San_José ...line and discovered that, at least on the test system, it seems to do a right thing. Should zic permit non-ASCII characters in zone names? In time zone abbreviations? If they are permitted, should zic warn about them? --ado
On 25 June 2014 17:17, Arthur David Olson <arthurdavidolson@gmail.com> wrote:
Prompted by the introduction of non-ASCII into comments, I checked how zic handles a... Zone America/San_José ...line and discovered that, at least on the test system, it seems to do a right thing.
Should zic permit non-ASCII characters in zone names? In time zone abbreviations?
I think that non-ASCII should, for now, be kept out of things that turn into file names at some point. Browsers, email clients, and editors have pretty caught up with UTF-8 by now, but file managers are, at least in some cases, still ‘getting there’, and may not be able to handle files containing non-UTF-8 characters properly, especially on Windows systems if the character is not present in the system code page. (So a ‘ø’ would be a little bit OK on a Western European Windows but a ‘ř’ would not, while on a Central European Windows the opposite might be true.) Cheers, Philip -- Philip Newton <philip.newton@gmail.com>
On Jun 25, 2014, at 11:17 AM, Arthur David Olson <arthurdavidolson@gmail.com> wrote:
Prompted by the introduction of non-ASCII into comments, I checked how zic handles a... Zone America/San_José ...line and discovered that, at least on the test system, it seems to do a right thing.
Should zic permit non-ASCII characters in zone names? In time zone abbreviations? If they are permitted, should zic warn about them?
Zone names need to be ASCII, nothing more. Isn’t this specified by POSIX? In any case, zone names turn into path names, and while some operating systems support Unicode file names, a lot do not. paul
On Jun 25, 2014, at 8:47 AM, <Paul_Koning@dell.com> wrote:
Zone names need to be ASCII, nothing more. Isn’t this specified by POSIX?
What the latest version of the Single UNIX Specification says of the TZ envrionment variable in 8.3 "Other Environment Variables" is This variable shall represent timezone information. The contents of the environment variable named TZ shall be used by the ctime(),ctime_r(), localtime(), localtime_r() strftime(), mktime(), functions, and by various utilities, to override the default timezone. The value of TZ has one of the two forms (spaces inserted for clarity): :characters or: std offset dst offset, rule If TZ is of the first format (that is, if the first character is a <colon>), the characters following the <colon> are handled in an implementation-defined manner. so it doesn't explicitly say that tz database zone names must be ASCII there. However, it says in 8.1 "Environment Variable Definition" that: For values to be portable across systems conforming to POSIX.1-2008, the value shall be composed of characters from the portable character set (except NUL and as indicated below). so, in theory, a value could have other characters, but such a setting isn't guaranteed to work on all systems.
In any case, zone names turn into path names, and while some operating systems support Unicode file names, a lot do not.
...and some of the ones that do support them actually support arbitrary sequences of bytes, except for NUL (string terminator) and the pathname separator, as file names, with the interpretation of non-ASCII sequences being based on the setting of an environment variable, so that é might be a single byte of 0xE9 (ISO 8859-1) or two bytes of 0xC3 0xA9 (UTF-8) or.... So, whilst non-ASCII characters might happen to work in some cases, there are probably other situations where they don't work as one might want them to work.
On Wed, 25 Jun 2014, Arthur David Olson wrote:
Should zic permit non-ASCII characters in zone names? In time zone abbreviations? If they are permitted, should zic warn about them?
Please don't permit characters outside the portable subset of ASCII in zone names. Comments in text files are non-critical, and if they appear mangled due to incorrect settings (such as disagreement between different software layers about what locale is in use), or due to absence of support in my display font, or if they are incomprehensible due to my unfamilarity with the language or character set involved, then it doesn't matter too much. On the other hand, if I am unable to enter a zone name due to incorrect settings, or due to missing support in my keyboard driver, or merely due to personal unfamiliarity with the "foreign" characters involved, then I won't be able to select the correct zone. This problem gets worse, for me, as character sets get larger: I might be able to figure out how to coerce my keyboard driver into emitting the codes for <latin small letter e with grave>, but I have very little chance of telling whether two somewhat-similar glyphs are different renderings of the same Chinese character, or different characters entirely, and I certainly don't know how to make my keyboard emit the right codes for them. This attitude is unfair to people for whom the letters A to Z and the digits 0 to 9 are foreign, unfamiliar, difficult to read, or difficult to input. I recognise that unfairness, but do not have a solution for them. --apb (Alan Barrett)
Arthur David Olson wrote:
Should zic permit non-ASCII characters in zone names? In time zone abbreviations? If they are permitted, should zic warn about them?
Currently zic allows non-ASCII characters in both places, no? Or more precisely, zic allows any byte except for ", newline, and the null byte. So the question is whether zic should stop allowing nearly-arbitrary byte strings, even byte strings that are not properly encoded characters. The simplest thing to do is to leave zic alone. I don't see much harm in that, though perhaps I'm missing something. PS to others: This is not a question about the tz database, as it will stick to a small subset of ASCII characters in zone names, as documented in "Theory". It's a question about what happens when people use 'zic' privately on their own databases.
On Jun 25, 2014, at 12:17 PM, Paul Eggert <eggert@cs.ucla.edu> wrote:
Arthur David Olson wrote:
Should zic permit non-ASCII characters in zone names? In time zone abbreviations? If they are permitted, should zic warn about them?
Currently zic allows non-ASCII characters in both places, no? Or more precisely, zic allows any byte except for ", newline, and the null byte. So the question is whether zic should stop allowing nearly-arbitrary byte strings, even byte strings that are not properly encoded characters.
The simplest thing to do is to leave zic alone. I don't see much harm in that, though perhaps I'm missing something.
+1 But perhaps the documentation should indicate that: the byte strings for zone names will be used, as is, in OS calls to create files, and we don't guarantee what effect that will have (for example, on at least one UNIX(R), with the default file system - regardless of whether it's running case-sensitive or case-insensitive - some amount of processing is done on file names to convert them to UTF-16 on disk); the byte strings for abbreviations will be copied over to the file, without interpretation; and that you use non-ASCII characters at your own risk.
Guys, please don't forget that zic is not the only usage of the tzdata. Countless platforms, languages, libraries, and applications use TZ identifiers. Putting a non-ASCII character in a time zone name is likely going to break many things. For example, if I have an app that asks the user for their time zone, I might save that time zone into a SQL database varchar column. One could argue that nvarchar is better, but surely we don't want to require folks to update their application database schemas just to stay current with the latest time zone updates. I'm sure there are many other scenarios that would be affected as well. -Matt
From: guy@alum.mit.edu Date: Wed, 25 Jun 2014 12:39:02 -0700 To: eggert@cs.ucla.edu CC: tz@iana.org Subject: Re: [tz] Non-ASCII outside comments?
On Jun 25, 2014, at 12:17 PM, Paul Eggert <eggert@cs.ucla.edu> wrote:
Arthur David Olson wrote:
Should zic permit non-ASCII characters in zone names? In time zone abbreviations? If they are permitted, should zic warn about them?
Currently zic allows non-ASCII characters in both places, no? Or more precisely, zic allows any byte except for ", newline, and the null byte. So the question is whether zic should stop allowing nearly-arbitrary byte strings, even byte strings that are not properly encoded characters.
The simplest thing to do is to leave zic alone. I don't see much harm in that, though perhaps I'm missing something.
+1
But perhaps the documentation should indicate that:
the byte strings for zone names will be used, as is, in OS calls to create files, and we don't guarantee what effect that will have (for example, on at least one UNIX(R), with the default file system - regardless of whether it's running case-sensitive or case-insensitive - some amount of processing is done on file names to convert them to UTF-16 on disk);
the byte strings for abbreviations will be copied over to the file, without interpretation;
and that you use non-ASCII characters at your own risk.
On Jun 26, 2014, at 4:56 PM, Matt Johnson <mj1856@hotmail.com> wrote:
Guys, please don't forget that zic is not the only usage of the tzdata. Countless platforms, languages, libraries, and applications use TZ identifiers. Putting a non-ASCII character in a time zone name is likely going to break many things.
If by "the tzdata" you mean "the files that come with the IANA time zone distribution", then all we can do is 1) have the Theory file indicate that official tz files will have only characters in a given subset of ASCII in the zone names and 2) follow the rules of the theory file when putting out tz data releases. If by "the tzdata" you mean "any data that anyone ever constructs for time zone processing", then all we can do is 1) nothing because somebody who constructs a file with tz syntax is not obliged to run them through zic at all, much less run it through zic with the "-v" flag. The Theory file currently says Use only valid POSIX file name components (i.e., the parts of names other than '/'). Do not use the file name components '.' and '..'. Within a file name component, use only ASCII letters, '.', '-' and '_'. Do not use digits, as that might create an ambiguity with POSIX TZ strings. A file name component must not exceed 14 characters or start with '-'. E.g., prefer 'Brunei' to 'Bandar_Seri_Begawan'. and I didn't see a patch from Paul that would *remove* any of those requirements, so, as long as we follow the Theory file rules, no official tz file will have a zone name that includes non-ASCII characters - or even ASCII characters other than a-z, A-Z, ".", "-", and "_". So all Paul would be obliged to do would be to continue to follow the Theory rules; I have no reason to imagine that he would do anything other than that. He is *not* obliged to make zic unconditionally reject files that have non-ASCII characters; if some third party creates a tz file with non-ASCII characters in a zone name, and hands it to software that parses tz files, and Something Bad Happens, that's a matter for the third party and the developer of the software in question to resolve.
participants (7)
-
Alan Barrett -
Arthur David Olson -
Guy Harris -
Matt Johnson -
Paul Eggert -
Paul_Koning@dell.com -
Philip Newton