My automatic repackager for CPAN has just barfed on the zone.tab in tzdata2012a. I should have checked the diff more carefully, I guess. The fault it finds is that the new America/Creston line has *two* tab characters between "America/Creston" and the region description. The comments at the top of the file include "Columns are separated by a single tab.". Previously the file adhered to the single tab rule, and I made my parser correspondingly strict. Obviously I'll manually intervene to get 2012a repackaged, but I think the file should be fixed for 2012b. -zefram
Date: Thu, 1 Mar 2012 12:17:15 +0000 From: Zefram <zefram@fysh.org> Message-ID: <20120301121715.GE3007@lake.fysh.org> | The fault it finds is that the new America/Creston line has *two* | tab characters between "America/Creston" and the region description. Oh, sorry about that - I have no actual user of that file, so that kind of error just slips past. | I think the file should be fixed for 2012b. My copy of the file has it corrected.... kre
Date: Thu, 1 Mar 2012 12:17:15 +0000 From: Zefram <zefram@fysh.org> Message-ID: <20120301121715.GE3007@lake.fysh.org> | The comments at the top of the file include "Columns are separated by a | single tab.". The extra tab is gone from 2012b, so the immediate problem is gone. However, that said, I believe your parser might be incorrect. The comments in zone.tab also say ... # This file contains a table with the following columns: # 1. ISO 3166 2-character country code. See the file `iso3166.tab'. # 2. Latitude and longitude of the zone's principal location # in ISO 6709 sign-degrees-minutes-seconds format, # either +-DDMM+-DDDMM or +-DDMMSS+-DDDMMSS, # first latitude (+ is north), then longitude (+ is east). # 3. Zone name used in value of TZ environment variable. # 4. Comments; present if and only if the country has multiple rows. That is, there are exactly 4 columns, separated by a single tab (3 tabs). The final column is "comments" - there's no stated restriction on the characters that can be used in comments (the first three columns all have a format that implicitly defines their content, the first is always 2 alpha chars, the second signs and digits, the third something that is reasonable as an environment var value (letters, digits, underscores, hyphens, slashes ... (but no white space generally) - but the last is just comments. I'd typically allow anything in comments, including white space, including tabs, wherever they may occur, including at the start of the field. So, while it was certainly not intentional, it is reasonable to argue that the entry CA +4906-11631 America/Creston Mountain Standard Time - Creston, British Columbia that was in 2012a contained column 1 "CA" column 2 "+4906-11631" column 3 "America/Creston" column 4 " Mountain Standard Time - Creston, British Columbia" and I'd suggest that's how a parser should interpret it. It might not make much sense, but I don't actually see it as a syntax error. kre
Robert Elz wrote:
The final column is "comments" - there's no stated restriction on the characters that can be used in comments
True, and your interpretation with the comments beginning with a tab is possible. (I was half expecting DateTime::TimeZone (another repackaging of the database for CPAN), which went out promptly after 2012a, to have ended up with this interpretation, but it turns out it went with the two tabs being a single separator. I don't know how automated Dave Rolsky has that release process.) However, tab-separated-value tables conventionally don't allow tabs to be part of the data, each tab being a separator. Anyway, I was aware of some ambiguity here when I wrote my parser. Quite apart from the tab issue, there's no stated restriction of the comments to ASCII, but there's also no indication of which encoding would be used for non-ASCII characters. So I made the parser as strict as possible based on the partial statement of the file format and the (admirably regular) data actually seen. This includes a restriction that the comments contain only printable ASCII, and neither begin nor end with whitespace. On its face this isn't in accord with receiving half of the Postel principle, but the failure mode here isn't a total failure of operation, it's to kick the issue up for conscious human attention. (It emailed me.) The design is conservative in that I've told the parser not to guess the meaning of anything irregular. Rather than argue about what the current syntax definition means, when it's plainly unclear on some of the details, I'd rather resolve this by making the definition more detailed. I suggest that it should be defined to match the strict syntax to which the data has heretofore adhered, and which my parser expects. For reference, these are the Perl regexps that I use to parse zone.tab (in Time::OlsonTZ::Download): $line =~ /\A([A-Z]{2}) \t([-+][0-9]{4}(?:[0-9]{2})?[-+][0-9]{5}(?:[0-9]{2})?) \t([!-~]+) (?:\t([!-~][ -~]*[!-~]))? \n\z/x; $line =~ /\A#[^\n]*\n\z/; We should also have an automated test, as part of tzcode, that checks that the file matches whatever detailed syntax is decided, and that its content is semantically sane (refers only to defined zones, for example). I'm happy to translate my regexps, or equivalents for whatever other syntax we agree on, into C for this purpose. The same goes for iso3166.tab. -zefram
On Fri, 2 Mar 2012, Zefram wrote:
Robert Elz wrote:
The final column is "comments" - there's no stated restriction on the characters that can be used in comments
True, and your interpretation with the comments beginning with a tab is possible. (I was half expecting DateTime::TimeZone (another repackaging of the database for CPAN), which went out promptly after 2012a, to have ended up with this interpretation, but it turns out it went with the two tabs being a single separator. I don't know how automated Dave Rolsky has that release process.) However, tab-separated-value tables conventionally don't allow tabs to be part of the data, each tab being a separator.
I noticed the issue and adjusted my parser to expect 1+ tab. -dave /*============================================================ http://VegGuide.org http://blog.urth.org Your guide to all that's veg House Absolute(ly Pointless) ============================================================*/
participants (3)
-
Dave Rolsky -
Robert Elz -
Zefram