[Patch] Make it slightly easier to parse tzdata

Hi there, Right now I am experimenting with trying to integrate the tzdata dataset into an embedded setup where I am not able to use zic(8) + /usr/share/zoneinfo. To summarize, my plan is to translate the entire tzdata into a set of C array/structure initializers (~50 KB for all timezones) and perform operations on those structures directly. While writing a Python script to translate the data, I noticed that there is a very small number of directives that would require quite a lot of additional code to parse properly. For example, "lastSun" makes a lot of sense as a special keyword in "Rule" directives, but it provides no functional gain in "Zone" directives. The same holds for the use of the "Dec 29 24:00" time used in Samoa's timezone. We should be able to use "Dec 30" instead, right? Anyway, I've attached a patch that should make it slightly easier to parse the tzdata files by removing these constructs. This should make it easier for people to reuse the dataset for other purposes. Thanks, -- Ed Schouten <ed@80386.nl>

On 10/31/2014 07:56 AM, Ed Schouten wrote:
there is a very small number of directives that would require quite a lot of additional code to parse properly. For example, "lastSun" makes a lot of sense as a special keyword in "Rule" directives, but it provides no functional gain in "Zone" directives.
The same holds for the use of the "Dec 29 24:00" time used in Samoa's timezone. We should be able to use "Dec 30" instead, right?
We could, but if the original announcement said the equivalent of "Dec 29 24:00" it's helpful if the corresponding zone line matches the announcement. Similarly, if the Zone transition is supposed to be at the same time and date as a Rule transition, it simplifies maintenance a bit to use the same string for both. As I understand it, in zic the same code is used to parse dates regardless of whether they appear in Rule or Zone lines. I assume the same thing could be done in a Python parser....

2014-10-31 21:18 GMT+01:00 Paul Eggert <eggert@cs.ucla.edu>:
On 10/31/2014 07:56 AM, Ed Schouten wrote:
there is a very small number of directives that would require quite a lot of additional code to parse properly. For example, "lastSun" makes a lot of sense as a special keyword in "Rule" directives, but it provides no functional gain in "Zone" directives.
The same holds for the use of the "Dec 29 24:00" time used in Samoa's timezone. We should be able to use "Dec 30" instead, right?
We could, but if the original announcement said the equivalent of "Dec 29 24:00" it's helpful if the corresponding zone line matches the announcement. Similarly, if the Zone transition is supposed to be at the same time and date as a Rule transition, it simplifies maintenance a bit to use the same string for both.
As I understand it, in zic the same code is used to parse dates regardless of whether they appear in Rule or Zone lines. I assume the same thing could be done in a Python parser....
My idea was to just make it easier for the next person. Adding support for it to my script is of course not infeasible. In fact, taking into account that it's only used in a couple of place, it will even be shorter to have a fixed map for these irregularly shaped entries: last_sun_map = {('1980', 'Mar'): 30, ...} You could argue that the dates in Zone and Rule entries are simply not the same thing. Dates in Zone entries are absolute. They indicate an end date of a ruleset. "lastSun" would need to be applied to the year used in the statement itself. "lastSun" in Rule entries are not applied to a year specifically. They are merely copied into the compiled timezone. I'd say that requiring the same parsing logic may be too demanding. Though I agree that it would be nice to have the definitions matching up with original announcements, in the end they will need to be processed by machines. If we are afraid that people get confused between "Dec 29 24:00" and "Dec 30", there is still the possibility to add a comment to clarify. If a feature is only used so rarely in the datasets that it's easier to use a lookup table to translate them to the proper value than it is to actually parse it properly, we might be sacrificing reusability. Best regards, -- Ed Schouten <ed@80386.nl>

On Fri, Oct 31, 2014, at 10:56, Ed Schouten wrote:
Hi there,
Right now I am experimenting with trying to integrate the tzdata dataset into an embedded setup where I am not able to use zic(8) + /usr/share/zoneinfo.
I'm not sure I understand why you can't use the zoneinfo files, built elsewhere?

<<On Fri, 31 Oct 2014 17:02:55 -0400, random832@fastmail.us said:
On Fri, Oct 31, 2014, at 10:56, Ed Schouten wrote:
Hi there,
Right now I am experimenting with trying to integrate the tzdata dataset into an embedded setup where I am not able to use zic(8) + /usr/share/zoneinfo.
I'm not sure I understand why you can't use the zoneinfo files, built elsewhere?
Especially since, even if you have no filesystem, they can be compiled into your executable. (And the compiled format is architecture-independent, so the same data files can be used on any platform.) It is really unfortunate that so many people insist on reinventing the wheel. -GAWollman

On 2014-10-31 15:12, Garrett Wollman wrote:
<<On Fri, 31 Oct 2014 17:02:55 -0400, random832@fastmail.us said:
On Fri, Oct 31, 2014, at 10:56, Ed Schouten wrote:
Hi there,
Right now I am experimenting with trying to integrate the tzdata dataset into an embedded setup where I am not able to use zic(8) + /usr/share/zoneinfo.
I'm not sure I understand why you can't use the zoneinfo files, built elsewhere?
Especially since, even if you have no filesystem, they can be compiled into your executable. (And the compiled format is architecture-independent, so the same data files can be used on any platform.)
It is really unfortunate that so many people insist on reinventing the wheel.
If space is a concern with all the sources using nearly 700KB and the binaries nearly 800KB, you could take a leaf from the web services and new media embedded systems areas, and use a compressed binary archive of the directory structure for the binaries (a la .jar, .war, .epub) and modify only the tz loader to use the required decompression code to load only the single required time zone info. This would require a tradeoff between the compiled sizes of the decompressors on your platform and the compressed data: 94K tz.tar.xz 132K tz.tar.bz 189K tz.tar.gz 471K tz.zip If you decided to go this route, the tradeoffs you make would be interesting tothis group/list and other embedded folks, and code changes to support this would be useful for other embedded folks and projects such as tzdist. -- Take care. Thanks, Brian Inglis

Brian Inglis wrote:
This would require a tradeoff between the compiled sizes of the decompressors on your platform and the compressed data: 94K tz.tar.xz
We can shrink it even further than that, as follows: files=' africa antarctica asia australasia europe northamerica southamerica ' sed ' s/#.*// s/^/ / s/$/ / s/\([^0-9]\)0*\([0-9]\)/\1\2/g s/:0\([^0-9:]\)/\1/g s/[[:blank:]]\{1,\}/ /g s/ $// s/^ // /^$/d ' $files | lzip That is, before compressing, omit backward-compatibility material, commentary, and unnecessary white space and zeros. On my platform, the above command outputs 21,831 bytes total.

On Fri, 31 Oct 2014, Ed Schouten wrote:
While writing a Python script to translate the data, I noticed that there is a very small number of directives that would require quite a lot of additional code to parse properly. For example, "lastSun" makes a lot of sense as a special keyword in "Rule" directives, but it provides no functional gain in "Zone" directives.
You can translate "lastSun" to a specific date during the conversion from tzdata's text format to your binary format. Then you would not need to do it in the embedded code that parses your binary format.
The same holds for the use of the "Dec 29 24:00" time used in Samoa's timezone. We should be able to use "Dec 30" instead, right?
Even if you can get it to work, I would advise against trying to write "Dec 30" in any rules for Samoa in 2011. There was no 30 December 2011 in Samoa, because at midnight at the end of 29 December 2011, their clockes jumped ahead 24 hours to the midnight at the beginning of 31 December. Ordinarily (but not in Samoa in 2011), "Dec 29 24:00" and "Dec 30 00:00" would mean the same thing, and you could translate between them when you create your binary data. However, "Dec 30" with an unspecified time is supposed to mean "an unknown time on 30 December", and tzdata uses this style when the exact time of a transition is unknown. --apb (Alan Barrett)
participants (6)
-
Alan Barrett
-
Brian Inglis
-
Ed Schouten
-
Garrett Wollman
-
Paul Eggert
-
random832@fastmail.us