...or America/Los_Angeles the old file is 1017 bytes, whereas the new file is 9107 bytes. Why did it grow by a factor of 9?
This is just about what I'd expect. There's the data in the old format (a factor of one) plus the data in the new format, where two considerations apply: 1. the transition times are 64 bits rather than 32 bits, doubling the size. 2. About 400 years of transitions are recorded rather than about 100, quadrupling the size. The combination of the two consideration means that the new data takes about 8 times as much space as the old, and the total is about 9 times as much as the old. (Recording 400 years of data lets us extrapolate into the far future.)
...how about if we modify the format so that the second copy contains only the information that's not in the first copy? This should help prevent bloat.
Back of the envelope, we'd end up with... 100 years in the old format = 1x the old file size 300 years in the new format = 6x the old file size ...leaving us with new files that were seven times the size of the old ones (rather than nine times as is the case now)--some savings, but not as much as we might like. Having duplicate information in the files does minimize changes to the existing code, and it does allow other file users to simply ignore the old-format information. --ado
"Olson, Arthur David (NIH/NCI)" <olsona@dc37a.nci.nih.gov> writes:
1. the transition times are 64 bits rather than 32 bits, doubling the size. 2. About 400 years of transitions are recorded rather than about 100, quadrupling the size. The combination of the two consideration means that the new data takes about 8 times as much space as the old, and the total is about 9 times as much as the old.
Ah, thanks, that explains it. I didn't know about (2). How about if we document this? Here's a proposed patch to the Theory file that explains this, along with some other issues that I noticed when I reread that file: * Update references to POSIX, etc.. * The tz code does not yet support the quoted time zone abbreviation syntax required by POSIX starting in 2001. * Add an example of a POSIX TZ setting. --- Theory 2004/05/27 16:00:30 2004.1 +++ Theory 2005/04/25 18:55:19 2004.1.0.1 @@ -12,26 +12,31 @@ ----- Time and date functions ----- -These time and date functions are upwards compatible with POSIX.1, +These time and date functions are mostly upwards compatible with POSIX, an international standard for UNIX-like systems. -As of this writing, the current edition of POSIX.1 is: +As of this writing, the current edition of POSIX is: - Information technology --Portable Operating System Interface (POSIX (R)) - -- Part 1: System Application Program Interface (API) [C Language] - ISO/IEC 9945-1:1996 - ANSI/IEEE Std 1003.1, 1996 Edition - 1996-07-12 + Standard for Information technology + -- Portable Operating System Interface (POSIX (R)) + -- System Interfaces + IEEE Std 1003.1, 2004 Edition + <http://www.opengroup.org/online-pubs?DOC=7999959899> + <http://www.opengroup.org/pubs/catalog/t041.htm> + +Currently the only POSIX feature not implemented is quoted time zone +abbreviations, e.g., TZ='<UTC-10>10' for a time zone 10 hours behind +UTC whose abbreviation is "UTC-10". -POSIX.1 has the following properties and limitations. +POSIX has the following properties and limitations. -* In POSIX.1, time display in a process is controlled by the - environment variable TZ. Unfortunately, the POSIX.1 TZ string takes +* In POSIX, time display in a process is controlled by the + environment variable TZ. Unfortunately, the POSIX TZ string takes a form that is hard to describe and is error-prone in practice. - Also, POSIX.1 TZ strings can't deal with other (for example, Israeli) + Also, POSIX TZ strings can't deal with other (for example, Israeli) daylight saving time rules, or situations where more than two time zone abbreviations are used in an area. - The POSIX.1 TZ string takes the following form: + The POSIX TZ string takes the following form: stdoffset[dst[offset],date[/time],date[/time]] @@ -40,6 +45,9 @@ POSIX.1 has the following properties and std and dst are 3 or more characters specifying the standard and daylight saving time (DST) zone names. + Starting with POSIX.1-2001, std and dst may also be + in a quoted form like "<UTC+10>"; this allows + "+" and "-" in the names. offset is of the form `[-]hh:[mm[:ss]]' and specifies the offset west of UTC. The default DST offset is one hour @@ -61,15 +69,25 @@ POSIX.1 has the following properties and where week 1 is the first week in which day d appears, and `5' stands for the last week in which day d appears (which may be either the 4th or 5th week). + + Here is an example POSIX TZ string, for US Pacific time using rules + appropriate from 1987 through at least 2005: -* In POSIX.1, when a TZ value like "EST5EDT" is parsed, - typically the current US DST rules are used, + TZ='PST8PDT,M4.1.0/02:00,M10.5.0/02:00' + + This POSIX TZ string is hard to remember, and mishandles time stamps + before 1987. With this package you can use this instead: + + TZ='America/Los_Angeles' + +* POSIX does not define the exact meaning of TZ values like "EST5EDT". + Typically the current US DST rules are used to interpret such values, but this means that the US DST rules are compiled into each program that does time conversion. This means that when US time conversion rules change (as in the United States in 1987), all programs that do time conversion must be recompiled to ensure proper results. -* In POSIX.1, there's no tamper-proof way for a process to learn the +* In POSIX, there's no tamper-proof way for a process to learn the system's best idea of local wall clock. (This is important for applications that an administrator wants used only at certain times-- without regard to whether the user has fiddled the "TZ" environment @@ -78,9 +96,9 @@ POSIX.1 has the following properties and daylight saving time shifts--as might be required to limit phone calls to off-peak hours.) -* POSIX.1 requires that systems ignore leap seconds. +* POSIX requires that systems ignore leap seconds. -These are the extensions that have been made to the POSIX.1 functions: +These are the extensions that have been made to the POSIX functions: * The "TZ" environment variable is used in generating the name of a file from which time zone information is read (or is interpreted a la @@ -108,7 +126,7 @@ These are the extensions that have been * To handle places where more than two time zone abbreviations are used, the functions "localtime" and "gmtime" set tzname[tmp->tm_isdst] (where "tmp" is the value the function returns) to the time zone - abbreviation to be used. This differs from POSIX.1, where the elements + abbreviation to be used. This differs from POSIX, where the elements of tzname are only changed as a result of calls to tzset. * Since the "TZ" environment variable can now be used to control time @@ -136,6 +154,18 @@ These are the extensions that have been Points of interest to folks with other systems: +* In 2005 this package started generating time zone information files + containing two sets of data. The first set uses 32-bit time stamps + and covers times from 1901-12-13 20:45:52 through 2038-01-19 + 03:14:07 UTC; it is for backward compatibility with older versions of + this and other libraries. The second set uses 64-bit time stamps + and contains about 400 years of transition times, which are + extrapolated into the indefinite future; it is for newer libraries, + typically on hosts with 64-bit time stamps. New files are + approximately nine times the size of the old, because the added data + set contains about four times as many transitions, and its time + stamps are twice as wide. + * This package is already part of many POSIX-compliant hosts, including BSD, HP, Linux, Network Appliance, SCO, SGI, and Sun. On such hosts, the primary use of this package @@ -173,9 +203,9 @@ Hewlett Packard, offer a wider selection beyond those provided here. The absence of such functions from this package is not meant to discourage the development, standardization, or use of such functions. Rather, their absence reflects the decision to make this package -contain valid extensions to POSIX.1, to ensure its broad -acceptability. If more powerful time conversion functions can be standardized, -so much the better. +contain valid extensions to POSIX, to ensure its broad acceptability. If +more powerful time conversion functions can be standardized, so much the +better. ----- Names of time zone rule files ----- @@ -277,7 +307,7 @@ and `Factory' (see the file `factory'). ----- Time zone abbreviations ----- When this package is installed, it generates time zone abbreviations -like `EST' to be compatible with human tradition and POSIX.1. +like `EST' to be compatible with human tradition and POSIX. Here are the general rules used for choosing time zone abbreviations, in decreasing order of importance: @@ -292,17 +322,16 @@ in decreasing order of importance: preferred "ChST", so the rule has been relaxed. This rule guarantees that all abbreviations could have - been specified by a POSIX.1 TZ string. POSIX.1 + been specified by a POSIX TZ string. POSIX requires at least three characters for an - abbreviation. POSIX.1-1996 says that an abbreviation + abbreviation. POSIX through 2000 says that an abbreviation cannot start with ':', and cannot contain ',', '-', - '+', NUL, or a digit. Draft 7 of POSIX 1003.1-200x - changes this rule to say that an abbreviation can - contain only '-', '+', and alphanumeric characters in - the current locale. To be portable to both sets of + '+', NUL, or a digit. POSIX from 2001 on changes this + rule to say that an abbreviation can contain only '-', '+', + and alphanumeric characters from the portable character set + in the current locale. To be portable to both sets of rules, an abbreviation must therefore use only ASCII - letters, as these are the only letters that are - alphabetic in all locales. + letters. Use abbreviations that are in common use among English-speakers, e.g. `EST' for Eastern Standard Time in North America. @@ -343,10 +372,10 @@ abbreviations like `EST'; this avoids th Calendrical issues are a bit out of scope for a time zone database, but they indicate the sort of problems that we would run into if we extended the time zone database further into the past. An excellent -resource in this area is Nachum Dershowitz and Edward M. Reingold, -<a href="http://emr.cs.uiuc.edu/home/reingold/calendar-book/index.shtml"> -Calendrical Calculations -</a>, Cambridge University Press (1997). Other information and +resource in this area is Edward M. Reingold and Nachum Dershowitz, +<a href="http://emr.cs.uiuc.edu/home/reingold/calendar-book/second-edition/"> +Calendrical Calculations: The Millennium Edition +</a>, Cambridge University Press (2001). Other information and sources are given below. They sometimes disagree. @@ -546,7 +575,7 @@ Sources: Michael Allison and Robert Schmunk, "Technical Notes on Mars Solar Time as Adopted by the Mars24 Sunclock" -<http://www.giss.nasa.gov/tools/mars24/help/notes.html> (2004-03-15). +<http://www.giss.nasa.gov/tools/mars24/help/notes.html> (2004-07-30). Jia-Rui Chong, "Workdays Fit for a Martian", Los Angeles Times (2004-01-14), pp A1, A20-A21.
Date: Mon, 25 Apr 2005 11:57:44 -0700 From: Paul Eggert <eggert@CS.UCLA.EDU> Message-ID: <87ekcyr8ev.fsf@penguin.cs.ucla.edu> | * The tz code does not yet support the quoted time zone abbreviation | syntax required by POSIX starting in 2001. Rather than documenting the deficiency, wouldn't it be just as easy to fix it? It doesn't look as if it should be too hard. I haven't contributed any tz code in years now, so I'm likely to be a bit rusty on the details, but if no-one else wants to handle this, I'm willing to take a look. If the syntax isn't as trivial as "Anything enclosed in a <> pair with no internal escape" (ie: no method to include '>') could someone with access to the spec please send the exact definition. kre
Robert Elz <kre@munnari.oz.au> writes:
| * The tz code does not yet support the quoted time zone abbreviation | syntax required by POSIX starting in 2001.
Rather than documenting the deficiency, wouldn't it be just as easy to fix it? It doesn't look as if it should be too hard.
Yes, that would be nice. Thanks.
If the syntax isn't as trivial as "Anything enclosed in a <> pair with no internal escape" (ie: no method to include '>') could someone with access to the spec please send the exact definition.
The current POSIX spec for TZ is here (look at the end of the page): http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html POSIX is a bit more restrictive than the simple spec you gave, but it'd be fine to implement it the simple way, as an extension to POSIX.
Date: Tue, 26 Apr 2005 12:07:29 -0700 From: Paul Eggert <eggert@CS.UCLA.EDU> Message-ID: <87pswhxspa.fsf@penguin.cs.ucla.edu> | Yes, that would be nice. Thanks. Patch appended. (Patch is to localtime.c in tzcode2005h) | The current POSIX spec for TZ is here (look at the end of the page): Thanks. Bradley White sent me the text in private mail, which was fortunate, as shortly after that I lest net connectivity for a while. | POSIX is a bit more restrictive than the simple spec you gave, Not really - it just says that lots of things are undefined. That's fine, what we do with undefined data is up to us - including when we treat it rationally... | but it'd be fine to implement it the simple way, as an extension to POSIX. That's what I did. That was by analogy with the current (non quoted form) parsing, where posix says 3 or more alphabetic chars, and tzcode allows 3 or more "anything but + - , and digits"). The new quoted code allows 3 or more "anything but >" which is definitely more than "alphanumeric plus '+' and '-'". (All forms disallow \0 of course). However, I'm not real sure that we shouldn't be a little more cautious with this. Do we really want to allow \n in a zone name abbreviation? Might not that cause "unanticipated" problems for some applications (I doubt many bother to sanitise the value of TZ before allowing it to be used). There may be a few other characters that should perhaps be considered "bad" here, without going all the way to alphabetic (current locale alphabetic) or alphanumeric (and + & -) (current locale). I'll leave that for someone else to thing about - and perhaps also to ponder whether there are any programs that could be convinced to do "bad things" by putting "weird stuff" in the zone name abbreviation. kre
participants (3)
-
Olson, Arthur David (NIH/NCI) -
Paul Eggert -
Robert Elz