Hello, I am currently developing a C++ time-zone database parser, building on top of Howard Hinnant's work. I am writing thorough checks. One Zone check that I am experimenting with is: *If the FORMAT field contains either a "%s" or a '/' then the RULES field must contain a named rule.* However line 3840 of the asia file in: https://data.iana.org/time-zones/releases/tzdata2021e.tar.gz fails this check, as highlighted in yellow in the table below. # Tajikistan # From Shanks & Pottenger. # Zone NAME STDOFF RULES FORMAT [UNTIL] Zone Asia/Dushanbe 04:35:12 - LMT 1924 May 2 5:00 - +05 1930 Jun 21 6:00 RussiaAsia +06/+07 1991 Mar 31 2:00s 5:00 1:00 +05/+06 1991 Sep 9 2:00s 05:00 - +05 My guess is that '+05/+06' should be '+05'. I may well be wrong. Could you let me know either way? If you are interested, I have found a few other bits and pieces where the data files are in slight variance with the documentation but are not show-stoppers. Most of these occur in the very early releases. (I am testing every database available on https://data.iana.org/time-zones/releases/) Regards Nick Deguillaume +44 75 2828 6473
On Fri, 13 May 2022 at 12:20, Nick Deguillaume via tz <tz@iana.org> wrote:
My guess is that '+05/+06' should be '+05'. I may well be wrong. Could you let me know either way?
zic.8 in the development repository says, of the RULES field, "When an amount of time is given, only the sum of standard time and this amount matters." https://github.com/eggert/tz/blob/65f616d2ed2784c97240266c9510ced135478d7e/z... So a STDOFF of 5:00 and RULES of 1:00 will sum to an offset of UT+06:00. Since daylight saving is applied by the rule, the portion of FORMAT after the slash (+06) is used. I'm not sure where your "must contain a named rule" quote is coming from — in this case the "rule" is not named, *per se*, but is rather a constant 1:00. The relevant thing is that the RULES field is not "-". If you are interested, I have found a few other bits and pieces where the
data files are in slight variance with the documentation but are not show-stoppers.
Certainly, if you find discrepancies in recent releases of our own documentation that could be cleared up or clarified, please send those along. -- Tim Parenti
Thank you for the explanation. It makes sense. I will send through a list of other issues once I have completed my full analysis. On Fri, 13 May 2022, 17:35 Tim Parenti, <tim@timtimeonline.com> wrote:
On Fri, 13 May 2022 at 12:20, Nick Deguillaume via tz <tz@iana.org> wrote:
My guess is that '+05/+06' should be '+05'. I may well be wrong. Could you let me know either way?
zic.8 in the development repository says, of the RULES field, "When an amount of time is given, only the sum of standard time and this amount matters."
https://github.com/eggert/tz/blob/65f616d2ed2784c97240266c9510ced135478d7e/z...
So a STDOFF of 5:00 and RULES of 1:00 will sum to an offset of UT+06:00. Since daylight saving is applied by the rule, the portion of FORMAT after the slash (+06) is used.
I'm not sure where your "must contain a named rule" quote is coming from — in this case the "rule" is not named, *per se*, but is rather a constant 1:00. The relevant thing is that the RULES field is not "-".
If you are interested, I have found a few other bits and pieces where the
data files are in slight variance with the documentation but are not show-stoppers.
Certainly, if you find discrepancies in recent releases of our own documentation that could be cleared up or clarified, please send those along.
-- Tim Parenti
On 5/13/22 09:35, Tim Parenti via tz wrote:
I'm not sure where your "must contain a named rule" quote is coming from
I imagine this was a style rule of Nick's invention. Violating the rule might issue a warning but it shouldn't be a fatal error, as the 'asia' file was correct as-is.
in this case the "rule" is not named, *per se*, but is rather a constant 1:00. The relevant thing is that the RULES field is not "-".
Or more precisely it's that the RULES column is neither "-", nor a suffixless zero offset, nor an offset with an "s" suffix. We don't use any of these more-obscure features in TZDB data but they're in the .zi format. Since TZDB consistently avoids '/' in the many other places where this situation arises, it should avoid '/' here for stylistic consistency. So I installed the attached proposed patches. The 1st patch omits the '/' in question; the 2nd documents that STDOFF columns don't have suffixes (this wasn't clear in the man page, and I discovered this while looking into the 3rd patch), and the 3rd adds a style check for this. None of these patches affect the TZif output files.
Thanks for the message. Yes, you were correct in thinking that *If the FORMAT field contains either a "%s" or a '/' then the RULES field must contain a named rule.* was a trial rule of my own invention for my own parser implementation. I saw a pattern in the historic data and was curious as to whether it applied everywhere. I have now modified my checks so that all files pass. To be clear, I am not suggesting that the zic compiler mishandles any old data files. Neither am I suggesting that there are any errors in the zic documentation. When I was referring to data being at slight variance to the documentation, the documentation I was referring to was: https://data.iana.org/time-zones/tz-link.html and https://data.iana.org/time-zones/tz-how-to.html I now recognise that I would have been better off using the zic documentation as my primary source. Nonetheless, here are a few things I have found: 1. tz_link.html states that: *Sources for the tz database are UTF-8 text files... * Some of the comments in some of the old files contain non UTF-8 single byte representations of accented letters. Since such occurrences are in the comments this will not affect anything. 2. The tz_how-to.html states that: *Prior to the 2020b release, it was called the TYPE field, though it was never used in the main data ...* However, some of the old data in https://data.iana.org/time-zones/releases/ contains "even" and "odd" to account for the Adeleide festival. (I got round this by excluding the versions of the Australia/Adeleide exhibiting "even" and "odd".) 3. The tz_how-to.html states that: *The FORMAT column specifies the usual abbreviation of the time zone name. It can have one of three forms:a string of three or more characters that are either ASCII alphanumerics, “+”, or “-”, in which case that’s the abbreviation ...* I had to allow an underscore and space to allow all the files to pass. In the case of St. Helena I also had to allow a '?' as the first character. Further, I had to allow an abbreviation in a '/' separated format to be only two characters.(I recognise that this is not technically in violation of the statement above.) 4. I can see that some of the older files use a '?' where the more modern files use '%s'. This is not mentioned in the tz_how-to.html documentation, I recognise that putting such obscurities in the document may not be a good idea. As you can see these are all very minor things. I appreciate your quick responses. Regards Nick On Fri, 13 May 2022 at 20:20, Paul Eggert <eggert@cs.ucla.edu> wrote:
On 5/13/22 09:35, Tim Parenti via tz wrote:
I'm not sure where your "must contain a named rule" quote is coming from
I imagine this was a style rule of Nick's invention. Violating the rule might issue a warning but it shouldn't be a fatal error, as the 'asia' file was correct as-is.
in this case the "rule" is not named, *per se*, but is rather a constant 1:00. The relevant thing is that the RULES field is not "-".
Or more precisely it's that the RULES column is neither "-", nor a suffixless zero offset, nor an offset with an "s" suffix. We don't use any of these more-obscure features in TZDB data but they're in the .zi format.
Since TZDB consistently avoids '/' in the many other places where this situation arises, it should avoid '/' here for stylistic consistency. So I installed the attached proposed patches. The 1st patch omits the '/' in question; the 2nd documents that STDOFF columns don't have suffixes (this wasn't clear in the man page, and I discovered this while looking into the 3rd patch), and the 3rd adds a style check for this.
None of these patches affect the TZif output files.
On 5/13/22 13:51, Nick Deguillaume wrote:
I now recognise that I would have been better off using the zic documentation as my primary source.
Yes, that's a better approach. In the attached patch (which I've installed in the development version) I added a note to that effect in the how-to document.
1. tz_link.html states that: *Sources for the tz database are UTF-8 text files... * Some of the comments in some of the old files contain non UTF-8 single byte representations of accented letters.
Thanks, fixed in the attached proposed patch by saying it's been UTF-8 since 2013a.
2. The tz_how-to.html states that: *Prior to the 2020b release, it was called the TYPE field, though it was never used in the main data ...* However, some of the old data in https://data.iana.org/time-zones/releases/ contains "even" and "odd" to account for the Adeleide festival. (
Fixed in the attached by saying it's been the practice since 2000e.
3. The tz_how-to.html states that:
*The FORMAT column specifies the usual abbreviation of the time zone name. It can have one of three forms:a string of three or more characters that are either ASCII alphanumerics, “+”, or “-”, in which case that’s the abbreviation ...* I had to allow an underscore and space to allow all the files to pass.
Fixed in the attached by saying "should" instead of "can". Older data sometimes violated that advice, but this meant the resulting abbreviations didn't conform to POSIX. I also added a mention of %z there.
Thanks for the effortful response. This motivates me to contribute more to the project in the future. On Sat, 14 May 2022, 08:51 Paul Eggert, <eggert@cs.ucla.edu> wrote:
On 5/13/22 13:51, Nick Deguillaume wrote:
I now recognise that I would have been better off using the zic documentation as my primary source.
Yes, that's a better approach. In the attached patch (which I've installed in the development version) I added a note to that effect in the how-to document.
1. tz_link.html states that: *Sources for the tz database are UTF-8 text files... * Some of the comments in some of the old files contain non UTF-8 single byte representations of accented letters.
Thanks, fixed in the attached proposed patch by saying it's been UTF-8 since 2013a.
2. The tz_how-to.html states that: *Prior to the 2020b release, it was called the TYPE field, though it was never used in the main data ...* However, some of the old data in https://data.iana.org/time-zones/releases/ contains "even" and "odd" to account for the Adeleide festival. (
Fixed in the attached by saying it's been the practice since 2000e.
3. The tz_how-to.html states that:
*The FORMAT column specifies the usual abbreviation of the time zone name. It can have one of three forms:a string of three or more characters that are either ASCII alphanumerics, “+”, or “-”, in which case that’s the abbreviation ...* I had to allow an underscore and space to allow all the files to pass.
Fixed in the attached by saying "should" instead of "can". Older data sometimes violated that advice, but this meant the resulting abbreviations didn't conform to POSIX. I also added a mention of %z there.
On 5/13/22 01:55, Nick Deguillaume via tz wrote:
I have found a few other bits and pieces where the data files are in slight variance with the documentation but are not show-stoppers. Most of these occur in the very early releases.
Although the .zi format has evolved, I don't recall any place where current zic mishandles old data files. Obviously we can't change old TZDB releases, but if would be good to know (if only for documentation purposes) if zic rejects or mishandles ancient data files.
participants (3)
-
Nick Deguillaume -
Paul Eggert -
Tim Parenti