Proposal: validation text file with releases

Background: I'm the primary developer for Noda Time <http://nodatime.org> which consumes the tz data. I'm currently refactoring the code to do this... and I've come across some code (originally ported from Joda Time) which I now understand in terms of what it's doing, but not exactly why. For a little while now, the Noda Time source repo has included a text dump file <https://github.com/nodatime/nodatime/blob/master/src/NodaTime.Test/TestData/...>, containing a text dump of every transition (up to 2100, at the moment) for every time zone. It looks like this, picking just one example: Zone: Africa/Maseru LMT: [StartOfTime, 1892-02-07T22:08:00Z) +01:52 (+00) SAST: [1892-02-07T22:08:00Z, 1903-02-28T22:30:00Z) +01:30 (+00) SAST: [1903-02-28T22:30:00Z, 1942-09-20T00:00:00Z) +02 (+00) SAST: [1942-09-20T00:00:00Z, 1943-03-20T23:00:00Z) +03 (+01) SAST: [1943-03-20T23:00:00Z, 1943-09-19T00:00:00Z) +02 (+00) SAST: [1943-09-19T00:00:00Z, 1944-03-18T23:00:00Z) +03 (+01) SAST: [1944-03-18T23:00:00Z, EndOfTime) +02 (+00) I use this file for confidence when refactoring my time zone handling code - if the new code comes up with the same set of transitions as the old code, it's probably okay. (This is just one line of defence, of course - there are unit tests, though not as many as I'd like.) It strikes me that having a similar file (I'm not wedded to the format, but it should have all the same information, one way or another) released alongside the main data files would be really handy for *all* implementors - it would be a good way of validating consistency across multiple platforms, with the release data being canonical. For any platforms which didn't want to actually consume the rules as rules, but just wanted a list of transitions, it could even effectively replace their use of the data. One other benefit: diffing the dump between two releases would make it clear what had changed in *effect*, rather than just in terms of rules. One sticking point is size. The current file for Noda Time is about 4MB, although it zips down to about 300K. Some thoughts around this: - We wouldn't need to distribute it in the same file as the data - just as we have data and code file, there could be a "textdump" file or whatever we'd want to call it. These could be retroactively generated for previous releases, too. - As you can see, there's redundancy in the format above, in that it's a list of "zone intervals" (as I call them in Noda Time) rather than a list of transitions - the end of each interval is always the start of the next interval. - For zones which settle into an infinite daylight saving pattern, I currently generate from the start of time to 2100 (and then a single zone interval for the end of time as Noda Time understands it; we'd need to work out what form that would take, if any). If we decided that "year of release + 30 years" was enough, that would cut down the size considerably. Any thoughts? If the feeling is broadly positive, the next step would be to nail down the text format, then find a willing victim/volunteer to write the C code. (You really don't want me writing C...) Jon

A possibility here is to store the output of a "zdump -v" command; I've used "zdump -v" output for regression testing; I believe Paul Eggert has done so as well. The "-c" option of zdump could be used to limit the range of the output. @dashdashado On Sat, Jul 11, 2015 at 6:35 AM, Jon Skeet <skeet@pobox.com> wrote:
Background: I'm the primary developer for Noda Time <http://nodatime.org> which consumes the tz data. I'm currently refactoring the code to do this... and I've come across some code (originally ported from Joda Time) which I now understand in terms of what it's doing, but not exactly why.
For a little while now, the Noda Time source repo has included a text dump file <https://github.com/nodatime/nodatime/blob/master/src/NodaTime.Test/TestData/...>, containing a text dump of every transition (up to 2100, at the moment) for every time zone. It looks like this, picking just one example:
Zone: Africa/Maseru LMT: [StartOfTime, 1892-02-07T22:08:00Z) +01:52 (+00) SAST: [1892-02-07T22:08:00Z, 1903-02-28T22:30:00Z) +01:30 (+00) SAST: [1903-02-28T22:30:00Z, 1942-09-20T00:00:00Z) +02 (+00) SAST: [1942-09-20T00:00:00Z, 1943-03-20T23:00:00Z) +03 (+01) SAST: [1943-03-20T23:00:00Z, 1943-09-19T00:00:00Z) +02 (+00) SAST: [1943-09-19T00:00:00Z, 1944-03-18T23:00:00Z) +03 (+01) SAST: [1944-03-18T23:00:00Z, EndOfTime) +02 (+00)
I use this file for confidence when refactoring my time zone handling code - if the new code comes up with the same set of transitions as the old code, it's probably okay. (This is just one line of defence, of course - there are unit tests, though not as many as I'd like.)
It strikes me that having a similar file (I'm not wedded to the format, but it should have all the same information, one way or another) released alongside the main data files would be really handy for *all* implementors - it would be a good way of validating consistency across multiple platforms, with the release data being canonical. For any platforms which didn't want to actually consume the rules as rules, but just wanted a list of transitions, it could even effectively replace their use of the data.
One other benefit: diffing the dump between two releases would make it clear what had changed in *effect*, rather than just in terms of rules.
One sticking point is size. The current file for Noda Time is about 4MB, although it zips down to about 300K. Some thoughts around this:
- We wouldn't need to distribute it in the same file as the data - just as we have data and code file, there could be a "textdump" file or whatever we'd want to call it. These could be retroactively generated for previous releases, too. - As you can see, there's redundancy in the format above, in that it's a list of "zone intervals" (as I call them in Noda Time) rather than a list of transitions - the end of each interval is always the start of the next interval. - For zones which settle into an infinite daylight saving pattern, I currently generate from the start of time to 2100 (and then a single zone interval for the end of time as Noda Time understands it; we'd need to work out what form that would take, if any). If we decided that "year of release + 30 years" was enough, that would cut down the size considerably.
Any thoughts? If the feeling is broadly positive, the next step would be to nail down the text format, then find a willing victim/volunteer to write the C code. (You really don't want me writing C...)
Jon

That's certainly a good starting point, but I have three issues with it: - The format is verbose and somewhat harder to parse (IMO) than an ISO-8601-based numeric format. It's easier to get things wrong when they involve cultures :) - It indicates the final wall offset and whether or not it's in DST, but not how much DST there is. This only matters in a few cases like Antarctica/Troll which have a non-one-hour saving, but it would still be worth indicating, IMO - The "second before the transition" line for each transition feels redundant to me... we could potentially put *three* values for each transition: UTC instant, local time one second before, local time "at" transition Given how widely used zdump is, we can't really change the format by default - but we could perhaps add a new flag to indicate a new format, assuming I'm not alone in my objections above? (Having said that, a dump with the data in a format I'm not ecstatic about would be better for me than no dump at all. I'm certainly not going to start chest-thumping about formats.) Jon On 11 July 2015 at 11:56, Arthur David Olson <arthurdavidolson@gmail.com> wrote:
A possibility here is to store the output of a "zdump -v" command; I've used "zdump -v" output for regression testing; I believe Paul Eggert has done so as well. The "-c" option of zdump could be used to limit the range of the output.
@dashdashado
On Sat, Jul 11, 2015 at 6:35 AM, Jon Skeet <skeet@pobox.com> wrote:
Background: I'm the primary developer for Noda Time <http://nodatime.org> which consumes the tz data. I'm currently refactoring the code to do this... and I've come across some code (originally ported from Joda Time) which I now understand in terms of what it's doing, but not exactly why.
For a little while now, the Noda Time source repo has included a text dump file <https://github.com/nodatime/nodatime/blob/master/src/NodaTime.Test/TestData/...>, containing a text dump of every transition (up to 2100, at the moment) for every time zone. It looks like this, picking just one example:
Zone: Africa/Maseru LMT: [StartOfTime, 1892-02-07T22:08:00Z) +01:52 (+00) SAST: [1892-02-07T22:08:00Z, 1903-02-28T22:30:00Z) +01:30 (+00) SAST: [1903-02-28T22:30:00Z, 1942-09-20T00:00:00Z) +02 (+00) SAST: [1942-09-20T00:00:00Z, 1943-03-20T23:00:00Z) +03 (+01) SAST: [1943-03-20T23:00:00Z, 1943-09-19T00:00:00Z) +02 (+00) SAST: [1943-09-19T00:00:00Z, 1944-03-18T23:00:00Z) +03 (+01) SAST: [1944-03-18T23:00:00Z, EndOfTime) +02 (+00)
I use this file for confidence when refactoring my time zone handling code - if the new code comes up with the same set of transitions as the old code, it's probably okay. (This is just one line of defence, of course - there are unit tests, though not as many as I'd like.)
It strikes me that having a similar file (I'm not wedded to the format, but it should have all the same information, one way or another) released alongside the main data files would be really handy for *all* implementors - it would be a good way of validating consistency across multiple platforms, with the release data being canonical. For any platforms which didn't want to actually consume the rules as rules, but just wanted a list of transitions, it could even effectively replace their use of the data.
One other benefit: diffing the dump between two releases would make it clear what had changed in *effect*, rather than just in terms of rules.
One sticking point is size. The current file for Noda Time is about 4MB, although it zips down to about 300K. Some thoughts around this:
- We wouldn't need to distribute it in the same file as the data - just as we have data and code file, there could be a "textdump" file or whatever we'd want to call it. These could be retroactively generated for previous releases, too. - As you can see, there's redundancy in the format above, in that it's a list of "zone intervals" (as I call them in Noda Time) rather than a list of transitions - the end of each interval is always the start of the next interval. - For zones which settle into an infinite daylight saving pattern, I currently generate from the start of time to 2100 (and then a single zone interval for the end of time as Noda Time understands it; we'd need to work out what form that would take, if any). If we decided that "year of release + 30 years" was enough, that would cut down the size considerably.
Any thoughts? If the feeling is broadly positive, the next step would be to nail down the text format, then find a willing victim/volunteer to write the C code. (You really don't want me writing C...)
Jon

Jon Skeet <skeet@pobox.com> wrote: |Given how widely used zdump is, we can't really change the format by |default - but we could perhaps add a new flag to indicate a new format, |assuming I'm not alone in my objections above? |(Having said that, a dump with the data in a format I'm not ecstatic about |would be better for me than no dump at all. I'm certainly not going to |start chest-thumping about formats.) I think Ed Schouten of FreeBSD has done something alike -- based on zdump(1) that is -- for his CloudABI framework (in FreeBSD and on that hub). If i understood the presentation he doesn't parse TZDATA itself but uses the pre-prepared output of zdump(1) to generate his data; maybe you can simply use that in return. --steffen

On 11 July 2015 at 12:34, Steffen Nurpmeso <sdaoden@yandex.com> wrote:
Jon Skeet <skeet@pobox.com> wrote: |Given how widely used zdump is, we can't really change the format by |default - but we could perhaps add a new flag to indicate a new format, |assuming I'm not alone in my objections above? |(Having said that, a dump with the data in a format I'm not ecstatic about |would be better for me than no dump at all. I'm certainly not going to |start chest-thumping about formats.)
I think Ed Schouten of FreeBSD has done something alike -- based on zdump(1) that is -- for his CloudABI framework (in FreeBSD and on that hub). If i understood the presentation he doesn't parse TZDATA itself but uses the pre-prepared output of zdump(1) to generate his data; maybe you can simply use that in return.
Well, as I mentioned before, the zdump data is incomplete - it doesn't show what portion of the total wall offset is due to DST. (So for Antarctica/Troll, for example, there's no way of knowing that the savings are 2 hours.) I haven't looked at the C code implementation yet to see whether that's because zdump chooses not to expose that information, or whether the output of zic doesn't actually include it (in which case zic would presumably need to be cloned or changed). Jon

FWIW, I think such a format would be very useful. Effectively, it is a unit test for others to confirm that they interpret the rules the same way as intended. It is similar to what I produced when trying to demonstrate the amount of change being caused by apparently "minor" changes to the data: https://github.com/jodastephen/tzdiff/commits/master Any output of this type should indeed just consist of a simple text file with ISO-8601 format timestamps. Stephen On 11 July 2015 at 11:35, Jon Skeet <skeet@pobox.com> wrote:
Background: I'm the primary developer for Noda Time which consumes the tz data. I'm currently refactoring the code to do this... and I've come across some code (originally ported from Joda Time) which I now understand in terms of what it's doing, but not exactly why.
For a little while now, the Noda Time source repo has included a text dump file, containing a text dump of every transition (up to 2100, at the moment) for every time zone. It looks like this, picking just one example:
Zone: Africa/Maseru LMT: [StartOfTime, 1892-02-07T22:08:00Z) +01:52 (+00) SAST: [1892-02-07T22:08:00Z, 1903-02-28T22:30:00Z) +01:30 (+00) SAST: [1903-02-28T22:30:00Z, 1942-09-20T00:00:00Z) +02 (+00) SAST: [1942-09-20T00:00:00Z, 1943-03-20T23:00:00Z) +03 (+01) SAST: [1943-03-20T23:00:00Z, 1943-09-19T00:00:00Z) +02 (+00) SAST: [1943-09-19T00:00:00Z, 1944-03-18T23:00:00Z) +03 (+01) SAST: [1944-03-18T23:00:00Z, EndOfTime) +02 (+00)
I use this file for confidence when refactoring my time zone handling code - if the new code comes up with the same set of transitions as the old code, it's probably okay. (This is just one line of defence, of course - there are unit tests, though not as many as I'd like.)
It strikes me that having a similar file (I'm not wedded to the format, but it should have all the same information, one way or another) released alongside the main data files would be really handy for all implementors - it would be a good way of validating consistency across multiple platforms, with the release data being canonical. For any platforms which didn't want to actually consume the rules as rules, but just wanted a list of transitions, it could even effectively replace their use of the data.
One other benefit: diffing the dump between two releases would make it clear what had changed in effect, rather than just in terms of rules.
One sticking point is size. The current file for Noda Time is about 4MB, although it zips down to about 300K. Some thoughts around this:
We wouldn't need to distribute it in the same file as the data - just as we have data and code file, there could be a "textdump" file or whatever we'd want to call it. These could be retroactively generated for previous releases, too. As you can see, there's redundancy in the format above, in that it's a list of "zone intervals" (as I call them in Noda Time) rather than a list of transitions - the end of each interval is always the start of the next interval. For zones which settle into an infinite daylight saving pattern, I currently generate from the start of time to 2100 (and then a single zone interval for the end of time as Noda Time understands it; we'd need to work out what form that would take, if any). If we decided that "year of release + 30 years" was enough, that would cut down the size considerably.
Any thoughts? If the feeling is broadly positive, the next step would be to nail down the text format, then find a willing victim/volunteer to write the C code. (You really don't want me writing C...)
Jon

Given that I've already found discrepancies (see "Discrepancies in time zone data interpretation") I'm going to go ahead and hack on this in purely pragmatic (read: short term) ways. I'll create a github repo just for this purpose and dump code in there - this is explicitly with the aim of encouraging a more permanent solution by proving value. Will post another message here when there's something worth looking at - I'll be initially looking at zdump output, Joda Time, standard Java, and Noda Time. Contributions from others for other languages/platforms will be very welcome. Jon On 13 July 2015 at 14:46, Stephen Colebourne <scolebourne@joda.org> wrote:
FWIW, I think such a format would be very useful. Effectively, it is a unit test for others to confirm that they interpret the rules the same way as intended.
It is similar to what I produced when trying to demonstrate the amount of change being caused by apparently "minor" changes to the data: https://github.com/jodastephen/tzdiff/commits/master
Any output of this type should indeed just consist of a simple text file with ISO-8601 format timestamps.
Stephen
Background: I'm the primary developer for Noda Time which consumes the tz data. I'm currently refactoring the code to do this... and I've come across some code (originally ported from Joda Time) which I now understand in terms of what it's doing, but not exactly why.
For a little while now, the Noda Time source repo has included a text dump file, containing a text dump of every transition (up to 2100, at the moment) for every time zone. It looks like this, picking just one example:
Zone: Africa/Maseru LMT: [StartOfTime, 1892-02-07T22:08:00Z) +01:52 (+00) SAST: [1892-02-07T22:08:00Z, 1903-02-28T22:30:00Z) +01:30 (+00) SAST: [1903-02-28T22:30:00Z, 1942-09-20T00:00:00Z) +02 (+00) SAST: [1942-09-20T00:00:00Z, 1943-03-20T23:00:00Z) +03 (+01) SAST: [1943-03-20T23:00:00Z, 1943-09-19T00:00:00Z) +02 (+00) SAST: [1943-09-19T00:00:00Z, 1944-03-18T23:00:00Z) +03 (+01) SAST: [1944-03-18T23:00:00Z, EndOfTime) +02 (+00)
I use this file for confidence when refactoring my time zone handling code - if the new code comes up with the same set of transitions as the old code, it's probably okay. (This is just one line of defence, of course - there are unit tests, though not as many as I'd like.)
It strikes me that having a similar file (I'm not wedded to the format, but it should have all the same information, one way or another) released alongside the main data files would be really handy for all implementors
On 11 July 2015 at 11:35, Jon Skeet <skeet@pobox.com> wrote: -
it would be a good way of validating consistency across multiple platforms, with the release data being canonical. For any platforms which didn't want to actually consume the rules as rules, but just wanted a list of transitions, it could even effectively replace their use of the data.
One other benefit: diffing the dump between two releases would make it clear what had changed in effect, rather than just in terms of rules.
One sticking point is size. The current file for Noda Time is about 4MB, although it zips down to about 300K. Some thoughts around this:
We wouldn't need to distribute it in the same file as the data - just as we have data and code file, there could be a "textdump" file or whatever we'd want to call it. These could be retroactively generated for previous releases, too. As you can see, there's redundancy in the format above, in that it's a list of "zone intervals" (as I call them in Noda Time) rather than a list of transitions - the end of each interval is always the start of the next interval. For zones which settle into an infinite daylight saving pattern, I currently generate from the start of time to 2100 (and then a single zone interval for the end of time as Noda Time understands it; we'd need to work out what form that would take, if any). If we decided that "year of release + 30 years" was enough, that would cut down the size considerably.
Any thoughts? If the feeling is broadly positive, the next step would be to nail down the text format, then find a willing victim/volunteer to write the C code. (You really don't want me writing C...)
Jon

Okay, I've created https://github.com/nodatime/tzvalidate It allows you (well, someone who's got everything set up...) to compare and contrast: - Joda Time - Noda Time - Java 8 - zdump Only Joda Time and Noda Time allow (and in fact require) a data version to be specified. Obviously in order to compare data meaningfully, one has to be using the same data in all places. That's the next thing to look at... but they're all using the same output format, and the results are already interesting in terms of some unexpected discrepanicies. I haven't had a chance to look into them yet. Jon On 13 July 2015 at 16:06, Jon Skeet <skeet@pobox.com> wrote:
Given that I've already found discrepancies (see "Discrepancies in time zone data interpretation") I'm going to go ahead and hack on this in purely pragmatic (read: short term) ways. I'll create a github repo just for this purpose and dump code in there - this is explicitly with the aim of encouraging a more permanent solution by proving value.
Will post another message here when there's something worth looking at - I'll be initially looking at zdump output, Joda Time, standard Java, and Noda Time. Contributions from others for other languages/platforms will be very welcome.
Jon
On 13 July 2015 at 14:46, Stephen Colebourne <scolebourne@joda.org> wrote:
FWIW, I think such a format would be very useful. Effectively, it is a unit test for others to confirm that they interpret the rules the same way as intended.
It is similar to what I produced when trying to demonstrate the amount of change being caused by apparently "minor" changes to the data: https://github.com/jodastephen/tzdiff/commits/master
Any output of this type should indeed just consist of a simple text file with ISO-8601 format timestamps.
Stephen
On 11 July 2015 at 11:35, Jon Skeet <skeet@pobox.com> wrote:
Background: I'm the primary developer for Noda Time which consumes the tz data. I'm currently refactoring the code to do this... and I've come across some code (originally ported from Joda Time) which I now understand in terms of what it's doing, but not exactly why.
For a little while now, the Noda Time source repo has included a text dump file, containing a text dump of every transition (up to 2100, at the moment) for every time zone. It looks like this, picking just one example:
Zone: Africa/Maseru LMT: [StartOfTime, 1892-02-07T22:08:00Z) +01:52 (+00) SAST: [1892-02-07T22:08:00Z, 1903-02-28T22:30:00Z) +01:30 (+00) SAST: [1903-02-28T22:30:00Z, 1942-09-20T00:00:00Z) +02 (+00) SAST: [1942-09-20T00:00:00Z, 1943-03-20T23:00:00Z) +03 (+01) SAST: [1943-03-20T23:00:00Z, 1943-09-19T00:00:00Z) +02 (+00) SAST: [1943-09-19T00:00:00Z, 1944-03-18T23:00:00Z) +03 (+01) SAST: [1944-03-18T23:00:00Z, EndOfTime) +02 (+00)
I use this file for confidence when refactoring my time zone handling code - if the new code comes up with the same set of transitions as the old code, it's probably okay. (This is just one line of defence, of course - there are unit tests, though not as many as I'd like.)
It strikes me that having a similar file (I'm not wedded to the format, but it should have all the same information, one way or another) released alongside the main data files would be really handy for all implementors - it would be a good way of validating consistency across multiple platforms, with the release data being canonical. For any platforms which didn't want to actually consume the rules as rules, but just wanted a list of transitions, it could even effectively replace their use of the data.
One other benefit: diffing the dump between two releases would make it clear what had changed in effect, rather than just in terms of rules.
One sticking point is size. The current file for Noda Time is about 4MB, although it zips down to about 300K. Some thoughts around this:
We wouldn't need to distribute it in the same file as the data - just as we have data and code file, there could be a "textdump" file or whatever we'd want to call it. These could be retroactively generated for previous releases, too. As you can see, there's redundancy in the format above, in that it's a list of "zone intervals" (as I call them in Noda Time) rather than a list of transitions - the end of each interval is always the start of the next interval. For zones which settle into an infinite daylight saving pattern, I currently generate from the start of time to 2100 (and then a single zone interval for the end of time as Noda Time understands it; we'd need to work out what form that would take, if any). If we decided that "year of release + 30 years" was enough, that would cut down the size considerably.
Any thoughts? If the feeling is broadly positive, the next step would be to nail down the text format, then find a willing victim/volunteer to write the C code. (You really don't want me writing C...)
Jon

I've expanded this a bit - we now have implementations for: - Joda Time - Noda Time - Java 7 (well, Java pre-8) - Java 8 - ICU4J - zdump - Ruby's tzinfo gem I'd really appreciate any input at this point. There are still a few issues with the data collection - it's not the pristine file diff we'd like to end up with - but it's enough to highlight some discrepancies, which I'll probably write up as a blog post and cc here. I think the fact that it *is* showing up these differences is evidence that this could provide a lot of value with the support of the rest of the community (and with a better implementation of my zdump munging - ideally something in zic itself, I suspect). Who do I need to persuade? (Paul, I guess...) Jon On 13 July 2015 at 21:43, Jon Skeet <skeet@pobox.com> wrote:
Okay, I've created https://github.com/nodatime/tzvalidate
It allows you (well, someone who's got everything set up...) to compare and contrast:
- Joda Time - Noda Time - Java 8 - zdump
Only Joda Time and Noda Time allow (and in fact require) a data version to be specified. Obviously in order to compare data meaningfully, one has to be using the same data in all places. That's the next thing to look at... but they're all using the same output format, and the results are already interesting in terms of some unexpected discrepanicies. I haven't had a chance to look into them yet.
Jon
On 13 July 2015 at 16:06, Jon Skeet <skeet@pobox.com> wrote:
Given that I've already found discrepancies (see "Discrepancies in time zone data interpretation") I'm going to go ahead and hack on this in purely pragmatic (read: short term) ways. I'll create a github repo just for this purpose and dump code in there - this is explicitly with the aim of encouraging a more permanent solution by proving value.
Will post another message here when there's something worth looking at - I'll be initially looking at zdump output, Joda Time, standard Java, and Noda Time. Contributions from others for other languages/platforms will be very welcome.
Jon
On 13 July 2015 at 14:46, Stephen Colebourne <scolebourne@joda.org> wrote:
FWIW, I think such a format would be very useful. Effectively, it is a unit test for others to confirm that they interpret the rules the same way as intended.
It is similar to what I produced when trying to demonstrate the amount of change being caused by apparently "minor" changes to the data: https://github.com/jodastephen/tzdiff/commits/master
Any output of this type should indeed just consist of a simple text file with ISO-8601 format timestamps.
Stephen
On 11 July 2015 at 11:35, Jon Skeet <skeet@pobox.com> wrote:
Background: I'm the primary developer for Noda Time which consumes the tz data. I'm currently refactoring the code to do this... and I've come across some code (originally ported from Joda Time) which I now understand in terms of what it's doing, but not exactly why.
For a little while now, the Noda Time source repo has included a text dump file, containing a text dump of every transition (up to 2100, at the moment) for every time zone. It looks like this, picking just one example:
Zone: Africa/Maseru LMT: [StartOfTime, 1892-02-07T22:08:00Z) +01:52 (+00) SAST: [1892-02-07T22:08:00Z, 1903-02-28T22:30:00Z) +01:30 (+00) SAST: [1903-02-28T22:30:00Z, 1942-09-20T00:00:00Z) +02 (+00) SAST: [1942-09-20T00:00:00Z, 1943-03-20T23:00:00Z) +03 (+01) SAST: [1943-03-20T23:00:00Z, 1943-09-19T00:00:00Z) +02 (+00) SAST: [1943-09-19T00:00:00Z, 1944-03-18T23:00:00Z) +03 (+01) SAST: [1944-03-18T23:00:00Z, EndOfTime) +02 (+00)
I use this file for confidence when refactoring my time zone handling code - if the new code comes up with the same set of transitions as the old code, it's probably okay. (This is just one line of defence, of course - there are unit tests, though not as many as I'd like.)
It strikes me that having a similar file (I'm not wedded to the format, but it should have all the same information, one way or another) released alongside the main data files would be really handy for all implementors - it would be a good way of validating consistency across multiple platforms, with the release data being canonical. For any platforms which didn't want to actually consume the rules as rules, but just wanted a list of transitions, it could even effectively replace their use of the data.
One other benefit: diffing the dump between two releases would make it clear what had changed in effect, rather than just in terms of rules.
One sticking point is size. The current file for Noda Time is about 4MB, although it zips down to about 300K. Some thoughts around this:
We wouldn't need to distribute it in the same file as the data - just as we have data and code file, there could be a "textdump" file or whatever we'd want to call it. These could be retroactively generated for previous releases, too. As you can see, there's redundancy in the format above, in that it's a list of "zone intervals" (as I call them in Noda Time) rather than a list of transitions - the end of each interval is always the start of the next interval. For zones which settle into an infinite daylight saving pattern, I currently generate from the start of time to 2100 (and then a single zone interval for the end of time as Noda Time understands it; we'd need to work out what form that would take, if any). If we decided that "year of release + 30 years" was enough, that would cut down the size considerably.
Any thoughts? If the feeling is broadly positive, the next step would be to nail down the text format, then find a willing victim/volunteer to write the C code. (You really don't want me writing C...)
Jon

On Tuesday, July 14, 2015, Paul Eggert <eggert@cs.ucla.edu <javascript:_e(%7B%7D,'cvml','eggert@cs.ucla.edu');>> wrote:
Jon Skeet wrote:
(and with a better implementation of my zdump munging - ideally something in zic itself, I suspect)
Why would zic need to be involved? zdump uses only the standard POSIX API and should work even on platforms that don't use zic or the tz database at all.
Unfortunately dump doesn't have all the information I'd wish it to - namely the split between standard and daylight offsets. It indicates whether a zone is in daylight savings or not, but not how much that contributes to the overall offset. As for platforms that don't use the tz database at all - as the purpose of this is to validate the use of the tz database, I'm not sure that's much of an advantage. I'd anticipate the output being distributed alongside (but not within) the data file, so prospective users still wouldn't need to be using a tz-based platform themselves. I haven't yet got a feeling for your thoughts on the proposal... is this something you can see any future in? I suppose the canonical file wouldn't *have* to come from IANA, but that does feel like the best option. Jon

Jon Skeet wrote:
Unfortunately dump doesn't have all the information I'd wish it to - namely the split between standard and daylight offsets. It indicates whether a zone is in daylight savings or not, but not how much that contributes to the overall offset.
That shouldn't matter. No software should care about that. Any software that does care is delving into undocumented and unsupported areas.
As for platforms that don't use the tz database at all - as the purpose of this is to validate the use of the tz database, I'm not sure that's much of an advantage.
It can be helpful, for example, when comparing a simple tz database entry to the should-be-equivalent POSIX TZ string (which does not require the tz database).
I haven't yet got a feeling for your thoughts on the proposal... is this something you can see any future in? I suppose the canonical file wouldn't *have* to come from IANA, but that does feel like the best option.
I can see something along these lines working, yes. That is, we could distribute extra tarballs containing zdump output (with a better format perhaps), zic output, and tzdist-format output (if we ever get around to doing that).

On 15 July 2015 at 00:51, Paul Eggert <eggert@cs.ucla.edu> wrote:
Jon Skeet wrote:
Unfortunately dump doesn't have all the information I'd wish it to - namely the split between standard and daylight offsets. It indicates whether a zone is in daylight savings or not, but not how much that contributes to the overall offset.
That shouldn't matter. No software should care about that. Any software that does care is delving into undocumented and unsupported areas.
The published API of Joda-Time and JSR-310 exposes the difference between the base offset and the current offset (including DST). Its this kind of detail that makes them much more useful. To say that software should not care and that it is unsupported is .... er .... rather worrying. I think I've indicated before that I'm not sure this project fully appreciates or understands the downstream impacts of changes on systems other than zic. I think Jon's proposals would help to make the impacts much clearer. Stephen

On Jul 14, 2015, at 9:05 PM, Stephen Colebourne <scolebourne@joda.org> wrote:
I think I've indicated before that I'm not sure this project fully appreciates or understands the downstream impacts of changes on systems other than zic.
To what extent did systems that chose to independently process the zone text files, rather than just using the output of zic, notify the tzdb project that they were doing this?

Stephen Colebourne wrote:
To say that software should not care and that it is unsupported is .... er .... rather worrying.
Although it is an issue, the DST-vs-STD offsets are implementation details that are neither exposed by the reference API nor exported to zic's output files. Any values they internally have were not intended to be visible when the tzdata entries were written. Of course other implementations are free to process tzdata sources in other ways -- to take an extreme example, implementations could export tzdata comments to their APIs. However, this sort of thing is not part of the reference tz API, and any regression suite based on the reference API shouldn't worry about it.
I'm not sure this project fully appreciates or understands the downstream impacts of changes on systems other than zic.
It's helpful to mention those impacts on this list, if only clarify issues like these in the documentation. Proposed patch attached. This patch doesn't change zic's behavior; it just documents the way zic has always behaved.

On 15 July 2015 at 06:09, Paul Eggert <eggert@cs.ucla.edu> wrote:
Stephen Colebourne wrote:
To say that software should not care and that it is unsupported is .... er .... rather worrying. Although it is an issue, the DST-vs-STD offsets are implementation details that are neither exposed by the reference API nor exported to zic's output files. Any values they internally have were not intended to be visible when the tzdata entries were written.
I think what Jon is asking, and I'm confirming, is that this data *should* be considered part of the public data exposed by the tzdb project. Noda, Joda and JSR-310 all use it, no doubt other do to. It is very reasonable and useful info. The key point being that we don't just care about what the offset is, but also how that offset is calculated. Stephen

Stephen Colebourne <scolebourne@joda.org> wrote: |On 15 July 2015 at 06:09, Paul Eggert <eggert@cs.ucla.edu> wrote: |> Stephen Colebourne wrote: |> Although it is an issue, the DST-vs-STD offsets are implementation details |> that are neither exposed by the reference API nor exported to zic's output |> files. Any values they internally have were not intended \ |> to be visible when |> the tzdata entries were written. | |I think what Jon is asking, and I'm confirming, is that this data |*should* be considered part of the public data exposed by the tzdb |project. Noda, Joda and JSR-310 all use it, no doubt other do to. It |is very reasonable and useful info. The key point being that we don't |just care about what the offset is, but also how that offset is |calculated. We were using this information as it has been a vivid part of the TZ data one and half a decade ago as a signed save_secs field to be added to signed utf_offsec [1]. The information is still in there; i am not ._.; otherwise an official statement on that data i would appreciate, too. What would be an issue to me is instead that TZDIST doesn't offer a binary representation but only (iCalendar,) JSON and XML, which require parser libraries, whereas with CBOR the IETF standardizes a wonderful, extensible and otherwise future-proof binary format that also has been designed with easy JSON mapping in my mind. And especially the more simple JSON/xy libraries read anything into memory before text parsing starts. Just my one cent. Ciao, [1] http://mm.icann.org/pipermail/tz/2012-June/018002.html --steffen

On 15 July 2015 at 08:45, Steffen Nurpmeso <sdaoden@yandex.com> wrote:
What would be an issue to me is instead that TZDIST doesn't offer a binary representation but only (iCalendar,) JSON and XML, which require parser libraries, whereas with CBOR the IETF standardizes a wonderful, extensible and otherwise future-proof binary format that also has been designed with easy JSON mapping in my mind.
Not that this should become a tzdist discussion, but it is my understanding that tzdist has left the distribution of compiled binaries for zones to future work. In particular, this would require writing up a spec describing the binaries compiled by zic. Certainly not insurmountable, but also not a priority at this time. http://www.ietf.org/mail-archive/web/tzdist/current/msg01207.html http://www.ietf.org/mail-archive/web/tzdist/current/msg01218.html §4.1.2 of the latest draft (draft-ietf-tzdist-service-09) says <https://tools.ietf.org/html/draft-ietf-tzdist-service-09#section-4.1.2> that "Clients use the HTTP Accept header field (see Section 5.3.2 of [RFC7231]) to indicate their preference for the returned data format. Servers indicate the available formats that they support via the 'capabilities' action response (Section 5.1)." So, as I understand it, there's definitely room for tzdist implementations to support this even without the file format being formalized. -- Tim Parenti

Tim Parenti <tim@timtimeonline.com> wrote: |On 15 July 2015 at 08:45, Steffen Nurpmeso <sdaoden@yandex.com> wrote: |> What would be an issue to me is instead that TZDIST doesn't offer |> a binary representation but only (iCalendar,) JSON and XML, which |> require parser libraries, whereas with CBOR the IETF standardizes |> a wonderful, extensible and otherwise future-proof binary format |> that also has been designed with easy JSON mapping in my mind. | |Not that this should become a tzdist discussion, but it is my understanding |that tzdist has left the distribution of compiled binaries for zones to |future work. In particular, this would require writing up a spec |describing the binaries compiled by zic. Certainly not insurmountable, but |also not a priority at this time. | |http://www.ietf.org/mail-archive/web/tzdist/current/msg01207.html |http://www.ietf.org/mail-archive/web/tzdist/current/msg01218.html Ah, (oh,) i am not on this list, i only have read the draft once that came up on the leapsecond list (version -05). I never used any TZ code (we had a single large DB, but with the possibility to include one (see how weird) timezone in the library binary), and if you would ask me then i don't think that including anything zic is a good option (but don't dig deeper here). I would simply take the developed TZDIST protocol and instead of GET /capabilities HTTP/1.1 Host: tz.example.com
Response <<
HTTP/1.1 200 OK Date: Wed, 4 Jun 2008 09:32:12 GMT Content-Type: application/json; charset="utf-8" Content-Length: xxxx i'd return HTTP/1.1 200 OK Date: Wed, 4 Jun 2008 09:32:12 GMT Content-Type: application/cbor; charset=binary Content-Length: xxxx which should be the most simple and cheap and 1:1 possibility on the server and the client side. This is what CBOR is for, among others. |§4.1.2 of the latest draft (draft-ietf-tzdist-service-09) says |<https://tools.ietf.org/html/draft-ietf-tzdist-service-09#section-4.1.2> |that "Clients use the HTTP Accept header field (see Section 5.3.2 of |[RFC7231]) to indicate their preference for the returned data format. |Servers indicate the available formats that they support via the |'capabilities' action response (Section 5.1)." So, as I understand it, |there's definitely room for tzdist implementations to support this even |without the file format being formalized. Yes; but like i said, why reinventing the wheel if there is a standardized, very easy to parse (i really like it, it is rocking) binary format that is capable to interact 1:1 with the JSON that is used by the upcoming TZDIST standard? That would be my thought on that. Because JSON is still too expensive especially if the cost is for nothing at all (and if you want to present data to the user you can still do a mapping easily). --steffen

Stephen Colebourne wrote:
I think what Jon is asking, and I'm confirming, is that this data *should* be considered part of the public data exposed by the tzdb project.
When I wrote that data, I didn't worry about the DST-vs-STD offset. I cared only about the correctness of the sum of the offset and standard time, because I knew that only the sum mattered. At this point it would be a nontrivial project to verify that these offsets actually make sense everywhere in tzdata. (They usually do, but in some cases I vaguely recall that there were issues.) Until such a project is done (and I'm not volunteering :-) it would be misleading to document that the offsets are meaningful. There are many ways in which tzdata source code can differ in appearance while meaning the same thing. For example, although a time can be written "1:00" or "01:00" or "1:00:00" or "+1:00", none of this matters in zic output. This sort of thing is intended to be obvious, but unfortunately it was not obvious in this case.

On Wed, Jul 15, 2015 at 2:06 AM, Stephen Colebourne <scolebourne@joda.org> wrote:
On 15 July 2015 at 06:09, Paul Eggert <eggert@cs.ucla.edu> wrote:
Although it is an issue, the DST-vs-STD offsets are implementation details that are neither exposed by the reference API nor exported to zic's output files. Any values they internally have were not intended to be visible when the tzdata entries were written.
I think what Jon is asking, and I'm confirming, is that this data *should* be considered part of the public data exposed by the tzdb project.
Then I'll confirm Paul's point: subdividing the offset should not be considered for part of a user API. Indeed, if we could change history, we should get rid of tm_isdst too. The only place a UTC offset should be meaningful is when converting an absolute timestamp to/from a civil YMDHMS+O, and that should only happen within the date/time library. Why distinguish between {YMDHMS,stdoff=Xh,dstoff=1h,is_dst=true} and {YMDHMS,stdoff=X+1h,dstoff=0,is_dst=*}? They are the same time. There may be some library with a rich TimeZone concept that exposes much of the information found in the the tz files (and they are welcome to try to extract it), but I would claim that such information fills a much needed gap in a core date/time API.

On 15/07/15 06:09, Paul Eggert wrote:
I'm not sure this project fully appreciates or understands the downstream impacts of changes on systems other than zic.
It's helpful to mention those impacts on this list, if only clarify issues like these in the documentation. Proposed patch attached. This patch doesn't change zic's behavior; it just documents the way zic has always behaved.
I think that the only element of tzdist that is 'revolutionary' is that in theory it should flag when an entry in tz has changed from that when the previous enquiry was made. That a rule may have changed does not necessarily mean that a particular enquiry has, so for instance adding historic changes will not flag a change if the client system is only asking for data say post 2000, or more critically for day to day use if the transitions in the next 12 months have been subject to a change. Picking up a previous quote ...
and tzdist-format output (if we ever get around to doing that).
tzdist NEEDS a fully populated set of data which will maintain every published change in order that users can validate what they have previously normalized against the current published data. That the tzdist workgroup has no charter to provide that does beg the question "What is the point?" without a freely available fully validated source then tzdist is of little use? -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

If the tzdata isn't really intended to be consumed - if we should *really* only consume the zic output, and anything else is somewhat questionable, then why distribute the tzdata at all? Why not just distribute the zic output files? As for DST vs STD time not being relevant in software - while Windows doesn't (mostly) use tzdata, it *does* allow you to specify a time zone and exclude DST from that. Anyone wanting to mimic that behaviour but using the tz data can *only* do it if they know the DST component of the overall offset. I would really like to proceed pragmatically: I have a personal real-world need for validation, and given the discrepancies I've found so far, I think it would be useful to other people as well. I would much rather have an imperfect but widely-used solution that has scope to be improved later than a centithread here but with no practical outcome. Some observations I hope we can all agree on: - Some platforms *do* consume tzdata directly, and expose the STD/DST components of the overall offset, and are likely to continue doing so. - zic and the reference API do not expose the STD/DST components of the overall offset, and are unlikely to start doing so - If two implementations agree on overall offsets, it's highly likely they agree on STD/DST components, whether or not that can be verified - It is desirable that all implementations agree on as much as possible, and this is more likely to happen if it's easy to regularly validate - It is also informative to be able to easily tell the differences between different data versions (whether released or not) Now, a few more debatable thoughts: - Tools: - I think it would be reasonably logical - and probably fairly simple - to modify zdump to output whatever format we want, assuming we don't intend to include the STD/DST components. I'm not sure how easy it is to get a complete list of all time zones to make zdump dump them all. - Modifying zic would give us more information to dump, but wouldn't make as much sense logically. - Writing a brand new tool with the logic of zic but just for dumping would probably involve code duplication, which introduces the possibility of the codebases diverging accidentally. - Format: - If we were to use a format such as XML, JSON or YAML to represent the data, it's easier to make it extensible and it's also more easily processed by modern platforms (Java, .NET etc) without fiddly line-based parsing. On the other hand, that sometimes requires extra dependencies - and may well be annoying from C. (I'm aware that zdump and zic are pretty simple to build - and need to be built on a very wide variety of platforms.) - A line-based format is easier to diff with common tools than a more structured format. - Writing a tool to convert between formats would be near-trivial if we bear it in mind from the beginning. - If the format allows the fields included to be specified, it will allow STD/DST-component-aware platforms to compare against each other for differences, even if they can only compare total offsets against zic output - Despite the contents of my current github repo, I'm certainly not proposing actually using Windows line breaks for the "real" format. - Process: - It should be very easy to add a commit hook to github to generate a new dump file per commit, making it really easy to diff any pair (e.g. between two releases, or seeing effect a recent commit had, whether of code or data.) - While I believe it would be beneficial to ship a dump file alongside code/data releases, with the previous bullet in place, we wouldn't *need* that to start with. We could view the whole thing as experimental until we're happy with the format etc. - zic documentation: - All of this has come from me struggling to implement a tzdata parser which is in line with zic. The man page for zic documents the tz data format, but not in enough detail for a compliant implementation, in my view. I volunteer to at least attempt some more detailed documentation if others feel it's useful. (If no-one else does, I'll probably keep it under the Noda Time documentation anyway, but with a suitable "this is entirely unofficial" warning.) So, where do we go from here? Does anyone believe this would actually be a bad thing to have? (That might come with the position of "only use zic output".) How are we best to decide the format? If modifying zdump to add an extra flag is deemed an appropriate course of action, do we have any volunteers to do so? I'm happy to host a github hook to publish the dump files at each commit when all the rest of the machinery is in place. Jon On 15 July 2015 at 06:09, Paul Eggert <eggert@cs.ucla.edu> wrote:
Stephen Colebourne wrote:
To say that software should not care and that it is unsupported is .... er .... rather worrying.
Although it is an issue, the DST-vs-STD offsets are implementation details that are neither exposed by the reference API nor exported to zic's output files. Any values they internally have were not intended to be visible when the tzdata entries were written.
Of course other implementations are free to process tzdata sources in other ways -- to take an extreme example, implementations could export tzdata comments to their APIs. However, this sort of thing is not part of the reference tz API, and any regression suite based on the reference API shouldn't worry about it.
I'm not sure this project fully
appreciates or understands the downstream impacts of changes on systems other than zic.
It's helpful to mention those impacts on this list, if only clarify issues like these in the documentation. Proposed patch attached. This patch doesn't change zic's behavior; it just documents the way zic has always behaved.

On Jul 15, 2015, at 3:10 PM, Jon Skeet <skeet@pobox.com<mailto:skeet@pobox.com>> wrote: If the tzdata isn't really intended to be consumed - if we should really only consume the zic output, and anything else is somewhat questionable, then why distribute the tzdata at all? Why not just distribute the zic output files? For the same reason that people distribute source files and not compiled binaries. paul

On 15 July 2015 at 20:23, <Paul_Koning@dell.com> wrote:
On Jul 15, 2015, at 3:10 PM, Jon Skeet <skeet@pobox.com> wrote:
If the tzdata isn't really intended to be consumed - if we should *really* only consume the zic output, and anything else is somewhat questionable, then why distribute the tzdata at all? Why not just distribute the zic output files?
For the same reason that people distribute source files and not compiled binaries.
Where distribution of program binaries is feasible, it tends to make life simpler for those consuming it. In my experience, most end users of software *don't* build it from scratch these days, even on platforms where that used to be the norm such as Linux. While source code tends to compile to different binaries on different platforms, there's no reason for the zic output not to be made entirely portable (if indeed it isn't already) such that the binaries *can* just be distributed, with the source of the data also available straight from a git repository (or as a separate but less-widely-used) download. Aside from reacting to that first sentence, any thoughts on the rest? I perhaps shouldn't have included that aspect at all in my previous mail, if it will detract from making more positive progress :( Jon

On Jul 15, 2015, at 1:00 PM, Jon Skeet <skeet@pobox.com> wrote:
On 15 July 2015 at 20:23, <Paul_Koning@dell.com> wrote:
On Jul 15, 2015, at 3:10 PM, Jon Skeet <skeet@pobox.com> wrote:
If the tzdata isn't really intended to be consumed - if we should really only consume the zic output, and anything else is somewhat questionable, then why distribute the tzdata at all? Why not just distribute the zic output files?
For the same reason that people distribute source files and not compiled binaries.
Where distribution of program binaries is feasible, it tends to make life simpler for those consuming it. In my experience, most end users of software don't build it from scratch these days, even on platforms where that used to be the norm such as Linux. While source code tends to compile to different binaries on different platforms, there's no reason for the zic output not to be made entirely portable (if indeed it isn't already)
The data is stored in big-endian form (done so that it could be shared over NFS by big-endian Sun-2, Sun-3, and Sun-4 machines and little-endian Sun386i machines), and uses standard lengths (the tzdata(5) man page uses "long" in structure definitions, but that's more informative than normative; it also explicitly speaks of "four-byte" values, and that's what's normative), so it should be portable to all 8-bit-byte two's-complement machines (other machines are on their own reading them). (I'll update the man page to be less C-on-ILP32-platform-oriented.)

On Jul 15, 2015, at 1:19 PM, Guy Harris <guy@alum.mit.edu> wrote:
The data is stored in big-endian form (done so that it could be shared over NFS by big-endian Sun-2, Sun-3, and Sun-4 machines and little-endian Sun386i machines), and uses standard lengths (the tzdata(5) man page uses "long" in structure definitions, but that's more informative than normative; it also explicitly speaks of "four-byte" values, and that's what's normative), so it should be portable to all 8-bit-byte two's-complement machines (other machines are on their own reading them).
(I'll update the man page to be less C-on-ILP32-platform-oriented.)
Actually, the *current* version in the GitHub repository uses "int32_t" rather than "long", so, whilst it does use C structures, it no longer assumes an ILP32 platform. No update should be necessary.

On Jul 15, 2015, at 12:10 PM, Jon Skeet <skeet@pobox.com> wrote:
If the tzdata isn't really intended to be consumed - if we should really only consume the zic output, and anything else is somewhat questionable, then why distribute the tzdata at all?
So that people can edit them to add updates/fixes, run them through zic for their own purposes, and send them back to the maintainers. So that people can see the comments giving history, sources, etc..
Why not just distribute the zic output files?
See above.
As for DST vs STD time not being relevant in software - while Windows doesn't (mostly) use tzdata, it does allow you to specify a time zone and exclude DST from that. Anyone wanting to mimic that behaviour but using the tz data can only do it if they know the DST component of the overall offset.
So how is that behavior useful? Yes, if you're living in Arizona, you *could* say "Mountain Time, without DST", but, with tzdata, you could also say "America/Phoenix" (or, even better, say "I live here" and have the system figure out that "here" is in Arizona and give you "America/Phoenix" without having to deal with "butbutbutbutbut I live in Scottsdale!"). I.e., is the fact that Microsoft allows that a consequence of Microsoft deciding to handle "time zones" in the conventional sense of the term, plus a "daylight savings time shifts occur" flag that can be applied to an arbitrary zone, rather than in the tzdata sense of "a region that has kept its clocks the same way over time since 1970" including DST rules (so that if region A is at the same offset-from-UTC as region B but didn't follow DST in the same fashion as region B at some point between 1970 and now, they're in separate tzdata time zones), the only reason for that behavior?
• zic and the reference API do not expose the STD/DST components of the overall offset, and are unlikely to start doing so
Perhaps, but it's not as if the reference API is constrained to be the ANSI C/POSIX API, so, *if* it were deemed useful to expose those two components separately, it could be extended to do so. Note, however, that it's not as if, within a given tzdata time zone, the STD component will be the same over time. Regions can decide to change which time zone, in the conventional sense, they're in; this does not require them to change what tzdata time zone they're in, unless part of a region covered by a given tzdata time zone changes which conventional time zone they're in but another part doesn't, in which case we need to split the tzdata time zone. I.e., programs that are given access to the STD and DST components of the overall offset should not misuse them. ("Assuming that STD represents some value that has remained constant since 1970" counts as "misuse".) In what fashion *are* they used by software in the real world?

Jon Skeet wrote:
Windows ... *does* allow you to specify a time zone and exclude DST from that. Anyone wanting to mimic that behaviour but using the tz data can *only* do it if they know the DST component of the overall offset.
No, the existing API suffices for that. For example, you can scan forward to the next time that doesn't have tm_isdst set, and use its UTC offset. I've done that sort of thing. Although this is just a heuristic and might not match user expectations in rare cases, that's also true for what Microsoft Windows does.
- Some platforms *do* consume tzdata directly, and expose the STD/DST components of the overall offset, and are likely to continue doing so.
That's fine, as long as they don't care whether the STD-vs-DST offset is reliable -- it appears they don't, or there'd have been complaints already. And in that case developers of these platforms shouldn't worry much about any changes that involve only the STD-vs-DST offset. It's the sort of change that a regression-tester might need to deal with, but users won't care about and so should be low priority. If the above analysis is wrong and users really care about that offset even though it's not supported by tzdata, then the platforms in question can maintain their own specialized databases and regression suites. It's not something the tz project itself needs to spend its limited resources on.
The man page for zic documents the tz data format, but not in enough detail for a compliant implementation
It'd be helpful to fix that. Any ambiguities should be clarified (or documented as being explicitly ambiguous), as was done in commit 2fab66aa164365209e47af24b2337b7c2ffdbe5c. This shouldn't require a complete rewrite.
If modifying zdump to add an extra flag is deemed an appropriate course of action, do we have any volunteers to do so?
I can volunteer to change zdump but it's not clear yet what needs to be changed. First we'd need to design a good format for regression testing, and document that format. This hasn't been done yet, and will take some thinking. Existing format proposals haven't ensured that the zdump output should contain everything visible to the API.

On 15 July 2015 at 21:29, Paul Eggert <eggert@cs.ucla.edu> wrote: <snip> The man page for zic documents the tz
data format, but not in enough detail for a compliant implementation
It'd be helpful to fix that. Any ambiguities should be clarified (or documented as being explicitly ambiguous), as was done in commit 2fab66aa164365209e47af24b2337b7c2ffdbe5c. This shouldn't require a complete rewrite.
Righto - I wasn't sure whether the intention was that the man page was complete in a specification sense, or more providing general guidance. I think it's clear in the vast majority of cases - it's only corner cases where I've found myself wondering how to interpret the rules. I brought it up mostly as it was the motivation for needing regression tests, but I think we should separate the work. Obviously the burden is on me to be clear about exactly where and how I'm having problems. Would you prefer that I report my confusion on the list, or by direct mail?
If modifying zdump to add
an extra flag is deemed an appropriate course of action, do we have any volunteers to do so?
I can volunteer to change zdump but it's not clear yet what needs to be changed. First we'd need to design a good format for regression testing, and document that format. This hasn't been done yet, and will take some thinking. Existing format proposals haven't ensured that the zdump output should contain everything visible to the API.
Right - as a non-API user, I'm mostly in the dark here. The proposal I've implemented on my github repo is very much to scratch my particular itch, and was mostly to make a start so that we could improve it. Is there any information missing on an individual transition basis, or is it more global matters (links, start/end points, range of transitions) that is missing information? Jon

Jon Skeet wrote:
Would you prefer that I report my confusion on the list, or by direct mail?
On the list, please.
Is there any information missing on an individual transition basis, or is it more global matters (links, start/end points, range of transitions) that is missing information?
I don't recall exactly the formats you're using, but the thing I most remember is not being able to handle all the data. zdump should be able to dump everything, not just transitions for a limited number of years. At some point, perhaps far in the future, there should be a pattern and zdump should be able to deduce it and output it and stop. zic already does this sort of thing.

On 16 July 2015 at 04:20, Paul Eggert <eggert@cs.ucla.edu> wrote:
Is there any information missing on an individual transition basis, or is
it more global matters (links, start/end points, range of transitions) that is missing information?
I don't recall exactly the formats you're using, but the thing I most remember is not being able to handle all the data. zdump should be able to dump everything, not just transitions for a limited number of years.
I'm fine with that. Given that zdump already has the ability to limit the number of years reported, I'd expect the new option to respect the existing one.
At some point, perhaps far in the future, there should be a pattern and zdump should be able to deduce it and output it and stop. zic already does this sort of thing.
Right - if that ever happens, I think we'd probably want to be able to be able to disable that behaviour though... the dumber regression tests are, the happier I am, if you see what I mean. Jon

On 2015-07-16 02:30, Jon Skeet wrote:
On 16 July 2015 at 04:20, Paul Eggert <eggert@cs.ucla.edu <mailto:eggert@cs.ucla.edu>> wrote: Is there any information missing on an individual transition basis, or is it more global matters (links, start/end points, range of transitions) that is missing information? I don't recall exactly the formats you're using, but the thing I most remember is not being able to handle all the data. zdump should be able to dump everything, not just transitions for a limited number of years. I'm fine with that. Given that zdump already has the ability to limit the number of years reported, I'd expect the new option to respect the existing one. At some point, perhaps far in the future, there should be a pattern and zdump should be able to deduce it and output it and stop. zic already does this sort of thing. Right - if that ever happens, I think we'd probably want to be able to be able to disable that behaviour though... the dumber regression tests are, the happier I am, if you see what I mean.
What about adding the date +FORMAT argument or the GNU ls --time-style options to zdump? --time-style=STYLE with -l, show times using style STYLE: full-iso, long-iso, iso, locale, +FORMAT. FORMAT is interpreted like 'date' The former requires only a call to strftime(3), while the latter requires some format selection. -- Take care. Thanks, Brian Inglis

Stephen Colebourne wrote:
On 15 July 2015 at 00:51, Paul Eggert <eggert@cs.ucla.edu> wrote:
Jon Skeet wrote:
Unfortunately dump doesn't have all the information I'd wish it to - namely the split between standard and daylight offsets. It indicates whether a zone is in daylight savings or not, but not how much that contributes to the overall offset.
That shouldn't matter. No software should care about that. Any software that does care is delving into undocumented and unsupported areas.
The published API of Joda-Time and JSR-310 exposes the difference between the base offset and the current offset (including DST). Its this kind of detail that makes them much more useful. To say that software should not care and that it is unsupported is .... er .... rather worrying.
I'd also appreciate if we could get more details out of the TZ db in an easy way. We (Meinberg) also have embedded devices (not POSIX, not even an OS, just a microcontroller) which also does conversion from UTC to local time, and it would be a very nice feature if a program could easily extract all information for a given year (i.e. standard and DST offsets, beginning and end of DST, if any) from a TZ file after this has been updated.
I think I've indicated before that I'm not sure this project fully appreciates or understands the downstream impacts of changes on systems other than zic. I think Jon's proposals would help to make the impacts much clearer.
Again, I fully agree and also support Jon's proposal. Martin

Stephen Colebourne wrote:
On 15 July 2015 at 00:51, Paul Eggert <eggert@cs.ucla.edu> wrote:
Jon Skeet wrote:
Unfortunately dump doesn't have all the information I'd wish it to - namely the split between standard and daylight offsets. It indicates whether a zone is in daylight savings or not, but not how much that contributes to the overall offset.
That shouldn't matter. No software should care about that. Any software that does care is delving into undocumented and unsupported areas.
The published API of Joda-Time and JSR-310 exposes the difference between the base offset and the current offset (including DST). Its this kind of detail that makes them much more useful. To say that software should not care and that it is unsupported is .... er .... rather worrying.
I'd also appreciate if we could get more details out of the TZ db in an easy way. We (Meinberg) also have embedded devices (not POSIX, not even an OS, just a microcontroller) which also does conversion from UTC to local time, and it would be a very nice feature if a program could easily extract all information for a given year (i.e. standard and DST offsets, beginning and end of DST, if any) from a TZ file after this has been updated.
I think I've indicated before that I'm not sure this project fully appreciates or understands the downstream impacts of changes on systems other than zic. I think Jon's proposals would help to make the impacts much clearer.
Again, I fully agree and also support Jon's proposal. Martin -- Martin Burnicki Senior Software Engineer MEINBERG Funkuhren GmbH & Co. KG Email: martin.burnicki@meinberg.de Phone: +49 (0)5281 9309-14 Fax: +49 (0)5281 9309-30 Lange Wand 9, 31812 Bad Pyrmont, Germany Amtsgericht Hannover 17HRA 100322 Geschäftsführer/Managing Directors: Günter Meinberg, Werner Meinberg, Andre Hartmann, Heiko Gerstung Web: http://www.meinberg.de

On 15/07/15 08:34, Martin Burnicki wrote:
I'd also appreciate if we could get more details out of the TZ db in an easy way. We (Meinberg) also have embedded devices (not POSIX, not even an OS, just a microcontroller) which also does conversion from UTC to local time, and it would be a very nice feature if a program could easily extract all information for a given year (i.e. standard and DST offsets, beginning and end of DST, if any) from a TZ file after this has been updated.
That is EXACTLY what tzdist is designed to address! -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

Next update: I've improved the zdump-based generation of the data, and put the data in the current format for all the tz data releases I can find (from 1996 onwards) at http://nodatime.org/tzvalidate/ None of this is in any way meant to imply that I'm trying to freeze the format - I appreciate Paul's point about it being an incomplete representation of the zic data - but I wanted to get the data such as it is already out there. Noda Time 2.0 alpha now correctly generates the 2012e, 2013e, 2014e and 2015e data - I want to do a bit of work to make it easier to consume the source data as a tgz directly before I then check it against all of the rest of the zdump output files. Jon On 14 July 2015 at 21:12, Jon Skeet <skeet@pobox.com> wrote:
I've expanded this a bit - we now have implementations for:
- Joda Time - Noda Time - Java 7 (well, Java pre-8) - Java 8 - ICU4J - zdump - Ruby's tzinfo gem
I'd really appreciate any input at this point. There are still a few issues with the data collection - it's not the pristine file diff we'd like to end up with - but it's enough to highlight some discrepancies, which I'll probably write up as a blog post and cc here. I think the fact that it *is* showing up these differences is evidence that this could provide a lot of value with the support of the rest of the community (and with a better implementation of my zdump munging - ideally something in zic itself, I suspect). Who do I need to persuade? (Paul, I guess...)
Jon
On 13 July 2015 at 21:43, Jon Skeet <skeet@pobox.com> wrote:
Okay, I've created https://github.com/nodatime/tzvalidate
It allows you (well, someone who's got everything set up...) to compare and contrast:
- Joda Time - Noda Time - Java 8 - zdump
Only Joda Time and Noda Time allow (and in fact require) a data version to be specified. Obviously in order to compare data meaningfully, one has to be using the same data in all places. That's the next thing to look at... but they're all using the same output format, and the results are already interesting in terms of some unexpected discrepanicies. I haven't had a chance to look into them yet.
Jon
On 13 July 2015 at 16:06, Jon Skeet <skeet@pobox.com> wrote:
Given that I've already found discrepancies (see "Discrepancies in time zone data interpretation") I'm going to go ahead and hack on this in purely pragmatic (read: short term) ways. I'll create a github repo just for this purpose and dump code in there - this is explicitly with the aim of encouraging a more permanent solution by proving value.
Will post another message here when there's something worth looking at - I'll be initially looking at zdump output, Joda Time, standard Java, and Noda Time. Contributions from others for other languages/platforms will be very welcome.
Jon
On 13 July 2015 at 14:46, Stephen Colebourne <scolebourne@joda.org> wrote:
FWIW, I think such a format would be very useful. Effectively, it is a unit test for others to confirm that they interpret the rules the same way as intended.
It is similar to what I produced when trying to demonstrate the amount of change being caused by apparently "minor" changes to the data: https://github.com/jodastephen/tzdiff/commits/master
Any output of this type should indeed just consist of a simple text file with ISO-8601 format timestamps.
Stephen
On 11 July 2015 at 11:35, Jon Skeet <skeet@pobox.com> wrote:
Background: I'm the primary developer for Noda Time which consumes the tz data. I'm currently refactoring the code to do this... and I've come across some code (originally ported from Joda Time) which I now understand in terms of what it's doing, but not exactly why.
For a little while now, the Noda Time source repo has included a text dump file, containing a text dump of every transition (up to 2100, at the moment) for every time zone. It looks like this, picking just one example:
Zone: Africa/Maseru LMT: [StartOfTime, 1892-02-07T22:08:00Z) +01:52 (+00) SAST: [1892-02-07T22:08:00Z, 1903-02-28T22:30:00Z) +01:30 (+00) SAST: [1903-02-28T22:30:00Z, 1942-09-20T00:00:00Z) +02 (+00) SAST: [1942-09-20T00:00:00Z, 1943-03-20T23:00:00Z) +03 (+01) SAST: [1943-03-20T23:00:00Z, 1943-09-19T00:00:00Z) +02 (+00) SAST: [1943-09-19T00:00:00Z, 1944-03-18T23:00:00Z) +03 (+01) SAST: [1944-03-18T23:00:00Z, EndOfTime) +02 (+00)
I use this file for confidence when refactoring my time zone handling code - if the new code comes up with the same set of transitions as the old code, it's probably okay. (This is just one line of defence, of course - there are unit tests, though not as many as I'd like.)
It strikes me that having a similar file (I'm not wedded to the format, but it should have all the same information, one way or another) released alongside the main data files would be really handy for all implementors - it would be a good way of validating consistency across multiple platforms, with the release data being canonical. For any platforms which didn't want to actually consume the rules as rules, but just wanted a list of transitions, it could even effectively replace their use of the data.
One other benefit: diffing the dump between two releases would make it clear what had changed in effect, rather than just in terms of rules.
One sticking point is size. The current file for Noda Time is about 4MB, although it zips down to about 300K. Some thoughts around this:
We wouldn't need to distribute it in the same file as the data - just as we have data and code file, there could be a "textdump" file or whatever we'd want to call it. These could be retroactively generated for previous releases, too. As you can see, there's redundancy in the format above, in that it's a list of "zone intervals" (as I call them in Noda Time) rather than a list of transitions - the end of each interval is always the start of the next interval. For zones which settle into an infinite daylight saving pattern, I currently generate from the start of time to 2100 (and then a single zone interval for the end of time as Noda Time understands it; we'd need to work out what form that would take, if any). If we decided that "year of release + 30 years" was enough, that would cut down the size considerably.
Any thoughts? If the feeling is broadly positive, the next step would be to nail down the text format, then find a willing victim/volunteer to write the C code. (You really don't want me writing C...)
Jon

On Jul 18, 2015, at 3:40 PM, Jon Skeet <skeet@pobox.com> wrote:
Next update: I've improved the zdump-based generation of the data, and put the data in the current format for all the tz data releases I can find (from 1996 onwards) at http://nodatime.org/tzvalidate/
I’ve generated a version of tzdata2015e-tzvalidate.txt.zip from my code here: http://howardhinnant.github.io/tzdata2015e-tzvalidate.txt.zip There are appear to be two kinds of differences: 1. I appear to start earlier than you, for example I have: Africa/Algiers 1891-03-14T23:48:48Z +00:09:21 standard PMT and you do not. 2. This one has me more concerned: When a zone specifies a rule/date combination and the date falls of the beginning of the rule table, I assume a “” variable part, where you appear to assume a “S” variable part. For example, I have: America/Barbados 1924-01-01T03:58:29Z -03:58:29 standard BMT 1932-01-01T03:58:29Z -04:00:00 standard AT 1977-06-12T06:00:00Z -03:00:00 daylight ADT And you have: America/Barbados 1924-01-01T03:58:29Z -03:58:29 standard BMT 1932-01-01T03:58:29Z -04:00:00 standard AST 1977-06-12T06:00:00Z -03:00:00 daylight ADT The America/Barbados Zone switches to the Barb Rule on 1932-01-01T03:58:29Z, using the format A%sT. But the first Barb Rule is 1977-06-12 2:00. I looked for documentation for what is supposed to happen in a situation like this, but didn’t find anything. Howard

On 18 July 2015 at 23:01, Howard Hinnant <howard.hinnant@gmail.com> wrote:
On Jul 18, 2015, at 3:40 PM, Jon Skeet <skeet@pobox.com> wrote:
Next update: I've improved the zdump-based generation of the data, and
put the data in the current format for all the tz data releases I can find (from 1996 onwards) at http://nodatime.org/tzvalidate/
I’ve generated a version of tzdata2015e-tzvalidate.txt.zip from my code here:
http://howardhinnant.github.io/tzdata2015e-tzvalidate.txt.zip
I saw your earlier message and hoped you were reading this thread too. Supporting code such as yours is precisely the motivation for this endeavour.
There are appear to be two kinds of differences:
1. I appear to start earlier than you, for example I have:
Africa/Algiers 1891-03-14T23:48:48Z +00:09:21 standard PMT
and you do not.
That much is simple to explain - the format I'm currently generating explicitly starts in 1905 and ends in 2035. The 1905 part was due to an earlier version of zdump I was using was limited to 1900. As per Paul's messages earlier in the thread, eventually we'll want to expose more data - although it's not clear how *late* it's worth going. (I doubt that it's worth extending beyond 2100 for example.)
2. This one has me more concerned: When a zone specifies a rule/date combination and the date falls of the beginning of the rule table, I assume a “” variable part, where you appear to assume a “S” variable part. For example, I have:
America/Barbados 1924-01-01T03:58:29Z -03:58:29 standard BMT 1932-01-01T03:58:29Z -04:00:00 standard AT 1977-06-12T06:00:00Z -03:00:00 daylight ADT
And you have:
America/Barbados 1924-01-01T03:58:29Z -03:58:29 standard BMT 1932-01-01T03:58:29Z -04:00:00 standard AST 1977-06-12T06:00:00Z -03:00:00 daylight ADT
Just to be clear, this isn't "me" so much as "zic and then zdump". It happens that Noda Time (which is more "my" code) does the same thing though :)
The America/Barbados Zone switches to the Barb Rule on 1932-01-01T03:58:29Z, using the format A%sT. But the first Barb Rule is 1977-06-12 2:00. I looked for documentation for what is supposed to happen in a situation like this, but didn’t find anything.
I think AST makes sense here (as it's standard time) but I agree that it's not clearly documented. In Noda Time, if I don't find a rule leading "into" the transition period, I take the name of the first rule with no daylight savings. See https://github.com/nodatime/nodatime/blob/20d57967e04f1b57a10c00910f337a1c3c... for the code involved. zic appears to implement equivalent behaviour, although I wouldn't like to pin down where. I'd be interested in seeing whether your understanding of the data in natural language ties in with the comments expressed in DateTimeZoneBuilder at the link above, by the way. Jon

On Jul 18, 2015, at 6:25 PM, Paul Eggert <eggert@cs.ucla.edu> wrote:
Jon Skeet wrote:
In Noda Time, if I don't find a rule leading "into" the transition period, I take the name of the first rule with no daylight savings. zic appears to implement equivalent behaviour
Yes, that's the intent.
I’ve updated my parser with this behavior: https://github.com/HowardHinnant/date/commit/618cd7be060b02a923af60c0eb866a6... And here is the updated validation file for 2015e: http://howardhinnant.github.io/tzdata2015e-tzvalidate.txt.zip The only difference I’m now seeing from Jon’s: http://nodatime.org/tzvalidate/tzdata2015e-tzvalidate.zip is associated with years 1904 and prior, which Jon has already explained. Thanks Jon and Paul for the explanation of the lookup rules for when one "falls off the front.” I have one suggestion for the validation file: Some timezones have no transition period and are labeled like so: EST Fixed: -05:00:00 EST My recommendation is to change this to: EST Initially: -05:00:00 standard EST And furthermore have every timezone do this. For example: Europe/Amsterdam Initially: +00:19:32 standard LMT 1834-12-31T23:40:28Z +00:19:32 standard AMT 1916-04-30T23:40:28Z +01:19:32 daylight NST ... instead of: Europe/Amsterdam 1834-12-31T23:40:28Z +00:19:32 standard AMT 1916-04-30T23:40:28Z +01:19:32 daylight NST ... Howard

I'd be happy with that behaviour, yes. I don't know how to extract it from the current zdump output though (or indeed any transitions before 1900). Jon On 19 July 2015 at 01:54, Howard Hinnant <howard.hinnant@gmail.com> wrote:
On Jul 18, 2015, at 6:25 PM, Paul Eggert <eggert@cs.ucla.edu> wrote:
Jon Skeet wrote:
In Noda Time, if I don't find a rule leading "into" the transition
period, I take the name of the first rule with no daylight savings.
zic appears to implement equivalent behaviour
Yes, that's the intent.
I’ve updated my parser with this behavior:
https://github.com/HowardHinnant/date/commit/618cd7be060b02a923af60c0eb866a6...
And here is the updated validation file for 2015e:
http://howardhinnant.github.io/tzdata2015e-tzvalidate.txt.zip
The only difference I’m now seeing from Jon’s:
http://nodatime.org/tzvalidate/tzdata2015e-tzvalidate.zip
is associated with years 1904 and prior, which Jon has already explained.
Thanks Jon and Paul for the explanation of the lookup rules for when one "falls off the front.”
I have one suggestion for the validation file:
Some timezones have no transition period and are labeled like so:
EST Fixed: -05:00:00 EST
My recommendation is to change this to:
EST Initially: -05:00:00 standard EST
And furthermore have every timezone do this. For example:
Europe/Amsterdam Initially: +00:19:32 standard LMT 1834-12-31T23:40:28Z +00:19:32 standard AMT 1916-04-30T23:40:28Z +01:19:32 daylight NST ...
instead of:
Europe/Amsterdam 1834-12-31T23:40:28Z +00:19:32 standard AMT 1916-04-30T23:40:28Z +01:19:32 daylight NST ...
Howard

In fact, I've only just realized that there's no reason I need to use zdump here - as the output file format from zic is clearly documented, I can just consume that directly. That should make things a lot simpler (from my C#-oriented perspective, of course - changes to zdump may well still be appropriate...) Jon On 19 July 2015 at 06:49, Jon Skeet <skeet@pobox.com> wrote:
I'd be happy with that behaviour, yes. I don't know how to extract it from the current zdump output though (or indeed any transitions before 1900).
Jon
On 19 July 2015 at 01:54, Howard Hinnant <howard.hinnant@gmail.com> wrote:
On Jul 18, 2015, at 6:25 PM, Paul Eggert <eggert@cs.ucla.edu> wrote:
Jon Skeet wrote:
In Noda Time, if I don't find a rule leading "into" the transition
period, I take the name of the first rule with no daylight savings.
zic appears to implement equivalent behaviour
Yes, that's the intent.
I’ve updated my parser with this behavior:
https://github.com/HowardHinnant/date/commit/618cd7be060b02a923af60c0eb866a6...
And here is the updated validation file for 2015e:
http://howardhinnant.github.io/tzdata2015e-tzvalidate.txt.zip
The only difference I’m now seeing from Jon’s:
http://nodatime.org/tzvalidate/tzdata2015e-tzvalidate.zip
is associated with years 1904 and prior, which Jon has already explained.
Thanks Jon and Paul for the explanation of the lookup rules for when one "falls off the front.”
I have one suggestion for the validation file:
Some timezones have no transition period and are labeled like so:
EST Fixed: -05:00:00 EST
My recommendation is to change this to:
EST Initially: -05:00:00 standard EST
And furthermore have every timezone do this. For example:
Europe/Amsterdam Initially: +00:19:32 standard LMT 1834-12-31T23:40:28Z +00:19:32 standard AMT 1916-04-30T23:40:28Z +01:19:32 daylight NST ...
instead of:
Europe/Amsterdam 1834-12-31T23:40:28Z +00:19:32 standard AMT 1916-04-30T23:40:28Z +01:19:32 daylight NST ...
Howard

On 19/07/15 01:54, Howard Hinnant wrote:
Europe/Amsterdam Initially: +00:19:32 standard LMT 1834-12-31T23:40:28Z +00:19:32 standard AMT 1916-04-30T23:40:28Z +01:19:32 daylight NST
Howard ... Historic material in TZ is still something of a Cinderella and the fact that sections of it simply become ignored because of the 1970 limit is irritating. The problem I have with the above 'rule set' is that until there was a common time agreed in 1834, there was not a previous fixed time for the area over which THAT agreement was made. The LMT tag does at least say that this time is specific to a point and that other adjacent towns MAY have different offsets. That the software needs something to work with is a given, but just as using TZ without the back file produces incorrect results prior to 1970 so do assumptions like these. The current demo tzdist claims to be 'unabridged', but in reality it is only REALLY valid post 1970 and so should flag that limitation. Paul is totally correct that we will never be able to produce a perfectly correct historic rules set for the world, but we do know when the limited set of data we are using started and all I am asking is that prior to authenticated material, LMT is observed as just that Local Mean Time as defined by the actual location, and an indication that some other mechanism maybe needed to calculate local time ... in the UK while it only applied for a short period of time, towns had their own local time and that is something that we may well be able to provide via tzdist while not encumbering TZ with the problems of validating that data. tzdist needs a geographic lookup system so that anyone can use to confirm their local timezone, something which geonames currently provides with a reasonable accuracy, but which is also by no means complete. -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

On Jul 19, 2015, at 4:24 AM, Lester Caine <lester@lsces.co.uk> wrote:
On 19/07/15 01:54, Howard Hinnant wrote:
Europe/Amsterdam Initially: +00:19:32 standard LMT 1834-12-31T23:40:28Z +00:19:32 standard AMT 1916-04-30T23:40:28Z +01:19:32 daylight NST
Howard ... Historic material in TZ is still something of a Cinderella and the fact that sections of it simply become ignored because of the 1970 limit is irritating. The problem I have with the above 'rule set' is that until there was a common time agreed in 1834, there was not a previous fixed time for the area over which THAT agreement was made. The LMT tag does at least say that this time is specific to a point and that other adjacent towns MAY have different offsets. That the software needs something to work with is a given, but just as using TZ without the back file produces incorrect results prior to 1970 so do assumptions like these. The current demo tzdist claims to be 'unabridged', but in reality it is only REALLY valid post 1970 and so should flag that limitation. Paul is totally correct that we will never be able to produce a perfectly correct historic rules set for the world, but we do know when the limited set of data we are using started and all I am asking is that prior to authenticated material, LMT is observed as just that Local Mean Time as defined by the actual location, and an indication that some other mechanism maybe needed to calculate local time ... in the UK while it only applied for a short period of time, towns had their own local time and that is something that we may well be able to provide via tzdist while not encumbering TZ with the problems of validating that data. tzdist needs a geographic lookup system so that anyone can use to confirm their local timezone, something which geonames currently provides with a reasonable accuracy, but which is also by no means complete.
Thanks Lester. I understand and agree with everything you’ve said. Please understand that the validation effort is not about historical accuracy. It is about accurately retrieving and representing all of the data in the database. For example I also have: America/Cambridge_Bay Initially: +00:00:00 standard zzz 1920-01-01T00:00:00Z -07:00:00 standard MST 1942-02-09T09:00:00Z -06:00:00 daylight MWT … And I assert that this is a correct representation of the contents of the database. # aka Iqaluktuuttiaq Zone America/Cambridge_Bay 0 - zzz 1920 # trading post est.? -7:00 NT_YK M%sT 1999 Oct 31 2:00 But I am not under the impression that the abbreviation zzz was actually used in Cambridge Bay prior to 1920. Howard

I've now changed the format on github to the one Howard proposed, updating the data on http://nodatime.org/tzvalidate (still uploading) and the various tools. Now that we (by default) include all the data available before the cutoff point (2035 by default) this hopefully removes at least some of the concerns Paul raised earlier, too. I'm trying to give all the tools a common command line interface: -s: source (where required) -f: from year (first year to show transitions for; defaults to "before any transitions are likely to be recorded") -t: to year (year to stop showing transitions; defaults to 2035) -z: zone (if specified, only a single zone is shown) (I don't know exactly who cares about my progress on this, but I thought it wouldn't hurt to keep the list up to date...) Jon On 19 July 2015 at 01:54, Howard Hinnant <howard.hinnant@gmail.com> wrote:
On Jul 18, 2015, at 6:25 PM, Paul Eggert <eggert@cs.ucla.edu> wrote:
Jon Skeet wrote:
In Noda Time, if I don't find a rule leading "into" the transition
period, I take the name of the first rule with no daylight savings.
zic appears to implement equivalent behaviour
Yes, that's the intent.
I’ve updated my parser with this behavior:
https://github.com/HowardHinnant/date/commit/618cd7be060b02a923af60c0eb866a6...
And here is the updated validation file for 2015e:
http://howardhinnant.github.io/tzdata2015e-tzvalidate.txt.zip
The only difference I’m now seeing from Jon’s:
http://nodatime.org/tzvalidate/tzdata2015e-tzvalidate.zip
is associated with years 1904 and prior, which Jon has already explained.
Thanks Jon and Paul for the explanation of the lookup rules for when one "falls off the front.”
I have one suggestion for the validation file:
Some timezones have no transition period and are labeled like so:
EST Fixed: -05:00:00 EST
My recommendation is to change this to:
EST Initially: -05:00:00 standard EST
And furthermore have every timezone do this. For example:
Europe/Amsterdam Initially: +00:19:32 standard LMT 1834-12-31T23:40:28Z +00:19:32 standard AMT 1916-04-30T23:40:28Z +01:19:32 daylight NST ...
instead of:
Europe/Amsterdam 1834-12-31T23:40:28Z +00:19:32 standard AMT 1916-04-30T23:40:28Z +01:19:32 daylight NST ...
Howard

On Jul 25, 2015, at 9:34 AM, Jon Skeet <skeet@pobox.com> wrote:
I've now changed the format on github to the one Howard proposed, updating the data on http://nodatime.org/tzvalidate (still uploading) and the various tools. Now that we (by default) include all the data available before the cutoff point (2035 by default) this hopefully removes at least some of the concerns Paul raised earlier, too.
<nitpick> The validation file is slightly more human-readable if instead of: Africa/Accra Initially: -00:00:52 standard LMT 1918-01-01T00:00:52Z +00:00:00 standard GMT 1920-09-01T00:00:00Z +00:20:00 daylight GHST 1920-12-30T23:40:00Z +00:00:00 standard GMT … You have: Africa/Accra Initially: -00:00:52 standard LMT 1918-01-01T00:00:52Z +00:00:00 standard GMT 1920-09-01T00:00:00Z +00:20:00 daylight GHST 1920-12-30T23:40:00Z +00:00:00 standard GMT … I.e. put spaces after “Initially:” to line up the offset information. Howard

On Jul 25, 2015, at 12:15 PM, Howard Hinnant <howard.hinnant@gmail.com> wrote:
On Jul 25, 2015, at 9:34 AM, Jon Skeet <skeet@pobox.com> wrote:
I've now changed the format on github to the one Howard proposed, updating the data on http://nodatime.org/tzvalidate (still uploading) and the various tools. Now that we (by default) include all the data available before the cutoff point (2035 by default) this hopefully removes at least some of the concerns Paul raised earlier, too.
<nitpick> The validation file is slightly more human-readable if instead of:
Africa/Accra Initially: -00:00:52 standard LMT 1918-01-01T00:00:52Z +00:00:00 standard GMT 1920-09-01T00:00:00Z +00:20:00 daylight GHST 1920-12-30T23:40:00Z +00:00:00 standard GMT …
You have:
Africa/Accra Initially: -00:00:52 standard LMT 1918-01-01T00:00:52Z +00:00:00 standard GMT 1920-09-01T00:00:00Z +00:20:00 daylight GHST 1920-12-30T23:40:00Z +00:00:00 standard GMT …
I.e. put spaces after “Initially:” to line up the offset information.
I should’ve added: Aside from this minor formatting issue, I get identical results for tzdata2015e-tzvalidate using http://howardhinnant.github.io/tz.html. Howard

On 25 July 2015 at 17:19, Howard Hinnant <howard.hinnant@gmail.com> wrote:
I should’ve added: Aside from this minor formatting issue, I get identical results for tzdata2015e-tzvalidate using http://howardhinnant.github.io/tz.html.
That's really good to hear. I'd thoroughly encourage you to run your code against older releases again for a more thorough test. I found some interesting corner cases. (There may well be some aspects of old data that you don't want to support of course, but that's a different matter. Noda Time 2.0 is currently in sync with zic from 1996 data releases onwards.) I may add another flag to allow the user to omit the name from each line - I suspect that without the names, the output for Java 8, Joda Time and ICU4J would be *close* to zic. (It may make more sense to just chop each line at the right position, of course - that's an advantage of your scheme...) Jon

Ah, I'd wondered why the spaces were there. (Viewing the mail in a proportional font, it's not obvious that it's meant to line up with anything.) I'll change the code for that, but probably not regenerate the zip files just yet... it's quite a lot to churn in the git repo :) Jon On 25 July 2015 at 17:15, Howard Hinnant <howard.hinnant@gmail.com> wrote:
On Jul 25, 2015, at 9:34 AM, Jon Skeet <skeet@pobox.com> wrote:
I've now changed the format on github to the one Howard proposed,
updating the data on http://nodatime.org/tzvalidate (still uploading) and the various tools.
Now that we (by default) include all the data available before the cutoff point (2035 by default) this hopefully removes at least some of the concerns Paul raised earlier, too.
<nitpick> The validation file is slightly more human-readable if instead of:
Africa/Accra Initially: -00:00:52 standard LMT 1918-01-01T00:00:52Z +00:00:00 standard GMT 1920-09-01T00:00:00Z +00:20:00 daylight GHST 1920-12-30T23:40:00Z +00:00:00 standard GMT …
You have:
Africa/Accra Initially: -00:00:52 standard LMT 1918-01-01T00:00:52Z +00:00:00 standard GMT 1920-09-01T00:00:00Z +00:20:00 daylight GHST 1920-12-30T23:40:00Z +00:00:00 standard GMT …
I.e. put spaces after “Initially:” to line up the offset information.
Howard

On Jul 18, 2015, at 6:16 PM, Jon Skeet <skeet@pobox.com> wrote:
On 18 July 2015 at 23:01, Howard Hinnant <howard.hinnant@gmail.com> wrote:
On Jul 18, 2015, at 3:40 PM, Jon Skeet <skeet@pobox.com> wrote:
Next update: I've improved the zdump-based generation of the data, and put the data in the current format for all the tz data releases I can find (from 1996 onwards) at http://nodatime.org/tzvalidate/
I’ve generated a version of tzdata2015e-tzvalidate.txt.zip from my code here:
http://howardhinnant.github.io/tzdata2015e-tzvalidate.txt.zip
I saw your earlier message and hoped you were reading this thread too. Supporting code such as yours is precisely the motivation for this endeavour.
<nod> The exercise found some bugs in my code already. :-) https://github.com/HowardHinnant/date/commit/a431164fcd79a43ba1f70c9ba70659a...
There are appear to be two kinds of differences:
1. I appear to start earlier than you, for example I have:
Africa/Algiers 1891-03-14T23:48:48Z +00:09:21 standard PMT
and you do not.
That much is simple to explain - the format I'm currently generating explicitly starts in 1905 and ends in 2035. The 1905 part was due to an earlier version of zdump I was using was limited to 1900. As per Paul's messages earlier in the thread, eventually we'll want to expose more data - although it's not clear how late it's worth going. (I doubt that it's worth extending beyond 2100 for example.)
Fwiw, I started at 1834-01-01T00:00:00 UTC. I doubt going beyond 2038 would be worthwhile.
2. This one has me more concerned: When a zone specifies a rule/date combination and the date falls of the beginning of the rule table, I assume a “” variable part, where you appear to assume a “S” variable part. For example, I have:
America/Barbados 1924-01-01T03:58:29Z -03:58:29 standard BMT 1932-01-01T03:58:29Z -04:00:00 standard AT 1977-06-12T06:00:00Z -03:00:00 daylight ADT
And you have:
America/Barbados 1924-01-01T03:58:29Z -03:58:29 standard BMT 1932-01-01T03:58:29Z -04:00:00 standard AST 1977-06-12T06:00:00Z -03:00:00 daylight ADT
Just to be clear, this isn't "me" so much as "zic and then zdump". It happens that Noda Time (which is more "my" code) does the same thing though :)
Ok.
The America/Barbados Zone switches to the Barb Rule on 1932-01-01T03:58:29Z, using the format A%sT. But the first Barb Rule is 1977-06-12 2:00. I looked for documentation for what is supposed to happen in a situation like this, but didn’t find anything.
I think AST makes sense here (as it's standard time) but I agree that it's not clearly documented.
In Noda Time, if I don't find a rule leading "into" the transition period, I take the name of the first rule with no daylight savings. See https://github.com/nodatime/nodatime/blob/20d57967e04f1b57a10c00910f337a1c3c... for the code involved.
zic appears to implement equivalent behaviour, although I wouldn't like to pin down where.
I'd be interested in seeing whether your understanding of the data in natural language ties in with the comments expressed in DateTimeZoneBuilder at the link above, by the way.
I didn’t really have an understanding, and so guessed at {00:00, “”}. I’m happy to implement whatever rule should be here (including the one you’ve described), and only ask that it be documented somewhere (besides the zic source code). I checked the zic man page, but didn’t see it (I might have missed it). Fwiw, here is the program I used to generate my version of this validation file: http://codepad.org/JethTWsl Howard

For anyone still interested in this, I've now moved the data to http://nodatime.github.io/tzvalidate/ and created a Travis job which lets me update it mostly-automatically. (When there's a new TZDB release, I need to build the Noda Time data file, push that, then manually trigger a Travis build for tzvalidate.) Of course, if there were any appetite for building and distributing this along with tzcode and tzdata, that would be even better :) Jon On 11 July 2015 at 11:35, Jon Skeet <skeet@pobox.com> wrote:
Background: I'm the primary developer for Noda Time <http://nodatime.org> which consumes the tz data. I'm currently refactoring the code to do this... and I've come across some code (originally ported from Joda Time) which I now understand in terms of what it's doing, but not exactly why.
For a little while now, the Noda Time source repo has included a text dump file <https://github.com/nodatime/nodatime/blob/master/src/NodaTime.Test/TestData/...>, containing a text dump of every transition (up to 2100, at the moment) for every time zone. It looks like this, picking just one example:
Zone: Africa/Maseru LMT: [StartOfTime, 1892-02-07T22:08:00Z) +01:52 (+00) SAST: [1892-02-07T22:08:00Z, 1903-02-28T22:30:00Z) +01:30 (+00) SAST: [1903-02-28T22:30:00Z, 1942-09-20T00:00:00Z) +02 (+00) SAST: [1942-09-20T00:00:00Z, 1943-03-20T23:00:00Z) +03 (+01) SAST: [1943-03-20T23:00:00Z, 1943-09-19T00:00:00Z) +02 (+00) SAST: [1943-09-19T00:00:00Z, 1944-03-18T23:00:00Z) +03 (+01) SAST: [1944-03-18T23:00:00Z, EndOfTime) +02 (+00)
I use this file for confidence when refactoring my time zone handling code - if the new code comes up with the same set of transitions as the old code, it's probably okay. (This is just one line of defence, of course - there are unit tests, though not as many as I'd like.)
It strikes me that having a similar file (I'm not wedded to the format, but it should have all the same information, one way or another) released alongside the main data files would be really handy for *all* implementors - it would be a good way of validating consistency across multiple platforms, with the release data being canonical. For any platforms which didn't want to actually consume the rules as rules, but just wanted a list of transitions, it could even effectively replace their use of the data.
One other benefit: diffing the dump between two releases would make it clear what had changed in *effect*, rather than just in terms of rules.
One sticking point is size. The current file for Noda Time is about 4MB, although it zips down to about 300K. Some thoughts around this:
- We wouldn't need to distribute it in the same file as the data - just as we have data and code file, there could be a "textdump" file or whatever we'd want to call it. These could be retroactively generated for previous releases, too. - As you can see, there's redundancy in the format above, in that it's a list of "zone intervals" (as I call them in Noda Time) rather than a list of transitions - the end of each interval is always the start of the next interval. - For zones which settle into an infinite daylight saving pattern, I currently generate from the start of time to 2100 (and then a single zone interval for the end of time as Noda Time understands it; we'd need to work out what form that would take, if any). If we decided that "year of release + 30 years" was enough, that would cut down the size considerably.
Any thoughts? If the feeling is broadly positive, the next step would be to nail down the text format, then find a willing victim/volunteer to write the C code. (You really don't want me writing C...)
Jon

I missed this the first time around...
On 11 July 2015 at 11:35, Jon Skeet <skeet@pobox.com> wrote:
Background: I'm the primary developer for Noda Time <http://nodatime.org> which consumes the tz data. I'm currently refactoring the code to do this... and I've come across some code (originally ported from Joda Time) which I now understand in terms of what it's doing, but not exactly why.
For a little while now, the Noda Time source repo has included a text dump file <https://github.com/nodatime/nodatime/blob/master/src/NodaTime.Test/TestData/...>, containing a text dump of every transition (up to 2100, at the moment) for every time zone. It looks like this, picking just one example:
Zone: Africa/Maseru LMT: [StartOfTime, 1892-02-07T22:08:00Z) +01:52 (+00) SAST: [1892-02-07T22:08:00Z, 1903-02-28T22:30:00Z) +01:30 (+00) SAST: [1903-02-28T22:30:00Z, 1942-09-20T00:00:00Z) +02 (+00) SAST: [1942-09-20T00:00:00Z, 1943-03-20T23:00:00Z) +03 (+01) SAST: [1943-03-20T23:00:00Z, 1943-09-19T00:00:00Z) +02 (+00) SAST: [1943-09-19T00:00:00Z, 1944-03-18T23:00:00Z) +03 (+01) SAST: [1944-03-18T23:00:00Z, EndOfTime) +02 (+00)
...
Any thoughts? If the feeling is broadly positive, the next step would be to nail down the text format, then find a willing victim/volunteer to write the C code. (You really don't want me writing C...)
What's wrong with zdump's output format? $ zdump -v Africa/Maseru Africa/Maseru -9223372036854775808 = NULL Africa/Maseru -9223372036854689408 = NULL Africa/Maseru Sun Feb 7 22:07:59 1892 UTC = Sun Feb 7 23:59:59 1892 LMT isdst=0 gmtoff=6720 Africa/Maseru Sun Feb 7 22:08:00 1892 UTC = Sun Feb 7 23:38:00 1892 SAST isdst=0 gmtoff=5400 Africa/Maseru Sat Feb 28 22:29:59 1903 UTC = Sat Feb 28 23:59:59 1903 SAST isdst=0 gmtoff=5400 Africa/Maseru Sat Feb 28 22:30:00 1903 UTC = Sun Mar 1 00:30:00 1903 SAST isdst=0 gmtoff=7200 Africa/Maseru Sat Sep 19 23:59:59 1942 UTC = Sun Sep 20 01:59:59 1942 SAST isdst=0 gmtoff=7200 Africa/Maseru Sun Sep 20 00:00:00 1942 UTC = Sun Sep 20 03:00:00 1942 SAST isdst=1 gmtoff=10800 Africa/Maseru Sat Mar 20 22:59:59 1943 UTC = Sun Mar 21 01:59:59 1943 SAST isdst=1 gmtoff=10800 Africa/Maseru Sat Mar 20 23:00:00 1943 UTC = Sun Mar 21 01:00:00 1943 SAST isdst=0 gmtoff=7200 Africa/Maseru Sat Sep 18 23:59:59 1943 UTC = Sun Sep 19 01:59:59 1943 SAST isdst=0 gmtoff=7200 Africa/Maseru Sun Sep 19 00:00:00 1943 UTC = Sun Sep 19 03:00:00 1943 SAST isdst=1 gmtoff=10800 Africa/Maseru Sat Mar 18 22:59:59 1944 UTC = Sun Mar 19 01:59:59 1944 SAST isdst=1 gmtoff=10800 Africa/Maseru Sat Mar 18 23:00:00 1944 UTC = Sun Mar 19 01:00:00 1944 SAST isdst=0 gmtoff=7200 Africa/Maseru 9223372036854689407 = NULL Africa/Maseru 9223372036854775807 = NULL Cons: - A bit verbose - technically uses instants (from before and on each transition) rather than spans. - The NULLs are a bit mysterious - I'm personally not sure *exactly* how it finds the transitions, and in particular I'm not sure if it will reliably find multiple transitions per day Pros: - Already exists - Is already written in C, and already installed on many systems - Does not depend on any implementation internals

I'd say those cons are pretty significant - I find it very significantly harder to read than the format I've propsed. I'm also confused by your "pro" that it doesn't depend on any implementation details, but it really *exposes* the implementation details in naming ("isdst" and "gmtoff" for example, along with the mysterious huge numeric values). The aim is to have a format which is easy and natural to generate on *multiple* platforms, so that authors of other code parsing the data can validate against it. That would be a *very* unnatural format to generate from .NET with either Noda Time or TimeZoneInfo, or from Java 7, Java 8 or Joda Time. I'd certainly be in favour of my "zicdump" C# code being rewritten in C (and maybe even offered as a different format for zdump, based on a command-line flag) but I see very little benefit in adopting a format which doesn't seem to have been designed for the same purpose as the tzvalidate one. Now having said all that, I'm very happy to tweak the format (and have already done so based on earlier suggestions) - but I wouldn't want to use the zdump -v format just because it already exists, if it's not really fit for purpose. Jon On 27 April 2016 at 16:51, Random832 <random832@fastmail.com> wrote:
I missed this the first time around...
On 11 July 2015 at 11:35, Jon Skeet <skeet@pobox.com> wrote:
Background: I'm the primary developer for Noda Time < http://nodatime.org> which consumes the tz data. I'm currently refactoring the code to do this... and I've come across some code (originally ported from Joda Time) which I now understand in terms of what it's doing, but not exactly why.
For a little while now, the Noda Time source repo has included a text dump file < https://github.com/nodatime/nodatime/blob/master/src/NodaTime.Test/TestData/... , containing a text dump of every transition (up to 2100, at the moment) for every time zone. It looks like this, picking just one example:
Zone: Africa/Maseru LMT: [StartOfTime, 1892-02-07T22:08:00Z) +01:52 (+00) SAST: [1892-02-07T22:08:00Z, 1903-02-28T22:30:00Z) +01:30 (+00) SAST: [1903-02-28T22:30:00Z, 1942-09-20T00:00:00Z) +02 (+00) SAST: [1942-09-20T00:00:00Z, 1943-03-20T23:00:00Z) +03 (+01) SAST: [1943-03-20T23:00:00Z, 1943-09-19T00:00:00Z) +02 (+00) SAST: [1943-09-19T00:00:00Z, 1944-03-18T23:00:00Z) +03 (+01) SAST: [1944-03-18T23:00:00Z, EndOfTime) +02 (+00)
...
Any thoughts? If the feeling is broadly positive, the next step would be to nail down the text format, then find a willing victim/volunteer to write the C code. (You really don't want me writing C...)
What's wrong with zdump's output format?
$ zdump -v Africa/Maseru Africa/Maseru -9223372036854775808 = NULL Africa/Maseru -9223372036854689408 = NULL Africa/Maseru Sun Feb 7 22:07:59 1892 UTC = Sun Feb 7 23:59:59 1892 LMT isdst=0 gmtoff=6720 Africa/Maseru Sun Feb 7 22:08:00 1892 UTC = Sun Feb 7 23:38:00 1892 SAST isdst=0 gmtoff=5400 Africa/Maseru Sat Feb 28 22:29:59 1903 UTC = Sat Feb 28 23:59:59 1903 SAST isdst=0 gmtoff=5400 Africa/Maseru Sat Feb 28 22:30:00 1903 UTC = Sun Mar 1 00:30:00 1903 SAST isdst=0 gmtoff=7200 Africa/Maseru Sat Sep 19 23:59:59 1942 UTC = Sun Sep 20 01:59:59 1942 SAST isdst=0 gmtoff=7200 Africa/Maseru Sun Sep 20 00:00:00 1942 UTC = Sun Sep 20 03:00:00 1942 SAST isdst=1 gmtoff=10800 Africa/Maseru Sat Mar 20 22:59:59 1943 UTC = Sun Mar 21 01:59:59 1943 SAST isdst=1 gmtoff=10800 Africa/Maseru Sat Mar 20 23:00:00 1943 UTC = Sun Mar 21 01:00:00 1943 SAST isdst=0 gmtoff=7200 Africa/Maseru Sat Sep 18 23:59:59 1943 UTC = Sun Sep 19 01:59:59 1943 SAST isdst=0 gmtoff=7200 Africa/Maseru Sun Sep 19 00:00:00 1943 UTC = Sun Sep 19 03:00:00 1943 SAST isdst=1 gmtoff=10800 Africa/Maseru Sat Mar 18 22:59:59 1944 UTC = Sun Mar 19 01:59:59 1944 SAST isdst=1 gmtoff=10800 Africa/Maseru Sat Mar 18 23:00:00 1944 UTC = Sun Mar 19 01:00:00 1944 SAST isdst=0 gmtoff=7200 Africa/Maseru 9223372036854689407 = NULL Africa/Maseru 9223372036854775807 = NULL
Cons: - A bit verbose - technically uses instants (from before and on each transition) rather than spans. - The NULLs are a bit mysterious - I'm personally not sure *exactly* how it finds the transitions, and in particular I'm not sure if it will reliably find multiple transitions per day
Pros: - Already exists - Is already written in C, and already installed on many systems - Does not depend on any implementation internals

On Wed, Apr 27, 2016, at 12:21, Jon Skeet wrote:
I'd say those cons are pretty significant - I find it very significantly harder to read than the format I've propsed. I'm also confused by your "pro" that it doesn't depend on any implementation details, but it really *exposes* the implementation details in naming ("isdst" and "gmtoff" for example, along with the mysterious huge numeric values).
isdst is standard C. gmtoff is a common extension. My point is that it doesn't actually parse the internal structures of the timezone files, it simply calls localtime over and over with different values, and so can be used even with a radically different implementation of the C functions, or against POSIX timezone strings, etc.
The aim is to have a format which is easy and natural to generate on *multiple* platforms, so that authors of other code parsing the data can validate against it. That would be a *very* unnatural format to generate from .NET with either Noda Time or TimeZoneInfo, or from Java 7, Java 8 or Joda Time.

On 27 April 2016 at 23:34, Random832 <random832@fastmail.com> wrote:
On Wed, Apr 27, 2016, at 12:21, Jon Skeet wrote:
I'd say those cons are pretty significant - I find it very significantly harder to read than the format I've propsed. I'm also confused by your "pro" that it doesn't depend on any implementation details, but it really *exposes* the implementation details in naming ("isdst" and "gmtoff" for example, along with the mysterious huge numeric values).
isdst is standard C. gmtoff is a common extension.
Right, so basically the format is specific to "C-based implementations". I agree that it's a different sort of implementation detail than normal, but it's still far from platform-neutral. The aim of the tzvalidate data is to help people validate that any code parsing the source data from tz does so in the same way - and I don't think the C-centric format helps that.
My point is that it doesn't actually parse the internal structures of the timezone files, it simply calls localtime over and over with different values, and so can be used even with a radically different implementation of the C functions, or against POSIX timezone strings, etc
Okay, so that's an argument for changing the implementation - but it's not an argument for changing the format, IMO. As far as I can see, the only genuine benefit from choosing the zdump format as the output format is that there's already C code for it. Hopefully it would be entirely possible to write code which calls localtime in the same way, and output my proposed format. On the other hand, I'm not sure whether that's actually a benefit anyway: the whole idea isn't to check whether multiple platforms have the same time zone data, but to check whether they each handle the same input data in the same way... I think it's reasonable to determine how zic handles its input data by looking directly at its output. To be honest, I think there'd be room for two tools in C here - one "white box" one dealing with the zic format directly, and one "black box" one more similar to zdump. Another "con" against zdump - the man pages I've found don't specify the format in very much detail. For example: For each zonename on the command line, print the time at the lowest
possible time value, the time one day after the lowest possible time value, the times both one second before and exactly at each detected time discontinuity, the time at one day less than the highest possible time value, and the time at the highest possible time value. Each line ends with isdst=1 if the given time is Daylight Saving Time or isdst=0 otherwise.
So what's the format for the time? I can see what it does on my system, but I wouldn't be surprised if there were multiple implementations of zdump doing slightly different things - possibly with some of them using the user locale for formatting, for example. For tzvalidate to be useful, the format has to be nailed down, ideally to the exact byte. The output of all my tools currently uses \r\n as the line break; for wider adoption it would probably be worth moving to \n. But if we had the output to an exact byte, then users wouldn't need to download the whole output file to check it for correctness, necessarily - they could check the SHA-1 hash of *their* output against the golden SHA-1 hash, and only find differences if necessary. Indeed, the SHA-1 hash from zic output could become part of the distributed tzdata, which I'd personally *love*. Discussion on whether that's feasible would be welcome... Jon

One way forward here would be to give zdump more options to let the user specify the format. That way, everybody could use their own favorite format. One format might look something like this for America/Los_Angeles, assuming the default zdump window of -500 through 2499 CE UTC: -501-12-31 16:07:02 -07:52:58 0 LMT 1883-11-18 12:00 -08 0 PST 1918-03-31 03:00 -07 1 PDT 1918-10-27 01:00 -08 0 PST ... 2016-03-13 03:00 -07 1 PDT 2016-11-06 01:00 -08 0 PST ... 2499-03-08 03:00 -07 1 PDT 2499-11-01 01:00 -08 0 PST (The "1" and "0" are whether DST is in effect.) This completely characterizes the data and is more compact and easier to read than what zdump outputs now. Other formats might be preferable for other reasons. It might also be nice to output "PST8PDT,M3.2.0,M11.1.0" at the end, instead of that repetitive list of lines starting with the year 2007.

On Thu, Apr 28, 2016, at 13:18, Paul Eggert wrote:
One way forward here would be to give zdump more options to let the user specify the format. That way, everybody could use their own favorite format.
You've still got to pick a paradigm: print transitions, print intervals, or print individual moments (which can be regarded as having been selected before or after a transition, or as the start or end of an interval)
It might also be nice to output "PST8PDT,M3.2.0,M11.1.0" at the end, instead of that repetitive list of lines starting with the year 2007.
This may not be available for all timezones, and it's certainly not available (not without a work) if the localtime implementation is blackboxed as it is now in zdump. I suppose you could infer it, technically.

On Thu, Apr 28, 2016, at 13:33, Random832 wrote:
This may not be available for all timezones, and it's certainly not available (not without a work) if the localtime implementation is
Er, typo (or rather an editing error) this was going to be "a lot of work", then I half-deleted it to replace with the statement about inferring it below.
blackboxed as it is now in zdump. I suppose you could infer it, technically.

I'd be perfectly happy with zdump gaining more display options, but I think there's still huge benefit in deciding on one *canonical* format for validation. It means that single format can be distributed rather than everyone having to build and run zic themselves, just to get output that can be centrally distributed. (Even if it isn't part of the IANA distribution, I can host that canonical format - not ideal, but better than nothing.) With a to-the-byte canonical format, I'm also happy to put the SHA-1 hashes up. As one point I suspect we could all agree on: assuming there is to be a canonical format, should we use "\n" as the line separator? (My current format uses \r\n, but I think that's likely to cause more pain than it alleviates.) Jon On 28 April 2016 at 18:18, Paul Eggert <eggert@cs.ucla.edu> wrote:
One way forward here would be to give zdump more options to let the user specify the format. That way, everybody could use their own favorite format. One format might look something like this for America/Los_Angeles, assuming the default zdump window of -500 through 2499 CE UTC:
-501-12-31 16:07:02 -07:52:58 0 LMT 1883-11-18 12:00 -08 0 PST 1918-03-31 03:00 -07 1 PDT 1918-10-27 01:00 -08 0 PST ... 2016-03-13 03:00 -07 1 PDT 2016-11-06 01:00 -08 0 PST ... 2499-03-08 03:00 -07 1 PDT 2499-11-01 01:00 -08 0 PST
(The "1" and "0" are whether DST is in effect.) This completely characterizes the data and is more compact and easier to read than what zdump outputs now. Other formats might be preferable for other reasons.
It might also be nice to output "PST8PDT,M3.2.0,M11.1.0" at the end, instead of that repetitive list of lines starting with the year 2007.

Right. Will modify my code and docs, then update the web site. Will probably add hashes on the web site at the same time. Jon On 29 April 2016 at 17:05, Paul Eggert <eggert@cs.ucla.edu> wrote:
Jon Skeet wrote:
As one point I suspect we could all agree on: assuming there is to be a canonical format, should we use "\n" as the line separator?
Oh, yes.

How do you feel about putting the version number at the top of the validation file? This more tightly couples the version number with the validation data, making the name of the validation file less relevant. Howard On Apr 29, 2016, at 12:09 PM, Jon Skeet <skeet@pobox.com> wrote:
Right. Will modify my code and docs, then update the web site. Will probably add hashes on the web site at the same time.
Jon
On 29 April 2016 at 17:05, Paul Eggert <eggert@cs.ucla.edu> wrote: Jon Skeet wrote:
As one point I suspect we could all agree on: assuming there is to be a canonical format, should we use "\n" as the line separator?
Oh, yes.

I'd been wondering about that myself. I think it's great for some use cases, less for others. (I can see where sometimes it may not be known.) I suspect it's more usually useful. Maybe it should be part of the distributed files, but tools don't need to produce it, and it isn't in the hash? Indeed, we could have version, then hash, then data... On 29 Apr 2016 5:25 p.m., "Howard Hinnant" <howard.hinnant@gmail.com> wrote:
How do you feel about putting the version number at the top of the validation file? This more tightly couples the version number with the validation data, making the name of the validation file less relevant.
Howard
On Apr 29, 2016, at 12:09 PM, Jon Skeet <skeet@pobox.com> wrote:
Right. Will modify my code and docs, then update the web site. Will
probably add hashes on the web site at the same time.
Jon
On 29 April 2016 at 17:05, Paul Eggert <eggert@cs.ucla.edu> wrote: Jon Skeet wrote:
As one point I suspect we could all agree on: assuming there is to be a canonical format, should we use "\n" as the line separator?
Oh, yes.

Jon Skeet wrote:
I'd be perfectly happy with zdump gaining more display options, but I think there's still huge benefit in deciding on one *canonical* format for validation.
I looked into the format you suggested, along with the other comments noted and formats I've seen elsewhere (e.g., Shanks), and came up with the attached proposal for a "canonical" -i format for zdump, with the design goals being a format that is unambiguous, easy to review, and compact. Although this format's columns don't always line up, in general aligning columns appears to be impractical (in the extreme case, year numbers might exceed 9999!), and I found that unaligned columns make it easier to see glitches anyway. The proposed -i format does not contain versioning information as that would complicate regression testing. For what it's worth, the -i format is about 10% the size of -v format, and is about 53% the size of the format you proposed. This proposal is incomplete, for several reasons. First, it doesn't address leap seconds. Second, it doesn't abbreviate predicted futures into POSIX TZ strings; fixing this would make the output significantly shorter. Third, there is no infrastructure for verifying a distribution by checksumming its zdump -i output. So the proposal is documented as being experimental in the attached patch, and I haven't installed it on github yet. Of course zdump -v has all these problems as well, so the proposal format wouldn't make these problems worse. The first attachment consists of the revised man-page output; the second attachment is the change to tzcode.

(To the list this time instead of just to Paul - sorry for the duplicate, Paul.) Thanks for your work on this, Paul. I'm about to go on vacation so I don't know how popular I'd be with my spouse if I investigated this much further this week, but I'm somewhat nervous about the readability for the sake of compactness. I'd like to try a bunch of different formats, with examples and sizes - using Noda Time to generate them just for the sake of making this experimentation easier. I'd personally be willing to sacrifice a certain amount of compactness for the sake of readability, but obviously if we can get the size down a bit *without* losing readability, that would be good. I'll try to come back with a document of options, at least within the next couple of weeks. I think it's fine for zdump not to bother with version information etc - it should be easy enough to write code to add that later, so that zdump can focus just on the time part of things. I'm *personally* not bothered about the lack of leap second information - my bias here is that Noda Time completely ignores leap seconds. (I view them as rather separate from actual time zones anyway - I think if we were thinking about systems from scratch I'd separate the two entirely, but that's a different matter.) Great to see progress on this though! Jon On 29 May 2016 at 19:49, Paul Eggert <eggert@cs.ucla.edu> wrote:
Jon Skeet wrote:
I'd be perfectly happy with zdump gaining more display options, but I think there's still huge benefit in deciding on one *canonical* format for validation.
I looked into the format you suggested, along with the other comments noted and formats I've seen elsewhere (e.g., Shanks), and came up with the attached proposal for a "canonical" -i format for zdump, with the design goals being a format that is unambiguous, easy to review, and compact. Although this format's columns don't always line up, in general aligning columns appears to be impractical (in the extreme case, year numbers might exceed 9999!), and I found that unaligned columns make it easier to see glitches anyway. The proposed -i format does not contain versioning information as that would complicate regression testing.
For what it's worth, the -i format is about 10% the size of -v format, and is about 53% the size of the format you proposed.
This proposal is incomplete, for several reasons. First, it doesn't address leap seconds. Second, it doesn't abbreviate predicted futures into POSIX TZ strings; fixing this would make the output significantly shorter. Third, there is no infrastructure for verifying a distribution by checksumming its zdump -i output. So the proposal is documented as being experimental in the attached patch, and I haven't installed it on github yet. Of course zdump -v has all these problems as well, so the proposal format wouldn't make these problems worse.
The first attachment consists of the revised man-page output; the second attachment is the change to tzcode.

Jon Skeet wrote:
I'd personally be willing to sacrifice a certain amount of compactness for the sake of readability, but obviously if we can get the size down a bit*without* losing readability, that would be good.
Yes. Readability is to some extent in the eye of the beholder, and the proposed zgrep -i format wasn't my first choice: it evolved over some time as I used it to look at a lot of data. To some extent the format is aimed at my needs, and may be less suited for novices. For example: TZ="America/Phoenix" - - -072818 LMT 1883-11-18 12 -07 MST 1918-03-31 03 -06 MDT 1 1918-10-27 01 -07 MST 1919-03-30 03 -06 MDT 1 1919-10-26 01 -07 MST 1942-02-09 03 -06 MWT 1 1943-12-31 23:01 -07 MST 1944-04-01 01:01 -06 MWT 1 1944-09-30 23:01 -07 MST 1967-04-30 03 -06 MDT 1 1967-10-29 01 -07 MST Here the columns don't line up and although this may be a bit offputting for some, for me it's a plus as it causes the unusual WWII non-hour transitions to stand out. Also, it's easier to visually identify the daylight-saving transitions via "1" vs nothing, than to scan through a column saying "isdst=1" vs "isdst=0". In contrast: America/Phoenix Initially: -07:28:18 standard LMT 1883-11-18 19:00:00Z -07:00:00 standard MST 1918-03-31 09:00:00Z -06:00:00 daylight MDT 1918-10-27 08:00:00Z -07:00:00 standard MST 1919-03-30 09:00:00Z -06:00:00 daylight MDT 1919-10-26 08:00:00Z -07:00:00 standard MST 1942-02-09 09:00:00Z -06:00:00 daylight MWT 1944-01-01 06:01:00Z -07:00:00 standard MST 1944-04-01 07:01:00Z -06:00:00 daylight MWT 1944-10-01 06:01:00Z -07:00:00 standard MST 1967-04-30 09:00:08Z -06:00:00 daylight MDT 1967-10-29 08:00:00Z -07:00:00 standard MST Although this conveys the same information, it's harder to catch anomalies, as the nicely-aligned columns and data tend to blur into each other. For example, it's hard to spot the error that I deliberately introduced into the penultimate line of that data, whereas the same error would have been much easier to see in zgrep -i format.

Right, I've now had a chance to do a bit more work on this. The various options are committed in a github branch <https://github.com/jskeet/nodatime/tree/tzvalidate-options> of Noda Time. I have a few concerns about the proposed format, but I definitely agree that we need to consider the audience and use cases. The use case I'm primarily interested in is validation: diffing a "golden" file with one generated by another tool. For example, to validate that Noda Time is doing the right thing, I'd compare the output of zdump with the output of NodaTime.TzValidate.NodaDump. Ideally, there will be no differences, so nothing to look at. If there *are* differences, I need to be able to understand them easily. Sometimes that will be missing lines on one side or the other indicating a different number of transitions, sometimes it will be differences between two lines (e.g. the transition point). In my use case, one would rarely, if ever, be visually examining a single file to look for anomalies, which is Paul's use case. In terms of the users themselves - while I'd expect them to be *somewhat* domain experts (people writing date/time libraries) I wouldn't expect them to be dealing with this format every day - so it should really be as clear as possible without having to consult the man page each time. (I'd envisage maybe having to look at the files once every six months or year.) The other "user" to consider is machine readability: there are some cases where it's very useful to be able to parse the file easily from code. For example, some platforms I've looked at definitely get the abbreviation wrong in many cases, so before diffing I remove the name. That's trivial to do in the current format - but much harder when some parts are optional and everything is variable width. Regarding compactness: again, this comes down to use cases. I don't particularly mind the file being reasonably large in total, so long as each zone is simple to look at. (I don't want multiple lines per transition, for example.) When zipped, there's not much difference between my original format and the smallest one I've tested (128K vs 106K). If we can make it more compact easily, that's fine - but I personally regard that as a much lower priority than other aspects of the format. Okay, concerns: - I don't see why we need the quoted form for the time zone ID. That's going to be a mild pain to generate robustly in terms of escaping, and it's not clear what would happen for non-ASCII characters anyway. Assuming we'll never get a line break as part of a zone ID, I think just including the ID in UTF-8 is the simplest plan. Presumably the benefit of the proposed format is that you can copy/paste it into a Unix shell to use that time zone. That's certainly not a use case that I'd personally find useful, but the quotes and TZ= part are an unnecessary distraction IMO. - Indicating daylight/standard with an arbitrary positive integer: if this is going to be a canonical format, we need to be more precise than that. Equivalent outputs should be equal. I'd also prefer it not to be an integer at all, given that it's indicating a Boolean value... where there's a number, there's an expectation (IMO) that the numeric value is meaningful. Just changing standard/daylight to s/d makes it a lot more compact, but I'd prefer std/day to be obvious. While we *could* omit the value for standard time, I still think there's a benefit in making every line consistent. Again, this comes down to a difference in use cases. - I'd *really* like colons in the UT offsets - "-103126" looks like a regular integer to me, whereas "-10:31:26" is fairly obviously 10 hours, 31 minutes and 26 seconds. - Personally I think it's simpler to think about the transition times in UT, indicated with a Z in the output. In particular, choosing the local time *after* the transition isn't how most people think about transitions in day to day conversation. If I were describing the UK rules, I'd say that in spring we advance our clocks at 1am and in the autumn we move them back at 2am... whereas in this format, that would be shown as advancing the clocks *to* 2am and moving them back *to* 1am. Just the fact that there's ambiguity suggests to me that using UT everywhere is a clearer option. The "Z" on every line is redundant, but IMO it helps with clarity. - Omitting the abbreviation when it happens to be the same as the UT offset makes the file harder to parse for very little benefit in my view. That's taking compactness further than is useful. - In terms of omitting 0 minutes and 0 seconds values: for times, I'd favour at least keeping the minutes: "2016-06-05 21:00" still looks like a date and time, whereas "2016-06-05 21" looks like a date and then 21. This isn't as much of a concern with offsets though - "+05" is reasonably clear on its own. Six sample formats to compare for Honolulu (one of the examples given in Paul's man page), in the order of the commits in the github branch. The number is the size of the file (including headers) for all zones. All of these still represent the transition in UT: "Original" (currently documented tzvalidate) - 1,735,616 bytes Pacific/Honolulu Initially: -10:31:26 standard LMT 1896-01-13 22:31:26Z -10:30:00 standard HST 1933-04-30 12:30:00Z -09:30:00 daylight HDT 1933-05-21 21:30:00Z -10:30:00 standard HST 1942-02-09 12:30:00Z -09:30:00 daylight HDT 1945-09-30 11:30:00Z -10:30:00 standard HST 1947-06-08 12:30:00Z -10:00:00 standard HST Short daylight and standard indicators - 1,463,421 bytes Pacific/Honolulu Initially: -10:31:26 s LMT 1896-01-13 22:31:26Z -10:30:00 s HST 1933-04-30 12:30:00Z -09:30:00 d HDT 1933-05-21 21:30:00Z -10:30:00 s HST 1942-02-09 12:30:00Z -09:30:00 d HDT 1945-09-30 11:30:00Z -10:30:00 s HST 1947-06-08 12:30:00Z -10:00:00 s HST Shorter offsets, but still with colons - 1,240,377 bytes Pacific/Honolulu Initially: -10:31:26 s LMT 1896-01-13 22:31:26Z -10:30 s HST 1933-04-30 12:30:00Z -09:30 d HDT 1933-05-21 21:30:00Z -10:30 s HST 1942-02-09 12:30:00Z -09:30 d HDT 1945-09-30 11:30:00Z -10:30 s HST 1947-06-08 12:30:00Z -10 s HST Shorter offsets, no colons - 1,236,955 bytes Pacific/Honolulu Initially: -103126 s LMT 1896-01-13 22:31:26Z -1030 s HST 1933-04-30 12:30:00Z -0930 d HDT 1933-05-21 21:30:00Z -1030 s HST 1942-02-09 12:30:00Z -0930 d HDT 1945-09-30 11:30:00Z -1030 s HST 1947-06-08 12:30:00Z -10 s HST Variable transition times, e.g. "21" instead of "21:00:00Z" (and changing Initially to - -) - 972,361 bytes Pacific/Honolulu - - -10:31:26 s LMT 1896-01-13 22:31:26 -10:30 s HST 1933-04-30 12:30 -09:30 d HDT 1933-05-21 21:30 -10:30 s HST 1942-02-09 12:30 -09:30 d HDT 1945-09-30 11:30 -10:30 s HST 1947-06-08 12:30 -10 s HST Variable transition times, but always keeping minutes - 1,079,278 bytes Content is the same as the above, due to all the transitions happening on the half hour... To show the difference between the last two options, here's Pago_Pago: Pacific/Pago_Pago - - +12:37:12 s LMT 1879-07-04 11:22:48 -11:22:48 s LMT 1911-01-01 11:22:48 -11 s NST 1967-04-01 11 -11 s BST 1983-11-30 11 -11 s SST vs Pacific/Pago_Pago - - +12:37:12 s LMT 1879-07-04 11:22:48 -11:22:48 s LMT 1911-01-01 11:22:48 -11 s NST 1967-04-01 11:00 -11 s BST 1983-11-30 11:00 -11 s SST (I'd prefer to keep the Z in there, admittedly - that wasn't an option I happened to code though. It's easy enough to imagine it...) With all that in mind, I would *personally* prefer to stick to the currently documented tzvalidate format. For my use cases of diffing and machine parsing, the fixed with format is useful, as is always specifying both the daylight/standard indicator and the name. I could live with the offset and time shortening, but I'd definitely prefer to have colons in the offset, and to keep minutes in the time part. Thoughts? Jon On 30 May 2016 at 22:59, Paul Eggert <eggert@cs.ucla.edu> wrote:
Jon Skeet wrote:
I'd personally be willing to sacrifice a certain amount of compactness for the sake of readability, but obviously if we can get the size down a bit*without* losing readability, that would be good.
Yes. Readability is to some extent in the eye of the beholder, and the proposed zgrep -i format wasn't my first choice: it evolved over some time as I used it to look at a lot of data. To some extent the format is aimed at my needs, and may be less suited for novices. For example:
TZ="America/Phoenix" - - -072818 LMT 1883-11-18 12 -07 MST 1918-03-31 03 -06 MDT 1 1918-10-27 01 -07 MST 1919-03-30 03 -06 MDT 1 1919-10-26 01 -07 MST 1942-02-09 03 -06 MWT 1 1943-12-31 23:01 -07 MST 1944-04-01 01:01 -06 MWT 1 1944-09-30 23:01 -07 MST 1967-04-30 03 -06 MDT 1 1967-10-29 01 -07 MST
Here the columns don't line up and although this may be a bit offputting for some, for me it's a plus as it causes the unusual WWII non-hour transitions to stand out. Also, it's easier to visually identify the daylight-saving transitions via "1" vs nothing, than to scan through a column saying "isdst=1" vs "isdst=0". In contrast:
America/Phoenix Initially: -07:28:18 standard LMT 1883-11-18 19:00:00Z -07:00:00 standard MST 1918-03-31 09:00:00Z -06:00:00 daylight MDT 1918-10-27 08:00:00Z -07:00:00 standard MST 1919-03-30 09:00:00Z -06:00:00 daylight MDT 1919-10-26 08:00:00Z -07:00:00 standard MST 1942-02-09 09:00:00Z -06:00:00 daylight MWT 1944-01-01 06:01:00Z -07:00:00 standard MST 1944-04-01 07:01:00Z -06:00:00 daylight MWT 1944-10-01 06:01:00Z -07:00:00 standard MST 1967-04-30 09:00:08Z -06:00:00 daylight MDT 1967-10-29 08:00:00Z -07:00:00 standard MST
Although this conveys the same information, it's harder to catch anomalies, as the nicely-aligned columns and data tend to blur into each other. For example, it's hard to spot the error that I deliberately introduced into the penultimate line of that data, whereas the same error would have been much easier to see in zgrep -i format.

On Jun 5, 2016, at 4:26 AM, Jon Skeet <skeet@pobox.com> wrote:
With all that in mind, I would personally prefer to stick to the currently documented tzvalidate format. For my use cases of diffing and machine parsing, the fixed with format is useful, as is always specifying both the daylight/standard indicator and the name. I could live with the offset and time shortening, but I'd definitely prefer to have colons in the offset, and to keep minutes in the time part.
Thoughts?
I know you were’t asking me, but here’s my opinion anyway. I strongly agree with Jon on every point. My position is one of a 3rd party date/time library author wanting validation against both zdump *and* other 3rd party software such as NodaTime. Howard

Jon Skeet wrote:
The use case I'm primarily interested in is validation: diffing a "golden" file with one generated by another tool
Yes, I should have mentioned that. I commonly compare two zdump output files using "diff", for example. zdump -i works well for this, too. However, it does not suffice to merely look at diff output. Sometimes we add new zones, for example, and diff output won't serve to proofread those.
I wouldn't expect them to be dealing with this format every day
True; even I don't do that. Still, there is no need for zdump -i format to be self-explanatory. For example, the format need not use strftime %c format merely because naive users are more likely to understand %c format than ISO 8601 format. As long as the format is reasonably clear without constantly having to refer to the documentation then we should be OK, and zdump -i format clears that relatively-low bar.
- I don't see why we need the quoted form for the time zone ID.
The API allows the TZ environment variable (the time zone ID) to be any finite sequence of non-null bytes. TZ need not be UTF-8 encoded, and the bytes can contain newlines, etc., and zdump output should be unambiguous regardless of how weird TZ's value is.
Presumably the benefit of the proposed format is that you can copy/paste it into a Unix shell to use that time zone.
No, and in general such a cut-and-paste would not work because the quotation scheme is not designed to be shell-compatible. The main goal is to have an unambiguous format that supports any TZ value allowed by the API. Also, to provide some room for future extensions to zdump -i format.
the quotes and TZ= part are an unnecessary distraction IMO.
Some decoration is needed in order to make it easy to distinguish a TZ= line from an ordinary data line. This is because a TZ string can be almost anything: it can look like a data line, for example. Anyway, if this is the worst of zdump -i's problems, we should be OK.
- Indicating daylight/standard with an arbitrary positive integer: if this is going to be a canonical format, we need to be more precise than that. Equivalent outputs should be equal. I'd also prefer it not to be an integer at all, given that it's indicating a Boolean value.
tm_isdst is defined by ISO C11 and by POSIX to be an int value, so if we want zdump to work with all standard-conforming implementations without losing information, it must be able to represent an arbitrary int somehow. The existing zdump -v format can do it, and it would be odd if zdump -i format were to lose that ability.
- I'd *really* like colons in the UT offsets
That is mostly just a style thing. That being said, in my experience most UT offsets that contain hours and minutes omit colons (this includes several examples in the RFC-5322-format header in your email :-).
- I think it's simpler to think about the transition times in UT, indicated with a Z in the output.
That's not my experience. Most of our sources do not base transitions on UT, and I typically think about local time when mulling over transitions and DST rules.
choosing the local time *after* the transition isn't how most people think about transitions in day to day conversation.
True. But it's easy to get used to when looking at zdump -i format. Plus, users most likely prefer localtime to UT when thinking about transitions.
Just the fact that there's ambiguity
The format is documented and if this documentation is understood correctly the zdump -i output has just one interpretation, so there is no ambiguity. A problem might arise if someone attempts to look at zdump -i output without reading the documentation; although such a problem could occur with any format choice, some formats are less confusing than others, and most likely that is what you're referring to. To some extent there is a tradeoff between formats that make typos easy to find, and formats that are more what users typically expect. Within reason I'd rather make typos easy to find, as typos are a real probelm!
- Omitting the abbreviation when it happens to be the same as the UT offset makes the file harder to parse for very little benefit in my view.
First, it's trivial to parse zdump -i lines even when the abbreviation is omitted. For example, here's an awk script that outputs only zdump -i lines that correspond to DST transitions even when abbreviations are omitted: /^[0-9]/ && NF > 3 && /[0-9]$/ {print} Compare this to an awk script to do the same thing with tzvalidate format: /^[0-9]/ && $(NF - 1) == "daylight" {print} which is not significantly simpler. Second, I realize the improvement is of little benefit to those who do not read zdump output. But any unambiguous format would do for that case; we could pick JSON format, or XML format, or whatever. Being somewhat old-fashioned I'd like a text format that makes it easy for me to read zdump -i format using an ordinary text editor. And for me, it's quite useful that redundant abbreviations are omitted. Consider, for example, this output: 1981-04-01 01 +07 1 1981-09-30 23 +06 1982-04-01 01 +07 1 1982-09-30 23 +06 1983-04-01 01 +07 +08 1 1983-09-30 23 +06 1984-04-01 01 +07 1 1984-09-30 02 +06 where the (incorrect) 1983-04-01 transition sticks out like a sore thumb. In contrast, if the abbreviation were always output and columns always lined up, and the output looked like this: 1981-04-01 01 +07 +07 1 1981-09-30 23 +06 +06 0 1982-04-01 01 +07 +07 1 1982-09-30 23 +06 +06 0 1983-04-01 01 +07 +08 1 1983-09-30 23 +06 +06 0 1984-04-01 01 +07 +07 1 1984-09-30 02 +06 +06 0 the same typo is *much* harder to spot. So it is not "very little benefit". It's a big deal to someone like me who wants to catch typos and who has to deal with the consequences of typos.
for times, I'd favour at least keeping the minutes
I was tempted by that too, on the grounds that it's what readers typically expect. However, it makes typos harder to catch, which is a significant disadvantage. I hope I've explained the significant technical advantages of zdump -i format for my use case (manually looking at zdump -i output, and looking at diffs of it). I am not surprised that its style is offputting, which is why I'm thinking that we may need a way for people to specify output style more flexibly than zdump -i versus zdump -v versus zdump -V.

On 6 June 2016 at 00:51, Paul Eggert <eggert@cs.ucla.edu> wrote: <snip> I hope I've explained the significant technical advantages of zdump -i
format for my use case (manually looking at zdump -i output, and looking at diffs of it). I am not surprised that its style is offputting, which is why I'm thinking that we may need a way for people to specify output style more flexibly than zdump -i versus zdump -v versus zdump -V.
Yes, it does seem that we're unlikely to come up with a format that both of us are happy with. One alternative is for me to keep the format I've worked up, but post-process the output of zdump with a Python script, to convert it into the tzvalidate format. I suspect that making zdump flexible enough to output both formats (and others) based on command line arguments would be a *huge* amount of work - especially with things like variable-width times, omitting time zone abbreviations if they happen to match the offset etc. The Python code could also generate the headers and hashes necessary. zdump -i + a Python script is something that can be done reasonably quickly (I doubt that it's more than an afternoon's work) and would still be significantly more portable than my current C#-based solution. If we were to go ahead with that, how hard (both technically and in terms of any IANA process required) would it be to start publishing a zip file of the tzvalidate output alongside the code and data files, and include just a hash within the main data zip file (e.g. as a new file such as tzvalidate-sha256.txt)? While I'm happy to keep doing this separately, it would obviously be better if it were integrated into the main process. Jon

Jon Skeet wrote:
hard (both technically and in terms of any IANA process required) would it be to start publishing a zip file of the tzvalidate output alongside the code and data files, and include just a hash within the main data zip file (e.g. as a new file such as tzvalidate-sha256.txt)? While I'm happy to keep doing this separately, it would obviously be better if it were integrated into the main process.
Something like this sounds like a good idea. I'd be inclined to use tar.gz format (as this format is already used by the distribution) as a container for zgrep -i format (whatever that turns out to be) for validation. If we distribute that, and while we're at it distribute the zic binary-file output (which others have requested), then we can use the same mechanism we're already using for checksumming distributions.

On 6 June 2016 at 07:27, Paul Eggert <eggert@cs.ucla.edu> wrote:
Jon Skeet wrote:
hard (both technically and in terms of any IANA process required) would it be to start publishing a zip file of the tzvalidate output alongside the code and data files, and include just a hash within the main data zip file (e.g. as a new file such as tzvalidate-sha256.txt)? While I'm happy to keep doing this separately, it would obviously be better if it were integrated into the main process.
Something like this sounds like a good idea. I'd be inclined to use tar.gz format (as this format is already used by the distribution)
Sorry, yes, that was sloppy on my part - I meant "a general compressed container".
as a container for zgrep -i format (whatever that turns out to be) for validation.
Well, I was proposing a transformation *from* the zdump -i format into one specifically for validation - in other words, after the "bad data" (typos) use cases have been weeded out, and we're into the "bad code" (in libraries) territory, at which point I still think the format I've proposed is more readable both for humans and machines.
If we distribute that, and while we're at it distribute the zic binary-file output (which others have requested), then we can use the same mechanism we're already using for checksumming distributions.
Sure. (Can't say I've looked at that part of the existing scheme, but I'm sure it makes sense.) Jon

On 5 June 2016 at 19:51, Paul Eggert <eggert@cs.ucla.edu> wrote:
choosing the local time *after* the transition isn't how most people think about transitions in day to day conversation.
True. But it's easy to get used to when looking at zdump -i format. Plus, users most likely prefer localtime to UT when thinking about transitions.
I realize the goal may be to have a single canonical format, but perhaps this could be made conditional on a -z option? Append a "Z" to the time-of-day when it's UT; don't when it's local. One of the good things the existing -v format provides is the ability to confirm that transitions occur at both the correct local time and Universal time. To some extent there is a tradeoff between formats that make typos easy to
find, and formats that are more what users typically expect. Within reason I'd rather make typos easy to find
Just to throw in a potential middle-of-the-road option, would it make sense to space-pad the datetime and offset values instead? Something like the following: TZ="America/Phoenix" - - -072818 LMT 1883-11-18 12 -07 MST 1918-03-31 03 -06 MDT 1 1918-10-27 01 -07 MST 1919-03-30 03 -06 MDT 1 1919-10-26 01 -07 MST 1942-02-09 03 -06 MWT 1 1943-12-31 23:01 -07 MST 1944-04-01 01:01 -06 MWT 1 1944-09-30 23:01 -07 MST 1967-04-30 03 -06 MDT 1 1967-10-29 01 -07 MST TZ="Pacific/Pago_Pago" - - +123712 LMT 1879-07-04 00 -112248 LMT 1911-01-01 00:22:48 -11 NST 1967-04-01 00 -11 BST 1983-11-30 00 -11 SST I think that strikes a balance where typos are easy to spot, but the purpose of each field remains quite reasonably clear. The handful of extra bytes seem a worthwhile expense to have a canonical format which can easily serve both humans and machines alike. (I would not, however, suggest padding the later fields.) typos are a real probelm! ;) -- Tim Parenti

Tim Parenti wrote:
I realize the goal may be to have a single canonical format, but perhaps this could be made conditional on a -z option?
Yes, or some such option like that. I was thinking more of a strftime-like format in which one could specify UT vs local time.
Just to throw in a potential middle-of-the-road option, would it make sense to space-pad the datetime and offset values instead?
I thought of doing that, but found that it would be a pain, since the amount of padding would be system-dependent. Every field whose extrema depend on machine integer size (the year, the UT offset) would have a width that would depend on the current machine architecture, and this would mean zdump -i would generate different outputs on different machine architectures. Plus, there would be a lot of spaces before the year and the UT offsets. Alternatively, zdump could look at all output to be generated for this particular zdump run, compute the maximum width needed for the run, and use that width. But this would mean that 'zdump -i A; zdump -i B' would not necessarily output the same thing as 'zdump -i A B', which would not be good at all. Alternatively, zdump could not bother to align outlandish years outside the range -999,9999 or outlandish UT offsets that are more than 100 hours away from UT. Something like that might work, I suppose, though we'd probably still get bug reports from compulsive aligners wondering why the outlandish cases aren't aligned properly, or why there's all that white space in the columns.

I'm totally in favour of the pragmatic approach of assuming there isn't crazy data. Let's try to design the format to accomplish what we need it to - which to my mind doesn't include years earlier than 1800 or later than 3000 (and probably a narrower range than that - I currently only check up to 2035, but that makes me somewhat nervous). I'd rather have something that does what it's designed for well and doesn't work for tasks it's not designed for than something that copes with everything, but does so in a mediocre way. I quite like Tim's padding idea overall, although I'd stlil argue for colons in offsets (RFC5322 uses a horrible format in general; I see no reason to copy mistakes of the past) and "d" instead of "a positive integer" to indicate daylight. (My goal is for this to capture all the relevant information from the original data text files, but the implementation details of is_dst fall outside that scope IMO. That's not part of the source data.) I'm still fine with the idea of transforming some non-ideal-to-me format from zdump into a more canonical format in a simple way though. Jon On 7 June 2016 at 09:44, Paul Eggert <eggert@cs.ucla.edu> wrote:
Tim Parenti wrote:
I realize the goal may be to have a single canonical format, but perhaps
this could be made conditional on a -z option?
Yes, or some such option like that. I was thinking more of a strftime-like format in which one could specify UT vs local time.
Just to throw in a potential middle-of-the-road option, would it make sense
to space-pad the datetime and offset values instead?
I thought of doing that, but found that it would be a pain, since the amount of padding would be system-dependent. Every field whose extrema depend on machine integer size (the year, the UT offset) would have a width that would depend on the current machine architecture, and this would mean zdump -i would generate different outputs on different machine architectures. Plus, there would be a lot of spaces before the year and the UT offsets.
Alternatively, zdump could look at all output to be generated for this particular zdump run, compute the maximum width needed for the run, and use that width. But this would mean that 'zdump -i A; zdump -i B' would not necessarily output the same thing as 'zdump -i A B', which would not be good at all.
Alternatively, zdump could not bother to align outlandish years outside the range -999,9999 or outlandish UT offsets that are more than 100 hours away from UT. Something like that might work, I suppose, though we'd probably still get bug reports from compulsive aligners wondering why the outlandish cases aren't aligned properly, or why there's all that white space in the columns.

Jon Skeet wrote:
I quite like Tim's padding idea overall, although I'd stlil argue for colons in offsets (RFC5322 uses a horrible format in general; I see no reason to copy mistakes of the past)
Tim argued for that as well, so let's do that.
and "d" instead of "a positive integer" to indicate daylight. (My goal is for this to capture all the relevant information from the original data text files, but the implementation details of is_dst fall outside that scope IMO. That's not part of the source data.)
The decimal integer does both, no? That is, it captures all the relevant information from the original data text files, and it also captures the POSIX-specified implementation details. Since zdump is supposed to be portable to other implementations where the value of the flag may matter, I'm still inclined to have the -i format output that information, as the -v format already does.
I'm still fine with the idea of transforming some non-ideal-to-me format from zdump into a more canonical format in a simple way though.
I plan to tweak the -i format to add colons to UT offsets, and do a couple of other things to make it easier to view and process (notably, use tabs rather than spaces to separate fields, so that people who want columns to line up can easily do that). Also, I would like to distribute a new file that contains the zdump -i output, so that changes to the tzdata source can easily be tracked by diffing the zdump -i output. This should help with regression testing. With luck, I'll get out a new patch shortly to do all this.

Paul Eggert wrote:
With luck, I'll get out a new patch shortly to do all this.
Proposed 3 patches attached. These 3 patches simplify regression testing, so I'll propose a patch for that soon. The regression data will bloat the tz distribution somewhat, though, so to help out with that let's distribute a single tarball containing both code and data (this was suggested during the previous millennium, so it's about time :-), and while we're at it, use a better compressor than gzip. (For compatibility we'll still distribute the old-format .tar.gz pair for a while.) I'll propose a patch for that too. Plus, I've used the new regression-testing framework to verify the planned conversion of the ex-Soviet zones from invented to numeric time zone abbreviations, so I'll propose that patch too. I'll push all these patches into the Github experimental version, as they should be stable enough for review now.

<<On Sun, 21 Aug 2016 20:53:08 -0700, Paul Eggert <eggert@cs.ucla.edu> said:
The regression data will bloat the tz distribution somewhat, though, so to help out with that let's distribute a single tarball containing both code and data (this was suggested during the previous millennium, so it's about time :-), and while we're at it, use a better compressor than gzip. (For compatibility we'll still distribute the old-format .tar.gz pair for a while.) I'll propose a patch for that too.
Please no, for the reasons amply stated by others in this thread. -GAWollman

(Snipped.) On 21 August 2016 at 11:42, Paul Eggert <eggert@cs.ucla.edu> wrote:
and "d" instead of "a positive
integer" to indicate daylight. (My goal is for this to capture all the relevant information from the original data text files, but the implementation details of is_dst fall outside that scope IMO. That's not part of the source data.)
The decimal integer does both, no? That is, it captures all the relevant information from the original data text files, and it also captures the POSIX-specified implementation details. Since zdump is supposed to be portable to other implementations where the value of the flag may matter, I'm still inclined to have the -i format output that information, as the -v format already does.
The problem is it isn't canonical at that point - from the perspective of information in the source text. If two valid implementations can give two different outputs, that's a problem for my use case. More below.
I'm still fine with the idea of transforming some non-ideal-to-me format
from zdump into a more canonical format in a simple way though.
I plan to tweak the -i format to add colons to UT offsets, and do a couple of other things to make it easier to view and process (notably, use tabs rather than spaces to separate fields, so that people who want columns to line up can easily do that).
Well, if the output format isn't suitable as a canonical format for validating tz text consumers, the readability doesn't matter as much to me... so long as I can convert it into a format I'm happier with. The question is really whether we want there to be two formats for the two slightly different use cases or not. Now the part about using an unspecified integer or not could be argued to be more theoretical than practical, of course... if the output data in the release will always use 1 and 0, I guess I could canonicalize to that... it doesn't feel nice in terms of relying on anything unspecified, but it may still be better than having two different formats. (I'm personally not a big fan of tabs, but I don't want to waste everyone's time by arguing forever...)
Also, I would like to distribute a new file that contains the zdump -i output, so that changes to the tzdata source can easily be tracked by diffing the zdump -i output. This should help with regression testing.
Yes, that would definitely be helpful, regardless of whether the format is precisely what I'd like :) Jon

Jon Skeet wrote:
it isn't canonical at that point - from the perspective of information in the source text. If two valid implementations can give two different outputs, that's a problem for my use case.
We have that problem anyway. If zdump is run on a platform with 32-bit or unsigned time_t, it will generate different output from platforms with 64-bit signed time_t. This is true even when using the tzcode localtime.c implementation. I plan to generate the reference file on a host with 64-bit signed time_t and to use the reference tzcode implementation. So long as you test on a similar platform we should be OK.
if the output data in the release will always use 1 and 0, I guess I could canonicalize to that
Yes, it should always be 1 or absent.

On 22 August 2016 at 16:30, Paul Eggert <eggert@cs.ucla.edu> wrote:
Jon Skeet wrote:
it isn't canonical at that point - from the perspective of
information in the source text. If two valid implementations can give two different outputs, that's a problem for my use case.
We have that problem anyway. If zdump is run on a platform with 32-bit or unsigned time_t, it will generate different output from platforms with 64-bit signed time_t. This is true even when using the tzcode localtime.c implementation.
Is that due to dates past 2038, or something else? (I've deliberately capped my canonical format at 2035 for that purpose, although that's only a reasonably short term solution of course. Are there other differences I should be aware of in terms of zdump -i output? I plan to generate the reference file on a host with 64-bit signed time_t
and to use the reference tzcode implementation. So long as you test on a similar platform we should be OK.
Well I'd be testing something where time_t doesn't get involved at all, of course - an entirely different, non-C-based representation. That's the point of it, from my perspective.
if the output data in the
release will always use 1 and 0, I guess I could canonicalize to that
Yes, it should always be 1 or absent.
Right. I think I'll at least *start* by writing a small script to convert from one format to the other. We can see whether over time one becomes a clear winner over the other. Jon

Jon Skeet wrote:
Is that due to dates past 2038, or something else?
Also dates before 1901 for 32-bit signed time_t, or before 1970 for unsigned time_t. I want the pre-1901 transitions to be checked, though, so I would rather stick with 64-bit signed time_t when generating the reference file. Plus, on some platforms zdump uses CRLF instead of LF to terminate output lines. There may be other niggling things like that.
I'd be testing something where time_t doesn't get involved at all, of course - an entirely different, non-C-based representation. That's the point of it, from my perspective.
Something with bignums, say? (Because 64-bit signed time_t doesn't suffice for simulations of proton decay in degenerate stars. See: Adams FC. The future history of the universe. Cosmic Update. 2011-07-23. http://dx.doi.org/10.1007/978-1-4419-8294-0_3 :-)

On 22 August 2016 at 18:38, Paul Eggert <eggert@cs.ucla.edu> wrote:
Jon Skeet wrote:
Is that due to dates past 2038, or something else?
Also dates before 1901 for 32-bit signed time_t, or before 1970 for unsigned time_t. I want the pre-1901 transitions to be checked, though, so I would rather stick with 64-bit signed time_t when generating the reference file.
Plus, on some platforms zdump uses CRLF instead of LF to terminate output lines. There may be other niggling things like that.
Right. My own format spec <https://github.com/nodatime/tzvalidate/blob/master/format.md> explicitly calls out U+000A, so that's consistent with using a Unix 64-bit version of zdump to generate the canonical file.
I'd be testing something where time_t doesn't get involved at all, of
course - an entirely different, non-C-based representation. That's the point of it, from my perspective.
Something with bignums, say? (Because 64-bit signed time_t doesn't suffice for simulations of proton decay in degenerate stars. See:
Adams FC. The future history of the universe. Cosmic Update. 2011-07-23. http://dx.doi.org/10.1007/978-1-4419-8294-0_3
If we still have time zones that need this sort of support in even a thousand years time, then... well, heck, *I* won't be maintaining it :) Given the other reactions around file merging, perhaps the data file should just be hosted as a separate file? Jon

Jon Skeet wrote:
Given the other reactions around file merging, perhaps the data file should just be hosted as a separate file?
Hmm, I think I'd rather not ship *three* tarballs per release for the indefinite future. *Two* tarballs are already too many. As the new .tzs file is closely associated with the data, it belongs in the tzdata tarball when we're talking about old-format distributions. By the way, I picked a 2050 cutoff date for the draft .tzs partly because I wanted to check a few years past the 2038 limit imposed by 32-bit signed time_t. I didn't go too much past 2050 partly to cut down on bloat, and partly because the reference implementation of zdump is too slow.

On 23 August 2016 at 01:12, Paul Eggert <eggert@cs.ucla.edu> wrote:
Jon Skeet wrote:
Given the other reactions around file merging, perhaps the data file should just be hosted as a separate file?
Hmm, I think I'd rather not ship *three* tarballs per release for the indefinite future. *Two* tarballs are already too many.
I'd rather you blessed a git repo and stopped calling it 'experimental' so I don't have to worry about tarballs at all :) -- Stuart Bishop <stuart@stuartbishop.net> http://www.stuartbishop.net/

On 08/21/2016 03:42 AM, Paul Eggert wrote:
Jon Skeet wrote:
I'd stlil argue for colons in offsets (RFC5322 uses a horrible format in general; I see no reason to copy mistakes of the past)
Tim argued for that as well, so let's do that.
After having used it for a while, the colons output by zic are getting in the way of my manual audits of zdump -i output. The tz database omits ":" in numeric time zone abbreviations, not only because of longstanding tradition exemplified by Internet RFC 5322, but also because the POSIX TZ format does not allow ":" there. "zic -i Asia/Colombo" therefore outputs lines like this: 2006-04-15 00 +05:30 "+0530" where the quoted string indicates that something is amiss because the abbreviation is not alphabetic and does not match the computed abbreviation. To fix this, the attached proposed patch causes zic to compute the abbreviation "+0530" instead, causing its output line to look like this: 2006-04-15 00 +0530 which makes it clearer that the situation is expected and does not need further attention. It may be that we'll need at some point to have a way to add colons to the tz database's time zone abbreviations, for people who prefer that style and are not worried about POSIX compatibility. I suppose this can be a topic for a future patch. It would need to be done carefully as it would likely require a change to the version number in compiled tzdata files, as it would change the format of their embedded TZ strings.

For anyone else still following this, I have an experimental converter <https://github.com/nodatime/tzvalidate/tree/master/csharp/src/NodaTime.TzVal...> (in C# at the moment) from the "tzs" format (for want of a better name) to the tzvalidate format <https://github.com/nodatime/tzvalidate/blob/master/format.md> we've been using. (I'm pleased to say that the results then concur, which is always reassuring.) I'll amend that code to handle Paul's latest patch when the next version of the time zone data comes out. Jon On 19 December 2016 at 18:03, Paul Eggert <eggert@cs.ucla.edu> wrote:
On 08/21/2016 03:42 AM, Paul Eggert wrote:
Jon Skeet wrote:
I'd stlil argue for colons in offsets (RFC5322 uses a horrible format in general; I see no reason to copy mistakes of the past)
Tim argued for that as well, so let's do that.
After having used it for a while, the colons output by zic are getting in the way of my manual audits of zdump -i output. The tz database omits ":" in numeric time zone abbreviations, not only because of longstanding tradition exemplified by Internet RFC 5322, but also because the POSIX TZ format does not allow ":" there. "zic -i Asia/Colombo" therefore outputs lines like this:
2006-04-15 00 +05:30 "+0530"
where the quoted string indicates that something is amiss because the abbreviation is not alphabetic and does not match the computed abbreviation. To fix this, the attached proposed patch causes zic to compute the abbreviation "+0530" instead, causing its output line to look like this:
2006-04-15 00 +0530
which makes it clearer that the situation is expected and does not need further attention.
It may be that we'll need at some point to have a way to add colons to the tz database's time zone abbreviations, for people who prefer that style and are not worried about POSIX compatibility. I suppose this can be a topic for a future patch. It would need to be done carefully as it would likely require a change to the version number in compiled tzdata files, as it would change the format of their embedded TZ strings.

On 7 June 2016 at 04:44, Paul Eggert <eggert@cs.ucla.edu> wrote:
Alternatively, zdump could not bother to align outlandish years outside the range -999,9999 or outlandish UT offsets that are more than 100 hours away from UT.
Though all of the space-padding alternatives you propose do have their limitations, as you mention, I think this is the one that remains most workable and portable. In fact, if you include colons in the UT offsets, you need only state that the datetime and offset fields are given with second-level precision, and that all terminal sequences of the substring ":00" within those fields are replaced with three spaces to aid readability.
Something like that might work, I suppose, though we'd probably still get bug reports from compulsive aligners wondering why the outlandish cases aren't aligned properly, or why there's all that white space in the columns.
"Compulsive aligners" will always find something to complain about, surely… but maybe if the stated goal isn't so much absolute "alignment" as general "readability", they might pick on something else instead. -- Tim Parenti

On Sun, May 29, 2016 at 1:49 PM, Paul Eggert <eggert@cs.ucla.edu> wrote:
I looked into the format you suggested, along with the other comments noted and formats I've seen elsewhere (e.g., Shanks), and came up with the attached proposal for a "canonical" -i format for zdump, with the design goals being a format that is unambiguous, easy to review, and compact.
I encountered an error when building with the proposed patch with GCC 4.9.3, resolved with the below trivial fix. Chris -- >8 -- Subject: [PATCH] Fix error with GCC 4.9.3 zdump.c: In function 'istrftime': zdump.c:1082:3: error: 'for' loop initial declarations are only allowed in C99 or C11 mode for (char const *p = f; ; p++) ^ zdump.c:1082:3: note: use option -std=c99, -std=gnu99, -std=c11 or -std=gnu11 to compile your code --- zdump.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/zdump.c b/zdump.c index 9ecc02a..d5ed5f0 100644 --- a/zdump.c +++ b/zdump.c @@ -1077,9 +1077,9 @@ istrftime(char *buf, size_t size, char const *time_fmt, { char *b = buf; size_t s = size; - char const *f = time_fmt; + char const *f = time_fmt, *p; - for (char const *p = f; ; p++) + for (p = f; ; p++) if (*p == '%' && p[1] == '%') p++; else if (!*p -- 2.8.1

On Apr 27, 2016, at 11:51 AM, Random832 <random832@fastmail.com> wrote:
What's wrong with zdump's output format?
$ zdump -v Africa/Maseru Africa/Maseru -9223372036854775808 = NULL Africa/Maseru -9223372036854689408 = NULL Africa/Maseru Sun Feb 7 22:07:59 1892 UTC = Sun Feb 7 23:59:59 1892 LMT isdst=0 gmtoff=6720 Africa/Maseru Sun Feb 7 22:08:00 1892 UTC = Sun Feb 7 23:38:00 1892 SAST isdst=0 gmtoff=5400 Africa/Maseru Sat Feb 28 22:29:59 1903 UTC = Sat Feb 28 23:59:59 1903 SAST isdst=0 gmtoff=5400 Africa/Maseru Sat Feb 28 22:30:00 1903 UTC = Sun Mar 1 00:30:00 1903 SAST isdst=0 gmtoff=7200 Africa/Maseru Sat Sep 19 23:59:59 1942 UTC = Sun Sep 20 01:59:59 1942 SAST isdst=0 gmtoff=7200 Africa/Maseru Sun Sep 20 00:00:00 1942 UTC = Sun Sep 20 03:00:00 1942 SAST isdst=1 gmtoff=10800 Africa/Maseru Sat Mar 20 22:59:59 1943 UTC = Sun Mar 21 01:59:59 1943 SAST isdst=1 gmtoff=10800 Africa/Maseru Sat Mar 20 23:00:00 1943 UTC = Sun Mar 21 01:00:00 1943 SAST isdst=0 gmtoff=7200 Africa/Maseru Sat Sep 18 23:59:59 1943 UTC = Sun Sep 19 01:59:59 1943 SAST isdst=0 gmtoff=7200 Africa/Maseru Sun Sep 19 00:00:00 1943 UTC = Sun Sep 19 03:00:00 1943 SAST isdst=1 gmtoff=10800 Africa/Maseru Sat Mar 18 22:59:59 1944 UTC = Sun Mar 19 01:59:59 1944 SAST isdst=1 gmtoff=10800 Africa/Maseru Sat Mar 18 23:00:00 1944 UTC = Sun Mar 19 01:00:00 1944 SAST isdst=0 gmtoff=7200 Africa/Maseru 9223372036854689407 = NULL Africa/Maseru 9223372036854775807 = NULL
Cons: - A bit verbose - technically uses instants (from before and on each transition) rather than spans. - The NULLs are a bit mysterious - I'm personally not sure *exactly* how it finds the transitions, and in particular I'm not sure if it will reliably find multiple transitions per day
Pros: - Already exists - Is already written in C, and already installed on many systems - Does not depend on any implementation internals
One of the use cases I find valuable is comparing the output from two sequential versions of tzdata. Therefore human-readability of the format ranks very high for me. I can read Jon’s format far more easily. One entry per transition is inherently easier to read. I see no reason to repeat the timezone name for each entry. Furthermore some of the zdump output is not referring to data contained in tzdata (that I can tell). Rather it appears to refer to details of tzcode (e.g. 9223372036854775807 == 0x7FFFFFFFFFFFFFFF). I would not be opposed to tweaking the format of the validation file. However the zdump format does not look like a step forward to me. I wouldn’t mind seeing the validation file start out with the tzdata version. And I wouldn’t mind seeing the leap second data appended. I would have no objection to subbing in ‘ ‘ for ’T’ in the transition timestamp. I would like to maintain an empty line between timezones. I would object to any data appearing after the abbreviation unless the abbreviation is padded with spaces to make the subsequent data appear in a consistent column. And generally, I highly value consistent columns for data. I can sympathize with “Already exists”. But from where I sit, so does Jon’s format. Howard

Thank you very much for this service Jon. This is extremely valuable to me for validating my own IANA timezone database parser: https://github.com/HowardHinnant/date I do not know, but I suspect, that this would also be valuable for validating both tzcode and tzdata. Any time that multiple independently developed pieces of software come up with exactly the same answer, that is generally a very good sign. If a validation file such as this revealed a difference between tzcode and Noda Time (or my own parser), the bug would not necessarily lie with Noda Time (or my own parser). A latent bug in tzcode might be revealed this way. And this is also a very convenient way to check the differences between sequential versions of tzdata, to ensure that the intended changes are actually the changes seen in the list of transitions. I consider the creation of this validation file and checking it against Jon’s independently created validation file, a critical test, for every single new version of the database: https://github.com/HowardHinnant/date/blob/master/test/tz_test/validate.cpp And I would be in favor of bundling such a validation file with either the tzdata or tzcode releases, or as a 3rd release alongside these two. Thanks again Jon. Howard On Apr 27, 2016, at 10:56 AM, Jon Skeet <skeet@pobox.com> wrote:
For anyone still interested in this, I've now moved the data to http://nodatime.github.io/tzvalidate/ and created a Travis job which lets me update it mostly-automatically. (When there's a new TZDB release, I need to build the Noda Time data file, push that, then manually trigger a Travis build for tzvalidate.)
Of course, if there were any appetite for building and distributing this along with tzcode and tzdata, that would be even better :)
Jon
On 11 July 2015 at 11:35, Jon Skeet <skeet@pobox.com> wrote:
Background: I'm the primary developer for Noda Time which consumes the tz data. I'm currently refactoring the code to do this... and I've come across some code (originally ported from Joda Time) which I now understand in terms of what it's doing, but not exactly why.
For a little while now, the Noda Time source repo has included a text dump file, containing a text dump of every transition (up to 2100, at the moment) for every time zone. It looks like this, picking just one example: Zone: Africa/Maseru LMT: [StartOfTime, 1892-02-07T22:08:00Z) +01:52 (+00) SAST: [1892-02-07T22:08:00Z, 1903-02-28T22:30:00Z) +01:30 (+00) SAST: [1903-02-28T22:30:00Z, 1942-09-20T00:00:00Z) +02 (+00) SAST: [1942-09-20T00:00:00Z, 1943-03-20T23:00:00Z) +03 (+01) SAST: [1943-03-20T23:00:00Z, 1943-09-19T00:00:00Z) +02 (+00) SAST: [1943-09-19T00:00:00Z, 1944-03-18T23:00:00Z) +03 (+01) SAST: [1944-03-18T23:00:00Z, EndOfTime) +02 (+00)
I use this file for confidence when refactoring my time zone handling code - if the new code comes up with the same set of transitions as the old code, it's probably okay. (This is just one line of defence, of course - there are unit tests, though not as many as I'd like.)
It strikes me that having a similar file (I'm not wedded to the format, but it should have all the same information, one way or another) released alongside the main data files would be really handy for all implementors - it would be a good way of validating consistency across multiple platforms, with the release data being canonical. For any platforms which didn't want to actually consume the rules as rules, but just wanted a list of transitions, it could even effectively replace their use of the data.
One other benefit: diffing the dump between two releases would make it clear what had changed in effect, rather than just in terms of rules.
One sticking point is size. The current file for Noda Time is about 4MB, although it zips down to about 300K. Some thoughts around this: • We wouldn't need to distribute it in the same file as the data - just as we have data and code file, there could be a "textdump" file or whatever we'd want to call it. These could be retroactively generated for previous releases, too. • As you can see, there's redundancy in the format above, in that it's a list of "zone intervals" (as I call them in Noda Time) rather than a list of transitions - the end of each interval is always the start of the next interval. • For zones which settle into an infinite daylight saving pattern, I currently generate from the start of time to 2100 (and then a single zone interval for the end of time as Noda Time understands it; we'd need to work out what form that would take, if any). If we decided that "year of release + 30 years" was enough, that would cut down the size considerably. Any thoughts? If the feeling is broadly positive, the next step would be to nail down the text format, then find a willing victim/volunteer to write the C code. (You really don't want me writing C...)
Jon
participants (18)
-
Arthur David Olson
-
Bradley White
-
Brian Inglis
-
Chris Rorvick
-
Garrett Wollman
-
Guy Harris
-
Howard Hinnant
-
Jon Skeet
-
Lester Caine
-
Martin Burnicki
-
Martin Burnicki
-
Paul Eggert
-
Paul_Koning@dell.com
-
Random832
-
Steffen Nurpmeso
-
Stephen Colebourne
-
Stuart Bishop
-
Tim Parenti