Proposed reversions, for moving forward

older
Philippines may do off-and-on DST...

Tim Parenti

July 27, 2014

11:33 p.m.

I propose the attached patches, described as follows, be applied to the master branch of our repository so that we have a stable starting point from which to adopt better Git practices moving forward. To assist in review, I have pushed these commits to https://github.com/timparenti/tz-experimental/commits/revert-eggert-2014-07. The first patch aims to revert, in their entirety, Paul’s three commits related to zone-linking, dated 2014-07-08, -09, and -15. Further work on this effort should be done on a new and separate branch. The second patch aims to revert, in their entirety, Paul’s two commits related to the introduction of time.tab, dated 2014-07-18 and -19. Further work on this effort should be done on a new and separate branch. The third, fourth, and fifth patches reapply changes cherry-picked from the above commits which relate to data and are less controversial. These are relative to the current state of master at 87163d184d7cc0c704d2b505adcfb23528203951 <https://github.com/eggert/tz/commit/87163d184d7cc0c704d2b505adcfb23528203951>, dated Tue Jul 22 20:54:19 2014 -0700. -- Tim Parenti

Attachments:

attachment.html (text/html — 1.5 KB)
0001-Revert-overzealous-zone-links.patch (application/octet-stream — 34.6 KB)
0002-Revert-introduction-of-time.tab.patch (application/octet-stream — 34.8 KB)
0003-A-few-more-changes-for-consistency.patch (application/octet-stream — 9.0 KB)
0004-Corrections-for-Hungary-and-a-source-for-Poland.patch (application/octet-stream — 6.5 KB)
0005-iso3166.tab-will-soon-switch-to-UTF-8.patch (application/octet-stream — 2.5 KB)

Show replies by date

Paul Eggert

August 2014

5:21 a.m.

Thanks for those patches, which had a lot of thought behind them. I'm inclined to accept most of 0001-Revert-overzealous-zone-links on the grounds that the sheer size of the recent zone-to-link change is unprecedented and that this is off-putting. However, the general principle should remain what it's always been, which is that the database should contain good data and that it's OK to remove data that are questionable (e.g., no reliable sources) and are out of scope anyway. So I plan to keep a small part of the zone-to-link change, namely the part in west Africa, as its size is more in line with previous changes of this kind. We can do the rest of the zone-to-link changes later, as they're not urgent. This compromise solution won't make everybody happy (it certainly doesn't make *me* happy) but it is a reasonable path forward. I'm not inclined to accept the 0002-Revert-introduction-of-time.tab patch, as that would leave Crozet Islands and the Scattered Islands uncovered by tzselect, and I'm loath to add a zone or link for them. Part of the point of the new table is to avoid the need to add new entries for tiny settlements and enclaves that merely mirror timekeeping elsewhere. The other three patches (0003-A-few-more-changes-for-consistency, 0004-Corrections-for-Hungary-and-a-source-for-Poland, 0005-iso3166.tab-will-soon-switch-to-UTF-8) are mostly in the experimental version already, but I captured their comment fixes and found a couple more and came up with the attached first patch accordingly. The second attached patch implements the change for west Africa. I've pushed both of these into the experimental version on github. At this point we're pretty much ready for a new release.

Stephen Colebourne

11:58 a.m.

The latest proposed 2014f version (as of 2014-08-01) can be viewed here: https://github.com/jodastephen/tzdiff/commit/39195c3177b935e856f02993994c4f6... I note that the data for Africa/Timbuktu remains changed, and is not mentioned in NEWS. This looks like an error. I have cross-checked the NEWS file to the other changes and they seem OK with the exception of the changes to links in West Africa, which remains a sore point. Stephen On 1 August 2014 06:21, Paul Eggert <eggert@cs.ucla.edu> wrote:

...

Thanks for those patches, which had a lot of thought behind them. I'm inclined to accept most of 0001-Revert-overzealous-zone-links on the grounds that the sheer size of the recent zone-to-link change is unprecedented and that this is off-putting. However, the general principle should remain what it's always been, which is that the database should contain good data and that it's OK to remove data that are questionable (e.g., no reliable sources) and are out of scope anyway. So I plan to keep a small part of the zone-to-link change, namely the part in west Africa, as its size is more in line with previous changes of this kind. We can do the rest of the zone-to-link changes later, as they're not urgent. This compromise solution won't make everybody happy (it certainly doesn't make *me* happy) but it is a reasonable path forward.

I'm not inclined to accept the 0002-Revert-introduction-of-time.tab patch, as that would leave Crozet Islands and the Scattered Islands uncovered by tzselect, and I'm loath to add a zone or link for them. Part of the point of the new table is to avoid the need to add new entries for tiny settlements and enclaves that merely mirror timekeeping elsewhere.

The other three patches (0003-A-few-more-changes-for-consistency, 0004-Corrections-for-Hungary-and-a-source-for-Poland, 0005-iso3166.tab-will-soon-switch-to-UTF-8) are mostly in the experimental version already, but I captured their comment fixes and found a couple more and came up with the attached first patch accordingly. The second attached patch implements the change for west Africa. I've pushed both of these into the experimental version on github.

At this point we're pretty much ready for a new release.

Paul Eggert

1:35 a.m.

Stephen Colebourne wrote:

...

I note that the data for Africa/Timbuktu remains changed, and is not mentioned in NEWS.

Thanks, fixed with the attached patch.

...

...
...
- Africa/Accra gains DST between 1920 and 1935

Based on your explanation, this looks like a good change with a good justification. Can any more detail be added to NEWS?

NEWS contains one line about this, which should be enough for a heads-up. Details can be found in the comments in 'africa'.

...

your explanation does not indicate that replacement is right

Of course it's not 100% right. We don't have reliable information about old timestamps in Ghana. It would be better if we didn't have to put this low-quality data into the tz database at all. The only reason it's there is that the format requires it. Much of the pre-1970 data falls into this category, unfortunately. When the quality is this bad, there's nothing wrong with improving the quality even if the result is not perfect, or with removing bad data if this can be done without significantly affecting end-user applications.

Robert Elz

11:32 p.m.

Date: Sat, 02 Aug 2014 18:35:58 -0700 From: Paul Eggert <eggert@cs.ucla.edu> Message-ID: <53DD91FE.90405@cs.ucla.edu> | Of course it's not 100% right. We don't have reliable information about | old timestamps in Ghana. It would be better if we didn't have to put | this low-quality data into the tz database at all. The only reason it's | there is that the format requires it. Not really, since time_t's are (generally) signed, and these days, 64 bits, localtime() can be asked to convert any time going back to (way) before the big bang and is expected to produce some kind of meaningful results. Of course, since our data (currently anyway) assumes that the gregorian calendar existed since (apparently way before) the big bang, and converts dates based upon that assumption - which means we know it is producing utter nonsense for anything earlier than the 15 century (or whatever) and for some parts of the world, up to the early 20th century. But we have to produce something - nonsense or not nonsense, what matters is that there is (at least for reasonable dates, say back a few thousand years BC or so) we get some kind of reasonably stable (and comparable) results, not just whatever random value happens to seem convenient today. | Much of the pre-1970 data falls into this category, unfortunately. Yes, it does, and if you really wanted to get rid of all unverified data, you'd remove all of it (from all zones) - the format requires that something be there, not any particular transitions. Just removing isolated segments of that unverified data looks wrong. | When the quality is this bad, there's nothing wrong with improving the | quality even if the result is not perfect, or with removing bad data if | this can be done without significantly affecting end-user applications. No, there wouldn't be if there was known bad data. But that's not what any of this is - no-one has a problem with correcting data that is known to be incorrect. The problem is that that is not what any of this is. It isn't bad data, it is just data that we do not know is correct, and we guess might not be perfect. That guess might be right - or it might not be, that's the point - it is possible that (just by chance) you're removing some good data and replacing it with bad data. You don't know, I don't know. I'd suggest just putting everything back, keep the results stable (if they're wrong at least they're the same wrong today as yesterday) and just replace data when it is known incorrect. If you're not going to do that, then at least do it properly, and delete all of the unverified data - ALL of it. kre

David Patte ₯

11:02 a.m.

I agree with Robert - totally logical On 2014-08-03 19:32, Robert Elz wrote:

...

I'd suggest just putting everything back, keep the results stable (if they're wrong at least they're the same wrong today as yesterday) and just replace data when it is known incorrect. If you're not going to do that, then at least do it properly, and delete all of the unverified data - ALL of it.

kre

Eliot Lear

9:20 a.m.

Hi Robert, On 8/4/14, 1:32 AM, Robert Elz wrote:

...

Yes, it does, and if you really wanted to get rid of all unverified data, you'd remove all of it (from all zones) - the format requires that something be there, not any particular transitions. Just removing isolated segments of that unverified data looks wrong.

What I like about Paul's approach is that he can make changes in perhaps small batches so that people can review them in bite-sized pieces. Eliot

Paul Eggert

3:10 p.m.

Eliot Lear wrote:

...

On 8/4/14, 1:32 AM, Robert Elz wrote:

...
Yes, it does, and if you really wanted to get rid of all unverified data, you'd remove all of it (from all zones) - the format requires that something be there, not any particular transitions.

What I like about Paul's approach is that he can make changes in perhaps small batches so that people can review them in bite-sized pieces.

Yes, and the recent contretemps started mainly because of a batch that was too large for some in the audience. The latest proposal is a small fraction of the original proposal. I would like to continue to remove data lacking a reliable source -- a process that's been going on for some time -- but I guess it'll be one step at a time.

Stephen Colebourne

3:40 p.m.

On 5 August 2014 16:10, Paul Eggert <eggert@cs.ucla.edu> wrote:

...

Yes, and the recent contretemps started mainly because of a batch that was too large for some in the audience. The latest proposal is a small fraction of the original proposal. I would like to continue to remove data lacking a reliable source -- a process that's been going on for some time -- but I guess it'll be one step at a time.

Er no. The argument has mostly been about the principles, not the size of the change. The size prevented both decent review and correct rollback, but both of those are as much about not using a sensible source code management strategy as the data itself. Simply ploughing on with the changes, just in smaller batches, does not actually make the objectors happy, it merely increases the noise and effort we all have to make. The point remains that replacing bogus data with other bogus data is nothing other than dumb from the perspective of many of those who ultimately consume the data. zic is but one consumer these day. Stephen

Paul Eggert

4:31 p.m.

Stephen Colebourne wrote:

...

Simply ploughing on with the changes, just in smaller batches, does not actually make the objectors happy

It doesn't make *me* happy either. I'd prefer doing these cleanups all at once to get them over with, but I'm willing to live with doing them more gradually, they way they've been done for many years. No end-user problems have been reported in all that time, and there's no reason to expect significant problems in the future as we continue to muck out the stables. Given the above, if you'd also prefer to get it over with, then we can do that too (but in a later release of course).

Derick Rethans

10:43 a.m.

On Wed, 6 Aug 2014, Paul Eggert wrote:

...

Stephen Colebourne wrote:

...
Simply ploughing on with the changes, just in smaller batches, does not actually make the objectors happy

It doesn't make *me* happy either. I'd prefer doing these cleanups all at once to get them over with, but I'm willing to live with doing them more gradually, they way they've been done for many years.

...

No end-user problems have been reported in all that time, and there's no reason to expect significant problems in the future as we continue to muck out the stables.

I'll give you one here then, where PHP's CI started failing (more) since the upgrade to 2014f: https://travis-ci.org/php/php-src/jobs/31891549#L910 I realize it's because acronyms changed *only* here, but making changes most certainly has effects on other systems. cheers, Derick

Paul Eggert

4:40 p.m.

Derick Rethans wrote:

...

I realize it's because acronyms changed *only* here

Yes, and although it was merely a regression test, it helps bring better perspective to the recent discussion about removing questionable old data. I've long expected that 2014f's acronym changes will cause the most disruption to end users, that the Russia changes will be noticeable but not that painful, and that the removal of questionable old data will cause no significant problems in practice. We thought the old acronym entries were not right -- even though we couldn't *prove* this -- and so we improved the data as best we could. Although we didn't make the changes lightly, we valued correctness over stability even when we knew we didn't achieve 100% correctness. This has long been common practice in tz maintenance.

Alan Barrett

6:25 a.m.

On Thu, 07 Aug 2014, Paul Eggert wrote:

...

Although we didn't make the changes lightly, we valued correctness over stability even when we knew we didn't achieve 100% correctness. This has long been common practice in tz maintenance.

Yes, valuing correctness over stability is good, even when the new data is not 100% correct, provided it is more correct than the old data. The stability-related complaints have been about cases where the "more correct than the old data" condition was not perceived to be satisfied. I am gradually coming round to the opinion that the new data is probably more correct than the old data, but that is not clear to all observers. --apb (Alan Barrett)

Eliot Lear

9:35 a.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Just for my own edification, it seems to me that there are some simple rules to follow. Let me know if I'm even close: 1. If the data is correct, there is no issue. 2. If the data is known to be false, it should come out. 3. If the data has no basis in fact (e.g., not known to be false but also no basis to believe it is true), it should come out. 4. If the data has conflicting historical viewpoints, it is a judgment call based on the quality of the reports. Does that seem about right or did I get something wrong or miss something? To me, stability of false or seemingly fabricated information (2 & 3) is actually a bad thing because people cite the database in their works and we don't want to perpetuate that. Eliot On 8/8/14, 8:25 AM, Alan Barrett wrote:

...

On Thu, 07 Aug 2014, Paul Eggert wrote:

...
Although we didn't make the changes lightly, we valued correctness over stability even when we knew we didn't achieve 100% correctness. This has long been common practice in tz maintenance.

Yes, valuing correctness over stability is good, even when the new data is not 100% correct, provided it is more correct than the old data.

The stability-related complaints have been about cases where the "more correct than the old data" condition was not perceived to be satisfied.

I am gradually coming round to the opinion that the new data is probably more correct than the old data, but that is not clear to all observers.

--apb (Alan Barrett)

-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (Darwin) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJT5JnOAAoJEIe2a0bZ0nozTTEH/3/q+VIavDUcqzpzGGs90LIR /Ke4Ah3FhaEKXLtq2p5vUXpZUQMyfQ83mJD39HPbCLDPq5taBPsTRAjxQdMQh3ju kUH7Y1jIYsC0rYhz3CbksF1goycSOvIOXMddbxzmlNM3NYK+qwBMgvZjO8Bw6GMO FTCQWRzLhbFsUjtuG4BPJV7uXAjr9Kk1XmCt8lcC3QYhjihJUUgZE9p31YkfPN7r 2CvsqyjoCgpxEDVxZ4o5bVRyabe995XwhSZmKlNY6ickml6UgRtQ9ewbigRCDF2Y rgP8foEHR5cn0XzLutRk+Vn4L1TxH2eD0Ypt4lBHhP86+mmDSur8cJmvpxFcNU0= =qSnZ -----END PGP SIGNATURE-----

Stephen Colebourne

9:24 p.m.

On 8 August 2014 10:35, Eliot Lear <lear@cisco.com> wrote:

...

Just for my own edification, it seems to me that there are some simple rules to follow. Let me know if I'm even close:

1. If the data is correct, there is no issue. 2. If the data is known to be false, it should come out. 3. If the data has no basis in fact (e.g., not known to be false but also no basis to believe it is true), it should come out. 4. If the data has conflicting historical viewpoints, it is a judgment call based on the quality of the reports.

The problem with 2 & 3 is not that the TZDB maintainer thinks or knows the data to be false, but that there is nothing better to replace it by. In the absence of something better, leaving it alone seems wise. Personally, I am OK with data such as 2 & 3 being removed so long as a) LMT is preserved b) the switch from LMT to a time-zone is chosen sensibly Recent changes have not respected either of these points, because they have simply involved linking time zone A to some other (fairly randomly selected) time zone B. While some may argue that LMT is a stupid concept, the reality is that the database format requires it, and it has been widely relied upon by consumers of the data. As such, LMT should be accurate, or technically at least accurate for each zone that differs beyond 1970 and for each at least one zone per ISO-defined region. Stephen

Guy Harris

5:52 p.m.

On Aug 8, 2014, at 2:24 PM, Stephen Colebourne <scolebourne@joda.org> wrote:

...

While some may argue that LMT is a stupid concept, the reality is that the database format requires it, and it has been widely relied upon by consumers of the data. As such, LMT should be accurate, or technically at least accurate for each zone that differs beyond 1970 and for each at least one zone per ISO-defined region.

Presumably meaning "accurate for some particular location in each zone...", as a sufficiently-wide zone contains locations whose LMT would differ by a second or more.

Zoidsoft

9:37 a.m.

On 9 August 2014 00:51, Paul Eggert <eggert@cs.ucla.edu> wrote:

...

This is based not only our experience with doing these tz changes in the past (we've done 'em, multiple times, for many years, with no problems reported); it's based also on my experience with the few applications that could conceivably use this old data (mostly astrology, but also earthquake records and the like)

Stephen Colebourne wrote: This is a far too limited view of the usages of the data, perhaps that is part of the problem here. The reality is that millions of developers use this old data, it is just indirect rather than direct. 1) Most developers are not aware of the nuances of coding well using date and time, and especially not times in the past. ------- Agreed. I used Jean Meeus's "Astronomical Algorithms" text to program date and time conversions using Julian dates which includes calendar conversions as well. If one is calculating sky positions for an observer in a given location then this time zone data is very important. Those who are astronomically sophisticated will generally use UTC, but if one wants to accurately represent sky positions coming from a wall clock from the past, the tzdata is very important. There are other factors such as delta time which can only be guessed at and become more error prone the farther you go back into the past so tzdata inaccuracy is not the only problem, but one does the best they can. Tossing out data because it isn't authoritative enough defeats this particular purpose. What I have done in the Terran Atlas is highlight those questionable areas and bring up a popup warning so users can make a judgement call. On Sat, Aug 9, 2014 at 1:52 PM, Guy Harris <guy@alum.mit.edu> wrote:

...

On Aug 8, 2014, at 2:24 PM, Stephen Colebourne <scolebourne@joda.org> wrote:

...
While some may argue that LMT is a stupid concept, the reality is that the database format requires it, and it has been widely relied upon by consumers of the data. As such, LMT should be accurate, or technically at least accurate for each zone that differs beyond 1970 and for each at least one zone per ISO-defined region.

Presumably meaning "accurate for some particular location in each zone...", as a sufficiently-wide zone contains locations whose LMT would differ by a second or more.

Stephen Colebourne

10:18 a.m.

On 9 August 2014 18:52, Guy Harris <guy@alum.mit.edu> wrote:

...

On Aug 8, 2014, at 2:24 PM, Stephen Colebourne <scolebourne@joda.org> wrote:

...
While some may argue that LMT is a stupid concept, the reality is that the database format requires it, and it has been widely relied upon by consumers of the data. As such, LMT should be accurate, or technically at least accurate for each zone that differs beyond 1970 and for each at least one zone per ISO-defined region.

Presumably meaning "accurate for some particular location in each zone...", as a sufficiently-wide zone contains locations whose LMT would differ by a second or more.

Exactly. See my LMT replacement proposal thread: http://mm.icann.org/pipermail/tz/2014-August/021346.html Stephen

Tim Parenti

10:22 p.m.

On 8 August 2014 05:35, Eliot Lear <lear@cisco.com> wrote:

...

Just for my own edification, it seems to me that there are some simple rules to follow. Let me know if I'm even close:

1. If the data is correct, there is no issue. 2. If the data is known to be false, it should come out. 3. If the data has no basis in fact (e.g., not known to be false but also no basis to believe it is true), it should come out. 4. If the data has conflicting historical viewpoints, it is a judgment call based on the quality of the reports.

I think points 1, 2, and 4 are pretty much undisputed. Point 3 is the current point of contention, but I think we should be moving to a solution wherein data like this that somehow got into previous releases despite having no basis in fact should be removed from the main files and instead be added to a separate file for data of "dubious" provenance, which users can choose to use or ignore at-will. -- Tim Parenti

Lester Caine

8:58 a.m.

On 08/08/14 23:22, Tim Parenti wrote:

...

I think points 1, 2, and 4 are pretty much undisputed. Point 3 is the current point of contention, but I think we should be moving to a solution wherein data like this that somehow got into previous releases despite having no basis in fact should be removed from the main files and instead be added to a separate file for data of "dubious" provenance, which users can choose to use or ignore at-will.

The problem here is simply 'users can choose to use or ignore at-will' if that is done at a distribution level! I've never had a problem that some of the data's accuracy can be disputed, and I do feel that the ignoring of pre-1970 data has been overcome - the Russian data is nice to see - but if the answers returned when checking historic data on a timezone depend on what data has been loaded then we are back with the problem! I've reached the point where I'd rather see a 'not available' rather than a 'this is todays guess' so we KNOW we have to switch to an alternate source? Paul - Getting things second correct is perhaps where some of the problem is coming from. I'd just be happy with hour correct. Anything better is just a bonus, but while 5 years ago normalizing data was probably a rarity, things like Derick's port of data to PHP has meant that we are using this as a reference today! In the future I CAN see a case for storing the version of tz used along with the timezone on a normalization. That may sound like overkill, but it is the only way to detect that something has changed ... -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

Paul Eggert

6:50 p.m.

Lester Caine wrote:

...

if the answers returned when checking historic data on a timezone depend on what data has been loaded then we are back with the problem!

This is inherent to the current way the tz database is used. Different software distributions use different versions of the tz database. They install the tz database with different build-time options, getting different sets of names. And distribution-specific patches can greatly affect old time stamps: for example, Android discards all parts of the database before 1901 (and after 2038!).

...

I'd rather see a 'not available' rather than a 'this is todays guess'

Even for future timestamps? Most of them are "today's guess".

...

I'd just be happy with hour correct.

The changes in question are pretty much all in that category.

...

I CAN see a case for storing the version of tz used along with the timezone

Absolutely, if one is serious about provenance. One should record the tz database version (including local changes and options) as part of the provenance. One way to record this sort of information is OPM <http://openprovenance.org/>. The tz database version is only a part of this problem, as in the typical case one also needs to record the C library version, kernel version, hardware version, etc.

Lester Caine

9:13 p.m.

On 10/08/14 19:50, Paul Eggert wrote:

...

...
I CAN see a case for storing the version of tz used along with the timezone

Absolutely, if one is serious about provenance. One should record the tz database version (including local changes and options) as part of the provenance. One way to record this sort of information is OPM <http://openprovenance.org/>. The tz database version is only a part of this problem, as in the typical case one also needs to record the C library version, kernel version, hardware version, etc.

The problem is getting distributions to stop messing the version number up :( PHP's getversion is returning garbage on some distributions ... -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

Lester Caine

7:56 a.m.

On 07/08/14 11:43, Derick Rethans wrote:

...

On Wed, 6 Aug 2014, Paul Eggert wrote:

...
...
Stephen Colebourne wrote:

...
...
Simply ploughing on with the changes, just in smaller batches, does not actually make the objectors happy

It doesn't make *me* happy either. I'd prefer doing these cleanups all at once to get them over with, but I'm willing to live with doing them more gradually, they way they've been done for many years. No end-user problems have been reported in all that time, and there's no reason to expect significant problems in the future as we continue to muck out the stables. I'll give you one here then, where PHP's CI started failing (more) since the upgrade to 2014f:

https://travis-ci.org/php/php-src/jobs/31891549#L910

I realize it's because acronyms changed *only* here, but making changes most certainly has effects on other systems.

Derick - I still have some trouble marrying up the travis output, but only because I'm used to graphic tools which highlight the exact diffs. ;) All of these regressions are for 2010, so a period when we are well into what should be accurately known data? Certainly in a period when this data may well have been used live in the field. The problem that I see is now documenting that users may see the times they were using only 4 years ago change? Paul - The whole point of regression tests is to establish why something changed and check that every change is more correct than the last. If not then the change should be reverted. The PHP tests are only covering a small time frame and given the amount of work needed to cross check every failure just on those tests I'd support Derick just replacing the old result set with a new one, but while blanket changes like that are justifiable for pre-1970 data, for 4 years ago? -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

Paul Eggert

9:58 p.m.

Lester Caine wrote:

...

The whole point of regression tests is to establish why something changed and check that every change is more correct than the last.

Yes, absolutely. I can't decipher that regression-test output either, but Derick said it was due to 2014f's time zone abbreviation changes; these affect past, current and future time stamps, and are independent of the changes that removed questionable old data.

Derick Rethans

10:18 a.m.

On Fri, 8 Aug 2014, Paul Eggert wrote:

...

Lester Caine wrote:

...
The whole point of regression tests is to establish why something changed and check that every change is more correct than the last.

Yes, absolutely. I can't decipher that regression-test output either, but Derick said it was due to 2014f's time zone abbreviation changes; these affect past, current and future time stamps, and are independent of the changes that removed questionable old data.

Indeed. I will have a look whether there is been other bugs reported that were not just because of abbreviation changes. cheers, Derick

John Hawkinson

9:07 a.m.

I hesitate to suggest this, because I think it may be a bad idea (which is why I sat on it all week), but: Paul Eggert <eggert@cs.ucla.edu> wrote on Tue, 5 Aug 2014 at 08:10:16 -0700 in <53E0F3D8.6080008@cs.ucla.edu>:

...

Eliot Lear wrote:

...
What I like about Paul's approach is that he can make changes in perhaps small batches so that people can review them in bite-sized pieces.

Yes, and the recent contretemps started mainly because of a batch that was too large for some in the audience. The latest proposal is a small fraction of the original proposal. I would like to continue to remove data lacking a reliable source -- a process that's been going on for some time -- but I guess it'll be one step at a time.

Perhaps this work should continue in bite-size portions on a branch until it is finally done, and only then should that branch be merged to the trunk and released. --jhawk@mit.edu John Hawkinson

Paul Eggert

9:20 p.m.

John Hawkinson wrote:

...

Perhaps this work should continue in bite-size portions on a branch until it is finally done

It's already done: I already did the work, and installed it into the experimental version, before reverting most of the changes. I'm afraid that incorporating these changes in bite-size portions won't overcome the philosophical objections that several people have to them. Fundamentally it's a matter of principle, not size.

Tim Parenti

10:25 p.m.

I'm going to attempt to synthesize a lot of the recent discussion with respect to where I stand, as well as one way we could proceed... On 8 August 2014 02:25, Alan Barrett <apb@cequrux.com> wrote:

...

Yes, valuing correctness over stability is good, even when the new data is not 100% correct, provided it is more correct than the old data.

The stability-related complaints have been about cases where the "more correct than the old data" condition was not perceived to be satisfied.

I am gradually coming round to the opinion that the new data is probably more correct than the old data, but that is not clear to all observers.

I, too, am gradually coming around to this stance, and for the same reasons. Among the reasons I'm not yet fully on board: On 5 August 2014 12:07, Paul Eggert <eggert@cs.ucla.edu> wrote:

...

Marc Lehmann wrote:

...
I haven't seen anybody argue the new data is better.

It appears you overlooked some arguments in that direction; see < http://mm.icann.org/pipermail/tz/2014-August/021283.html> for example.

This post only addresses the changes to a few of the zones. If you assert that the rest of the changes are also better for similar reasons, that's one thing, but to date, I don't think this has been done. Depending on the nature of the assertion(s), they may or may not require fuller documentation to become convincing. On 6 August 2014 14:32, Paul Eggert <eggert@cs.ucla.edu> wrote:

...

Lester Caine wrote:

...
If it is proven wrong because there is a proven correct version then OK, but switching one unproven fact with another ...

Those changes mostly remove dubious data, rather than replacing one dubious datum with another.

I believe the relevant point made by objectors here is that, by doing anything other than deleting the identifiers altogether, there is no such thing as "removing" data from an end user's perspective, again, because the format minimally requires that something be there. Date and time tools using tz will still output a wall clock time for a given tz identifier and historic UNIX timestamp, and the new assertions in this space are the ones which (in many cases) have no more proven merit than the old versions. I believe this distinction of the actual Zone line data (and zic'd binaries) we input from the end user "data" (and wall clock times) that tools output is an important one to make here, as it lies at the heart of much objection to these changes: On 6 August 2014 14:58, Lester Caine <lester@lsces.co.uk> wrote:

...

If the removing of dubious data results in the answers generated changing then that is the stability that is objected to. These changes resulting new output that is only changing because of two lots of dubious states is the problem

As for the scope of the disruption caused by these changes, in general, I find it difficult to buy either argument that the supposed disruption is (a) so large as to prohibit the change, or (b) so small as to override all other concerns. Due to the age of the timestamps affected, I'm more inclined to lean toward the "small" side of this debate, but due to my cautious nature I'm also inclined to overestimate the impacts by an order-of-magnitude or two. Further, just because no one has complained to us does not mean issues don't exist, or have even been discovered by users. In the end, I think reality is somewhere in the middle, and that these are weak arguments on both sides. * * * On 6 August 2014 14:32, Paul Eggert <eggert@cs.ucla.edu> wrote:

...

In the long run it'd be better to remove dubious data, or at least move it to a "dubious" area optionally available to users who prefer it; but one step at a time.

I think this is the direction we should move toward at this juncture. Perhaps frustratingly, the first task would be to restore the zone data (and associated commentary) removed in 2013e and 2014f to this new area. (I would have suggested "attic" for this file, but "dubious" is more straightforward and also avoids another file starting with "a").

...

From a build procedure standpoint, and to avoid disrupting the main database, I think the simplest approach would be to add a Makefile target which compiles the standard files as usual, then compiles the dubious data with a separate call to zic. If I am not mistaken, this would simply overwrite the binaries created from links in the standard files with binaries representing the more suspect data. It's a bit of unnecessary work for the compiler, yes, but this would simply factor into an administrator's decision whether to include this data.

On 5 August 2014 11:40, Stephen Colebourne <scolebourne@joda.org> wrote:

...

Simply ploughing on with the changes, just in smaller batches, does not actually make the objectors happy

In terms of maintenance procedure, I think we need to be just as cautious to observe due diligence as when we add new data. This takes on a different tone when relegating dubious data, but should still be of importance to the project. On 8 August 2014 05:07, John Hawkinson <jhawk@mit.edu> wrote:

...

Perhaps this work should continue in bite-size portions on a branch until it is finally done, and only then should that branch be merged to the trunk and released.

And I think that, once the new file and build procedure are in place, this is one reasonable way (among many) to move forward with this plan. -- Tim Parenti

Lester Caine

11:21 p.m.

That I think is a nice summary Tim All I would add is perhaps a reference to the amount of work required to fix the regressions in testing third party interfaces like PHP and Java and other platforms. It is the fact that we can not gauge if any of the changes do affect users that is the problem. That someone now gets a different answer in a grey area of the data may well not be important, but not knowing that a change HAS happened means that the effect of a change can not be reviewed? And perhaps the very evidence we are looking for discovered ... -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

Brian Inglis

4:54 p.m.

New subject: Regression Tests and Revision Controls

On 2014-08-08 17:21, Lester Caine wrote:

...

That I think is a nice summary Tim

All I would add is perhaps a reference to the amount of work required to fix the regressions in testing third party interfaces like PHP and Java and other platforms. It is the fact that we can not gauge if any of the changes do affect users that is the problem. That someone now gets a different answer in a grey area of the data may well not be important, but not knowing that a change HAS happened means that the effect of a change can not be reviewed? And perhaps the very evidence we are looking for discovered ...

Perhaps some comments on the scale and effort required for commercial/professional regression test suites and SOPs would be appropriate. I would expect a regression test suite to run zic to list all transitions in all zones, diff that output against the baseline output, look for release notes explaining why each change has occurred, document that; follow up any unexplained diffs with the maintainer, and document that, then pass the diff and documentation to the product release manager, for review and reappraisal or approval to promote the change to a production release. If someone could contribute such a regression test script and baseline output, it would allow the maintainer to better evaluate the potential impact of any proposed changes on products using the package. I also agree that normal revision control practice is to create a (tagged) branch for each potential change, review the diffs against the trunk, make any other changes indicated and add to the branch, then add the branch to the proposed changes expected to be merged into the trunk in the next release. On release, add the release tag to everything in the trunk. This also has the benefit that downstream consumers of the packages can more easily create their own forks which could exclude some changes or include their own patches. Each branch change would also normally include the output of the regression test suite run against those changes and checked in with the code. -- Take care. Thanks, Brian Inglis

Paul Eggert

7:14 p.m.

New subject: Regression Tests and Revision Controls

Brian Inglis wrote:

...

If someone could contribute such a regression test script and baseline output

That's been done in the past (contributed privately to me), but my impression is that it wasn't ready for widespread publication and use, and was meant more for internal private use. I do internal regression testing on an ad-hoc basis, and I've recently been talking with a Large Development Organization to try to do this sort of thing more regularly with the tz data, but we don't have anything publishable yet.

...

normal revision control practice is to create a (tagged) branch for each potential change

That may be normal in some organizations, but it is neither common everywhere nor needed here. That approach inevitably leads to configure-time complexity on downstream installers, developers, and users, complexity that in our case would cause more harm than the practice would cure. Because the tz database is so small and stable that it's feasible for one very-part-time volunteer to merge everything into a single branch that's almost always near production quality, tagged branches wouldn't bring that much to the table. It'd of course be OK if contributors wanted to take the time to make a tagged branch for each potential change, but as far as I know no tz contributor does that now and I'd rather not impose that kind of version-control bureaucracy on future contributors (including myself :-).

Lester Caine

8:39 a.m.

New subject: Regression Tests and Revision Controls

On 10/08/14 20:14, Paul Eggert wrote:

...

Because the tz database is so small and stable that it's feasible for one very-part-time volunteer to merge everything into a single branch that's almost always near production quality, tagged branches wouldn't bring that much to the table. It'd of course be OK if contributors wanted to take the time to make a tagged branch for each potential change, but as far as I know no tz contributor does that now and I'd rather not impose that kind of version-control bureaucracy on future contributors (including myself :-).

Modern tools allow a lot more analysis than used to be possible, and as well as the github view, there are other projects with clones of that. The next step is to provide the data with a different view, and that is one where one can simply plug in a version id and pull out data for that version. I don't think any of the current third part sources have that capability? This would make managing the historic data a little easier, but in reality that is dwarfed these days by the annual changes going forward ... Recording the tz version with a timestamp only works if one can then decode it and update it if required. My own Hg DVCS view allows me to step through the history of file changes so Ican check this manually but I'm back to the point I was at several years ago where I want the data in a relational database with a version tag, and all I need to add IS the diff's each release ... For some historic offsets we do need to indicate a level of uncertainty, which is why I was saying 'unknown' at which point the comfortable offset comes purely from it's location. That tz provides a mean offset with perhaps an hour or more window is the point here. If a location is using time based on a central clock rather than LMT then there can be a substantial change in offset. There was a discussion about the shear volume of data that could be added but recording when a particular location came into a tz entry is something that needs to be recorded outside of tz? It's only the evolution of those tz's themselves which forms part of the data? -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

Lester Caine

9:45 a.m.

New subject: Regression Tests and Revision Controls

On 11/08/14 09:39, Lester Caine wrote:

...

My own Hg DVCS view allows me to step through the history of file changes so Ican check this manually but I'm back to the point I was at several years ago where I want the data in a relational database with a version tag, and all I need to add IS the diff's each release ...

OK I have been looking at this from the wrong end. Basically what I'm trying to flag up is when data that has already been normalized needs to be reviewed because of a change in tz. I'd been looking at historic data, but as Paul says ... future offsets are a guess already, and so it's this area I need to be targeting. The historic changes just come out in the wash. This may actually be a discussion that should be on the tzdist list, so I have copied there as well. That we need to know which version of tz was used to normalize a record has been a given, so when an update occurs we can identify which time zone has been updated and where in time the change happened. As a minimum flagging an historic record where it is identified with a tz version that has a change from the current version when it is used. We have been trying to get to a point where the historic tz data is stable and acceptable to everyone. From a tzdist point of view, I would contest that having proven historic data prior to 1970 is essential as not including it leaves a hole in how useful the information is. I presume that the tzdist discussion is based on prompt updates to data going forward, but would not exclude corrections to the historic material as well? Either way, picking up when stored information has changed is important? Having the information that there has been a change is one thing, using it is something else. -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

Paul Eggert

5:52 a.m.

New subject: Regression Tests and Revision Controls

Lester Caine wrote:

...

That we need to know which version of tz was used to normalize a record has been a given

Although it's been an assumption I doubt whether it's a given, as (A) the versioning is complicated, and (B) the version info is often not readily available. Here's a bit more about (A). The tz database is typically massaged by downstream distributions before users see it. Android discards all data outside the 32-bit window (roughly, before 1901 or after 2038). Other distributions discard data before 1970, or add their own zones, or cherry-pick upstream changes, or invent their own changes. Here's an example of versions being distributed today. Debian currently sports four distinct tzdata versions which it calls 2014e-0squeeze1, 2014e-0wheezy1, 2014e-1, and 2014f-1, each derived from an upstream release and with Debian-specific changes; see <https://packages.debian.org/search?keywords=tzdata>. Ubuntu, which is downstream from Debian, currently has seven different versions with Ubuntu-specific changes, which it calls 2014e-0ubuntu0.10.04, 2014e-0ubuntu0.12.04, 2013b-1ubuntu1, 2013g-0ubuntu0.13.04, 2014e-0ubuntu0.13.10, 2014e-0ubuntu0.14.04, and 2014f-1; see <http://packages.ubuntu.com/search?keywords=tzdata>. And Linux Mint, which is downstream from Ubuntu, has its own four versions 2010i-1, 2011n-0ubuntu0.11.10, 2013b-1ubuntu1, and 2014b-1; see <http://community.linuxmint.com/software/view/tzdata>. That's just three distributors; there are dozens more. And if you care about Java, PHP, Go, etc., then multiply everything by another factor, as these subsystems have their own database copies with their own idiosyncrasies. Good luck to anybody who wants to keep track of what all those versions actually mean. And that's just (A). I suspect that (B) is a bigger kettle of fish.

Lester Caine

7:10 a.m.

New subject: [Tzdist] Regression Tests and Revision Controls

On 13/08/14 06:52, Paul Eggert wrote:

...

Lester Caine wrote:

...
That we need to know which version of tz was used to normalize a record has been a given

Although it's been an assumption I doubt whether it's a given, as (A) the versioning is complicated, and (B) the version info is often not readily available.

Here's a bit more about (A). The tz database is typically massaged by downstream distributions before users see it. Android discards all data outside the 32-bit window (roughly, before 1901 or after 2038). Other distributions discard data before 1970, or add their own zones, or cherry-pick upstream changes, or invent their own changes.

Here's an example of versions being distributed today. Debian currently sports four distinct tzdata versions which it calls 2014e-0squeeze1, 2014e-0wheezy1, 2014e-1, and 2014f-1, each derived from an upstream release and with Debian-specific changes; see <https://packages.debian.org/search?keywords=tzdata>. Ubuntu, which is downstream from Debian, currently has seven different versions with Ubuntu-specific changes, which it calls 2014e-0ubuntu0.10.04, 2014e-0ubuntu0.12.04, 2013b-1ubuntu1, 2013g-0ubuntu0.13.04, 2014e-0ubuntu0.13.10, 2014e-0ubuntu0.14.04, and 2014f-1; see <http://packages.ubuntu.com/search?keywords=tzdata>. And Linux Mint, which is downstream from Ubuntu, has its own four versions 2010i-1, 2011n-0ubuntu0.11.10, 2013b-1ubuntu1, and 2014b-1; see <http://community.linuxmint.com/software/view/tzdata>.

That's just three distributors; there are dozens more. And if you care about Java, PHP, Go, etc., then multiply everything by another factor, as these subsystems have their own database copies with their own idiosyncrasies. Good luck to anybody who wants to keep track of what all those versions actually mean.

And that's just (A). I suspect that (B) is a bigger kettle of fish.

All that just strengthens the argument for a properly managed distribution system? And perhaps at the same time the major hole caused by only providing a time offset in the browser header rather than a users actual timezone can be address? It's not a lot of use giving correct timezone data if you have no idea of the clients offset in 6 months time :( -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

Peter Ilieve

5:28 p.m.

New subject: Regression Tests and Revision Controls

On 12 Aug 2014, at 10:45, Lester Caine <lester@lsces.co.uk> wrote:

...

We have been trying to get to a point where the historic tz data is stable and acceptable to everyone. …

I don’t think that is an achievable goal. I write that as someone who broke the stability of UK historical data by finding a lot of the old Summer Time Orders and other regulations, work continued and pretty much finished by Joseph Myers. That sort of thing will probably, and should, continue. Some keen youngster in Elbonia will discover the tz data (maybe her Linux box doesn’t follow a short-notice summer time change) and will remember that her grandfather was a senior official in the Elbonian ministry of the interior, and there is a trunk full of his old stuff in her parents’ attic. Before we know it we will have decades of accurate history for Elbonia, traceable to primary sources, but different from the current tz data. That is a Good Thing, beating stability any day. Peter Ilieve

Lester Caine

6:26 a.m.

New subject: Regression Tests and Revision Controls

On 13/08/14 18:28, Peter Ilieve wrote:

...

...
We have been trying to get to a point where the historic tz data is

...
stable and acceptable to everyone. … I don’t think that is an achievable goal.

I write that as someone who broke the stability of UK historical data by finding a lot of the old Summer Time Orders and other regulations, work continued and pretty much finished by Joseph Myers.

That sort of thing will probably, and should, continue. Some keen youngster in Elbonia will discover the tz data (maybe her Linux box doesn’t follow a short-notice summer time change) and will remember that her grandfather was a senior official in the Elbonian ministry of the interior, and there is a trunk full of his old stuff in her parents’ attic. Before we know it we will have decades of accurate history for Elbonia, traceable to primary sources, but different from the current tz data. That is a Good Thing, beating stability any day.

That is perhaps my point. There is more and more historic data coming out of the woodwork and some of the less developed countries see to be doing a better job of scanning and indexing that material so it's freely available. tzdist has to be based on a complete set of available history, Something that has a cutoff of 1970 will not be acceptable at which point the mechanism for recording and updating the underlying documentation becomes more important. -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

Paul Eggert

5:15 p.m.

New subject: Regression Tests and Revision Controls

Lester Caine wrote:

...

Something that has a cutoff of 1970 will not be acceptable

The proposed 'backzone' file does not have a cutoff of 1970. Please see: http://mm.icann.org/pipermail/tz/2014-August/021388.html

Lester Caine

10:16 p.m.

New subject: Regression Tests and Revision Controls

On 14/08/14 18:15, Paul Eggert wrote:

...

...
Something that has a cutoff of 1970 will not be acceptable

The proposed 'backzone' file does not have a cutoff of 1970. Please see:

http://mm.icann.org/pipermail/tz/2014-August/021388.html

This is probably a discussion for tzdist ... My reading of the paperwork so far is that it is planned to have a full range base, and then sources that are supplying truncated data identify the fact, but what is not clear is how new historic data that identifies that a current timezone now has two historic routes. The planing seems to be based on tz zones, but as with all of these things there is no planning to manage tz identifiers. Can we not just describe 'backzone' as historic data. That the provenance of some material is 'poorly-sourced' just needs to be properly flagged within the file. The source comments forming part of the main files need to be reproduced in the history file to identify where data IS well sourced. I'm looking at the Jersey and Gernsey data there, but a quick scan on other historic entries which currently have no comments seem to have the source material in their main files? Hopefully what is then left is a better identified residue of suspect entries? -- Lester Caine - G8HFL ----------------------------- Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

Paul Eggert

6:46 a.m.

New subject: Regression Tests and Revision Controls

Lester Caine wrote:

...

My reading of the paperwork so far is that it is planned to have a full range base,

Sorry, I don't know what a 'full range base' is, but there are no plans for a full coverage of all civil time stamps in every era. My only plan for backzone now is to be a repository for questionable/out-of-scope (pre-1970) data that we can easily move out of the tz database proper, one step at a time. If people want to donate other data (within reason; see below) that'd be fine, but it's not something I want to spend time on. The backzone file is out of scope for this project and if managing it becomes a significant burden we should spin it out into a separate project.

...

what is not clear is how new historic data that identifies that a current timezone now has two historic routes.

For current backzone entries this can be deduced from pre1970.tab. If backzone gets more complicated I suggest adding commentary to it in case a zone's boundaries are not clear. Or if that's not enough, do a database redesign as a separate project.

...

The planing seems to be based on tz zones, but as with all of these things there is no planning to manage tz identifiers.

The current approach will not scale. It's meant only for relatively modest growth, where we can manage identifiers the same way we've always managed them. Sorry, but a database redesign will be needed to cover appreciable quantities of pre-1970 data.

...

Can we not just describe 'backzone' as historic data.

Sorry, I don't follow. The tz database is almost all historic data.

...

That the provenance of some material is 'poorly-sourced' just needs to be properly flagged within the file.

It is, in commentary within the file. If those comments are not clear please feel free to submit patches; see the CONTRIBUTING file.

Paul Eggert

11:51 p.m.

Tim Parenti wrote:

...

If you assert that the rest of the changes are also better for similar reasons

No, that's too strong. I assert only that the rest of the changes are so small that they won't cause significant problems in practice with real-world time stamps from the era. This is based not only our experience with doing these tz changes in the past (we've done 'em, multiple times, for many years, with no problems reported); it's based also on my experience with the few applications that could conceivably use this old data (mostly astrology, but also earthquake records and the like), and on my reading of contemporaneous sources. Timekeeping simply wasn't that accurate back then. The changes in question alter timestamps by a few minutes in areas where timekeeping was so sloppy that people at the time wouldn't have noticed or cared about the change. This attitude toward timekeeping still persists in some parts of the world. Last month I talked to someone who recently lived in smaller cities of Ethiopia. Many residents have reasonably high-precision time available on their cell phones. They ignore it, and use Ethiopian time -- some use Arab time, which is equally imprecise -- so that meetings are scheduled to a precision of an hour or three, maybe, if you're lucky. This is the normal state of affairs for most of the timestamps under discussion, except that people back then didn't even have cell phones to ignore. This is why I have no qualms about the experimental post-2014f Russia changes <https://github.com/eggert/tz/commit/3c0c83726c1746c77d9b3ebca085d9a041d6f6e7>. They're small changes to old time stamps, and they're not going to break practical applications, even if they happen to impose Bolshevik timestamps on White Army areas, which some probably do.

...

there is no such thing as "removing" data from an end user's perspective

This objection would have merit if end users cared about this data to 1-second precision. But they don't. And they're right to not care.

...

Perhaps frustratingly, the first task would be to restore the zone data (and associated commentary) removed in 2013e and 2014f to this new area.

I already did that, privately, a few weeks ago. This shouldn't be limited to data removed in the last year or two -- it should contain all dubious data ever removed, going back to the 1990s. (I've done that too.)

...

the simplest approach would be to add a Makefile target which compiles the standard files as usual, then compiles the dubious data with a separate call to zic.

Yes, I've done that too, and I'd be fine with that.

Tim Parenti

12:13 a.m.

On 8 August 2014 19:51, Paul Eggert <eggert@cs.ucla.edu> wrote:

...

Timekeeping simply wasn't that accurate back then. The changes in question alter timestamps by a few minutes in areas where timekeeping was so sloppy that people at the time wouldn't have noticed or cared about the change.

These comments, along with the rest of your message, help greatly in my personal understanding. I can't speak for the others, but I think that this articulates much of what I have been missing. On 8 August 2014 19:51, Paul Eggert <eggert@cs.ucla.edu> wrote:

...

there is no such thing as "removing" data from an end user's perspective

...
This objection would have merit if end users cared about this data to 1-second precision. But they don't. And they're right to not care.

Really, the only "losses" I can think of would be to those who converted ancient timestamps to UNIX time and simply stored them away in a dusty database without a second thought. The losses they incur with these changes may well be their own fault for not adhering to best practices (which we should also try to encourage), but we should at least weigh the effects on even shoddy implementations. Perhaps this was indeed considered, but this wasn't clear. I still think being more proactive in discussing the content of proposed changes before committing them would help, but moving forward with a secondary area for suspect data sidesteps that issue somewhat. I wholeheartedly agree that, for the vast majority of applications, cultures, and traditions, people would be right not to care about this. But I will simply restate for the record that I find any arguments that they do or don't to be pretty weak. Fortunately, from the rest of your response, I am (mostly) satisfied, provided we do indeed aim to move forward with a secondary area for data of dubious provenance. -- Tim Parenti

John Alvord

8:14 p.m.

For those interested enough, there should be a variorium database for a permanent record of all the data and arguments. John Alvord On Aug 9, 2014 1:28 AM, "Tim Parenti" <tim@timtimeonline.com> wrote:

...

On 8 August 2014 19:51, Paul Eggert <eggert@cs.ucla.edu> wrote:

...
Timekeeping simply wasn't that accurate back then. The changes in question alter timestamps by a few minutes in areas where timekeeping was so sloppy that people at the time wouldn't have noticed or cared about the change.

These comments, along with the rest of your message, help greatly in my personal understanding. I can't speak for the others, but I think that this articulates much of what I have been missing.

On 8 August 2014 19:51, Paul Eggert <eggert@cs.ucla.edu> wrote:

...
there is no such thing as "removing" data from an end user's perspective

...
This objection would have merit if end users cared about this data to 1-second precision. But they don't. And they're right to not care.

Really, the only "losses" I can think of would be to those who converted ancient timestamps to UNIX time and simply stored them away in a dusty database without a second thought. The losses they incur with these changes may well be their own fault for not adhering to best practices (which we should also try to encourage), but we should at least weigh the effects on even shoddy implementations. Perhaps this was indeed considered, but this wasn't clear. I still think being more proactive in discussing the content of proposed changes before committing them would help, but moving forward with a secondary area for suspect data sidesteps that issue somewhat.

I wholeheartedly agree that, for the vast majority of applications, cultures, and traditions, people would be right not to care about this. But I will simply restate for the record that I find any arguments that they do or don't to be pretty weak. Fortunately, from the rest of your response, I am (mostly) satisfied, provided we do indeed aim to move forward with a secondary area for data of dubious provenance.

-- Tim Parenti

Tim Parenti

11:41 p.m.

On 9 August 2014 04:58, Lester Caine <lester@lsces.co.uk> wrote:

...

The problem here is simply 'users can choose to use or ignore at-will' if that is done at a distribution level! I've never had a problem that some of the data's accuracy can be disputed, and I do feel that the ignoring of pre-1970 data has been overcome - the Russian data is nice to see - but if the answers returned when checking historic data on a timezone depend on what data has been loaded then we are back with the problem!

Ah, that is a good point. To avoid such a mish-mash, we could strongly encourage those maintaining distributions to use only the standard data. If people are really eager to maintain an additional "tz-extended" distribution alongside that, so be it. On 9 August 2014 16:14, John Alvord <johngrahamalvord@gmail.com> wrote:

...

For those interested enough, there should be a variorium database for a permanent record of all the data and arguments.

IMHO, that's probably overkill. At some point, it's incumbent upon developers to adhere to best practices, and simply storing a historical UNIX timestamp under the assumption that corresponding wall clock times will never change is naïve at best. Storing UNIX timestamps alongside tz identifiers and ISO-formatted timestamps should be enough to be able to detect any resulting divergence if it's absolutely critical. Then it's up to each developer to decide which they think is more correct for their purposes. -- Tim Parenti

Stephen Colebourne

9:33 a.m.

On 9 August 2014 00:51, Paul Eggert <eggert@cs.ucla.edu> wrote:

...

This is based not only our experience with doing these tz changes in the past (we've done 'em, multiple times, for many years, with no problems reported); it's based also on my experience with the few applications that could conceivably use this old data (mostly astrology, but also earthquake records and the like)

This is a far too limited view of the usages of the data, perhaps that is part of the problem here. The reality is that millions of developers use this old data, it is just indirect rather than direct. 1) Most developers are not aware of the nuances of coding well using date and time, and especially not times in the past. 2) Java, and many other languages, have a date/time system that allows any date to be requested, including into the very far past 3) When requesting a far past date, a time-zone offset is associated with the time in many cases. That offset often will be LMT. 4) When LMT is changed that change is visble. Saying that the developers are doing it wrong may be true, but it is also meaningless. We cannot fix the code or mental models of millions of developers. We can say that the developers do not really care about the offset they are given per se, however it is likely that they will not have explicitly coded for the data to change. Most of the time, it all works out, but sometimes it won't. Any issues will be handled by individual developers - we will never hear about them here on this list. This is a big issue as the date that is being used as the cutoff date (1970) is really not that long ago. Many people have birth dates before that date, and so have the potential to be affected. ie. we are discussing timestamps of living people (recent changes affected up to the 1960s). What we need is a consistent approach to data before 1970 that will result in stability going forward. If we had that, then a one off change to reach the stable state becomes desirable, rather than undesirable. Stephen

Paul Eggert

7:14 a.m.

Stephen Colebourne wrote:

...

This is a big issue as the date that is being used as the cutoff date (1970) is really not that long ago. Many people have birth dates before that date, and so have the potential to be affected.

This is too hypothetical, I'm afraid. People's pre-1970 birthdays are typically recorded as dates, not as POSIX time stamps. (That's true even for post-1970 birthdays.) Although it's conceivable that an application could misuse the data in the way you describe, it'd be pretty weird to do that, and in practice it's not a significant problem. We shouldn't let hypothetical misuse like that get in the way of cleaning out bogus and/or out-of-scope data.

Alan Barrett

9:51 a.m.

On Fri, 08 Aug 2014, Tim Parenti wrote:

...

On 6 August 2014 14:32, Paul Eggert <eggert@cs.ucla.edu> wrote:

...
Those changes mostly remove dubious data, rather than replacing one dubious datum with another.

I believe the relevant point made by objectors here is that, by doing anything other than deleting the identifiers altogether, there is no such thing as "removing" data from an end user's perspective, again, because the format minimally requires that something be there.

This point bears repeating. Removing a line from the zic input might appear to be removing dubious data, but from another viewpoint it's changing, not removing, the results you get from converting timestanps back and forth between UTC and local time. --apb (Alan Barrett)

4303

Age (days ago)

4322

Last active (days ago)

List overview

Download

46 comments

16 participants

participants (16)

Alan Barrett
Brian Inglis
David Patte ₯
Derick Rethans
Derick Rethans
Eliot Lear
Guy Harris
John Alvord
John Hawkinson
Lester Caine
Paul Eggert
Peter Ilieve
Robert Elz
Stephen Colebourne
Tim Parenti
Zoidsoft