Formal models and the need for storing original source data
---------- Forwarded message ---------- From: Robert Elz <kre@munnari.oz.au> To: Lester Caine <lester@lsces.uk> Cc: tz@iana.org Bcc: Date: Wed, 03 Jun 2020 19:31:43 +0700 Subject: Re: [tz] How good is pre-1970 time zone history represented in TZ database? Date: Wed, 3 Jun 2020 12:22:32 +0100 From: Lester Caine <lester@lsces.uk> Message-ID: <3576438c-e0bd-7188-2153-43d80e9054b8@lsces.uk>
There are two kinds of changes that may be made to a zone. One affects only future timestamps, and is the common old garden variety "government interference" or however you think of it. These are far and away the most common changes. While when we have insufficient notification of a change, it may be applied retrospectively, when that has happened, everyone tends to be very aware that there were bad time conversions for a while. In any case, old historic stored data isn't affected at all by this kind of change, it is as valid (or not) after the change as it was before.
You can also have data recorded in the past which refers to the future, so even a change that affects only future timestamps can still invalidate existing stored data.
The second kind of change is a correction to historic data. This happens when we discover an error in what was present (and these days, almost only ever affects pre-1970 timestamps).
In those, if someone had stored the UTC converted form of some local timestamp, then after the correction they wouldn't get back the data that was originally used to produce it.
The problem there is having discarded the original data instead of retaining it. Always retain the original source data. Then by all means, when computing, convert timestamps from their various local values to UTC so they can be more easily correctly ordered (or whatever) but use those converted values only for transient computations. Store the original. Always.
If that is done correctly, then after a correction to old data, the results might be different than they were before - but that's only because they were wrong before, and (hopefully) better after the fix.
I agree 100%. I've been struggling for years to really get my head wrapped around this stuff, and finally decided to create a formal model using abstract data types. I'm finding that approach to be quite helpful, so I wrote it up in a couple of essays that folks on this list might find interesting. In fact, the folks on this list may well be the *only* people who would find them interesting. :-) The first is aimed at the general programmer, but it introduces the fundamental concepts: https://drive.google.com/file/d/1WntAyhIawYtbL2k3fPE71cM9EFkVCJGQ/view?usp=s... The second is aimed at time geeks and goes much deeper into the underlying theory: https://drive.google.com/file/d/1aOj9YeDFUST0lQFXZiUsCSmxjnzlbIe4/view?usp=s... These are still in draft form, but I hope others find some value in them. (Teaser: the model makes no reference to years, days, hours, or seconds. How's that for abstract?) What is rather interesting (and reassuring) is that the formal model ends up telling us pretty much what this thread has been saying: store original data. However, by looking at things in terms of abstract data types, you can easily see *why* you need to do that without having to resort to specific examples, and many seemingly unrelated cases can be seen to be special cases of the same underlying phenomenon. The model also points you to the very specific places in your design and code where you have to watch out for problems.
The only time it makes sense to store timestamps in other than the original form is when we *know* that the conversion is correct (and hence, no later correction will change it). For users of tzdata that really only applies to post-1970 timestamps.
When exactly do we *know* anything, I mean really for sure? Imagine yourself back in 1966. UTC has been chugging along nicely for five years, providing a well-behaved predictable time standard. We know it's never going to change. Now it's 1967. Oops, the name just changed. Well, that doesn't really count, a rose by any other name ... Now it's 1972. Oops again, we just added a leap second, and there are more on the way. Sorry unix time_t, your fundamental assumption was just revoked -- and so for the next 50 years, nearly every computer on the planet will stubbornly refuse to accept the existence of leap seconds. Never say never.
---------- Forwarded message ---------- From: Robert Elz <kre@munnari.oz.au> To: Lester Caine <lester@lsces.uk> Cc: tz@iana.org Bcc: Date: Wed, 03 Jun 2020 23:48:42 +0700 Subject: Re: [tz] How good is pre-1970 time zone history represented in TZ database? Date: Wed, 3 Jun 2020 16:02:45 +0100 From: Lester Caine <lester@lsces.uk> Message-ID: <3b7f0c78-4dd7-0f2e-a8e3-2b24401e7e1c@lsces.uk>
| I came into this 20 years ago
I've been involved with it for longer than that - back to my first unix experience, in '76, where the US tz rules were compiled into the code, and most people in AU simply adjusted their computer's clock (their offset from UTC) 4 times a year (when the US switched summer time on and off, and when AU did - and yes, that meant that the generated GMT timestamps were wrong, most of the time). From that (a bit later) I was responsible for the mess that existed until ado invented tzdata (and yes, I mean the 2nd arg to gettimeofday()).
You make me look like a newbie. I first got into this seriously in 1997 while updating legacy systems to handle Y2K.
From all of this I have learned that time is hard. Really hard.
Many people believe that since they learned to tell the time when they were 4 or 5 years old, and have been doing it ever since, they know all there is to know. That's sad...
See the discussion of seduction and polar bears in the first essay linked above.
| now while working with a data archive | which has now been simply dumped because we had no idea what rules were | used to produce the normalised data.
That's a pity, byt sometimes past mistakes simply come back to bite, and sometimes bite hard. Note that the error there was normalising the data, if that hadn't been done, none of the rest of it would matter, you'd now have the original data and could manipulate it however seems best, for now, regardless of what anyone did with it decades ago (and if you get it all wrong, future generations could cope, because they'd also still have the original data, and can fix any errors).
| Nowadays yes it does make sense to | store both an original time and a normalised time,
No, it doesn't, just the original, plus ...
| along with a location,
yes, something which can be mapped into a timezone - and as accurate a location as possible.
| and a record of which version of rules was used to do the | normalization.
Don't care about that, since the result won't be being saved.
| Add to that a flag that indicates if the UTC time is fixed!
If the UTC time is the authoritative one, that is what is stored. No need for extra flags. Just the authoritative time - the one which defines whatever it is that is being recorded.
I second everything Robert Elz said here. - Michael Kenniston
On Sat 2020-06-20T17:05:34-0700 Michael Kenniston hath writ:
Imagine yourself back in 1966. UTC has been chugging along nicely for five years, providing a well-behaved predictable time standard. We know it's never going to change.
This is a polyanna view of history which bursts upon looking at issues of Bulletin Horaire and seeing what was actually being broadcast by various transmitters. There was not uniformity in broadcasting what the BIH had designated "UTC" until 1972. -- Steve Allen <sla@ucolick.org> WGS-84 (GPS) UCO/Lick Observatory--ISB 260 Natural Sciences II, Room 165 Lat +36.99855 1156 High Street Voice: +1 831 459 3046 Lng -122.06015 Santa Cruz, CA 95064 https://www.ucolick.org/~sla/ Hgt +250 m
Date: Sat, 20 Jun 2020 17:05:34 -0700 From: Michael Kenniston <tz@michaelkenniston.com> Message-ID: <CAF570XH7PjWAbhEBwN2=kU6TP82WHVUnoi2AAGA3eYr++J9N8g@mail.gmail.com> | You can also have data recorded in the past which refers to the future, | so even a change that affects only future timestamps can still | invalidate existing stored data. Not if the authoritative original data is what is stored. What that means might alter, but the data is still as valid as it ever was. Of course, should the data change (which it can, since it has yet to occur - eg: a meeting originally scheduelled for a future Monday is moved to the following Wednesday) then the stored data will be invalid - but I don't see how anything can fix that in advance. | When exactly do we *know* anything, I mean really for sure? I know that the time now us 09:49:43 as I typed that '3' (not any longer) and that that was 02:49:43 UTC (when I typed it). Things might change in the future, but that equivalence never can since it has passed now, and we cannot alter the past. | Never say never. Don't be naive. (sorry, no way to type an i with two dots...) kre
participants (3)
-
Michael Kenniston -
Robert Elz -
Steve Allen