Formal models and the need for storing original source data

June 21, 2020

      ...
---------- Forwarded message ----------
From: Robert Elz <kre@munnari.oz.au>
To: Lester Caine <lester@lsces.uk>
Cc: tz@iana.org
Bcc:
Date: Wed, 03 Jun 2020 19:31:43 +0700
Subject: Re: [tz] How good is pre-1970 time zone history represented in TZ
database?
    Date:        Wed, 3 Jun 2020 12:22:32 +0100
    From:        Lester Caine <lester@lsces.uk>
    Message-ID:  <3576438c-e0bd-7188-2153-43d80e9054b8@lsces.uk>
There are two kinds of
changes that may be made to a zone.   One affects only future timestamps,
and is the common old garden variety "government interference" or however
you think of it.  These are far and away the most common changes.  While
when we have insufficient notification of a change, it may be applied
retrospectively, when that has happened, everyone tends to be very aware
that there were bad time conversions for a while.   In any case, old
historic stored data isn't affected at all by this kind of change, it
is as valid (or not) after the change as it was before.
You can also have data recorded in the past which refers to the future, so
even
a change that affects only future timestamps can still invalidate existing
stored data.
...
The second kind of change is a correction to historic data.  This happens
when
we discover an error in what was present (and these days, almost only ever
affects pre-1970 timestamps).
In those, if someone had stored the UTC converted form of some local
timestamp,
then after the correction they wouldn't get back the data that was
originally
used to produce it.
The problem there is having discarded the original data instead of
retaining
it.   Always retain the original source data.   Then by all means, when
computing, convert timestamps from their various local values to UTC so
they can be more easily correctly ordered (or whatever) but use those
converted values only for transient computations.   Store the original.
Always.
If that is done correctly, then after a correction to old data, the
results might be different than they were before - but that's only because
they were wrong before, and (hopefully) better after the fix.
I agree 100%.  I've been struggling for years to really get my head wrapped
around this stuff, and finally decided to create a formal model using
abstract
data types.  I'm finding that approach to be quite helpful, so I wrote it
up in a
couple of essays that folks on this list might find interesting.  In fact,
the folks
on this list may well be the *only* people who would find them
interesting.  :-)

The first is aimed at the general programmer, but it introduces the
fundamental
concepts:
https://drive.google.com/file/d/1WntAyhIawYtbL2k3fPE71cM9EFkVCJGQ/view?usp=s...

The second is aimed at time geeks and goes much deeper into the underlying
theory:
https://drive.google.com/file/d/1aOj9YeDFUST0lQFXZiUsCSmxjnzlbIe4/view?usp=s...

These are still in draft form, but I hope others find some value in them.
(Teaser:
the model makes no reference to years, days, hours, or seconds.  How's that
for
abstract?)

What is rather interesting (and reassuring) is that the formal model ends
up telling
us pretty much what this thread has been saying:  store original data.
However,
by looking at things in terms of abstract data types, you can easily see
*why* you
need to do that without having to resort to specific examples, and many
seemingly
unrelated cases can be seen to be special cases of the same underlying
phenomenon.
The model also points you to the very specific places in your design and
code
where you have to watch out for problems.
...
The only time it makes sense to store timestamps in other than the original
form is when we *know* that the conversion is correct (and hence, no later
correction will change it).   For users of tzdata that really only applies
to post-1970 timestamps.
When exactly do we *know* anything, I mean really for sure?

Imagine yourself back in 1966.  UTC has been chugging along
nicely for five years, providing a well-behaved predictable time
standard.  We know it's never going to change.

Now it's 1967.  Oops, the name just changed.  Well, that doesn't
really count, a rose by any other name ...

Now it's 1972.  Oops again, we just added a leap second, and
there are more on the way.  Sorry unix time_t, your fundamental
assumption was just revoked -- and so for the next 50 years, nearly
every computer on the planet will stubbornly refuse to accept the
existence of leap seconds.

Never say never.
...
---------- Forwarded message ----------
From: Robert Elz <kre@munnari.oz.au>
To: Lester Caine <lester@lsces.uk>
Cc: tz@iana.org
Bcc:
Date: Wed, 03 Jun 2020 23:48:42 +0700
Subject: Re: [tz] How good is pre-1970 time zone history represented in TZ
database?
    Date:        Wed, 3 Jun 2020 16:02:45 +0100
    From:        Lester Caine <lester@lsces.uk>
    Message-ID:  <3b7f0c78-4dd7-0f2e-a8e3-2b24401e7e1c@lsces.uk>
| I came into this 20 years ago
I've been involved with it for longer than that - back to my first
unix experience, in '76, where the US tz rules were compiled into the
code, and most people in AU simply adjusted their computer's clock
(their offset from UTC) 4 times a year (when the US switched summer time
on and off, and when AU did - and yes, that meant that the generated GMT
timestamps were wrong, most of the time).   From that (a bit later) I was
responsible for the mess that existed until ado invented tzdata (and yes,
I mean the 2nd arg to gettimeofday()).
You make me look like a newbie.  I first got into this seriously in 1997
while
updating legacy systems to handle Y2K.
...
...
From all of this I have learned that time is hard.   Really hard.
Many people believe that since they learned to tell the time when
they were 4 or 5 years old, and have been doing it ever since, they
know all there is to know.   That's sad...
See the discussion of seduction and polar bears in the first essay
linked above.
...
| now while working with a data archive
  | which has now been simply dumped because we had no idea what rules were
  | used to produce the normalised data.
That's a pity, byt sometimes past mistakes simply come back to bite,
and sometimes bite hard.   Note that the error there was normalising the
data, if that hadn't been done, none of the rest of it would matter,
you'd now have the original data and could manipulate it however seems
best, for now, regardless of what anyone did with it decades ago (and if
you get it all wrong, future generations could cope, because they'd also
still have the original data, and can fix any errors).
| Nowadays yes it does make sense to
  | store both an original time and a normalised time,
No, it doesn't, just the original, plus ...
| along with a location,
yes, something which can be mapped into a timezone - and as accurate
a location as possible.
| and a record of which version of rules was used to do the
  | normalization.
Don't care about that, since the result won't be being saved.
| Add to that a flag that indicates if the UTC time is fixed!
If the UTC time is the authoritative one, that is what is stored.
No need for extra flags.   Just the authoritative time - the one
which defines whatever it is that is being recorded.
I second everything Robert Elz said here.

 - Michael Kenniston

Michael Kenniston

Steve Allen

Robert Elz

tags

participants (3)