
Some recent difficulties with tzdb updates have led me to consider how they could have been avoided. • The fact that two timezones agree since 1970 is not represented independently from the data of each timezone. This leads to "update anomalies" well-known from database theory. For instance, before Africa/Khartoum switched to UTC - 02 h in 2017, Africa/Juba was a link (in the file africa) to Africa/Khartoum. After the switch, new Zone data for Africa/Juba had to appear in the file africa. Actually, Zone data for Africa/Juba had survived in the file backzone but this was overlooked, so that we had Zone data for Africa/Juba in both files africa and backzone from 2017c until 2018b, and these Zone data did not even agree. And all this happened for Africa/Juba even though civil time in Juba was not changed at all at the time. It appears that this error would not have happened if all African Zone data had been kept in the file africa (rather than in africa and in backzone) -- it was not the Zone data of Africa/Juba that had changed, but only the fact that Africa/Juba ws merged with Africa/Khartoum because their Zone data agreed since 1970. Unfortunately, this fact was not (and still is not) representable as an independently changeable item in tzdb. • Most of the Zone data currently contained in the file backzone have in earlier versions been stored in the "continental" files (africa, antarctica, asia, australasia, europe, northamerica, southamerica), and the reason why they are in backzone is that their Zone data agree with other Zone data since 1970. And since this may well change in the future, we have to keep these Zone data in backzone current. The file backzone is an integral part of the tzdb data, not just a container for additional data of lesser quality. The quality of any Zone data (and any Rule data) in tzdb should always be "to the best of our current knowledge" -- it just does not make sense to keep Zone data in tzdb that are not updated when we acquire relevant new information for them. Thus we should get rid of Zone data for Argentina/Rosario etc (or else update them); keeping data that are known to be wrong is not only useless, it is an invitation for consequential errors. • Letting derived data (such as whether two timezones should be merged because they agree since 1970) decide about the storage location of the Zone data for a timezone not only implies unnecessary data moves upon data updates, it may also disrupt commentary text and its references. Easily understandable comments (such as which facts were deduced from which document) are crucial for later updates, where the effect of a newly found document has to be determined, often after several years, and likely by a different contributor. • The fact that two timezones agree since 1970 has nothing to do with the fact that some timezones have changed their names, with the old names being kept as Links to the new names. Currently, however, Links representing one or both facts are kept in the same file backward, and cannot be distinguished. This leads to information loss and update anomalies: Currently, the file backward has a Link from Africa/Asmera to Africa/Nairobi -- the information that Asmera is an outdated spelling of Asmara can only be found in the file backzone. The name America/Virgin had been replaced by America/St_Thomas in version 95k, and this fact could be seen in the file backward until 2021a when this information was lost. It reappeared (in backzone) only in version 2021c. Keeping one type of information (spelling changes) in different locations (files backward or backzone) depending on an independent condition (that may even change over time) certainly causes unnecessary maintenance effort. While some of these points may sound like theoretical claims for normal forms as taught in computer science, my point here is only practical simplicity: each basic fact to be recorded in tzdb should have its obvious place where it is stored and where it can be looked up and updated; and updates of independent facts should be possible without mutual side effects. This appears to be a necessity for a collaborative project. Last Saturday, Paul Eggert has very nicely summarized the history and some of the guiding principles of tzdb. It is largely due to his immense work on the maintenance and evolvement of tzdb that the tzdb system was such a tremendous success in its first 30 years. As a means for the success over the next 30 years, I propose a simplification of the tzdb schema, so as to avoid the update anomalies described above, and thus decrease the maintenance burden, currently mainly shouldered by Paul. The information schema used in the fork produced by Stephen Colebourne is already much simpler, and it is apparently what is needed by several power users contributing to the widespread success of tzdb. Michael Deckers.

On 3/25/22 15:36, Michael H Deckers via tz wrote:
• The fact that two timezones agree since 1970 is not represented independently from the data of each timezone.
Yes, and more generally the problem is that the zic input format does not allow one Zone to say something like "before 1966 I was like this other Zone", or "between 1922 and 1945 I was like this other Zone". This problem is not limited to data in the "backzone" file. To fix this more general problem, we could change the zic input format and change the data accordingly. (This has already been proposed, but has not been implemented.) Or we could have some sort of prepass over the data.Or we could do both. I doubt whether it'd be worth the hassle of trying to fix it only for "backzone".
The file backzone is an integral part of the tzdb data, not just a container for additional data of lesser quality.
Here I'm afraid I'll have to disagree. "backzone" contains a considerable amount data of lesser quality. (Some of the lack of quality is that we don't record or even know how low the quality is.) I don't have time to maintain "backzone" well and I doubt whether anyone else does either. This need to limit the maintenance burden (much of it political, and some behind the scenes) is not something always appreciated by users. That doesn't make it any less real.
Thus we should get rid of Zone data for Argentina/Rosario etc (or else update them); keeping data that are known to be wrong is not only useless, it is an invitation for consequential errors.
Feel free to propose changes to "backzone" along these lines. Please send them in "git format-patch" form. If there's no objection it wouldn't be much work to install them.
• The fact that two timezones agree since 1970 has nothing to do with the fact that some timezones have changed their names, with the old names being kept as Links to the new names. Currently, however, Links representing one or both facts are kept in the same file backward,
Yes, that's a problem and should be fixed. Much of it is a relic of last year's controversy, which ended up with only nine zones being moved to "backward" instead of the thirty-odd that I originally proposed, under the idea the original proposal was too big a change to install all at once. It's time to move the rest of the zones (or at least, move nine more), and once the move is done we can attack the problem of categorizing Link lines better than they're categorized now, with the goal of making it easier for people like Stephen Colebourne to maintain downstream versions that use a different approach.
participants (2)
-
Michael H Deckers
-
Paul Eggert