Date: Sat, 02 Aug 2014 18:35:58 -0700 From: Paul Eggert <eggert@cs.ucla.edu> Message-ID: <53DD91FE.90405@cs.ucla.edu> | Of course it's not 100% right. We don't have reliable information about | old timestamps in Ghana. It would be better if we didn't have to put | this low-quality data into the tz database at all. The only reason it's | there is that the format requires it. Not really, since time_t's are (generally) signed, and these days, 64 bits, localtime() can be asked to convert any time going back to (way) before the big bang and is expected to produce some kind of meaningful results. Of course, since our data (currently anyway) assumes that the gregorian calendar existed since (apparently way before) the big bang, and converts dates based upon that assumption - which means we know it is producing utter nonsense for anything earlier than the 15 century (or whatever) and for some parts of the world, up to the early 20th century. But we have to produce something - nonsense or not nonsense, what matters is that there is (at least for reasonable dates, say back a few thousand years BC or so) we get some kind of reasonably stable (and comparable) results, not just whatever random value happens to seem convenient today. | Much of the pre-1970 data falls into this category, unfortunately. Yes, it does, and if you really wanted to get rid of all unverified data, you'd remove all of it (from all zones) - the format requires that something be there, not any particular transitions. Just removing isolated segments of that unverified data looks wrong. | When the quality is this bad, there's nothing wrong with improving the | quality even if the result is not perfect, or with removing bad data if | this can be done without significantly affecting end-user applications. No, there wouldn't be if there was known bad data. But that's not what any of this is - no-one has a problem with correcting data that is known to be incorrect. The problem is that that is not what any of this is. It isn't bad data, it is just data that we do not know is correct, and we guess might not be perfect. That guess might be right - or it might not be, that's the point - it is possible that (just by chance) you're removing some good data and replacing it with bad data. You don't know, I don't know. I'd suggest just putting everything back, keep the results stable (if they're wrong at least they're the same wrong today as yesterday) and just replace data when it is known incorrect. If you're not going to do that, then at least do it properly, and delete all of the unverified data - ALL of it. kre