Gripe about compressed rule set names in tzdata.zi
Reviewing the changes between 2018d and 2018e crystallized something that had not been clear to me before: the "compressed" ruleset names used in tzdata.zi are chosen as sequential ASCII characters, or pairs of ASCII letters after the single-character names run out. This means that addition or deletion of a ruleset near the beginning of the file creates semantically-insignificant changes throughout the rest of the file, making it hard to review the diffs to get a sense of what changed, or to eyeball-validate the .zi file. It probably also bloats one's repository if storing tzdata.zi in a source code control system. So I'm wondering if it'd be acceptable to change the compression rule to something that gives more stable results. I've not really experimented to see what would work well, but one simple idea is to take the shortest unique prefix of each original ruleset name. There are other options if shortness is prized well above readability. I wouldn't expect the results to be 100% stable in the face of ruleset additions/deletions, since conflicts could appear or disappear, but 99% would be fine. Regardless of the specifics, this would surely have the effect of making tzdata.zi a bit bigger (although compression would likely buy back much of the delta). So, if nobody but me is annoyed by the current behavior, I'd expect such a proposal to get rejected. I'm willing to do the legwork to develop a concrete patch if it'd be likely to get accepted. regards, tom lane
On 05/09/2018 11:21 AM, Tom Lane wrote:
addition or deletion of a ruleset near the beginning of the file creates semantically-insignificant changes throughout the rest of the file, making it hard to review the diffs to get a sense of what changed, or to eyeball-validate the .zi file.
I haven't been eyeballing tzdata.zi myself, as it wasn't designed to be readable. Instead, I run "make check_tzs", which compares zdump output, and I eyeball that. We could make tzdata.zi more readable along the lines that you suggest, presumably as a runtime option to zishrink.awk and a corresponding Makefile macro to let builders select whether they want tzdata.zi to be smaller and less-readable, or larger and more-readable.
Paul Eggert <eggert@cs.ucla.edu> writes:
We could make tzdata.zi more readable along the lines that you suggest, presumably as a runtime option to zishrink.awk and a corresponding Makefile macro to let builders select whether they want tzdata.zi to be smaller and less-readable, or larger and more-readable.
Hm, I'm inclined to think that that's overkill. With the patch I propose below, the size of tzdata.zi for 2018e grows from 106908 to 108256 bytes, or a 1.26% increase; it doesn't seem worth complicating builders' lives still more to offer an option to avoid that. What I did was just to hash the input ruleset names, remove collisions through an open-chaining adjustment, and generate new names by converting the hashes back to strings. The hash function is pretty trivial, but I'm not sure it's worth working harder (or practical to do anything more interesting in awk, anyway). I get three collisions on the current set of names: Collision between Russia and Algeria at hash 641 Collision between NZ and SA at hash 96 Collision between Ecuador and Iraq at hash 2403 Given that we're trying to map 134 names into a space of 2704 hash values, some collisions are practically inevitable, and so I doubt we'd do better with a different hash. Note that I'm mapping the names into upper and lower case letters only. We could reduce the probability of a collision a little by also using punctuation as the current code does, but I think that that's not actually a good design: if the ruleset syntax is ever expanded to make punctuation have some other meaning, the existing compression rule is going to cause forward-compatibility problems. Still, if you're convinced that that will never happen, the attached patch can easily be adjusted to restore the larger output alphabet. With this approach, I estimate that there's at most about a 5% chance of a new ruleset name causing one existing ruleset's abbreviation to change, and a very small chance of it affecting more than one existing ruleset. So that's a great deal better than the existing way as far as the stability of the tzdata.zi representation goes. regards, tom lane
Tom Lane wrote:
I doubt we'd do better with a different hash.
We can do a bit better; the attached patches uses a hash that shrinks the size of tzdata.zi by about 0.5% compared to the method used in 2018e. This hash should also avoid needless churn during updates.
if the ruleset syntax is ever expanded to make punctuation have some other meaning, the existing compression rule is going to cause forward-compatibility problems.
Good point. The data entries are already using some punctuation characters as Rule names and so these characters are fair game, but we should reserve some of the never-used characters. The attached proposed patches reserve the characters in "!$%&'()*,/:;<=>?@[\]^`{|}~", unless quoted. (However, this restriction is not enforced by zic in the attached patches.) The attached patches also require Rule names to begin with a character that is not a digit, -, +, or white space; zic already rejected the empty string (this was not documented) and there were ambiguities if one of these characters started a Rule name so I added a check for this to zic. If anybody uses unusual Rule names, now's a good time to speak up.
Paul Eggert <eggert@cs.ucla.edu> writes:
Tom Lane wrote:
I doubt we'd do better with a different hash.
We can do a bit better; the attached patches uses a hash that shrinks the size of tzdata.zi by about 0.5% compared to the method used in 2018e. This hash should also avoid needless churn during updates.
Looks nice --- obviously, this is more work than what I did, but it seems sensible, and not likely to be an undue pain to maintain. I think 0004 is a bit of a kluge though ... regards, tom lane
participants (2)
-
Paul Eggert -
Tom Lane