Gripe about compressed rule set names in tzdata.zi

May 9, 2018

      Reviewing the changes between 2018d and 2018e crystallized something that
had not been clear to me before: the "compressed" ruleset names used in
tzdata.zi are chosen as sequential ASCII characters, or pairs of ASCII
letters after the single-character names run out.  This means that
addition or deletion of a ruleset near the beginning of the file creates
semantically-insignificant changes throughout the rest of the file,
making it hard to review the diffs to get a sense of what changed,
or to eyeball-validate the .zi file.  It probably also bloats one's
repository if storing tzdata.zi in a source code control system.

So I'm wondering if it'd be acceptable to change the compression rule to
something that gives more stable results.  I've not really experimented
to see what would work well, but one simple idea is to take the shortest
unique prefix of each original ruleset name.  There are other options if
shortness is prized well above readability.  I wouldn't expect the results
to be 100% stable in the face of ruleset additions/deletions, since
conflicts could appear or disappear, but 99% would be fine.

Regardless of the specifics, this would surely have the effect of making
tzdata.zi a bit bigger (although compression would likely buy back much
of the delta).  So, if nobody but me is annoyed by the current behavior,
I'd expect such a proposal to get rejected.

I'm willing to do the legwork to develop a concrete patch if it'd
be likely to get accepted.

			regards, tom lane

Tom Lane

Paul Eggert

Tom Lane

Paul Eggert

Tom Lane

tags

participants (2)