Right, I've now had a chance to do a bit more work on this. The various options are committed in a github branch of Noda Time.

I have a few concerns about the proposed format, but I definitely agree that we need to consider the audience and use cases. The use case I'm primarily interested in is validation: diffing a "golden" file with one generated by another tool. For example, to validate that Noda Time is doing the right thing, I'd compare the output of zdump with the output of NodaTime.TzValidate.NodaDump. Ideally, there will be no differences, so nothing to look at. If there are differences, I need to be able to understand them easily. Sometimes that will be missing lines on one side or the other indicating a different number of transitions, sometimes it will be differences between two lines (e.g. the transition point). In my use case, one would rarely, if ever, be visually examining a single file to look for anomalies, which is Paul's use case.

In terms of the users themselves - while I'd expect them to be somewhat domain experts (people writing date/time libraries) I wouldn't expect them to be dealing with this format every day - so it should really be as clear as possible without having to consult the man page each time. (I'd envisage maybe having to look at the files once every six months or year.)

The other "user" to consider is machine readability: there are some cases where it's very useful to be able to parse the file easily from code. For example, some platforms I've looked at definitely get the abbreviation wrong in many cases, so before diffing I remove the name. That's trivial to do in the current format - but much harder when some parts are optional and everything is variable width.

Regarding compactness: again, this comes down to use cases. I don't particularly mind the file being reasonably large in total, so long as each zone is simple to look at. (I don't want multiple lines per transition, for example.) When zipped, there's not much difference between my original format and the smallest one I've tested (128K vs 106K). If we can make it more compact easily, that's fine - but I personally regard that as a much lower priority than other aspects of the format.

Okay, concerns:
Six sample formats to compare for Honolulu (one of the examples given in Paul's man page), in the order of the commits in the github branch. The number is the size of the file (including headers) for all zones. All of these still represent the transition in UT:

"Original" (currently documented tzvalidate) - 1,735,616 bytes

Pacific/Honolulu
Initially:           -10:31:26 standard LMT
1896-01-13 22:31:26Z -10:30:00 standard HST
1933-04-30 12:30:00Z -09:30:00 daylight HDT
1933-05-21 21:30:00Z -10:30:00 standard HST
1942-02-09 12:30:00Z -09:30:00 daylight HDT
1945-09-30 11:30:00Z -10:30:00 standard HST
1947-06-08 12:30:00Z -10:00:00 standard HST

Short daylight and standard indicators - 1,463,421 bytes

Pacific/Honolulu
Initially:           -10:31:26 s LMT
1896-01-13 22:31:26Z -10:30:00 s HST
1933-04-30 12:30:00Z -09:30:00 d HDT
1933-05-21 21:30:00Z -10:30:00 s HST
1942-02-09 12:30:00Z -09:30:00 d HDT
1945-09-30 11:30:00Z -10:30:00 s HST
1947-06-08 12:30:00Z -10:00:00 s HST

Shorter offsets, but still with colons - 1,240,377 bytes

Pacific/Honolulu
Initially:           -10:31:26 s LMT
1896-01-13 22:31:26Z -10:30 s HST
1933-04-30 12:30:00Z -09:30 d HDT
1933-05-21 21:30:00Z -10:30 s HST
1942-02-09 12:30:00Z -09:30 d HDT
1945-09-30 11:30:00Z -10:30 s HST
1947-06-08 12:30:00Z -10 s HST

Shorter offsets, no colons - 1,236,955 bytes

Pacific/Honolulu
Initially:           -103126 s LMT
1896-01-13 22:31:26Z -1030 s HST
1933-04-30 12:30:00Z -0930 d HDT
1933-05-21 21:30:00Z -1030 s HST
1942-02-09 12:30:00Z -0930 d HDT
1945-09-30 11:30:00Z -1030 s HST
1947-06-08 12:30:00Z -10 s HST

Variable transition times, e.g. "21" instead of "21:00:00Z" (and changing Initially to - -) - 972,361 bytes

Pacific/Honolulu
- - -10:31:26 s LMT
1896-01-13 22:31:26 -10:30 s HST
1933-04-30 12:30 -09:30 d HDT
1933-05-21 21:30 -10:30 s HST
1942-02-09 12:30 -09:30 d HDT
1945-09-30 11:30 -10:30 s HST
1947-06-08 12:30 -10 s HST

Variable transition times, but always keeping minutes - 1,079,278 bytes

Content is the same as the above, due to all the transitions happening on the half hour...


To show the difference between the last two options, here's Pago_Pago:

Pacific/Pago_Pago
- - +12:37:12 s LMT
1879-07-04 11:22:48 -11:22:48 s LMT
1911-01-01 11:22:48 -11 s NST
1967-04-01 11 -11 s BST
1983-11-30 11 -11 s SST

vs

Pacific/Pago_Pago
- - +12:37:12 s LMT
1879-07-04 11:22:48 -11:22:48 s LMT
1911-01-01 11:22:48 -11 s NST
1967-04-01 11:00 -11 s BST
1983-11-30 11:00 -11 s SST

(I'd prefer to keep the Z in there, admittedly - that wasn't an option I happened to code though. It's easy enough to imagine it...)

With all that in mind, I would personally prefer to stick to the currently documented tzvalidate format. For my use cases of diffing and machine parsing, the fixed with format is useful, as is always specifying both the daylight/standard indicator and the name. I could live with the offset and time shortening, but I'd definitely prefer to have colons in the offset, and to keep minutes in the time part.

Thoughts?

Jon


On 30 May 2016 at 22:59, Paul Eggert <eggert@cs.ucla.edu> wrote:
Jon Skeet wrote:
I'd personally be willing to sacrifice a
certain amount of compactness for the sake of readability, but obviously if
we can get the size down a bit*without*  losing readability, that would be
good.

Yes. Readability is to some extent in the eye of the beholder, and the proposed zgrep -i format wasn't my first choice: it evolved over some time as I used it to look at a lot of data. To some extent the format is aimed at my needs, and may be less suited for novices. For example:

TZ="America/Phoenix"
- - -072818 LMT
1883-11-18 12 -07 MST
1918-03-31 03 -06 MDT 1
1918-10-27 01 -07 MST
1919-03-30 03 -06 MDT 1
1919-10-26 01 -07 MST
1942-02-09 03 -06 MWT 1
1943-12-31 23:01 -07 MST
1944-04-01 01:01 -06 MWT 1
1944-09-30 23:01 -07 MST
1967-04-30 03 -06 MDT 1
1967-10-29 01 -07 MST

Here the columns don't line up and although this may be a bit offputting for some, for me it's a plus as it causes the unusual WWII non-hour transitions to stand out. Also, it's easier to visually identify the daylight-saving transitions via "1" vs nothing, than to scan through a column saying "isdst=1" vs "isdst=0". In contrast:

America/Phoenix
Initially:           -07:28:18 standard LMT
1883-11-18 19:00:00Z -07:00:00 standard MST
1918-03-31 09:00:00Z -06:00:00 daylight MDT
1918-10-27 08:00:00Z -07:00:00 standard MST
1919-03-30 09:00:00Z -06:00:00 daylight MDT
1919-10-26 08:00:00Z -07:00:00 standard MST
1942-02-09 09:00:00Z -06:00:00 daylight MWT
1944-01-01 06:01:00Z -07:00:00 standard MST
1944-04-01 07:01:00Z -06:00:00 daylight MWT
1944-10-01 06:01:00Z -07:00:00 standard MST
1967-04-30 09:00:08Z -06:00:00 daylight MDT
1967-10-29 08:00:00Z -07:00:00 standard MST

Although this conveys the same information, it's harder to catch anomalies, as the nicely-aligned columns and data tend to blur into each other. For example, it's hard to spot the error that I deliberately introduced into the penultimate line of that data, whereas the same error would have been much easier to see in zgrep -i format.