Right, I've now had a chance to do a bit more work on this. The various options are committed in a github branch of Noda Time.

I have a few concerns about the proposed format, but I definitely agree that we need to consider the audience and use cases. The use case I'm primarily interested in is validation: diffing a "golden" file with one generated by another tool. For example, to validate that Noda Time is doing the right thing, I'd compare the output of zdump with the output of NodaTime.TzValidate.NodaDump. Ideally, there will be no differences, so nothing to look at. If there are differences, I need to be able to understand them easily. Sometimes that will be missing lines on one side or the other indicating a different number of transitions, sometimes it will be differences between two lines (e.g. the transition point). In my use case, one would rarely, if ever, be visually examining a single file to look for anomalies, which is Paul's use case.

In terms of the users themselves - while I'd expect them to be somewhat domain experts (people writing date/time libraries) I wouldn't expect them to be dealing with this format every day - so it should really be as clear as possible without having to consult the man page each time. (I'd envisage maybe having to look at the files once every six months or year.)

The other "user" to consider is machine readability: there are some cases where it's very useful to be able to parse the file easily from code. For example, some platforms I've looked at definitely get the abbreviation wrong in many cases, so before diffing I remove the name. That's trivial to do in the current format - but much harder when some parts are optional and everything is variable width.

Regarding compactness: again, this comes down to use cases. I don't particularly mind the file being reasonably large in total, so long as each zone is simple to look at. (I don't want multiple lines per transition, for example.) When zipped, there's not much difference between my original format and the smallest one I've tested (128K vs 106K). If we can make it more compact easily, that's fine - but I personally regard that as a much lower priority than other aspects of the format.

Okay, concerns:

I don't see why we need the quoted form for the time zone ID. That's going to be a mild pain to generate robustly in terms of escaping, and it's not clear what would happen for non-ASCII characters anyway. Assuming we'll never get a line break as part of a zone ID, I think just including the ID in UTF-8 is the simplest plan. Presumably the benefit of the proposed format is that you can copy/paste it into a Unix shell to use that time zone. That's certainly not a use case that I'd personally find useful, but the quotes and TZ= part are an unnecessary distraction IMO.
Indicating daylight/standard with an arbitrary positive integer: if this is going to be a canonical format, we need to be more precise than that. Equivalent outputs should be equal. I'd also prefer it not to be an integer at all, given that it's indicating a Boolean value... where there's a number, there's an expectation (IMO) that the numeric value is meaningful. Just changing standard/daylight to s/d makes it a lot more compact, but I'd prefer std/day to be obvious. While we could omit the value for standard time, I still think there's a benefit in making every line consistent. Again, this comes down to a difference in use cases.
I'd really like colons in the UT offsets - "-103126" looks like a regular integer to me, whereas "-10:31:26" is fairly obviously 10 hours, 31 minutes and 26 seconds.
Personally I think it's simpler to think about the transition times in UT, indicated with a Z in the output. In particular, choosing the local time after the transition isn't how most people think about transitions in day to day conversation. If I were describing the UK rules, I'd say that in spring we advance our clocks at 1am and in the autumn we move them back at 2am... whereas in this format, that would be shown as advancing the clocks to 2am and moving them back to 1am. Just the fact that there's ambiguity suggests to me that using UT everywhere is a clearer option. The "Z" on every line is redundant, but IMO it helps with clarity.
Omitting the abbreviation when it happens to be the same as the UT offset makes the file harder to parse for very little benefit in my view. That's taking compactness further than is useful.
In terms of omitting 0 minutes and 0 seconds values: for times, I'd favour at least keeping the minutes: "2016-06-05 21:00" still looks like a date and time, whereas "2016-06-05 21" looks like a date and then 21. This isn't as much of a concern with offsets though - "+05" is reasonably clear on its own.

Six sample formats to compare for Honolulu (one of the examples given in Paul's man page), in the order of the commits in the github branch. The number is the size of the file (including headers) for all zones. All of these still represent the transition in UT:

"Original" (currently documented tzvalidate) - 1,735,616 bytes

Pacific/Honolulu

Initially: -10:31:26 standard LMT

1896-01-13 22:31:26Z -10:30:00 standard HST

1933-04-30 12:30:00Z -09:30:00 daylight HDT

1933-05-21 21:30:00Z -10:30:00 standard HST

1942-02-09 12:30:00Z -09:30:00 daylight HDT

1945-09-30 11:30:00Z -10:30:00 standard HST

1947-06-08 12:30:00Z -10:00:00 standard HST

Short daylight and standard indicators - 1,463,421 bytes

Pacific/Honolulu

Initially: -10:31:26 s LMT

1896-01-13 22:31:26Z -10:30:00 s HST

1933-04-30 12:30:00Z -09:30:00 d HDT

1933-05-21 21:30:00Z -10:30:00 s HST

1942-02-09 12:30:00Z -09:30:00 d HDT

1945-09-30 11:30:00Z -10:30:00 s HST

1947-06-08 12:30:00Z -10:00:00 s HST

Shorter offsets, but still with colons - 1,240,377 bytes

Pacific/Honolulu

Initially: -10:31:26 s LMT

1896-01-13 22:31:26Z -10:30 s HST

1933-04-30 12:30:00Z -09:30 d HDT

1933-05-21 21:30:00Z -10:30 s HST

1942-02-09 12:30:00Z -09:30 d HDT

1945-09-30 11:30:00Z -10:30 s HST

1947-06-08 12:30:00Z -10 s HST

Shorter offsets, no colons - 1,236,955 bytes

Pacific/Honolulu

Initially: -103126 s LMT

1896-01-13 22:31:26Z -1030 s HST

1933-04-30 12:30:00Z -0930 d HDT

1933-05-21 21:30:00Z -1030 s HST

1942-02-09 12:30:00Z -0930 d HDT

1945-09-30 11:30:00Z -1030 s HST

1947-06-08 12:30:00Z -10 s HST

Variable transition times, e.g. "21" instead of "21:00:00Z" (and changing Initially to - -) - 972,361 bytes

Pacific/Honolulu

- - -10:31:26 s LMT

1896-01-13 22:31:26 -10:30 s HST

1933-04-30 12:30 -09:30 d HDT

1933-05-21 21:30 -10:30 s HST

1942-02-09 12:30 -09:30 d HDT

1945-09-30 11:30 -10:30 s HST

1947-06-08 12:30 -10 s HST

Variable transition times, but always keeping minutes - 1,079,278 bytes

Content is the same as the above, due to all the transitions happening on the half hour...

To show the difference between the last two options, here's Pago_Pago:

Pacific/Pago_Pago

- - +12:37:12 s LMT

1879-07-04 11:22:48 -11:22:48 s LMT

1911-01-01 11:22:48 -11 s NST

1967-04-01 11 -11 s BST

1983-11-30 11 -11 s SST

Pacific/Pago_Pago

- - +12:37:12 s LMT

1879-07-04 11:22:48 -11:22:48 s LMT

1911-01-01 11:22:48 -11 s NST

1967-04-01 11:00 -11 s BST

1983-11-30 11:00 -11 s SST

(I'd prefer to keep the Z in there, admittedly - that wasn't an option I happened to code though. It's easy enough to imagine it...)

With all that in mind, I would personally prefer to stick to the currently documented tzvalidate format. For my use cases of diffing and machine parsing, the fixed with format is useful, as is always specifying both the daylight/standard indicator and the name. I could live with the offset and time shortening, but I'd definitely prefer to have colons in the offset, and to keep minutes in the time part.

Thoughts?

Jon

On 30 May 2016 at 22:59, Paul Eggert <eggert@cs.ucla.edu> wrote:

Jon Skeet wrote:

I'd personally be willing to sacrifice a
certain amount of compactness for the sake of readability, but obviously if
we can get the size down a bit*without* losing readability, that would be
good.

Yes. Readability is to some extent in the eye of the beholder, and the proposed zgrep -i format wasn't my first choice: it evolved over some time as I used it to look at a lot of data. To some extent the format is aimed at my needs, and may be less suited for novices. For example:

TZ="America/Phoenix"
- - -072818 LMT
1883-11-18 12 -07 MST
1918-03-31 03 -06 MDT 1
1918-10-27 01 -07 MST
1919-03-30 03 -06 MDT 1
1919-10-26 01 -07 MST
1942-02-09 03 -06 MWT 1
1943-12-31 23:01 -07 MST
1944-04-01 01:01 -06 MWT 1
1944-09-30 23:01 -07 MST
1967-04-30 03 -06 MDT 1
1967-10-29 01 -07 MST

Here the columns don't line up and although this may be a bit offputting for some, for me it's a plus as it causes the unusual WWII non-hour transitions to stand out. Also, it's easier to visually identify the daylight-saving transitions via "1" vs nothing, than to scan through a column saying "isdst=1" vs "isdst=0". In contrast:

America/Phoenix
Initially: -07:28:18 standard LMT
1883-11-18 19:00:00Z -07:00:00 standard MST
1918-03-31 09:00:00Z -06:00:00 daylight MDT
1918-10-27 08:00:00Z -07:00:00 standard MST
1919-03-30 09:00:00Z -06:00:00 daylight MDT
1919-10-26 08:00:00Z -07:00:00 standard MST
1942-02-09 09:00:00Z -06:00:00 daylight MWT
1944-01-01 06:01:00Z -07:00:00 standard MST
1944-04-01 07:01:00Z -06:00:00 daylight MWT
1944-10-01 06:01:00Z -07:00:00 standard MST
1967-04-30 09:00:08Z -06:00:00 daylight MDT
1967-10-29 08:00:00Z -07:00:00 standard MST

Although this conveys the same information, it's harder to catch anomalies, as the nicely-aligned columns and data tend to blur into each other. For example, it's hard to spot the error that I deliberately introduced into the penultimate line of that data, whereas the same error would have been much easier to see in zgrep -i format.