Right, I've now had a chance to do a bit more work on this. The various options are committed in a
github branch of Noda Time.
I have a few concerns about the proposed format, but I definitely agree that we need to consider the audience and use cases. The use case I'm primarily interested in is validation: diffing a "golden" file with one generated by another tool. For example, to validate that Noda Time is doing the right thing, I'd compare the output of zdump with the output of NodaTime.TzValidate.NodaDump. Ideally, there will be no differences, so nothing to look at. If there are differences, I need to be able to understand them easily. Sometimes that will be missing lines on one side or the other indicating a different number of transitions, sometimes it will be differences between two lines (e.g. the transition point). In my use case, one would rarely, if ever, be visually examining a single file to look for anomalies, which is Paul's use case.
In terms of the users themselves - while I'd expect them to be somewhat domain experts (people writing date/time libraries) I wouldn't expect them to be dealing with this format every day - so it should really be as clear as possible without having to consult the man page each time. (I'd envisage maybe having to look at the files once every six months or year.)
The other "user" to consider is machine readability: there are some cases where it's very useful to be able to parse the file easily from code. For example, some platforms I've looked at definitely get the abbreviation wrong in many cases, so before diffing I remove the name. That's trivial to do in the current format - but much harder when some parts are optional and everything is variable width.
Regarding compactness: again, this comes down to use cases. I don't particularly mind the file being reasonably large in total, so long as each zone is simple to look at. (I don't want multiple lines per transition, for example.) When zipped, there's not much difference between my original format and the smallest one I've tested (128K vs 106K). If we can make it more compact easily, that's fine - but I personally regard that as a much lower priority than other aspects of the format.
Okay, concerns:
- I don't see why we need the quoted form for the time zone ID. That's going to be a mild pain to generate robustly in terms of escaping, and it's not clear what would happen for non-ASCII characters anyway. Assuming we'll never get a line break as part of a zone ID, I think just including the ID in UTF-8 is the simplest plan. Presumably the benefit of the proposed format is that you can copy/paste it into a Unix shell to use that time zone. That's certainly not a use case that I'd personally find useful, but the quotes and TZ= part are an unnecessary distraction IMO.
- Indicating daylight/standard with an arbitrary positive integer: if this is going to be a canonical format, we need to be more precise than that. Equivalent outputs should be equal. I'd also prefer it not to be an integer at all, given that it's indicating a Boolean value... where there's a number, there's an expectation (IMO) that the numeric value is meaningful. Just changing standard/daylight to s/d makes it a lot more compact, but I'd prefer std/day to be obvious. While we could omit the value for standard time, I still think there's a benefit in making every line consistent. Again, this comes down to a difference in use cases.
- I'd really like colons in the UT offsets - "-103126" looks like a regular integer to me, whereas "-10:31:26" is fairly obviously 10 hours, 31 minutes and 26 seconds.
- Personally I think it's simpler to think about the transition times in UT, indicated with a Z in the output. In particular, choosing the local time after the transition isn't how most people think about transitions in day to day conversation. If I were describing the UK rules, I'd say that in spring we advance our clocks at 1am and in the autumn we move them back at 2am... whereas in this format, that would be shown as advancing the clocks to 2am and moving them back to 1am. Just the fact that there's ambiguity suggests to me that using UT everywhere is a clearer option. The "Z" on every line is redundant, but IMO it helps with clarity.
- Omitting the abbreviation when it happens to be the same as the UT offset makes the file harder to parse for very little benefit in my view. That's taking compactness further than is useful.
- In terms of omitting 0 minutes and 0 seconds values: for times, I'd favour at least keeping the minutes: "2016-06-05 21:00" still looks like a date and time, whereas "2016-06-05 21" looks like a date and then 21. This isn't as much of a concern with offsets though - "+05" is reasonably clear on its own.
Six sample formats to compare for Honolulu (one of the examples given in Paul's man page), in the order of the commits in the github branch. The number is the size of the file (including headers) for all zones. All of these still represent the transition in UT:
"Original" (currently documented tzvalidate) - 1,735,616 bytes
Pacific/Honolulu
Initially: -10:31:26 standard LMT
1896-01-13 22:31:26Z -10:30:00 standard HST
1933-04-30 12:30:00Z -09:30:00 daylight HDT
1933-05-21 21:30:00Z -10:30:00 standard HST
1942-02-09 12:30:00Z -09:30:00 daylight HDT
1945-09-30 11:30:00Z -10:30:00 standard HST
1947-06-08 12:30:00Z -10:00:00 standard HST
Short daylight and standard indicators - 1,463,421 bytes
Pacific/Honolulu
Initially: -10:31:26 s LMT
1896-01-13 22:31:26Z -10:30:00 s HST
1933-04-30 12:30:00Z -09:30:00 d HDT
1933-05-21 21:30:00Z -10:30:00 s HST
1942-02-09 12:30:00Z -09:30:00 d HDT
1945-09-30 11:30:00Z -10:30:00 s HST
1947-06-08 12:30:00Z -10:00:00 s HST
Shorter offsets, but still with colons - 1,240,377 bytes
Pacific/Honolulu
Initially: -10:31:26 s LMT
1896-01-13 22:31:26Z -10:30 s HST
1933-04-30 12:30:00Z -09:30 d HDT
1933-05-21 21:30:00Z -10:30 s HST
1942-02-09 12:30:00Z -09:30 d HDT
1945-09-30 11:30:00Z -10:30 s HST
1947-06-08 12:30:00Z -10 s HST
Shorter offsets, no colons - 1,236,955 bytes
Pacific/Honolulu
Initially: -103126 s LMT
1896-01-13 22:31:26Z -1030 s HST
1933-04-30 12:30:00Z -0930 d HDT
1933-05-21 21:30:00Z -1030 s HST
1942-02-09 12:30:00Z -0930 d HDT
1945-09-30 11:30:00Z -1030 s HST
1947-06-08 12:30:00Z -10 s HST
Variable transition times, e.g. "21" instead of "21:00:00Z" (and changing Initially to - -) - 972,361 bytes
Pacific/Honolulu
- - -10:31:26 s LMT
1896-01-13 22:31:26 -10:30 s HST
1933-04-30 12:30 -09:30 d HDT
1933-05-21 21:30 -10:30 s HST
1942-02-09 12:30 -09:30 d HDT
1945-09-30 11:30 -10:30 s HST
1947-06-08 12:30 -10 s HST
Variable transition times, but always keeping minutes - 1,079,278 bytes
Content is the same as the above, due to all the transitions happening on the half hour...
To show the difference between the last two options, here's Pago_Pago:
Pacific/Pago_Pago
- - +12:37:12 s LMT
1879-07-04 11:22:48 -11:22:48 s LMT
1911-01-01 11:22:48 -11 s NST
1967-04-01 11 -11 s BST
1983-11-30 11 -11 s SST
vs
Pacific/Pago_Pago
- - +12:37:12 s LMT
1879-07-04 11:22:48 -11:22:48 s LMT
1911-01-01 11:22:48 -11 s NST
1967-04-01 11:00 -11 s BST
1983-11-30 11:00 -11 s SST
(I'd prefer to keep the Z in there, admittedly - that wasn't an option I happened to code though. It's easy enough to imagine it...)
With all that in mind, I would personally prefer to stick to the currently documented tzvalidate format. For my use cases of diffing and machine parsing, the fixed with format is useful, as is always specifying both the daylight/standard indicator and the name. I could live with the offset and time shortening, but I'd definitely prefer to have colons in the offset, and to keep minutes in the time part.
Thoughts?
Jon