[PROPOSED PATCH 2/2] Use lz format for new tarball

newer
Time zone change in Turkey

older
URL to get lz file

Antonio Diaz Diaz

Aug. 26, 2016

11:10 p.m.

Paul Eggert wrote:

...

...
please use a compression format that can be handled easily by Windows users as well. For instance, choose a format from the list that 7Zip can handle: http://www.7-zip.org/

Thanks for mentioning the problem. xz format is on 7-Zip's list; it's a tiny bit larger than lzip format for our data (0.3% larger for the draft tzdb tarball) but I suppose portability trumps this minor advantage.

Please, do not use xz for a new distribution format. The xz format is defective. See for example http://www.nongnu.org/lzip/xz_inadequate.html or the note in the third xz test at the bottom of http://www.nongnu.org/lzip/lzip_benchmark.html If a format that 7-Zip can handle is desired, I recommend bzip2 which produces a tzdb tarball a 25% smaller than gzip and only a 9% larger than lzip. Best regards, Antonio.

Show replies by date

Paul Eggert

August 2016

7:37 p.m.

Antonio Diaz Diaz wrote:

...

Please, do not use xz for a new distribution format.

I'm not much of a fan of xz format either, which is why I suggested lz format originally, and we could easily change it back to lz. However, Oscar van Vlijmen asked for something convenient on MS-Windows; see <http://mm.icann.org/pipermail/tz/2016-August/023945.html>. What's the story on that? Why doesn't 7-Zip support lz format? Will 7-Zip support lz format any time soon? Surely it would be an easy thing to add. I'd rather not use bzip2 format, since it doesn't compress the tz tarball nearly as well. bzip2 format is 11% larger than xz format. (lz format is 0.3% smaller.)

Antonio Diaz Diaz

1:55 a.m.

Paul Eggert wrote:

...

Oscar van Vlijmen asked for something convenient on MS-Windows; see <http://mm.icann.org/pipermail/tz/2016-August/023945.html>. What's the story on that? Why doesn't 7-Zip support lz format? Will 7-Zip support lz format any time soon? Surely it would be an easy thing to add.

It is indeed an easy thing to add, and it has been requested a couple times[1][2], but Igor Pavlov does not consider it a priority. Given that Igor is busy maintaining 7-Zip and is co-author of the xz format[3], it is difficult to tell when (or if) 7-Zip will support the lz format. But he already knows about the defects of xz, so I have hope that he may change his mind. [1] http://sourceforge.net/p/sevenzip/discussion/45797/thread/9e6de893/ [2] http://sourceforge.net/p/sevenzip/discussion/45797/thread/32771d84/ [3] http://tukaani.org/xz/xz-file-format.txt

...

I'd rather not use bzip2 format, since it doesn't compress the tz tarball nearly as well. bzip2 format is 11% larger than xz format. (lz format is 0.3% smaller.)

I haven't got a real tzdb tarball, but I have made an approximate one by combining tzcode2016f and tzdata2016f. The sizes of the resulting compressed files are not big, and the size difference between bzip2 and lzip is of about 32 kB, which is not a problem even for dial-up users. -rw-r--r-- 1 1597440 2016-08-28 16:17 tzdb.tar -rw-r--r-- 1 377682 2016-08-28 16:17 tzdb.tar.bz2 -rw-r--r-- 1 508373 2016-08-28 16:17 tzdb.tar.gz -rw-r--r-- 1 346119 2016-08-28 16:17 tzdb.tar.lz -rw-r--r-- 1 346928 2016-08-28 16:17 tzdb.tar.xz If the lack of support in 7-Zip is an obstacle for the adoption of lzip, then I would consider bzip2 the second best choice; it decompresses safely on all platforms at the only cost of an unimportant increase in tarball size. IMO gzip is also fine. Xz is the only format that I consider should be avoided. Best regards, Antonio.

Paul Eggert

9:18 p.m.

Antonio Diaz Diaz wrote:

...

It is indeed an easy thing to add, and it has been requested a couple times[1][2], but Igor Pavlov does not consider it a priority.

If it's that easy to add, perhaps you could do that and send in a patch. Even if it's low priority for the maintainer, if the code and documentation are already written it shouldn't be hard for the maintainer to install a patch. This would help encourage the use of lz format.

...

I would consider bzip2 the second best choice; it decompresses safely on all platforms at the only cost of an unimportant increase in tarball size. IMO gzip is also fine. Xz is the only format that I consider should be avoided.

bzip2 is about 11% bigger than lzip for our purposes, though. The .bz2 combined file is bigger than the gzipped data file, which is a downer: $ ls -l tz*.tar.*z* -rw-r--r-- 1 eggert eggert 202609 Aug 30 14:00 tzcode2016X.tar.gz -rw-r--r-- 1 eggert eggert 394169 Aug 30 14:00 tzdata2016X.tar.gz -rw-r--r-- 1 eggert eggert 426667 Aug 30 14:10 tzdb-2016X.tar.bz2 -rw-r--r-- 1 eggert eggert 382991 Aug 30 14:00 tzdb-2016X.tar.lz As there are multiple free MS-Windows-based utilities that can decompress lzip format, I guess we can ask our MS-Windows users to use one. They can continue to use the existing gzip-based tarballs as well, since they will be distributed for a while. So, I'm inclined to go back to .lz format despite the lack of current 7-Zip support, as in the attached proposed tz patch.

Deborah Goldsmith

9:48 p.m.

So between the current distribution scheme and the proposed new scheme we’re saving about 210K total. Do we really need to worry about that in 2016? Is it worth the disruption to existing workflows? Unless some people are still using dialup, I don’t understand why this is important. Far more important is widespread support. What is the goal for making these changes to the distribution format? It’s going to cause a lot of work for a lot of people. What’s the justification? I think most customers of tz would be quite happy for the distribution to continue in its current form forever. What would you say to convince them this change is worth the effort? Thanks, Debbie

...

On Aug 30, 2016, at 2:18 PM, Paul Eggert <eggert@cs.ucla.edu> wrote:

Antonio Diaz Diaz wrote:

...
It is indeed an easy thing to add, and it has been requested a couple times[1][2], but Igor Pavlov does not consider it a priority.

If it's that easy to add, perhaps you could do that and send in a patch. Even if it's low priority for the maintainer, if the code and documentation are already written it shouldn't be hard for the maintainer to install a patch. This would help encourage the use of lz format.

...
I would consider bzip2 the second best choice; it decompresses safely on all platforms at the only cost of an unimportant increase in tarball size. IMO gzip is also fine. Xz is the only format that I consider should be avoided.

bzip2 is about 11% bigger than lzip for our purposes, though. The .bz2 combined file is bigger than the gzipped data file, which is a downer:

$ ls -l tz*.tar.*z* -rw-r--r-- 1 eggert eggert 202609 Aug 30 14:00 tzcode2016X.tar.gz -rw-r--r-- 1 eggert eggert 394169 Aug 30 14:00 tzdata2016X.tar.gz -rw-r--r-- 1 eggert eggert 426667 Aug 30 14:10 tzdb-2016X.tar.bz2 -rw-r--r-- 1 eggert eggert 382991 Aug 30 14:00 tzdb-2016X.tar.lz

As there are multiple free MS-Windows-based utilities that can decompress lzip format, I guess we can ask our MS-Windows users to use one. They can continue to use the existing gzip-based tarballs as well, since they will be distributed for a while.

So, I'm inclined to go back to .lz format despite the lack of current 7-Zip support, as in the attached proposed tz patch. <0001-Go-back-to-lz-tarball-improve-documentation.patch>

Paul Eggert

12:21 a.m.

Deborah Goldsmith wrote:

...

What is the goal for making these changes to the distribution format?

1. Simplifying the distribution process. This can be seen in the proposed changes to tz-link.htm describing how to download and extract the release. The current tz-link.htm uses five shell commands to do that; the proposed tz-link.htm cuts this to two. These simplifications will make configuration errors less likely, as well as make it easier for newcomers. 2. Shrinking the distribution tarball. This matters less, but while we're doing (1) we might as well do (2). Not everyone is as well-off and well-connected as Apple and UCLA.

...

It’s going to cause a lot of work for a lot of people.

It'll be a bit of work at the start, to change unpacking scripts. But the changes are small, and it should save some work in the long run. And there's no rush, as the old-format tarballs will continue to be distributed.

...

most customers of tz would be quite happy for the distribution to continue in its current form forever.

Change is such a pain! :-) That being said, the distribution format has changed over time and it will change in the future. This is inevitable in any successful, longrunning project. Alexander Belopolsky wrote:

...

If the size of data distribution is a concern, it looks like one can achieve a much better compression by simply discarding comments

But the comments are the best part! :-)

Marc Lehmann

7:53 a.m.

On Tue, Aug 30, 2016 at 05:21:09PM -0700, Paul Eggert <eggert@cs.ucla.edu> wrote:

...

1. Simplifying the distribution process. This can be seen in the proposed changes to tz-link.htm describing how to download and extract the release. The current tz-link.htm uses five shell commands to do that; the proposed tz-link.htm cuts this to two. These simplifications will make configuration errors less likely, as well as make it easier for newcomers.

At the expense of a rather obscure compression format (lzip), support for which might have to be downloaded, compiled and installed first. For example, openwrt doesn't have lzip, and generally no compiler available, so lzip-only would make it impossible to install new zoneinfo data on these kind of platforms currently. As comparison, gzip, bzip2 and (a bit less so) xz are readily available on most platforms - busybox, for example, has support for all three. openwrt might be an extreme example, but most smaller gnu/linux distros also have no lzip package, but all have gzip, bzip2, and most have xz.

...

2. Shrinking the distribution tarball. This matters less, but while we're doing (1) we might as well do (2). Not everyone is as well-off and well-connected as Apple and UCLA.

Having a data-only (comment-stripped) release might go much further along that - if the goal is large-scale distribution and frequent updates, then distributing a 24kb file (in addition) is much easier than a 380kb file. Thinking of many smaller end-devices such as phones, a 24kb download might still be nontrivial, but much easier to do as part of some urgent updates than a 380kb one for example, and much easier to justify.

...

But the comments are the best part! :-)

Sure, but I would say statistically, 100.00%(*) of the people downloading zoneinfo data will never have a look, especially on embeded devices. (*) some rounding might be necessary -- The choice of a Deliantra, the free code+content MORPG -----==- _GNU_ http://www.deliantra.net ----==-- _ generation ---==---(_)__ __ ____ __ Marc Lehmann --==---/ / _ \/ // /\ \/ / schmorp@schmorp.de -=====/_/_//_/\_,_/ /_/\_\

Jon Skeet

8:30 a.m.

On 31 August 2016 at 08:53, Marc Lehmann <schmorp@schmorp.de> wrote:

...

On Tue, Aug 30, 2016 at 05:21:09PM -0700, Paul Eggert <eggert@cs.ucla.edu> wrote:

...
1. Simplifying the distribution process. This can be seen in the proposed changes to tz-link.htm describing how to download and extract the release. The current tz-link.htm uses five shell commands to do that; the proposed tz-link.htm cuts this to two. These simplifications will make configuration errors less likely, as well as make it easier for newcomers.

At the expense of a rather obscure compression format (lzip), support for which might have to be downloaded, compiled and installed first.

... and which may not have as comprehensive programmatic support in other platforms, either. For example, in Noda Time, the tools I've written use common .NET libraries to perform the data extraction. That means without installing anything, I can just point the tools straight at the IANA website. That's definitely feasible for zip, tar, tar.gz and bzip2 files, but may well not be the case for xz or lzip. I don't know how many people this would affect other than myself, of course, but there may well be others. It feels like there's a pretty simple solution here: host multiple files on IANA, and let users pick which one they want. I very much doubt that storage size is a concern, and that way users on poor network connections but with advanced compression tools can use those, and others with fewer bandwidth concerns but more tooling concerns can use more common formats. I would hope it wouldn't make the release process significantly more complicated - just multiple commands to build the multiple distributable files. Now, that raises the possibility of *also* keeping the code/data split as they currently are, as well as comprehensive distributable files - I have no particular preference on that front. Jon

Howard Hinnant

2:18 p.m.

On Aug 31, 2016, at 4:30 AM, Jon Skeet <skeet@pobox.com> wrote:

...

For example, in Noda Time, the tools I've written use common .NET libraries to perform the data extraction. That means without installing anything, I can just point the tools straight at the IANA website. That's definitely feasible for zip, tar, tar.gz and bzip2 files, but may well not be the case for xz or lzip.

I don't know how many people this would affect other than myself, of course, but there may well be others.

My library (https://github.com/HowardHinnant/date) also automatically downloads and installs the tzdataYYYYV.tar.gz from http://www.iana.org/time-zones: https://github.com/HowardHinnant/date/blob/master/tz.cpp#L231-L237 We’ve just gotten this working on macOS, Linux, iOS and Windows. Howard

David Patte ₯

2:29 p.m.

Mine does as well (Time Zone Master) On 2016-08-31 10:18, Howard Hinnant wrote:

...

On Aug 31, 2016, at 4:30 AM, Jon Skeet <skeet@pobox.com> wrote:

...
For example, in Noda Time, the tools I've written use common .NET libraries to perform the data extraction. That means without installing anything, I can just point the tools straight at the IANA website. That's definitely feasible for zip, tar, tar.gz and bzip2 files, but may well not be the case for xz or lzip.

I don't know how many people this would affect other than myself, of course, but there may well be others.

My library (https://github.com/HowardHinnant/date) also automatically downloads and installs the tzdataYYYYV.tar.gz from http://www.iana.org/time-zones:

https://github.com/HowardHinnant/date/blob/master/tz.cpp#L231-L237

We’ve just gotten this working on macOS, Linux, iOS and Windows.

Howard

Random832

September 2016

4:52 p.m.

On Tue, Aug 30, 2016, at 20:21, Paul Eggert wrote:

...

Alexander Belopolsky wrote:

...
If the size of data distribution is a concern, it looks like one can achieve a much better compression by simply discarding comments

But the comments are the best part! :-)

It might make sense, long-term, to transition the most of the comments to a conventional documentation format (HTML, maybe) and host it on a website.

Jon Skeet

4:54 p.m.

Heck, just hosting all the files in an easily navigable (and ideally diffable) format, for all the different versions, would be great, leaving aside any other features. I might look into doing that myself... On 1 September 2016 at 17:52, Random832 <random832@fastmail.com> wrote:

...

On Tue, Aug 30, 2016, at 20:21, Paul Eggert wrote:

...
Alexander Belopolsky wrote:

...
If the size of data distribution is a concern, it looks like one can achieve a much better compression by simply discarding comments

But the comments are the best part! :-)

It might make sense, long-term, to transition the most of the comments to a conventional documentation format (HTML, maybe) and host it on a website.

Goudge, Stephen

7:51 a.m.

...

From: tz-bounces@iana.org [mailto:tz-bounces@iana.org] On Behalf Of Random832 Sent: 01 September 2016 17:52 To: tz@iana.org Subject: Re: [tz] [PROPOSED PATCH 2/2] Use lz format for new tarball

On Tue, Aug 30, 2016, at 20:21, Paul Eggert wrote:

...
Alexander Belopolsky wrote:

...
If the size of data distribution is a concern, it looks like one can > achieve a much better compression by simply discarding comments

But the comments are the best part! :-)

It might make sense, long-term, to transition the most of the comments to a conventional documentation format (HTML, maybe) and host it on a website.

Please don't consider "transitioning" the comments to anywhere, that would just disconnect the comments from the very thing that they are commenting on and make it harder to use version control logs to see _why_ the data changed. I know it isn't part of the mainstream purpose of the TZ project but I find the historical data useful - at the very least, interesting - including how the perceived reliability of sources for that data change over time. It is rare to find a publically available real-life data set that illustrates such things - especially one where pretty much anyone can understand what the data represents - and that makes TZ useful as a discussion tool. A very off-topic use-case I admit, but... Regards, Stephen Goudge Senior Software Engineer 390 Princesway Team Valley Gateshead Tyne & Wear NE11 0TU T +44 (0) 191 420 3015 F +44 (0) 191 420 3030 W www.petards.com ___________________________________________________________________________________________ This email has been scanned by Petards. The service is powered by Symantec MessageLabs Email AntiVirus.cloud ___________________________________________________________________________________________ This email has been sent from Petards Group plc or a member of the Petards group of companies. The information in this email is confidential and/or privileged. It is intended solely for the use of the addressee. Access to this email by anyone else is unauthorised. If you are not the intended recipient, any review, dissemination, disclosure, alteration, printing, circulation or transmission of this e-mail and/or any file or attachment transmitted with it, is prohibited and may be unlawful. Petards Group plc is registered in England & Wales. Company No. 2990100 ___________________________________________________________________________________________

Steffen Nurpmeso

12:29 p.m.

"Goudge, Stephen" <stephen.goudge@petards.com> wrote: |> From: tz-bounces@iana.org [mailto:tz-bounces@iana.org] On Behalf Of |> Random832 |> Sent: 01 September 2016 17:52 |> To: tz@iana.org |> Subject: Re: [tz] [PROPOSED PATCH 2/2] Use lz format for new tarball |> |> On Tue, Aug 30, 2016, at 20:21, Paul Eggert wrote: |>> Alexander Belopolsky wrote: |>>> If the size of data distribution is a concern, it looks like one |>> can > achieve a much better compression by simply discarding comments |>> |>> But the comments are the best part! :-) |> |> It might make sense, long-term, to transition the most of the comments \ |> to a |> conventional documentation format (HTML, maybe) and host it on a website. | |Please don't consider "transitioning" the comments to anywhere, that \ .. |of the TZ project but I find the historical data useful - at the very \ |least, interesting - including how the perceived reliability of sources \ |for that data change over time. It is rare to find a publically available \ |real-life data set that illustrates such things - especially one where \ |pretty much anyone can understand what the data represents - and that \ |makes TZ useful as a discussion tool. A very off-topic use-case I admit, \ |but... I absolutely concur. It would be nice if these words lead to a draw. \(^_^)/ I for one repeat that i think these are very precious, i was thrilled once i detected them first, that much is for sure. In times where old traditions are lost, and the majority of the western world wouldn't know how to survive if all supermarkets would stay closed, but have huge barbecues if not, i think small and silent, beloved beauties are more worth than ever. Just think of that gigantic «United States of America» episode in «11'09"01 September 11» – just the very same. Rest very off-topic indeed. Yet, the person who brought up the idea is both, mathematician and Python programmer, and these two attributes can't replace the latter two of Power. Beauty. Soul. I am personally also in doubt of the first. Because, if you don't split up character set names like "insert whitespace for ![[:alnum:][:space:]]", insert whitespace in between [:alpha:] and [:digit:], squeeze multiple [:space:] to a single one, you end up with much more entries than you will have if you would have splitted in between [:alpha:] and [:digit:]. If i would have been born where the elks live i would possibly even say that only morons don't split in between [:alpha:] and [:digit:], it simply cannot be understood why you wouldn't if you look at it that way. I admit it is not as bad as i say if you look at it in an interpreted high-level language. --steffen

Deborah Goldsmith

6:56 p.m.

Comments below. Thanks, Debbie

...

On Aug 30, 2016, at 5:21 PM, Paul Eggert <eggert@cs.ucla.edu> wrote:

Deborah Goldsmith wrote:

...
What is the goal for making these changes to the distribution format?

1. Simplifying the distribution process. This can be seen in the proposed changes to tz-link.htm describing how to download and extract the release. The current tz-link.htm uses five shell commands to do that; the proposed tz-link.htm cuts this to two. These simplifications will make configuration errors less likely, as well as make it easier for newcomers.

It’s going to require changing (multiple) existing workflows that are working just fine. If support for the existing format is discontinued, this will be a substantial amount of work for many.

...

2. Shrinking the distribution tarball. This matters less, but while we're doing (1) we might as well do (2). Not everyone is as well-off and well-connected as Apple and UCLA.

Is there any evidence anyone cares about the minor differences we’re discussing? The difference between bz2 (widely supported) and lz (not as widely supported) is 43K. Why put the onus on consumers of the data to find an implementation of a less-known compression scheme their platform doesn’t support?

...

...
It’s going to cause a lot of work for a lot of people.

It'll be a bit of work at the start, to change unpacking scripts. But the changes are small, and it should save some work in the long run. And there's no rush, as the old-format tarballs will continue to be distributed.

You’re making assumptions about how people are consuming these files. Not everyone is using an “unpacking script.” Other people have already chimed in with comments along these lines. There are all sorts of different workflows consuming this data beyond the canonical one.

...

...
most customers of tz would be quite happy for the distribution to continue in its current form forever.

Change is such a pain! :-) That being said, the distribution format has changed over time and it will change in the future. This is inevitable in any successful, longrunning project.

RFCs are still plain text. .tar format hasn’t changed in decades. Bytes are still 8 bits. Most of us still type on QWERTY keyboards. Widely-used standards change slowly or not at all, with very good reason. Yes, change is a pain. Change is certainly warranted when the benefits outweigh the costs. I don’t think the costs are being weighed properly, though, and the benefits don’t seem that great. Are there any significant benefits beyond an easier format for newcomers?

...

Alexander Belopolsky wrote:

...
If the size of data distribution is a concern, it looks like one can achieve a much better compression by simply discarding comments

But the comments are the best part! :-)

John Hawkinson

7:08 p.m.

I'd like to second Deborah Goldsmith's comments. Change is hugely disruptive and the tz package is a core element of many critical systems. We don't have the luxury of making little changes, unlike some others packages. Ideally the current distribution format should last for centuries. Supplementing it is fine, but removing it should be off the table. In short, "No, you don't get to simplify." This all looks like a solution in search of a problem. ---------------------------------------- Further, I remain deeply concerned that we are not pushing changes out to releases fast enough, and not using enough of an enterprise perspective. On Aug. 8, Paul remarked that he expected an October release would be the first to incorporate the leap second changes for this December. That is 2 months; that is not a comfortable amount of time. The strange fiction of the "experimental" github repo does not really help here. The idea that we might all be using something different while thinking we are "current" is disquieting. Sorry to stir up old issues. ---------------------------------------- p.s.: Maybe someone can dig up some RISKS entries of classic times where people "just wanted to simplify" and disaster struck. --jhawk@mit.edu John Hawkinson

Howard Hinnant

7:43 p.m.

On Sep 1, 2016, at 3:08 PM, John Hawkinson <jhawk@mit.edu> wrote:

...

On Aug. 8, Paul remarked that he expected an October release would be the first to incorporate the leap second changes for this December. That is 2 months; that is not a comfortable amount of time.

I second the notion that leap second updates should be accompanied with a greater sense of urgency. My library provides leap-second-aware clocks which should be accurate within a second 5 months into the future. https://howardhinnant.github.io/date/tz.html#utc_clock https://howardhinnant.github.io/date/tz.html#tai_clock https://howardhinnant.github.io/date/tz.html#gps_clock Howard

Brian Inglis

4:43 a.m.

On 2016-09-01 13:08, John Hawkinson wrote:

...

I'd like to second Deborah Goldsmith's comments.

Change is hugely disruptive and the tz package is a core element of many critical systems. We don't have the luxury of making little changes, unlike some others packages. Ideally the current distribution format should last for centuries. Supplementing it is fine, but removing it should be off the table.

For a commercial example, the Oracle/Sun Java TZupdater tool documents at http://www.oracle.com/technetwork/java/javase/tzupdater-readme-136440.html that by default it assumes the latest data is available at http://www.iana.org/time-zones/repository/tzdata-latest.tar.gz and I am sure I am not the only one with a script to check daily if the symlink destination changes, and download and apply the updates locally to systems with zoneinfo or Java, pending availability of officially updated releases, some of which may lag IANA releases by days, weeks, or months on some platforms. Many organizations have policies that only officially blessed vendor releases may be loaded and tested on systems.

...

On Aug. 8, Paul remarked that he expected an October release would be the first to incorporate the leap second changes for this December. That is 2 months; that is not a comfortable amount of time.

I suspect that right zoneinfo is not hugely important to the vast majority of tz users, but it is probably critical to those who do use it for their work, and updates will require work dealing with future times to be redone. Those who require it have to download the IERS file as soon as it is updated, regenerate the right data with every tz release, then rerun all their own dependent processes. -- Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

Paul Eggert

5:23 a.m.

Brian Inglis wrote:

...

I suspect that right zoneinfo is not hugely important to the vast majority of tz users, but it is probably critical to those who do use it for their work, and updates will require work dealing with future times to be redone.

I doubt whether there are many practical end-user applications of this sort. The usages I've seen are regression testing and so forth, which need to deal with more-urgent tz updates anyway. Serious astronomical users must deal not only with leap seconds but also with a lot of other things, and I expect that they have longstanding procedures for this stuff in place and do not need or use tz's leap seconds. Any end users who need a tz-based leap-second table accurate months into the future have had to deal with this problem for many years, as we've been reasonably lackadaisical about generating new releases if the only reason was a leap-second that is months in the future. If this were a significant issue, it would be straightforward for a downstream distributor like Ubuntu or FreeBSD to use the latest leapseconds file from NIST or from GitHub. I don't know of any distributor who's done so.

Brian Inglis

4:59 a.m.

On 2016-09-01 13:08, John Hawkinson wrote:

...

p.s.: Maybe someone can dig up some RISKS entries of classic times where people "just wanted to simplify" and disaster struck.

http://catless.ncl.ac.uk/Risks/search?query=simplify mostly provides a few examples over the years. -- Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

Brian Inglis

8:37 p.m.

On 2016-09-01 12:56, Deborah Goldsmith wrote:

...

...
...
What is the goal for making these changes to the distribution format?

On Aug 30, 2016, at 5:21 PM, Paul Eggert <eggert@cs.ucla.edu> wrote: Deborah Goldsmith wrote: 2. Shrinking the distribution tarball. This matters less, but while we're doing (1) we might as well do (2). Not everyone is as well-off and well-connected as Apple and UCLA. Is there any evidence anyone cares about the minor differences we’re discussing? The difference between bz2 (widely supported) and lz (not as widely supported) is 43K. Why put the onus on consumers of the data to find an implementation of a less-known compression scheme their platform doesn’t support?

GNU now provides .lz with new packages but they also provide the .gz to support older systems. Lzip appears to have advantages where only poor and/or slow connections are available. Thus there would also be a size advantage in distributing the test data as a separate archive in all the formats desired. A case could be made for also providing .zip archives for those who deal only with .Net and java on Windows (one of those is now MS for UWP and Store apps).

...

...
...
It’s going to cause a lot of work for a lot of people. It'll be a bit of work at the start, to change unpacking scripts. But the changes are small, and it should save some work in the long run. And there's no rush, as the old-format tarballs will continue to be distributed. Alexander Belopolsky wrote: If the size of data distribution is a concern, it looks like one can achieve a much better compression by simply discarding comments But the comments are the best part! :-) +1 junk the test data used by few, over the comments enjoyed by many!

-- Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

Clive D.W. Feather

9:40 p.m.

Deborah Goldsmith said:

...

RFCs are still plain text. .tar format hasn???t changed in decades. Bytes are still 8 bits.

Actually, no they aren't. I've spent the last several years programming on a processor where a byte is 16 bits - that is, the C type "unsigned char" can hold values from 0 to 65535 inclusive. Incidentally, that's something like the third commonest processor in the world. I suspect most of you have owned one at one time or another. -- Clive D.W. Feather | If you lie to the compiler, Email: clive@davros.org | it will get its revenge. Web: http://www.davros.org | - Henry Spencer Mobile: +44 7973 377646

Guy Harris

10:56 p.m.

On Sep 1, 2016, at 2:40 PM, Clive D.W. Feather <clive@davros.org> wrote:

...

Deborah Goldsmith said:

...
RFCs are still plain text. .tar format hasn???t changed in decades. Bytes are still 8 bits.

Actually, no they aren't. I've spent the last several years programming on a processor where a byte is 16 bits - that is, the C type "unsigned char" can hold values from 0 to 65535 inclusive.

Incidentally, that's something like the third commonest processor in the world. I suspect most of you have owned one at one time or another.

...as part of some larger machine you own, doing digital signal processing inside that machine.

Paul Eggert

4:23 a.m.

Deborah Goldsmith wrote:

...

It’s going to require changing (multiple) existing workflows that are working just fine.

It sounds like you're not alone with these concerns. OK, then let's keep supporting the existing distribution format indefinitely. Plus, it sounds like we should stick with the current set of files in the format, rather than add the (somewhat large) .tzs file to the data tarball. Instead, I plan to call the new-format tarball "experimental" for now. This is probably just as well, as that should give more freedom to try other things out too. The experimental tarball can contain the .tzs file without disrupting the traditional format. And if this experiment doesn't work out, we can stop distributing this new-format tarball in future releases, or change the new format. To do all this, I plan to send out a proposed patchset shortly. As for choice of compression format, the new format is intended for developers, not for end-user devices. So I'm not concerned whether some embedded platforms lack support for .lz or for .xz or whatever. Developers can easily get support for all the formats discussed, and we might as well pick a format that compresses well. It is an experiment after all.

Jon Skeet

5:38 a.m.

This sentence concerns me - the rest of the message sounded good: Developers can easily get support for all the formats discussed

...

That's just not true. Or rather, it's not true in a convenient fashion. The "tried and tested" compression formats (gzip, zip, even bzip2) have good API support within a broad range of programming languages. It's *much* easier to write code to deal with a tar.gz file than it is to: a) ensure that the xz tool is installed; b) create a temporary directory; c) shell out to run the tool and check that it was successful; d) use the files; e) clean up the temporary directory. If the only developer use of the files was "extract them from the command line and look at them" then I'd be fine with a relatively obscure compression format, but as developers tend to want to write tools to *use* the files - preferably without always having to go through the extraction process first - the benefit of wide interoperability trumps compression sizes, for me at least. And as we're only talking about a relatively small number of machines, a difference in compression size won't affect much network traffic. Jon

Stephen Colebourne

7:22 a.m.

I agree with everything Jon just said. Using an unusual and difficult to find compression tool is the wrong engineering trade off for this project. Sorry, but it really must be zip, gzip or bzip2. Stephen On 6 September 2016 at 06:38, Jon Skeet <skeet@pobox.com> wrote:

...

This sentence concerns me - the rest of the message sounded good:

...
Developers can easily get support for all the formats discussed

That's just not true. Or rather, it's not true in a convenient fashion. The "tried and tested" compression formats (gzip, zip, even bzip2) have good API support within a broad range of programming languages. It's much easier to write code to deal with a tar.gz file than it is to: a) ensure that the xz tool is installed; b) create a temporary directory; c) shell out to run the tool and check that it was successful; d) use the files; e) clean up the temporary directory.

If the only developer use of the files was "extract them from the command line and look at them" then I'd be fine with a relatively obscure compression format, but as developers tend to want to write tools to use the files - preferably without always having to go through the extraction process first - the benefit of wide interoperability trumps compression sizes, for me at least. And as we're only talking about a relatively small number of machines, a difference in compression size won't affect much network traffic.

Jon

David Patte ₯

11:51 a.m.

I agree as well Thanks On 2016-09-06 03:22, Stephen Colebourne wrote:

...

I agree with everything Jon just said. Using an unusual and difficult to find compression tool is the wrong engineering trade off for this project. Sorry, but it really must be zip, gzip or bzip2.

Stephen

On 6 September 2016 at 06:38, Jon Skeet <skeet@pobox.com> wrote:

...
This sentence concerns me - the rest of the message sounded good:

...
Developers can easily get support for all the formats discussed

That's just not true. Or rather, it's not true in a convenient fashion. The "tried and tested" compression formats (gzip, zip, even bzip2) have good API support within a broad range of programming languages. It's much easier to write code to deal with a tar.gz file than it is to: a) ensure that the xz tool is installed; b) create a temporary directory; c) shell out to run the tool and check that it was successful; d) use the files; e) clean up the temporary directory.

If the only developer use of the files was "extract them from the command line and look at them" then I'd be fine with a relatively obscure compression format, but as developers tend to want to write tools to use the files - preferably without always having to go through the extraction process first - the benefit of wide interoperability trumps compression sizes, for me at least. And as we're only talking about a relatively small number of machines, a difference in compression size won't affect much network traffic.

Jon

Paul Eggert

10:31 p.m.

On 09/05/2016 10:38 PM, Jon Skeet wrote:

...

The "tried and tested" compression formats (gzip, zip, even bzip2)

If the overriding goal is a compression format that works everywhere now, then gzip is clearly the way to go, as we're already using it and all other formats are therefore more likely to cause a problem on some platform somewhere. Luckily, though, that's not an absolute goal. Although lzip format is not everybody's preference, this is also true for other compression formats, and lzip is a reasonable choice for problems that the new distribution attempts to address. Our continuing to distribute gzip-format tarballs will address backwards-compatibility concerns. Besides, it's not like this is our first rodeo. We formerly used Lempel-Ziv format, and switched to gzip format before gzip was nearly-universally installed and supported. PS. In looking into this today I found and partly fixed some year-2038 bugs in gzip! (Everything is connected....) See: http://bugs.gnu.org/24385

Jon Skeet

5:54 a.m.

On 6 September 2016 at 23:31, Paul Eggert <eggert@cs.ucla.edu> wrote:

...

On 09/05/2016 10:38 PM, Jon Skeet wrote:

...
The "tried and tested" compression formats (gzip, zip, even bzip2)

If the overriding goal is a compression format that works everywhere now, then gzip is clearly the way to go, as we're already using it and all other formats are therefore more likely to cause a problem on some platform somewhere.

Luckily, though, that's not an absolute goal. Although lzip format is not everybody's preference, this is also true for other compression formats, and lzip is a reasonable choice for problems that the new distribution attempts to address.

So whose preference *is* lzip? You've seen a number of us express our dissatisfaction - who actually benefits from this? Sure, it means downloading slightly less data - but whose priority is that, whose "absolute goal"? We've already agreed that this is aimed at developers rather than end users - and I and others (as developers) have expressed our preference for gzip (or zip or bzip2). Who has expressed a significant preference for lzip, out of the target audience?

...

Our continuing to distribute gzip-format tarballs will address backwards-compatibility concerns.

Yes, those of us who prefer a compression format which is widely supported within programming languages can use the existing distribution format - but can't take advantage of any of the benefits of the new format. Why not bring the benefits of the new format to more people?

...

Besides, it's not like this is our first rodeo. We formerly used Lempel-Ziv format, and switched to gzip format before gzip was nearly-universally installed and supported.

I don't see that that's any argument for changing now. One option I'll repeat as it seems to have been missed: distribute multiple formats. If there's already going to be the backwards-compatible gzip files and the new lz file, why not *also* have a new gzip file? Jon

Paul Eggert

7:46 a.m.

Jon Skeet wrote:

...

So whose preference *is* lzip?

Mine, mostly. Antonio Diaz Diaz also expressed a preference for it. Admittedly he's biased, as he is an lzip maintainer. (I'm biased too, as I'm a gzip maintainer....)

...

I don't see that that's any argument for changing now.

The point is that after we changed to gzip format, things turned out all right. This sort of change is not as much work as one might fear.

...

One option ...: distribute multiple formats.

You mean .gz, .lz, .bz2, .zip, etc.? That sounds like it'd be a bit more work for me, for the staff, and for newcomers trying to navigate through the distribution. If you like, though, you can take on part of that job, and it might be helpful to do do so as this is a good time to experiment with distribution formats anyway. You could maintain a downstream server, say, one that delivers other distribution formats. Right now in the experimental GitHub version, for example, 'make tarballs' generates a file tzdb-2016f-41-g6d70eda.tar.lz, corresponding to the 41st Git commit after 2016f, with abbreviated hash 6d70eda. So, your web server could convert that to (say) bzip2 compression format, and redistribute a file named tzdb-2016f-41-g6d70eda.tar.bz2. That might be a nice thing to have.

Jon Skeet

8:04 a.m.

On 7 September 2016 at 08:46, Paul Eggert <eggert@cs.ucla.edu> wrote:

...

Jon Skeet wrote:

So whose preference *is* lzip?

...
Mine, mostly. Antonio Diaz Diaz also expressed a preference for it. Admittedly he's biased, as he is an lzip maintainer. (I'm biased too, as I'm a gzip maintainer....)

I don't see that that's any argument for changing now.

...
The point is that after we changed to gzip format, things turned out all right. This sort of change is not as much work as one might fear.

Perhaps the situation is not the same as it was before, though? Perhaps the data is being used in different contexts?

...

One option ...: distribute multiple formats.

...
You mean .gz, .lz, .bz2, .zip, etc.? That sounds like it'd be a bit more work for me, for the staff, and for newcomers trying to navigate through the distribution.

It sounds like it *should* be a matter of adding a few lines per distribution format to whatever scripts are being used. Helping newcomers should be reasonably straightforward with documentation: "We have two kinds of distribution packages: one has separate code and data (point to the existing gzip files), and one includes (describe new contents here). The latter package is available in zip, gzip and lz formats." There are plenty of other packages which allow a choice of compression format - for example, look at the Apache distribution site ( https://httpd.apache.org/download.cgi) which allows the source download as tar.gz or tar.bz2. I do acknowledge that it would be more work, but I suspect it would be less work than the time we've all taken to write all these emails so far, and it should be a one-time job... it shouldn't involve any more *ongoing* maintenance. If you like, though, you can take on part of that job, and it might be

...

helpful to do do so as this is a good time to experiment with distribution formats anyway. You could maintain a downstream server, say, one that delivers other distribution formats. Right now in the experimental GitHub version, for example, 'make tarballs' generates a file tzdb-2016f-41-g6d70eda.tar.lz, corresponding to the 41st Git commit after 2016f, with abbreviated hash 6d70eda. So, your web server could convert that to (say) bzip2 compression format, and redistribute a file named tzdb-2016f-41-g6d70eda.tar.bz2. That might be a nice thing to have.

While that would be easy enough to do, I think there's benefit in having the files hosted on IANA as a single trusted source. (Then there's tzdist of course, but that's a whole other matter...) Jon

Deborah Goldsmith

8:36 p.m.

Maybe we should wait until lzip is widely available before adopting it? Meanwhile, bzip2 is already widely adopted, and is smaller than gzip. Debbie

...

On Sep 7, 2016, at 12:46 AM, Paul Eggert <eggert@CS.UCLA.EDU> wrote:

Jon Skeet wrote:

...
So whose preference *is* lzip?

Mine, mostly. Antonio Diaz Diaz also expressed a preference for it. Admittedly he's biased, as he is an lzip maintainer. (I'm biased too, as I'm a gzip maintainer....)

...
I don't see that that's any argument for changing now.

The point is that after we changed to gzip format, things turned out all right. This sort of change is not as much work as one might fear.

...
One option ...: distribute multiple formats.

You mean .gz, .lz, .bz2, .zip, etc.? That sounds like it'd be a bit more work for me, for the staff, and for newcomers trying to navigate through the distribution.

If you like, though, you can take on part of that job, and it might be helpful to do do so as this is a good time to experiment with distribution formats anyway. You could maintain a downstream server, say, one that delivers other distribution formats. Right now in the experimental GitHub version, for example, 'make tarballs' generates a file tzdb-2016f-41-g6d70eda.tar.lz, corresponding to the 41st Git commit after 2016f, with abbreviated hash 6d70eda. So, your web server could convert that to (say) bzip2 compression format, and redistribute a file named tzdb-2016f-41-g6d70eda.tar.bz2. That might be a nice thing to have.

Russ Allbery

9:01 p.m.

Deborah Goldsmith <goldsmit@apple.com> writes:

...

Maybe we should wait until lzip is widely available before adopting it?

...

Meanwhile, bzip2 is already widely adopted, and is smaller than gzip.

My impression is that bzip2 is dying and offers no great advantages at this point. Those who are looking for the best compression are using either xz or lzip. Those who care primarily about backward compatibility are using gzip. bzip2 is falling between those two stools: it's much newer than gzip and not as widely-supported, and it's larger (and I think slower to decompress although I could be wrong) than either xz or lzip. I would not introduce bzip2 into anything that wasn't already using it at this point. I think the current plan is a good one: stick with gzip for the supported, stable distribution, add an experimental distribution that can play around with compression and contents, and worry later (possibly much later) about when the stable distribution might be worth changing. At some point, gzip will probably go the way of compress, but we're not there yet. (It will probably take longer than the replacement of compress with gzip took, since there aren't patent issues driving the matter the way that there were for compress.) -- Russ Allbery (eagle@eyrie.org) <http://www.eyrie.org/~eagle/>

Deborah Goldsmith

3:24 p.m.

As far as I can tell, lzip is only used on Linux. There are no tools that ship with macOS or Windows (out of the box) that can decode it. Adopting lzip as the primary format at this point seems like a statement that only Linux matters. Debbie

...

On Sep 7, 2016, at 2:01 PM, Russ Allbery <eagle@eyrie.org> wrote:

Deborah Goldsmith <goldsmit@apple.com> writes:

...
Maybe we should wait until lzip is widely available before adopting it?

...
Meanwhile, bzip2 is already widely adopted, and is smaller than gzip.

My impression is that bzip2 is dying and offers no great advantages at this point. Those who are looking for the best compression are using either xz or lzip. Those who care primarily about backward compatibility are using gzip. bzip2 is falling between those two stools: it's much newer than gzip and not as widely-supported, and it's larger (and I think slower to decompress although I could be wrong) than either xz or lzip.

I would not introduce bzip2 into anything that wasn't already using it at this point.

I think the current plan is a good one: stick with gzip for the supported, stable distribution, add an experimental distribution that can play around with compression and contents, and worry later (possibly much later) about when the stable distribution might be worth changing. At some point, gzip will probably go the way of compress, but we're not there yet. (It will probably take longer than the replacement of compress with gzip took, since there aren't patent issues driving the matter the way that there were for compress.)

-- Russ Allbery (eagle@eyrie.org) <http://www.eyrie.org/~eagle/>

Alexander Belopolsky

3:42 p.m.

On Tue, Sep 20, 2016 at 11:24 AM, Deborah Goldsmith <goldsmit@apple.com> wrote:

...

There are no tools that ship with macOS or Windows (out of the box) that can decode it.

Neither of those systems come with a compiler out of the box either. On MacOS, it takes less than 1.5 seconds to install lzip: $ time brew install lzip ==> Downloading https://homebrew.bintray.com/bottles/lzip-1.18.el_capitan.bottle.tar.gz Already downloaded: /Users/a/Library/Caches/Homebrew/lzip-1.18.el_capitan.bottle.tar.gz ==> Pouring lzip-1.18.el_capitan.bottle.tar.gz 🍺 /usr/local/Cellar/lzip/1.18: 9 files, 166.4K real 0m1.124s user 0m0.646s sys 0m0.537s If you have trouble unpacking .tar.lz files - just point your browser to https://github.com/eggert/tz/tree/2016g and get whatever individual files you need.

Alexander Belopolsky

3:47 p.m.

On Tue, Sep 20, 2016 at 11:42 AM, Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:

...

If you have trouble unpacking .tar.lz files - just point your browser to

https://github.com/eggert/tz/tree/2016g

BTW, $ wget https://github.com/eggert/tz/archive/2016g.zip will download all files in a .zip archive.

Paul Eggert

4:57 p.m.

On 09/20/2016 08:47 AM, Alexander Belopolsky wrote:

...

$ wget https://github.com/eggert/tz/archive/2016g.zip

will download all files in a .zip archive.

Thanks, I had forgotten about that. This approach also works with .tar.gz format: https://github.com/eggert/tz/archive/2016g.tar.gz <https://github.com/eggert/tz/archive/2016g.zip> This is not quite the same as a distribution tarball, though, as it omits automatically-generated files (e.g., 'leapseconds', 'version') and it contains a file that is not distributed ('.gitignore').

Alexander Belopolsky

5:23 p.m.

On Tue, Sep 20, 2016 at 12:57 PM, Paul Eggert <eggert@cs.ucla.edu> wrote:

...

This is not quite the same as a distribution tarball, though, as it omits automatically-generated files (e.g., 'leapseconds', 'version') and it contains a file that is not distributed ('.gitignore').

You may want to start using Github "release" feature to host semi-official tar-balls: https://help.github.com/articles/creating-releases/

Alexander Belopolsky

3:51 p.m.

On Tue, Sep 20, 2016 at 11:42 AM, Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:

...

$ time brew install lzip ==> Downloading https://homebrew.bintray.com/bottles/lzip-1.18.el_capitan. bottle.tar.gz Already downloaded: /Users/a/Library/Caches/Homebrew/lzip-1.18.el_capitan. bottle.tar.gz ==> Pouring lzip-1.18.el_capitan.bottle.tar.gz 🍺 /usr/local/Cellar/lzip/1.18: 9 files, 166.4K

real 0m1.124s

Sorry, my timing was off because I already had the source cached. With downloading, it takes about 1.8s: $ time brew install lzip ==> Downloading https://homebrew.bintray.com/bottles/lzip-1.18.el_capitan.bottle.tar.gz ############################################################################################################################################### 100.0% ==> Pouring lzip-1.18.el_capitan.bottle.tar.gz 🍺 /usr/local/Cellar/lzip/1.18: 9 files, 166.4K real 0m1.885s

Paul.Koning＠dell.com

3:58 p.m.

...

On Sep 20, 2016, at 11:42 AM, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:

On Tue, Sep 20, 2016 at 11:24 AM, Deborah Goldsmith <goldsmit@apple.com> wrote: There are no tools that ship with macOS or Windows (out of the box) that can decode it.

Neither of those systems come with a compiler out of the box either. On MacOS, it takes less than 1.5 seconds to install lzip:

$ time brew install lzip

That assumes you have Brew installed, and that you were willing to have it to begin with. MacOS out of the box does understand xz format, which seems to be a plausible alternative, if you want a newer and more efficient compresssion. xz is also readily available on Windows. paul

Random832

3:56 p.m.

On Tue, Sep 20, 2016, at 11:24, Deborah Goldsmith wrote:

...

As far as I can tell, lzip is only used on Linux. There are no tools that ship with macOS or Windows (out of the box) that can decode it.

Doesn't ship with "Linux" out of the box, either. At least, on my machine it's not present and I've done nothing special to exclude it... Ubuntu has a half dozen packages for different versions of it (most of which seem to be in "universe" rather than the main distribution), with no clear guidance on which one I should want to install. [It's clear enough, to be fair, what the actual differences between them *are*, but it's not clear if these differences have any impact on performance or not, and overall I'm completely mystified at the decision to *package* them.]

...

Adopting lzip as the primary format at this point seems like a statement that only Linux matters.

While it seems to be a bit bleeding-edge, I don't see how it's less so for Linux than for other operating systems. Also I don't see why a "primary" format means that only platforms supporting that format, 'out of the box' or otherwise, matter, while other distribution formats are still fully supported. Seems like it creates a bit of a chicken-and-egg dilemma towards ever migrating anyone to new formats.

Deborah Goldsmith

4:02 p.m.

...

Seems like it creates a bit of a chicken-and-egg dilemma towards ever migrating anyone to new formats.

I would argue that distribution of core internet data should never *drive* the adoption of new formats. Once a new format is widespread for other reasons, it makes perfect sense to adopt it. Important data that everyone needs to use should be as widely available as possible, which means using widely-adopted data formats. Debbie

...

On Sep 20, 2016, at 8:56 AM, Random832 <random832@fastmail.com> wrote:

On Tue, Sep 20, 2016, at 11:24, Deborah Goldsmith wrote:

...
As far as I can tell, lzip is only used on Linux. There are no tools that ship with macOS or Windows (out of the box) that can decode it.

Doesn't ship with "Linux" out of the box, either. At least, on my machine it's not present and I've done nothing special to exclude it... Ubuntu has a half dozen packages for different versions of it (most of which seem to be in "universe" rather than the main distribution), with no clear guidance on which one I should want to install. [It's clear enough, to be fair, what the actual differences between them *are*, but it's not clear if these differences have any impact on performance or not, and overall I'm completely mystified at the decision to *package* them.]

...
Adopting lzip as the primary format at this point seems like a statement that only Linux matters.

While it seems to be a bit bleeding-edge, I don't see how it's less so for Linux than for other operating systems. Also I don't see why a "primary" format means that only platforms supporting that format, 'out of the box' or otherwise, matter, while other distribution formats are still fully supported. Seems like it creates a bit of a chicken-and-egg dilemma towards ever migrating anyone to new formats.

Alexander Belopolsky

4:11 p.m.

On Tue, Sep 20, 2016 at 12:02 PM, Deborah Goldsmith <goldsmit@apple.com> wrote:

...

...
Seems like it creates a bit of a chicken-and-egg dilemma towards ever migrating anyone to new formats.

I would argue that distribution of core internet data should never

*drive* the adoption of new formats. In what sense tzdb-latest.tar.lz is "core internet data"? Even tzdata-latest.tar.gz is not used by the systems on the internet directly. System providers distribute derived binary files with system patches instead. The code and test data in tzdb archives is hardly of interest to users who would have trouble installing lzip.

Paul Eggert

4:06 p.m.

On 09/20/2016 08:24 AM, Deborah Goldsmith wrote:

...

There are no tools that ship with macOS or Windows (out of the box) that can decode it

That's OK and there's precedent for it, as the tz project used gzip format before macOS or Microsoft Windows shipped out-of-the-box with decoding tools. And back then, many other operating systems (SunOS, etc.) didn't ship out-of-the-box with gzip either. It was not a big deal. People who didn't have gzip could easily obtain and use its freely-available source code, just as they can with lzip now. This sort of thing is not beyond the capability of our target audience.

...

Adopting lzip as the primary format at this point That hasn't happened, as the primary format is still gzip. The lzip format is experimental.

Alexander Belopolsky

August 2016

10:34 p.m.

On Tue, Aug 30, 2016 at 5:18 PM, Paul Eggert <eggert@cs.ucla.edu> wrote:

...

$ ls -l tz*.tar.*z* -rw-r--r-- 1 eggert eggert 202609 Aug 30 14:00 tzcode2016X.tar.gz -rw-r--r-- 1 eggert eggert 394169 Aug 30 14:00 tzdata2016X.tar.gz -rw-r--r-- 1 eggert eggert 426667 Aug 30 14:10 tzdb-2016X.tar.bz2 -rw-r--r-- 1 eggert eggert 382991 Aug 30 14:00 tzdb-2016X.tar.lz

If the size of data distribution is a concern, it looks like one can achieve a much better compression by simply discarding comments in the data files: $ cat africa antarctica asia australasia \ europe northamerica southamerica | wc -c 647830 $ cat africa antarctica asia australasia \ europe northamerica southamerica | egrep -v '^\w*(#.*|$)' | wc -c 151231 Given the structured (low entropy) nature of the resulting stream, it compresses very well: $ cat africa antarctica asia australasia \ europe northamerica southamerica | egrep -v '^\w*(#.*|$)'| xz -c | wc -c 24600

Random832

7:43 p.m.

On Fri, Aug 26, 2016, at 19:10, Antonio Diaz Diaz wrote:

...

Paul Eggert wrote:

...
...
please use a compression format that can be handled easily by Windows users as well. For instance, choose a format from the list that 7Zip can handle: http://www.7-zip.org/

Thanks for mentioning the problem. xz format is on 7-Zip's list; it's a tiny bit larger than lzip format for our data (0.3% larger for the draft tzdb tarball) but I suppose portability trumps this minor advantage.

Please, do not use xz for a new distribution format. The xz format is defective. See for example http://www.nongnu.org/lzip/xz_inadequate.html

Seems like a lot of fear, uncertainty, and doubt. " Xz was designed as a fragmented format. Xz implementations may choose what subset of the format they support. For example the xz-embedded decompressor does not support the optional CRC64 check, and is therefore unable to verify the integrity of the files produced by default by xz-utils. Xz files must be produced specially for the xz-embedded decompressor. " - is this last sentence even true? does xz-embedded fail to open the files, or merely doesn't run the integrity check? Someone could write an lzip extractor that ignores the CRC, would this be an indictment of your format? "It has room for 2^63 filters, which can then be combined to make an even larger number of algorithms. Xz reserves less than 0.8% of filter IDs for custom filters, but even this small range provides about 8 million custom filter IDs for each human inhabitant on earth. There is not the slightest justification for such egregious level of extensibility. " - this seems like a criticism of data type choice? I'm not sure what the point is. "The 'file' utility does not provide any help:" "Xz-utils can report the minimum version of xz-utils required to decompress a given file, but it must examine the file contents to find it out," - how does 'file' work if not by examining the file content? "Not only data at a random position are interpreted as the CRC. Whatever data that follow the bogus CRC will be interpreted as the beginning of the following field, preventing the successful decoding of any remaining data in the stream. " What are the odds that the bytes found there will coincidentally match the CRC of the short data? And won't a corrupted length field always prevent the successful decoding of any remaining data, regardless of how the CRC is stored relative to it? ---- Anyway, why even use a compressed format? Is the data large enough for it to matter?

Clive D.W. Feather

8:36 p.m.

Random832 said:

...

"Not only data at a random position are interpreted as the CRC. Whatever data that follow the bogus CRC will be interpreted as the beginning of the following field, preventing the successful decoding of any remaining data in the stream. "

What are the odds that the bytes found there will coincidentally match the CRC of the short data? And won't a corrupted length field always prevent the successful decoding of any remaining data, regardless of how the CRC is stored relative to it?

Not necessarily. There's a situation with Bluetooth where a 1-bit corruption of the length field results in the wrong bytes being examined for the CRC *but* only one byte of the CRC is actually independent. If that has the right value, it doesn't matter what the rest of the "CRC" is; the CRC calculation comes out right. -- Clive D.W. Feather | If you lie to the compiler, Email: clive@davros.org | it will get its revenge. Web: http://www.davros.org | - Henry Spencer Mobile: +44 7973 377646

Brian Inglis

8:49 p.m.

On 2016-08-27 13:43, Random832 wrote:

...

On Fri, Aug 26, 2016, at 19:10, Antonio Diaz Diaz wrote:

...
Paul Eggert wrote:

...
...
please use a compression format that can be handled easily by Windows users as well. For instance, choose a format from the list that 7Zip can handle: http://www.7-zip.org/

Thanks for mentioning the problem. xz format is on 7-Zip's list; it's a tiny bit larger than lzip format for our data (0.3% larger for the draft tzdb tarball) but I suppose portability trumps this minor advantage.

Please, do not use xz for a new distribution format. The xz format is defective. See for example http://www.nongnu.org/lzip/xz_inadequate.html

Seems like a lot of fear, uncertainty, and doubt.

Think slow metered links in third world countries and you will get why a format with recovery is useful. Remember when modems got decent speed, checking and recovery? -- Take care. Thanks, Brian Inglis, Calgary, Alberta, Canada

3547

Age (days ago)

3572

Last active (days ago)

List overview

Download

47 comments

18 participants

participants (18)

Alexander Belopolsky
Antonio Diaz Diaz
Brian Inglis
Clive D.W. Feather
David Patte ₯
Deborah Goldsmith
Goudge, Stephen
Guy Harris
Howard Hinnant
John Hawkinson
Jon Skeet
Marc Lehmann
Paul Eggert
Paul.Koning＠dell.com
Random832
Russ Allbery
Steffen Nurpmeso
Stephen Colebourne