
Fun fact: the time zone database is 80% comments by volume. @dashdashado

On Apr 21, 2023, at 1:05 PM, Michael Douglass via tz <tz@iana.org> wrote:
On 4/21/23 15:13, Arthur David Olson via tz wrote:
Fun fact: the time zone database is 80% comments by volume.
What's that by weight?
If stored on paper, probably a much smaller percentage, as most of the weight of the printed-out tzdb source would be paper, not ink/toner impressions on the paper. If stored in uncompressed digital media, it depends on the relative weight of different bit sequences on the storage medium and on whether the comment bit patterns are heavier, or lighter, on average than the data bit patterns. If stored in compressed digital media, that question might not have an answer, as, depending on the compression algorithm, it might not be possible to determine which bits in the compressed form belong to comments and which belong to data.

Guy Harris via tz <tz@iana.org> wrote on Fri, 21 Apr 2023 at 17:29:33 EDT in <59840028-62AD-4567-9D90-DAD3CBEAB025@sonic.net>:
On Apr 21, 2023, at 1:05 PM, Michael Douglass via tz <tz@iana.org> wrote:
On 4/21/23 15:13, Arthur David Olson via tz wrote:
Fun fact: the time zone database is 80% comments by volume. What's that by weight? If stored on paper, probably a much smaller percentage, as most of ...
I would say that rather, the proper metaphorical analogue of percent-by-mass as opposed to percent-by-volume is not actual mass, but rather effectiveness or efficacy. (Although maybe that is more like molarity or density than it is straight up %-by-mass? If so it's a mere arithmetic conversion.) And so, one might argue, comments are almost exactly equivalent to code here, and everything captured in code is also captured in comments, so tz is 100% "mass" of comments. Or 0% "mass" of comments, because they are fully duplicative of the coded data, just in another form. Or, perhaps, 50% since that's the mean of 0% and 100%. However, that assumption isn't right. There are plenty of places where the text of the comments explain the sourcing but don't actually enumerate the specifics of the transitions, which are clear enough from the coded data in subsequent tabular form. So measuring how many bytes of coded data contain information not within the comments would require a line-by-line inquiry and semantic evaluation. And at some point, there is the question of the value of history. Is an entertaining story of a "derisive offer to erect a sundial" in Detroit "mere dicta," or is it valuable content that is fully itself a part of the database, or somewhere in between? How should it be measured? Still another question might be the ratio of "tzdata" to "tzcode," which is, at least, easy to calculate. When we're done answering all those questions, someone can produce some nifty graphs of how this has changed over time. In addition to pure growth, it may be, perhaps, that discussion about time zone abbreviations and perhaps their local invention in the database might have been considered meaty content in prior years and now is viewed differently. So if the semantic standards change over time, that task might become even tougher. I suppose there's probably some prior art here in the discipline of philosophy of science (or philosphy of history), but I don't know it. One could also imagine tackling this problem with machine learning. -- jhawk@alum.mit.edu John Hawkinson

On 2023-04-21 14:05, Michael Douglass wrote:
On 4/21/23 15:13, Arthur David Olson wrote:
Fun fact: the time zone database is 80% comments by volume. What's that by weight?
Ignoring historical data files, printing 65 lines/page at 10/12pt, on cheap 80g/m2 A4 office paper (210mm*297mm) with 10mm end margins, ~270-272pp*: $ units `cat ~/tz/data/{[ef]*,*{ica,asia}} | wc -l`*12point/277mm*A4paper*80gsm 1.3475787 kg $ units `cat ~/tz/data/{[ef]*,*{ica,asia}} | wc -l`*12point/277mm*A4paper*80gsm *.8 1.0780629 kg so rounding up pages and weights, we have ~1.35kg with ~1.1kg of comments ;^> Calculation of printing and shipping costs are left as exercises for readers who still print on paper and send mail! ;^p *double checked page count with `pr` -- Take care. Thanks, Brian Inglis Calgary, Alberta, Canada La perfection est atteinte Perfection is achieved non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add mais lorsqu'il n'y a plus rien à retirer but when there is no more to cut -- Antoine de Saint-Exupéry

On 2023-04-22 18:21, Brian Inglis via tz wrote:
$ units `cat ~/tz/data/{[ef]*,*{ica,asia}} | wc -l`*12point/277mm*A4paper*80gsm 1.3475787 kg
I guess one would print two-sided so that the 1079 pages you computed would fit on 0.68 kg of paper. Which is not so much if I recall the times when the ISO sent out printed committee drafts of SQL from Geneva into the world. And why did you exclude back*, zone*, etc? Michael Deckers.

On 2023-04-22 15:07, Michael H Deckers wrote:
On 2023-04-22 18:21, Brian Inglis wrote:
$ units `cat ~/tz/data/{[ef]*,*{ica,asia}} | wc -l`*12point/277mm*A4paper*80gsm 1.3475787 kg
I guess one would print two-sided so that the 1079 pages you computed would fit on 0.68 kg of paper. My estimate was ~270-272pp; 16 A4 sheets/m^2 == .0625m^2/sheet; * 80g/m^2 == 5g/sheet; * 272sheets == 1.36kg!
Good point - but I have rarely printed anything large enough to benefit from duplex in the last decade - I used to print 4 pages/sheet at home using various imposition packages or print drivers.
And why did you exclude back*, zone*, etc? Did not consider those core, but for giggles:
$ units `cat ~/tz/data/{[ef]*,*{ica,asia},back*,zone*.tab} | wc -l`*12point/277mm*A4paper*80gsm 1.5766488 kg # on about ~316-319pp $ units `grep -h '^\s*#\|^\s*$' ~/tz/data/{[ef]*,*{ica,asia},back*,zone*.tab} | wc -l`*12point/277mm*A4paper*80gsm 1.1474093 kg # on about 230-232pp giving ~73% comment or blank lines by line or page count or weight. I am here excluding data line comments, which would require analyzing numbers of characters and/or areas occupied on the line to get comparables. -- Take care. Thanks, Brian Inglis Calgary, Alberta, Canada La perfection est atteinte Perfection is achieved non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add mais lorsqu'il n'y a plus rien à retirer but when there is no more to cut -- Antoine de Saint-Exupéry
participants (7)
-
Arthur David Olson
-
Brian Inglis
-
Guy Harris
-
John Hawkinson
-
Magnus Fromreide
-
Michael Douglass
-
Michael H Deckers