Re: [tz] Pre-1970 data

Nov. 5, 2021

      On Fri, Nov 5, 2021 at 12:01 PM Brian Park <brian@xparks.net> wrote:
...
I agree that it is conceptually cleaner if the Core TZDB identifiers were
internal only. But I understand that some people would consider ISO-country
identifiers to be out of scope of this project, although there are many ad
hoc ones currently in the database. I think a file like 'countryzone'
should be added only if there are people willing to maintain such a list.
It may need to be a separate project, to avoid forcing the TZ Coordinator
to pick up the slack if those maintainers drop off.
Following up my own post, I took an initial stab at what this 'countryzone'
file would look like, and immediately ran into problems that convinces me
that this does *not* belong in the TZDB project. The scope seems too large,
so it seems better as a separate project.

I started from an ISO-3166 CSV file (see
https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes for a human
readable version), and I found:

1) Many country names are too long to fit into 14 characters. Let's say we
relax that constraint because we deprecate support for any old Unix system
that cannot support these longer file names. But there are countries like
"Heard Island and McDonald Islands", "South Georgia and the South Sandwich
Islands", and "United States Minor Outlying Islands", and "British Indian
Ocean Territory". Just from an ergonomics perspective, we should find a way
to shorten these very long names.

2) If we shorten some countries, like "Bosnia and Herzegovina" to just
"Bosnia" for convenience, are we going to offend people? I don't know
anyone from Bosnia and Herzegovina, so I have no idea. Each country that we
shorten needs to be researched carefully.

3) At least 5 countries have non-ASCII characters in their ISO names: "Côte
d'Ivoire ", "Curaçao", "Åland Islands", "Saint Barthélemy", "Réunion".
Personally, I would like to use only ASCII characters because they are the
lowest common denominator that is guaranteed to work, outside of mainframes
using EBCDIC. If we remove these non-ASCII characters, are we going to
offend the people of those countries, even though these are supposed to be
English versions of their country names?

4) So maybe the solution is to use 2-letter or 3-letter ISO codes, instead
of the shortened, quasi-English versions of the country names. So we get
things like "CA/Eastern" or "CAN/Eastern", instead of "Canada/Eastern". Not
very satisfying for Canadians or many other countries (except for Americans
whose ISO codes "US" and "USA" match their colloquial usage perfectly).

All this before we even get to the work of mapping various ISO countries
(and their subregions if needed) to their corresponding canonical TZDB
identifiers.

With regards to pre-1970 data in 'backzone', I'll see if I can do some
exploratory work on the 'backzone'/'mergedzone' pairing next week, and
determine if there are any major problems with the idea.

Brian