[PROPOSED PATCH 1/3] Remove hair from southamerica comment
While using the tz source to test some other program I found an unwanted hair space (U+200A) in the commentary. Fix this, and adjust Makefile to catch this sort of thing in the future. * Makefile (SAFE_CHARSET): Exclude ] and -, as they're now the invoker's responsibility. Invoker changed. (NONSYM_CHAR): Remove, replacing with ... (OK_CHAR): ... this new macro. All uses changed. (NONSYM_LINE, VALID_LINE): Remove, replacing with ... (OK_LINE): ... this new macro. All uses changed. (check_character_set): Simplify test, and report all non-ASCII non-letters. Remove the exception for Makefile, as it no longer needs to contain non-OK characters. * southamerica: Replace an inadvertent hair space with a space. --- Makefile | 30 ++++++++++++++---------------- southamerica | 2 +- 2 files changed, 15 insertions(+), 17 deletions(-) diff --git a/Makefile b/Makefile index b398727..c3b23c8 100644 --- a/Makefile +++ b/Makefile @@ -292,23 +292,24 @@ TAB_CHAR= ' ' SAFE_CHARSET1= $(TAB_CHAR)' !\"'$$sharp'$$%&'\''()*+,./0123456789:;<=>?@' SAFE_CHARSET2= 'ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\^_`' SAFE_CHARSET3= 'abcdefghijklmnopqrstuvwxyz{|}~' -SAFE_CHARSET= ]$(SAFE_CHARSET1)$(SAFE_CHARSET2)$(SAFE_CHARSET3)- -SAFE_CHAR= '['$(SAFE_CHARSET)']' -# NONSYM_CHAR is a regular expression that matches any character -# except for a small number of symbols, where we prefer to stick with +SAFE_CHARSET= $(SAFE_CHARSET1)$(SAFE_CHARSET2)$(SAFE_CHARSET3) +SAFE_CHAR= '[]'$(SAFE_CHARSET)'-]' + +# OK_CHAR matches any character allowed in the distributed files. +# This is the same as SAFE_CHAR, except that multibyte letters are +# also allowed so that commentary can contain people's names and quote +# non-English sources. For non-letters the sources are limited to # ASCII renderings for the convenience of maintainers whose text editors # mishandle UTF-8 by default (e.g., XEmacs 21.4.22). -NONSYM_CHAR= '[^–—°′″≈≠≤≥±−×÷∞←→↔·•§¶«»‘’‚‛“”„‟‹›「」『』〝〞〟]' +OK_CHAR= '[][:alpha:]'$(SAFE_CHARSET)'-]' # SAFE_LINE matches a line of safe characters. -# SAFE_SHARP_LINE is similar, except any character can follow '#'; +# SAFE_SHARP_LINE is similar, except any OK character can follow '#'; # this is so that comments can contain non-ASCII characters. -# NONSYM_LINE matches a line of non-symbols. -# VALID_LINE matches a line of any validly-encoded characters. +# OK_LINE matches a line of OK characters. SAFE_LINE= '^'$(SAFE_CHAR)'*$$' -SAFE_SHARP_LINE='^'$(SAFE_CHAR)'*('$$sharp$(NONSYM_CHAR)'*)?$$' -NONSYM_LINE= '^'$(NONSYM_CHAR)'*$$' -VALID_LINE= '^.*$$' +SAFE_SHARP_LINE='^'$(SAFE_CHAR)'*('$$sharp$(OK_CHAR)'*)?$$' +OK_LINE= '^'$(OK_CHAR)'*$$' # Flags to give 'tar' when making a distribution. # Try to use flags appropriate for GNU tar. @@ -475,14 +476,11 @@ check: check_character_set check_white_space check_links check_sorted \ check_character_set: $(ENCHILADA) LC_ALL=en_US.utf8 && export LC_ALL && \ sharp='#' && \ - ! grep -Env $(SAFE_LINE) $(MANS) date.1 $(MANTXTS) \ + ! grep -Env $(SAFE_LINE) Makefile $(MANS) date.1 $(MANTXTS) \ $(MISC) $(SOURCES) $(WEB_PAGES) && \ ! grep -Env $(SAFE_SHARP_LINE) $(TDATA) backzone \ iso3166.tab leapseconds yearistype.sh zone.tab && \ - test $$(grep -Ecv $(SAFE_SHARP_LINE) Makefile) -eq 1 && \ - ! grep -Env $(NONSYM_LINE) CONTRIBUTING NEWS README Theory \ - $(MANS) date.1 zone1970.tab && \ - ! grep -Env $(VALID_LINE) $(ENCHILADA) + ! grep -Env $(OK_LINE) $(ENCHILADA) check_white_space: $(ENCHILADA) ! grep -En ' '$(TAB_CHAR)"|$$(printf '[\f\r\v]')" $(ENCHILADA) diff --git a/southamerica b/southamerica index be63a88..6bbc2c8 100644 --- a/southamerica +++ b/southamerica @@ -30,7 +30,7 @@ # I suggest the use of _Summer time_ instead of the more cumbersome # _daylight-saving time_. _Summer time_ seems to be in general use # in Europe and South America. -# -- E O Cutler, _New York Times_ (1937-02-14), quoted in +# -- E O Cutler, _New York Times_ (1937-02-14), quoted in # H L Mencken, _The American Language: Supplement I_ (1960), p 466 # # Earlier editions of these tables also used the North American style -- 2.1.4
* Makefile (check_character_set): Don't require iso3166.tab to be ASCII. * NEWS: Document this. * iso3166.tab (AX, CI, RE): Use UTF-8 rather than ASCII approximations. --- Makefile | 2 +- NEWS | 8 ++++++++ iso3166.tab | 11 +++++------ 3 files changed, 14 insertions(+), 7 deletions(-) diff --git a/Makefile b/Makefile index c3b23c8..6f70979 100644 --- a/Makefile +++ b/Makefile @@ -479,7 +479,7 @@ check_character_set: $(ENCHILADA) ! grep -Env $(SAFE_LINE) Makefile $(MANS) date.1 $(MANTXTS) \ $(MISC) $(SOURCES) $(WEB_PAGES) && \ ! grep -Env $(SAFE_SHARP_LINE) $(TDATA) backzone \ - iso3166.tab leapseconds yearistype.sh zone.tab && \ + leapseconds yearistype.sh zone.tab && \ ! grep -Env $(OK_LINE) $(ENCHILADA) check_white_space: $(ENCHILADA) diff --git a/NEWS b/NEWS index c14df78..407258d 100644 --- a/NEWS +++ b/NEWS @@ -1,5 +1,13 @@ News for the tz database +Unreleased, experimental changes + + Changes affecting data format + + The file 'iso3166.tab' now uses UTF-8, so that its entries can better + spell the names of Åland Islands, Côte d'Ivoire, and Réunion. + + Release 2015d - 2015-04-24 08:09:46 -0700 Changes affecting future time stamps diff --git a/iso3166.tab b/iso3166.tab index 0b0b842..0548800 100644 --- a/iso3166.tab +++ b/iso3166.tab @@ -3,11 +3,10 @@ # This file is in the public domain, so clarified as of # 2009-05-17 by Arthur David Olson. # -# From Paul Eggert (2014-07-18): +# From Paul Eggert (2015-05-02): # This file contains a table of two-letter country codes. Columns are # separated by a single tab. Lines beginning with '#' are comments. -# Although all text currently uses ASCII encoding, this is planned to -# change to UTF-8 soon. The columns of the table are as follows: +# All text uses UTF-8 encoding. The columns of the table are as follows: # # 1. ISO 3166-1 alpha-2 country code, current as of # ISO 3166-1 Newsletter VI-16 (2013-07-11). See: Updates on ISO 3166 @@ -38,7 +37,7 @@ AS Samoa (American) AT Austria AU Australia AW Aruba -AX Aaland Islands +AX Åland Islands AZ Azerbaijan BA Bosnia & Herzegovina BB Barbados @@ -67,7 +66,7 @@ CD Congo (Dem. Rep.) CF Central African Rep. CG Congo (Rep.) CH Switzerland -CI Cote d'Ivoire +CI Côte d'Ivoire CK Cook Islands CL Chile CM Cameroon @@ -211,7 +210,7 @@ PT Portugal PW Palau PY Paraguay QA Qatar -RE Reunion +RE Réunion RO Romania RS Serbia RU Russia -- 2.1.4
* NEWS: Document this. * tzselect.ksh (utf8_locale): New var. Use it to select a UTF-8 locale if available. --- NEWS | 4 ++++ tzselect.ksh | 12 ++++++++++++ 2 files changed, 16 insertions(+) diff --git a/NEWS b/NEWS index 407258d..b77cd03 100644 --- a/NEWS +++ b/NEWS @@ -7,6 +7,10 @@ Unreleased, experimental changes The file 'iso3166.tab' now uses UTF-8, so that its entries can better spell the names of Åland Islands, Côte d'Ivoire, and Réunion. + Changes affecting code + + tzselect aligns UTF-8 columns better, if a UTF-8 locale is available. + Release 2015d - 2015-04-24 08:09:46 -0700 diff --git a/tzselect.ksh b/tzselect.ksh index 3acdebd..b288cdf 100644 --- a/tzselect.ksh +++ b/tzselect.ksh @@ -44,6 +44,18 @@ REPORT_BUGS_TO=tz@iana.org exit 1 } +# Use a UTF-8 locale if available, as the data contain UTF-8, +# and the shell aligns columns better that way. +# Check the UTF-8 of U+12345 CUNEIFORM SIGN URU TIMES KI. +utf8_locale='BEGIN { u12345 = "\360\222\215\205"; exit length(u12345) != 1 }' +$AWK "$utf8_locale" || + for locale in en_US.utf8 en_US.UTF-8 C.utf8; do + (LC_ALL=$locale $AWK "$utf8_locale") 2>/dev/null && { + export LC_ALL=$locale + break + } + done + coord= location_limit=10 zonetabtype=zone1970 -- 2.1.4
Using a UTF-8 locale won't work if the terminal isn't UTF-8. It should be piping utf-8 output through iconv -f utf-8 and using the user's locale. If the user has a UTF-8 terminal but no UTF-8 locale, that's the user's problem. On Sun, May 3, 2015, at 02:25, Paul Eggert wrote:
* NEWS: Document this. * tzselect.ksh (utf8_locale): New var. Use it to select a UTF-8 locale if available. --- NEWS | 4 ++++ tzselect.ksh | 12 ++++++++++++ 2 files changed, 16 insertions(+)
diff --git a/NEWS b/NEWS index 407258d..b77cd03 100644 --- a/NEWS +++ b/NEWS @@ -7,6 +7,10 @@ Unreleased, experimental changes The file 'iso3166.tab' now uses UTF-8, so that its entries can better spell the names of Åland Islands, Côte d'Ivoire, and Réunion.
+ Changes affecting code + + tzselect aligns UTF-8 columns better, if a UTF-8 locale is available. +
Release 2015d - 2015-04-24 08:09:46 -0700
diff --git a/tzselect.ksh b/tzselect.ksh index 3acdebd..b288cdf 100644 --- a/tzselect.ksh +++ b/tzselect.ksh @@ -44,6 +44,18 @@ REPORT_BUGS_TO=tz@iana.org exit 1 }
+# Use a UTF-8 locale if available, as the data contain UTF-8, +# and the shell aligns columns better that way. +# Check the UTF-8 of U+12345 CUNEIFORM SIGN URU TIMES KI. +utf8_locale='BEGIN { u12345 = "\360\222\215\205"; exit length(u12345) != 1 }' +$AWK "$utf8_locale" || + for locale in en_US.utf8 en_US.UTF-8 C.utf8; do + (LC_ALL=$locale $AWK "$utf8_locale") 2>/dev/null && { + export LC_ALL=$locale + break + } + done + coord= location_limit=10 zonetabtype=zone1970 -- 2.1.4
-- Random832
random832@fastmail.us wrote:
Using a UTF-8 locale won't work if the terminal isn't UTF-8. It should be piping utf-8 output through iconv -f utf-8 and using the user's locale.
Thanks, good catch. Also, the 'echo' commands should guard against backslashes in the data. Proposed patches attached.
participants (2)
-
Paul Eggert -
random832@fastmail.us