Using curved quotes etc. in TZDB HTML
After looking at today’s little change[1] to one of TZDB’s HTML files, I noticed that it cited Morrison et al’s 2020 paper but changed the paper’s title to use straight quotes 'like this' instead of curved quotes ‘like this’. Although TZDB’s HTML files mostly use straight quotes, they’re not methodical about it, and anyway straight quotes are to some extent a relic of the 20th century and in HTML files we can now reliably use curved quotes. So I composed a patch (attached) to be more consistent about quoting, and to use curved quotes except in places where the quotes are part of computer code and need to be straight. I haven’t installed this patch, though, because it’s annoying to edit HTML that looks like this: strings like “Prague”, “Praha”, “Прага”, and “布拉格”. I’d rather edit HTML that looks like this: strings like “Prague”, “Praha”, “Прага”, and “布拉格”. Since the HTML files are already UTF-8 encoded (otherwise they wouldn’t contain those non-ASCII letters) this shouldn’t be a problem with today’s editors. In 2014[2] I held off on making such a change because Garrett Wollman wrote that XEmacs 21.4 does not support quotes ‘like this’ or “like this”. However, XEmacs 21.4 has not been updated since 2009 and by now I would think XEmacs users would be using either XEmacs 21.5 (even though it’s still “beta”) or GNU Emacs, and both of these do support curved quotes. So instead of installing this patch, I’m inclined to install a different one in which HTML files use UTF-8 quotes “like this” rather than harder-to-read HTML entities “like this”. If this is OK we can also do something similar with other special characters such as “–” instead of “–”. Comments welcome. I am ccing this email to Garrett to see whether XEmacs 21.4 is still an issue with him. [1]: https://lists.iana.org/hyperkitty/list/tz@iana.org/message/BH63ZL5DHEBJPT2CC... [2]: https://mm.icann.org/pipermail/tz/2014-June/046053.html
Paul Eggert via tz:
I’d rather edit HTML that looks like this:
strings like “Prague”, “Praha”, “Прага”, and “布拉格”.
Since the HTML files are already UTF-8 encoded (otherwise they wouldn’t contain those non-ASCII letters) this shouldn’t be a problem with today’s editors.
In HTML, you could use mark the quotes up semanticlaly using the Q element: <q>Prague</q>, <q>Praha</q>, <q>Прага</q>, and <q>布拉格</q>. All browsers released this side of the turn of the century should support this. Make sure language is specified in the <html> tag, to get the browser to select the appropriate quoting style. -- \\// Peter - http://www.softwolves.pp.se/
On 2025-08-24 22:37, Peter Krefting via tz wrote:
<q>Prague</q>, <q>Praha</q>, <q>Прага</q>, and <q>布拉格</q>.
Thanks, I had not considered that, and perhaps we should use <q> for ordinary double-quoting in English. However, <q> does not address text like this: Astrodienst’s Web version of Shanks and Pottenger’s which is not as nice as: Astrodienst’s Web version of Shanks and Pottenger’s Also, <https://html.spec.whatwg.org/#the-q-element> says that <q> is not always a substitute for ordinary double-quoting. For example, in: They are not in any sense “standard compatible” – some are using <q> would be incorrect as this does not quote another source. So even if we use <q> there are places where we will still need curved quotes in the HTML. For more about issues with <q>, please see: https://github.com/whatwg/html/issues/10216
Date: Mon, 25 Aug 2025 07:34:25 -0700 From: Paul Eggert via tz <tz@iana.org> Message-ID: <a5afc9df-feb5-4583-b39d-05a74a1fa110@cs.ucla.edu> | Astrodienst’s Web version of Shanks and Pottenger’s That's simply wrong, regardless of questions of how nice it is to type, that character that is being misrepresented there by ’ should be an apostrophe ('), which is an entirely different thing than a closing single quote. If the apostrophe character doesn't look "right" to you, use a different font where it does, don't just turn it into a quote. kre
On 2025-08-25 08:45, Robert Elz wrote:
Date: Mon, 25 Aug 2025 07:34:25 -0700 From: Paul Eggert via tz <tz@iana.org> | Astrodienst’s Web version of Shanks and Pottenger’s
That's simply wrong, regardless of questions of how nice it is to type, that character that is being misrepresented there by ’ should be an apostrophe ('), which is an entirely different thing than a closing single quote.
There are two issues here. First, whether to use HTML entities like '/‘/’ versus ordinary characters like '/‘/’. Second, whether to use '/' or ’/’ in English possessives and contractions. For the first issue, ordinary characters simplify editing HTML. For the second issue, which is what I think you’re focusing on, there has been considerable confusion and some controversy due to the historical use of ' (U+0027 APOSTROPHE) to mean many things including apostrophe and single quotation marks. On this topic the current Unicode Standard says the following[1]: When text is set, U+2019 RIGHT SINGLE QUOTATION MARK is preferred as apostrophe.... U+2019 RIGHT SINGLE QUOTATION MARK is preferred where the character is to represent a punctuation mark, as for contractions: “We’ve been here before.” In this latter case, U+2019 is also referred to as a punctuation apostrophe.... The semantics of U+2019 are therefore context dependent. For example, if surrounded by letters or digits on both sides, it behaves as an in-text punctuation character and does not separate words or lines. Given the Unicode’s limitations no approach to this problem is perfect. That being said the Unicode Standard is a reasonable way to go. [1]: https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-6/#G12411
On Aug 25, 2025, at 9:27 AM, Paul Eggert via tz <tz@iana.org> wrote:
There are two issues here. First, whether to use HTML entities like '/‘/’ versus ordinary characters like '/‘/’. Second, whether to use '/' or ’/’ in English possessives and contractions.
...
For the second issue, which is what I think you’re focusing on, there has been considerable confusion and some controversy due to the historical use of ' (U+0027 APOSTROPHE) to mean many things including apostrophe and single quotation marks. On this topic the current Unicode Standard says the following[1]:
When text is set, U+2019 RIGHT SINGLE QUOTATION MARK is preferred as apostrophe.... U+2019 RIGHT SINGLE QUOTATION MARK is preferred where the character is to represent a punctuation mark, as for contractions: “We’ve been here before.” In this latter case, U+2019 is also referred to as a punctuation apostrophe.... The semantics of U+2019 are therefore context dependent. For example, if surrounded by letters or digits on both sides, it behaves as an in-text punctuation character and does not separate words or lines.
Some style guides: MLA: https://style.mla.org/apostrophes-three-ways/ "It should look like a single closing quotation mark, not an opening one." Federal government of Australia: https://www.stylemanual.gov.au/grammar-punctuation-and-conventions/punctuati... They don't seem to say what an apostrophe should look like, but the apostrophes in that text are U+2019.
Date: Mon, 25 Aug 2025 09:27:51 -0700 From: Paul Eggert <eggert@cs.ucla.edu> Message-ID: <4c233b86-6d4e-4158-9232-2881a6203266@cs.ucla.edu> | For the second issue, which is what I think you’re focusing on, Yes, of course, which is why I said: regardless of questions of how nice it is to type | there has been considerable confusion and some controversy due to the | historical use of ' (U+0027 APOSTROPHE) to mean many things including | apostrophe and single quotation marks. Yes, and people are ignorant. ASCII at least had an excuse, there were only 128 code points (including control chars) available - not every possible glyph could be represented, not even close. At least they didn't copy (some) ancient typewriters (which might have been tempting when ASCII was being invented) and also omit 0 and 1, using O and l instead. Unicode has no such excuse. | On this topic the current Unicode Standard says the following[1]: | The semantics of U+2019 are therefore context dependent. That's simply pathetic, there is no excuse for that. If they considered U+0027 too encumbered with confusion to recommend using, they could have easily just added a new code point that is uniquely apostrophe. Apostrophe and closing single quote are about as related as 1 and l (or 0/O). They have a (kind of) similar appearance, but otherwise are completely different things. | For example, if surrounded by letters or digits on both sides, And 'twas the night before Christmas when the dogs' fleas' bites were .... | Given the Unicode’s limitations no approach to this problem is perfect. Unicode has both U+0027 and U+2019 - there's no reason at all not to use both, using each when that one is required. Particularly for usages like this when no-one is going to be attempting to parse the results (nothing will be wondering whether U+0027 might have just been an ASCII single quote (opening or closing) and trying to guess what to do, where all that matters is what its appearance ends up being. kre
On 2025-08-25 11:05, Robert Elz wrote:
If they considered U+0027 too encumbered with confusion to recommend using, they could have easily just added a new code point that is uniquely apostrophe.
The Unicode developers did exactly what you suggest. It did not work. They introduced ʼ (U+02BC MODIFIER LETTER APOSTROPHE), and through Unicode 2.1.9 (1999) this was listed as the preferred character for the apostrophe used in English possessives and plurals.[1] However, this did not work as people in practice largely used ’ (U+2019 RIGHT SINGLE QUOTATION MARK) instead, partly due to longstanding practice in high-quality typography. So starting in Unicode 3.0.0 (1999) the Unicode developers changed their mind and started recommending common practice.[2] [1]: https://www.unicode.org/Public/2.0-Update/NamesList-1.txt [2]: https://www.unicode.org/Public/3.0-Update/NamesList-3.0.0.txt
I agree with Paul. In addition to the references Paul already cited, the notes for U+2019 RIGHT SINGLE QUOTATION MARK in the Unicode NamesList.txt file (used in code charts) say rather clearly, “this is the preferred character to use for apostrophe.” In decent English-language typography, apostrophe and closing single quote don’t merely “have a (kind of) similar appearance”; they are virtually identical in appearance. You would be hard-pressed to find any person, except perhaps a professional type designer, who could describe the difference. Creating an artificial typographical distinction between “straight” apostrophe and “curly” single quote has no precedent in typography or in any character set, legacy or modern, and we should not invent such a convention here. If “all that matters is what its appearance ends up being,” then using U+2019 for both is the obviously correct decision. -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org
On 2025-08-25 12:05, Robert Elz via tz wrote:
On Mon, 25 Aug 2025 09:27:51 -0700, Paul Eggert wrote: | For the second issue, which is what I think you're focusing on, Yes, of course, which is why I said: regardless of questions of how nice it is to type | there has been considerable confusion and some controversy due to the | historical use of ' (U+0027 APOSTROPHE) to mean many things including | apostrophe and single quotation marks. Yes, and people are ignorant. ASCII at least had an excuse, there were only 128 code points (including control chars) available - not every possible glyph could be represented, not even close. At least they didn't copy (some) ancient typewriters (which might have been tempting when ASCII was being invented) and also omit 0 and 1, using O and l instead. Unicode has no such excuse. | On this topic the current Unicode Standard says the following[1]: | The semantics of U+2019 are therefore context dependent. That's simply pathetic, there is no excuse for that. If they considered U+0027 too encumbered with confusion to recommend using, they could have easily just added a new code point that is uniquely apostrophe. Apostrophe and closing single quote are about as related as 1 and l (or 0/O). They have a (kind of) similar appearance, but otherwise are completely different things. | For example, if surrounded by letters or digits on both sides, And 'twas the night before Christmas when the dogs' fleas' bites were .... | Given the Unicode's limitations no approach to this problem is perfect. Unicode has both U+0027 and U+2019 - there's no reason at all not to use both, using each when that one is required. Particularly for usages like this when no-one is going to be attempting to parse the results (nothing will be wondering whether U+0027 might have just been an ASCII single quote (opening or closing) and trying to guess what to do, where all that matters is what its appearance ends up being.
We have numerous Unicode glyphs and HTML entity symbols for variants of spaces, dashes, and Mathematical symbols and letters, so Unicode and HTML entities could manage a few extra glyphs and symbols for English word abbreviation mark `…, s possessive mark …'s, and plural s possessive mark …s', where the latter two could perhaps be combined, as with accented and historical variants of s glyphs. -- Take care. Thanks, Brian Inglis Calgary, Alberta, Canada La perfection est atteinte Perfection is achieved non pas lorsqu'il n'y a plus rien à ajouter not when there is no more to add mais lorsqu'il n'y a plus rien à retrancher but when there is no more to cut -- Antoine de Saint-Exupéry
Brian Inglis wrote:
We have numerous Unicode glyphs and HTML entity symbols for variants of spaces, dashes, and Mathematical symbols and letters, so Unicode and HTML entities could manage a few extra glyphs and symbols for English word abbreviation mark `…, s possessive mark …'s, and plural s possessive mark …s', where the latter two could perhaps be combined, as with accented and historical variants of s glyphs.
Decisions about whether to encode Unicode characters, especially lookalikes such as these, are always made on their own merits. They are never a matter of “we have these (near-)duplicates, so we might as well have these other (near-)duplicates as well.” -- Doug Ewell, CC, ALB | Lakewood, CO, US | ewellic.org
On 8/25/25 10:27, Paul Eggert via tz wrote:
... The semantics of U+2019 are therefore context dependent. For example, if surrounded by letters or digits on both sides, it behaves as an in-text punctuation character and does not separate words or lines.
Remembering that the possessive of a singular noun ends with an apostrophe, not a single quote. -- gil
On Mon, 25 Aug 2025 at 10:34, Paul Eggert via tz <tz@iana.org> wrote:
For more about issues with <q>, please see:
Among the issues raised there is that "quotation marks are not included when text is copied." To me, that makes <q> a non-starter as it interferes with proper quoting (heh) of documentation. -- Tim Parenti
On 8/26/25 07:50, Tim Parenti wrote:
Among the issues raised there is that "quotation marks are not included when text is copied." To me, that makes <q> a non-starter
I didn't observe that problem with my usual browser Firefox (currently 142.0): cutting and pasting <q>hello</q> gives me "hello" with the ASCII double-quotes included. However, now that you mention it I do see the problem with Chromium 139.0.7258.138. What a mess. Although there are more complex solutions involving CSS or even JavaScript I'd rather avoid that. So it looks like we should avoid <q>. Thanks for mentioning the issue.
On 2025-08-26 13:45, Paul Eggert wrote:
So it looks like we should avoid <q>.
After reviewing the comments on this topic it seems that plain UTF-8 is the way to go. I went through the HTML and public-facing files like README (visible on GitHub), and used UTF-8 quotes and a few other special characters to make the files easier to edit and/or read. The complete set of non-ASCII characters that now occur directly in *.html, *.txt, or other public-facing text files is now: §«°±»½¾×–‘’“”•→−≤★⯪ and this set is now documented and checked in Makefile as UNUSUAL_OK_CHARSET. Although we can extend this character set as needed, I thought it better to not allow arbitrary UTF-8 due to the usual problems with normalization, confusables, etc. I installed the attached set of proposed patches to implement this. The first patch is mostly a mistake but that is rectified in the last patch. I considered going further, by using curved quotes and other special characters in commentary in the main (non public facing) data files, e.g., to say “34° 54′ S” instead of the current “34° 54' S” so that comments use U+2032 PRIME instead of the approximation U+0027 APOSTROPHE to denote minutes. However, this would be more work and I was not sure it would be worth the maintenance hassle long term so I left it alone for now. The only change these patches make to data (as opposed to commentary) consists of changing Côte d'Ivoire to Côte d’Ivoire in iso3166.tab. The patches also change Gur'yev to Gur’yev and Dumont d'Urville to Dumont d’Urville in the comments column of zone1970.tab, and remove some unnecessary "s in the comments column of zonenow.tab.
HI 您們好 我使用的是中文繁體 我來至時常在點選 來自哪 選不到的"台灣" 台灣有也有分城市區域 加上中文字體博大精深 很難永所謂的翻譯或者編碼來轉換 但我以生為台灣人為驕傲 感謝您的指導 Paul Eggert via tz <tz@iana.org> 於 2025年8月30日 週六 下午3:28寫道:
On 2025-08-26 13:45, Paul Eggert wrote:
So it looks like we should avoid <q>.
After reviewing the comments on this topic it seems that plain UTF-8 is the way to go. I went through the HTML and public-facing files like README (visible on GitHub), and used UTF-8 quotes and a few other special characters to make the files easier to edit and/or read. The complete set of non-ASCII characters that now occur directly in *.html, *.txt, or other public-facing text files is now:
§«°±»½¾×–‘’“”•→−≤★⯪
and this set is now documented and checked in Makefile as UNUSUAL_OK_CHARSET. Although we can extend this character set as needed, I thought it better to not allow arbitrary UTF-8 due to the usual problems with normalization, confusables, etc.
I installed the attached set of proposed patches to implement this. The first patch is mostly a mistake but that is rectified in the last patch.
I considered going further, by using curved quotes and other special characters in commentary in the main (non public facing) data files, e.g., to say “34° 54′ S” instead of the current “34° 54' S” so that comments use U+2032 PRIME instead of the approximation U+0027 APOSTROPHE to denote minutes. However, this would be more work and I was not sure it would be worth the maintenance hassle long term so I left it alone for now.
The only change these patches make to data (as opposed to commentary) consists of changing Côte d'Ivoire to Côte d’Ivoire in iso3166.tab. The patches also change Gur'yev to Gur’yev and Dumont d'Urville to Dumont d’Urville in the comments column of zone1970.tab, and remove some unnecessary "s in the comments column of zonenow.tab.
(Please translate your messages into English before sending, as this is an English-language mailing list.) Two things. First, TZDB is an English-language database. If you are having trouble finding Taiwan using a traditional Chinese locale, this is something downstream from us, either CLDR[1] or your operating system's supplier, and you'll need to contact them to fix it. [1]: https://data.iana.org/time-zones/tz-link.html#notation Second, I don't observe your problem when using 'tzselect', which is the simple English-language text-based selection interface that TZDB distributes. See the shell transcript below. There is a separate entry for Taiwan, which has a distinct time zone history since 1970. There is no need for subdistricts of Taiwan, as all subdistricts have the same such history. $ tzselect Please identify a location so that time zone rules can be set correctly. Please select a continent, ocean, "coord", "TZ", "time", or "now". 1) Africa 2) Americas 3) Antarctica 4) Arctic Ocean 5) Asia 6) Atlantic Ocean 7) Australia 8) Europe 9) Indian Ocean 10) Pacific Ocean 11) coord - I want to use geographical coordinates. 12) TZ - I want to specify the timezone using a proleptic TZ string. 13) time - I know local time already. 14) now - Like "time", but configure only for timestamps from now on. #? 5 Please select a country whose clocks agree with yours. 1) Afghanistan 20) Israel 39) Qatar 2) Armenia 21) Japan 40) Réunion 3) Australia 22) Jordan 41) Russia 4) Azerbaijan 23) Kazakhstan 42) Saudi Arabia 5) Bahrain 24) Korea (North) 43) Seychelles 6) Bangladesh 25) Korea (South) 44) Singapore 7) Bhutan 26) Kuwait 45) Sri Lanka 8) Brunei 27) Kyrgyzstan 46) Syria 9) Cambodia 28) Laos 47) Taiwan 10) China 29) Lebanon 48) Tajikistan 11) Cyprus 30) Macau 49) Thailand 12) East Timor 31) Malaysia 50) Turkey 13) French S. Terr. 32) Mongolia 51) Turkmenistan 14) Georgia 33) Myanmar (Burma) 52) United Arab Emirates 15) Hong Kong 34) Nepal 53) Uzbekistan 16) India 35) Oman 54) Vietnam 17) Indonesia 36) Pakistan 55) Yemen 18) Iran 37) Palestine 19) Iraq 38) Philippines #? 47 Based on the following information: Taiwan TZ='Asia/Taipei' will be used. Selected time is now: Sun Aug 31 12:13:58 AM CST 2025. Universal Time is now: Sat Aug 30 04:13:58 PM UTC 2025. Is the above information OK? 1) Yes 2) No #? 1 You can make this change permanent for yourself by appending the line export TZ='Asia/Taipei' to the file '.profile' in your home directory; then log out and log in again. Here is that TZ value again, this time on standard output so that you can use the ./tzselect command in shell scripts: Asia/Taipei
participants (9)
-
Brian Inglis -
Doug Ewell -
Guy Harris -
M -
Paul Eggert -
Paul Gilmartin -
Peter Krefting -
Robert Elz -
Tim Parenti