Hi all Maybe it could be interesting to hear if Lianna has something to add to it? I am curious to understand how much big is the impact of this fact in Armenia. Cheers, Roberto On 11.02.2019, at 23:47, Dusan Stojicevic <dusan@dukes.in.rs<mailto:dusan@dukes.in.rs>> wrote: http://domainincite.com/23908-right-of-the-colon-idn-getting-killed-over-dot... @MarkS Dusan
Hi all, Thanks, Roberto for your inquiry about the situation in Armenia. We hold a press conference lately, on the Day of using Armenian scripts in Internet, and talk about the issue of period mark. The confusion is not about how many dots should be used, which is the Armenian colon or period, but the fact that the dot in the Armenian script in Unicode is different from that of the dot in the Latin script. So that when someone puts the name of the domain name and then needs to add ․հայ (TLD extension), if this dot is written in Armenian script (this non-standard dot is a default dot in Armenian keyboard), it's considered by servers to be an unknown symbol and the domain name doesn't resolve and show the exact address. This means, that the user should change the keyboard to the Latin, put the dot, then change the keyboard again to write in Armenian. And the confusion is that NOT everyone knows about that problem, which doesn't support the popularization of our IDN TLD. So, to your question about how big is the impact, I'd say it's very big. It's a tremendous problem for the end user and it's a serious problem for the IDN TLD. Best, Lianna On 12/02/2019 13:16, Roberto Gaetano wrote:
Hi all Maybe it could be interesting to hear if Lianna has something to add to it? I am curious to understand how much big is the impact of this fact in Armenia. Cheers, Roberto
On 11.02.2019, at 23:47, Dusan Stojicevic <dusan@dukes.in.rs <mailto:dusan@dukes.in.rs>> wrote:
http://domainincite.com/23908-right-of-the-colonH-idn-getting-killed-over-do... <http://domainincite.com/23908-right-of-the-colon-idn-getting-killed-over-dot...>
@MarkS
Dusan
Hi Lianna, thanks for clarifying the problem. This seems to be the case similar (if not the same) as the other “dots” in Unicode. It is the application that can handle the conversion. For example, Chrome would recognize the ideographic full stop used in Chinese (U+3002) as a label separator in the URL address bar. The ideographic full stop is converted to the full stop (U+002E) before the DNS query happens. This conversion does not happen in Safari. Not sure about Firefox or other browsers. IDNA2003 established a number of “dots” characters as viable label separators. That definition was dropped in IDNA2008 (or at least I could not find it), but found it on UTS#46<http://unicode.org/reports/tr46/#Notation>. Perhaps we should look into requesting an update of UTS#46 to add the “Armenian dot” or in the protocol itself (e.g. a mapping solution)? Dennis From: UA-discuss <ua-discuss-bounces@icann.org> on behalf of Lianna Galstyan <lianna@isoc.am> Date: Tuesday, February 12, 2019 at 1:35 PM To: Roberto Gaetano <roberto_gaetano@hotmail.com>, Dusan Stojicevic <dusan@dukes.in.rs> Cc: Internet Society NGO <isoc@isoc.am>, "UA-discuss@icann.org" <ua-discuss@icann.org> Subject: [EXTERNAL] Re: [UA-discuss] Armenia Hi all, Thanks, Roberto for your inquiry about the situation in Armenia. We hold a press conference lately, on the Day of using Armenian scripts in Internet, and talk about the issue of period mark. The confusion is not about how many dots should be used, which is the Armenian colon or period, but the fact that the dot in the Armenian script in Unicode is different from that of the dot in the Latin script. So that when someone puts the name of the domain name and then needs to add ․հայ (TLD extension), if this dot is written in Armenian script (this non-standard dot is a default dot in Armenian keyboard), it's considered by servers to be an unknown symbol and the domain name doesn't resolve and show the exact address. This means, that the user should change the keyboard to the Latin, put the dot, then change the keyboard again to write in Armenian. And the confusion is that NOT everyone knows about that problem, which doesn't support the popularization of our IDN TLD. So, to your question about how big is the impact, I'd say it's very big. It's a tremendous problem for the end user and it's a serious problem for the IDN TLD. Best, Lianna On 12/02/2019 13:16, Roberto Gaetano wrote: Hi all Maybe it could be interesting to hear if Lianna has something to add to it? I am curious to understand how much big is the impact of this fact in Armenia. Cheers, Roberto On 11.02.2019, at 23:47, Dusan Stojicevic <dusan@dukes.in.rs<mailto:dusan@dukes.in.rs>> wrote: http://domainincite.com/23908-right-of-the-colonH-idn-getting-killed-over-dot-confusion<http://domainincite.com/23908-right-of-the-colon-idn-getting-killed-over-dot-confusion> @MarkS Dusan
In article <1A1FCA40-9172-4FCF-AC8B-2A4A1FE3E11A@verisign.com> you write:
should look into requesting an update of UTS#46 to add the “Armenian dot” or in the protocol itself (e.g. a mapping solution)?
No -- the problem is that you need different mappings for different input languages. See the message I just sent. R's, John
On 2/12/2019 12:33 PM, John Levine wrote:
In article <1A1FCA40-9172-4FCF-AC8B-2A4A1FE3E11A@verisign.com> you write:
should look into requesting an update of UTS#46 to add the “Armenian dot” or in the protocol itself (e.g. a mapping solution)? No -- the problem is that you need different mappings for different input languages. See the message I just sent.
What about URLs that are in a document or database? There's no "input language" for them. (Or not necessarily one). I think for things like separators, the only thing that works is a generic set of acceptable ones that will be converted, so that no matter from where you access a URL it will work the same. Whether such mapping would be sensitive to the *script* of some character found in the domain name, that's another matter. A./
Dear all, During Barcelona meeting, I reported this issue. This is not the issue only in Armenia. There's similar dot issue in Arabic script (Sarmad, correct me if I'm wrong), at least that's what I understand during our workshop. Also, issue with the same nature is @ - not existing on all Cyrillic keyboards. Users are doing the same - change to Latin, type @ and then switch back to Cyrillic keyboard. Both issues I find crucial to solve, and I alerted Mark to get deeper into it. From my point of view, these issues directly decrease the usage of IDN. Regards, Dusan uto, 12. feb 2019. 23:04 Asmus Freytag <asmusf@ix.netcom.com> je napisao/la:
On 2/12/2019 12:33 PM, John Levine wrote:
In article <1A1FCA40-9172-4FCF-AC8B-2A4A1FE3E11A@verisign.com> <1A1FCA40-9172-4FCF-AC8B-2A4A1FE3E11A@verisign.com> you write:
should look into requesting an update of UTS#46 to add the “Armenian dot” or in the protocol itself (e.g. a mapping solution)?
No -- the problem is that you need different mappings for different input languages. See the message I just sent.
What about URLs that are in a document or database?
There's no "input language" for them. (Or not necessarily one).
I think for things like separators, the only thing that works is a generic set of acceptable ones that will be converted, so that no matter from where you access a URL it will work the same.
Whether such mapping would be sensitive to the *script* of some character found in the domain name, that's another matter.
A./
In article <a956ca35-6264-6bd7-38c3-0816a34f7d98@ix.netcom.com> you write:
-=-=-=-=-=-
On 2/12/2019 12:33 PM, John Levine wrote:
In article <1A1FCA40-9172-4FCF-AC8B-2A4A1FE3E11A@verisign.com> you write:
should look into requesting an update of UTS#46 to add the “Armenian dot” or in the protocol itself (e.g. a mapping solution)? No -- the problem is that you need different mappings for different input languages. See the message I just sent.
What about URLs that are in a document or database?
Their IDNs should be U-labels, not random text that looks sort of like U-labels. The time to clean stuff up is when its entered and you have some idea of the ccontext.
On 2/12/2019 4:09 PM, John Levine wrote:
In article <a956ca35-6264-6bd7-38c3-0816a34f7d98@ix.netcom.com> you write:
-=-=-=-=-=-
On 2/12/2019 12:33 PM, John Levine wrote:
In article <1A1FCA40-9172-4FCF-AC8B-2A4A1FE3E11A@verisign.com> you write:
should look into requesting an update of UTS#46 to add the “Armenian dot” or in the protocol itself (e.g. a mapping solution)? No -- the problem is that you need different mappings for different input languages. See the message I just sent. What about URLs that are in a document or database? Their IDNs should be U-labels, not random text that looks sort of like U-labels. The time to clean stuff up is when its entered and you have some idea of the ccontext.
That's wishful thinking. (Or, that horse has left the barn). URLs are any string that happens to resolve on the local browser and then pasted into a document. You can verify that today it even includes uppercase Greek, for example. In a recent discussion someone put it that labels should be valid IDN U labels once they are "in the system". Now, when you make a DNS lookup the label you submit is "in the system". When it's on the side of a bus, to use the other extreme, it's not. Now, where do you draw the line for where "the system" starts? I don't think you can include all HTML documents or HTML mail message or any other place a URL can and does exist today. A./
In a recent discussion someone put it that labels should be valid IDN U labels once they are "in the system". Now, when you make a DNS lookup the label you submit is "in the system". When it's on the side of a bus, to use the other extreme, it's not. Now, where do you draw the line for where "the system" starts? I don't think you can include all HTML documents or HTML mail message or any other place a URL can and does exist today.
Who knows? There are standards for URLs, but if you look at three browsers you'll probably find at least four misinterpretations of them. Regards, John Levine, john.levine@standcore.com Standcore LLC
On 2/12/2019 7:12 PM, John Levine wrote:
In a recent discussion someone put it that labels should be valid IDN U labels once they are "in the system". Now, when you make a DNS lookup the label you submit is "in the system". When it's on the side of a bus, to use the other extreme, it's not. Now, where do you draw the line for where "the system" starts? I don't think you can include all HTML documents or HTML mail message or any other place a URL can and does exist today.
Who knows? There are standards for URLs, but if you look at three browsers you'll probably find at least four misinterpretations of them.
What I am arguing is that you cannot retroactively apply standards that haven't been followed; particularly for HTML you have a whole internet's worth of legacy data that any system must be able to handle and continue to do so for the indefinite future. A./
What I am arguing is that you cannot retroactively apply standards that haven't been followed; particularly for HTML you have a whole internet's worth of legacy data that any system must be able to handle and continue to do so for the indefinite future.
I suppose we're stuck with UTS46 for IDNs in URLs for now, but as we are seeing there are places where UTS46 does the wrong thing, so there are places we can fix at the front end. After all, if the input is preprocssed into valid U-labels and dots, UTS46 nontransitional shouldn't make things any worse. Regards, John Levine, john.levine@standcore.com Standcore LLC
On 2/12/2019 7:51 PM, John Levine wrote:
What I am arguing is that you cannot retroactively apply standards that haven't been followed; particularly for HTML you have a whole internet's worth of legacy data that any system must be able to handle and continue to do so for the indefinite future.
I suppose we're stuck with UTS46 for IDNs in URLs for now, but as we are seeing there are places where UTS46 does the wrong thing, so there are places we can fix at the front end.
After all, if the input is preprocssed into valid U-labels and dots, UTS46 nontransitional shouldn't make things any worse.
Best avenue for that is new protocols where you may be able to enforce pure IDNs. A./
In article <af160a09-c0c4-d981-55fd-acf56c5aae3b@isoc.am>, Lianna Galstyan <lianna@isoc.am> wrote:
So that when someone puts the name of the domain name and then needs to add ․հայ (TLD extension), if this dot is written in Armenian script (this non-standard dot is a default dot in Armenian keyboard), it's considered by servers to be an unknown symbol and the domain name doesn't resolve and show the exact address. This means, that the user should change the keyboard to the Latin, put the dot, then change the keyboard again to write in Armenian. ...
This is a a well known problem for IDNs, but not one we have dealt with. The first version of IDNA (IDNA2003) included a set of character mappings that was supposed to work everywhere. In fact, it worked OK in western Europe and maybe Japan, but people in other places ran into the problems you're seeing, characters that make sense in your language but that don't work in the generic mapping. The second version of IDNA, IDNA2008, recognized this problem and deliberately removed all the mappings. The idea was that experts in different scripts and languages would create mappings that make sense for people who use those scripts and speak those languages. The mappings would turn the user into into standardized U-labels that the IDN software can then use. Unfortunately, the Unicode Consortium didn't get the memo and published UTR46 which provides a few more generic input mappings, which have the same problems as the IDNA2003 mapping. At this point, all the web browsers use those generic mappings. If there were an Armenian mapping for IDNs, when the characters in a domain name are Armenian, it handles Armenenian punctuation, and when the characters are Latin, Latin punctuation. Then the browser or app could apply the Armenian mapping when the user's input language is Armenian. Unfortunately, the Armenian mapping does not exist yet. This is a problem the UASG could work on, by bringing together IDN experts and language and script experts to create mappings that work for languages where the generic mappings don't. Regards, John Levine, john.levine@standcore.com
Hi, On Tue, Feb 12, 2019 at 03:32:54PM -0500, John Levine wrote:
The second version of IDNA, IDNA2008, recognized this problem and deliberately removed all the mappings. The idea was that experts in different scripts and languages would create mappings that make sense for people who use those scripts and speak those languages. The mappings would turn the user into into standardized U-labels that the IDN software can then use.
This isn't quite correct for the case of the dots in domain names. There are two additional important wrinkles here. First, IDNA is defined for _labels_, and not for _domain names_. This is perfectly clear in IDNA2008. It is less clear in IDNA2003, because while most of that specification _is_ about labels, there are some places where the whole domain mname is implicated. This is particularly true of label separators (the dots). That brings us, however, to two different problems. First, domain names are distributed in their operation, and that means that there is no way to be sure that the "whole domain name" is in one script. We see this today, quite commonly, where there are IDLs that live under traditional LDH-labels. For most Latin-based languages, this isn't really a problem, but where you have multiple scripts where at least one is not Latin, it's hard to be sure exactly which rules ought to apply. But more importantly, there is an additional problem with domain names: the label separators we are used to seeing _don't appear_ in the DNS. A domain name like crankycanuck.ca. does not appear, in the DNS, as a series of octets separated by a special character (.), but instead a series of octets bound by length indicators that also function as label separators (conceptually, it's like 12crankycanuck2ca00; the final 0 is a null label to indicate the root. This is, by the way, the reason it is possible to have a label with a . in it in the DNS. You rarely see these, but they sometimes show up in the responisble person field of the SOA record). Since the separator never actually appears in the DNS and since you're supposed to go label by label, this is a problem. Now, it _might_ be that an application that is attempting to handle IDNs that are likely to be entered in a given locale should do some sort of mapping of the normal stops in that locale: that's roughly what RFC 5895 suggests.
If there were an Armenian mapping for IDNs, when the characters in a domain name are Armenian, it handles Armenenian punctuation, and when the characters are Latin, Latin punctuation.
That won't, of course, work, because it is possible to have mixed code point repertoires either within or between labels. _Probably_ it would be safe just to map all stops to ".", but nobody knows and the last time we tried that it didn't work out. Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com
In article <20190213015321.im3xzkmrbn2nsnp5@mx4.yitter.info> you write:
But more importantly, there is an additional problem with domain names: the label separators we are used to seeing _don't appear_ in the DNS.
True.
If there were an Armenian mapping for IDNs, when the characters in a domain name are Armenian, it handles Armenenian punctuation, and when the characters are Latin, Latin punctuation.
That won't, of course, work, because it is possible to have mixed code point repertoires either within or between labels. _Probably_ it would be safe just to map all stops to ".", but nobody knows and the last time we tried that it didn't work out.
I agree we can't do it perfectly, but the question is whether we can do it better than we're doing it now. We seem to agree that trying to do mapping without context has gone about as far as it can go, which isn't far enough. Context free dots are particularly horrible since there are at least two kinds of dots (00b7 and 30fb) which can appear in U-labels in some contexts. My question is whether we can come up with context sensitive mappings that are not horribly complicated and match what users expect. For the case of Armenian, it seems like if you have aaa:aaa where aaa is Armenian text and : is the Armenian stop, it makes sense to map the : to an ASCII dot. If you have aaa:lll (Latin text), maybe it does, or maybe since the user is shifting to Latin anyway it's not hard to type a dot instead. Or maybe if you know the input's coming from an Armenian input device, you always treat : as a dot. I don't know which of those, or something else, is best, but the current setup is clearly wrong. R's, John
Replies inline On 2/12/19 9:28 PM, John Levine wrote:
In article <20190213015321.im3xzkmrbn2nsnp5@mx4.yitter.info> you write:
But more importantly, there is an additional problem with domain names: the label separators we are used to seeing _don't appear_ in the DNS.
True.
If there were an Armenian mapping for IDNs, when the characters in a domain name are Armenian, it handles Armenenian punctuation, and when the characters are Latin, Latin punctuation.
That won't, of course, work, because it is possible to have mixed code point repertoires either within or between labels. _Probably_ it would be safe just to map all stops to ".", but nobody knows and the last time we tried that it didn't work out.
I agree we can't do it perfectly, but the question is whether we can do it better than we're doing it now. We seem to agree that trying to do mapping without context has gone about as far as it can go, which isn't far enough. Context free dots are particularly horrible since there are at least two kinds of dots (00b7 and 30fb) which can appear in U-labels in some contexts.
There is technical issue that comes up is how the stop is processed. In the DNS protocol itself, the dot represents a separator for domain components (which have a max length of 63 ASCII characters) and are also used as markers for DNS compression which is somewhat mandatory to make replies fit within the 512 byte limit. This creates an entire nightmare of problems if a period (specifically 0x2E) is used in an IDN in a context where it doesn't represent a subdomain. I don't know enough about Armenian or any other foreign languages to say specifically if this is a problem in actuality but I can easily imagine areas where this causes pain. To expand on John's comment, context based processing is a can of worms that should be approached very carefully; as the dot has a very specific meaning within the DNS protocol itself (with a single dot representing the root zone), allowing another character to have this functionality in the U-Label could easily lead to unpredictable results; this is especially potent in cases where you have an IDN domain with an ASCII TLD
My question is whether we can come up with context sensitive mappings that are not horribly complicated and match what users expect.
For the case of Armenian, it seems like if you have aaa:aaa where aaa is Armenian text and : is the Armenian stop, it makes sense to map the : to an ASCII dot. If you have aaa:lll (Latin text), maybe it does, or maybe since the user is shifting to Latin anyway it's not hard to type a dot instead. Or maybe if you know the input's coming from an Armenian input device, you always treat : as a dot. I don't know which of those, or something else, is best, but the current setup is clearly wrong.
Arguably, the characters to encode a domain spot should be of the level represented. i.e., you need a period to denote .com as it's an ASCII TLD, even if the next part. If done in this matter, then the Armenian character separate with an Armenian TLD would at least be straightforward and reduce the amount of places where things can go wrong in U->A label generation. This would also make things like DNS search path more or or less work as expected, with the downside that it may be unintuitive for users. I do feel a better solution is needed here but I'm not sure I have a solid suggestion on how to handle it. Part of me is wondering if a EDNS extension may be a path forward to help reduce IDN pain in the future to allow resolution of u-labels directly. Michael
Michael Casadevall writes:
<SNIPPED>
There is technical issue that comes up is how the stop is processed. In the DNS protocol itself, the dot represents a separator for domain components (which have a max length of 63 ASCII characters) and are also used as markers for DNS compression which is somewhat mandatory to make replies fit within the 512 byte limit.
Strictly speaking, this not how it works. On the wire, the dots/stops/periods not used as separators between labels in the protocol. It has always been a custom to use a them as separators in human readable text and humans have been getting very used to this. Lot's of applications manipulating labels in the protocol have followed this custom. But labels itself are defined as a count followed by the number of bytes. A tool looking up "example.org" translates this domain name into "\7example\3com". It is the application which takes care of this. The same case is true for IDNA. If the preferred se[erator is a different character (and I understand from the discussion in Armenian it is a colon) then with the proper LOCALE "example:org" should also result in "\7example\3com" on the wire. Regards, jaap
Initially I looked in the Armenian Unicode block U+0530-058F and could not find an Armenian dot. I then examined ․հայ and determined that the Armenian dot used is U+2024 ONE DOT LEADER from the General Punctuation block. Curiously, the 2 Armenian virtual keyboards on my OSX produce the standard U+002E and not U+2024 when the dot key is clicked. Additionally, the online Armenian virtual keyboard in google translate produces the standard U+002E and not U+2024. André Schappo On 12 Feb 2019, at 10:19, Lianna Galstyan <lianna@isoc.am<mailto:lianna@isoc.am>> wrote: Hi all, Thanks, Roberto for your inquiry about the situation in Armenia. We hold a press conference lately, on the Day of using Armenian scripts in Internet, and talk about the issue of period mark. The confusion is not about how many dots should be used, which is the Armenian colon or period, but the fact that the dot in the Armenian script in Unicode is different from that of the dot in the Latin script. So that when someone puts the name of the domain name and then needs to add ․հայ (TLD extension), if this dot is written in Armenian script (this non-standard dot is a default dot in Armenian keyboard), it's considered by servers to be an unknown symbol and the domain name doesn't resolve and show the exact address. This means, that the user should change the keyboard to the Latin, put the dot, then change the keyboard again to write in Armenian. And the confusion is that NOT everyone knows about that problem, which doesn't support the popularization of our IDN TLD. So, to your question about how big is the impact, I'd say it's very big. It's a tremendous problem for the end user and it's a serious problem for the IDN TLD. Best, Lianna On 12/02/2019 13:16, Roberto Gaetano wrote: Hi all Maybe it could be interesting to hear if Lianna has something to add to it? I am curious to understand how much big is the impact of this fact in Armenia. Cheers, Roberto On 11.02.2019, at 23:47, Dusan Stojicevic <dusan@dukes.in.rs<mailto:dusan@dukes.in.rs>> wrote: http://domainincite.com/23908-right-of-the-colonH-idn-getting-killed-over-dot-confusion<http://domainincite.com/23908-right-of-the-colon-idn-getting-killed-over-dot-confusion> @MarkS Dusan <lianna.vcf>
On the same topic with comments from Armenian colleagues http://telecom.arka.am/en/news/telecom/why_hay_domain_is_not_popular_in_arme... Regards, Natalia -- Natalia MOCHU Global Stakeholder Engagement Manager Eastern Europe and Central Asia Internet Corporation for Assigned Names and Numbers (ICANN) Tel.: +7 917 560 77 00 natalia.mochu@icann.org<mailto:alexandra.kulikova@icann.org> www.icann.org<http://www.icann.org> Twitter @NMochu Subscribe to our regional newsletter! https://info.icann.org/LP---Regional-Newsletter.html [cid:image001.png@01D32573.C82DFC40] From: UA-discuss <ua-discuss-bounces@icann.org> on behalf of Roberto Gaetano <roberto_gaetano@hotmail.com> Date: Tuesday, February 12, 2019 at 12:16 PM To: Dusan Stojicevic <dusan@dukes.in.rs>, Lianna Galstyan <lianna@isoc.am> Cc: Universal Acceptance <ua-discuss@icann.org> Subject: Re: [UA-discuss] Armenia Hi all Maybe it could be interesting to hear if Lianna has something to add to it? I am curious to understand how much big is the impact of this fact in Armenia. Cheers, Roberto On 11.02.2019, at 23:47, Dusan Stojicevic <dusan@dukes.in.rs<mailto:dusan@dukes.in.rs>> wrote: http://domainincite.com/23908-right-of-the-colon-idn-getting-killed-over-dot... @MarkS Dusan
participants (12)
-
Andre Schappo -
Andrew Sullivan -
Asmus Freytag -
Asmus Freytag (c) -
Dusan Stojicevic -
Jaap Akkerhuis -
John Levine -
Lianna Galstyan -
Michael Casadevall -
Natalia Mochu -
Roberto Gaetano -
Tan Tanaka, Dennis