Replies inline On 2/12/19 9:28 PM, John Levine wrote:
In article <20190213015321.im3xzkmrbn2nsnp5@mx4.yitter.info> you write:
But more importantly, there is an additional problem with domain names: the label separators we are used to seeing _don't appear_ in the DNS.
True.
If there were an Armenian mapping for IDNs, when the characters in a domain name are Armenian, it handles Armenenian punctuation, and when the characters are Latin, Latin punctuation.
That won't, of course, work, because it is possible to have mixed code point repertoires either within or between labels. _Probably_ it would be safe just to map all stops to ".", but nobody knows and the last time we tried that it didn't work out.
I agree we can't do it perfectly, but the question is whether we can do it better than we're doing it now. We seem to agree that trying to do mapping without context has gone about as far as it can go, which isn't far enough. Context free dots are particularly horrible since there are at least two kinds of dots (00b7 and 30fb) which can appear in U-labels in some contexts.
There is technical issue that comes up is how the stop is processed. In the DNS protocol itself, the dot represents a separator for domain components (which have a max length of 63 ASCII characters) and are also used as markers for DNS compression which is somewhat mandatory to make replies fit within the 512 byte limit. This creates an entire nightmare of problems if a period (specifically 0x2E) is used in an IDN in a context where it doesn't represent a subdomain. I don't know enough about Armenian or any other foreign languages to say specifically if this is a problem in actuality but I can easily imagine areas where this causes pain. To expand on John's comment, context based processing is a can of worms that should be approached very carefully; as the dot has a very specific meaning within the DNS protocol itself (with a single dot representing the root zone), allowing another character to have this functionality in the U-Label could easily lead to unpredictable results; this is especially potent in cases where you have an IDN domain with an ASCII TLD
My question is whether we can come up with context sensitive mappings that are not horribly complicated and match what users expect.
For the case of Armenian, it seems like if you have aaa:aaa where aaa is Armenian text and : is the Armenian stop, it makes sense to map the : to an ASCII dot. If you have aaa:lll (Latin text), maybe it does, or maybe since the user is shifting to Latin anyway it's not hard to type a dot instead. Or maybe if you know the input's coming from an Armenian input device, you always treat : as a dot. I don't know which of those, or something else, is best, but the current setup is clearly wrong.
Arguably, the characters to encode a domain spot should be of the level represented. i.e., you need a period to denote .com as it's an ASCII TLD, even if the next part. If done in this matter, then the Armenian character separate with an Armenian TLD would at least be straightforward and reduce the amount of places where things can go wrong in U->A label generation. This would also make things like DNS search path more or or less work as expected, with the downside that it may be unintuitive for users. I do feel a better solution is needed here but I'm not sure I have a solid suggestion on how to handle it. Part of me is wondering if a EDNS extension may be a path forward to help reduce IDN pain in the future to allow resolution of u-labels directly. Michael