Store domain in Punycode or Unicode?
Hi, I work for Microsoft and own project EAI with O365 mail flow for the most part. We recently announced the support for Phase 1 EAI - that we now support send and receive emails with EAI address. A question, or a challenge I would like to bring your attention and would like to discuss with this group. Today we do not allow customers to enter IDN (in Unicode) in our system (O365), so customers can only enter domain in ASCII, or an IDN in Punycode form. This already brings a challenge for us since mail may come in as UTF8 form. But more importantly, to prepare for the future support of EAI, we must design how to allow our customers to enter a Unicode domain in O365, and which form we shall store the domain - Unicode or punycode? Our proposal is to store domain in punycode. The reasons are at the follows: 1. Domains in our system is unique, meaning domain is a key. One domain shall only exist once and belong to one customer only. 2. We already allow customers to enter domain in Punycode code. 3. At gateway we need to know whether a domain is in our system. The match logic will be at follows: * Is domain in system? If so go ahead and accept. * If not, is it UTF8 form? If so convert to Punycode and search again. 4. Every time when we display, we will always convert the domain to Unicode. 5. This is how DNS supports IDN. A uniform storage will make implementation a lot easier. But if we allow to store domain in Unicode, then we have to understand those in Punycode and those in Unicode and convert back and force. I understand we always need conversion, but if in only one form we know we always need to covert to the other form, vs we might need to covert both directions everywhere, very costly and very confusing. Carolyn
Hi, On Tue, Apr 03, 2018 at 06:36:03PM +0000, Carolyn Liu via UA-discuss wrote:
Today we do not allow customers to enter IDN (in Unicode) in our system (O365), so customers can only enter domain in ASCII, or an IDN in Punycode form.
To be clear, this means that domain names with labels of the form xn--[punycode-goes-here] are allowed, but no non-LDH characters are allowed in any domain name label; but, after permitting EAI addresses you will accept UTF-8 in the local-part?
already brings a challenge for us since mail may come in as UTF8 form.
Under EAI, it _will_ come in that form.
allow our customers to enter a Unicode domain in O365, and which form we shall store the domain – Unicode or punycode?
If you attempt to support IDNA2003 or at least some of the compatibility modes of UTS#46, you effectively need to store both. IDNA2003 can lose information in a round trip from Punycode-form and Unicode-form, so you basically need to know the whole set. This fundamental problem was actually one of the most urgent requirements for IDNA2008, and it's why some of us remain pretty annoyed with UTS#46 as a strategy since one of its profiles breaks that plan without any suggestion of how it'll eventually wean people from it. (We didn't have a weaning suggestion either in the IDNABIS WG, which is why we decided to break the backward compatibility in the few cases, reasoning that pain early in deployment was less bad than pain later.) If you're restricting your supported domains to IDNA2008, then you don't have to care: every actual U-label is also exactly one A-label, and conversely. So you can store U-labels or A-labels and get the same result. The usual recommendation is that you store U-labels just because storing A-labels will result in transformation for every user event, and that might have nasty performance effects.
1. Domains in our system is unique, meaning domain is a key. One domain shall only exist once and belong to one customer only.
This is true regardless of whether it's a U-label or A-label: since they're DNS names they _must_ be unique globally within the DNS.
3. At gateway we need to know whether a domain is in our system. The match logic will be at follows: a. Is domain in system? If so go ahead and accept. b. If not, is it UTF8 form? If so convert to Punycode and search again.
This sounds like a round trip plan. Why not just run it through the relevant algorithm and check one time? (LDH-only names will not undergo any transformation. You may need a coalesce function or similar.)
4. Every time when we display, we will always convert the domain to Unicode.
This is a reason to prefer U-label forms: no conversion on display, when the user is waiting.
5. This is how DNS supports IDN. A uniform storage will make implementation a lot easier.
This is true.
But if we allow to store domain in Unicode, then we have to understand those in Punycode and those in Unicode and convert back and force. I understand we always need conversion, but if in only one form we know we always need to covert to the other form, vs we might need to covert both directions everywhere, very costly and very confusing.
It is _certainly_ true that you want to pick one, and if you already have A-labels in the sytem then you might have a migration problem. That might be a reason to use A-labels for storage. A -- Andrew Sullivan ajs@anvilwalrusden.com
I think the Unicode form should be stored. My reasons for recommending this is a little different. Mostly, I use MySQL and phpMyAdmin for my database work. Storing IDNs and/or EAI addresses in Unicode form has advantages. ① I can search by constructing an SQL query in phpMyAdmin. eg all IDNs which contain 食品 ② I can learn a lot by visual inspection eg I can readily identify text being in Korean, Thai, Sinhala, Chinese, Arabic, Cyrillic scripts I could not do either of the above if only the punycode form is stored. Basically, people can relate to the Unicode form and not the punycode form. So, if it involves people, store in the Unicode form. Actually, there is one punycode label I always recognise, which is .xn--fiqs8s😀 xn--fiqs8s = 中国 = China. I recognise it because I have seen it so many times and I remember when it went live as I posted to IDNforums idnforums.com/forums/26659-china-idn-cctlds-are-live.html<http://idnforums.com/forums/26659-china-idn-cctlds-are-live.html> That is the only punycode label I recognise. André Schappo On 3 Apr 2018, at 20:51, Andrew Sullivan <ajs@anvilwalrusden.com<mailto:ajs@anvilwalrusden.com>> wrote: Hi, On Tue, Apr 03, 2018 at 06:36:03PM +0000, Carolyn Liu via UA-discuss wrote: Today we do not allow customers to enter IDN (in Unicode) in our system (O365), so customers can only enter domain in ASCII, or an IDN in Punycode form. To be clear, this means that domain names with labels of the form xn--[punycode-goes-here] are allowed, but no non-LDH characters are allowed in any domain name label; but, after permitting EAI addresses you will accept UTF-8 in the local-part? already brings a challenge for us since mail may come in as UTF8 form. Under EAI, it _will_ come in that form. allow our customers to enter a Unicode domain in O365, and which form we shall store the domain – Unicode or punycode? If you attempt to support IDNA2003 or at least some of the compatibility modes of UTS#46, you effectively need to store both. IDNA2003 can lose information in a round trip from Punycode-form and Unicode-form, so you basically need to know the whole set. This fundamental problem was actually one of the most urgent requirements for IDNA2008, and it's why some of us remain pretty annoyed with UTS#46 as a strategy since one of its profiles breaks that plan without any suggestion of how it'll eventually wean people from it. (We didn't have a weaning suggestion either in the IDNABIS WG, which is why we decided to break the backward compatibility in the few cases, reasoning that pain early in deployment was less bad than pain later.) If you're restricting your supported domains to IDNA2008, then you don't have to care: every actual U-label is also exactly one A-label, and conversely. So you can store U-labels or A-labels and get the same result. The usual recommendation is that you store U-labels just because storing A-labels will result in transformation for every user event, and that might have nasty performance effects. 1. Domains in our system is unique, meaning domain is a key. One domain shall only exist once and belong to one customer only. This is true regardless of whether it's a U-label or A-label: since they're DNS names they _must_ be unique globally within the DNS. 3. At gateway we need to know whether a domain is in our system. The match logic will be at follows: a. Is domain in system? If so go ahead and accept. b. If not, is it UTF8 form? If so convert to Punycode and search again. This sounds like a round trip plan. Why not just run it through the relevant algorithm and check one time? (LDH-only names will not undergo any transformation. You may need a coalesce function or similar.) 4. Every time when we display, we will always convert the domain to Unicode. This is a reason to prefer U-label forms: no conversion on display, when the user is waiting. 5. This is how DNS supports IDN. A uniform storage will make implementation a lot easier. This is true. But if we allow to store domain in Unicode, then we have to understand those in Punycode and those in Unicode and convert back and force. I understand we always need conversion, but if in only one form we know we always need to covert to the other form, vs we might need to covert both directions everywhere, very costly and very confusing. It is _certainly_ true that you want to pick one, and if you already have A-labels in the sytem then you might have a migration problem. That might be a reason to use A-labels for storage. A -- Andrew Sullivan ajs@anvilwalrusden.com<mailto:ajs@anvilwalrusden.com>
participants (3)
-
Andre Schappo -
Andrew Sullivan -
Carolyn Liu