IANA IDN Tables

newer
Fwd: [DNSOP] IAB seeks volunteer...

older
Another difficulty to overcome ...

Andre Schappo

Feb. 25, 2018

11:54 a.m.

It must be over a year since I last looked at the IANA IDN tables and they have grown massively in that time ➜ iana.org/domains/idn-tables<http://iana.org/domains/idn-tables> Some observations: I will use Verisignʼs Japanese .コム (transliteration of .com) There is .コム Japanese, which is LDH (ASCII Letters, Digits, Hyphen) + Hiragana + Katakana + a heap of Han There is .コム Hiragana, which is DH + Hiragana There is .コム Katakana, which is DH + Katakana There is .コム Han, which is DH + a heap of Han André Schappo

Attachments:

attachment.html (text/html — 1.1 KB)

Show replies by date

Jim DeLaHunt

February 2018

10:12 p.m.

Thank you! I had not known about the IANA IDN tables, and they are fascinating reading. In particular, it provides a concrete way of talking about registry policies that affect phishing attacks through confusables, as discussed in our "Another difficulty to overcome ..." thread a few days ago. —Jim DeLaHunt, Vancouver, Canada On 2018-02-25 03:54, Andre Schappo wrote:

...

It must be over a year since I last looked at the IANA IDN tables and they have grown massively in that time ➜ iana.org/domains/idn-tables <http://iana.org/domains/idn-tables>

Some observations: I will use Verisignʼs Japanese .コム (transliteration of .com)

There is .コム Japanese, which is LDH (ASCII Letters, Digits, Hyphen) + Hiragana + Katakana + a heap of Han There is .コム Hiragana, which is DH + Hiragana There is .コム Katakana, which is DH + Katakana There is .コム Han, which is DH + a heap of Han

André Schappo

-- --Jim DeLaHunt, jdlh@jdlh.com http://blog.jdlh.com/ (http://jdlh.com/) multilingual websites consultant 355-1027 Davie St, Vancouver BC V6E 4L2, Canada Canada mobile +1-604-376-8953

Mark Svancarek

5:30 p.m.

コ andム are katakana characters, so only examples 1 and 3 are valid from a language perspective. Those characters do not exist within hiragana and I do not believe they exist as standalone Chinese characters, either. From: UA-discuss <ua-discuss-bounces@icann.org> On Behalf Of Jim DeLaHunt Sent: Sunday, February 25, 2018 2:13 PM To: ua-discuss@icann.org Subject: Re: [UA-discuss] IANA IDN Tables Thank you! I had not known about the IANA IDN tables, and they are fascinating reading. In particular, it provides a concrete way of talking about registry policies that affect phishing attacks through confusables, as discussed in our "Another difficulty to overcome ..." thread a few days ago. —Jim DeLaHunt, Vancouver, Canada On 2018-02-25 03:54, Andre Schappo wrote: It must be over a year since I last looked at the IANA IDN tables and they have grown massively in that time ➜ iana.org/domains/idn-tables<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fiana.org%2Fd...> Some observations: I will use Verisignʼs Japanese .コム (transliteration of .com) There is .コム Japanese, which is LDH (ASCII Letters, Digits, Hyphen) + Hiragana + Katakana + a heap of Han There is .コム Hiragana, which is DH + Hiragana There is .コム Katakana, which is DH + Katakana There is .コム Han, which is DH + a heap of Han André Schappo -- --Jim DeLaHunt, jdlh@jdlh.com<mailto:jdlh@jdlh.com> http://blog.jdlh.com/<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fblog.jdlh.com%2F&data=04%7C01%7Cmarksv%40microsoft.com%7C8aacd3fef61d49259e1708d57c9cf92a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636551936008511004%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=OieXOI3TA4yE2QgzJY5f2AZMm3IIHLa9xbkZGG4XfPs%3D&reserved=0> (http://jdlh.com/<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fjdlh.com%2F&data=04%7C01%7Cmarksv%40microsoft.com%7C8aacd3fef61d49259e1708d57c9cf92a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C1%7C636551936008511004%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwifQ%3D%3D%7C-1&sdata=kQFMQdKuA%2B9jLwKbVT%2B%2BYV8%2B8tA2d6Iz16UomuU3zR0%3D&reserved=0>) multilingual websites consultant 355-1027 Davie St, Vancouver BC V6E 4L2, Canada Canada mobile +1-604-376-8953

Andrew Sullivan

6:44 p.m.

On Mon, Feb 26, 2018 at 05:30:38PM +0000, Mark Svancarek via UA-discuss wrote:

...

コ andム are katakana characters, so only examples 1 and 3 are valid from a language perspective. Those characters do not exist within hiragana and I do not believe they exist as standalone Chinese characters, either.

That doesn't matter, because the characters _below_ the Katakana _could_ be Han or whatever. Japanese is written in multiple scripts, so knowing that one label is only written with Katakana does not tell you whether another label might have Han characters in it. It doesn't, indeed, even tell you that there might not be a Chinese name lower in the tree, much as com today permits characters outside the ASCII range. Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com

Mark Svancarek

7:10 p.m.

I misunderstood Andre's original post, sorry, my mind was thinking about the characters in the TLD, not the characters in the other labels. Oops. It still seems weird to me to have a superset table (Japanese) as well as subset tables (Hiragana, Katakana) for the same TLD. What's the utility? -----Original Message----- From: UA-discuss <ua-discuss-bounces@icann.org> On Behalf Of Andrew Sullivan Sent: Monday, February 26, 2018 10:44 AM To: ua-discuss@icann.org Subject: Re: [UA-discuss] IANA IDN Tables On Mon, Feb 26, 2018 at 05:30:38PM +0000, Mark Svancarek via UA-discuss wrote:

...

コ andム are katakana characters, so only examples 1 and 3 are valid from a language perspective. Those characters do not exist within hiragana and I do not believe they exist as standalone Chinese characters, either.

Andrew Sullivan

7:18 p.m.

On Mon, Feb 26, 2018 at 07:10:00PM +0000, Mark Svancarek wrote:

...

It still seems weird to me to have a superset table (Japanese) as well as subset tables (Hiragana, Katakana) for the same TLD. What's the utility?

When you're submitting an IDN, there is probably a selection of language tags you can pick. These probably align with those; at least, that's the point of the distinction in the LGR approach we did for the root. (The effect is supposed to be the same anyway, due to blocking, but I don't know whether Verisign is doing that too.) A -- Andrew Sullivan ajs@anvilwalrusden.com

Mark Svancarek

4:58 p.m.

So perhaps the intent is to allow user to select the LGR under which their submission will be approved? -----Original Message----- From: UA-discuss <ua-discuss-bounces@icann.org> On Behalf Of Andrew Sullivan Sent: Monday, February 26, 2018 11:18 AM To: ua-discuss@icann.org Subject: Re: [UA-discuss] IANA IDN Tables On Mon, Feb 26, 2018 at 07:10:00PM +0000, Mark Svancarek wrote:

...

It still seems weird to me to have a superset table (Japanese) as well as subset tables (Hiragana, Katakana) for the same TLD. What's the utility?

Andrew Sullivan

5:26 p.m.

On Tue, Feb 27, 2018 at 04:58:00PM +0000, Mark Svancarek wrote:

...

So perhaps the intent is to allow user to select the LGR under which their submission will be approved?

I hesitate to say what someone else's intention is. But the basic idea of having different LGRs for the same script but different language tags is to permit a given label to be evaluated according to the target use case, while still providing a mechanism to create the final LGR out of all the code points permitted. Imagine a script called "Slobbovian", which has two communities of speakers "Upper Slobbovian" and "Lower Slobbovian". 80% of the Slobbovian code points are shared, but 10% are used only for one or the other languages. Moreover, for every member in the 10% appropriate to Upper, there is a member of those other 10% appropriate to Lower. (This is an artificial example, obviously.) In this case, one would expect two LGRs for application purposes: the Upper and the Lower. The Upper would have its respective 10% plus the 80% as "allocatable", and the other 10% as "blocked", with a 1:1 correspondence for variant generation between the first 10% and the second 10%. The Lower would have a similar LGR, _mutatis mutandis_. The overall LGR for the _zone_ permits all the Slobbovian-range code points, but has different dispositions depending on which language tag was used in the application for the domain. I hope that makes a little bit of sense. If not, I need to write a longer mail :) A

...

-----Original Message----- From: UA-discuss <ua-discuss-bounces@icann.org> On Behalf Of Andrew Sullivan Sent: Monday, February 26, 2018 11:18 AM To: ua-discuss@icann.org Subject: Re: [UA-discuss] IANA IDN Tables

On Mon, Feb 26, 2018 at 07:10:00PM +0000, Mark Svancarek wrote:

...
It still seems weird to me to have a superset table (Japanese) as well as subset tables (Hiragana, Katakana) for the same TLD. What's the utility?

When you're submitting an IDN, there is probably a selection of language tags you can pick. These probably align with those; at least, that's the point of the distinction in the LGR approach we did for the root. (The effect is supposed to be the same anyway, due to blocking, but I don't know whether Verisign is doing that too.)

A

-- Andrew Sullivan ajs@anvilwalrusden.com

Don Hollander

5:30 p.m.

So, below the root level, the applicant needs to not only choose their preferred domain name, but also the expected language? D -----Original Message----- From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Andrew Sullivan Sent: Wednesday, 28 February 2018 6:26 AM To: ua-discuss@icann.org Subject: Re: [UA-discuss] IANA IDN Tables On Tue, Feb 27, 2018 at 04:58:00PM +0000, Mark Svancarek wrote:

...

So perhaps the intent is to allow user to select the LGR under which their submission will be approved?

...

-----Original Message----- From: UA-discuss <ua-discuss-bounces@icann.org> On Behalf Of Andrew Sullivan Sent: Monday, February 26, 2018 11:18 AM To: ua-discuss@icann.org Subject: Re: [UA-discuss] IANA IDN Tables

On Mon, Feb 26, 2018 at 07:10:00PM +0000, Mark Svancarek wrote:

...
It still seems weird to me to have a superset table (Japanese) as well

as subset tables (Hiragana, Katakana) for the same TLD. What's the utility?

...
When you're submitting an IDN, there is probably a selection of language tags you can pick. These probably align with those; at least, that's the point of the distinction in the LGR approach we did for the root. (The effect is supposed to be the same anyway, due to blocking, but I don't know whether Verisign is doing that too.)

A

-- Andrew Sullivan ajs@anvilwalrusden.com

Mark Svancarek

5:46 p.m.

That's the way I am reading it. If different LGRs apply, then different tables and tags apply. Which *still* doesn't answer why one would want to use a Katakana-specific LGR. I've beaten this horse to death.. Thanks to everyone who helped me to better understand! /marksv -----Original Message----- From: UA-discuss <ua-discuss-bounces@icann.org> On Behalf Of Don Hollander Sent: Tuesday, February 27, 2018 9:31 AM To: 'Andrew Sullivan' <ajs@anvilwalrusden.com>; ua-discuss@icann.org Subject: Re: [UA-discuss] IANA IDN Tables So, below the root level, the applicant needs to not only choose their preferred domain name, but also the expected language? D -----Original Message----- From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Andrew Sullivan Sent: Wednesday, 28 February 2018 6:26 AM To: ua-discuss@icann.org Subject: Re: [UA-discuss] IANA IDN Tables On Tue, Feb 27, 2018 at 04:58:00PM +0000, Mark Svancarek wrote:

...

So perhaps the intent is to allow user to select the LGR under which their submission will be approved?

...

-----Original Message----- From: UA-discuss <ua-discuss-bounces@icann.org> On Behalf Of Andrew Sullivan Sent: Monday, February 26, 2018 11:18 AM To: ua-discuss@icann.org Subject: Re: [UA-discuss] IANA IDN Tables

On Mon, Feb 26, 2018 at 07:10:00PM +0000, Mark Svancarek wrote:

...
It still seems weird to me to have a superset table (Japanese) as well

as subset tables (Hiragana, Katakana) for the same TLD. What's the utility?

...
When you're submitting an IDN, there is probably a selection of language tags you can pick. These probably align with those; at least, that's the point of the distinction in the LGR approach we did for the root. (The effect is supposed to be the same anyway, due to blocking, but I don't know whether Verisign is doing that too.)

A

-- Andrew Sullivan ajs@anvilwalrusden.com

'Andrew Sullivan'

5:47 p.m.

On Wed, Feb 28, 2018 at 06:30:36AM +1300, Don Hollander wrote:

...

So, below the root level, the applicant needs to not only choose their preferred domain name, but also the expected language?

I'm describing the approach being used for the root zone, actually. AFAIK, the processes for names in gTLDs haven't been settled (full disclosure: my employer used to lend me to ICANN to work on that effort, but I'm no longer able to do that). Think of Arabic script and the many languages that it supports: many code points that are Arabic characters would not be appropriate for Arabic-language (or Urdu, or whatever you like) labels. Such labels may _still_ generate variants, but the disposition of the variants (blocked or allocatable) depends on what hte intention of the label is supposed to be. Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com

Roberto Gaetano

8:04 p.m.

Yes. This is what happened for our .org in Cyrillic, not only Russian, but additional RSEP for other languages using this script. R

...

On 27.02.2018, at 18:30, Don Hollander <don.hollander@gmail.com> wrote:

So, below the root level, the applicant needs to not only choose their preferred domain name, but also the expected language?

D

-----Original Message----- From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Andrew Sullivan Sent: Wednesday, 28 February 2018 6:26 AM To: ua-discuss@icann.org Subject: Re: [UA-discuss] IANA IDN Tables

On Tue, Feb 27, 2018 at 04:58:00PM +0000, Mark Svancarek wrote:

...
So perhaps the intent is to allow user to select the LGR under which their submission will be approved?

I hesitate to say what someone else's intention is. But the basic idea of having different LGRs for the same script but different language tags is to permit a given label to be evaluated according to the target use case, while still providing a mechanism to create the final LGR out of all the code points permitted.

Imagine a script called "Slobbovian", which has two communities of speakers "Upper Slobbovian" and "Lower Slobbovian". 80% of the Slobbovian code points are shared, but 10% are used only for one or the other languages. Moreover, for every member in the 10% appropriate to Upper, there is a member of those other 10% appropriate to Lower. (This is an artificial example, obviously.) In this case, one would expect two LGRs for application purposes: the Upper and the Lower. The Upper would have its respective 10% plus the 80% as "allocatable", and the other 10% as "blocked", with a 1:1 correspondence for variant generation between the first 10% and the second 10%. The Lower would have a similar LGR, _mutatis mutandis_. The overall LGR for the _zone_ permits all the Slobbovian-range code points, but has different dispositions depending on which language tag was used in the application for the domain.

I hope that makes a little bit of sense. If not, I need to write a longer mail :)

A

...
-----Original Message----- From: UA-discuss <ua-discuss-bounces@icann.org> On Behalf Of Andrew Sullivan Sent: Monday, February 26, 2018 11:18 AM To: ua-discuss@icann.org Subject: Re: [UA-discuss] IANA IDN Tables

On Mon, Feb 26, 2018 at 07:10:00PM +0000, Mark Svancarek wrote:

...
It still seems weird to me to have a superset table (Japanese) as well

as subset tables (Hiragana, Katakana) for the same TLD. What's the utility?

...
When you're submitting an IDN, there is probably a selection of language tags you can pick. These probably align with those; at least, that's the point of the distinction in the LGR approach we did for the root. (The effect is supposed to be the same anyway, due to blocking, but I don't know whether Verisign is doing that too.)

A

-- Andrew Sullivan ajs@anvilwalrusden.com

Asmus Freytag

7:27 p.m.

On 2/26/2018 11:10 AM, Mark Svancarek via UA-discuss wrote:

...

I misunderstood Andre's original post, sorry, my mind was thinking about the characters in the TLD, not the characters in the other labels. Oops.

It still seems weird to me to have a superset table (Japanese) as well as subset tables (Hiragana, Katakana) for the same TLD. What's the utility? None that I can see.

There is some use for different tables that overlap in more complicated ways. Such as allowing both Chinese labels (from a large subset of Han ideographs) and Japanese labels (from a smaller/different subset) but together with Hiragana and Katakana. That's at least the plan for the Root Zone and a scheme like that would be a reasonable solution for the second-level for zones that have a more global audience. Having two tables limits the combinations based on whether you are registering a "Japanese" vs. a "Chinese" label, even though the namespaces overlap. If, in addition, you align the definition of variants so some existing Japanese labels can block a new Chinese label that would look like a variant to a Chinese user, then you've incrementally reduced the attack surface for spoofing - arguably always a good thing. A./

...

-----Original Message----- From: UA-discuss <ua-discuss-bounces@icann.org> On Behalf Of Andrew Sullivan Sent: Monday, February 26, 2018 10:44 AM To: ua-discuss@icann.org Subject: Re: [UA-discuss] IANA IDN Tables

On Mon, Feb 26, 2018 at 05:30:38PM +0000, Mark Svancarek via UA-discuss wrote:

...
コ andム are katakana characters, so only examples 1 and 3 are valid from a language perspective. Those characters do not exist within hiragana and I do not believe they exist as standalone Chinese characters, either. That doesn't matter, because the characters _below_ the Katakana _could_ be Han or whatever. Japanese is written in multiple scripts, so knowing that one label is only written with Katakana does not tell you whether another label might have Han characters in it. It doesn't, indeed, even tell you that there might not be a Chinese name lower in the tree, much as com today permits characters outside the ASCII range.

Best regards,

A

-- Andrew Sullivan ajs@anvilwalrusden.com

Mark Svancarek

4:56 p.m.

Yep. Intersections make sense to me, subsets less so. -----Original Message----- From: UA-discuss <ua-discuss-bounces@icann.org> On Behalf Of Asmus Freytag Sent: Monday, February 26, 2018 11:28 AM To: ua-discuss@icann.org Subject: Re: [UA-discuss] IANA IDN Tables On 2/26/2018 11:10 AM, Mark Svancarek via UA-discuss wrote:

...

I misunderstood Andre's original post, sorry, my mind was thinking about the characters in the TLD, not the characters in the other labels. Oops.

It still seems weird to me to have a superset table (Japanese) as well as subset tables (Hiragana, Katakana) for the same TLD. What's the utility? None that I can see.

...

-----Original Message----- From: UA-discuss <ua-discuss-bounces@icann.org> On Behalf Of Andrew Sullivan Sent: Monday, February 26, 2018 10:44 AM To: ua-discuss@icann.org Subject: Re: [UA-discuss] IANA IDN Tables

On Mon, Feb 26, 2018 at 05:30:38PM +0000, Mark Svancarek via UA-discuss wrote:

...
コ andム are katakana characters, so only examples 1 and 3 are valid from a language perspective. Those characters do not exist within hiragana and I do not believe they exist as standalone Chinese characters, either. That doesn't matter, because the characters _below_ the Katakana _could_ be Han or whatever. Japanese is written in multiple scripts, so knowing that one label is only written with Katakana does not tell you whether another label might have Han characters in it. It doesn't, indeed, even tell you that there might not be a Chinese name lower in the tree, much as com today permits characters outside the ASCII range.

Best regards,

A

-- Andrew Sullivan ajs@anvilwalrusden.com

Andrew Sullivan

11:39 p.m.

Hi, On Sun, Feb 25, 2018 at 11:54:06AM +0000, Andre Schappo wrote:

...

Some observations: I will use Verisignʼs Japanese .コム (transliteration of .com)

There is .コム Japanese, which is LDH (ASCII Letters, Digits, Hyphen) + Hiragana + Katakana + a heap of Han There is .コム Hiragana, which is DH + Hiragana There is .コム Katakana, which is DH + Katakana There is .コム Han, which is DH + a heap of Han

This is because Japanese is written in more than one script. So to do "Japanese" you need to use all of them, but you need the other stuff to be correctly handled so that conflicting variants don't arise. Those of you who are not familiar with the VIP and root LGR projects might want to read that material, which I think is linked (it used to be anyway) from the UA pages. A -- Andrew Sullivan ajs@anvilwalrusden.com

Maxim Alzoba

10:47 a.m.

Hello Andrew, As I understand all IDN tables, which passed 2012 application round are to be allowed to use despite any changes in LGR (it was the part of the discussion when LGRs were established). Sincerely Yours, Maxim Alzoba Special projects manager, International Relations Department, FAITID m. +7 916 6761580(+whatsapp) skype oldfrogger Current UTC offset: +3.00 (.Moscow)

...

On 26 Feb 2018, at 02:39, Andrew Sullivan <ajs@anvilwalrusden.com> wrote:

Hi,

On Sun, Feb 25, 2018 at 11:54:06AM +0000, Andre Schappo wrote:

...
Some observations: I will use Verisignʼs Japanese .コム (transliteration of .com)

There is .コム Japanese, which is LDH (ASCII Letters, Digits, Hyphen) + Hiragana + Katakana + a heap of Han There is .コム Hiragana, which is DH + Hiragana There is .コム Katakana, which is DH + Katakana There is .コム Han, which is DH + a heap of Han

This is because Japanese is written in more than one script. So to do "Japanese" you need to use all of them, but you need the other stuff to be correctly handled so that conflicting variants don't arise.

Those of you who are not familiar with the VIP and root LGR projects might want to read that material, which I think is linked (it used to be anyway) from the UA pages.

A

-- Andrew Sullivan ajs@anvilwalrusden.com

Andrew Sullivan

3:51 p.m.

On Mon, Feb 26, 2018 at 01:47:56PM +0300, Maxim Alzoba wrote:

...

As I understand all IDN tables, which passed 2012 application round are to be allowed to use despite any changes in LGR (it was the part of the discussion when LGRs were established).

Correct. But some people are updating in line with LGR tools already -- particularly when their communities are sensitive to the issues of general-purpose domains and need conflicting uses of the same script. This applies (for instance) to Han, certainly Latin and Arabic, and probably Cyrillic (and maybe even Latin, Cyrillic, and Greek, though I know of nobody who's been that careful yet). The previous "variants" approach derived from the JET work made the distinction between "blocked" and "allocatable" less plain than it is now, and the inter-writing-system effects of characters is also now plainer and so easier to represent. The JET approach worked quite well for CJK when used in relative isolation, but has limitations when applied more generally, which is why the new approach was worked out. A -- Andrew Sullivan ajs@anvilwalrusden.com

Maxim Alzoba

6:31 p.m.

Andrew, the issue is , when they do not use IDN tables, which are allowed, the situation where some symbols are not used correctly might arise over time. P.s: As for Russian Cyrillic LGR we have found no mistakes there. Sincerely Yours, Maxim Alzoba Special projects manager, International Relations Department, FAITID m. +7 916 6761580(+whatsapp) skype oldfrogger Current UTC offset: +3.00 (.Moscow)

...

On 26 Feb 2018, at 18:51, Andrew Sullivan <ajs@anvilwalrusden.com> wrote:

On Mon, Feb 26, 2018 at 01:47:56PM +0300, Maxim Alzoba wrote:

...
As I understand all IDN tables, which passed 2012 application round are to be allowed to use despite any changes in LGR (it was the part of the discussion when LGRs were established).

Correct. But some people are updating in line with LGR tools already -- particularly when their communities are sensitive to the issues of general-purpose domains and need conflicting uses of the same script. This applies (for instance) to Han, certainly Latin and Arabic, and probably Cyrillic (and maybe even Latin, Cyrillic, and Greek, though I know of nobody who's been that careful yet).

The previous "variants" approach derived from the JET work made the distinction between "blocked" and "allocatable" less plain than it is now, and the inter-writing-system effects of characters is also now plainer and so easier to represent. The JET approach worked quite well for CJK when used in relative isolation, but has limitations when applied more generally, which is why the new approach was worked out.

A

-- Andrew Sullivan ajs@anvilwalrusden.com

Andrew Sullivan

6:41 p.m.

On Mon, Feb 26, 2018 at 09:31:16PM +0300, Maxim Alzoba wrote:

...

Andrew,

the issue is , when they do not use IDN tables, which are allowed, the situation where some symbols are not used correctly might arise over time.

That situation is going to arise over time one way or the other ;-) Also, note that the root LGR is for the root zone. Other zones (lower in the tree) might use different LGRs. A -- Andrew Sullivan ajs@anvilwalrusden.com

Asmus Freytag

7:48 p.m.

On 2/26/2018 7:51 AM, Andrew Sullivan wrote:

...

On Mon, Feb 26, 2018 at 01:47:56PM +0300, Maxim Alzoba wrote:

...
As I understand all IDN tables, which passed 2012 application round are to be allowed to use despite any changes in LGR (it was the part of the discussion when LGRs were established). Correct. But some people are updating in line with LGR tools already -- particularly when their communities are sensitive to the issues of general-purpose domains and need conflicting uses of the same script. This applies (for instance) to Han, certainly Latin and Arabic, and probably Cyrillic (and maybe even Latin, Cyrillic, and Greek, though I know of nobody who's been that careful yet).

The previous "variants" approach derived from the JET work made the distinction between "blocked" and "allocatable" less plain than it is now, and the inter-writing-system effects of characters is also now plainer and so easier to represent. The JET approach worked quite well for CJK when used in relative isolation, but has limitations when applied more generally, which is why the new approach was worked out.

One interesting development for Chinese is a clever bit of tweaking of the algorithm that defines the set of "allocatable" variants. The original JET approach was intended to lead to at most three possible labels: one all-simplified, one all-traditional label plus one mixed label (as applied for). Because some code points have more than one simplified or more than one traditional variant, a simple-minded scheme would allow a combinatorial explosion of allocatable labels in some cases. The new algorithm is able to limit the number of allocatable labels in these cases to four; fewer in the general case. This would be a big win, as keeping the number of allocatable variants small has benefits, especially as the number of allocatable FQDN is the permutation of all allocatable labels on each level. Embedding the reduction into the algorithm has the advantage of making the set of allocatable labels predictable (by evaluating the label against the LGR). The LGR would fully conform to RFC 7940. The number of blocked variants is still defined by the permutation of all variants that aren't allocatable. For some labels, the numbers can be formidable, but fortunately, there is no need to enumerate them, even for collision testing. However, even the largest set of blocked variant pales compared to the immense size of the namespace (20,000 code points) to the power of (maximal number of code points in a U-label). I believe the Chinese Generation Panel is planning a presentation of the scheme at ICANN61. A./

Elaine Pruis

9:06 p.m.

It must be over a year since I last looked at the IANA IDN tables and they have grown massively in that time ➜ iana.org/domains/idn-tables Actually, all new TLD applicants were required to submit their language tables during the application window. Part of the pre-delegation test is to test the registry function against the rulesets in the tables. The Registries have a contractual obligation to ICANN to publish the tables. That is a task managed by IANA. IANA has finally caught up with the publication of these tables, and that is why it appears to be a massive growth in a short period of time. As a member of the CSC (tasked with oversight of the IANA function), we are now monitoring the "speed" at which submitted tables get approved and then published on this site. On Mon, Feb 26, 2018 at 11:48 AM, Asmus Freytag <asmusf@ix.netcom.com> wrote:

...

On 2/26/2018 7:51 AM, Andrew Sullivan wrote:

...
On Mon, Feb 26, 2018 at 01:47:56PM +0300, Maxim Alzoba wrote:

...
As I understand all IDN tables, which passed 2012 application round are to be allowed to use despite any changes in LGR (it was the part of the discussion when LGRs were established).

Correct. But some people are updating in line with LGR tools already -- particularly when their communities are sensitive to the issues of general-purpose domains and need conflicting uses of the same script. This applies (for instance) to Han, certainly Latin and Arabic, and probably Cyrillic (and maybe even Latin, Cyrillic, and Greek, though I know of nobody who's been that careful yet).

The previous "variants" approach derived from the JET work made the distinction between "blocked" and "allocatable" less plain than it is now, and the inter-writing-system effects of characters is also now plainer and so easier to represent. The JET approach worked quite well for CJK when used in relative isolation, but has limitations when applied more generally, which is why the new approach was worked out.

One interesting development for Chinese is a clever bit of tweaking of the algorithm that defines the set of "allocatable" variants.

The original JET approach was intended to lead to at most three possible labels: one all-simplified, one all-traditional label plus one mixed label (as applied for).

Because some code points have more than one simplified or more than one traditional variant, a simple-minded scheme would allow a combinatorial explosion of allocatable labels in some cases.

The new algorithm is able to limit the number of allocatable labels in these cases to four; fewer in the general case.

This would be a big win, as keeping the number of allocatable variants small has benefits, especially as the number of allocatable FQDN is the permutation of all allocatable labels on each level.

Embedding the reduction into the algorithm has the advantage of making the set of allocatable labels predictable (by evaluating the label against the LGR). The LGR would fully conform to RFC 7940.

The number of blocked variants is still defined by the permutation of all variants that aren't allocatable. For some labels, the numbers can be formidable, but fortunately, there is no need to enumerate them, even for collision testing.

However, even the largest set of blocked variant pales compared to the immense size of the namespace (20,000 code points) to the power of (maximal number of code points in a U-label).

I believe the Chinese Generation Panel is planning a presentation of the scheme at ICANN61.

A./

Andre Schappo

12:36 p.m.

...

On 25 Feb 2018, at 23:39, Andrew Sullivan <ajs@anvilwalrusden.com> wrote:

Hi,

On Sun, Feb 25, 2018 at 11:54:06AM +0000, Andre Schappo wrote:

...
Some observations: I will use Verisignʼs Japanese .コム (transliteration of .com)

There is .コム Japanese, which is LDH (ASCII Letters, Digits, Hyphen) + Hiragana + Katakana + a heap of Han There is .コム Hiragana, which is DH + Hiragana There is .コム Katakana, which is DH + Katakana There is .コム Han, which is DH + a heap of Han

This is because Japanese is written in more than one script. So to do "Japanese" you need to use all of them, but you need the other stuff to be correctly handled so that conflicting variants don't arise.

Those of you who are not familiar with the VIP and root LGR projects might want to read that material, which I think is linked (it used to be anyway) from the UA pages.

VIP = Variant IDNs Project? André Schappo

Andrew Sullivan

2:06 p.m.

On Tue, Feb 27, 2018 at 12:36:05PM +0000, Andre Schappo wrote:

...

VIP = Variant IDNs Project?

Variant Issues Project, but yes. It was the foundation for the current LGR stuff. I am not ambitious enough to go searching the ICANN site for the link :-) a -- Andrew Sullivan ajs@anvilwalrusden.com

Sarmad Hussain

5:01 p.m.

Dear all, You can find information regarding IDN work at ICANN at www.icann.org/idn. You can scroll down to the end of this page and find the link to "Resources" which lists the link to all the work leading to the Root Zone Label Generation Ruleset (or you can go directly to https://www.icann.org/resources/pages/root-zone-lgr-documentation-2017-12-15...). You will find all the relevant documents here, from more recent to the earlier ones. Regards Sarmad -----Original Message----- From: UA-discuss [mailto:ua-discuss-bounces@icann.org] On Behalf Of Andrew Sullivan Sent: Tuesday, February 27, 2018 7:07 PM To: ua-discuss@icann.org Subject: Re: [UA-discuss] IANA IDN Tables On Tue, Feb 27, 2018 at 12:36:05PM +0000, Andre Schappo wrote:

...

VIP = Variant IDNs Project?

Variant Issues Project, but yes. It was the foundation for the current LGR stuff. I am not ambitious enough to go searching the ICANN site for the link :-) a -- Andrew Sullivan ajs@anvilwalrusden.com

3032

Age (days ago)

3034

Last active (days ago)

List overview

Download

23 comments

11 participants

participants (11)

'Andrew Sullivan'
Andre Schappo
Andrew Sullivan
Asmus Freytag
Don Hollander
Elaine Pruis
Jim DeLaHunt
Mark Svancarek
Maxim Alzoba
Roberto Gaetano
Sarmad Hussain