Re: [UA-discuss] Latin+Cyrillic — .com .?? .??

April 27, 2017

      On 4/27/2017 11:22 AM, Jothan Frakes wrote:
...
On Thu, Apr 27, 2017 at 7:37 AM, Andre Schappo <A.Schappo@lboro.ac.uk 
<mailto:A.Schappo@lboro.ac.uk>> wrote:
Some thoughts, having now caught up with all the UA emails on
    phishing.
①  Over the years there has been much discussion about Cyrillic
    being used to masquerade as ASCII domain names. I wonder if the
    Russian speaking community have been having similar discussions
    with respect to ASCII being used to masquerade as Cyrillic domain
    names.
Quick comments here (mostly for a wider reading audience):
1] Need to include Greek in the Cyrillic/Latin (or "ASCII" as we call 
it here in this discussion) as being Homograph rich across all three 
from visually identical or near identicals
Also arguably Armenian - the font used in the Unicode charts is not 
representative, and much Armenian fonts styles look more like Times or 
Helvetica, meaning that there are shapes like "հ ս ո օ"
...
2] I used to believe that there was a bright line between all Cyrillic 
and all Latin/ASCII - and I learned through the process of many wise 
people like Yuri and Dusan spending time to evolve my thinking that 
these may be intermixed under perfectly normal use, and this also 
varies by region.  We should not assume all of one or all of the other.
The limitation in thinking is that the "go-to" solution is to try to ban 
some code points, or to ban them in certain contexts. Which leads to the 
call for single-script labels (which, as we know reduces, but does not 
remove the homograph attach surface).

A more robust method is to make homoglyphs mutually exclusive in the 
registry. If a registered label has one code point at a certain 
position, the same label with the homoglyph substituted at the same 
position would be blocked. ("Blocked variant")

The technology to specify this used to exist in two slightly different 
forms; once for Arabic and once for CJK. These were defined in separate 
RFCs, with mutually incompatible plain-text formats.

With RFC 7940 there is, for the first time, a universal XML schema to 
specify these kinds of relations. This should make it easy to generate 
shared libraries and toolsets that can  read/process these definitions. 
As a result, blocked variants are a technique that should become a 
standard methodology for registries.

If you have blocked variants defined, then you can mix not just Cyrillic 
and Latin labels more safely, but also mix Latin and Cyrillic inside a 
single label without opening yourself up to homograph attacks.

RFC 7940 is occasionally misunderstood as a prescription how to design 
Label Generation Rules (aka IDN tables). It is not, it is instead a 
description of a universal data format (in XML) that can represent 
pretty much anything needed for registration policies (on the code point 
level): for example, you can define which code points to allow, next to 
what other code points and what variants to block.

It could use a bit of advertising. Perhaps it could be mentioned in 
comments to the IDN guidelines? (As a co-author, I'm not eligible to 
make such comments myself). Not least because it unifies the description 
of blocked variants, it does have a clear place in the infrastructure 
needed to support universal acceptance.

A./

PS: For the root zone, we are planning to stick to single-script labels, 
but also to implement blocked variants across scripts. Some of the data 
in my cross-script variants collection comes from the relevant drafts 
for that project, other data comes from data derived from Unicode's 
UTR#39, and some is based on my own knowledge of certain scripts.

PPS: I'm attaching an update of my cross- script variants listing. The 
data for that exists in an XML file according to RFC7940; the HTML 
summary of that data is created by a simple tool. I would appreciate 
comments on the contents and description from anyone.

PPPS: you may have noticed that I'm not writing anything about 
allocatable variants. Their effect on the DNS is very different - they 
may be needed/useful in some context, but the motivation is not 
security. RFC 7940 allows you to define them where needed, including 
with the same semantics as in the existing RFCs if desired.

Re: [UA-discuss] Latin+Cyrillic — .com .?? .??

Asmus Freytag