Re: [UA-discuss] FW: I-D Action: draft-klensin-idna-rfc5891bis-00.txt
FWIW (and to coin a phrase), I'd consider "www.TRÛMP.com" more of a "pun" than a "confusable". It's evocative of "Trump" but is visually distinct.
But of course, you can't actually register TRÛMP.com at all, because upper case characters are not allowed in U-labels. So you'd have to register trûmp.com, and any sensible Latin variant strategy would require that the undecorated Latin characters would be variants of decorated Latin characters.
I was only displaying the name upper case to make it show up better. Of course, it is not just the "decoration" that is of interest. There are quite a few other characters from other scripts that are confusable possibilities. It is an interesting problem. For example, we took one 6 character name of a business which is trademarked & ran it through my algorithm, we came up with over 1 million possible permutations. This is because you can use more than one character look-alike. Lest you think that this doesn't happen, we have already found names registered which use more than one confusable. And, I have only just started my testing. One of the things that I want to do is to see exactly which "variants" or characters are most often used by potential miscreants. I call them "miscreants" because it is difficult for me to believe that someone who registers a variation of "mybank.com" or "apple.com" has something good on their mind. Nalini
On Mon, Mar 13, 2017 at 02:46:42PM +0000, nalini.elkins@insidethestack.com wrote:
Of course, it is not just the "decoration" that is of interest. There are quite a few other characters from other scripts that are confusable possibilities.
In general, cross-script registration is a bad idea. We have known this since at least 2003. The LGR work points out that script-language definitions is generally a good idea for multi-language scripts.
It is an interesting problem. For example, we took one 6 character name of a business which is trademarked & ran it through my algorithm, we came up with over 1 million possible permutations. This is because you can use more than one character look-alike.
Are these all single-script creations? I fear your algorithm is rather more expansive than is reasonable. This is hardly the first time the issue has been studied. The combinatorial explosion problem is a well-known and well-discussed one. The ICANN Variant Issues Project explored an awful lot of this.
Lest you think that this doesn't happen, we have already found names registered which use more than one confusable.
Of course it happens. That's actually what all proposals for restrictions are about.
I call them "miscreants" because it is difficult for me to believe that someone who registers a variation of "mybank.com" or "apple.com" has something good on their mind.
It doesn't help us to think clearly about the issues to start trying to do psycological analysis and intention-attribution of the people doing these things. _Regardless_ of the intention, it's an attack vector, and I think that is part of what we need to take into consideration. Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com
On 3/13/2017 7:46 AM, nalini.elkins@insidethestack.com wrote:
It is an interesting problem. For example, we took one 6 character name of a business which is trademarked & ran it through my algorithm, we came up with over 1 million possible permutations. This is because you can use more than one character look-alike.
Lest you think that this doesn't happen, we have already found names registered which use more than one confusable. And, I have only just started my testing.
I keep coming back to the concept of "perceptual distance". When you look at individual code points and call them "confusable" you assert that each such pair has a perceptual distance that is small enough to fit below a certain threshold, but that threshold is not zero, nor is the actual perceptual distance between most code points that are considered confuable. There are two interesting issues with this. One is that that you may have two pairs of confusables that have one common member, but the other two members are far enough apart in perceptual distance to no longer meet your threshold of confusability. The other is that code points by themselves are really not relevant for this, because the real metric should be the perceptual distance between labels (or even FQDNs). Just based on labels: if you simultaneously substitute more than one code point in a label with a potential confusable, the result may be that the label is now further apart in perceptual space from the original label than if you had only substituted one code point at a time. The reason is that people read words, and having a single code point altered may not interfere with the process of reading that word, but once you change two or more, the situation is different. As a result, I would tentatively conclude that your claim that those 1 million permutations are all equally confusable with the original label is likely specious. It is reasonable to suspect that a good portion of those labels would look distinctly "odd", if your substitution is based on ordinary single-code-point confusability thresholds. That said, there are some labels in certain scripts for which the variant code points are true equivalents (whether visual, phonetic or semantic). In those cases, making multiple substitutions can result in rather large multiplicities of fully equivalent labels. (Note, that if one starts with 0 distance, or almost 0 distance, in perceptual space, then even multiple substitutions can be expected to result in negligible perceptual distance between variant labels --- but that is not usually the case for the kinds of instances considered under "confusable"). Finally, in considering labels, you'll pick up the 'rn' vs. 'm' issue: that is, confusables are not 1:1 in code point space, they may be 1:n or even n:m. (The same is true for variants that represent true equivalence). A./ PS: a test whether variants are true equivalence is whether they satisfy not only symmetry but transitivity. Anything with a measurable perceptual distance is likely not transitive; just think of two labels that are (barely) not confusable and now imagine an "average shape" label. The latter would be confusable with both, and therefore all three would not obey the transitivity constraint.
participants (3)
-
Andrew Sullivan -
Asmus Freytag -
nalini.elkins@insidethestack.com