Re: [UA-discuss] SAC095 - SSAC Advisory on the Use of Emoji in Domain

May 29, 2017

      Hi,

On Mon, May 29, 2017 at 04:12:35PM +0000, Stuart Stuple wrote:
...
I don't mean to be arguing. I'm simply sharing my perspective as to why these guidelines are unworkable given typical customer goals and customer understanding of text.
I don't mean to be arguing either, in the sense of being disagreeable,
but I think that those of us interested in this need to work out what
we think the goal of acceptance of domain names is.  Because if the
goal is the same as "rendering of text the same as running text for
readers", then I'm pretty sure we have some deep disagreements.
...
I understand the perspective that emoji are more variable than text as the emotional impact of the variations of emoji.
No, this has nothing to do with the emotional impact of the variations
of emoji.  It has to do with what is considered to be "the same."
...
However, as someone involved in font development, I am confident that the guidelines for Emoji are as well developed as those for character glyphs.
But it's not just _glyphs_ that count here.  If you run U+0061, LATIN
SMALL LETTER A, through NFC, _no matter what the font is_ you get
U+0061.  If you put a grave accent on it, _no matter how you get that
on there_, when you're done with NFC you get the precomposed form,
U+00E0 LATIN SMALL LETTER A WITH GRAVE.

But U+1F466 BOY is Emoji_Modifier_Base, so it takes the skin tone
modifiers.  So if you add U+1F3FF to U+1F466, you get a new combining
sequence.  But U+1F466 is NFC, so it doesn't normalize with other
modifiers of Emoji_Modifier_Base characters, which means that if
someone just reads "BOY" when reading that sequence, information is
lost.  So either we have to train everyone in the world to see
racialized emojis, which at the very least seems like a rather
contentious idea, or else we need to create normalization rules, which
will break Unicode's promises about normalization stability.
...
Emojis do indeed have normalization as much as any other Unicode combination.
This is either trivially true (in that all code points have NF*
properties) or it misses the point of the concern.
...
presumes that the same normalization routine is being used. The classic case that we encounter in our software is normalization of casing -- works great, right up until you have plain text that includes Turkish i or I.
That's not "normalization" in Unicode terms, it's case folding or
maybe downcasing.  The _reason_ the IDNA rules are so complicated
around this is precisely because of the kinds of rules you're
highlighting: localized software actually can do more interesting
things with case folding if you know the locale.  Since domain names
don't have locale information with them, you're in rather big trouble
here, which is why IDNA2003's approach to this turned out not to work.
So in IDNA2008 the specification suggests local case handling and
requires stability under caseFold and NFKC and also requires strings
to be in NFC.
...
I agree that the simplification of thinking of a "smiley face" is part of the problem. As I mention, each Unicode emoji has a Unicode character name -- they are not at all ambiguous.
And literally no human who is not familiar with Unicode uses those
character names.  For instance, did you know that U+ 1F623 is
PERSEVERING FACE and U+1F616 is CONFOUNDED FACE?  How would you know?
I'd say maybe "squinty eyed" and "upset squinty eyed", but I have no
idea -- it'd probably depend on context.  And it is _context_ that is
precisely the problematically missing thing in free-floating network
identifiers, which is what domain names are intended to be.  It is
that ambiguity of meaning that has people using peaches and eggplants
to represent things that are probably not what's for dinner.  This is
perfectly useful for casual communications, and disastrous for
supposedly unique and minimally ambiguous identifiers.  That's what
the SSAC report is about.
...
My point is less that emojis are good (though I think customers clearly expect and want them as identifiers) but rather than the same problems outline already exist for the encoding of many living languages.
But living languages have not been encoded by Unicode as symbols.
They've been encoded by Unicode as letters.  That's the difference.

People want lots of things.  They might, for instance, want
apostrophes or mixed scripts in domain names, too.  But they're a bad
idea because they break stuff.

Best regards,

A

-- 
Andrew Sullivan
ajs@anvilwalrusden.com