Hi, On Mon, May 29, 2017 at 04:12:35PM +0000, Stuart Stuple wrote:
I don't mean to be arguing. I'm simply sharing my perspective as to why these guidelines are unworkable given typical customer goals and customer understanding of text.
I don't mean to be arguing either, in the sense of being disagreeable, but I think that those of us interested in this need to work out what we think the goal of acceptance of domain names is. Because if the goal is the same as "rendering of text the same as running text for readers", then I'm pretty sure we have some deep disagreements.
I understand the perspective that emoji are more variable than text as the emotional impact of the variations of emoji.
No, this has nothing to do with the emotional impact of the variations of emoji. It has to do with what is considered to be "the same."
However, as someone involved in font development, I am confident that the guidelines for Emoji are as well developed as those for character glyphs.
But it's not just _glyphs_ that count here. If you run U+0061, LATIN SMALL LETTER A, through NFC, _no matter what the font is_ you get U+0061. If you put a grave accent on it, _no matter how you get that on there_, when you're done with NFC you get the precomposed form, U+00E0 LATIN SMALL LETTER A WITH GRAVE. But U+1F466 BOY is Emoji_Modifier_Base, so it takes the skin tone modifiers. So if you add U+1F3FF to U+1F466, you get a new combining sequence. But U+1F466 is NFC, so it doesn't normalize with other modifiers of Emoji_Modifier_Base characters, which means that if someone just reads "BOY" when reading that sequence, information is lost. So either we have to train everyone in the world to see racialized emojis, which at the very least seems like a rather contentious idea, or else we need to create normalization rules, which will break Unicode's promises about normalization stability.
Emojis do indeed have normalization as much as any other Unicode combination.
This is either trivially true (in that all code points have NF* properties) or it misses the point of the concern.
presumes that the same normalization routine is being used. The classic case that we encounter in our software is normalization of casing -- works great, right up until you have plain text that includes Turkish i or I.
That's not "normalization" in Unicode terms, it's case folding or maybe downcasing. The _reason_ the IDNA rules are so complicated around this is precisely because of the kinds of rules you're highlighting: localized software actually can do more interesting things with case folding if you know the locale. Since domain names don't have locale information with them, you're in rather big trouble here, which is why IDNA2003's approach to this turned out not to work. So in IDNA2008 the specification suggests local case handling and requires stability under caseFold and NFKC and also requires strings to be in NFC.
I agree that the simplification of thinking of a "smiley face" is part of the problem. As I mention, each Unicode emoji has a Unicode character name -- they are not at all ambiguous.
And literally no human who is not familiar with Unicode uses those character names. For instance, did you know that U+ 1F623 is PERSEVERING FACE and U+1F616 is CONFOUNDED FACE? How would you know? I'd say maybe "squinty eyed" and "upset squinty eyed", but I have no idea -- it'd probably depend on context. And it is _context_ that is precisely the problematically missing thing in free-floating network identifiers, which is what domain names are intended to be. It is that ambiguity of meaning that has people using peaches and eggplants to represent things that are probably not what's for dinner. This is perfectly useful for casual communications, and disastrous for supposedly unique and minimally ambiguous identifiers. That's what the SSAC report is about.
My point is less that emojis are good (though I think customers clearly expect and want them as identifiers) but rather than the same problems outline already exist for the encoding of many living languages.
But living languages have not been encoded by Unicode as symbols. They've been encoded by Unicode as letters. That's the difference. People want lots of things. They might, for instance, want apostrophes or mixed scripts in domain names, too. But they're a bad idea because they break stuff. Best regards, A -- Andrew Sullivan ajs@anvilwalrusden.com