On 01.07.2008 10:53, Julian Cable wrote:
So I have a practical preference for the 7-bit subset of UTF-8 with no BOM (of course I would never dream of calling this ASCII ;)
Well, the 7-bit subset of UTF-8 with no BOM *is* ASCII, so we might as well call it ASCII. ;-) Pure 7-bit ASCII would of course be the most portable encoding but in 2008 we should not longer have to deny non-English [1] speakers and countries the correct spelling of their names and places.
If we go for UTF-8 can we be very firm about whether a BOM is required or prohibited and please make sure its not permitted.
Yes, definitely. One of the biggest advantages of UTF-8 is that programs which do not support UTF-8 can usually still process UTF-8-encoded files. There are no embedded zero bytes, and the bytes of a multi-byte character are never equal to 7-bit ASCII characters. If a tzdata file suddenly started with hex EF BB BF, the parser would try to interpret these bytes as the start of a rule, and fail. I understand the tendency of using an encoding mark for Unicode files in the Microsoft world, and it is very useful for UTF-16 and UTF-32, but (1) UTF-8 has only one byte order, and (2) adding it would cause more problems than it is worth. I assume that Windows editors which support UTF-8 can also be manually switched to UTF-8 without the need for a BOM. Best regards Martin Jerabek [1] Yes, there are a few languages other than English whose script only needs 7-bit ASCII.