In the southamerica data file we have several Latin-1 encoded non-ASCII characters. The other data files are 100% ASCII Here they are: $ grep -nP "[\\x80-\\xff]" * southamerica:384:# There's also a note in only one of the major national papers (La Naci�n) at southamerica:390:# (...) anunci� que el pr�ximo domingo a las 00:00 los puntanos deber�n southamerica:393:# A partir de entonces, San Luis establecer� el huso horario propio de southamerica:395:# 2009, el cambio horario quedar� comprendido entre las 00:00 del tercer southamerica:396:# domingo de marzo y las 24:00 del segundo s�bado de octubre. southamerica:815:# I just send a e-mail to Zulmira Brand�o at I think it's fine to allow non-ASCII in comments, but would strongly request that the files be UTF-8 encoded. Anything else leads to immense confusion over what charset is in use. -- Andy
Thanks for catching these problems. Two proposed patches are attached. The first fixes the problems by going back to ASCII. The second puts in a check for this problem, so that non-ASCII bytes like that don't slip into future releases. At some point we may well want to add non-ASCII characters, but they should be UTF-8 I expect. I've pushed these proposed patches into the unofficial experimental repository at github.
On Thu, Dec 13, 2012 at 10:23 PM, Paul Eggert <eggert@cs.ucla.edu> wrote:
Thanks for catching these problems. Two proposed patches are attached. The first fixes the problems by going back to ASCII. The second puts in a check for this problem, so that non-ASCII bytes like that don't slip into future releases. At some point we may well want to add non-ASCII characters, but they should be UTF-8 I expect. I've pushed these proposed patches into the unofficial experimental repository at github.
FWIW, as a Spanish speaker (and possibly the author of the mail messages in Spanish which got converted into Latin-1 encoded comments) I vow for UTF-8 encoding rather than mutilation. -- Mariano Absatz - El Baby www.clueless.com.ar
Agreed, UTF-8 would be preferable to ASCII. Deborah Goldsmith Apple Inc. On Dec 27, 2012, at 10:28 AM, Mariano Absatz - el Baby <baby@baby.com.ar> wrote:
On Thu, Dec 13, 2012 at 10:23 PM, Paul Eggert <eggert@cs.ucla.edu> wrote: Thanks for catching these problems. Two proposed patches are attached. The first fixes the problems by going back to ASCII. The second puts in a check for this problem, so that non-ASCII bytes like that don't slip into future releases. At some point we may well want to add non-ASCII characters, but they should be UTF-8 I expect. I've pushed these proposed patches into the unofficial experimental repository at github.
FWIW, as a Spanish speaker (and possibly the author of the mail messages in Spanish which got converted into Latin-1 encoded comments) I vow for UTF-8 encoding rather than mutilation.
-- Mariano Absatz - El Baby www.clueless.com.ar
participants (4)
-
Andy Heninger -
Deborah Goldsmith -
Mariano Absatz - el Baby -
Paul Eggert