[PATCH] More spelling and accent fixes.

In reply to: [tz] [PATCH] More spelling and accent fixes. Paul Eggert eggert at CS.UCLA.EDU Fri Jun 27 23:06:47 UTC 2014 A small problem regarding the transition to UTF-8. Let me refer to a copy of some text from the diff, sent through the mailing list: @@ -383,7 +383,7 @@ Zone Asia/Chongqing 7:06:20 - LMT 1928 # or Chungking # Wusu, Qiemo, Xinyan, Wulanwusu, Jinghe, Yumin, Tacheng, Tuoli, Emin, # Shihezi, Changji, Yanqi, Heshuo, Tuokexun, Tulufan, Shanshan, Hami, # Fukang, Kuitun, Kumukuli, Miquan, Qitai, and Turfan. -Zone Asia/Urumqi 5:50:20 - LMT 1928 # Ürümqi or Urumchi +Zone Asia/Urumqi 5:50:20 - LMT 1928 # Ürümqi or Ürümchi The correct spelling of the Asian city at hand is Ürümchi, with two u-diaeresis. The github asia file has the correct spelling and the correct encoding. The diff file however gets mangled by some publishing process. Hence, the diff published through the mailing list is rather useless. I have a lot of text encoding conversion filters, but noone is able to decode the mangled "Ürümqi". It looks to me like a doubly UTF-8 encoded piece of text. Perhaps something to look into? Oscar van Vlijmen

On 2014-06-30 11:40, vanadovv@hetnet.nl wrote:
In reply to: [tz] [PATCH] More spelling and accent fixes. Paul Eggert eggert at CS.UCLA.EDU Fri Jun 27 23:06:47 UTC 2014 A small problem regarding the transition to UTF-8. Let me refer to a copy of some text from the diff, sent through the mailing list: @@ -383,7 +383,7 @@ Zone Asia/Chongqing 7:06:20 - LMT 1928 # or Chungking # Wusu, Qiemo, Xinyan, Wulanwusu, Jinghe, Yumin, Tacheng, Tuoli, Emin, # Shihezi, Changji, Yanqi, Heshuo, Tuokexun, Tulufan, Shanshan, Hami, # Fukang, Kuitun, Kumukuli, Miquan, Qitai, and Turfan. -Zone Asia/Urumqi 5:50:20 - LMT 1928 # Ürümqi or Urumchi +Zone Asia/Urumqi 5:50:20 - LMT 1928 # Ürümqi or Ürümchi The correct spelling of the Asian city at hand is Ürümchi, with two u-diaeresis. The github asia file has the correct spelling and the correct encoding. The diff file however gets mangled by some publishing process. Hence, the diff published through the mailing list is rather useless. I have a lot of text encoding conversion filters, but noone is able to decode the mangled "Ürümqi". It looks to me like a doubly UTF-8 encoded piece of text. Perhaps something to look into? Oscar van Vlijmen
I think the lack of MIME headers in Paul's email is the reason why. When I tried emailing the same patch to myself using git send-email, I got prompted to specify an 8-bit encoding as follows: The following files are 8bit, but do not declare a Content-Transfer-Encoding. /tmp/8jzhF_MOTQ/0001-More-spelling-and-accent-fixes.patch Which 8bit encoding should I declare [UTF-8]? It can also be specified on the command line of git send-email using the '--8bit-encoding=<encoding>' option, or specified in the git configuration in the 'sendmail.assume8bitEncoding' option. -- -=( Ian Abbott @ MEV Ltd. E-mail: <abbotti@mev.co.uk> )=- -=( Tel: +44 (0)161 477 1898 FAX: +44 (0)161 718 3587 )=-

Ian Abbott wrote:
I think the lack of MIME headers in Paul's email is the reason why.
Yes, for some reason it's not working for me. I had already noticed the problem, and filed a bug report here: http://article.gmane.org/gmane.comp.version-control.git/252660 a few hours ago, but no response yet. git format-patch never prompts me, and sometimes it labels the patch as UTF-8, sometimes it doesn't. For example: $ git format-patch 5be5ee3dd453c5b575f6336eada9390fb205717a^! 0001-Mention-more-JavaScript-libraries.patch $ git format-patch c25e1180cf3ec34d6c731d5ec16739d6d2ca8fc2^! 0001-More-spelling-and-accent-fixes.patch $ grep UTF-8 0* 0001-Mention-more-JavaScript-libraries.patch: <meta http-equiv="Content-type" content='text/html; charset="UTF-8"'> The first patch is labeled as UTF-8 even though the patch is entirely ASCII; the second patch is not labeled even though it contains non-ASCII characters! My .gitconfig is simple: [user] name = Paul Eggert email = eggert@cs.ucla.edu [push] default = simple It seems crazy to me that I would need to specify an obscure option to have 'git format-patch' do the right thing. I run either git 1.9.3 (Fedora 20) or git 1.9.1 (Ubuntu 14.04) and neither version documents sendmail.assume8bitEncoding or --8bit-encoding in its man pages. Perhaps the git folks have been hacking around in this area, and the natural default doesn't work any more? Could you try the above shell commands and see what they output for you? Also, which git version are you running?

Paul Eggert <eggert@cs.ucla.edu> writes:
$ git format-patch 5be5ee3dd453c5b575f6336eada9390fb205717a^! 0001-Mention-more-JavaScript-libraries.patch $ git format-patch c25e1180cf3ec34d6c731d5ec16739d6d2ca8fc2^! 0001-More-spelling-and-accent-fixes.patch $ grep UTF-8 0* 0001-Mention-more-JavaScript-libraries.patch: <meta http-equiv="Content-type" content='text/html; charset="UTF-8"'>
The first patch is labeled as UTF-8
No, it isn't.
It seems crazy to me that I would need to specify an obscure option to have 'git format-patch' do the right thing. I run either git 1.9.3 (Fedora 20) or git 1.9.1 (Ubuntu 14.04) and neither version documents sendmail.assume8bitEncoding or --8bit-encoding in its man pages.
See git-send-email(1). Andreas. -- Andreas Schwab, schwab@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different."

Andreas Schwab wrote:
No, it isn't.
Sorry, my examples were wrong (I was confusing data with headers), and getting the examples right has helped me to understand the problem better. Here are the corrected examples: $ git format-patch a9da35214bd3a27d84458b8fee19d19c37aa67f0^! 0001-Further-updates-to-commentary-mostly-un-ASCII-fying-.patch $ git format-patch c25e1180cf3ec34d6c731d5ec16739d6d2ca8fc2^! 0001-More-spelling-and-accent-fixes.patch $ grep UTF-8 0* 0001-Further-updates-to-commentary-mostly-un-ASCII-fying-.patch:Content-Type: text/plain; charset=UTF-8 and the reason the first patch has a proper Content-type header and the second patch doesn't, is that git outputs the header only if the commit log message itself contains UTF-8.
See git-send-email(1).
The machine I email patches from is old (Solaris 10!) and slow and does not share files with my desktop. And my desktop does not use a mail transfer agent, for security reasons. So I've been running 'git format-patch' on my desktop, using scp to copy the output file to the email-sending machine, and feeding that file directly to /usr/lib/sendmail. Or, I've been using git format-patch to generate a file and then attaching the patch via Thunderbird. Apparently neither use case works with UTF-8 data, alas. What a pain.

On 2014-06-30 17:49, Paul Eggert wrote:
Ian Abbott wrote:
I think the lack of MIME headers in Paul's email is the reason why.
Yes, for some reason it's not working for me. I had already noticed the problem, and filed a bug report here:
http://article.gmane.org/gmane.comp.version-control.git/252660
a few hours ago, but no response yet. git format-patch never prompts me, and sometimes it labels the patch as UTF-8, sometimes it doesn't. For example:
$ git format-patch 5be5ee3dd453c5b575f6336eada9390fb205717a^! 0001-Mention-more-JavaScript-libraries.patch $ git format-patch c25e1180cf3ec34d6c731d5ec16739d6d2ca8fc2^! 0001-More-spelling-and-accent-fixes.patch $ grep UTF-8 0* 0001-Mention-more-JavaScript-libraries.patch: <meta http-equiv="Content-type" content='text/html; charset="UTF-8"'>
The first patch is labeled as UTF-8 even though the patch is entirely ASCII; the second patch is not labeled even though it contains non-ASCII characters!
You're mistaken about the "Mention more JavaScript libraries" patch. Grep caught the wrong line! I get the same result in git 2.0.0. It seems git format-patch only adds the MIME headers to the patch if the commit message contains 8-bit characters, irregardless of the patch data. However, that doesn't stop git send-email detecting the lack of Content-Transfer-Encoding and prompting for one when necessary.
My .gitconfig is simple:
[user] name = Paul Eggert email = eggert@cs.ucla.edu [push] default = simple
It seems crazy to me that I would need to specify an obscure option to have 'git format-patch' do the right thing. I run either git 1.9.3 (Fedora 20) or git 1.9.1 (Ubuntu 14.04) and neither version documents sendmail.assume8bitEncoding or --8bit-encoding in its man pages. Perhaps the git folks have been hacking around in this area, and the natural default doesn't work any more? Could you try the above shell commands and see what they output for you? Also, which git version are you running?
Check the man page for git-send-email - the option is described there, although I got it slightly wrong. It should be sendemail.assume8bitEncoding, not sendmail.assume8bitEncoding. I get the same result as you in git 2.0.0. You can set the option in either your global ~/.gitconfig or your local .git/config using either $ git config --global sendemail.assume8bitEncoding UTF-8 or $ git config --local sendemail.assume8bitEncoding UTF-8 (you can omit the --local as that's the default) In either case, the option looks like this in the global or local git config: [sendemail] assume8bitEncoding = UTF-8 -- -=( Ian Abbott @ MEV Ltd. E-mail: <abbotti@mev.co.uk> )=- -=( Tel: +44 (0)161 477 1898 FAX: +44 (0)161 718 3587 )=-

Ian Abbott wrote:
[sendemail] assume8bitEncoding = UTF-8
This assumes I can run 'git send-email', right? So if I attach the patch via Thunderbird then it won't work. Also, I would have to figure out how to use 'git send-email', which I expect will not be trivial for me due to security considerations.

On 2014-06-30 19:19, Paul Eggert wrote:
Ian Abbott wrote:
[sendemail] assume8bitEncoding = UTF-8
This assumes I can run 'git send-email', right? So if I attach the patch via Thunderbird then it won't work. Also, I would have to figure out how to use 'git send-email', which I expect will not be trivial for me due to security considerations.
git send-email's --smtp-server option (or the sendemail.smptserver git config variable) can be either a hostname or an absolute path to a sendmail-like program, in fact it defaults to /usr/sbin/sendmail, /usr/lib/sendmail, or localhost. You could set it to your own script that sends it to /usr/lib/sendmail on another host via ssh. -- -=( Ian Abbott @ MEV Ltd. E-mail: <abbotti@mev.co.uk> )=- -=( Tel: +44 (0)161 477 1898 FAX: +44 (0)161 718 3587 )=-

Le 30 juin 2014 à 06:40, "vanadovv@hetnet.nl" <vanadovv@hetnet.nl> a écrit :
-Zone Asia/Urumqi 5:50:20 - LMT 1928 # Ürümqi or Urumchi +Zone Asia/Urumqi 5:50:20 - LMT 1928 # Ürümqi or Ürümchi
The correct spelling of the Asian city at hand is Ürümchi, with two u-diaeresis. The github asia file has the correct spelling and the correct encoding.
The diff file however gets mangled by some publishing process. Hence, the diff published through the mailing list is rather useless.
I have a lot of text encoding conversion filters, but noone is able to decode the mangled "Ürümqi". It looks to me like a doubly UTF-8 encoded piece of text.
Classic mojibake. It's what you get if you encode Ürümqi as UTF-8, and then read the bytes as if they were Windows Latin-1.
participants (5)
-
Andreas Schwab
-
Ian Abbott
-
J Andrew Lipscomb
-
Paul Eggert
-
vanadovv@hetnet.nl