From 100a7709130ef43afaf17debc637db846e12efbd Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Thu, 26 Jun 2014 14:23:53 -0700 Subject: [PATCH] * zic.8, NEWS: Document character encoding issues better. (Thanks to Guy Harris for reporting the problem.) --- NEWS | 5 +++-- zic.8 | 18 +++++++++++++++++- 2 files changed, 20 insertions(+), 3 deletions(-) diff --git a/NEWS b/NEWS index e87e2b7..2983cc0 100644 --- a/NEWS +++ b/NEWS @@ -28,8 +28,9 @@ Unreleased, experimental changes Documentation and commentary now prefer UTF-8 to US-ASCII, allowing the use of proper accents in foreign words and names. - Code and data have not changed because of this. - (Thanks to Garrett Wollman and Ian Abbott for helping to debug this.) + Code and data have not changed because of this. (Thanks to + Garrett Wollman, Ian Abbott, and Guy Harris for helping to debug + this.) Non-HTML documentation and commentary now use plain-text URLs instead of HTML insertions, and are more consistent about bracketing URLs when they diff --git a/zic.8 b/zic.8 index e22e6cd..2a1d29e 100644 --- a/zic.8 +++ b/zic.8 @@ -113,7 +113,7 @@ before 1970 or after the start of 2038. A time zone abbreviation has fewer than 3 characters. POSIX requires at least 3. .PP -An output file name contains a byte that is not an ASCII letter, digit, +An output file name contains a byte that is not an ASCII letter, .q "-" , .q "/" , or @@ -135,8 +135,24 @@ rather than .B yearistype when checking year types (see below). .PP +Input files should be text files, that is, they should be a series of +zero or more lines, each ending in a newline byte and containing at +most 511 bytes, and without any NUL bytes. The input text's encoding +is typically UTF-8 or ASCII; it should have a unibyte representation +for the POSIX Portable Character Set (PPCS) + +and the encoding's non-unibyte characters should consist entirely of +non-PPCS bytes. Non-PPCS characters typically occur only in comments: +although output file names and time zone abbreviations can contain +nearly any character, other software will work better if these are +limited to the restricted syntax described under the +.B \-v +option. +.PP Input lines are made up of fields. Fields are separated from one another by one or more white space characters. +The white space characters are space, form feed, carriage return, newline, +tab, and vertical tab. Leading and trailing white space on input lines is ignored. An unquoted sharp character (#) in the input introduces a comment which extends to the end of the line the sharp character appears on. -- 1.9.1