[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[debian-devel:13101] inside of coming groff
佐野@浜松です。
# subject 変えます。
In <200010171136.UAA09096@xxxxxxxxxxxxxxxxxxxxxxxx>,
on "Tue, 17 Oct 2000 20:36:39 +0900",
with "[debian-devel:13079] Re: jgroff patch handling onlatin1(Re:linuxdoc-tools: no Korean .txt output!)",
Nozomi Ytow <nozomi@xxxxxxxxxxxxxxxxxx> さん wrote:
> source が non UTF-8 の場合、source locale dependent に UTF-8 に
> language tag を入れる? そうすると roff は language tag で
> 動作を変える? つまり、事実上 multilingual roff を作る? BiDi や
> vertical も含めて? 俄には考えにくい。あるとそりゃうれしいけど。
<http://lists.debian.org/debian-i18n-0010/> から見てみよう、と
思ったらまだ公開されてないですね。
手元に届いた
Message-Id: <20001018.004646.101498877.wl@xxxxxxx>
X-Mailing-List: <debian-i18n@lists.debian.org> archive/latest/669
から引用します。
| Subject: Re: [Groff] Re: groff: radical re-implementation
| From: Werner LEMBERG <wl@xxxxxxx>
| Date: Wed, 18 Oct 2000 00:46:46 +0200 (CEST)
| X-Mailer: Mew version 1.95b54 on Emacs 20.6 / Mule 4.0 (HANANOEN)
# そういや Werner って Mew 使ってるんですかね、関係無いけど。
| > - Groff handles glyph, not character.
| > I don't understand relationship between these two. UTF-8 is a code
| > for character, not glyph. ISO8859-1 and EUC-JP are also codes for
| > character. No difference among UTF-8, ISO8859-1, and EUC-JP.
|
| Well, this is *very* important. The most famous example is that the
| character `f', followed by character `i' will be translated into a
| single glyph `fi' (which has incidentally a Unicode number for
| historical reasons). A lot of other ligatures don't have a character
| code. Or consider a font which has 10 or more versions of the `&'
| character (such a font really exists). Do you see the difference? A
| font can have multiple glyphs for a single character. For other
| scripts like Arabic it is necessary to do a lot of contextual analysis
| to get the right glyphs. Indic scripts like Tamil have about 50 input
| character codes which map up to 3000 glyphs!
|
| Consider the CJK part of Unicode. A lot of Chinese, Korean, Japanese,
| and Vietnamese glyphs have been unified, but you have to select a
| proper locale to get the right glyph -- many Japanese people have been
| misled because a lot of glyphs in the Unicode book which have a JIS
| character code don't look `right' for Japanese.
|
| For me, groff is primarily a text processing tool, and such a program
| works with glyphs to be printed on paper. A `character' is an
| abstract concept, basically. Your point of view, I think, is
| completely different: You treat groff as a filter which just
| inserts/removes some spaces, newline characters etc.
ここがポイントみたいですね。Werner の構想では、groff は「グリフ」を
扱うためのものだから、「文字」にこだわることはしない、と。
で、どうやって入力データに含まれる「文字」から適切な「グリフ」を
選択していくべきか、というのが groff (実体は troff コマンド) の
やるべき仕事、ということなんでしょうか。
| > However, I won't stick to wchar_t or ucs-4 for internal code, though
| > I have no idea about your '31bit glyph code'. (Maybe I have to
| > study Omega...)
|
| A `glyph code' is just an arbitrary registration number for a glyph
| specified in the font definition file. It is invariable from the
| input encoding. Adobe has `official' glyph lists like `Adobe
| standard' or `Adobe Japan1'. CID encoded PostScript fonts use CMaps
| to map the input encoding to these glyph IDs.
|
| > The name '--locale' is confusing since it has no relation to locale,
| > i.e., a term which refer to a certain standard technology.
|
| I welcome any suggestions for better names...
|
| > - Japanese and Chinese text contains few whitespace characters.
| > (Japanese and Chinese words are not separated by whitespace).
| > Therefore, different line-breaking algorithm should be used.
| > (Hyphen character is not used when a word is broken into lines.)
| > (Modern Korean language contains whitespace characters between
| > words --- though not words, strictly speaking.)
|
| Not really a different line breaking algorithm but more glyph
| properties (to be set with `.cflags'): disallowing breaks after or
| before a glyph for implementing kinsoku shori; for implementing
| shibuaki properly we probably need to extend the .cflags syntax to set
| glyph properties for whole glyph classes.
このへんの禁則処理についての話は鵜飼さんの書かれてたのと同じ方向を
考えているみたいですね。
| For the non-CJK experts: `kinsoku shori' means that some CJK glyphs
| must not start a line (for example, an ideographic comma or closing
| bracket) resp. end a line (opening brackets). `shibuaki' means
| `quarter space'; this is the space between CJK characters and Latin
| characters -- there are Japanese standards which defines all these
| things in great detail.
|
| > - Hyphenation algorithm differs from language to language.
# 久保田さん、どうもです。Werner はちゃんとわかってくれてましたね :)
# 安心しました。
| What exactly do you mean? The only real difficult language which
| could be easily supported with groff is Thai (and similar languages).
| You need at least a dictionary to find word breaks. All other
| languages can easily be managed with the current algorithm, I believe.
この easily be managed てのは、各言語について定義ファイルを揃えれば
いい、ってことですね。ただし、そうはいっても、既にあるものを使えるなら
楽だけど、 gettext 関連のメッセージ翻訳みたいに「既に枠組は用意されて
いて、あとは翻訳するだけでいい」という作業であっても、それなりに人と
時間が必要でしょう。
| > - Almost CJK characters (ideographics, hiragana, katakana, hangul,
| > and so on) have double width on tty. Since you won't use wchar_t,
| > you cannot use wcwidth() to get the width for characters.
|
| This is not a problem. Just give the proper glyph width in the tty
| font definition files.
固定幅のフォントなら簡単そうですね。プロポーショナルなのは
どうなるんだろう ? グリフの数が膨大になると、すごく手間の
かかる作業になったりしませんか ?
とりあえず固定幅のフォントだけ扱えればいい、ってことにするのかな。
| > - Latin-1 people may use 0xa9 for '\(co'. However, this character
| > cannot be read in other encodings. The current Groff convert
| > '\(co' to 0xa9 in latin1 device and to '(C)' in ascii device.
| > How it works for future Groff? Use u+00a9? The postprocessor
| > (see below) cannot convert u+00a9 to '(C)' because the width is
| > different and typesetting is broken. It is very difficult to
| > design to avoid this problem...
|
| For tty devices, the route is as follows. Let's assume that the input
| encoding is Latin-1. Then the input character code `0xa9' will be
| converted to Unicode character `U+00a9' (by the preprocessor). A
| hard-coded table maps this character code to a glyph with the name
| `co'. Now troff looks up the metric info in the font definition file.
| If the target device is an ASCII-capable terminal, the width is three
| characters (the glyph `co' is defined with the .char request to be
| equal to `(C)'); if it is a Unicode-capable terminal, the width is one
| character. After formatting, a hard-coded table maps the glyphs back
| to Unicode.
|
| Note that the last step may fail for glyphs which have no
| corresponding Unicode value.
hard-coded table map のサイズってどれくらいになるのかな。
一度にメモリーに読み込まないといけないとかだと処理が苦しそう。
characer code から glyph name への変換ってのはどこかに決まった
ものがあるのかな。
| > > . Finally, we need to divide the -T option into a --device and
| > > --output-encoding.
| >
| > What is the default encoding for tty? I suggest this should be
| > locale-sensible. (Or, this can be UTF-8 and Groff can invoke a
| > postprocessor.)
|
| I favor UTF-8 + postprocessor. Terminal capabilities should be
| selected with macro packages; for example, an ASCII terminal would get
| the options
|
| -m ascii --device=tty --output-encoding=ascii
|
| the tmac.ascii file would be very similar to tmac.tty + tmac.tty-char.
|
| A latin-2 terminal would be
|
| -m latin2 --device=tty --output-encoding=latin2
|
| A Unicode terminal emulating an ASCII terminal would be
|
| -m ascii --device=tty --output-encoding=utf8
|
| etc.
|
| Using a postprocessor we need only a single font definition file for
| tty devices.
この single font definition file for tty devices ってのは
"After formatting, a hard-coded table maps the glyphs back to
Unicode" の時に使われるものなのかな。
| > > Yes. The `iconv' preprocessor would then do some trivial, hard-coded
| > > conversion.
| >
| > You mean, the preprocessor is iconv(1) ?
|
| Basically yes, with some adaptations to groff.
|
| > The preprocessor, provisional name 'gpreconv', will be designed as:
| > - includes hard-coded converter for latin1, ebcdic, and utf8.
| > - uses iconv(3) if possible (compiled within internationalized OS).
| > - parses --input-encoding option.
| > - default input is latin1 if compiled within non-internationalized
| > OS.
| > - default input is locale-sensible if compiled within
| > internationalized OS.
|
| Exactly.
|
| > Thus I designed the above 'gpreconv'. Oh, I have to design
| > 'gpostconv' also.
|
| It should be very similar to the preprocessor.
|
|
| Werner
--
# (わたしのおうちは浜松市、「夜のお菓子」で有名さ。)
<kgh12351@xxxxxxxxxxx> : Taketoshi Sano (佐野 武俊)