[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[debian-devel:10372] Draft 3
久保田です。
今回からは、直接英語で執筆した部分があります。
・目次が若干変更になりました。
・かなり英語化を進めました。
・日本語に関する章 (3.3.1.) がかなり増えています。
・3.1.* は、UNICODE 関連を除いて、ほぼ完成しました。
あとは一次情報を探してそれへのポインタを示せたらと思っています。
・最後のほうの章は、まだまだドラフト以前の状態です。
ところで、ISO の規格書って、有料なんでしょうか?
/***********************************************************
* 久保田智広 Tomohiro KUBOTA
* tkubota@xxxxxxxxxxx / kubota@debian.or.jp
***********************************************************/
INTRODUCTION TO I18N
CONTENTS
1. About This Document
2. Introduction
3. Character Coding Systems
3.1. General Discussion
3.1.1. Character / Character Set / Coded Character Set (or Codeset)
3.1.2. Stateless and Stateful
3.1.3. Number of Bytes, Number of Characters, and Number of Columns
3.2. Standards for Character Codes
3.2.1. ASCII and ISO 646
3.2.2. ISO 8859
3.2.3. ISO 2022
3.2.4. ISO/IEC 10646 (UCS-4, UCS-2), UNICODE, UTF-8, UTF-16
3.3. Current Situation in Each Country
3.3.1. Japanese
4. Output to Display
4.1. Console Softwares
4.2. X Clients
5. Input from Keyboard
5.1. Console Softwares
5.1.1. Invoked in the Console and Kon
5.1.2. Invoked in an X Terminal Emulator
5.2. X Clients
6. Internal Processing and File I/O
7. Other Special Topics
6.1. Tcl/Tk Programs
6.2. Perl Scripts
6.3. Shell Scripts
8. Examples of I18N
1. About This Document
1.1. Scope
This document describes the basic ideas of I18N written for
programmers and package maintainers of Debian GNU/Linux.
The aim of this document is to offer an introduction to
basic concepts, character codes, and points of which care
should be taken when one writes an I18N-ed software or
a I18N patch for an existing software. This document
also tries to introduce the real state and existing
problems for each language and country.
This document does not describe the details of programming,
except for the last chapter where instances of I18N are
collected.
Though this document is strongly related to programming
languages such as C and standardized I18N methods such as
gettext and LOCALE, this document does not supply a
detailed explanation of them.
1.2. New Versions of This Document
The current version of this document is always accessible
at http://surfchem0.riken.go.jp/~kubota/linuxwork/i18ndoc.html.
1.3. Feedback and Contributions
This document needs contributions, especially for a
chapter on each languages and a chapter on instances of I18N.
These chapters are consist of contributions.
Otherwise, this will be a mere document only on Japanization,
because the original author Tomohiro KUBOTA <kubota@debian.or.jp>
speaks Japanese and live in Japan.
Discussions are held at debian-devel@lists.debian.org mailing list.
2. Introduction
Debian system includes many softwares. Though many of them
have faculty to process, output, and input text data, a part
of these programs assume text as written in English (ASCII).
For people who use non-English language these programs are
hardly usable.
So far people who use non-English languages have given up
and accepted computers as such. However we should throw away
such a wrong idea now. It is nonsence that a person who
want to use a computer has to learn English in advance.
There are a few approaches for softwares to be able to handle
non-English languages. What we need to do at first is to know
the differences between these approaches and to choose one
approach for each case.
a. L10N (localization)
This approach is to support two languages or character sets,
English (ASCII) and another specified one. An example is
Nemacs (Nihongo Emacs, an ancestor of MULE, MULtilingual Emacs).
Since a programmer has his/her own mother tangue,
there are numerous L10N patches and L10N softwares
written to satisfy his/her own need.
b. I18N (internationalization)
This approach is to support many languages but only two
of them, English (ASCII) and another one, at the same time.
One have to specify the 'another' language by LANG environmental
variable or so on. LOCALE and GETTEXT is categolized
into I18N.
c. M17N (multilingualization)
This approach is to support many languages at the same time.
For example, Mule (MULtilingual Enhancement to GNU Emacs)
can treat a text file which contains multiple languages,
for example, a paper on difference between Korean and Chinese
whose main text is written in Finnish. Now GNU Emacs 20 and
XEmacs include Mule.
Generally speaking I18N approach is better than L10N and M17N than I18N.
In other words, text-processing softwares are 'better' which can treat
many languages at the same time, than can treat two (English and an
another) languages.
Now let me classify approaches for support of non-English languages
from an another viewpoint.
A. Imprementation without Knowledges on Each Language
By utilizing standardized methods supplied by the kernel or libraries
such as LOCALE, wchar_t, and gettext, this approach is possible.
The advantages of this approach are (1) that when the kernel or
libraries is upgraded the software may support new additional languages
and (2) that programmers need not know each language.
The disadvantage is that there are categories or fields where
a standardized method is not available. So far standardized
methods are available in the field of I18N such as LOCALE and
gettext and no standards are established for M17N approach.
Furthermore, there are no standard for number of columns a
character occupies nor methods for inputting non-English
language on console (that is, interface to inputting library).
B. Imprementation Using Knowledges on Each Language
This approach is to directly imprement information about
each language based on knowledge of programmers and
contributers. L10N almost always uses this approach.
The advantage of this approach is that detailed and strict
imprementation is possible beyond the field where
standardized methods are available. Language-specific
problems can be perfectly solved (of course it depends on
the skill of the programmer). The disadvantages are
(1) that the number of supported languages is restricted
by the skill or the interest of the programmers or the
contributers, (2) that labor which should be united and
concentrated to upgrade the kernel or libraries is dispersed
into many softwares, that is, re-inventing of the wheel.
However a majestic M17N software such as Mule can be
built by strongly propel this approach.
Using this classification, let me consider L10N, I18N and M17N
from programmer's point of view.
L10N can be realized only using his/her own knowledge on his/her
language. For example, all what you have to do is to implement
your knowledge on SHIFT-JIS coding system. Since the motivation
of L10N is usually to satisfy programmer's own need, extensiveness
for the third language is often ignored. Then, approach B, not A, is
taken.
Though L10N-ed softwares are basically useful for people who
speaks the same language to the programmer, it is sometimes
useful for other people whose coding system is similar to
the programmer's. For example, a software which
doesn't recognize EUC-JP but doesn't break EUC-JP, does not
break EUC-KR also.
Main part of I18N is, in the case of C program, achieved using
standerized methods such as LOCALE, wchar_t, and gettext.
An LOCALE approach is classified into I18N because functions
related to LOCALE change their behavior by a parameter
to setlocale() or environmental variables such as LANG.
Namely, approach A is emphasized for I18N. For field where
standardized methods are not available, however, approach B
cannot be avoided. Even in such a case, an interface and
support for each language should be designed to be separated
so that a support for new languages can be easily added.
Unfortunately there are no standardized methods for M17N so far.
Exceptions are ISO-2022-INT-* and UNICODE codesets which can
express many languages at the same time. However, ISO-2022-INT-*
is stateful and thus imprementation may be difficult and
UNICODE lacks a compatibility to eastern Asian standards
and UNICODE itself has many variants (UCS-* and UTF-*) though
they can be converted one another easily. Of course M17N-ed
software cannot be written only with M17N-ed codeset.
Thus approach B cannot be avoided for M17N so far.
Efforts for standardization in various fileds for M17N should
be made. Mule is the only software which achieved M17N.
This document is focused on I18N. Note that an I18N-ed software
cannot process a text file which contains more than three languages,
for examile, Finnish, Chinese, and Korean (a paper written in
Finnish, on comparson of Chinese and Korean). M17N is needed
for such a case. Don't forget that the true goal is M17N and
I17N is a compromise.
For people using non-Latin letters, I18N does not include
messages written in their languages nor file names written
their languages. Yes, it is true they should be achieved.
However, on considering our current state, we can say these
requires are too much luxury. Our true necesity is,
for example, that characters in our languages are displayed
using correct font without destroying the screen, that
a way for our characters to be inputted is supplied, and
that our languages can be inputted correctly. It would be
fine if text-processing softwares such as perl and grep
processes our languages correctly.
Regarding such circumstance on which we stand, the auther
concentrate on the problems which is truely needed rather
than right and ideal I18N/M17N. In other words, this document
is concentrated on the way how characters should be displayed,
inputed, and processed without destroying them, not on the
time-displaying format, currency symbol, or so on.
3. Chacter Coding Systems
Here major character sets and codesets are introduced.
The last section of this chapter contains informations
on each language. Contributions for this section for
many languages are especially welcome, though contributions
for the whole text are of course welcome.
3.1. General Discussion
3.1.1. Character / Character Set / Coded Character Set (or Codeset)
'CHARACTER CODE' is a set of combinations of bits in order to
treat characters in computers. To determine a character code
it is needed that to determine a set of characters to be encoded.
This set of character is called 'CHARACTER SET'. There are many
standards of character sets in the world. For example, JIS X 0208
contains main characters used in Japanese. Usually character sets
is not a mere collection of characters but each character in the
set has its number. Usually the numbering is done so that the set
is consistent with international standards.
Then one selects a character set or multiple character sets and
assigns codes for characters included in the character set(s).
This way to assign code is called 'ENCODING'.
The set of encoded characters are called 'CODED CHARACTER SET'
or 'CODESET'. For example, ISO-2022-JP 'codeset' contains
'character set's of ASCII, JIS X 0201 Katakana, and JIS X 0208 Kanji.
Encoding for a codeset including multiple character sets
is usually done in two stages, at first in each character
set and then for combination of character sets.
For a codeset including only one character set, we don't
have to distinguish 'charadcter set' and 'codeset'.
For example, ASCII is a character set and a codeset at the
same time.
3.1.2. Stateless and Stateful
For codeset including multiple character sets it is needed
to determine the way to combine these charcter sets when encoding.
There are two ways to do that. One is to make all characters
in the all character sets have unique codes. The other is to
allow characters from different character sets to have the same
code and to have a code such as escape sequence to switch
'SHIFT STATE', that is, to select one character set.
A codeset with shift state is called 'STATEFUL' and
one without shift state is called 'STATELESS'.
Generally stateful codesets can contain more characters than
stateless one. However, imprementation of stateful codeset
is much difficult than that of stateless codeset.
3.1.3. Number of Bytes, Number of Characters, and Number of Columns
One ASCII character is always expressed by one byte
and occupies one column on console or fixed font for X.
One must not make such an assumption for I18N programming
and have to clearly distinguish number of bytes, characters,
and columns.
3.2. Standards for Character Codes
3.2.1. ASCII and ISO 646
ASCII is a character set and also a codeset at the same time.
ASCII is 7bit and contains 94 printable characters which are
encoded in the region of 0x21-0x7e.
ISO 646 is the international standard of ASCII. Following
12 characters of
0x23 (number),
0x24 (dollar),
0x40 (at),
0x5b (left square bracket),
0x5c (backslash),
0x5d (right square bracket),
0x5e (caret),
0x60 (backquote),
0x7b (left curly brace),
0x7c (vertical line),
0x7d (right curly brace), and
0x7e (tilde)
are called IRV (International Reference Version) and other
82 (94 - 12 = 82) characters are called BCT (Basic Code Table).
Characters at IRV can be different between countries.
For example, UK version of ISO 646 has pound currency
symbol at 0x23 and Japanese version has yen currency
symbol at 0x5c. US version of ISO 646 is same to ASCII.
As far as I know, all codesets in the world contains
ISO 646 character set.
Characters in 0x00 - 0x1f, 0x20, and 0x7f are control characters.
3.2.2 ISO 8859
ISO 8859 is an expansion of ASCII using all 8 bits.
Additional 96 printable characters encoded in 0xa0 - 0xff are
available besides 94 ASCII printable characters.
There are 10 variants of ISO 8859 (in 1997).
ISO-8859-1 Latin alphabet No.1 (1987)
characters for western European languages
ISO-8859-2 Latin alphabet No.2 (1987)
characters for central European languages
ISO-8859-3 Latin alphabet No.3 (1988)
ISO-8859-4 Latin alphabet No.4 (1988)
characters for northern European languages
ISO-8859-5 Latin/Cyrillic alphabet (1988)
ISO-8859-6 Latin/Arabic alphabet (1987)
ISO-8859-7 Latin/Greek alphabet (1987)
ISO-8859-8 Latin/Hebrew alphabet (1988)
ISO-8859-9 Latin alphabet No.5 (1989)
same as ISO-8859-1 except for Turkish instead of Icelandic
ISO-8859-10 Latin alphabet No.6 (1993)
Adds Inuit (Greenlandic) and Sami (Lappish) letters to ISO-8859-4
Reference: http://park.kiev.ua/mutliling/ml-docs/iso-8859.html
3.2.3. ISO 2022
ISO 2022 is a very powerful codeset where multiple
character sets including 1byte and multibyte can be
expressed at the same time. It is stateful.
There are many subset codeset of ISO 2022, for example,
ISO-2022-JP, EUC, and compound-text. ISO-2022-*
is widely used for mail/news. EUC has several variants,
for example, EUC-JP and EUC-KR and widely used for
UNIX(-like) systems. Compound-text is the standard
codeset for X Window System.
ISO 2022 has two versions of 7bit and 8bit. At first
8bit version is explained. 7bit version is a subset
of 8bit version.
The 8bit code space are divided into four regions.
(1) 0x00 - 0x1f: C0 (Control Characters 0)
(2) 0x20 - 0x7f: GL (Graphic Characters Left)
(3) 0x80 - 0x9f: C1 (Control Characters 1)
(4) 0xa0 - 0xff: GR (Graphic Characters Right)
GL and GR is the spaces where (printable) character sets are mapped.
All character sets, for example, ASCII, ISO 646-UK,
and JIS X 0208, are classified into four categories.
(1) character set with 1-byte 94-character
(2) character set with 1-byte 96-character
(3) character set with multibyte 94-character
(4) character set with multibyte 96-character
Characters in character sets with 94-character are mapped
into 0x21 - 0x7e. Characters in 96-character set are
mapped into 0x20 - 0x7f.
For example, ASCII, ISO 646-UK, and JIS X 0201 Katakana
are classified into (1), JIS X 0208 Japanese Kanji,
KS C 5601 Korean, GB 2312-80 Chinese are classified into (3),
and ISO 8859-* are classified to (2).
The mechanism to map these character sets into GL and GR is
a bit complex. There are four buffers, G0, G1, G2, and G3.
A character set is 'designated' into one of these buffers
and then a buffer is 'invoked' into GL or GR.
Control sequences to 'designate' a character set into a
buffer are determined as below.
A sequence to designate a character set with 1-byte 94-character
into G0 set is: ESC 0x28 F,
into G1 set is: ESC 0x29 F,
into G2 set is: ESC 0x2a F, and
into G3 set is: ESC 0x2b F.
A sequence to designate a character set with 1-byte 96-character
into G1 set is: ESC 0x2d F,
into G2 set is: ESC 0x2e F, and
into G3 set is: ESC 0x2f F.
A sequence to designate a character set with multibyte 94-character
into G0 set is: ESC 0x24 0x28 F,
(exception: 'ESC 0x24 F' for F=0x40,0x41,0x42.)
into G1 set is: ESC 0x24 0x29 F,
into G2 set is: ESC 0x24 0x2a F, and
into G3 set is: ESC 0x24 0x2b F.
A sequence to designate a character set with multibyte 96-character
into G1 set is: ESC 0x24 0x2d F,
into G2 set is: ESC 0x24 0x2e F, and
into G3 set is: ESC 0x24 0x2f F.
where 'F' is determined for each character set:
character set with 1-byte 94-character
ISO 646 IRV: 1983 F=0x40
BS 4730 (UK) F=0x41
ANSI X3.4-1968 (ASCII) F=0x42
NATS Primary Set for Finland and Sweden F=0x43
:
JIS X 0201 Katakana F=0x49
JIS X 0201 Latin F=0x4a
:
character set with 1-byte 96-character
ISO 8859-1 Latin-1 F=0x41
ISO 8859-2 Latin-2 F=0x42
ISO 8859-3 Latin-3 F=0x43
ISO 8859-4 Latin-4 F=0x44
ISO 8859-7 Latin/Greek F=0x46
ISO 8859-6 Latin/Arabic F=0x47
ISO 8859-8 Latin/Hebrew F=0x48
ISO 8859-5 Latin/Cyrillic F=0x4c
:
character set with multibyte 94-character
JIS X 0208-1978 Japanese F=0x40
GB 2312-80 Chinese F=0x41
JIS X 0208-1983 Japanese F=0x42
KS C 5601 Korean F=0x43
JIS X 0212-1990 Japanese F=0x44
CCITT Extended GB (ISO-IR-165) F=0x45
CNS 11643-1992 Set 1 (Taiwan) F=0x47
CNS 11643-1992 Set 2 (Taiwan) F=0x48
CNS 11643-1992 Set 3 (Taiwan) F=0x49
CNS 11643-1992 Set 4 (Taiwan) F=0x4a
CNS 11643-1992 Set 5 (Taiwan) F=0x4b
CNS 11643-1992 Set 6 (Taiwan) F=0x4c
CNS 11643-1992 Set 7 (Taiwan) F=0x4d
:
** WHERE CAN I FIND THE COMPLETE AND AUTHORITATIVE TABLE OF THIS? **
Control codes to 'invoke' one of G{0123} into GL or GR
is determined as below.
A control code to invoke G0 into GL is: (L)SO ((Locking) Shift Out)
G1 GL : (L)SI ((Locking) Shift In)
G2 GL : LS2 (Locking Shift 2)
G3 GL : LS3 (Locking Shift 3)
A control code to invoke one character
in G2 into GL is: SS2 (Single Shift 2)
G3 GL : SS3 (Single Shift 3)
A control code to invoke G1 into GR is: LS1R (Locking Shift 1 Right)
G2 GR : LS2R (Locking Shift 2 Right)
G3 GR : LS2R (Locking Shift 3 Right)
** WHAT IS THE VALUE OF THESE CONTROL CODES? **
Note that a character code in a character set invoked into GR is
or-ed with 0x80.
ISO 2022 also determines 'announcer' code. For example,
'ESC 0x20 0x41' means 'Only G0 buffer is used. G0 is already
invoked into GL'. This simplify the coding system. Even this
announcer can be omitted if people who exchange data agree.
7bit version of ISO 2022 is a subset of 8bit version. It does not
use C1 and GR.
Explanation on C0 and C1 is omitted here.
3.2.4. ISO/IEC 10646 (UCS-4, UCS-2), UNICODE, UTF-8, UTF-16
3.3. Current Situation in Each Country
This section describes a specific information for each language.
Contributions from people speaking each language are welcome.
If you are to write a section on your language, please include
these points:
1. kinds and number of characters used in the language,
2. explanation on character set(s) which is (are) standardized,
3. explanation on codeset(s) which is (are) standardized,
4. usage and popularity for each codeset,
5. de-facto standard, if any, on columns of characters,
6. writing direction and combined characters,
7. widely used value for LANG environmental variable,
8. the way to input characters from keyboard and whether
you want to input yes/no (and so on) in your language
or in English,
9. informations needed for beautiful displaying, for example,
where to break a line, hyphonation, word wrapping, and so on, and
10. other topics.
Writers whose languages are written in different direction
from European languages or needs a combined characters
(I heard that is used in Thai) are encouraged to explain
how to treat such languages.
3.3.1. Japanese language / used in Japan
Japanese is the only official language used in Japan
People in Okinawa islands and Ainu ethnic group in Hokkaido region
have each language, though they are used among few number
of people and they don't have own letters.
Japan is the only region where Japanese language is widely used.
The author of this section is Tomohiro KUBOTA <kubotga@debian.or.jp>.
3.3.1.1. Characters used in Japanese
There are three kinds of characters used in Japan,
Hiragana, Katakana, and Kanji.
Hiragana and Katakana are phonogram derived from Kanji.
Hiragana and Katakana characters have one-to-one correspondence
each other like upper and lower case of Latin alphabets.
However, toupper() and tolower() should not convert Hiragana
and Katakana each other.
Hiragana contains about 100 characters and of course Katakana does.
(FYI: about 50 regular characters, about 30 characters with voiced
consonant symbol, and 9 small characters.)
Kanji is ideogram imported from China about 2 thousands years ago.
Nobody knows the whole number of Kanji and almost adult Japanese
people know several thousands of Kanji characters.
Though the origin of Kanji is Chinese character, shapes are
changed from original ancient Chinese Kanji.
Almost all Kanji have several ways to read, according to the
word the Kanji is contained.
Arabic numerical characters (same as European languages) are
widely used in Japanese, though we have Kanji numberical characters.
Though Latin alphabets are not a part of Japanese characters,
they are widely used for proper noun and so on.
3.3.1.2. Character Sets
JIS (Japan Industrial Standards) is an organization responsible
for character sets and codesets used in Japan.
The major character sets in Japan are:
JIS X 0201 (about 60 characters including KATAKANA),
JIS X 0208 (about 7000 characters including HIRAGANA, KATAKANA, and KANJI),
and
JIS X 0212 (about 6000 characters including KANJI).
JIS X 0201 は 8 ビットコンピュータ時代の産物であり、
JIS X 0208 includes all characters of JIS X 0201 であるので、
obsolate である。
JIS X 0212 はほとんど使われていない。したがって、JIS X 0208 のみを
サポートすれば、たいがいの用は足りる。
なお、JIS X 0208 には HIRAGANA, KATAKANA and KANJI だけではなく、
non-literal symbols, English alphabets, Greek characters,
and Russian characters も含んでいる。しかし、それを意識した
プログラミングをする必要はまずないはずである。
JIS X 0208 は、ISO-2022 の枠組に従うように、0x21 - 0x7e の範囲の
バイト 2 つによってできている。
JIS X 0212 も、JIS X 0208 と同様、0x21 - 0x7e の範囲のバイト 2 つ
によってできている。
3.3.1.3. Codesets
3つの popular codesets がある。まず、それぞれの特徴について述べる。
* ISO-2022-JP (aka JIS code or JUNET code)
- stateful
- subset of ISO-2022
- 7bit
- ASCII, JIS X 0201, JIS X 0208, JIS X 0212 can be used.
- used for e-mail and net-news and preferred for HTML.
- same to the cord determined in RFC 1468.
* EUC-JP (Extended UNIX Code)
- stateless
- subset of ISO-2022
- 8bit
- ASCII, JIS X 0201, JIS X 0208, JIS X 0212 can be used.
- preferred code for UNIX. For example, almost all Japanese
message catalogs for gettext is written in EUC-JP.
- Japanese code is mapped in 0xa0 - 0xff. This is important
for programmer because one doesn't need to care there are
fake '\' or '/' (which can be treated in a special way in
many context) in the Japanese code.
* SHIFT-JIS (aka Microsoft Kanji Code)
- stateless
- NOT subset of ISO-2022
- 8bit
- ASCII, JIS X 0201, and JIS X 0208 can be expressed, but
JIS X 0212 cannot.
- Windows/Macintosh supports this code only. This makes
SHIFT-JIS the most popular code in Japan. Though MS
is thinking about transition to UNICODE, it is suspicious
that it can be done successfully.
*** ここに各 codeset の具体的な encoding を書く ***
UNICODE is not popular in Japan at all.
3.3.1.4. How These Codesets Are Used --- Information for Programmers
以下の例外を除き、EUC-JP をサポートすべきである。もちろんこれは、
EUC-JP の知識を直接に使ってコーディングせよという意味ではない。
wchar_t などを使うことによって特定コードに関する知識を使わずに
コーディングできるのなら、そのほうが望ましい。また、以下の例外に
含まれない場合でも、EUC-JP 以外の日本語コード (ISO-2022-JP、
SHIFT-JIS) をもサポートしたほうがいいのは言うまでもない。
・メール・ネットニュースを扱うプログラムは ISO-2022-JP を
扱わなければならない。
・ICQ クライアントの事実上の標準は SHIFT-JIS である。
・WWW ブラウザのレンダリングエンジンは、すべてのコードに
対応できるべきである。(I18N では全く不充分で、M17N を
目指すべきである)。
・Windows/Macintosh とデータのやりとりを行うプログラムは、
SHIFT-JIS を使うべきである。
・BBS では SHIFT-JIS が広く使われている。
・Windows で使われている、Joliet 形式の CD-ROM は、
ファイル名が UNICODE で書かれている。
3.3.1.5. Columns
日本語表示可能なコンソール (kon, kterm, krxvt) では JIS X 0201 は
1 カラム、JIS X 0208 と JIS X 0212 は 2 カラムを占有する。
ヨーロッパ言語と同様、左から右、上から下に書く。
ただし、上から下、右から左に書くのが本式であり、
X 上で動くワードプロセッサはこの方式をサポートするのが望ましい。
3.3.1.6. LANG variable
EUC-JP
LANG=ja_JP.ujis (major for Linux)
LANG=ja_JP.eucJP (major for *BSD)
LANG=ja_JP
LANG=ja
ISO-2022-jp
LANG=ja_JP.jis
SHIFT-JIS
LANG=js_JP.sjis
3.3.1.7. Input from Keyboard
日本語の文字はキーボードから直接入力できないので、
English Alphabet の入力を日本語に変換するためのソフトウェアが
必要である。フリーなものでは Wnn と Canna がある。
これらはサーバー/クライアントモデルを採用していて、
独自のプロトコルを実装している。X Window System 上では、
これらの独自のプロトコルと XIM (X Input Method) とを
仲介する kinput2 というソフトウェアが使われる。
表音文字であるひらがなのうち、大部分は1文字で母音+子音を表すので、
English Alphabet 2 文字を入力することでひらがな 1 文字を入力する。
一部のひらがなについては、Alphabet 1 文字や 3 文字が
ひらがな 1 文字に対応する。
漢字は、ひらがなをさらに変換することで得る。日本語の単語には
漢字が複数並んだものがたくさんあるが、単語は一度に変換できるのが
通常である。優れた辞書や文法解析能力を持つ変換ソフトウェアを使うと、
もっと長いフレーズや文全体を一度に変換できる。ただし、日本語の単語には
同じ音を持つが違う漢字を用いる (意味も違う) ものが多数あるので、
複数の候補の中から選ぶことが、しばしば必要になる。
この、ひらがなから漢字への変換には、よく使われる単語の読みと漢字と
品詞 (活用の形) を含んだ辞書が必要である。この辞書を作る作業には
多くの作業が必要なので、proprietary ソフトウェアのほうが
フリーソフトよりもずっと優れているのが現状である。
3.3.1.8. Layout of Characters
ここでは、ブラウザのレンダリングエンジンのようなソフトウェアを
作る際に有用な情報を記す。
日本語では、英語とは異なり、単語を空白で区切るようなことはしない。
また、行を折り返すとき、単語の途中で行を折り返しても構わない。
ただし、記号に付いては若干の注意が必要である。英語では
open parentheses は行の最後には決して現れないし、close parentheses
は行の先頭には決して現れない。日本語にも parentheses に相当する
記号があり、英語の場合と同じ規則が適用される。また、
英語では、comma や period は行の先頭には決して現れない。
日本語にも comma や period に相当する記号があり、
英語の場合と同じく、行の先頭には決して現れない。
英語の場合、単語と単語の間でしか行の折り返しができないので、
parentheses, comma, and period にかんする規則は自動的に守られるが、
日本語の場合は、特別に考慮しなければならない。
3.3.1.9. More Detailed Discussion
3.3.1.9.1. Width of characters
Different from European languages, Japanese characters should
written in a fixed width. Exceptions arises when two symbols
such as parentheses, periods and commas continue. Kerning
should be done for such cases if the software is a word processor.
A text editor need not.
3.3.1.9.2. Writind direction
Japanese language can be written in vertical direction. A line goes
downward and the row of lies goes from right to left. This direction
is the traditional style. For example, most Japanese books, magazines
and newspapers except for in the field of natural science are written
in vertical direction.
A few Japanese characters have to have different fonts for vertical
direction. They are reasonable characters --- parentheses and
'long silable' symbol like dash in English. Symbols equivalent to
period and comma also have different style for horizonal and vertical
direction.
In Japan, Arabic numerical characters are widely used, like European
languages, though we have Kanji numbercal characters. Latin characters
can also appear in Japanese texts. If a row of 1 ~ 3 (or 4) characters of
Arabic and Latin appear in Japanese vertical text, these characters
are crowded into one column. If more characters appear (large numbers
or long words), the paper is rotated 90 degree in anticlockwise and
the characters are written in European way. This is not so strong
custom. Arabic and Latin characters can always be written in normal and
rotated way in vertical text.
Word processors should have a faculty to write text in vertical direction.
A version of Japanized TeX can use vertical direction. That is a nice
software and all I wrote here can be done.
3.3.1.9.3. Rubi
Rubi is a small (usuall 1/2 in length and 1/4 in area or a bit smaller)
characters written above (in horizonal direction) or right-side of
(in vertical direction) the main text. This is usually used to show
a reading of difficult Kanji.
Japanized TeX can use ruby by using extra macro. Word processors should
have Rubi faculty.
3.3.1.9.4. upper and lower case
Japanese character does not have upper and lower case although
there two sets of phonograms, Hiragana and Katakana.
Hiragana is used for usual text. Katakana is used mainly for
express foreign or imported words, for example, KONPYU-TA
for computer, MAIKUROSOFUTO for Microsoft, and AINSYUTAIN for Einstein.
3.3.1.9.5. sorting
Phonograms (Hiragana and katakana) have sorting order.
The order is same to defined in JIS X 0208, with a few exceptions.
Ideograms (Kanji) sorting is difficult. They should be sorted
by its reading but almost all kanji have a few readings according
to the context. So if you want to sort Japanese text, you will need
a dictionary of whole Japanese Kanji words. And more, a few
Japanese words written in Kanji have different readings with
exactly same Kanjis, this can occur especially for names of person.
So it is usual that addressbook databases have two 'name' columns,
one for Kanji expression and the other for Hiragana.
3.3.1.9.6. Ro-ma ji (Alphabetic expression of Japanese)
We have a phonetic alphabetic expression of Japanese, Ro-ma ji.
It has almost one-to-one correspondence to Japanese phonogram.
It can be used to display Japanese text on Linux console and
so on. Since Japanese have many homophones this expression
can be crabbed.
4. Output to Display
Here 'Output to Display' does not mean I18N of messages using gettext.
I will concern on whether characters are correctly outputted so that
we can read it. For example, install libcanna1g package and display
/usr/doc/libcanna1g/README.jp.gz on console or xterm (of course after
ungzipping). This text file is written in Japanese but even Japanese
people can not read such a row of strange characters. Which you would
prefer if you were a Japanese, an English message which can be read
with a dictionary or such a row of strange characters which is
a result of gettext-ization? (Yes, there IS a way to display Japanese
characters correctly -- kon for console and kterm for X, and Japanese
people are happy with gettext-ized Japanese messages.)
Problems on displaying non-English characters are discussed below.
Since the mother tongue of the author is Japanese, the content may
be biased to Japanese.
4.1. Console Softwares
Softwares running on the console are not responsible for displaying.
The console itself is responsible. There are terminal emulators
which can display non-English languages such as kterm (Japanes),
krxvt, grxvt, and crxvt (Japanese, Greek, and Chinese, included
in rxvt-ml package), cxterm (Chinese, Korean, and Japanese, non-free),
and so on and softwares with which non-English characters can be
displayed on console such as kon2 (Japanese).
All what a software on console (including terminal emulator and so on)
has to do is that output a correct code to the console.
At first, it is important not to destroy string data.
Sometimes it can be done only by 8bit-clean-ize.
'8bit-clean' means that the software does not destroy the
most significant bit (MSB) of data the software treats.
Next, be careful for a software which sends control codes such
as location everytime it output 1 byte. Such codes destroy
the continuity of multibyte character.
Be also careful for destruction of multicolumn characters.
For example, when a string exceeds the width of the console,
the string is divided at the end of the line. Terminal emulators
should have a faculty to avoid such a 'excess of line width' type
destruction of character but so far no terminal emulators
have such a faculty. (Only one exception --- shell mode of Emacs.
However, unfortunately shell mode of Emacs is a dumb terminal and
many softwares cannot be run on it.) Thus each software on
console should be careful.
There is another reason to destroy multicolumn characters.
When a message is overwritten on another string, a part
of a character which is a part of a previous string can be
left not overwritten. This may be more troublesome than many
people would think because multicolumn character can be
written at every columns, not only at the multiple of the
width of the character.
These destruction of continuity of multibyte characters may
be a cause of the destruction of the whole line following
the character. Whether this can occur depends on the internal
imprementation of console program. This can occur if the
terminal emulator does not treat columns, bytes and characters
properly separately. The shell mode of Emcs is the only example
doing that but there are no chance to overwrite character on
the shell mode of Emcas, because it is a dumb terminal.
There are no standards for number of columns a character occupies.
This can be a large problem for softwares with ncurses.
There is no 'right' way to solve this. Each software has to
have an information for each character set. Consult section 2.6
for each language. Take care of the distinction between number
of columns, bytes, and characters. For subset of EUC-JP
(ASCII alphabets and JIS X 0208 kanji), number of bytes and columns
are equal (1-byte character occupy 1 column and 2-byte character
occupy 2 columns).
Another important point is that the string has to be converted
into a codeset which the console can understand. So far there
are no consoles which understand Unicode.
4.2. X Clients
X itself is already internationalized. Thus many languages can
be displayed if fonts are properly prepared. It is users'
responsibility to prepare fonts and all what softwares have
to do is to be careful to selection of fonts.
Though codesets other than ASCII often contains multiple character sets,
fontsets for X are prepared for each character sets. So a set of fontsets
for set of character sets should be used instead of a single fontset.
For example, C programs using Xlib should use series of functions
related to XFontSet structure instead of functions for XFontStruct
structure.
Font | FontSet
==================+====================
XFontStruct | XFontSet
------------------+--------------------
XLoadFont() | XCreateFontSet()
------------------+--------------------
XUnloadFont() | XFreeFontSet()
------------------+--------------------
XQueryFont() | XFontsOfFontSet()
------------------+--------------------
XDrawString() | XmbDrawString()
XDrawString16() | XwcDrawString()
------------------+--------------------
XDrawText() | XmbDrawText()
XDrawText16() | XwcDrawText()
------------------+--------------------
If a software uses the left-hand functions it have to be rewritten
using the coresponding right-hand functions in the table. Note that
this table is not perfect but only for an example. Since these
right-hand functions use wide characters and multibyte characters
in C, setlocale() has to be called in advance.
The same problem exists for softwares using toolkits such as
athena, GTK+, Qt, and so on.
5. Input from Keyboard
I18N of display is a prerequisite for I18N of input from keyboard.
I18N is not necessary only for answering Yes/No. Most
Japanese-speaking people regard it is too troublesome only for
answer Y/N to invoke the input method, input alphabetical
representation of Japanese, and convert to Japanese characer.
This would be true for Korean and Chinese. On the other hand
softwares such as text editor, word processor, terminal emulator,
and shell should have I18N-ed input support.
5.1. Console Softwares
5.1.1. Invoked in the Console and Kon
Canna and Wnn is client/server type Japanese input methods.
Wnn has its variants for Korean and Chinese.
They have their own protocols and there are no standards.
There are softwares to add a faculty of inputting Japanese
to console by connecting console and these input methods,
but these softwares (canuum for Canna and uum for Wnn) are
not Debianized yet. There are a few softwares which can talk
Canna or Wnn protocol directy, for example, nvi-m17n-canna.
In Debian system, these softwares 'depends' on libcanna or wnn
packages.
GNU Emacs offers methods for inputting many languages
such as Japanese, Chinese, Korean, Latin-{12345}, Russian,
Greek, Hebrew, Thai, Vietnamese, Czech, and so on
in the console environment. XEmacs also offers similar
mechanism but the set of supported languages are different.
We will be very happy if the input faculty of (X)Emacs
becomes a library and other softwares can use. The author
doesn't know this can be achieved or not.
5.1.2. Invoked in an X Terminal Emulator
X has a standard to input various languages. That is XIM.
Kinput2 is a software to connect Canna and/or Wnn and XIM protocol.
And more, terminal emulators such as kterm and krxvt have a
faculty to connect to XIM. So the way to input various languages
is supplied.
All what softwares running on a terminal emulater have to do is
to accept the input properly.
At first 8bit-clean-ize is needed.
X Window System には XIM という標準があり、一方、Canna や Wnn などの
個々の変換エンジンと XIM との仲介を行う kinput2 という
プログラムがあるので、ターミナルエミュレータの中で使うのであれば、
quick hack として8ビットクリーン化だけでいい。この意味でなら、
bash (libreadlineg2) (2.02.1-1.6) や tcsh (6.08.01-3) でさえ
2バイト文字の入力を受け付ける。
ただし、この段階では、2バイト文字を意識していないので、2バイト
文字をまたいでカーソル移動したり、2バイト文字を消去したりするときは、
ユーザーは2回操作しなければならない。これを間違えると、入力した文字列が
壊れてしまい、復旧できなくなる。
5.2. X Clinents
必要なことは、
* XIM からの入力を受け付けること。On-the-Spot (その場変換) を
受け付けるのが望ましいが必須ではない。
* Compound Text を含むペーストを受け付けること。
である。
6. Internal Processing and File I/O
ユーザーの立場からみれば、入出力さえ正しくできるなら、
内部ではどのような処理が行われようとも構わない。
プログラマの立場からみると、C のワイド文字や
UNICODE を使うことによって、特定の codeset の
知識なしに、
・文字数を正しく数えることができる。
・マルチバイト文字を文字の途中で分割する心配がなくなる。
ワイド文字のコーディング (wchar_t の中身) は実装依存であるので、
wchar_t 型の変数の値を調べて character set を知ることはできない。
たとえばカラム数を知るときには character set を知ることが必要
なので、その場合には character set にかんする情報を別途記憶
しておく必要がある。
UNICODE
7. Other Special Topics
7.1. Mail/News
歴史的な理由から、メールやニュースのヘッダと本文は、
7 ビットコードを使わなければならない。それを満たすため、
本文とヘッダでは異なった方法が用いられている。
本文では、日本語を表記するには ISO 2022-JP が用いられている。
他の言語ではどのように解決しているのか、教えてほしい。
本文に用いている codeset は、Content-Type ヘッダで指定している。
しかし、ヘッダ自身 (たとえば Subject とか From とか) に
ASCII 文字以外を使いたい場合は、別の場所で codeset を
指定するわけにはいかない。そこで用いられているのが
MIME (RFC-*) である。