Internationalization

Wide characters

Earlier in this section we looked at the encoding scheme used for the multibyte characters that are needed to represent Asian-language ideograms. We noted that because single-byte characters can be intermixed with multibyte characters, the sequence of bytes needed to encode an ideogram must be self-identifying: regardless of the supplementary code set used, each byte of a multibyte character will have the high-order bit set. In this way, any byte of a multibyte character can always be distinguished from a member of the primary, 7-bit US ASCII code set, whose high-order bit is not set (or "0"). If code sets 2 or 3 are used, each multibyte character will also be preceded by a shift byte; that is, if code set 1 were dedicated to a single-byte character set, either of code sets 2 or 3 could be used to represent multibyte characters. Given some set of these encodings, then any program interested in the next character will be able to determine whether the next byte represents a single-byte character or the first byte of a multibyte character. If the latter, then the program will have to retrieve bytes until the character is complete.

Some of the inconvenience of handling multibyte characters would be eliminated, of course, if all characters were a uniform number of bytes. ANSI C provides the implementation-defined integral type wchar_t to let you manipulate variable-width characters as uniformly sized data objects called wide characters. Since there can be thousands or tens of thousands of ideograms in an Asian-language set, programs should use a 32-bit sized integral value to hold all members. wchar_t is defined in the headers <stdlib.h> and <wchar.h> as a typedef of a 32 bit signed integer.

Implementations provide appropriate libraries with functions that you can use to manage multibyte and wide characters. We will look at these functions below.

For each wide character there is a corresponding EUC representation and vice versa; the wide character that corresponds to a regular single-byte character has the same numeric value as its single-byte value, including the null character. There is no guarantee that the value of the macro EOF can be stored in a wchar_t, just as EOF might not be representable as a char.

EUC and corresponding 32-bit wide-character representation

Code set EUC code representation Wide-character representation

0 0xxxxxxx 0000000000000000000000000xxxxxxx

1 1xxxxxxx 0011000000000000000000000xxxxxxx

1xxxxxxx1xxxxxxx 001100000000000000xxxxxxxxxxxxxx

1xxxxxxx1xxxxxxx1xxxxxxx 00110000000xxxxxxxxxxxxxxxxxxxxx

2 SS2 1xxxxxxx 0001000000000000000000000xxxxxxx

SS2 1xxxxxxx1xxxxxxx 000100000000000000xxxxxxxxxxxxxx

SS2 1xxxxxxx1xxxxxxx1xxxxxxx 00010000000xxxxxxxxxxxxxxxxxxxxx

3 SS3 1xxxxxxx 0010000000000000000000000xxxxxxx

SS3 1xxxxxxx1xxxxxxx 001000000000000000xxxxxxxxxxxxxx

SS3 1xxxxxxx1xxxxxxx1xxxxxxx 00100000000xxxxxxxxxxxxxxxxxxxxx

Code set	EUC code representation	Wide-character representation
0	`0xxxxxxx`	`0000000000000000000000000xxxxxxx`
1	`1xxxxxxx`	`0011000000000000000000000xxxxxxx`
	`1xxxxxxx1xxxxxxx`	`001100000000000000xxxxxxxxxxxxxx`
	`1xxxxxxx1xxxxxxx1xxxxxxx`	`00110000000xxxxxxxxxxxxxxxxxxxxx`
2	`SS2 1xxxxxxx`	`0001000000000000000000000xxxxxxx`
	`SS2 1xxxxxxx1xxxxxxx`	`000100000000000000xxxxxxxxxxxxxx`
	`SS2 1xxxxxxx1xxxxxxx1xxxxxxx`	`00010000000xxxxxxxxxxxxxxxxxxxxx`
3	`SS3 1xxxxxxx`	`0010000000000000000000000xxxxxxx`
	`SS3 1xxxxxxx1xxxxxxx`	`001000000000000000xxxxxxxxxxxxxx`
	`SS3 1xxxxxxx1xxxxxxx1xxxxxxx`	`00100000000xxxxxxxxxxxxxxxxxxxxx`

Most of the functions provided let you convert multibyte characters into wide characters and back again. Before we turn to the functions, we should note that most application programs will not need to convert multibyte characters to wide characters in the first place. Programs such as diff, for example, will read in and write out multibyte characters, needing only to check for an exact byte-for-byte match. More complicated programs such as grep, that use regular expression pattern matching, may need to understand multibyte characters, but only the common set of functions that manages the regular expression needs this knowledge. The program grep itself requires no other special multibyte character handling. Finally, note that except for libc, the libraries described below are archives, not shared objects. They cannot be dynamically linked with your program.

Multibyte and wide-character conversion

ANSI C provides five library functions that manage multibyte and wide characters:

mblen length of next multibyte character
mbtowc convert multibyte character to wide character
wctomb convert wide character to multibyte character
mbstowcs convert multibyte character string to wide character string
wcstombs convert wide character string to multibyte character string

The first three functions are described on the mbchar(3C) manual page, the last two on the mbstring(3C) page.

Input/output

Since most programs will convert between multibyte and wide characters just before or after performing I/O, libc provides routines that let you manage the conversion within the I/O function itself as if the input or output stream were wide characters instead of multibyte characters. fgetwc, for instance, reads bytes from a stream until a complete EUC character has been seen and returns it in its wide-character representation. fgetws does the same thing for strings; fputwc and fputws are the corresponding write versions. Of course, these routines and others are functionally similar to the Intro(3S) functions; they differ only in their handling of EUC representations. See their manual pages for details. Here is a look at how you can expect the functions to work.

Given the following declarations

   #include <stdio.h>
   #include <wchar.h>
   
   wchar_t s1[BUFSIZ];  /* declare array s1 to store wide characters */
   char    s2[BUFSIZ];  /* declare array s2 of characters for EUC
                           representation */

a multibyte string can be input into s1 using fgetws:

   fgetws(s1, BUFSIZ, stdin);  /* read EUC string from stdin and
                                  convert to process code string in s1 */

fgets and mbstowcs:

   fgets(s2, BUFSIZ, stdin);   /* read EUC string from stdin into s2 */
   mbstowcs(s1, s2, BUFSIZ);   /* convert EUC string in s2 to process
                                  code string in s1 */

the %S conversion specifier for scanf:

   scanf("%S", s1);     /* read EUC string from stdin and convert
                           to process code string in s1 */

the %S conversion specifier for scanf and mbstowcs:

   scanf("%S", s2);            /* read EUC string from stdin into s2 */
   mbstowcs(s1, s2, BUFSIZ);   /* convert EUC string in s2 to process
                                  code string in s1 */

You can use fputws, wcstombs, and the %S conversion specifier for printf (see fprintf(3S)) in the same way for output.

Character classification and conversion

Single- and multibyte character classification and conversion functions are provided in libc. You can use these routines to test 7-bit US ASCII characters, for instance, in their wide-character representations, or to determine whether multibyte characters are ideograms, phonograms, or the like. See the wctype(3C) and wconv(3C) manual pages for details.

As noted, these routines are declared in the <wcchar.h> header.

curses support

32-bit versions of certain UNIX System V Release 4 (SVR4) curses functions are provided in libocurses and declared in <ocurses.h>. Check the curses(3ocurses), manual page for some of the things you need to look out for in using these functions.

The POSIX curses library (libcurses) supports the wide character functions specified in the POSIX standard. See Intro(3curses).

C language features

To give even more flexibility to the programmer in an Asian environment, ANSI C provides 32-bit wide character constants and wide string literals. These have the same form as their non-wide versions except that they are immediately prefixed by the letter L:

'x': regular character constant
'¥': regular character constant
L'x': wide character constant
L'¥': wide character constant
abc¥xyz: regular string literal
Labc¥xyz: wide string literal

Note that multibyte characters are valid in both the regular and wide versions. The sequence of bytes necessary to produce the ideogram ¥ is encoding-specific, but if it consists of more than one byte, the value of the character constant '¥' is implementation-defined, just as the value of 'ab' is implementation-defined. A regular string literal contains exactly the bytes (except for escape sequences) specified between the quotes, including the bytes of each specified multibyte character.

When the compilation system encounters a wide character constant or wide string literal, each multibyte character is converted (as if by calling the mbtowc function) into a wide character. Thus the type of L'¥' is wchar_t and the type of L"abc¥xyz" is array of wchar_t with length eight. (Just as with regular string literals, each wide string literal has an extra zero-valued element appended, but in these cases it is a wchar_t with value zero.)

Just as regular string literals can be used as a short-hand method for character array initialization, wide string literals can be used to initialize wchar_t arrays:

   wchar_t *wp = L"a¥z";
   wchar_t x[] = L"a¥z";
   wchar_t y[] = {L'a', L'¥', L'z', 0};
   wchar_t z[] = {'a', L'¥', 'z', '\0'};

In the above example, the three arrays x, y and z as well as the array pointed to by wp, have the same length and all are initialized with identical values.

Adjacent wide string literals will be concatenated, just as with regular string literals. Adjacent regular and wide string literals produce undefined behavior.