Tcl_SetDefaultEncodingDir(3tcl)
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
_________________________________________________________________
NAME
Tcl_GetEncoding, Tcl_FreeEncoding, Tcl_GetEncodingFromObj,
Tcl_ExternalToUtfDString, Tcl_ExternalToUtf,
Tcl_UtfToExternalDString, Tcl_UtfToExternal,
Tcl_WinTCharToUtf, Tcl_WinUtfToTChar, Tcl_GetEncodingName,
Tcl_SetSystemEncoding, Tcl_GetEncodingNameFromEnvironment,
Tcl_GetEncodingNames, Tcl_CreateEncoding,
Tcl_GetEncodingSearchPath, Tcl_SetEncodingSearchPath,
Tcl_GetDefaultEncodingDir, Tcl_SetDefaultEncodingDir - pro-
cedures for creating and using encodings
SYNOPSIS
#include <tcl.h>
Tcl_Encoding
Tcl_GetEncoding(interp, name)
void
Tcl_FreeEncoding(encoding)
int |
Tcl_GetEncodingFromObj(interp, objPtr, encodingPtr) |
char *
Tcl_ExternalToUtfDString(encoding, src, srcLen, dstPtr)
char *
Tcl_UtfToExternalDString(encoding, src, srcLen, dstPtr)
int
Tcl_ExternalToUtf(interp, encoding, src, srcLen, flags, statePtr,
dst, dstLen, srcReadPtr, dstWrotePtr, dstCharsPtr)
int
Tcl_UtfToExternal(interp, encoding, src, srcLen, flags, statePtr,
dst, dstLen, srcReadPtr, dstWrotePtr, dstCharsPtr)
char *
Tcl_WinTCharToUtf(tsrc, srcLen, dstPtr)
TCHAR *
Tcl_WinUtfToTChar(src, srcLen, dstPtr)
const char *
Tcl_GetEncodingName(encoding)
int
Tcl_SetSystemEncoding(interp, name)
const char * |
Tcl Last change: 8.1 1
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
Tcl_GetEncodingNameFromEnvironment(bufPtr) |
void
Tcl_GetEncodingNames(interp)
Tcl_Encoding
Tcl_CreateEncoding(typePtr)
Tcl_Obj * |
Tcl_GetEncodingSearchPath() |
int |
Tcl_SetEncodingSearchPath(searchPath) |
const char *
Tcl_GetDefaultEncodingDir(void)
void
Tcl_SetDefaultEncodingDir(path)
ARGUMENTS
Tcl_Interp *interp (in) Interpreter
to use for
error
reporting,
or NULL if
no error
reporting
is
desired.
const char *name (in) Name of
encoding
to load.
Tcl_Encoding encoding (in) The encod-
ing to
query,
free, or
use for
converting
text. If
encoding
is NULL,
the
current
system
encoding
is used.
Tcl_Obj *objPtr (in) Name of |
encoding |
Tcl Last change: 8.1 2
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
to get |
token for.
Tcl_Encoding *encodingPtr (out) Points to |
storage |
where |
encoding |
token is |
to be |
written.
const char *src (in) For the
Tcl_ExternalToUtf
functions,
an array
of bytes
in the
specified
encoding
that are
to be con-
verted to
UTF-8.
For the
Tcl_UtfToExternal
and
Tcl_WinUtfToTChar
functions,
an array
of UTF-8
characters
to be con-
verted to
the speci-
fied
encoding.
const TCHAR *tsrc (in) An array
of Windows
TCHAR
characters
to convert
to UTF-8.
int srcLen (in) Length of
src or
tsrc in
bytes. If
the length
is nega-
tive, the
encoding-
Tcl Last change: 8.1 3
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
specific
length of
the string
is used.
Tcl_DString *dstPtr (out) Pointer to
an unini-
tialized
or free
Tcl_DString
in which
the con-
verted
result
will be
stored.
int flags (in) Various
flag bits
OR-ed
together.
TCL_ENCODING_START
signifies
that the
source
buffer is
the first
block in a
(poten-
tially
multi-
block)
input
stream,
telling
the
conversion
routine to
reset to
an initial
state and
perform
any ini-
tializa-
tion that
needs to
occur
before the
first byte
is con-
verted.
TCL_ENCODING_END
Tcl Last change: 8.1 4
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
signifies
that the
source
buffer is
the last
block in a
(poten-
tially
multi-
block)
input
stream,
telling
the
conversion
routine to
perform
any final-
ization
that needs
to occur
after the
last byte
is con-
verted and
then to
reset to
an initial
state.
TCL_ENCODING_STOPONERROR
signifies
that the
conversion
routine
should
return
immedi-
ately upon
reading a
source
character
that does
not exist
in the
target
encoding;
otherwise
a default
fallback
character
will
automati-
Tcl Last change: 8.1 5
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
cally be
substi-
tuted.
Tcl_EncodingState *statePtr (in/out) Used when
converting
a (gen-
erally
long or
indefinite
length)
byte
stream in
a piece-
by-piece
fashion.
The
conversion
routine
stores its
current
state in
*statePtr
after src
(the
buffer
containing
the
current
piece) has
been con-
verted;
that state
informa-
tion must
be passed
back when
converting
the next
piece of
the stream
so the
conversion
routine
knows what
state it
was in
when it
left off
at the end
of the
last
Tcl Last change: 8.1 6
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
piece.
May be
NULL, in
which case
the value
specified
for flags
is ignored
and the
source
buffer is
assumed to
contain
the com-
plete
string to
convert.
char *dst (out) Buffer in
which the
converted
result
will be
stored.
No more
than
dstLen
bytes will
be stored
in dst.
int dstLen (in) The max-
imum
length of
the output
buffer dst
in bytes.
int *srcReadPtr (out) Filled
with the
number of
bytes from
src that
were actu-
ally con-
verted.
This may
be less
than the
original
source
length if
Tcl Last change: 8.1 7
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
there was
a problem
converting
some
source
charac-
ters. May
be NULL.
int *dstWrotePtr (out) Filled
with the
number of
bytes that
were actu-
ally
stored in
the output
buffer as
a result
of the
conver-
sion. May
be NULL.
int *dstCharsPtr (out) Filled
with the
number of
characters
that
correspond
to the
number of
bytes
stored in
the output
buffer.
May be
NULL.
Tcl_DString *bufPtr (out) Storage |
for the |
prescribed |
system |
encoding |
name.
const Tcl_EncodingType *typePtr (in) Structure
that
defines a
new type
of encod-
ing.
Tcl Last change: 8.1 8
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
Tcl_Obj *searchPath (in) List of |
filesystem |
direc- |
tories in |
which to |
search for |
encoding |
data |
files.
const char *path (in) A path to
the loca-
tion of
the encod-
ing file.
_________________________________________________________________
INTRODUCTION
These routines convert between Tcl's internal character
representation, UTF-8, and character representations used by
various operating systems or file systems, such as Unicode,
ASCII, or Shift-JIS. When operating on strings, such as
such as obtaining the names of files or displaying charac-
ters using international fonts, the strings must be
translated into one or possibly multiple formats that the
various system calls can use. For instance, on a Japanese
Unix workstation, a user might obtain a filename represented
in the EUC-JP file encoding and then translate the charac-
ters to the jisx0208 font encoding in order to display the
filename in a Tk widget. The purpose of the encoding pack-
age is to help bridge the translation gap. UTF-8 provides
an intermediate staging ground for all the various encod-
ings. In the example above, text would be translated into
UTF-8 from whatever file encoding the operating system is
using. Then it would be translated from UTF-8 into whatever
font encoding the display routines require.
Some basic encodings are compiled into Tcl. Others can be
defined by the user or dynamically loaded from encoding
files in a platform-independent manner.
DESCRIPTION
Tcl_GetEncoding finds an encoding given its name. The name
may refer to a built-in Tcl encoding, a user-defined encod-
ing registered by calling Tcl_CreateEncoding, or a
dynamically-loadable encoding file. The return value is a
token that represents the encoding and can be used in subse-
quent calls to procedures such as Tcl_GetEncodingName,
Tcl_FreeEncoding, and Tcl_UtfToExternal. If the name did
not refer to any known or loadable encoding, NULL is
returned and an error message is returned in interp.
Tcl Last change: 8.1 9
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
The encoding package maintains a database of all encodings
currently in use. The first time name is seen,
Tcl_GetEncoding returns an encoding with a reference count
of 1. If the same name is requested further times, then the
reference count for that encoding is incremented without the
overhead of allocating a new encoding and all its associated
data structures.
When an encoding is no longer needed, Tcl_FreeEncoding
should be called to release it. When an encoding is no
longer in use anywhere (i.e., it has been freed as many
times as it has been gotten) Tcl_FreeEncoding will release
all storage the encoding was using and delete it from the
database.
Tcl_GetEncodingFromObj treats the string representation of |
objPtr as an encoding name, and finds an encoding with that |
name, just as Tcl_GetEncoding does. When an encoding is |
found, it is cached within the objPtr value for future |
reference, the Tcl_Encoding token is written to the storage |
pointed to by encodingPtr, and the value TCL_OK is returned. |
If no such encoding is found, the value TCL_ERROR is |
returned, and no writing to *encodingPtr takes place. Just |
as with Tcl_GetEncoding, the caller should call |
Tcl_FreeEncoding on the resulting encoding token when that |
token will no longer be used.
Tcl_ExternalToUtfDString converts a source buffer src from
the specified encoding into UTF-8. The converted bytes are
stored in dstPtr, which is then null-terminated. The caller
should eventually call Tcl_DStringFree to free any informa-
tion stored in dstPtr. When converting, if any of the char-
acters in the source buffer cannot be represented in the
target encoding, a default fallback character will be used.
The return value is a pointer to the value stored in the
DString.
Tcl_ExternalToUtf converts a source buffer src from the
specified encoding into UTF-8. Up to srcLen bytes are con-
verted from the source buffer and up to dstLen converted
bytes are stored in dst. In all cases, *srcReadPtr is
filled with the number of bytes that were successfully con-
verted from src and *dstWrotePtr is filled with the
corresponding number of bytes that were stored in dst. The
return value is one of the following:
TCL_OK All bytes of src were con-
verted.
TCL_CONVERT_NOSPACE The destination buffer was
not large enough for all
of the converted data; as
Tcl Last change: 8.1 10
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
many characters as could
fit were converted though.
TCL_CONVERT_MULTIBYTE The last few bytes in the
source buffer were the
beginning of a multibyte
sequence, but more bytes
were needed to complete
this sequence. A subse-
quent call to the conver-
sion routine should pass a
buffer containing the
unconverted bytes that
remained in src plus some
further bytes from the
source stream to properly
convert the formerly
split-up multibyte
sequence.
TCL_CONVERT_SYNTAX The source buffer con-
tained an invalid charac-
ter sequence. This may
occur if the input stream
has been damaged or if the
input encoding method was
misidentified.
TCL_CONVERT_UNKNOWN The source buffer con-
tained a character that
could not be represented
in the target encoding and
TCL_ENCODING_STOPONERROR
was specified.
Tcl_UtfToExternalDString converts a source buffer src from
UTF-8 into the specified encoding. The converted bytes are
stored in dstPtr, which is then terminated with the
appropriate encoding-specific null. The caller should even-
tually call Tcl_DStringFree to free any information stored
in dstPtr. When converting, if any of the characters in the
source buffer cannot be represented in the target encoding,
a default fallback character will be used. The return value
is a pointer to the value stored in the DString.
Tcl_UtfToExternal converts a source buffer src from UTF-8
into the specified encoding. Up to srcLen bytes are con-
verted from the source buffer and up to dstLen converted
bytes are stored in dst. In all cases, *srcReadPtr is
filled with the number of bytes that were successfully con-
verted from src and *dstWrotePtr is filled with the
corresponding number of bytes that were stored in dst. The
Tcl Last change: 8.1 11
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
return values are the same as the return values for
Tcl_ExternalToUtf.
Tcl_WinUtfToTChar and Tcl_WinTCharToUtf are Windows-only
convenience functions for converting between UTF-8 and Win-
dows strings. On Windows 95 (as with the Unix operating
system), all strings exchanged between Tcl and the operating
system are "char" based. On Windows NT, some strings
exchanged between Tcl and the operating system are "char"
oriented while others are in Unicode. By convention, in
Windows a TCHAR is a character in the ANSI code page on Win-
dows 95 and a Unicode character on Windows NT.
If you planned to use the same "char" based interfaces on
both Windows 95 and Windows NT, you could use
Tcl_UtfToExternal and Tcl_ExternalToUtf (or their
Tcl_DString equivalents) with an encoding of NULL (the
current system encoding). On the other hand, if you planned
to use the Unicode interface when running on Windows NT and
the "char" interfaces when running on Windows 95, you would
have to perform the following type of test over and over in
your program (as represented in pseudo-code):
if (running NT) {
encoding <- Tcl_GetEncoding("unicode");
nativeBuffer <- Tcl_UtfToExternal(encoding, utfBuffer);
Tcl_FreeEncoding(encoding);
} else {
nativeBuffer <- Tcl_UtfToExternal(NULL, utfBuffer);
}
Tcl_WinUtfToTChar and Tcl_WinTCharToUtf automatically handle
this test and use the proper encoding based on the current
operating system. Tcl_WinUtfToTChar returns a pointer to a
TCHAR string, and Tcl_WinTCharToUtf expects a TCHAR string
pointer as the src string. Otherwise, these functions
behave identically to Tcl_UtfToExternalDString and
Tcl_ExternalToUtfDString.
Tcl_GetEncodingName is roughly the inverse of
Tcl_GetEncoding. Given an encoding, the return value is the
name argument that was used to create the encoding. The
string returned by Tcl_GetEncodingName is only guaranteed to
persist until the encoding is deleted. The caller must not
modify this string.
Tcl_SetSystemEncoding sets the default encoding that should
be used whenever the user passes a NULL value for the encod-
ing argument to any of the other encoding functions. If
name is NULL, the system encoding is reset to the default
system encoding, binary. If the name did not refer to any
known or loadable encoding, TCL_ERROR is returned and an
error message is left in interp. Otherwise, this procedure
increments the reference count of the new system encoding,
Tcl Last change: 8.1 12
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
decrements the reference count of the old system encoding,
and returns TCL_OK.
Tcl_GetEncodingNameFromEnvironment provides a means for the |
Tcl library to report the encoding name it believes to be |
the correct one to use as the system encoding, based on sys- |
tem calls and examination of the environment suitable for |
the platform. It accepts bufPtr, a pointer to an uninitial- |
ized or freed Tcl_DString and writes the encoding name to |
it. The Tcl_DStringValue is returned.
Tcl_GetEncodingNames sets the interp result to a list con-
sisting of the names of all the encodings that are currently
defined or can be dynamically loaded, searching the encoding
path specified by Tcl_SetDefaultEncodingDir. This procedure
does not ensure that the dynamically-loadable encoding files
contain valid data, but merely that they exist.
Tcl_CreateEncoding defines a new encoding and registers the
C procedures that are called back to convert between the
encoding and UTF-8. Encodings created by Tcl_CreateEncoding
are thereafter visible in the database used by
Tcl_GetEncoding. Just as with the Tcl_GetEncoding pro-
cedure, the return value is a token that represents the
encoding and can be used in subsequent calls to other encod-
ing functions. Tcl_CreateEncoding returns an encoding with
a reference count of 1. If an encoding with the specified
name already exists, then its entry in the database is
replaced with the new encoding; the token for the old encod-
ing will remain valid and continue to behave as before, but
users of the new token will now call the new encoding pro-
cedures.
The typePtr argument to Tcl_CreateEncoding contains informa-
tion about the name of the encoding and the procedures that
will be called to convert between this encoding and UTF-8.
It is defined as follows:
typedef struct Tcl_EncodingType {
const char *encodingName;
Tcl_EncodingConvertProc *toUtfProc;
Tcl_EncodingConvertProc *fromUtfProc;
Tcl_EncodingFreeProc *freeProc;
ClientData clientData;
int nullSize;
} Tcl_EncodingType;
The encodingName provides a string name for the encoding, by
which it can be referred in other procedures such as
Tcl_GetEncoding. The toUtfProc refers to a callback pro-
cedure to invoke to convert text from this encoding into
UTF-8. The fromUtfProc refers to a callback procedure to
Tcl Last change: 8.1 13
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
invoke to convert text from UTF-8 into this encoding. The
freeProc refers to a callback procedure to invoke when this
encoding is deleted. The freeProc field may be NULL. The
clientData contains an arbitrary one-word value passed to
toUtfProc, fromUtfProc, and freeProc whenever they are
called. Typically, this is a pointer to a data structure
containing encoding-specific information that can be used by
the callback procedures. For instance, two very similar
encodings such as ascii and macRoman may use the same call-
back procedure, but use different values of clientData to
control its behavior. The nullSize specifies the number of
zero bytes that signify end-of-string in this encoding. It
must be 1 (for single-byte or multi-byte encodings like
ASCII or Shift-JIS) or 2 (for double-byte encodings like
Unicode). Constant-sized encodings with 3 or more bytes per
character (such as CNS11643) are not accepted.
The callback procedures toUtfProc and fromUtfProc should
match the type Tcl_EncodingConvertProc:
typedef int Tcl_EncodingConvertProc(
ClientData clientData,
const char *src,
int srcLen,
int flags,
Tcl_EncodingState *statePtr,
char *dst,
int dstLen,
int *srcReadPtr,
int *dstWrotePtr,
int *dstCharsPtr);
The toUtfProc and fromUtfProc procedures are called by the
Tcl_ExternalToUtf or Tcl_UtfToExternal family of functions
to perform the actual conversion. The clientData parameter
to these procedures is the same as the clientData field
specified to Tcl_CreateEncoding when the encoding was
created. The remaining arguments to the callback procedures
are the same as the arguments, documented at the top, to
Tcl_ExternalToUtf or Tcl_UtfToExternal, with the following
exceptions. If the srcLen argument to one of those high-
level functions is negative, the value passed to the call-
back procedure will be the appropriate encoding-specific
string length of src. If any of the srcReadPtr, dstWro-
tePtr, or dstCharsPtr arguments to one of the high-level
functions is NULL, the corresponding value passed to the
callback procedure will be a non-NULL location.
The callback procedure freeProc, if non-NULL, should match
the type Tcl_EncodingFreeProc:
typedef void Tcl_EncodingFreeProc(
ClientData clientData);
Tcl Last change: 8.1 14
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
This freeProc function is called when the encoding is
deleted. The clientData parameter is the same as the
clientData field specified to Tcl_CreateEncoding when the
encoding was created.
Tcl_GetEncodingSearchPath and Tcl_SetEncodingSearchPath are |
called to access and set the list of filesystem directories |
searched for encoding data files. |
The value returned by Tcl_GetEncodingSearchPath is the value |
stored by the last successful call to |
Tcl_SetEncodingSearchPath. If no calls to |
Tcl_SetEncodingSearchPath have occurred, Tcl will compute an |
initial value based on the environment. There is one encod- |
ing search path for the entire process, shared by all |
threads in the process. |
Tcl_SetEncodingSearchPath stores searchPath and returns |
TCL_OK, unless searchPath is not a valid Tcl list, which |
causes TCL_ERROR to be returned. The elements of searchPath |
are not verified as existing readable filesystem direc- |
tories. When searching for encoding data files takes place, |
and non-existent or non-readable filesystem directories on |
the searchPath are silently ignored. |
Tcl_GetDefaultEncodingDir and Tcl_SetDefaultEncodingDir are |
obsolete interfaces best replaced with calls to |
Tcl_GetEncodingSearchPath and Tcl_SetEncodingSearchPath. |
They are called to access and set the first element of the |
searchPath list. Since Tcl searches searchPath for encoding |
data files in list order, these routines establish the |
"default" directory in which to find encoding data files.
ENCODING FILES
Space would prohibit precompiling into Tcl every possible
encoding algorithm, so many encodings are stored on disk as
dynamically-loadable encoding files. This behavior also
allows the user to create additional encoding files that can
be loaded using the same mechanism. These encoding files
contain information about the tables and/or escape sequences
used to map between an external encoding and Unicode. The
external encoding may consist of single-byte, multi-byte, or
double-byte characters.
Each dynamically-loadable encoding is represented as a text
file. The initial line of the file, beginning with a "#"
symbol, is a comment that provides a human-readable descrip-
tion of the file. The next line identifies the type of
encoding file. It can be one of the following letters:
[1] S
A single-byte encoding, where one character is always
Tcl Last change: 8.1 15
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
one byte long in the encoding. An example is iso8859-
1, used by many European languages.
[2] D
A double-byte encoding, where one character is always
two bytes long in the encoding. An example is big5,
used for Chinese text.
[3] M
A multi-byte encoding, where one character may be
either one or two bytes long. Certain bytes are lead
bytes, indicating that another byte must follow and
that together the two bytes represent one character.
Other bytes are not lead bytes and represent them-
selves. An example is shiftjis, used by many Japanese
computers.
[4] E
An escape-sequence encoding, specifying that certain
sequences of bytes do not represent characters, but
commands that describe how following bytes should be
interpreted.
The rest of the lines in the file depend on the type.
Cases [1], [2], and [3] are collectively referred to as
table-based encoding files. The lines in a table-based
encoding file are in the same format as this example taken
from the shiftjis encoding (this is not the complete file):
# Encoding file: shiftjis, multi-byte
M
003F 0 40
00
0000000100020003000400050006000700080009000A000B000C000D000E000F
0010001100120013001400150016001700180019001A001B001C001D001E001F
0020002100220023002400250026002700280029002A002B002C002D002E002F
0030003100320033003400350036003700380039003A003B003C003D003E003F
0040004100420043004400450046004700480049004A004B004C004D004E004F
0050005100520053005400550056005700580059005A005B005C005D005E005F
0060006100620063006400650066006700680069006A006B006C006D006E006F
0070007100720073007400750076007700780079007A007B007C007D203E007F
0080000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000FF61FF62FF63FF64FF65FF66FF67FF68FF69FF6AFF6BFF6CFF6DFF6EFF6F
FF70FF71FF72FF73FF74FF75FF76FF77FF78FF79FF7AFF7BFF7CFF7DFF7EFF7F
FF80FF81FF82FF83FF84FF85FF86FF87FF88FF89FF8AFF8BFF8CFF8DFF8EFF8F
FF90FF91FF92FF93FF94FF95FF96FF97FF98FF99FF9AFF9BFF9CFF9DFF9EFF9F
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
81
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
Tcl Last change: 8.1 16
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
300030013002FF0CFF0E30FBFF1AFF1BFF1FFF01309B309C00B4FF4000A8FF3E
FFE3FF3F30FD30FE309D309E30034EDD30053006300730FC20152010FF0F005C
301C2016FF5C2026202520182019201C201DFF08FF0930143015FF3BFF3DFF5B
FF5D30083009300A300B300C300D300E300F30103011FF0B221200B100D70000
00F7FF1D2260FF1CFF1E22662267221E22342642264000B0203220332103FFE5
FF0400A200A3FF05FF03FF06FF0AFF2000A72606260525CB25CF25CE25C725C6
25A125A025B325B225BD25BC203B301221922190219121933013000000000000
000000000000000000000000000000002208220B2286228722822283222A2229
000000000000000000000000000000002227222800AC21D221D4220022030000
0000000000000000000000000000000000000000222022A52312220222072261
2252226A226B221A223D221D2235222B222C0000000000000000000000000000
212B2030266F266D266A2020202100B6000000000000000025EF000000000000
The third line of the file is three numbers. The first
number is the fallback character (in base 16) to use when
converting from UTF-8 to this encoding. The second number
is a 1 if this file represents the encoding for a symbol
font, or 0 otherwise. The last number (in base 10) is how
many pages of data follow.
Subsequent lines in the example above are pages that
describe how to map from the encoding into 2-byte Unicode.
The first line in a page identifies the page number. Fol-
lowing it are 256 double-byte numbers, arranged as 16 rows
of 16 numbers. Given a character in the encoding, the high
byte of that character is used to select which page, and the
low byte of that character is used as an index to select one
of the double-byte numbers in that page - the value obtained
being the corresponding Unicode character. By examination
of the example above, one can see that the characters 0x7E
and 0x8163 in shiftjis map to 203E and 2026 in Unicode,
respectively.
Following the first page will be all the other pages, each
in the same format as the first: one number identifying the
page followed by 256 double-byte Unicode characters. If a
character in the encoding maps to the Unicode character
0000, it means that the character does not actually exist.
If all characters on a page would map to 0000, that page can
be omitted.
Case [4] is the escape-sequence encoding file. The lines in
an this type of file are in the same format as this example
taken from the iso2022-jp encoding:
# Encoding file: iso2022-jp, escape-driven
E
init {}
final {}
iso8859-1 \x1b(B
jis0201 \x1b(J
Tcl Last change: 8.1 17
Tcl_GetEncoding(3) Tcl Library Procedures Tcl_GetEncoding(3)
jis0208 \x1b$@
jis0208 \x1b$B
jis0212 \x1b$(D
gb2312 \x1b$A
ksc5601 \x1b$(C
In the file, the first column represents an option and the
second column is the associated value. init is a string to
emit or expect before the first character is converted,
while final is a string to emit or expect after the last
character. All other options are names of table-based
encodings; the associated value is the escape-sequence that
marks that encoding. Tcl syntax is used for the values; in
the above example, for instance, "{}" represents the empty
string and "\x1b" represents character 27.
When Tcl_GetEncoding encounters an encoding name that has
not been loaded, it attempts to load an encoding file called
name.enc from the encoding subdirectory of each directory
that Tcl searches for its script library. If the encoding
file exists, but is malformed, an error message will be left
in interp.
KEYWORDS
utf, encoding, convert
Tcl Last change: 8.1 18
Man(1) output converted with
man2html