search |
TALISMAN
general
Information
Unix server |
|
UTF-8
|
(Partly from Silicon Graphics' Moving Worlds documentation) The 2 byte (UCS-2) encoding of ISO 10646 is identical to the Unicode standard. In order to allow standard ASCII text editors to contiue to work with most VRML files, we have chosen to support the UTF-8 encoding of ISO 10646. This encoding allows ASCII text (0x0..0x7F) to appear without any changes and encodes all characters from 0x80.. 0x7FFFFFFF into a series of six or fewer bytes. If the most significant bit of the first character is 0, then the remaining seven bits are interpreted as an ASCII character. Otherwise, the number of leading 1 bits will indicate the number of bytes following. There is always a 0 bit between the count bits and any data. First byte could be one of the following. The X indicates bits available to encode the character. max char byte one total in character bits possible numeric range -------- -------------------- ---- ---------------------------------- 0XXXXXXX only this byte 7 0..0x7F (ASCII) 110XXXXX two bytes 11 Maximum character value is 0x7FF 1110XXXX three bytes 16 Maximum character value is 0xFFFF 11110XXX four bytes 21 Maximum character value is 0x1FFFFF 111110XX five bytes 26 Maximum character value is 0x3FFFFFF 1111110X six bytes 31 Maximum character value is 0x7FFFFFFF
All following bytes have this format: 10xxxxxx. For example,
a three byte character encoding can hold 16 bits of actual character
data, spread through the Note that the Unicode UTF-8 standard does not, as of the time of this writing, include the high end of the possible range for the encoding described above. A two byte example with ®The symbol for a registered trademark is "circled R registered sign" or 174 in both ISO/Latin-1 (8859/1) and ISO 10646. In UTF-8 it has the following two-byte encoding: 0xC2, 0xAE. Here's a rough idea of how that's generated:
|