UTF-8 Encoding Reference
UTF-8 is a variable-length character encoding standard used for electronic
communication. Defined by the Unicode Standard, the name is derived from
Unicode Transformation Format 8-bit. UTF-8 is capable of encoding all
1,000,000+ valid Unicode code points using one to four bytes.
Code point - UTF-8 conversion
First code Last code Byte1 Byte2 Byte3 Byte4
U+0000 U+007F 0xxxxxxx
U+0080 U+07FF 110xxxxx 10xxxxxx
U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+010000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
The bits encoding the binary value of the Unicode point replace the xxx
from the most significant bit on the left (in Byte1) to the least on the right
in the last byte as needed.
ASCII characters of the range 0x00 to 0x7F are not encoded. If Byte1 is
larger than 0x7F the first bit is 1. This indicates that additional bytes will
be used in the encoding. As you may see in the table above the initial bits
in Byte1 define how many bytes will be used in the encoding. Each additional
byte will begin with 10 and provide 6 more bits of the final binary value.
UTF-8 encodings used in the context of JNIOR are rarely more than two bytes.
Note that JANOS offers a shortcut for selecting the appropriate Unicode
character for common accenting. For instance by typing the
base character
'e' followed by typing Ctrl-U twice you can toggle to the correct letter
used in the word résumé.
[/flash/manpages/reference.hlp:769]