11
M_T08_EN_B
Introduction
UTF-8 encoding
UTF-8 was designed to be compatible with certain software initially foreseen for the processing of one-byte
characters. Each 16 bit character is encoded on a chain of 1 to 4 bytes.
UTF-8 is normalised in the RFC-3629 (UTF-8, a transformation format of ISO 10646). Encoding is also dened
in the 17 technical report of the Unicode standard. It is part of the standard on chapter 3 "Conformance" and
is approved by the International Standard Organisation (ISO), the Internet Engineering Task Force (IETF) as
well as most of the national standardization organisations.
Encoding
The numbered characters from 0 to 127 are encoded on 1 byte whose most signicant bit is always 0.
The characters with a number greater than 127 are encoded over several bytes. In this case, the most
signicant bits of the rst byte form a series of 1 as long as the number of bytes used to encode the character,
the following bytes having 10 as the most signicant bit.
Denitionofthenumberofbytesused
UTF-8 binary representation Meaning
0xxxxxxx 1 byte coding 1 to 7 bits (from 0 to 127)
110xxxxx 10xxxxxx 2 bytes coding 8 to 11 bits (from 128 to 2047)
1110xxxx 10xxxxxx 10xxxxxx 3 bytes coding 12 to 16 bits (from 2048 to 65535)
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 4 bytes coding 17 to 21 bits (from 65536 to 2097151)
This idea could be applied up to 6 bytes but UTF-8 sets the limit to 4. This idea also allows the use of more
bytes than needed to code a character but the UTF-8 forbids it.
Note: the UTF-8 representation over 4 bytes corresponds to a character code greater than 65535,
which must not be used with the T08 program.
Example
Example of the UTF-8 encoding
Character Character number UTF-8 binary encoding
A 65 01000001
é 233 11000011 10101001
€ 8364 11100010 10000010 10101100
In any UTF-8 character string, any 0 most signicant bit byte encodes a US-ASCII character on a byte. The
characters whose codes are included between 0 and 127 are therefore represented the same way as in ASCII
(non-accentuated, capital and small letters, numbers and some frequent initials).