UTF-8 is an efficient encoding of Unicode character - String (data type) that recognizes the fact that the majority of text-based communications are in ASCII. It therefore optimizes the encoding of these characters.

Unicode is preferred to ASCII because it permits the inclusion of accents, scientific symbols and characters used in languages other than English. The UTF-8 format is a standard encoding that provides the most efficient means of encoding 16-bit Unicode characters in cases where the majority of characters are in the ASCII range. Both UTF-8 and the alternative UTF-16 encoding issupported by all widely used operating systems and major applications (and has been for more than 15 years).

SNOMED CT uses the UTF-8 representation of characters in terms and other text fields.

Character encoding

ASCII characters are encoded as a single byte.

Greek, Hebrew, Arabic and most accented European characters are encoded as two bytes;
All other characters are encoded as three bytes;
The individual characters are encoded according to the following rules.

Single byte encoding

Characters in the range 'u+0000' to 'u+007f' are encoded as a single byte.

byte 0
0	bits 0-6

Two byte encoding

Characters in the range 'u+0080' to 'u+07ff' are encoded as two bytes.

byte 0				byte 1
1	1	0	bits 6-10	1	0	bits 0-5

Three byte encoding

Characters in the range 'u+0800' to 'u+ffff' are encoded as three bytes:

byte 0					byte 1			byte 2
1	1	1	0	bits 12-15	1	0	bits 6-11	1	0	bits 0-5

Notes on encoding rules

The first bits of each byte indicate the role of the byte. A zero bit terminates this role information. Thus possible byte values are:

Bits	Byte value	Role
0???? ?? ?	000-127	Single byte encoding of a character
10??? ?? ?	128-191	Continuation of a multi-byte encoding
110?? ?? ?	192-223	First byte of a two byte character encoding
1110? ?? ?	224-239	First byte of a three byte character encoding
1111? ?? ?	240-255	Invalid in UTF-8

Example encoding

Character	S	C	T
Unicode	0053	0043	0054	00AE		2462
Bytes	01010011	01000011	01010100	11000010	10101110	11101111	10111111	10111111

Search

6. Unicode UTF-8 encoding