Unicode
Table of Contents
1. Unicode
Representing text-format data in computers is a matter of:
- repertoires: defining a set of characters defining a repertoires of an unordered collection of characters
- charset: assigning each character of the repertoires a number (code point) a coded character set (charset) maps characters from the repertoires to numeric values
- encoding: assigning the number a bit representation
thus, we should tell `charset` and `encoding` apart when talking about text-format data. but for some simple text-format like ASCII, the two concepts are basically the same: ASCII is both a charset and an encoding scheme
1.1. Charset
ASCII uses 7 bit to encode 128 characters. LATIN-1 uses 8 bit to encode 256 characters.
Unicode is a charset, so when we talking about unicode, temporary forget the encoding schemes like `utf`, `ucs`…
The first version of unicode used 16-bit code points, which allowed for encoding 65536 characters.
Starting with unicode 2.0, the Unicode standard began assigning code points from 0-10ffff (which requires 21 bits)
The first 128 characters of unicode is assigned to the same characters in ASCII charset, and the same is true for LATIN-1, thus ASCII and LATIN-1 are subsets of unicode. Note that we are talking about `charset`, for various encoding schemes, MAYBE the conclusion doesn't apply.
1.2. Encoding
unicode assigns characters a number from 0-10ffff, such a character number is called a `code point`, these `code point` are just non-negative integer, they don't have an implicit binary representation or a width of 21 bits – binary representation and unit width is defined by encoding scheme.
there are 3 encoding schemes:
- UTF-16, the default encoding scheme, map code point to either one or two 16-bit integers
- UTF-8 offer backward compatibility with ASCII-based APIs and protocols1, A code pointer is mapped to 1,2,3 or 4 bytes.
- UTF-32 is the simplest but most memory intensive encoding: it use a fixed 32 bits integer to map a code point.
Both UTF-16 and UTF-8 are variable length encoding schemes.
for input/output, encoding scheme need to define a byte serialization of text. UTF-8 is itself serializable because it is byte-based. for each of UTF-16 and UTF-32, there are two variants defined: big endian and little endian , the corresponding encoding scheme is called UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE.
For historical reasons, there are also another two encoding schemes:
UCS2
UCS2 is basically the same with UTF-16 of the first unicode version: it is a fixed length encoding scheme, using 16 bits to encoding code points between 0-ffff, it is deprecated because it can't encode code points between 10000-10ffff
UCS4
UCS4 is the same with UTF-32
1.3. UTF-16
UTF-16 is the default encoding form of Unicode. it is a variable length encoding scheme:
- for code points between 0-ffff (Basic Multilingual Plane, BMP), UTF-16 using one 16-bit unit to encode
for code points between 10000-10ffff (Supplementary Multilingual Plane, SMP), UTF-16 using two 16-bit unit to encode code points between 10000-10ffff, are encoded with two code units, called `surrogates pair`, the first surrogates in the pair must be in the range of D800-DBFF, and the second one must be in the range of DC00-DFFF, The coide point values D800-DFFF are just set aside for this mechanism and will never, by themselves, be assigned any characters.
For historical reason, sometimes UCS2 is taken for UTF-16. e.g. JAVA alleged that a char represent a UTF-16 character, in face, JAVA is using the older UTF-16 , or the deprecated UCS2, thus a char in JAVA can't represent code point between 10000-10FFFF!
1.3.1. the UTF-16 encoding
lead\tail | DC00 | DC01 | … | DFFF |
---|---|---|---|---|
D800 | 10000 | 10001 | … | 103ff |
D801 | 10400 | 10401 | … | 107ff |
… | … | … | … | … |
DBFF | 10fc00 | 10fc01 | … | 10ffff |
1.4. UTF-8
To meet the requirement of bye-oriented, ASCII-based system, the unicode defines UTF-8, UTF-8 is a variable-width encoding scheme. UTF-8 will encode character to 1,2,3 or 4 bytes.
1.4.1. the UTF-8 encoding
Bits | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Byte 6 |
---|---|---|---|---|---|---|---|
7 | U+007F | 0xxxxxxx | |||||
11 | U+07FF | 110xxxxx | 10xxxxxx | ||||
16 | U+FFFF | 1110xxxx | 10xxxxxx | 10xxxxxx | |||
21 | U+1FFFFF | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | ||
26 | … | … | … | … | … | … | |
… |
What we can read from this table:
- UTF-8 is compilable with ASCII and thus compilable with those legacy ASCII-based systems
- UTF-8 is a prefix2 encoding scheme
- 4 bytes UTF-8 is enough to encode all the 21 bit unicode code points.
- Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi could take more space in UTF-8
1.5. Unicode and Java
Java internally use UTF-16 to represent `Character` and `String`, BUT, since UTF-16 is a variable-width encoding, How to represent a SMP UTF-16 using Java `Character` class?
In fact, `Character` literal can only represent UCS2 code units, i.e. they are limited to values from 0000-ffff, supplementary characters (10000-10ffff) must be represented as a surrogates pair within a char sequence or as in integer.
char c='\u1234'; // ok char c='\u10001'; // error String s="\u10001"; // error String s="\ud801\ud802"; // ok char [] chars=Character.toChars(0x10001);// ok
Java String class and Character class has a lot of methods to cope with code point, especially for supplementary characters, when dealing with supplementary characters, we must take care.
String s=new String(new int[] {0x10001},0,1); System.out.println("s.length:"+s.length()); // output: 2 for (int i=0;i<s.length();++i) { System.out.printf("cp: 0x%x\n",s.codePointAt(i)); } // output: 0x10001 // 0xdc01 System.out.println("cp count:"+s.codePointCount()); // output: 1 boolean Character.isHighSurrogate(char c); boolean Character.isLowSurrogate(char c); int Character.toCodePoint(char high, char low); char [] Character.toChars(int codePoint);
2. ICU
2.1. Collator
2.2. Normalizer
2.3. BIDI
3. Reference
Footnotes:
ASCII-base system will take 0x0 as the `end of data` mark, e.g. `char * ` in C, while UTF-8 encoding scheme doesn't use 0x0 as code unit.
A prefix code is a type of code system (typical a variable-length code) that there is no valid code word in the system that is a prefix (start) of any other valid code word in the set. For example, a code with code words {9, 59, 55} has the prefix property; a code consisting of {9, 5, 59, 55} does not. Wikipedia: prefix code