Unicode

Table of Contents

1. Unicode

Representing text-format data in computers is a matter of:

  1. repertoires: defining a set of characters defining a repertoires of an unordered collection of characters
  2. charset: assigning each character of the repertoires a number (code point) a coded character set (charset) maps characters from the repertoires to numeric values
  3. encoding: assigning the number a bit representation

thus, we should tell `charset` and `encoding` apart when talking about text-format data. but for some simple text-format like ASCII, the two concepts are basically the same: ASCII is both a charset and an encoding scheme

1.1. Charset

ASCII uses 7 bit to encode 128 characters. LATIN-1 uses 8 bit to encode 256 characters.

Unicode is a charset, so when we talking about unicode, temporary forget the encoding schemes like `utf`, `ucs`…

The first version of unicode used 16-bit code points, which allowed for encoding 65536 characters.

Starting with unicode 2.0, the Unicode standard began assigning code points from 0-10ffff (which requires 21 bits)

The first 128 characters of unicode is assigned to the same characters in ASCII charset, and the same is true for LATIN-1, thus ASCII and LATIN-1 are subsets of unicode. Note that we are talking about `charset`, for various encoding schemes, MAYBE the conclusion doesn't apply.

1.2. Encoding

unicode assigns characters a number from 0-10ffff, such a character number is called a `code point`, these `code point` are just non-negative integer, they don't have an implicit binary representation or a width of 21 bits – binary representation and unit width is defined by encoding scheme.

there are 3 encoding schemes:

  1. UTF-16, the default encoding scheme, map code point to either one or two 16-bit integers
  2. UTF-8 offer backward compatibility with ASCII-based APIs and protocols1, A code pointer is mapped to 1,2,3 or 4 bytes.
  3. UTF-32 is the simplest but most memory intensive encoding: it use a fixed 32 bits integer to map a code point.

Both UTF-16 and UTF-8 are variable length encoding schemes.

for input/output, encoding scheme need to define a byte serialization of text. UTF-8 is itself serializable because it is byte-based. for each of UTF-16 and UTF-32, there are two variants defined: big endian and little endian , the corresponding encoding scheme is called UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE.

For historical reasons, there are also another two encoding schemes:

  1. UCS2

    UCS2 is basically the same with UTF-16 of the first unicode version: it is a fixed length encoding scheme, using 16 bits to encoding code points between 0-ffff, it is deprecated because it can't encode code points between 10000-10ffff

  2. UCS4

    UCS4 is the same with UTF-32

1.3. UTF-16

UTF-16 is the default encoding form of Unicode. it is a variable length encoding scheme:

  • for code points between 0-ffff (Basic Multilingual Plane, BMP), UTF-16 using one 16-bit unit to encode
  • for code points between 10000-10ffff (Supplementary Multilingual Plane, SMP), UTF-16 using two 16-bit unit to encode code points between 10000-10ffff, are encoded with two code units, called `surrogates pair`, the first surrogates in the pair must be in the range of D800-DBFF, and the second one must be in the range of DC00-DFFF, The coide point values D800-DFFF are just set aside for this mechanism and will never, by themselves, be assigned any characters.

    For historical reason, sometimes UCS2 is taken for UTF-16. e.g. JAVA alleged that a char represent a UTF-16 character, in face, JAVA is using the older UTF-16 , or the deprecated UCS2, thus a char in JAVA can't represent code point between 10000-10FFFF!

1.3.1. the UTF-16 encoding

Table 1: supplementary code points (10000-10ffff)
lead\tail DC00 DC01 DFFF
D800 10000 10001 103ff
D801 10400 10401 107ff
DBFF 10fc00 10fc01 10ffff

1.4. UTF-8

To meet the requirement of bye-oriented, ASCII-based system, the unicode defines UTF-8, UTF-8 is a variable-width encoding scheme. UTF-8 will encode character to 1,2,3 or 4 bytes.

1.4.1. the UTF-8 encoding

Table 2: UHF-8 encoding
Bits Last code point Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
7 U+007F 0xxxxxxx          
11 U+07FF 110xxxxx 10xxxxxx        
16 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx      
21 U+1FFFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx    
26  
             

What we can read from this table:

  1. UTF-8 is compilable with ASCII and thus compilable with those legacy ASCII-based systems
  2. UTF-8 is a prefix2 encoding scheme
  3. 4 bytes UTF-8 is enough to encode all the 21 bit unicode code points.
  4. Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi could take more space in UTF-8

1.5. Unicode and Java

Java internally use UTF-16 to represent `Character` and `String`, BUT, since UTF-16 is a variable-width encoding, How to represent a SMP UTF-16 using Java `Character` class?

In fact, `Character` literal can only represent UCS2 code units, i.e. they are limited to values from 0000-ffff, supplementary characters (10000-10ffff) must be represented as a surrogates pair within a char sequence or as in integer.

char c='\u1234'; // ok
char c='\u10001'; // error
String s="\u10001"; // error
String s="\ud801\ud802"; // ok
char [] chars=Character.toChars(0x10001);// ok

Java String class and Character class has a lot of methods to cope with code point, especially for supplementary characters, when dealing with supplementary characters, we must take care.

String s=new String(new int[] {0x10001},0,1);
System.out.println("s.length:"+s.length()); // output: 2
for (int i=0;i<s.length();++i) {
    System.out.printf("cp: 0x%x\n",s.codePointAt(i));
}
// output: 0x10001
//         0xdc01
System.out.println("cp count:"+s.codePointCount()); // output: 1

boolean Character.isHighSurrogate(char c);
boolean Character.isLowSurrogate(char c);
int Character.toCodePoint(char high, char low);
char [] Character.toChars(int codePoint);

2. ICU

2.1. Collator

2.2. Normalizer

2.3. BIDI

3. Reference

Footnotes:

1

ASCII-base system will take 0x0 as the `end of data` mark, e.g. `char * ` in C, while UTF-8 encoding scheme doesn't use 0x0 as code unit.

2

A prefix code is a type of code system (typical a variable-length code) that there is no valid code word in the system that is a prefix (start) of any other valid code word in the set. For example, a code with code words {9, 59, 55} has the prefix property; a code consisting of {9, 5, 59, 55} does not. Wikipedia: prefix code

Author: [email protected]
Date:
Last updated: 2022-02-25 Fri 22:34

知识共享许可协议