Basic concept of character coding

Reprinted from http://www.hostloc.com/thread-177609-1-1.html

1, ASCII encoding, is the 256 characters required to display words in English (for example, English letters, numbers, punctuation, etc.)

2, ANSI encoding, like Chinese, must not only represent all Chinese characters with only 256 characters. Therefore, the ASCII code table is expanded, using two or more bytes to represent a Chinese character. Similarly, different countries and regions have formulated different standards, which use 2 bytes to represent a variety of extended encoding methods, called ANSI codes. That is to say, ANSI is a general term for extending ASCII code tables. Different language operating systems represent different coding methods. For example, the Chinese operating system, ANSI coding refers to GB2312; Japanese operating system ANSI coding refers to JIS.

3, Unicode encoding, Unicode is a super large set, is also a unified standard, can accommodate all the world's language symbols. Each symbol is coded differently, for example, the U 0639 represents the Arabia letter Ain, and the U 0041 represents the upper case letter A of the English language, and the Unicode code of the word "Han" is U 6C49.

4, code page (codepage), Unicode is a world unification standard, that is, if a text is encoded in a Unicode way, it can display Chinese, Japanese, Arabic, and so on at the same time, and can be displayed normally on any system. However, because ANSI codes are incompatible with each other, there is a need to have an identity to show the mapping relationship between different ANSI codes to Unicode (that is, the mapping relationship between different codes), which is the code page. For example, the code page of the simplified Chinese is CP_936 (the default code page of the Chinese system), which is the meaning of the first parameter of the MultiByteToWideChar in the windows API. If the code page is not marked, the system does not know how to code conversion.

5, SBCS (single byte character set), MBCS (multi byte character set), DBCS (wide byte character set), respectively corresponding to the above mentioned ASCII coding, ANSI encoding, Unicode coding.

6, common Chinese encodings: GB2312 (CP_20936) -.gt; GBK (CP_936) -.gt; GB18030 (CP_54936), the three encoding methods are down compatible, that is, GB18030 contains all the characters. In 2000, GB18030 replaced GBK as the official national standard.

7 and UCS (Unicode Character Set): UCS-2 specifies 2 bytes to represent one text, and UCS-4 specifies that 4 bytes represent one word. We are almost always dealing with UCS-2 in our work.

8 and UTF (UCS Transformation Format):UCS are only specified how to code, but there is no provision on how to transmit and save the encoding. UTF specifies that the encoding should be saved by several bytes. UTF-7, UTF-8 and UTF-16 are relatively common coding methods. UTF-8 encodings are not the same as Unicode codes, but they can be converted by computing, unlike ANSI and Unicode that must be artificially defined by a mapping table. UTF-16 is completely corresponding to UCS-2 and can be represented by a portion of UCS-4 text. And UTF-32 is completely corresponding to UCS-4, but it's very unusual.

9, UTF-8 are compatible with ASCII codes, English letters are 1 bytes, and Chinese characters are usually 3 bytes; all the characters of UTF-16 are saved with 2 bytes, and their encoding is equivalent to Unicode. UTF-16 is also divided into UTF-16LE (little endian) and UTF-16BE (big endian), such as a letter 'A', if stored according to UTF-8, 0 * 61; if utf-16le is stored as 0 * 610 * 00 (low valid bit in front); if utf-16be is stored is 0 * 000 * 61 (high effective bit in front). This is the meaning of several coding methods that we can choose when we use Notepad to save files.

10, BOM (byte order mark), the UTF-8 utf-16le utf16-be mentioned above are all Unicode coding, but the system still does not parse a text file, even if it is known that it is a Unicode code. So there is such a rule: inserting a few byte labels at the beginning of the text file to illustrate the encoding. UTF-8's BOM is 0xef 0xbb 0xbf, utf-16le's BOM is 0xff 0xFE, utf16-be's BOM is 0xFE. In fact, BOM is not necessary, it is only used to help program automatic judgment coding, if we manually select the encoding method (like ANSI), even if there is no BOM, it can be displayed normally. Conversely, when a program reads a text file, it must first read the three bytes of the text to determine the encoding.