|Jun Da's WebCentral|
Chinese text computing
(This is the 1998 version. An updated 2004 version is now available)
Comparison of three GB2312-80 tables in the public domain
(Last updated: 2000-10-22)
1. The confusion
The GB2312-80 standard is the most popular encoding scheme supported by software applications develped or localized for users of simplified Chinese. It is said that the standard enumerates 6763 distinctive simplified Chinese characters (c/f Lunde). However, several GB2312 tables in the public domain give different numbers of characters. This note tries to document where their differences are.
Three GB2312-80 tables can be found on the internet, which include:
Examination of the documents suggest that the three tables were compiled by different (groups of) people. More information about the three documents can be found at:
In the following table, 'Local copies' contains local copies of the documents from their original sites. (They are provided here simply for fast access.) 'Character lists' contains lists of hanzi (with one hanzi per line) generated from their original documents. 'Sorted lists' are lists of unique hanzi from their corresponding lists. In the last two columns, numbers in brackets indicate the number of hanzi found in each document.
The following is a list of characters that may demonstrate how the three tables differ in the number of unique characters. Column 1 is the internal code given in the three tables. Column 2 is from the IFA2312 table, Column 3 the IFB2312 table and Column 4 the HY2312 table.
IFA IFB HY B1C8 比 比 比 DFC1 比 吡 B2B8 哺 哺 哺 DFB2 哺 卟 BAF3 后 后 后 E1E1 后 後 BCEE 硷 碱 BCEF 硷 硷 硷 C0F5 栗 栗 栗 C0FC 栗 傈 C0FA 历 历 历 F0DF 历 疬 C3B4 么 么 么 F7E1 么 麽 C3B8 酶 酶 C4F2 尿 尿 尿 EBE5 尿 脲 C7A4 千 扦 C7A5 千 钎 C7A7 千 千 千 C8FD 三 三 三 C8FE 三 叁 C9CA 商 墒 C9CC 商 商 商 CCAA 太 酞 CCAB 太 太 太 CFC6 掀 掀 掀 CFC7 掀 锨 D3DA 于 于 于 ECB6 于 於 D3E0 余 余 余 E2C5 余 馀 DFB8 吒 吒 吒 DFE5 吒 咤 E6B1 吒 姹 DAA1 淞 凇 E4C1 淞 淞 淞 E7D6 缰 缰 D7FE 镕 EED0 钸 钚 EEDF 钸 钸 钸
From the above list it seems that IFB is perpahs most loyal to the original GB2312 standard which include quite a few traditional Chinese characters. In comparison, IFA converts those traditional characters into their simplified counterparts. The HY list simply does not list the code space that were used for traditional characters (that were included in the original GB2312 standard specification).
5. Suggestions for using the three tables
If you are concerned about backward compatility, it is best to use an assorted list based on the three tables. A simple run of 'sort | uniq' under a unix shell will get it done. However, if you do not care about the compatibility issue, the Haiyan list will be your best choice.
Text Computing Sitemap
Copyright. 1998-2000. Jun Da. email@example.com