Jun Da's WebCentral

Home | Academic | Chinese | CALL | Systems | Personal | Contact

 

Chinese text computing

(This is the 1998 version. An updated 2004 version is now available)

 

Comparison of different GB2312 tables in the public domain

Comparison of three GB2312-80 tables in the public domain

(Last updated: 2000-10-22)

1. The confusion

The GB2312-80 standard is the most popular encoding scheme supported by software applications develped or localized for users of simplified Chinese. It is said that the standard enumerates 6763 distinctive simplified Chinese characters (c/f Lunde). However, several GB2312 tables in the public domain give different numbers of characters. This note tries to document where their differences are.

2. Sources

Three GB2312-80 tables can be found on the internet, which include:

Examination of the documents suggest that the three tables were compiled by different (groups of) people. More information about the three documents can be found at:

3. Data

In the following table, 'Local copies' contains local copies of the documents from their original sites. (They are provided here simply for fast access.) 'Character lists' contains lists of hanzi (with one hanzi per line) generated from their original documents. 'Sorted lists' are lists of unique hanzi from their corresponding lists. In the last two columns, numbers in brackets indicate the number of hanzi found in each document.

GB2312 Comparison Table

  Local copies Character lists Sorted lists
IFA2312 ifatable.txt ifa2312list.txt
(6768)
ifa2312lista.txt
(6748)
IFB2312 ifbtable.txt ifb2312list.txt
(6768)
ifb2312lista.txt
(6768)
HY2312 hytable.txt hy2312list.txt
(6746)
hy2312lista.txt
(6746)

4. Comparison

The following is a list of characters that may demonstrate how the three tables differ in the number of unique characters. Column 1 is the internal code given in the three tables. Column 2 is from the IFA2312 table, Column 3 the IFB2312 table and Column 4 the HY2312 table.

	IFA	IFB	HY

B1C8	比	比	比

DFC1	比	吡	

B2B8	哺	哺	哺

DFB2	哺	卟	

BAF3	后	后	后

E1E1	后	後	

BCEE	硷	碱	

BCEF	硷	硷	硷

C0F5	栗	栗	栗

C0FC	栗	傈	

C0FA	历	历	历

F0DF	历	疬	

C3B4	么	么	么

F7E1	么	麽	

C3B8	酶	酶	

C4F2	尿	尿	尿

EBE5	尿	脲	

C7A4	千	扦	

C7A5	千	钎	

C7A7	千	千	千

C8FD	三	三	三

C8FE	三	叁	

C9CA	商	墒	

C9CC	商	商	商

CCAA	太	酞	

CCAB	太	太	太

CFC6	掀	掀	掀

CFC7	掀	锨	

D3DA	于	于	于

ECB6	于	於	

D3E0	余	余	余

E2C5	余	馀	

DFB8	吒	吒	吒

DFE5	吒	咤	

E6B1	吒	姹	

DAA1	淞	凇	

E4C1	淞	淞	淞

E7D6	缰	缰	

D7FE	镕		

EED0	钸	钚	

EEDF	钸	钸	钸

From the above list it seems that IFB is perpahs most loyal to the original GB2312 standard which include quite a few traditional Chinese characters. In comparison, IFA converts those traditional characters into their simplified counterparts. The HY list simply does not list the code space that were used for traditional characters (that were included in the original GB2312 standard specification).

5. Suggestions for using the three tables

If you are concerned about backward compatility, it is best to use an assorted list based on the three tables. A simple run of 'sort | uniq' under a unix shell will get it done. However, if you do not care about the compatibility issue, the Haiyan list will be your best choice.

 

 

Chinese Computing Site Map

Chinese Text Computing Sitemap
Title page
Introduction
Statistics
Search
Technical notes
Chinese computing FAQ
Relevant links
Suggestions
What's new
Copyright notice
My homepage

Copyright. 1998-2000. Jun Da. jda@mtsu.edu