Jun Da's WebCentral

Home | Academic | Chinese | CALL | Systems | Personal | Contact


Chinese text computing

(This is the 1998 version. An updated 2004 version is now available)


New Page 1

Hanzi (Chinese character) segmenter

(Last modified: 2000-10-22)

The following is a tiny Perl script which will generate a list of hanzi (i.e., Chinese character) from a GB-encoded running text. Note that only hanzi  will be displayed (i.e., code space beginning from B0A1). The script is presented here for illustration purpose only.

# By Da Jun 


# For comments and suggestions, please contact me at:


#		jun@biosci.utexas.edu


# Last modified: Dec. 8, 1999


# Use at your own risk :-). Freely distributable as long as this notice 

# is intacted.


# This script segments a plain GB encoded text file (which may contain

# other ascii codes) into a list of characters with one character per line.  

# All other codes are discarded.Output is dumped to STANDOUT with each line 

# containing one character followed by \n (newline).


# To run the script on a unix system, do the following:


# 1) Save it as a text file (e.g. name it as 'seggb');

# 2) Find out where the Perl Interpreter is on your system. It is 

#    usually in the /usr/local/bin folder (which is the default used 

#    here) or /usr/bin (on some unix systems). The shell command "whereis perl"

#    will tell you where the Perl interpreter is on your system.

# 3) Make the script executable by issuing the following command at the prompt:


#    	chmod u+x seggb 


# Now you are ready to run the script.


# At the prompt, issue the following command (assuming you save the script as

# 'seggb'):


#	seggb myGBtextfile


# in which 'myGBtextfile' is the name of any GB text file you want to segment.

# Note that several files can be processed at the same time. e.g.,


# 	seggb file1 file2 file3 ...


# The script can also takes input from the I/O pipe. Suppose we have a textfile

# 'fhy.txt'. We can also use the script in the following (dummy) way:


#	cat fhy.txt | seggb


while ( $line = <> ) {
  # First, we pre-process the input line to get rid of a few known control

  # characters that may be hidden in the text file.

  $line =~ s/[ \n\r\f\t]//g;	
  # Second, we want to make sure that the line is not empty (Otherwise there'll

  # be nothing to process). Note that we use line length as a test. We could

  # test if the string is empty or not by using "(if $line eq '')". But it seems

  # that using string length is better in dealing with texts that may contain a

  # mixture of both two-byte and one-byte codes. I don't know why it is the

  # case but this is what I found out in practice.
  # If the line is not empty, 

  if ( length($line) ne "") {	
  # we do the following:

    while ( $line ) {
    # 1) Get rid of any ascii code(s) that may be at the beginning of the line.

      while ( $line le '~' && $line ne "" ) { $line =~ s/^.//g; }
    # 2) Take the first two bytes of $line:

      $mychar = substr($line, 0, 2); 	
    # 3) If the two bytes stored in $mychar is GB-encoded, we send them out

    #    to STANDOUT. Note that the character in the quotes is binary: B0A1

      if ($mychar ge "" ) { print "$mychar\n"; }
    # 4) Get rid of the first two and process the next two bytes in the line.

      $line =~ s/^..//g;		
    }	# End of the inner while starting from 1)

  }	# End of the if at the top which tests that the line is not empty.

} 	# End of the top while loop
Chinese Computing Site Map

Chinese Text Computing Sitemap
Title page
Technical notes
Chinese computing FAQ
Relevant links
What's new
Copyright notice
My homepage

Copyright. 1998-2000. Jun Da. jda@mtsu.edu