Hanzi (Chinese character) segmenter

(Last modified: 2000-10-22)

The following is a tiny Perl script which will generate a list of hanzi (i.e., Chinese character) from a GB-encoded running text. Note that only hanzi  will be displayed (i.e., code space beginning from B0A1). The script is presented here for illustration purpose only.


#!/usr/local/bin/perl
# By Da Jun 

#

# For comments and suggestions, please contact me at:

#	

#		jun@lingua.mtsu.edu

#

# Last modified: Dec. 8, 1999

#

# Use at your own risk :-). Freely distributable as long as this notice 

# is intacted.

#

# This script segments a plain GB encoded text file (which may contain

# other ascii codes) into a list of characters with one character per line.  

# All other codes are discarded.Output is dumped to STANDOUT with each line 

# containing one character followed by \n (newline).

#

# To run the script on a unix system, do the following:

# 

# 1) Save it as a text file (e.g. name it as 'seggb');

# 2) Find out where the Perl Interpreter is on your system. It is 

#    usually in the /usr/local/bin folder (which is the default used 

#    here) or /usr/bin (on some unix systems). The shell command "whereis perl"

#    will tell you where the Perl interpreter is on your system.

# 3) Make the script executable by issuing the following command at the prompt:

#

#    	chmod u+x seggb 

#

# Now you are ready to run the script.

#

# At the prompt, issue the following command (assuming you save the script as

# 'seggb'):

# 

#	seggb myGBtextfile

#

# in which 'myGBtextfile' is the name of any GB text file you want to segment.

# Note that several files can be processed at the same time. e.g.,

#

# 	seggb file1 file2 file3 ...

#

# The script can also takes input from the I/O pipe. Suppose we have a textfile

# 'fhy.txt'. We can also use the script in the following (dummy) way:

#

#	cat fhy.txt | seggb

#

# END OF NOTES
while ( $line = <> ) {
  # First, we pre-process the input line to get rid of a few known control

  # characters that may be hidden in the text file.

  $line =~ s/[ \n\r\f\t]//g;	
  # Second, we want to make sure that the line is not empty (Otherwise there'll

  # be nothing to process). Note that we use line length as a test. We could

  # test if the string is empty or not by using "(if $line eq '')". But it seems

  # that using string length is better in dealing with texts that may contain a

  # mixture of both two-byte and one-byte codes. I don't know why it is the

  # case but this is what I found out in practice.
  # If the line is not empty, 

  if ( length($line) ne "") {	
  # we do the following:

    while ( $line ) {
    # 1) Get rid of any ascii code(s) that may be at the beginning of the line.

      while ( $line le '~' && $line ne "" ) { $line =~ s/^.//g; }
    # 2) Take the first two bytes of $line:

      $mychar = substr($line, 0, 2); 	
    # 3) If the two bytes stored in $mychar is GB-encoded, we send them out

    #    to STANDOUT. Note that the character in the quotes is binary: B0A1

      if ($mychar ge "°¡" ) { print "$mychar\n"; }
    # 4) Get rid of the first two and process the next two bytes in the line.

      $line =~ s/^..//g;		
    }	# End of the inner while starting from 1)

  }	# End of the if at the top which tests that the line is not empty.

} 	# End of the top while loop