Hanzi (Chinese character) segmenter(Last modified: 2000-10-22) The following is a tiny Perl script which will generate a list of hanzi (i.e., Chinese character) from a GB-encoded running text. Note that only hanzi will be displayed (i.e., code space beginning from B0A1). The script is presented here for illustration purpose only. #!/usr/local/bin/perl # By Da Jun # # For comments and suggestions, please contact me at: # # jun@lingua.mtsu.edu # # Last modified: Dec. 8, 1999 # # Use at your own risk :-). Freely distributable as long as this notice # is intacted. # # This script segments a plain GB encoded text file (which may contain # other ascii codes) into a list of characters with one character per line. # All other codes are discarded.Output is dumped to STANDOUT with each line # containing one character followed by \n (newline). # # To run the script on a unix system, do the following: # # 1) Save it as a text file (e.g. name it as 'seggb'); # 2) Find out where the Perl Interpreter is on your system. It is # usually in the /usr/local/bin folder (which is the default used # here) or /usr/bin (on some unix systems). The shell command "whereis perl" # will tell you where the Perl interpreter is on your system. # 3) Make the script executable by issuing the following command at the prompt: # # chmod u+x seggb # # Now you are ready to run the script. # # At the prompt, issue the following command (assuming you save the script as # 'seggb'): # # seggb myGBtextfile # # in which 'myGBtextfile' is the name of any GB text file you want to segment. # Note that several files can be processed at the same time. e.g., # # seggb file1 file2 file3 ... # # The script can also takes input from the I/O pipe. Suppose we have a textfile # 'fhy.txt'. We can also use the script in the following (dummy) way: # # cat fhy.txt | seggb # # END OF NOTES while ( $line = <> ) { # First, we pre-process the input line to get rid of a few known control # characters that may be hidden in the text file. $line =~ s/[ \n\r\f\t]//g; # Second, we want to make sure that the line is not empty (Otherwise there'll # be nothing to process). Note that we use line length as a test. We could # test if the string is empty or not by using "(if $line eq '')". But it seems # that using string length is better in dealing with texts that may contain a # mixture of both two-byte and one-byte codes. I don't know why it is the # case but this is what I found out in practice. # If the line is not empty, if ( length($line) ne "") { # we do the following: while ( $line ) { # 1) Get rid of any ascii code(s) that may be at the beginning of the line. while ( $line le '~' && $line ne "" ) { $line =~ s/^.//g; } # 2) Take the first two bytes of $line: $mychar = substr($line, 0, 2); # 3) If the two bytes stored in $mychar is GB-encoded, we send them out # to STANDOUT. Note that the character in the quotes is binary: B0A1 if ($mychar ge "°¡" ) { print "$mychar\n"; } # 4) Get rid of the first two and process the next two bytes in the line. $line =~ s/^..//g; } # End of the inner while starting from 1) } # End of the if at the top which tests that the line is not empty. } # End of the top while loop |