Hanzi (Chinese character) segmenter 2
(Last modified: 2000-10-22)

The following is a tiny C script which will generate a list of hanzi (Chinese character) from a GB-encoded running text. Note that only hanzi will be displayed (i.e., code space beginning from B0A1). The script is presented here for illustration purpose only. It uses the same idea to identify hanzi as described in the Perl segmenter.


 

#include <ctype.h>
#include <stdio.h>
#include <string.h>
main (int argc, char *argv[]) {
FILE *fp;
char c1;
char c2;
char mystring[3];
if ((fp = fopen(argv[1], "rb"))==NULL)
{printf("cannot open file\n"); exit(1);}
while (fread(&c1,sizeof(char),1,fp) > 0) {
if ((int) c1 < 0) {
fread(&c2,sizeof(char),1,fp);
mystring[0] = c1;
mystring[1] = c2;
mystring[2] = NULL;
if ( strcmp(mystring, "°¡") >=0 ) {
printf("%s\n", mystring);
} /* end of the immediate if */
} /* end of the outer if which tests two-byte encoding */
/* else { printf("%c\n", c1); } */ /* non hanzi output if turned on */
} /* end of while */
fclose(fp);
} /* end of main */