Jun Da's WebCentral

Home | Academic | Chinese | CALL | Systems | Personal | Contact

 

Chinese text computing

(This is the 1998 version. An updated 2004 version is now available)

 

New Page 1

Hanzi (Chinese character) segmenter 2

(Last modified: 2000-10-22)

The following is a tiny C script which will generate a list of hanzi (Chinese character) from a GB-encoded running text. Note that only hanzi  will be displayed (i.e., code space beginning from B0A1). The script is presented here for illustration purpose only. It uses the same idea to identify hanzi as described in the Perl segmenter.


#include <ctype.h>

#include <stdio.h>

#include <string.h>
main (int argc, char *argv[]) {

	FILE *fp;

   	char c1;

   	char c2;

   	char mystring[3];
	if ((fp = fopen(argv[1], "rb"))==NULL)

		{printf("cannot open file\n"); exit(1);}
        while (fread(&c1,sizeof(char),1,fp) > 0) {

              if ((int) c1 < 0) {

			fread(&c2,sizeof(char),1,fp);

	      		mystring[0] = c1;

	      		mystring[1] = c2;

	      		mystring[2] = NULL;

 	      		if ( strcmp(mystring, "°¡") >=0 ) {

				printf("%s\n", mystring); 

			} /* end of the immediate if */

		} /* end of the outer if which tests two-byte encoding */

		/* else { printf("%c\n", c1); }	 */ /* non hanzi output if turned on */

	} /* end of while */
	fclose(fp);
} /* end of main */

Chinese Computing Site Map

Chinese Text Computing Sitemap
Title page
Introduction
Statistics
Search
Technical notes
Chinese computing FAQ
Relevant links
Suggestions
What's new
Copyright notice
My homepage

Copyright. 1998-2000. Jun Da. jda@mtsu.edu