Jun Da's Chinese text computing page

Jun Da's WebCentral	Home \| Academic \| Chinese \| CALL \| Systems \| Personal \| Contact


Chinese text computing (This is the 1998 version. An updated 2004 version is now available)

New Page 1

Hanzi (Chinese character) segmenter 2

(Last modified: 2000-10-22)

The following is a tiny C script which will generate a list of hanzi (Chinese character) from a GB-encoded running text. Note that only hanzi will be displayed (i.e., code space beginning from B0A1). The script is presented here for illustration purpose only. It uses the same idea to identify hanzi as described in the Perl segmenter.

#include <ctype.h>

#include <stdio.h>

#include <string.h>

main (int argc, char *argv[]) {

	FILE *fp;

   	char c1;

   	char c2;

   	char mystring[3];

	if ((fp = fopen(argv[1], "rb"))==NULL)

		{printf("cannot open file\n"); exit(1);}

        while (fread(&c1,sizeof(char),1,fp) > 0) {

              if ((int) c1 < 0) {

			fread(&c2,sizeof(char),1,fp);

	      		mystring[0] = c1;

	      		mystring[1] = c2;

	      		mystring[2] = NULL;

 	      		if ( strcmp(mystring, "°¡") >=0 ) {

				printf("%s\n", mystring); 

			} /* end of the immediate if */

		} /* end of the outer if which tests two-byte encoding */

		/* else { printf("%c\n", c1); }	 */ /* non hanzi output if turned on */

	} /* end of while */

	fclose(fp);

} /* end of main */

Chinese Computing Site Map

Chinese Text Computing Sitemap

Title page

Introduction

Statistics

Search

Technical notes

Chinese computing FAQ

Relevant links

Suggestions

What's new

Copyright notice

My homepage