Tuesday, February 5, 2013

Comparing Chinese and Bengali texts

Extending multi-version documents (MVDs) to properly support languages like Chinese and Bengali, which use 16-bit characters, turns out to be easier than I thought. Currently the nmerge tool, which produces MVDs, works only with 8-bit bytes internally, so that individual characters may be split over several bytes, as in UTF-8 encoding. Things get complicated whenever differences are detected between parts of characters. Making everything 16-bit will facilitate the comparison of texts in any living language and avoid such complications (unless you want to compare dead languages like Phoenician or Lydian, and even then UTF-16 can encode them). I don't have any Chinese examples, but my friends in India have provided me with some interesting Bengali texts, which I'll be using for testing.

5 comments:

Unknown said...

This is a great ideas and information...

desmond said...

Which language were you interested in?

সুচরিতাসু said...

Being one of the people who prepared those Bengali texts, I know that they have been encoded in UTF-8. Do you need to convert them to UTF-16? And is the conversion as easy as opening the file in notepad and use 'save as' option?
I would like to see the results after you compare the set--it sounds very interesting! :)

sumaya754 said...

Although Bangladesh and china are native but they have huge difference between bengli and chines.

study Chinese

desmond said...

The way I have designed the new version of nmerge any file in any encoding can be used as input, and it will be saved in that encoding. So if your files are in UTF-8, then the MVD made from it will be in UTF-8. However, when loaded into the computer for comparison it will be converted to UTF-16. So, basically you don't have to worry about that. Just encode it however you like.