Tamil Text Processing


It is very easy to do simple text processing like finding the length of a string, finding a duplicate word or sorting an array of words. Have you done anything similar in Tamil ?(or any other language you speak). It is equally easy but you must know a few things, Unicode in particular. 

The first step is to do an equivalent of "Hello, World". I.e. write a one line program (in C++ or Java) to print say, "வணக்கம்". In order to accomplish this you'll need to setup your IDE (Eclipse or Netbeans or Visual Studio etc) to

1) Display output (console) in Unicode

2) Display source code (text editor) in Unicode ( so you can type in Tamil strings)

3) You may also want to setup your laptop/PC to type Tamil easily. Look for utilities like ekalappai or Azhagi on the web. (Similar tools are available for other Indian languages)

Information on how to do the above things is widely available on the web. It is very easy.


If you have come this far it's time to read about Unicode in general and a bit more about how your language (Tamil, Hindi, Telugu etc) is encoded in Unicode. We have used Tamil as an example. 

Now you are ready to do some fun stuff:

1. What is the most frequently used Tamil letter ?

2. What is the longest word in Tamil, excluding proper nouns ? (Including proper nouns - which may not be the right thing to do - it is திருவாலவாயுடையார்திருவிலையாடற்புராணம்.) How many letters does it have? How many characters/bytes does it take?

3. Sort an array of Tamil words in ascending order. (You may need a little help. See below)

4. What is the average number of characters needed for a Tamil letter? (Letter vs character. e.g. டி is one letter but it needs two characters to be represented in Unicode)

You can do all of this and more if you had access to a comprehensive list of tamil words. We have just that, a list containing more than 100,000 tamil words taken from Tamil Lexicon, an authoritative Tamil dictionary. 

Download Tamil Lexicon Words(Rigt click on the link and do "save link as". Or if you simply click on the link, the file will open in a new window and depending on your setting it may show a lot of garbage. No problem, simply save the page and then open it in your favourite editor)


 Sorting

A little infrastructure is needed to sort of a list of Tamil words. We provide this (simple) infrastructure in Java and a similar technique should be possible in C++ as well. (If time permits we'll provide a C++ example later). To the curious mind, you need to know what a Collator is. It is a class that is used to specify the lexicographic ordering (i.e. alphabetical order) of the alphabets for a given language like Tamil. Please read about it in Java API documentation. 

You also need to know what a comparator is. This is a class that is used to specify whether a given word is less than, equal or greater than another word. Particularly look at it's compare method. 

We give you a TamilComparator, which also includes a Collator. Use this to sort Tamil words and do a few more interesting stuff that you couldn't do before. E.g. Does the list of tamil lexicon words provided above contain any duplicate words?

Download  TamilComparator.java(Rigt click on the link and do "save link as". Or if you simply click on the link, the file will open in a new window and depending on your setting it may show a lot of garbage. No problem, simply save the page and then open it in your favourite editor)  


 Edit Distance (aka Levenshtein Distance)

How would you compute the edit distance between two Tamil strings?  The algorithm for this is well understood and documented. (e.g. See http://en.wikipedia.org/wiki/Edit_distance). The difficulty is with the non-uniform multi-character representation of an alphabet. E.g. the Tamil alphabet 'க்' takes two characters (க + ் ) while 'க' is just one character and க்ஷ்(க+ ் + ஷ் +  ்) takes 4 characters. As a consequence, if two words differ by say, one letter it may differ by one to 4 characters. So, you need to find alphabet boundaries and then compare the two alphabets instead of blindly comparing characters. 

Edit distance is very useful in spell checkers. For other uses see Wikipedia. "Approximate String Matching" is the area where this algorithm is very useful.

1) You can find the average edit distance between two consecutive words in the Tamil Lexicon.

2) Given a Tamil word, find all the words that differ by 1 (or 2) letters. (You'll need BK-Tree named after Burkhard and Keller)

3) Write a simple game. Given two tamil words, find a way, if any, to go from the first word to the second word whereby in each step you can change one letter but the resulting word must be present in the dictionary. E.g. Is there a way from கல்வி to அகரு. Ans: கல்வி -->கல் -->அகல் --அகரு, where every intermediate word is also a valid word. The path should also be the shortest. E.g. கல்வி to கருவி is just one step but you can also do கல்வி -->கல் -->அகல் --அகரு-->கருவி which is not the best solution.