I presented at the Unicode Conference 2 weeks ago, on Oct. 16, on important yet overlooked issues that concern languages that use abugida scripts and have agglutinative morphology, using Thamil language as a case study. Although the talk was mainly about the issues around dictionary data sets, other issues included input methods, and the need for phoneme level segmentation for these use cases. See below for more details:
The talk covered the following topics:
- What are abugida scripts and agglutinative grammars?
- How do we distinguish the linguistic terms carefully from the Unicode / computing terms?
- What are some of the user-facing problems in internationalization (i18n) for languages with these aspects?
- Why would dictionary data be useful in some of these cases?
- What is the landscape for dictionary datasets?
- What are some of the important non-technical issues for dictionary datasets? How do they influence technical issues and outcomes for the user?
- What is the approach I took for deriving a dictionary dataset from scratch?
- Why are phonemes necessary for doing almost anything interesting (“NLP”) above the layer of fonts for abugida scripts? What analogies can we use to make better sense of that point?
- With the ideas discussed, how can we think of improving input method designs for Tamil and other abugida scripts?
Phonemes have come up for me a lot, from the original clj-thamil project in 2014 that led to writing programs in languages other than English (as explained in the 2015 Clojure/West talk). I further explained more clearly in the 2017 Tamil Internet Conference talk on the need for phoneme level analysis to enable NLP applications.
Ironically, even though I was listed as co-author on this paper on designing a better Tamil keyboard presented at the 2004 Tamil Internet Conference, I understood the idea only after it was explained to me. Once explained, it made sense, but I wasn’t the one involved in pushing the idea, and the idea ultimately fell on deaf ears and was ignored. Its inclusion in the conference proceedings itself was tenuous and only the result of a struggle. Only much later on, after needing to think in terms of phonemes for the clj-thamil project, did I realize that the Tamil grammar lessons I had started in 2001, now hosted on learntamil.com, are written with an assumption that the user understands the “basic math” of phonemes within the script of the language. That assumption is reasonable, but the word phoneme is never used, and that would have been intentional if it were a conscious choice to begin with. Only after these realizations along the way did the idea of phonemes start to make perfect sense in the context of input methods on smartphones. And that allowed me to reinterpret the 2004 Tamil keyboard proposal in those terms, too. The topic of phonemes keeps coming up because there is no way to avoid it if you want to make meaningful progress for Tamil NLP.
But there are ideas about agglutinative language support in input methods, as well as the inclusion of dictionary datasets in input methods and really most all NLP applications, that are wide open for exploration. I am hoping to see more inspired, focused work and inclusive, good-natured collaboration from the community in those directions.
One set of topics that the recent conference presentation didn’t go into were the technical details of the code to do the derivation of Tamil dictionary data from scratch. It was nice to be able to use Clojure to navigate immutable representations of prefix trees that effectively allow navigation up and down the tree as if there were parent pointers in the tree nodes. It required maintaining code for prefix trees along with new code to make the prefix trees support the zipper interface to enable zipper navigation of the prefix trees. Also, the new code in dict-data-builder introduces functions for doing verb conjugation and more functions for noun inflection, given that the code finally tackles the problem head on of finding the verb class to enable proper verb conjugation.