general programming tamil thamil

Redesigning an Input Method for an Abugida Script

After I previously talked about problems of input methods for abugida scripts, and added more supporting details to the point, I finally started prototyping possible implementations of the idea (try it out!).

But there are quite a few constraints and tradeoffs that come up once you start thinking about the details. I think these issues apply generally to most abugida scripts. So I am documenting all of the details below. Also, getting a new input method adopted requires more than perfecting just the technical details and user experience — it also requires overcoming user inertia (or creating awareness), and it also requires educating industry experts and those implementing changes. If you have feedback, please send it my way so that I can continue to update this post with the latest information.

Existing input methods for Tamil

The following is not a full history of input methods in Tamil, but just an overview of the technical details of how they work.

  • Anjal (a.k.a. Romanized) – Murasu Anjal brought about the earliest wide-spread Tamil input method, which uses an English keyboard and a well-defined English -> Tamil transliteration scheme. Ex: “th” -> த்; “tha” -> த; “m” -> ம்; “mi” -> மி; “z” -> ழ்; “thamiz” -> தமிழ். Each successive English letter typed updates the output Tamil text accordingly, as the above examples hint. Ex: “t” -> ட்; “th” -> த்; “tha” -> த; etc. (demo site)
  • Tamil99 – a standard adopted by the Tamil Nadu government in 1999 that composes the keyboard using Tamil vowels and consonants*, not English letters. The consonants are not pure consonants, but rather a consonant + அ (short a) combination letter. To compensate, the standard defines behavior to conveniently achieve pure consonants in a large number of occurrences — when a doubled consonant is typed, or when a strong consonant follows a soft consonant, the first consonant is assumed to only be a pure consonant. (demo site)
  • Google Input Tools-style Transliteration – This uses an English keyboard layout, like Anjal style keyboards. But instead of a strict deterministic transliteration scheme, it allows leniency in the pre-transliterated English, and compensates by using some sort of scheme to predict an ordered list of the most likely Tamil transliterations. I don’t know what the scheme is, but I assume that the scheme is based on Tamil text corpora and statistics. (demo site – select Tamil in the language drop down, then the first option “தமிழ்” in the next drop down to the right)
  • Gboard Tamil->Tamil – This keyboard layout uses Tamil letters very similar to Tamil99, but instead of keeping the keys static like Tamil99, it changes the vowel keys into consonant+vowel letter combinations whenever a consonant is typed. If a non-Tamil key or a Tamil consonant+vowel letter is typed, the vowel keys are displayed again. But if a different consonant is typed after the first, then the consonant+vowel keys adjust again. I depict this method in my conference talk slides. (Note: Gboard has an English->Tamil transliteration method, but this behaves similar to the Google Input Tools-style Transliteration described above.)

Availability notes: Anjal-style transliteration is available on Windows via Murasu Anjal and Keyman, on macOS (built-in settings), and I think on iOS and on Android via Sellinam (the version of Murasu Anjal for mobile phones). Tamil99 is available on Windows and macOS via the same software as the Anjal-style, while it is built-in on iOS. Gboard’s Tamil->Tamil style exists seemingly instead of the Tamil99 method, although on Gboard, all South Asian abugida scripts’ input methods are implemented to behave in the same way (dynamically changing vowel keys). Gboard’s style of input for Tamil and all South Asian abugida scripts only exists on Gboard. Google Input Tools is available already within the UIs for Gmail and Google Translate, and it is also available for Chrome OS (Chromebooks) if the user installs the corresponding official Chrome extension.

Shortcomings in existing input methods

What are the shortcomings of the existing input methods? Because of course, there is no reason to invent something new if there is no room for improvement with what you have already.

Anjal transliteration shortcomings

  • With this method, you have to know English to type in Thamil.
  • With this method, you have a leaky abstraction — the representation of sounds in English affects how the input keystrokes transform to the output Tamil text. For some consonants, you sometimes have to type 2 keys, and others only require 1 key, whereas the transformation of a vowel after a consonant is much more consistent linguistically to the user. Ex: “t” -> ட், “th” -> த், “tha” -> த. This is especially noticeable when a consonant that requires 2 keys is doubled (ex: த் = t h, and த்த் = t h t h, which is especially noticeable when the user mistypes the “thth” somehow).
  • With this method, you have to know the proper keys to distinguish 2 or 3 “l”s, 2 “r”s, and 3 “n”s ( l, L, z; r, R; n, N, n-/n^/w). You also have to know whether certain consonant clusters followed by a vowel sound implies both consonants or just one (ex: does “…ngO…” = ங் + க் + ஓ or ங் + ஓ? does “…nja…” = ஞ் + ச் + அ  or ஞ் + அ ?)
  • Chrome OS, and Linux implementations – input methods in Chrome OS and Linux seem to have an internal state machine for the invisible text field that requires whitespace / punctuation to be entered after the word for words ending in a pure consonant (if you switch to an English keyboard, the state machine thinks it needs to backtrack and may delete the most recently outputted letter to the real text field).

Tamil99 shortcomings

  • in Tamil99, irregularity — words ending in pure consonants.
  • in Tamil99, words with pure consonants in the middle of the word that aren’t auto-detected.
  • In Tamil99, irregularity — you need to know that certain words where a consonant appears twice in a row by the first occurrence is not a pure consonant cannot be handled automatically (ex: the ம் in உரிமம்; the ர் in காவல்காரர்). The automatic dotting works only on letters like ட் in கட்டம் and even ட் in கட்டடம்).
  • Tamil99 probably won’t work well for transliterated loan words and old poetry transcriptions where whitespace does not correspond to word boundaries.

Google Input Tools-style transliteration / Gboard English->Tamil shortcomings

  • This method creates a menu of options, and the user must switch attention back and forth from the line output text to the menu of transliteration options for the next word.
  • This depends on the accuracy of whatever the underlying data is for the transliteration to work well.
  • Hard to predict the transliteration output. The same potential problem example from Anjal (ex: does “…ngO…” = ங் + க் + ஓ or ங் + ஓ?) is not even necessarily deterministic. Another example: the Tamil consonant ட can be pronounced as “t” or “d” depending on its position in the word, but “kEdkum” gives an incorrectly spelled word, but “kEtkum” gives the correctly spelled word. So knowledge of Tamil pronunciation rules for such “hard” consonants (plosives) and knowledge of English letters are both required and intertwined (implies complexity).
  • There is no deterministic transliteration scheme that can be used just-in-case, so the user doesn’t have control over typing precisely that can override the predictive transliteration scheme.
  • Similar to Tamil99, this probably won’t work well for transliterated loan words and old poetry transcriptions where whitespace does not correspond to word boundaries.

Gboard Tamil->Tamil shortcomings

  • Gboard – the dynamically changing keyboard keys is obviously bad UX. Keys, like any other button, shouldn’t be dynamically changing, much less should they change so frequently to be describable as “flickering”.
  • GBoard – the user is unwittingly effectively typing based in terms of code points — every key press corresponds to a Unicode code point. This does not correspond to the linguistic equivalent of the letter (the combination of phonemes), let alone correspond to how the users would write these letters by hand (for what that’s worth).
  • Bad UX aside, this input style is somewhat similar to the Tamil99 style, but it still does include the Tamil99 affordances (compensations) of automatic dotting for doubled consonants and common consonant clusters (ம்ப், ங்க், etc.) that occur as nasal+plosive pairs at the same point of articulation.
  • Due to the above points, GBoard is strictly worse than Tamil99 — due to flickering UX and no automatic dotting and no discernable benefits over Tamil99.

Designing a better keyboard

The style of input that I’m proposing here for abugida scripts including Tamil is based on the phonemes of the abugida. I’ll call this style “Phonemic”. (demo site)

But first, let’s describe the goals of creating a new input method for abugida writing systems.

Design goals

  • The new input method should not require knowledge of English. This allows native Tamil speakers, who learn speaking/listening and literacy (reading/writing) in Tamil first. Requiring young and other-aged mono-lingual users to learn a new writing system to type their own language should be seen as a barrier to adoption, not a foregone conclusion. Therefore, the keys should be in Tamil.
  • The input method should allow the user to have full control over what they type. There should be a clear determinism between what they type and what appears as the output text. It should not depend on inaccessible underlying data.
  • The input method should be more efficient than existing methods — it should require fewer keypresses and/or shorter finger travel than existing methods.
  • The input method should have no dynamically changing (flickering) keys. This is clearly an achievable goal for abugida scripts.
  • Ideally: there should be some intuition — some linguistic basis — for users when typing words. This only matters if it simplifies how users convert words into keypresses by making it more consistent, in which case it would likely contribute to greater efficiency.

The Phonemic style input method achieves all of the above goals for Tamil.

Intuition for the Phonemic style of input

As previously discussed in my conference talk, the Phonemic style of input for abugida scripts has a linguistic intuition, and it also has analogues to input methods for other writing systems:

Comparison of Phonemic to existing input methods

Let’s look at examples of representing words in the different input method styles. Next to each word is the number of keystrokes required.

மாமா(4) m A m A

(6) m a a m a a
(4) ம ஆ ம ஆ(4) ம் ஆ ம் ஆ

(6) ம் அ அ ம் அ அ
அம்மா(4) a m m A

(5) a m m a a
(4) அ ம ம ஆ(4) அ ம் ம் ஆ

(5) அ ம் ம் அ அ
கார்மேகம்(8) k A r m E k a m

(10) k a a r m e e k a m
(8) க ஆ ர ம ஏ க ம ◌் (8) க் ஆ ர் ம் ஏ க் அ ம்

(10) க் அ அ ர் ம் எ எ க் அ ம்
வாத்து(7) v A t h t h u

(8) v a a t h t h u
(5) வ ஆ த த உ(5) வ் ஆ த் த் உ

(6) வ் அ அ த் த் உ
சாமான்கள்(8) s A m A n k a L

(10) s a a m a a n k a L
(9) ச ஆ ம ஆ ன ◌் க ள ◌்(8) ச் ஆ ம் ஆ ன் க் அ ள்

(10) ச் அ அ ம் அ அ ன் க் அ ள்
உரிமம்(6) u r i m a m(7) உ ர இ ம அ ம ◌்(6) உ ர் இ ம் அ ம்
காவல்காரர்(10) k A v a l k A r a r

(12) k a a v a l k a a r a r
(11) க ஆ வ ல ◌் க ஆ ர அ ர ◌்(10) க் ஆ வ் அ ல் க் ஆ ர் அ ர்

க் அ அ வ் அ ல் க் அ அ ர் அ ர்
வகுப்புக்கு(10) v a k u p p u k k u(9) வ க உ ப ப உ க க உ(10) வ் அ க் உ ப் ப் உ க் க் உ
கட்டடம்(8) k a t t a d a m

(8) k a t t a t a m

(8) k a d d a d a m
(6) க ட ட ட ம ◌்(8) க் அ ட் ட் அ ட் அ ம்
கட்டம்(6) k a t t a m(5) க ட ட ம ◌்(6) க் அ ட் ட் அ ம்
காட்டும்(6) k A t t u m

(7) k a a t t u m
(7) க ஆ ட ட உ ம ◌்(6) க் ஆ ட் ட் உ ம்

(7) க் அ அ ட் ட் உ ம்
போகிறோம்(7) p O k i R O m

(9) p o o k i R o o m
(8) ப ஓ க இ ற ஓ ம ◌்(7) ப் ஓ க் இ ற் ஓ ம்

(9) ப் ஒ ஒ க் இ ற் ஒ ஒ ம்
சேர்கிறோம்(8) s E r k i R O m

(10) s e e r k i R o o m
(10) ச ஏ ர ◌் க இ ற ஓ ம ◌்(8) ச் ஏ ர் க் இ ற் ஓ ம்

(10) ச் எ எ ர் க் இ ற் ஒ ஒ ம்


  • The Phonemic style is strictly more efficient than the Anjal transliteration style — it uses the same or less number of keystrokes. They are nearly equivalent in behavior — as stated above, abugida phonemes are like alphabet letters — but the Phonemic style is better when a single phoneme requires more than 1 English letter to type (ex: த் <-> “th”).
  • The Phonemic style uses Tamil letters only, while the Anjal style uses English letters only.
  • For the above reasons, let’s rule out the Anjal style from consideration, and only compare Tamil99 to Phonemic.
  • There is a clear inconsistency / cognitive load required of Tamil99 users to know when to use the code point U+0BCD ([◌்] — “pulli” – required to indicate pure consonant). In other words, Tamil99 can auto-insert pulli code points in many instances, except when it can’t. The user can almost but not quite fully think in terms of the phonemes, because the user has to think of this grapheme code point that does not map to a phoneme (rather, it maps to the “subtraction” of a phoneme).
  • There is a difference in efficiency (in terms of number of keystrokes) that favors one style or the other depending on the word. Words that have C+V letters where the vowel = அ (short a) give an advantage to Tamil99. But Tamil99 has a disadvantage when a word ends in a pure consonant, or a pure consonant precedes a different consonant
  • Previously, I have run a letter frequency analysis of random modern Tamil text and found that the Phonemic style yields fewer keystrokes than the number of occurrences of Consonant+அ is less than the number of occurrences of {Consonant, Consonant+{vowel besides அ}}. This means that in practice, the Phonemic style is more efficient because it results in fewer keypresses.
  • Beyond keypresses, efficiency can also occur based on the amount of finger travel. Finger travel would be a function of both the layout of the keys and the language’s average case input. While it appears to me that Tamil99’s layout is fixed (ex: iOS, Keyman), the Phonemic input method’s key layout is still an open question. This leaves room for further benefit of Phonemic relative to Tamil99.

Technical considerations

  • Just like many input methods, the Phonemic input method requires context in order to allow the output text to mutate based on new input typed into the input method. This is because the Unicode representation of a C+V grapheme cluster has nothing to do with the Unicode representation of the C grapheme cluster — graphemes != phonemes.
  • More specifically, if the keys of the input are the pure consonants (க், ங், ச், … ன்) and standalone vowels (அ, ஆ, … ஔ), then typing the pure consonant க் (k) key appends க் (output text = “….க்”), and then typing the pure vowel இ (i) key converts that trailing க் to கி (output text = “….கி”).
  • Because Tamil and most languages with abugida scripts have short / long vowel pairs (ex: அ (a) / ஆ (aa); இ (i) / ஈ (ii); உ (u) / ஊ (uu); …), it might be useful to allow long vowels to equivalent be typed by doubling a short vowel (ex: க் (k) + ஆ (aa) = கா (kaa); க் (k) + அ (a) + அ (a) = கா (kaa)). In Tamil, the 1:2 proportion of short to long vowels adheres to the grammar described in the oldest known literature. And if I understand correctly, this short : long proportion also exists in Sanskrit and descendant languages as described by the grammar of an equally antiquarian literature. Therefore, this particular vowel phoneme double-tap equivalence should work for at least all South Asian abugidas.

Importance of the Phonemic input method for all abugidas

I believe that the Phonemic input method is even more important for all other languages using abugida scripts, and that the above analysis showing its clear benefit for Tamil is an underestimate of its benefit for the others.

My belief is based on a simple observation that I have some confidence in: among South Asian abugidas, the Tamil script is the least irregular. (Disclaimer: I know very little about other Indic scripts, but I hope I know enough that this assertion is still fair.)

For example, in Unicode, Tamil does not have ligature characters that need the use of Zero-Width Joiner (ZWJ) or Zero-Width Non-Joiner (ZWNJ) code points for encoding ligature grapheme clusters. Even without ZWJ and ZWNJ characters, there can still be consonant-consonant ligatures. In Malayalam, for the city name Kochi (കൊച്ചി), the phoneme representation is k + o + ch + ch + i. The difference between the Malayalam orthography (കൊച്ചി) and the Tamil orthography (கொச்சி) is that Malayalam replaces the double “ch” with a special ligature. This different double-“ch” base ligature grapheme cluster is available only in the long-press popup menu on the base “ch” key (ച). Now imagine instead that single tapping ച twice in a row yields ച്ച (“chch”), and that a subseuquent tap of ഇ (i) yields ച്ചി (“chchi”). The Malayalam input would be as simple as the Tamil, without the user needing to be confused about where to find various the keys for ligatures and combining grapheme marks among the various long-press menus and combining mark keys. (The combining mark keys themselves on the keyboard may be a little confusing to users since they us the dotted circle placeholders to indicate the relative positions of the combining marks.)


A more obvious example:


Although I have less ways to guess my way to plausibility with Devanagari script, I am aware that certain combinations of consonants and vowels can result in ligatures that must be used in place of the constituent letters (example). An example like this in Deavanagari also shows that the context-dependent orthography changes can be somewhat complex and irregular, but thinking in phoneme space can simplify the complexity for the user (and reduce the number of keys for the input method).

When we consider abugida scripts from Southeast Asia, the considerations there involve complexities that further strengthen the need to think of input methods (and other higher-level internationalization support) in terms of phonemes. One talk at the 2020 Unicode Conference discussed the confusion caused by the visual similarity of representing a letter in Khmer even when permuting the order of the combining marks. During the Q&A, there was a protracted discussion about the Unicode definition of the canonical ordering of the combining mark code points, and how to get users typing Khmer text to obey the canonical ordering of those code points when typing text. Delving into that level of details was tricky enough for the experts of Unicode and scripts themselves who were debating each other. The presenter concluded the talk by saying that the best way to solve the problem is to just come up with a better input method, because that would ensure that the user would type any new text in a way that obeys the canonical ordering. Basically, this is really hard to deal with after the fact for existing content, so all we can easily control is the future. If you’ve followed me so far, you can see the flaw in their discussion prior to that parting message of a better input method. They’re looking at the problem at the wrong level of detail — code points — and instead should be finding higher-level ways for users to input the language. It is a lot to ask of users that they think of their writing system in terms of Unicode code points. Enter the phonemic input method to the rescue — it would simplify the number of keys that they have to deal with, and hopefully bring some consistency to their typing mental model. Allow the complexity to become a concern for the input method implementors. Now, since I don’t know much at all about Khmer, maybe what I’m suggesting won’t be perfect (of course there will be some caveats), but I don’t think the overall idea would be completely off-base.

And in general, this should be a lesson for all i18n and i18n-adjacent participants in these discussions. A lot of work for Unicode and i18n has been laying the very broad-sweeping foundational basis for basic language support in computers — “can you display text in everyone’s languages properly?” This requires Unicode standards, code points, properties data for each code point, algorithm libraries to process text that encodes those code points, layout engines, and fonts. Different writing systems have different concerns for various applications. For any application for abugida scripts that is “higher-level” than the existing levels of support, as I’ve said in my talk and before, we really need to start making sense of these scripts in phoneme space, not code point space or grapheme cluster space.

The problem of improving input methods for these scripts and languages is a natural place to start this shift in approach. Another application of the idea might be backspace behavior. If input methods allow addition of text (oftentimes as “appending to the end”), backspace is like its inverse — popping the stack of those appends. Can we make backspace behavior more consistent and simpler by thinking in terms of phonemes? (Backspace behavior on Malayalam ligatures is semi-intuitive, but backspacing any multi-code point Tamil letter only deletes code points :-\)

Anticipated usage patterns

Anticipated platforms/devices/OSes

Although it would be nice for an optimal Tamil input method to be used on all platforms and operating systems, there seems to be an inertia for existing users of computers with physical keyboards (desktops and laptops). The reason is that these keyboards in South Asia are usually US-101 keyboards, or at least from my personal observation in Tamil Nadu — there are very few non-English (non-US 101) physical keyboards.

Existing users of computers with physical computers are already fluent in English and accustomed to typing in English, so the Anjal style is prevalent among those Tamil-speaking users because it reuses that English typing knowledge. Conversely, those users of the Anjal style have a hard time using a different input method style because they have no visual cues on the physical keyboard of what Tamil phoneme would correspond to the US-101 keyboard key.

Therefore, it seems mostly likely to expect adoption on touchscreen mobile devices with virtual on-screen software keyboards — these virtual keyboards can display the Tamil phoneme on the virtual key. Also, changing the software of a virtual keyboard has a different upfront cost than printing the Tamil phonemes on a US-101 keyboard.

Anticipated audience (new users / existing users / both?)

As mentioned before, existing users of an Anjal style input method on physical keyboards are unlikely to change their input method.

There is a chance that users who find Tamil99 presents minor challenges be willing to switch to a Phonemic input method on touchscreen devices, but this needs to be prototyped and tested. The uncertainty in this prediction is whether existing Tamil99 users consciously recognize the mental burden of the inconsistency of the input method.

There is a chance that young / new Tamil users of physical keyboards might find the Phonemic style more intuitive than the Anjal style. It seems very likely for such users to adopt the Phonemic style on mobile as their first input method there.

Future Work

Layout of keys

  • The layout of keys for the Phonemic input style are still an open question. Each of the variations in the demo site, listed in the “Keyboard” dropdown, from “தமிழ் Phonemic A” to “தமிழ் Phonemic D”, give a slight variation.
  • Putting letters together that appear next to each other in the alphabet might encourage adoption in the short term, but these do not necessarily imply the efficiency of the keyboard, which is probably important for long-term benefits. For example, “soft” consonants (nasals) that are articulated at the same point as “strong” consonants (plosives) will occur next to each other in a pair, with the plosive preceding the nasal. (This is similar to the Pāṇini alphabet for Sanskrit, IIUC.) Since these nasal-plosive pairs occur often in words, it could be beneficial to have their keyboard keys side-by-side if they are typed by separate fingers. But if they are typed by the same finger, then that could actually introduce a minor slowdown compared to using different fingers.
  • Unlike the US QWERTY keyboard, which was technically designed to be inefficient, it might be beneficial to have a balance of keypresses for the left hand and right hand for average Tamil text.
  • Of course, the most frequent keys should be on the home row, and maybe the top row should be preferred over the bottom row of a US-101 keyboard.
  • Finger travel should be measured for various key layout options.
  • All of the above factors must have influenced the Dvorak layout over QWERTY…
  • One interesting language-specific observation is that because Tamil plain text only really uses periods, but not the other punctuation on nearby keys in a US 101 keyboard (, < > / : ; ‘ [ ] { } | \), it might be better to replace some of those keys to ensure the existence of long vowel keys alongside the short vowel keys, and only preserve the period and question mark.

Can this be extended for a syllabary?

It’s not definitively clear.

Abugidas are also called alphasyllabaries because they’re not quite alphabetic, but they do have a regularity of graphemes based on phonemes that give them more regularity than a true syllabary.

In particular, most syllabaries seem to not contain pure consonants, and this is potentially a sticking point for adapting the Phonemic style to syllabaries, unless the “short a” phoneme can be thought of as a “neutral” / “default” vowel in the syllabary’s respective language.

Feedback from you

What do you think about the Tamil Phonemic input method?

What do you think about the benefit of the general Phonemic strategy of input methods for other abugida scripts (especially for those in South and Southeast Asia?)

Send me an email or fill out this form to send me feedback. Thanks!

4 replies on “Redesigning an Input Method for an Abugida Script”

Great post Elango. I’ve been looking for something like this for a long time. The Phonemic input is a great idea and is much better than Anjal or Tamil99 style.

However, I wonder if it makes sense to retain the Tamil99 style sans the Pulli (Virama) key mapping. Because this would allow the Tamil99 users to easily adapt to the Phonemic form. Secondly, I also feel that the separation of Consonants and Vowels on the right and left sides of the keyboard has become somewhat of a standard in both Tamil99 and GBoard inputs. I think it can gain adoption much better. Any particular reason, why you think Tamil99 key layout needs to be changed?

Once this is standard, having the Slide-to-type/ swipe-typing feature on top of such a keyboard would be a great leap for many users I sincerely hope.

Thanks Baskaran. And good question. Others have asked me, too, about how the exact layout of the keys, specifically for the exact reason you are talking about. They say that it would be easier to learn for the first time if the keys were arranged in a meaningful order, say if it were in alphabetical order. But my response is that there is a tradeoff to be made in the design here as there often is elsewhere. The layout that makes it easy to learn isn’t the layout that is most efficient. Take English keyboards for example — QWERTY was designed to be anti-efficient (to prevent typewriter hammers from jamming), Dvorak was designed with English language statistics in mind to be most efficient, and a hypothetical ABCDEF keyboard would be neither-here-nor-there.

One thing that I think is important, both for User Experience and even the details of technical work, is consistency. A guiding principle for good UX design is achieving the least mental overhead for the user, and consistency is a necessary (but not sufficient) ingredient for that. For that reason, I think the Phonemic with the pulli (புள்ளி) is preferable than Tamil99 in the long run since it is internally consistent, and similarly a different layout for Phomemic gives options for a more efficient layout than Tamil99. I think Tamil99 is already suitable enough for people to get used to it quickly, but a Phonemic keyboard should try for the efficiency and consistency use case, which will benefit users over the long-term. I am not entirely sure yet, but I think if/when Phonemic proves successful, the layout may also make it simpler to scale the implementation to other abugida scripts, too.

I agree with you. The easy to learn layout is certainly not going to be optimal in the long run.

I did a detailed statistical analysis of Tamil characters using corpora having ~660M words and has comeup with a design based on this. I can’t attach the images of the heatmap analysis or the keyboard design here. However, here is a brief snapshot of top 50 characters along with their frequency.

0 ** க : 101639761
1 ** ம் : 79185832
2 ** ர் : 71118296
3 ** ல் : 70827906
4 ** க் : 66335732
5 ** து : 64629393
6 ** ன் : 61646613
7 ** ப : 59613686
8 ** வ : 57886610
9 ** த் : 56277080
10 ** த : 53254312
11 ** தி : 51930875
12 ** ட் : 47817917
13 ** ப் : 44539257
14 ** கு : 43653712
15 ** அ : 42878798
16 ** ள் : 41534988
17 ** டு : 37927692
18 ** ட : 37830559
19 ** ரு : 37696186
20 ** ந் : 35519809
21 ** இ : 35468786
22 ** வி : 31253880
23 ** ம : 31160274
24 ** ர : 31027484
25 ** ய : 27613523
26 ** டி : 27531725
27 ** கி : 27052040
28 ** ற் : 26546556
29 ** ங் : 24240744
30 ** எ : 23936794
31 ** ன : 23305401
32 ** ண் : 21940617
33 ** ரி : 21371176
34 ** று : 20671301
35 ** கா : 20480983
36 ** சி : 20347697
37 ** யி : 19852486
38 ** மா : 19291103
39 ** ச : 17913051
40 ** தா : 17797583
41 ** மு : 16937471
42 ** லை : 16936147
43 ** உ : 16532718
44 ** ற : 16473631
45 ** ல : 16459674
46 ** பா : 16419343
47 ** ள : 16128845
48 ** பி : 15935413
49 ** ளி : 15286857
50 ** பு : 14982198

If you could drop me a line at baskarans _at_, I can share my results with you. I would be glad to have your thoughts on the keyboard layout and see how it can be improved.

Interesting, thanks for the info. I’ve been thinking of a keyboard layout in terms of phonemes (and I’ve been saying for a few years now that Tamil and other abugida script languages should be thinking in terms of phonemes much more often than not). Of course, that’s the basis for this proposed fundamental shift in keyboard layout. I think a bigram analysis of phonemes on your dataset would be interesting. Instead of the (12+1)*(18+1) = 247 letters, it would instead be every sequential pairing of the (12+18) = 30 phonemes, giving 30^2 = 900 bigrams. That would give you the data for finger travel and key relatedness, etc. If you’re interested in looking into that, I’ve just pushed commits to the main branch on clj-thamil so that you can run the new “phonemes” to preprocess your text into just the letters that represent the phonemes of the original text:

Of course, no matter how you slice it, there is still going to be some executive decision when producing the final key layout. There is no objective quantitive measure of finger travel that I know of. And even if you could minimize finger travel using Cartesian distance, that works for mobile but probably not for physical desktop/laptop keyboards. On physical keyboards, you might get faster typing speeds if different fingers type consecutive keys (especially if the fingers are on opposite hands). So physical keyboard design might be at odds with the mobile device on screen keyboard design.

Anyhoo, I’ll drop you a line.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s