Google Summer of Code 2012 Project: Disha

New Visual Keyboard For Bengali

Project Background:

Currently, the most commonly used and popular keyboard layouts available for Indic scripts such as Bengali use a kind of non-visual style of typing. Now, the big question lies in the fact that what exactly is a “Non-Visual Style of Typing”? The answer to that can be stated simply as the sequence in which the characters are typed into the system is not exactly the sequence in which they are displayed.
This can be explained by a simple example of an input combination of the Bengali consonant Ka (ক) and the dependent vowel sign E (ে) as follows:-
Typing sequence: ক+ে
Display sequence: কে
This non-visual style is achieved by following a uniform method of typing the characters as per their type(i.e. consonants,  independent / dependent vowels, special characters, conjunct characters) and are defined by specific sets of rules.

Problem with the current system:

Even though the existing non-visual style of writing is quite prevalent, this poses a major learning challenge for new users who are usually more used to the conventional visual way of writing.
How so?
Reiterating the above example the most common problem faced is that inexperienced users usually try the above consonant-vowel combination in the following way:-
ে+ক
thus ending up with the following display:-
েক
which is not how it should be.

This project is thus aimed at creating a Visual Typing Method for complex scripts like Bengali.

Project Implementation Logic:

The examples stated in the previous section outline just one of the implications for the project. However, there are quite a few cases which need to be implemented on top of the existing system to create such a Visual Layout. This project is primarily concerned with dependent vowels and split vowels which need to rendering of pre-base matras for base consonants.
The main implementation focus of this project can thus be listed as below:-

Case 1:
BENGALI VOWEL SIGN E(ে)[Unicode: 0x09C7]: In this case the rendering engine should be able to process an input combination in which the dependent vowel is input first followed by the base consonant, i.e. if the base consonant is ক, the input system should be able to process the input combination: ে+ ক  as কে in which the vowel the input system first takes ে as input, stores in it’s buffer and waits for the following input. If the following input is a consonant like ক as in the current example, it renders ে as the pre base matra, and ক as the base consonant.

Case 2:
BENGALI VOWEL SIGN AI (ৈ)[Unicode: 0x09C8]: This case is similar in behaviour to the previous case, the only difference being in the input vowel, which is Oikar in this case. Thus the input sequence being ৈ+ক, and the display sequence being কৈ.

Case 3:
BENGALI VOWEL SIGN I(ি)[Unicode: 0x09BF]: This case is also similar to the previous couple of cases where the dependent vowel ি is to be input followed by the consonant as ি+ক and is displayed as কি where again ি becomes the pre base matra followed the base consonant ক.

Case 4:
BENGALI VOWEL SIGN O (ো)[Unicode: 0x09CB]: This case implements split vowels, wherein one part of the vowel sign could be input before the base consonant, while the other part maybe be input after the consonant. This can be explained by the following example:-
Input combination: ে+ক+া
Display combination: কো
This input sequence can be implemented by the input system in two parts. In the first part the input system stores the input vowel ে in the buffer. If the next character is a consonant such as ক it renders the combination as কে and again stores this to the buffer. Now, if the next input character is the vowel া, then it renders the combination as কো and commits it as the output, otherwise the input system, commits the output as কে and initializes the input state to render the next character.

Case 5:
BENGALI VOWEL SIGN AU(ৌ)[Unicode: 0x09CC]: This case again behaves similar to the previous case, the only difference being in the second part of the implementation, wherein if the next input vowel is Au, then the system renders the output as কৌ.

The above mentioned five cases are the main priority for the implementation of the Visual Keyboard Layout. The other dependent vowels in the Bengali language are mostly either post base, above base or below base matras which are already implemented by the present input layouts, hence they need not be reworked.

Project Progress:

The new input system is based on the bn-probhat keyboard input layout, as it provides the most comprehensive key mapping for the Bengali language. The essential key mappings in the visual keyboard layout have been kept the same as the Probhat input system based on the m17n database.
For the purpose of the actual keyboard layout implementation the some research work was done on the specifics of the m17n library and database.

The actual implementation of the Project can be done in two ways:-

  1. Combination Based Input System: In this method the combinations for various input sequences for all three pre-base matras can be defined in the mim file itself.
  2. Condition Based Input System: This method uses a logical condition based approach, whereby conditional logic can be defined in the mim file itself based on a logical algorithm.

Pros and Cons of Combination-Based Implementation Method:

Pros:-
1. Since this a direct input mapping, instead of a logical implementation for input mapping, hence I suspect that performance wise this layout may be faster than the former.
2. As this is a combination of one-to-one and two-to-one mapping, this is also a fairly simple implementation to understand.
3. Inherently takes care of a lot of constraint checking, as this kind of input mapping directly overrides any possible conflicts.

Cons:-
1. The source code for the input layout tends to be a redundant and lengthy.

Implementation Logic for Condition-Based Method:

The actual work on the input system consists of two essential parts:-

Part 1:

The creation of a new MIM file defining the basic one to one mapping of individual keyboard inputs to individual characters of the Bengali Language. This mapping has been done in accordance with the existing mapping present in the bn-probhat layout for the m17n database.

Part 2:

Defining new logical rules in the form of conditions in the new input system. These logical rules have been narrowed down to the implementation of the five individual cases as listed above.

Pseudo-Code for Conditional Approach:

After some experimentation and research it has been concluded that the base logic required for the keyboard layout to work in the desired way is as follows. The following pseudo-code has been concluded upon keeping in mind an m17n database implementation:-

  1. The input system reads one character at a time.
  2. Check if the input character is either ি [Unicode: 0x09BF], ে [Unicode: 0x09C7] or ৈ [Unicode: 0x09C8]
  3. If the condition in step 2 is true, go to step 4, else carry on with normal rendering and commit.
  4. If input character is ি [Unicode: 0x09BF], perform the following steps:-
    1. Read the next input character and store it in temporary variable, say c.
    2. Check if it is a consonant, i.e.: (c > 0x0994) & (c < 0x09C0). If true go to sub-step 3.
    3. Check for consonant rule exceptions, i.e.: (c != 0x0999) & (c != 0x099E). If true go to sub-step 4.
    4. Join 0x09BF and c.
    5. Commit.
  5. If input character is ৈ [Unicode: 0x09C8], perform the following steps:-
    1. Read the next input character and store it in temporary variable, say c.
    2. Check if it is a consonant, i.e.: (c > 0x0994) & (c < 0x09C0). If true go to sub-step 3.
    3. Check for consonant rule exceptions, i.e.: (c != 0x0999) & (c != 0x099E). If true go to sub-step 4.
    4. Join 0x09C8 and c.
    5. Commit.
  6. If input character is  ে [Unicode: 0x09C7], perform the following steps:-
    1. Read the next input character and store it in a temporary variable, say c1.
    2. Check if it is a consonant, i.e.: (c1 > 0x0994) & (c1 < 0x09C0). If true go to sub-step 3.
    3. Check for consonant rule exceptions, i.e.: (c1 != 0x0999) & (c1 != 0x099E). If true go to sub-step 4.
    4. Join 0x09C7 and c1 and store the combined characters into c2.
    5. Read the next input character and store it in temporary variable, say c3.
    6. Check if (c3 = 0x09BE) | (c3 = 0x09D7), if true go to sub-step 7, else Commit and pass c3 to Initialized state.
    7. Join c2 and c3.
    8. Commit.

Status of Each Implementation Method:-

After quite a bit of experimentation and a bit of a wild goose chase in trying to debug a non-existent bug, the combination based input system is finally complete and tested. It works great with the m17n system and IBus daemon.

However, not everything is so smooth with my bonus alternative method. Even though the input system works great and even gels well with the IBus and m17n systems, it does not yet seem to be able to join the three pre-base matras with the input consonants. I’m trying to create a joining functional state so as to finish it without any intrusion into the m17n system, however it still does not provide the desired output glyphs. I’m now suspecting that this type of implementation may require some modifications in the m17n system.

[Note to self: Finishing the project within the stipulated time still holds higher priority than working on a bonus method. But having both would be cool!!]

  1. I tried making a mim file for Assamese input. It worked fine but some keys did not map correctly. Especially the lower row in the keyboard (zxcv–,./), when pressed returns the corresponding roman letter. Can you please help me solving the problem? One more thing, can I put unicode character code instead of symbols in the mim file?? Please help.

    • Hi,

      This sounds interesting. Is the mim file available somewhere from where I can check it? For the problem regarding keyboard inputs returning incorrect outputs, you may try doing a step by step rechecking of the system’s workflow. You may do this by using m17n-edit via the following steps:-

      Copy the file .mim under ~/.m17n.d
      % export MDEBUG_DATABAE=1
      % export MDEBUG_INPUT=1
      % m17n-edit –im

      Also, you should be able to use UTF-8 character codes inside the mim file instead of symbols wherever required. For a list of UTF-8 codes, you may check out: http://www.utf8-chartable.de/unicode-utf8-table.pl

  2. hi, i’m just new to this field of open source developing, i like to join the upcoming Google Summer of Code 2013. i saw you on the Google Summer of Code 2012 accepted list and i also saw you have succeeded, so i hope you can help me, i need to know what are the skills i need to participate, i’m a Bsc. Computer Science 2nd year student of University of Calicut, kerala. i know C,C++, ORACLE, VB.NET, and little bit of HTML. i don’t know anything about open source programming. just advice me how to take step 1 to this field, and to participate in Google Summer of Code 2013 just like you did on 2012. please

    with Hope,
    Sreedhar.p

    • Hey Sreedhar,

      I would suggest that you start by going through the list of accepted organizations from last year and look for the ones that employ the use of the languages you are comfortable working with. C and C++ should definitely put you in quite a good ground. You can make a list of the organizations that you would like to work with, go their project lists, and based upon your liking start interacting with those organizations. You can then either talk to them about one of their project ideas, or you may come up with your own. The sooner you start the interaction the better.

      Hope that helps.

      Regards,
      Sayak

  3. hello sayak,

    if i need to apply when Student application period opens, is there a possibilty of getting selected? and during application submission what are the things i need? can i scan my proof documents and send? please tell how did you apply in 2012? so i can do like that.
    Thank you for your Valuable Reply……… and for helping me.

    From

    Sreedhar.p

Leave a comment