tance measures and requires no linguistic know 4 Our Approach. ledge They concluded that stemming improves, recall of IR systems for Indian languages like Our approach is based on Goldsmith s 2001. Bengali Dasgupta and Ng 2007 worked on take all splits method Goldsmith s method was. unsupervised morphological parsing for Benga purely unsupervised but we have used a list of. li Pandey and Siddiqui 2008 proposed an un hand crafted Gujarati suffixes in our approach. supervised stemming algorithm for Hindi based to learn a better set of stems and suffixes during. on Goldsmith s 2001 approach the training phase In our approach we make. Unlike previous approaches for Indian lan use of a list of Gujarati words extracted from. guages which are either rule based or complete EMILLE corpus for the purpose of learning the. ly unsupervised we propose a hybrid approach probable stems and suffixes for Gujarati during. which harnesses linguistic knowledge in the the training phase This set of stems and suffix. form of a hand crafted suffix list es will be used for stemming any word provided. to the stemmer We have described the details,3 Gujarati Morphology of our approach below. Gujarati has three genders masculine neuter 4 1 Training Phase. and feminine two numbers singular and plur During the training phase we try to obtain the. al and three cases nominative obli optimal split position for each word present in. que vocative and locative for nouns The gend the Gujarati word list provided for training We. er of a noun is determined either by its meaning obtain the optimal split for any word by taking. or by its termination The nouns get inflected on all possible splits of the word see Figure 1 and. the basis of the word ending number and case choosing the split which maximizes the function. The Gujarati adjectives are of two types dec given in Eqn 1 as the optimal split position The. linable and indeclinable The declinable adjec suffix corresponding to the optimal split. tives have the termination in neuter ab position is verified against the list of 59 Gujarati. solute The masculine absolute of these adjec suffixes created by us If it cannot be generated. tives ends in o and the feminine absolute in by agglutination of the hand crafted suffixes. For example the adjective s r then the length of the word is chosen as the. good takes the form s r s ro and optimal split position i e the entire word is. s r when used for a neuter masculine treated as a stem with no suffix. and feminine object respectively These adjec, tives agree with the noun they qualify in gender stem1 suffix1 stem2 suffix2 stemL suffixL. number and case The adjectives that do not end NULL. in in neuter absolute singular are classified as, indeclinable and remain unaltered when affixed Figure 1 All Possible Word Segmentations. f i i log freq stemi L i log freq suffixi,The Gujarati verbs are inflected based upon a.
combination of gender number person aspect Eqn 1,i split position varies from 1 to L. tense and mood,L Length of the word,There are several postpositions in Gujarati. which get bound to the nouns or verbs which The function used for finding the optimal. split position reflects the probability of a partic. they postposition e g n genitive marker,ular split since the probability of any split is. m in e ergative marker etc These,determined by the frequencies of the stem and. postpositions get agglutinated to the nouns or,suffix generated by that split The frequency of.
verbs and not merely follow them,shorter stems and suffixes is very high when. We created a list of hand crafted Gujarati suf,compared to the slightly longer ones Thus the. fixes which contains the postpositions and the,multipliers i length of stemi and L i length of. inflectional suffixes for nouns adjectives and,suffixi have been introduced in the function in. verbs for use in our approach,order to compensate for this disparity.
Once we obtain the optimal split of any word ible split using the frequencies of stems and suf. we update the frequencies of the stem and suffix fixes obtained from the training process The. generated by that split We iterate over the word word is stemmed at the position for which the. list and re compute the optimal split position value of the function is maximum. until the optimal split positions of all the words. remain unchanged The training phase was ob 5 Experiments and Result. served to take three iterations typically,We performed various experiments to evaluate. 4 2 Signatures the performance of the stemmer using EMILLE. Corpus for Gujarati We extracted around ten, After the training phase we have a list of stems million words from the corpus These words. and suffixes along with their frequencies We also contained Gujarati transliterations of Eng. use this list to create signatures As shown in lish words We tried to filter out these words by. Figure 2 each signature contains a list of stems using a Gujarati to English transliteration engine. and a list of suffixes appearing with these stems and an English dictionary We obtained. Stems Suffixes 8 525 649 words after this filtering process. We have used five fold cross validation for, pashu animal n evaluating the performance We divided the ex. jang war no tracted words into five equal parts of which four. were used for training and one for testing In, ne order to create gold standard data we extracted. n thousand words from the corpus randomly and,tagged the ideal stem for these words manually.
n For each of the five test sets we measured, Figure 2 Sample Signature the accuracy of stemming the words which are. The signatures which contain very few stems present in the test set as well as gold standard. or very few suffixes may not be useful in stem data Accuracy is defined as the percentage of. ming of unknown words thus we eliminate the words stemmed correctly. signatures containing at most one stem or at The experiments were aimed at studying the. most one suffix The stems and suffixes in the impact of i using a hand crafted suffix list ii. remaining signatures will be used to stem new fixing the minimum permissible stem size and. words An overview of the training algorithm is iii provide unequal weightage to the stem and. shown in Figure 3 suffix for deciding the optimal split position. Various results based on these experiments are, Step 1 Obtain the optimal split position for each described in the following subsections. word in the word list provided for training, using Eqn 1 and the list of hand crafted suf 5 1 Varying Minimum Stem Size. We varied the minimum stem size from one to, Step 2 Repeat Step 1 until the optimal split posi six and observed its impact on the system per. tions of all the words remain unchanged formance We performed the experiment with. and without using the hand crafted suffix list, Step 3 Generate signatures using the stems and The results of this experiment are shown in Ta.
suffixes generated from the training phase ble 1 and Figure 4. The results of this experiment clearly indicate, Step 4 Discard the signatures which contain either. that there is a large improvement in the perfor,only one stem or only one suffix. mance of the stemmer with the use of hand,crafted suffixes and the performance degrades if. Figure 3 Overview of training algorithm we keep a restriction on the minimum stem size. For higher values of minimum stem size all the, 4 3 Stemming of any unknown word valid stems which are shorter than the minimum. For stemming of any word given to the stemmer stem size do not get generated leading to a de. we evaluate the function in Eqn 1 for each poss cline in accuracy. Accuracy parameter in order to provide unequal, Min Stem Without hand weightage to the stem and suffix and observe its.
Size crafted suffix effect on system performance We used Eqn 2. crafted suffixes,es instead of Eqn 1 and varied from 0 1 to 0 9 in. 1 67 86 50 04 this experiment The results of this experiment. 2 67 70 49 80 are shown in Table 2,3 66 43 49 60,4 59 46 46 35. 5 51 65 41 22 f i i log freq stemi,6 43 81 36 89 1 L i log freq suffixi. Table 1 Effect of use of hand crafted suffixes and. fixing min stem size on stemmer s performance,Table 2 Effect of on the stemmer s performance. The accuracy was found to be maximum,when value of was fixed to 0 5 i e stem and.
Figure 4 Variation stemmer s accuracy with the var suffix were given equal weightage for determin. iation in min stem size ing the optimal split of any word. There are several spurious suffixes which get 6 Conclusion and Future Work. generated during the training phase and degrade, the performance of the stemmer when we don t We developed a lightweight stemmer for Guja. rati using a hybrid approach which has an accu,use the hand crafted suffix list e g is not a. racy of 67 86 We observed that use of a, valid inflectional Gujarati suffix but it does get. hand crafted Gujarati suffix list boosts the accu,generated if we don t use the hand crafted suf. racy by about 17 We also found that fixing,fix list due to words such as anek many.
the minimum stem size and providing unequal, and ane and A simple validation of the weightage to stem and suffix degrades the per. suffixes generated during training against the formance of the system. hand crafted suffix list leads to learning of bet Our stemmer is lightweight and removes only. ter suffixes and in turn better stems during the the inflectional endings as we have developed it. training phase thereby improving the system s for use in IR system The list of hand crafted. performance suffixes can be extended to include derivational. Thus we decided to make use of the hand suffixes for performing full fledged stemming. crafted suffix list during training phase and not which may be required in applications such as. to put any restriction on the minimum stem size displaying words in a user interface. 5 2 Providing unequal weightage to stem We have measured the performance of the. and suffix stemmer in terms of accuracy as of now We. plan to evaluate the stemmer in terms of the in, We have provided equal weightage to stem and dex compression achieved and the impact on. suffix in Eqn 1 which is responsible for deter precision and recall of Gujarati IR system. mining the optimal split position of any word,We obtained Eqn 2 from Eqn 1 by introducing a. References The EMILLE Corpus,http www lancs ac uk fass projects corpus emille. Creutz Mathis and Krista Lagus 2005 Unsuper,vised morpheme segmentation and morphology.
induction from text corpora using Morfessor 1 0,Technical Report A81 Publications in Computer. and Information Science Helsinki University of,Technology. Creutz Mathis and Krista Lagus 2007 Unsuper,vised models for morpheme segmentation and. morphology learning Association for Computing,Machinery Transactions on Speech and Language. Processing 4 1 1 34,Dasgupta Sajib and Vincent Ng 2006 Unsuper.
vised Morphological Parsing of Bengali Lan,guage Resources and Evaluation 40 3 4 311. Goldsmith John A 2001 Unsupervised learning of,the morphology of a natural language Computa. tional Linguistics 27 2 153 198,Goldsmith John A 2006 An algorithm for the un. supervised learning of morphology Natural Lan,guage Engineering 12 4 353 371. Jurafsky Daniel and James H Martin 2009 Speech,and Language Processing An Introduction to.
Natural Language Processing Speech Recogni,tion and Computational Linguistics 2nd edition. Prentice Hall Englewood Cliffs NJ,Lovins Julie B 1968 Development of a stemming. algorithm Mechanical Translation and Computa,tional Linguistics 11 22 31. Majumder Prasenjit Mandar Mitra Swapan K Pa,rui Gobinda Kole Pabitra Mitra and Kalyanku. mar Datta 2007 YASS Yet another suf x strip,per Association for Computing Machinery Trans.
actions on Information Systems 25 4 18 38,Pandey Amaresh K and Tanveer J Siddiqui 2008. An unsupervised Hindi stemmer with heuristic,improvements In Proceedings of the Second. Workshop on Analytics For Noisy Unstructured,Text Data 303 99 105. Porter Martin F 1980 An algorithm for suffix strip. ping Program 14 3 130 137,Ramanathan Ananthakrishnan and Durgesh D Rao. A Lightweight Stemmer for Hindi Workshop on,Computational Linguistics for South Asian Lan.
guages EACL 2003,Tisdall William St Clair 1892 A simplified gram. mar of the Gujarati language together with A,short reading book and vocabulary D B Tarapo.

