The Arabic-UNL Dictionary

Much attention has been given to the dictionary in order to make it suitable and in the required format for supporting the morphological, syntactic and semantic analysis and generation needed for both language generation and language analysis. The Arabic-UNL dictionary stores four types of linguistic information:
• Morphological information: The information responsible for the correctness the structure of words in Arabic language , word formation and different word forms, i.e., part of speech, lexical structure and inflectional paradigms.
• Syntactic information: The information responsible for generating well-formed Arabic structures, i.e., Valency, Aspect and Subcategorization frames.
• Morpho-syntactic information: The information concerning grammatical categories and linguistic units that have both morphological and syntactic properties, i.e., transitivity, gender, number and etc.
• Semantic information: Information about the semantic classification of words, it allows for the mapping between the semantic information in graphs and the syntactic structures of the generated sentences.
The Arabic dictionary takes into account the variety of Arabic all over the Arab world. There are many varieties of the Arabic language. The largest divisions occur between the spoken languages of different regions. For example the Arabic word “???”, In Egypt it means “decisiveness” but in Jordon it means “discount”. Both meanings is represented in the Arabic dictionary.

The Sources Used:


Different sources have been used to expand the Arabic-UNL dictionary:
• The English WordNet3.0: 117, 659 synsets have been associated to the corresponding lexical items of Arabic. It has been chosen because of the huge number of concepts available in it and the information provided for each (glossary, example…etc.).
• The International corpus of Arabic (ICA): 50,000 lexemes have been selected from ICA to increase the size of the Arabic dictionary. ICA has been chosen because it contains 100 million words reflect the actual use of the Arabic language selected from a wide range of sources representing a wide cross-section of regional variety in the Arabic language.
• Wikipedia: about 72,000 Wikipedia entries have been added to the Arabic dictionary.

Dictionary Types:


There are four types of the Arabic-UNL dictionary:
• Default Dictionary is language-independent and contains punctuation signs, blank spaces and regular expressions that handle special cases (such as URL's, e-mail addresses, etc.)
• Closed-Class Dictionary contains natural language words that are not associated to UWs (such as determiners, adpositions and conjunctions) or that are associated to pro-UWs (pronouns)
• Open-Class Dictionary contains natural language words that are associated to UWs (nouns, verbs, adjectives and adverbs)
• Proper Noun Dictionary contains natural language words that have been classified as proper nouns

Dictionary Size:


The Open-Class Dictionary is in two different formats
• Generative Dictionary, base forms and the corresponding lexical features and generation rules (to be used in natural language generation). It contains 218,710 entries.
• Enumerative Dictionary, word forms, with the corresponding lexical features (to be used in natural language analysis). It contains 2,329,403 entries.

Dictionary Availability:

• The Arabic UNL dictionary offers a solution for the problem of accessibility by making the dictionary open source.
• The Arabic UNL dictionary is exported in text file format which offers maximum flexibility for users and makes it readable by both humans and machines as well.
• The dictionary is available through: http://www.unlweb.net/unlarium/index.php?action=export .

Dictionary Projects:


Many projects have been processed in order to increase the dictionary size.
• MIR, which contains 117,659 entries. It aims at creating UNL->NL (generation) dictionaries based in the WordNet3.0.
• BRUNO, which contains 51,934 entries. It aims at providing NL->UNL (analysis) dictionaries based in the frequency of occurrence of lemmas in the source language.
• WIKIPEDIA, which contains 72,047 entries. it aims at creating dictionary entries corresponding to the titles of the Wikipedia.

Dictionary applications:


MUHIT: MUltilingual Harmonized dIcTionary) is a multilingual electronic dictionary where entries have been interlinked by sense. In MUHIT, natural language word forms have been associated to a uniform concept identifier. The name "Muhit" has been inspired by the Arabic word ?????? (al-Muhit), which means "Ocean" and "comprehensive", and its part of one of the most celebrated Arabic dictionaries (al-Qamus al-Muhit), compiled by al-Firuzabadi (1329–1414) and widely used for centuries. MUHIT contains more than 14,000,000 word forms in more than 50 languages. Arabic language ranks first inside MUHIT as it represents about 20% of the whole size; the Arabic share is 2,329,403 word forms (http://www.unlweb.net/muhit/index.php?muhit=report).