International Corpus of Arabic

The International Corpus of Arabic (ICA)

Bibliotheca Alexandrina (BA) is one of the international Egyptian organizations that play a significant role in disseminating culture and knowledge and supporting scientific research. It initiated a leading project to build the “International Corpus of Arabic (ICA)”, an ambitious attempt to build a representative corpus of the Arabic language as it is used all over the Arab world, with the aim of supporting research on the Arabic language.
The ICA is a step-by-step guide to create and analyze Arabic linguistic corpora. Once finished, the analyzed version of ICA will be the first planned analyzed corpus available as a linguistic resource for researchers. It is also the first systematic investigation of national varieties within the Arabic language, this should prove very useful for linguists who believe that their theories and descriptions of Arabic should be based on real, rather than contrived, data.
Alansary et al. started the collection of the ICA in 2006. It should ultimately contain 100 million words. The collection of samples is of written Modern Standard Arabic (MSA) selected from a wide range of sources representing a wide cross-section of regional variety in the Arabic language.


ICA Collection


In collecting a representative corpus of the Arabic Language, our main focus was to cover the same genres from different sources from all Arab countries. Therefore, the ICA includes:
1. Diverse sources; Newspapers, web articles, books… etc.
Some of these sources are divided into sub sources. for example, the genre “press” is divided into “Newspapers”, “Electronic Press” and “magazines” which is subsequently divided into “ General” and “Specialized” magazines.
2. Diverse genres; Literature, Politics, Sciences…etc.
Some genres are also divided into sub-genres. for example, the genre “Literature” is divided into “Prose”, “Poetry” and “Studies of Linguistics and Literature”, the sub-genre “prose” is further divided into “Novels”, “Short Stories”, “Child Stories” and “Plays”.
The following are some of the criteria we borne in mind when collecting the required data:
1. Different sources and genres should be weighed in proportion to how common they are.
2. The number of categories the corpus should contain, and the number of texts in each category and the number of words in each sample weighed.


ICA Design


In designing the ICA, we tried to arrive at a design that would make searching within the corpus as economic and easy as possible. The design chosen was to break up the corpus into the different sources (books, newspapers…etc.), and subsequently break up these sources into the various genres (Literature, Sciences…etc.). In addition, A careful record of a variety of variables is kept with every text; when and where the text was written and published, its source and its genre.
• There are 4 sources all over the corpus, namely; Press, Net articles, Books and Academics.
• The press source is divided into three sub-sources, namely; Newspapers, Magazines, Electronic Press.
• There are 11 genres all over the corpus, namely; Strategic Sciences, Social Sciences, Sports, Religion, Literature, Humanities, Natural Sciences, Applied Sciences, Art, Biography and Miscellaneous.
• There are 24 sub-genres, namely; Politics, Law, Economy, Sociology, Islamic, Christian, Other religions, Comparative religion, prose, Poetry, Studies of Literature and Linguistic, History, Psychology, Philosophy, Geography, Biology, Physics, Chemistry, Geology and Environment, Space, Medicine, Engineering, Agriculture and Technology.
• There are 4 sub-sub-genres, namely; Novels, Short Stories, Child Stories and plays.
• All the publications of the Arab world have been covered in addition to some of the texts published outside the Arab world.


Corpus Analysis


Currently, the analysis stage includes the morphological analysis, it was done automatically using both statistical and rule based approach depending on one of the famous Arabic morphological Analyzers "Tim Buckwalter"; where the analysis lists number of information such as Prefix(s), Suffix(s), Word Class, Stem, Lemma, Root, Stem Pattern as well as Number, Gender and Definiteness according to the different contexts of the words within the corpus.
Buckwalter’s morphological Analyzer Enhancer (BAMAE)
It is a software application that helps in limiting and enhancing Buckwalter’s output solutions. It emulates other systems that interested in enhancing BAMA’s output, for example, LDC Standard Arabic Morphological Analyzer (SAMA). This enhancer has participated in building an enhanced lexicon using the training data, rules and information derived from the training data and Buckwalter’s solutions. This enhanced lexicon helps in:

  • Excluding and eliminating the wrong solutions.
  • Predicting the best analysis solution(s) for a word according to its context.
  • Supplying the missing information (i.e. definiteness, gender and number) for words.
  • Predicting the root and the word pattern of some words correctly.

To reach the best solution for the input word, BAMAE starts to do linguistic enhancements through disambiguation stages starts with word level, followed by context level and finally if these levels failed in disambiguation, Memory level will be implemented.

Precision and Recall are the evaluation measures which are used to evaluate the BAMAE on a sample from training data (60,000 of 450,000 words). Precision measurement was 0.87 while recall measurement was 0.83.