Link to Document (PDF)

As we move towards an increasingly globalized and knowledge-based economy, the ability to discover and share information across language and cultural boundaries has become more and more crucial. With the advent and rapid development of the Internet, the amount of information generated in different languages and disseminated via the World Wide Web (WWW) and social media platforms is growing exponentially. As the Internet becomes more ubiquitous and pervasive, its users have become linguistically more diverse and culturally more heterogeneous. It is thus of utmost importance to ensure that online information resources and services are efficiently and equitably accessible to all users, regardless of their linguistic and cultural backgrounds. Unfortunately, owing to language barriers and linguistic digital divide, the majority of the world’s populations, including native speakers of resource-scarce African languages, have long been denied the opportunity to access and benefit from online resources. Since most of the current major search engines and commercial Information Retrieval (IR) systems have primarily focused on wellresourced European and Asian languages, they have paid little attention to the development of Cross Language Information Retrieval (CLIR) for resource-scarce African languages. The need for exploring and building more specialized information systems that enable speakers of African languages to discover valuable information beyond linguistic and cultural barriers has, therefore, become more urgent today than ever before. Taking these facts into consideration, this study is aimed at exploring and building an experimental CLIR between one of the severely under-resourced African languages, (i.e. Afaan Oromo) and one of the most commonly used online languages, (i.e. English). We have focused on developing and evaluating the first Oromo-English CLIR (hereafter also referred to as OMEN-CLIR), which is designed to make effective use of limited linguistic resources to search and retrieve relevant information across language and cultural boundaries.

Although Afaan Oromo is one of the major indigenous African languages that is widely spoken by more than 35 million people in the Horn of Africa, it is still considered as one of the most under-resourced languages, especially where being resourced is measured by the extent to which a given language is supported by computational linguistic resources and information access technologies. Afaan Oromo poses a huge challenge to the development of natural language processing and information access technologies, not only because it is one of the most resource-scarce languages, but also because it has very rich and complex morphological processes. Unlike English, Afaan Oromo is a highly synthetic language with a very productive inflectional and derivational morphology. In this study, we have focused on exploring and building basic linguistic resources and IR tools required for designing and developing OMEN-CLIR. Some of the major linguistic resources and translation tools that we have designed and developed during the course of this study include a generic computational model of Afaan Oromo morphology, a machinereadable bilingual dictionary, a rule-based Afaan Oromo stemmer, lists of Afaan Oromo suffixes and stopwords. While our machine-readable Oromo-English dictionary has been adopted and used as a main source of knowledge for query translation, our Afaan Oromo stemmer has played a key role in identifying and normalizing word form variations.

Apart from designing and building the components of OMEN-CLIR, for which the necessary linguistic resources and translation tools have been crafted from scratch, another major contribution of this study is to assess and evaluate the performance of the proposed retrieval system. In order to assess the performances of OMEN-CLIR, we had participated in one of the well-recognized international CrossLanguage Evaluation Forums (i.e., CLEF campaign) over the past couple of years. The main focus of our evaluation in two different CLEF annual campaigns was to assess how well OMEN-CLIR performs in international and standard evaluation competitions like the ad-hoc track of the CLEF campaign. Besides a number of official runs that we had submitted to the ad-hoc tracks of the CLEF-2006 and CLEF-2007 campaigns, we have conducted various additional retrieval experiments to improve the performances of OMEN-CLIR. The evaluation results have been found to be very promising and encouraging, given the disparity of the languages involved and the limited amount of linguistic resources used for developing OMEN-CLIR. In one of our official retrieval experiments in which we have applied our Afaan Oromo stemmer, our CLIR system has achieved an average mean precision (AMP) of 29.90%, which is about 67.95% of a monolingual baseline. In general, the evaluation results show that it is viable to design and develop a CLIR system for resource-scarce African languages without relying on very rich linguistic resources that are not yet available for severely under-resourced languages like Afaan Oromo.