Background

Digitization of Arabic-Language Books
Conference Planning Proposal
10 March 2005

The recent Google announcement is a herald and exemplar of an impending wave of large-scale book digitization efforts which will dramatically change the information landscape of the English-speaking (and to a lesser extent, other advanced) world. It is self-evidently important for political, educational, and cultural-custodial reasons that the Arabic-speaking world take a meaningful part in this sea-change in the use and availability of texts.


Stanford University proposes to develop an international cooperative effort toward secure, effective, and useful digitization of Arabic-language books, in effect complementing what Google intends for Roman-alphabet (especially English and Western European language) literatures. We see a planning project as an immediate and first step toward establishing a permanent core of Arabic books for both academic and popular use. In essence, this planning phase will bring together a group of potential partners and construct a cooperative organizational and technical base for a large-scale project to follow. Potential partners and contributors will be drawn from collections in the Middle East, North America, Europe and other regions as available.


At present, we anticipate that most books will be scanned in place; that is, assuming most libraries would not be willing to ship their holdings overseas, our operating assumption is that the raw digitizing will be local. This effort will not require Arabic language skills (assuming unique identifiers, such as barcodes, are used for control purposes) and can rely to some extent on student workers or outside contractors.


However, subsequent processing, including quality review, OCR, chapter or other structural identification, further bibliographic or metadata description, encoding and publishing etc. – would best be performed by people with Arabic reading skills and, of course, in an area with reasonably low prevailing wages (at least compared to Palo Alto and like university settings). Thus, we envisage a process in which large numbers of page images will be transmitted from various partner organizations to the BA, where finished, quality-controlled, products will be produced. These products would then be redistributed as appropriate.


Early stages of such a project would establish best practices for such steps as file transfer, OCR (which is relatively well developed for Arabic fonts), QC, end-processing, encoding and publishing and archiving. Over a period of several years, we believe a major collection of up to 500,000 titles could be created.


Copyright is a significant issue, as with all such efforts. We will adhere to the various copyright laws and regulations; our mission is to digitize for preservation, indexing, intellectual access, and other non-display/distribution purposes. Egyptian law may contain provisions that may be of particular benefit to this undertaking. According to Online Security (an unverified source): “In official settings, copying from compilations for teaching purposes, for use in judiciary or administrative procedures, or for storage in a public library or archive are expressly allowed by Law 82.” A closely related issue would be the opportunity to provide much, if not all, digitized Arabic content to the Open Content project overseen by the Hewlett Foundation.


The critical first step will be to bring potential partners together for a serious, outcome-oriented discussion. We propose holding such a working group in Alexandria, in the fall of 2006, with approximately six North American curators and librarians, a like number of Middle Eastern colleagues, several technologists, and representatives of the Foundation. The outcomes of the working group would be a consortia agreement, a work plan (which would address in very broad strokes a strategy for defining a core collection), and, slightly later, a proposal to one or several foundations for operating funds.


Having a plan in place, even in advance of substantial work on the plan itself, could have a beneficial impact, in that it will encourage other institutions to address aspects and segments of the literatures not represented.
One of the fundamental questions for something of such scale and potential
as this is “where do we start?” If we answer that question for the immediate
participants, it will inform others as well – they don’t have to concern themselves
particularly with segments of literatures prospectively handled by others.