Sunteți pe pagina 1din 37

Developing Digital Libraries:

General Strategies Sukhdev Singh


Based on From Paper to Collection by Dr. Michel Loots, Dan Camarzan and Ian H Witten. Human Info NGO, Belgium; Simple Words, Romania; University of waikato, New Zealand.

INTRODUCTION
Typical steps that have to be implemented are: i. Selecting the documents to be included ii. Securing copyrights permissions to use these documents in the digital library iii. Scanning and OCR of the hard-copy documents which are not available in to digital form to have a perfect digital format iv. Converting all documents to a format (integrating text and images).

v. Tagging the chapters, paragraphs and images of the digital documents vi. Organising the collection into a optimally structured digital library vii. Building the digital library using the Greenstone software viii. Printing and distributing the collection on CD-ROM and/or distributing it over the Internet

Answers the following before jumping in:


What is the goal of your collection? What is your target group? How big is itlocal, regional, or globaly? How many documents are you making available? How many pages? How much graphics content?

Does the material split into parts that will be consulted by a limited audience and parts that need to be disseminated widely? Are the documents already available electronically? If so, in which formats? (Note incidentally that PDF files are not automatically equivalent to digital full-text form, as they often contain only page images.) What is the copyright status of the

Who owns the copyright? Are there other organizations with the same target audience? Are you willing to collaborate with other groups? What budget is available for the whole project? What human resources are available (in person-months) for co-ordination, editing, scanning and programming? How many computers are available for this

How many CD-ROMs do you want to distribute? Will they be free, or for sale?

SCANNERS AND SCANNING


The first step in converting paper documents into a digital library collection is to obtain images of all pages of all publications in digital format.

Scanners
Scanners are available in all price ranges, and all shapes and sizes. They range from Rs. 5000 for flat-bed scanners to upwards of Rs. 25,00,000 for large industrial scanners from manufacturers such as Bell & Howell.

The output format of a scanned page is a computer file that is usually stored in TIFF or Bitmap format. Compressed TIFF IV is the best format to use. An average page scanned and converted to this format occupies only 50 Kb, compared to perhaps 2 Mb for the equivalent page in uncompressed Bitmap form.

Low-cost flat-bed scanner


Low-cost flat-bed units are the cheapest and most widely available type of scanner. There are many brands: HP, Agfa, Acer, etc. Prices range from 5,000 to 15,000. Both black-and-white and color images can be scanned. The low price allows each computer to have its own scanner.

Fact is that rates exceeding twelve pages per hour are rarely achieved.

these scanners are useful only for small jobs with limited numbers of pagesno more than 200 to 400 pages a month on a regular basis, or one-time jobs of up to 1000 or 2000 pages.

Low-end scanner with sheet feeder


Low-end scanners with sheet feeders typically cost between Rs. 25,000 and Rs. 60,000. Ten to fifty pages can be inserted, scanned and processed at once: thus the operator does not have to attend constantly to the machine. This increases capacity up to 150 to 200 pages per day. These scanners are more robust, and have a larger lifespan before repairusually in the range 30,000 to 50,000 pages.


A disadvantage is that only one side of the page is scanned at a timethe stack of pages must be reversed and rescanned in order to obtain an image of both sides. This often creates problems because sheet feeders are never without problems and sometimes pages get blocked. These scanners are useful for up to 1500 to 3000 pages a month.

Color scanners
Any scanning operation invariably involves some color images, so a color scanner will always be required. Generally speaking, less than 5% of any publication contains color images, plus the cover. Thus a low cost flat-bed scanner as described above suffices. It is advisable to select one capable of scanning up to 600 dpi resolution.

Professional duplex scanners


Professional scanners are reliable, heavy-duty machines capable of processing a large volume of pages typically from 2000 pages to 10,000 pages per day. They have an automatic sheet-feeder tray system that processes batches of about 50 to 200 pages. The best and fastest are duplex machines that scan both sides of the page at once.

Professional duplex scanners require a powerful computer with a hard disk of at least 10 to 20 Gb. Prices range from Rs. 2,50,000 to Rs. 25,00,000. For example, the Canon DR-6020 duplex scanner costs Rs.2,50,000 and works with double-sided documents. It has a capacity of about 2000 pages per day and a lifespan of 600,000 to 800,000 pages. Bell & Howell and Fujitsu scanners range from Rs. 5,00,000 to Rs. 25,00,000 and have a lifespan of many millions of pages.

Scanning programs
Every scanner comes with its own software, which means that the program must be installed on the computer that manages the scanner. Some have a computer card that needs to be installed in your computer to speed up the scanning operation.

Preparing the documents


Before being scanned, documents must be properly prepared. Dusty documents must be cleaned, humid documents dried, clips removed, pages unfolded. The spine of each book should be removed by cutting it off, straight and precisely. Books provided by libraries must often be rebound, and if so you should be particularly careful when removing spines in order to facilitate smooth rebinding.

The scanning process


Using software provided with the scanner, a digital image of each paper page is scanned and transformed into a Bitmap or TIFF image.

Quality control
First divide the material into batches with similar paper and print qualities. Perform OCR tests on a sample from the first batch to determine the optimal settings. Then scan all material in this batch before proceeding to the next one.

Filename conventions
Follow a suitable file naming convention.

Productivity and resources


You should not underestimate the magnitude of the scanning operation and particularly the OCR process that follows. It is best to consider scanning and OCR as completely separate activities. The optimal choice from an economic and practical point of view should be made individually for each one.

Scanning costs
An important decision is whether to invest in scanning equipment and perform all scanning oneself, or outsource it to a scanning company. The main considerations are:

pressure of time for the scanning job; total number of pages; salary costs of those who perform the scanning

The following steps are involved in converting paper documents to computer form: 1. scanning; 2. page layout analysis; 3. recognition; 4. scanning images and tables.

OCR: OPTICAL CHARACTER RECOGNITION

On the market are many good OCR programs, with prices ranging from Rs. 5000 to Rs.20,000. For example, among many others are: Read-Iris (http://www.readiris.com/) Omnipage (http://www.omnipage.com/) Fine-Reader (http://www.finereader.com/)

A choice must be made between undertaking the scanning and OCR inhouse or outsourcing it to a commercial organization. To do it in-house requires a scanner, OCR software program, OCR skill development, and a quality-conscious, highly motivated workforce.

The OCR process


The OCR process differs from one OCR program to another, and each one requires a considerable amount of learning. The programs manual will explain this process in detail. Four points deserve particular attention: quality control, tables, images, and specialized material such as formulas, foreign characters etc.

Quality control
The first is performed at the same time as OCR. Every OCR program has a built-in spell-checker that highlights every suspect letter. The second is a general check of the text once the OCR process is finished.

The third is a spelling check using Microsoft Word. Finally, the completed document should be checked by an independent person who samples the complete book and checks for errors, problems with tables and images, tagging, and the general look of the resulting text.

Tables Images Specialized material

Productivity and resources


Young people between 18 and 25 are best suited for this job. Because the first hours are always the most productive, the work should either be organized on a part-time basis or only the most motivated and concentrated people should be selected for full-time work.

Two-thirds of people tend to quit or get bored after about three to five weeks. This translates into poorer quality and low productivity in the last weeks. A regular supply of work is needed to justify the necessary training, to maintain concentration, and to keep spirits high.
28 Pages / Day

Alternatives to OCR
Manual retyping Image files Combining scanning and OCR

CREATING AN ELECTRONIC COLLECTION


Meta Data Tagging document files

S-ar putea să vă placă și