The Science of Digitizing and Displaying

DRAFT
The Science of Digitizing and Displaying Islamic Manuscripts
Firstly I wish to apologize to those in the audience who have already begun to digitize
their collections and have probably a far better knowledge than I about my topic. My
purpose here today is to address some of the issues facing the novice who is looking
to begin the process of digitization of their collections. There are a multitude of
reasons to digitize our collections of manuscripts primary among them is for the
purpose of preserving them which has two subpurposes. The first is to preserve the
text for the future, and the second is to preserve the codices as artifacts. These two
reasons are not mutually exclusive and are often viewed as codependent goals. In this
presentation I will discuss the sciences involved in the decision process after one has
determined their primary goal in converting textual content of a codex into images.
Typically the initiative to digitize a manuscript is primarily driven by the desire to

share a repository’s manuscript holdings for public view. Sometimes this is done with
a static display in the more traditional setting in display cases often using surrogates
to prevent the originals from being exposed to damaging light found in many public
display venues. At other times it is done to mount a more permanent display that can
reach a much larger audience by mounting on the internet. In some cases these
exhibits are being created in virtual worlds such as Second Life
[http://www.seconlife.com] where the Stanford University Libraries has a number of
interactive exhibits using digital images of our collections.
One must first determine their primary (and perhaps even secondary) desired end
results before the planning for digitizing is begun. Your selection of critical choices
in the methods and standards will be dependent on these end results. The choice of
full color, black and white or greyscale, the resolution of the images, the orientation
of the images, the intended display mechanism, and lastly the available storage, and
archiving facilities all play a role in the technical specifications chosen. These critical
choices will be discussed.
Reasons to Digitize
The desire to digitize a manuscript or collection of manuscripts is driven by many

factors. One of the primary reasons, as previously mentioned, is to emulate or
improve on the practice of exhibiting the manuscripts in a traditional setting in glass
cases in an enclosed space. This practice often involved lighting the manuscripts for
better viewing by exhibit visitors. The lighting conditions though are often
detrimental to the wellbeing of the materials. Too much intense light can fade the
inks on the parchment or paper or even destroy the paper. In order to minimize this
destructive exposure it has become common practice to exhibit surrogates of the
manuscripts for public viewing. A digital copy is carefully produced then printed as
an exact replica of the original which is then placed in the hostile environment of the
exhibition cases.
Another reason to digitize is a different method of sharing the content of the original.
Again the digitization of the manuscript is the first stage. It is then necessary to
provide a method for delivering the images to the intended users. This will be
DRAFT
discussed a little later. The choice of method may influence other selections made in
the process.
A side benefit of the previous purposes is that the imaging of the content can serve to
preserve the content of the codex for research without further damage to the physical
object. This can also be the primary purpose in digitizing a manuscript. Ancillary to
this may be the desire to provide additional access to the content with the use of OCR
technology as is commonly done with printed textual matter.
The Basics of Imaging.
The earliest imaging of manuscripts was through the medium of microfilm.

The process was an extension of the art of photography. Microfilming became the
medium of choice for providing copies of manuscripts to researchers and for the
preservation of the content. This science is well developed and has proven very stable
given proper storage conditions for the resulting film. The shelf life of properly
processed microfilm is not yet known for certain but has proven to be stable for at
least a century and some believe that the potential is closer to five centuries
In the 1990s flatbed scanners became very popular when the personal
computers were becoming a necessity in the workplace and even in many homes. The
flatbed scanner typically uses a flat document glass and a moving scan head to capture
the images in a basic line-by-line mode. This technology proved very effective with
single sheet documents, but proves to be somewhat destructive for bound materials
such as codices. Further technological developments allowed for a less destructive
process with an overhead scan head. As cameras have replaced scan heads
improvements were made in the quality of images produced as well as the speed of
imaging. This has introduced much higher resolutions and much larger images
without reduction in size for extremely large originals such as newspapers and maps.
Further enhancements to the scanners in order to place less stress on the original so
that documents did not need to be flattened on the table were subsequently introduced.
Software integrated with the scanner was developed that even could correct for the
curvature of a bound item without placing stress on the binding.
The Science
In the digital age the basic unit of measurement for images is the pixel. Pixels
(or picture elements) are the basic unit of measure in digital imaging. They are the
individual “dots” that make up a composite image to display on a video device (or to
print). When an image is captured it is essentially represented by rows and columns of
pixels to record the characteristics of the original. To create a black and white image
of an object one bit per pixel is all that is needed as that single bit can have but two
values, either a 1 or a 0 -- either black or white. To create a grayscale image it is
common to use 8 bits per pixel which will give you 256 shades from black through
grey to white. For black & white fine art photography imaging it is common to use 16
bits per pixel for 65,536 shades of gray to more closely represent what was captured
in analog photography. In a like manner with color the greater the number of bits
representing each pixel the more subtle colors can be captured, with 8 bits per pixel
you can represent 256 distinct colors, and with 16 bits per pixel you can represent
65,536 distinct colors. Many modern cameras are capable of recording up to 48 bits
DRAFT
per pixel for an astounding 281.5 trillion colors. Given that it is considered the
average human eye is capable of discerning only about 100,000 distinct colors at best
the latter may be a bit excessive. This also brings up an issue of imaging as the more
bits per pixel captured the larger the size of the resulting file.
The number of pixels per inch (ppi – often used interchangeably with dpi, or
dots per inch) to be captured depends on the required use of the resulting images.
Current best practices in digital imaging for preservation generally call for 300 ppi
although in many cases 400 ppi may be a better solution, particularly if there is a
possibility of needing images for commercial printing or as in the case of Arabic
script materials to retain the dots distinguishing between letters of the same shape or if
the text is vocalized.
In preservation mode the scale chosen should be set to enable the redisplay of
the object to the exact size of the original without loss of clarity. To be able to
display the images for closer inspection a 400 ppi image would be made.
Color and color depth. Plain text may be scanned in black and white (1 bit) if
color is unimportant, although it is often preferable to scan in greyscale (8 bit) for
more clarity. Caution: if color is later needed you will need to scan again and this will
further stress the original. Thus even if immediate goals only require a black and
white image, it may still be best to scan in color. Black and white and greyscale
images can be further derived from the color original if later needed.
In some cases it may be desirable to image what cannot be seen with the naked
eye, as in some palimpsests. The SLAC National Accelerator Laboratory at Stanford
has used x-ray fluorescence (XRF) to reveal hidden text on a palimpsest of
Archimedes. [http://ww2.slac.stanford.edu/tip/2005/may20/Archimedes.htm] Other
techniques use ultra-violet light. These will not be discussed in this paper.
Formats
The formats for image files are abundant. There are many that are available
depending on the intended use. It is generally agreed that a TIFF (Tagged Image File
Format) file is the preferred format for archival purposes. This format carries the
most information and is generally the native format for many scanners and digital
cameras. However it also creates the largest file. Other derivative formats (JPEG,
GIF, PNG, etc.) for specific display purposes that are smaller can be created from a
master TIFF file. The TIFF file is one most likely to be upgradable as newer formats
become available. All other formats generally have some form of compression in
order to make the files smaller and easier to handle and store, but many compression
algorithms cause a loss of fidelity to the original image. The end use of the images
and expected mode of display will play an important part in the decision of which file
format to use and how much loss of clarity is acceptable for your application.
Metadata.
For the purposes of preservation and retrieval of the images you will need to have
information about these images in a consistent format. There are multiple schemas
available to describe not only the objects represented (codices) but also the digital
object (image) itself. Among the more common are: EAD (Encoded Archival
DRAFT
Description), TEI (Text Encoding Initiative), METS (Metadata Encoding and

Transmission Standard), MODS (Metadata Object Description Schema), and others.
Of primary importance in metadata management is a consistent and manageable
system of identification of the digital objects that can be integrated with your delivery
system.
Display Methods
For the delivery of images to an audience through the internet various methods
are in use. The quickest and easiest and homegrown of these methods is to create a
web site that presents static images. These are frequently GIF, JPEG, PNG, or PDF
files with a simple interface. The user often must navigate sequentially between the
images unless a verbal or thumbnail image index has been created. There are tens of
image viewer software packages for the web that are available either free or for a
small charge. For a more user friendly experience the use of animated images
simulating the turning of pages has been used. The additional ability to zoom in on an
image and move about the image is a great advantage for the serious scholar.
If the equipment, maintenance and management of your own site is not

possible or desirable you can use other services, some free and others for a fee. You
could also simply upload your images to one of the photo sites (Picasa web from
Google, Flikr, etc.) these all have various limitations and are subject to the fate of the
company involved in the service since your image files are stored on their servers.
You also have the option to host the images on your own equipment using software
for image display. Or you can contract with other agencies (e.g. Contentdm from
OCLC) that will handle the images and access.
Magnification of the images for a closer inspection is an added value that

should be considered. A scholar can often discern subtle differences in the ink, hand,
or in the paper or other elements by the magnification of the image without needing to
handle the codex.
Directional issues are a problem that often appear when the digitizing of Arabic script
manuscripts is outsourced to a service agency for production. Frequently they are not
aware that Arabic is read from right to left and will create a file of images that follow
the western left to right direction. This makes for a display that is difficult to read in a
normal fashion.
As mentioned earlier, in theory it may be possible to add content indexing to the

manuscript images with the use of OCR to make them searchable. This is currently
under investigation for Arabic script printed matter as well as hand written materials
but to date has not meet with complete success.
Storage issues
Archiving the images is critical to the perpetuation of the content of the manuscripts
concerned. Although the paper medium of manuscripts has mostly endured for
centuries in certain cases, they are still fragile and subject to destruction and decay
over the long term. The use of burned CDs, or DVDs as a storage medium is not
optimal. They are adequate for the transport of files from one location to another, but
DRAFT
should never be considered an archival copy as the expected reliable life of the
medium is currently calculated at 4 to 5 years of use. For archival purposes any
storage of electronic data must be refreshed periodically to insure the integrity of the
file. Additionally best practices recommend that image data be stored in multiple
locations for security reasons. There should never be a single copy of the electronic
image files.
The most staggering reality of digitizing a large collection is the size of the
resulting files. A single image of an A4 size original scanned at 400 ppi in black and
white (1 bit) will make a file of 1.87 megabytes. The same image scanned at 400 ppi
in 8 bit grayscale will result in a file of 14.8 megabytes. If you want 24 bit color your
image will create a file of 44.3 megabytes. This is for an image of a single page, you
must multiply by the number of pages (images) to get an idea of the size of the file for
an entire manuscript. Thus that A4 size manuscript could result in a file of some 250
images in 24 bit color would be over 11 gigabytes. If you then multiple this number
as an average by the number of manuscripts in your collection you can see the
staggering results. If your collection is about 1,000 codicies your storage
requirements for the images alone would be almost 10 yottabytes (a yottabyte is 10 to
the 24th power – that is 1 with 24 zeros following).
Conclusion
As in all things there is no one right solution for us. But remember the basics:
1) define your purposes for digitizing

2) describe your primary intended audience and their needs
3) determine best delivery method of images for meeting intended audience’s needs
4) evaluate the condition of the manuscripts to be digitized
5) determine conservation measures needed for those items too fragile to be handled
6) plan long term storage of digital data
7) begin TODAY for manuscripts, like many other things, do not get better with age

The Science of Digitizing and Displaying

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

The Science of Digitizing and Displaying

Încărcat de

Drepturi de autor:

Formate disponibile

DRAFT

The Science of Digitizing and Displaying Islamic Manuscripts

Typically the initiative to digitize a manuscript is primarily driven by the desire to

The desire to digitize a manuscript or collection of manuscripts is driven by many

The Basics of Imaging.

The earliest imaging of manuscripts was through the medium of microfilm.

Description), TEI (Text Encoding Initiative), METS (Metadata Encoding and

If the equipment, maintenance and management of your own site is not

Magnification of the images for a closer inspection is an added value that

As mentioned earlier, in theory it may be possible to add content indexing to the

1) define your purposes for digitizing

S-ar putea să vă placă și