Documente Academic
Documente Profesional
Documente Cultură
Firstly I wish to apologize to those in the audience who have already begun to digitize
their collections and have probably a far better knowledge than I about my topic. My
purpose here today is to address some of the issues facing the novice who is looking
to begin the process of digitization of their collections. There are a multitude of
reasons to digitize our collections of manuscripts primary among them is for the
purpose of preserving them which has two subpurposes. The first is to preserve the
text for the future, and the second is to preserve the codices as artifacts. These two
reasons are not mutually exclusive and are often viewed as codependent goals. In this
presentation I will discuss the sciences involved in the decision process after one has
determined their primary goal in converting textual content of a codex into images.
One must first determine their primary (and perhaps even secondary) desired end
results before the planning for digitizing is begun. Your selection of critical choices
in the methods and standards will be dependent on these end results. The choice of
full color, black and white or greyscale, the resolution of the images, the orientation
of the images, the intended display mechanism, and lastly the available storage, and
archiving facilities all play a role in the technical specifications chosen. These critical
choices will be discussed.
Reasons to Digitize
Another reason to digitize is a different method of sharing the content of the original.
Again the digitization of the manuscript is the first stage. It is then necessary to
provide a method for delivering the images to the intended users. This will be
DRAFT
discussed a little later. The choice of method may influence other selections made in
the process.
A side benefit of the previous purposes is that the imaging of the content can serve to
preserve the content of the codex for research without further damage to the physical
object. This can also be the primary purpose in digitizing a manuscript. Ancillary to
this may be the desire to provide additional access to the content with the use of OCR
technology as is commonly done with printed textual matter.
In the 1990s flatbed scanners became very popular when the personal
computers were becoming a necessity in the workplace and even in many homes. The
flatbed scanner typically uses a flat document glass and a moving scan head to capture
the images in a basic line-by-line mode. This technology proved very effective with
single sheet documents, but proves to be somewhat destructive for bound materials
such as codices. Further technological developments allowed for a less destructive
process with an overhead scan head. As cameras have replaced scan heads
improvements were made in the quality of images produced as well as the speed of
imaging. This has introduced much higher resolutions and much larger images
without reduction in size for extremely large originals such as newspapers and maps.
Further enhancements to the scanners in order to place less stress on the original so
that documents did not need to be flattened on the table were subsequently introduced.
Software integrated with the scanner was developed that even could correct for the
curvature of a bound item without placing stress on the binding.
The Science
In the digital age the basic unit of measurement for images is the pixel. Pixels
(or picture elements) are the basic unit of measure in digital imaging. They are the
individual “dots” that make up a composite image to display on a video device (or to
print). When an image is captured it is essentially represented by rows and columns of
pixels to record the characteristics of the original. To create a black and white image
of an object one bit per pixel is all that is needed as that single bit can have but two
values, either a 1 or a 0 -- either black or white. To create a grayscale image it is
common to use 8 bits per pixel which will give you 256 shades from black through
grey to white. For black & white fine art photography imaging it is common to use 16
bits per pixel for 65,536 shades of gray to more closely represent what was captured
in analog photography. In a like manner with color the greater the number of bits
representing each pixel the more subtle colors can be captured, with 8 bits per pixel
you can represent 256 distinct colors, and with 16 bits per pixel you can represent
65,536 distinct colors. Many modern cameras are capable of recording up to 48 bits
DRAFT
per pixel for an astounding 281.5 trillion colors. Given that it is considered the
average human eye is capable of discerning only about 100,000 distinct colors at best
the latter may be a bit excessive. This also brings up an issue of imaging as the more
bits per pixel captured the larger the size of the resulting file.
The number of pixels per inch (ppi – often used interchangeably with dpi, or
dots per inch) to be captured depends on the required use of the resulting images.
Current best practices in digital imaging for preservation generally call for 300 ppi
although in many cases 400 ppi may be a better solution, particularly if there is a
possibility of needing images for commercial printing or as in the case of Arabic
script materials to retain the dots distinguishing between letters of the same shape or if
the text is vocalized.
In preservation mode the scale chosen should be set to enable the redisplay of
the object to the exact size of the original without loss of clarity. To be able to
display the images for closer inspection a 400 ppi image would be made.
Color and color depth. Plain text may be scanned in black and white (1 bit) if
color is unimportant, although it is often preferable to scan in greyscale (8 bit) for
more clarity. Caution: if color is later needed you will need to scan again and this will
further stress the original. Thus even if immediate goals only require a black and
white image, it may still be best to scan in color. Black and white and greyscale
images can be further derived from the color original if later needed.
In some cases it may be desirable to image what cannot be seen with the naked
eye, as in some palimpsests. The SLAC National Accelerator Laboratory at Stanford
has used x-ray fluorescence (XRF) to reveal hidden text on a palimpsest of
Archimedes. [http://ww2.slac.stanford.edu/tip/2005/may20/Archimedes.htm] Other
techniques use ultra-violet light. These will not be discussed in this paper.
Formats
The formats for image files are abundant. There are many that are available
depending on the intended use. It is generally agreed that a TIFF (Tagged Image File
Format) file is the preferred format for archival purposes. This format carries the
most information and is generally the native format for many scanners and digital
cameras. However it also creates the largest file. Other derivative formats (JPEG,
GIF, PNG, etc.) for specific display purposes that are smaller can be created from a
master TIFF file. The TIFF file is one most likely to be upgradable as newer formats
become available. All other formats generally have some form of compression in
order to make the files smaller and easier to handle and store, but many compression
algorithms cause a loss of fidelity to the original image. The end use of the images
and expected mode of display will play an important part in the decision of which file
format to use and how much loss of clarity is acceptable for your application.
Metadata.
For the purposes of preservation and retrieval of the images you will need to have
information about these images in a consistent format. There are multiple schemas
available to describe not only the objects represented (codices) but also the digital
object (image) itself. Among the more common are: EAD (Encoded Archival
DRAFT
Display Methods
For the delivery of images to an audience through the internet various methods
are in use. The quickest and easiest and homegrown of these methods is to create a
web site that presents static images. These are frequently GIF, JPEG, PNG, or PDF
files with a simple interface. The user often must navigate sequentially between the
images unless a verbal or thumbnail image index has been created. There are tens of
image viewer software packages for the web that are available either free or for a
small charge. For a more user friendly experience the use of animated images
simulating the turning of pages has been used. The additional ability to zoom in on an
image and move about the image is a great advantage for the serious scholar.
Directional issues are a problem that often appear when the digitizing of Arabic script
manuscripts is outsourced to a service agency for production. Frequently they are not
aware that Arabic is read from right to left and will create a file of images that follow
the western left to right direction. This makes for a display that is difficult to read in a
normal fashion.
Storage issues
Archiving the images is critical to the perpetuation of the content of the manuscripts
concerned. Although the paper medium of manuscripts has mostly endured for
centuries in certain cases, they are still fragile and subject to destruction and decay
over the long term. The use of burned CDs, or DVDs as a storage medium is not
optimal. They are adequate for the transport of files from one location to another, but
DRAFT
should never be considered an archival copy as the expected reliable life of the
medium is currently calculated at 4 to 5 years of use. For archival purposes any
storage of electronic data must be refreshed periodically to insure the integrity of the
file. Additionally best practices recommend that image data be stored in multiple
locations for security reasons. There should never be a single copy of the electronic
image files.
The most staggering reality of digitizing a large collection is the size of the
resulting files. A single image of an A4 size original scanned at 400 ppi in black and
white (1 bit) will make a file of 1.87 megabytes. The same image scanned at 400 ppi
in 8 bit grayscale will result in a file of 14.8 megabytes. If you want 24 bit color your
image will create a file of 44.3 megabytes. This is for an image of a single page, you
must multiply by the number of pages (images) to get an idea of the size of the file for
an entire manuscript. Thus that A4 size manuscript could result in a file of some 250
images in 24 bit color would be over 11 gigabytes. If you then multiple this number
as an average by the number of manuscripts in your collection you can see the
staggering results. If your collection is about 1,000 codicies your storage
requirements for the images alone would be almost 10 yottabytes (a yottabyte is 10 to
the 24th power – that is 1 with 24 zeros following).
Conclusion
As in all things there is no one right solution for us. But remember the basics: