Documente Academic
Documente Profesional
Documente Cultură
By:
Tahir Sandhu
Gwen Williams
Collection Development:
LIS 450 DL required us to gather digital imagery, if we chose to
develop a digital collection of imagery and text. We visited the
Pennsylvania State University (PSU) English Emblem Books web site and
browsed through the catalog of the English Emblem Books. We also visited
2
the Middlebury College Minerva Project web site to obtain images from the
Minerva Britanna book of emblems.
We downloaded about 200 images from various books, along with
complete bibliographic citation for each image. PSU and Middlebury had
already saved these images as JPEG with a resolution of 105 pixels per
inch. The size of these images was about 8x7 inches. To manage these
images as cover images for each bibliographic record, we resized these
images at a high resolution (200 pixels per inch) and smaller dimensions
(about 2 inches in width).
Thereafter, we designed a filing system for these images to ensure
that the each image has: (a) a subject assigned to it using a noun phrase,
such as “Ape Never Man,” for an emblem that reflects on “how futile is it to
imitate someone;” (b) an acronym to indicate the author of the book that
originally contained the emblem; (c) a code to indicate the storage area
where we put each author’s work, (d) a year to indicate the publication of
the book, and (e) a file extension (.html) to indicate the type of file in which
the emblem was stored. Accordingly, the name of each emblem thus
consisted of these five elements in the following order:
3
Figure B: Record List
4
Figure C: Bibliographic Record
5
This record element can be hyperlinked to three other elements: (a) “see
also references” in the top right box; (b) “special announcements” in the
bottom right,” and (c) “additional element,” in the bottom left.
The record element also has enough room for future inclusion of
additional information about the book in the top left (main box). We
anticipate that such bibliographic elements would be utilized for a
“conceptual linking” among various resources on the web.
6
As shown below, the LCSH would not yield any records from our
collection because our collection existed below the bibliographic level of the
LSCH.
7
Figure F: LCSH and a list of Facets
The subject string in this browsing hierarchy is thus a legal LCSH up to the
fourth level—Christian life. Thereafter, the subject is dispersed into its
essential facets that recur in multiple emblems. All emblems having the
facet, “obedience” are thus collocated as a subset in this browsing
hierarchy. The audience interested in “obedience” can thus access these
emblems directly.
Likewise, if the audience is interested in emblems that pontificate on
“God as a beloved” can browse through the hierarchical list until the
desired item is encountered (see below). Note that if we had not done a
“facet analysis” to assign subject heading below the LCSH, the emblem on
“God” and “God as beloved,” would fail to appear as a hyperlink on the
screen. Thus, our audience who have little knowledge of the LCSH would
not be able to find items.
8
Figure G: LCSH, Sub-divisions, and a List of Facets
It is primarily with this reason that we did “facet analysis” for each
digital emblem to determine (a) its class identity within the legal LCSH
headings, such as represented iconically with bookshelves in the screen
shots above and below; and (b) a descriptor that represents the facets that
most audience members are likely to use as a “browsing / search string.”
The screen shot below illustrates this:
9
We hope that our facet analysis would serve the search and browsing
requirements for two types in our audience: faculty and professional
writers on the one hand, and student and amateur historians on the other.
For instance, faculty and professional writers are likely to search (browse)
for items in a digital library that collocates items according to the LCSH,
such as the strings represented with bookshelves above. The students and
amateur historians are likely to search (browse) items with “keywords”
that have a certain “subject context,” such as “beloved,” whose subject
context in the above string is as follows: God as a subject in Love in Art as
expressed in the Poetry that one finds in the English Emblems of 16th to
19th century.
We believe that such subject context is absolutely necessary for any
audience trained to find an item’s relevancy for their needs not because the
item contains a “keyword,” but because the item is contextualized. We have
tried to show that such contextualization can be determined by a careful
“subject/facet analysis.”
In the examples above, we show that any item that contains the
“keyword search query,” “passionate love,” would retrieve all resources
from a collection that contains the text strings: passionate love. The
audience who is interested in the “subject of passionate love,” in the
context of the subject string, would likely be disappointed to see items that
discuss “passionate love” from the standpoint of contemporary studies of
passionate love as a “psychosomatic or biological/hormonal/chemical”
phenomenon. To avoid such random collocation of items, we argue that the
subject context of an item should be determined on three levels:
(A): at the level of a discipline (Literature, 16th to 19th century),
which is a basic class (BC) in our subject string, as well as the “sub-
divisions” of a BC, which are the intermediary strings between the BC and
the digital emblems. The class-identity holds together all the sub-divisions
at the lower levels. The BC descriptor thus controls the collocation of sub-
divisions under a legitimate LCSH for a discipline, such as literary studies.
All sub-divisions are subject to be assigned a new disciplinary identity by
changing only the BC description into a legitimate LCSH for another
discipline, such as history. For instance, the BC in our example can be re-
designated as “History, 16th to 19th century.” Therefore, all sub-divisions in
our examples can be used as browsing (search) strings for students of
history as well. Therefore, without disrupting the underlying subjects in a
LCSH, by changing the BC for one discipline into another discipline’s
legitimate LCSH, the same digital resources can be repurposed for multiple
audiences in different disciplines. Subject analysis at the discipline level is,
we hope, a conceptual tool for building digital collections that can be re-
appropriated in a federated search arrangement. (Greenstone has that
power).
(B): at the level of discourse in a discipline, which are the facets in
our subject string. “Obedience,” “Devotional Literature,” “Life of Christ,”
Chastity,” etc. are the various discursive topics, or discourses, pertaining
10
to Christian Life in literature that the discipline of literary studies has
constituted over the years. These discourses allow refining and re-filtering
of digital resources that are attached to a BC string. Each resource
becomes attached to a subject context that the audience will decide as
relevant or irrelevant. Furthermore, subject assignment at the discourses
level will allow comparing and contrasting such a list of facets before a
complete cross-disciplinary switching is made. For instance, the facet
descriptors for literary studies, when switched at the BC level with a string
in history, would facilitate a consolidated list of facets for the new display,
which can be arranged on the fly.
(C): at the level of the surface of the resource, which is the
“keyword” that may or may not appear in the actual resources such as the
digital emblem. These surface-level noun phrases would serve a double
purpose: (a) keyword in context, that is each phrase would have the
subject context attached to it, as noted above; and (b) random access to
that phrase, provided such phrase appeared in the actual resource, as part
of the full-text searches.
We have greatly benefited from the teachings of Professor Pauline
Cochrane and the works of Indian classificationist, Mr. S. R. Ranganathan
to propose that the subject description for a digital resource, such as the
digital emblem, should not be assigned only at the “keyword-level.” It
should be assigned at three levels: discipline, discourse, and surface. Such
subject description (facet analysis), we believe, will allow the full
integration of automated indexing and list preparing tools (Greenstone) in
subject cataloging. Therefore, the “synthetico-analytical” work of the
human indexer will render the machine as the most reliable tool to retrieve
only that which is fully desired by the audience.
11
Implementation in Greenstone:
A complete list of bibliographic records in our Greenstone homepage
is given in Figure B (see above). The book icons yield to the following
screen in Figure I that contains two files: the graphic image on the left, and
the html record on the right.
The image is associated as a “cover image” with the html file that
contains the bibliographic data for that image in an html file.
We made full use of the Greenstone plug-ins to avoid manual linking
of html and graphic files. We also avoided unnecessary encoding to
construct indexes and browsing hierarchies. In short, we made full use of
the Greenstone automation that allowed us to create templates for html
files and then invoked special routines in the collection configuration file,
“collect.cfg,” to allow Greenstone to put together a list of the searched
items on the fly. Greenstone also established the hyperlinks from the
leaves of each record to the resource sections where we stored that
information.
Figure I: Emblem and the Record
12
Encoding and Metadata:
After we carefully analyzed each emblem we created the
bibliographic record as shown above in Figure I. The record has 7
elements: “Bibliographic Record, “Motto,” “Author,” “Book”, “Publisher,”
“Subjects,” and “Pictora.” All seven elements are placed in the “body” of an
“html” file, which divides the html file into: a head with only the “title” of
the html file in it; and a body with seven “layout tables” in it. In each table,
we placed each element of the record. Each table contains: (a) title, and (b)
section. Title of the section is what appears as text on the right side of the
leaves in Figure I. The section is what appears as a note card in Figure D. If
the button “Expand Text” in Figure I is clicked all “sections” within the
html file are displayed in the same sequence that they appear on the right
side of the Figure I, which is a “table of content” for the entire bibliographic
record.
The conceptual anatomy of the html file and the layout tables is
represented as follows:
13
In each table we inserted two cells, which Dreamwaver placed into
separated rows. We picked the color for each layout table from the color
palate and applied it to the table.
We inserted the title of the html file in the “title” window of the
Dreamweaver design and layout window.
We then split the Dreamweaver design and layout window to see the
“html codes” and the “layout elements” simultaneously.
We clicked on one layout box to see its start and end codes in the
code window. Then we enclosed the layout table with html section tags:
<Section>
layout table
</Section>
<Section>
<Description>
</Description>
layout table
</Section>
<Section>
<Description>
<Metadata name =“Title” mode =“accumulate”></Metadata>
</Description>
layout table
</Section>
Then we enclosed the start tag <Section> and the end tag </
Description> within a comment box. We also enclosed the end tag </
Section> within a comment box. This enclosing made the section,
description, and metadata tags invisible in the browser. But the content
and the color of the layout table remained visible in the browser.
The manual and the software generated html encoding for one layout
table looked like the following:
<!——
<Section>
<Description>
<Metadata name =“Title” mode =“accumulate”></Metadata>
</Description>
——>
layout table
<!——
</Section>
——>
14
We copied and pasted the Dreamweaver and the manual tags in the
html file six times over. Now we have our master template for the
bibliographic record.
This template is essentially a series of sections in an html file. Each
section has a header and a body, and within each section header we
encoded the metadata for that section. The metadata tags for the section
on author looks like the following:
<!——
<Section>
<Description>
<Metadata name =“Author” mode =“accumulate”></Metadata>
<Metadata name =“Title” mode =“accumulate”></Metadata>
</Description>
——>
Greenstone builds an index on the first pair for the metadata tags
only when mode =”accumulate” is specified. Greenstone takes the second
pair of the metadata tags and uses that as a “title” for that section to
display along the leaves. See Figure I above.
We, in essence, divided the html file into six sections. Each section
having a small index hook attached to it in the form of a pair of metadata
tags.
When we had prepared a master copy of a record for one author, we
saved it as a template for that author. Then we manually typed the
bibliographic information about each digital emblem in the index hook and
the body of each section. We saved each html file as a bibliographic record
with the naming convention described above.
15
flexibility in connecting with and navigating directories on the classroom
server; increased visibility during the importing and building processes in
that we could immediately spot error messages (and the successes that
scrolled machine characters past our eyes as the HASH directories and
indexes were built in a symmetry pattern not that dissimilar from
streaming fractals); and increased knowledge of Greenstone’s building
procedures. Still another advantage of building through the command line
was access to and control over optional switches important for full
utilization of Greenstone’s importing and building processes. For example,
using the command line mode for importing enabled us to add our emblem
htmls and JPEGs in batches. We were able to build and re-build existing
collections without generating duplicate Greenstone archive records: this
was achieved by invoking the optional switch “-removeold” at the import
process (see Witten and Bainbridge, pages 315-316). The Witten and
Bainbridge text contains a list of other such features available for the
command line mode import and build processes.
Once you set up the Greenstone framework for building a collection
(a step that includes supplying the collection name, in our case,
“DEmblems”), the software sets up seven directories to store: (a) source
files, (b) files in the Greenstone Archival Format, which is an XML format
that Greenstone utilized to build web pages for the display, and indexes, (d)
images, (e) collection logo, (f) plain text files such as hfiles, (g) perl scripts,
(h) the collection configuration file, collect.cfg, and (i) various
automatically generated directories and files, such as the hash directory
structure, the associated files directory, the fail.log, and the collection
information database. The seven directories are: “import,” “archives,”
“building,” “index,” “etc,” “images,” and “perllib.”
16
upon the naming convention that should give the JEPG the
same prefix that the html files has. For instance, RECORD1.jpg
will be associated with RECORD1.html. Note, after all
collection building and indexing steps are completed, the
associate files will be stored in the index directory’s sub-
directory called, “assoc.”
d. We specified the four alphabetical vertical browsing lists that
Greenstone will use to build browsing on the metadata
elements, “Source,” “Motto,” “Subject,” and “Pictora.” The
“Source” list was built at the document level and the other
three lists were built at the section level of the document
(recall that the metadata elements, “Motto,” “Subject,” and
“Pictora” were specified within sections of html documents).
Specifying the four alphabetical vertical browsing lists as we
did invoked default display features of Greenstone such as the
alphabetical buckets (A-B, C, D-F, etc.) and page advancing
arrows (icon + “Matches 11-20”) that respectively appear at
the top and bottom of search or browse query results.
e. We specified the h-files that Greenstone will use to build the
two browsing hierarchies, the subject classification as
explicated in the above report section, “Subject Description:
Facet Analysis,” and the Biblical motto classification ordered
on the hierarchical structure of the canon itself. (See
Appendix B: hfile for DEmblems, sub.txt).
f. We specified the “sort” operations on the two browsing
hierarchies. We did not specify the “sort” operations for the
vertical browsing lists, relying instead on the default sort,
which corresponds with the metadata element particular to
each list (eg, the AZSectionList constructed on the metadata
element “Pictora” will be sorted by “Pictora”).
g. We specified the “buttonname” for each of the browsing lists.
As Figure I shows, three buttonnames display on the
navigational bar in the characteristic Greenstone colors and
fonts and four buttonnames appear as linked text. The
“search,” “filenames,” and “subjects” buttons display as they
do because these buttonname icons came with the downloaded
Greenstone software. Our linked text buttons, “MottoAtoZ,”
“BiblicalMotto,” “SubjectsAtoZ,” and “PictoraAtoZ,” are chosen
names specific to our collection: as such, the macro files did
not contain such icons. We toyed with the idea of substituting
a pre-made buttonname for uniform-display purposes. For
example, we could have easily specified that the “MottoAtoZ”
display as the pre-made button, “phrases.” However, we still
would have been left with finding pre-made buttons for
“BiblicalMotto” and for “PictoraAtoZ,” not to mention finding a
second subject-related button that differentiated the
17
alphabetical vertical list of subjects from the classified
hierarchy. Moreover, the term “phrases” is not synonymous
with the term “MottoAtoZ.” We decided that clarity of button-
naming took priority over uniform-display of the buttons.
h. We specified a formatting string for each leaf in a particular
browsing list to be hyperlinked with the source html document
in that list.
i. We specified through a formatting string the navigational
buttons, “Expand Text,” “Detach, and “Highlight.” These
buttons appear at the lower left corner of the screen when the
full record and the image are displayed side by side.
j. We specified a formatting string for each section title to
display as a heading above its particular section layout table.
When the “Expand Text” button is selected, all headings are
displayed.
k. We invoked the collection icon feature of Greenstone by
specifying the path to our Fireworks designed logo, placed in
the DEmblems images directory.
l. We specified the searchable field names for the pull down-
menu. The searchable fields correspond to the metadata
elements and indexes specified above (see a.).
m. We wrote a succinct description of the collection in the
“collectionextra” line, indicating the LIS class it was
constructed for as well as the builders of the collection. The
collectionextra line functions as part of a splash-page for the
entire collection. Insofar as additional information for the
splash-page, Greenstone automatically generates statements
for the search and browsing features specified. Any additional
revision to the splash-page would entail manual revision of the
macros, a step we elected not to pursue at this time.
Once we finalized the collect.cfg file, we executed the “import” perl script
and the “buildcol” perl script.
Greenstone parsed all source files and built a hash directory
structure for storing information about the GAF. The GAF and the hash
directory structure ensure that the human administrator knows the path
to source documents and the associated metadata that Greenstone stores
in XML in the GAF (see Appendix C: Greenstone Archival Format for
Adversity_Misery_HH_02_1686.html, an example of doc.xml).
Furthermore, the hash directory structure and the GAF ensure that the
software is able to build web pages on the fly once a search query is
executed and the desired document is clicked-on.
We executed the final step by moving the building directory contents
into the index directory. The English Emblem Book Digital Library was
thus complete and available for the audience.
18
Possibilities for the English Emblem Books Digital Library:
There are a few select aspects of the collection that had we had more
time to work on the project we would have implemented. We have already
invested time in understanding and investigating possible solutions for
each of the following. As it ended up, the semester-hourglass beat us.
Thumbnails. Greenstone has automated procedures that generate
thumbnails for images imported. Provided the appropriate plug-in is
invoked through the configuration file, an image and its corresponding
thumbnail are automatically associated by Greenstone; a GAF file records
the association; and the hash directory structure stores each. We
investigated thumbnails because we initially wanted (a) to not re-size the
JPEGs collected from PSU and Middlebury; and (b) to specify to
Greenstone that the thumbnail should stand as a substitute for the “cover
image.” Essentially we wanted to specify to Greenstone, “display the
thumbnail associated with this html in the cover image spot and make the
thumbnail link to the appropriate image.”
Revisions to the macros. Customization of colors, fonts, and
navigational features are certainly possible through working with the
macro files. We have already mentioned two areas where macro-work
could have been possible to enhance our already visually appealing
collection: creating collection-specific buttons for the navigational bar and
designing a different splash page. We were also interested in revising the
macros in such a way as to make Boolean searching across metadata
elements possible. For example, the current collection allows a search for
the subject, “Obedience.” The displayed results will show numerous leaves
from various books by various authors. We would like to have enabled a
Boolean search for the subject, “Obedience” AND the author, “Hugo.”
Incorporation of Strictly Textual Leaves. The emblem books and
books of emblems collected all had digitized images of pages that were not
emblems proper, that is, PSU and Middlebury had also digitized the various
prefaces, tables of contents, dissertations, dedications, and exhortations
associated with each book. Such leaves are strictly textual matter and, as
such, were beyond the initial scope of the project, focusing as we did on the
emblem proper. But for our audience these leaves are absolutely crucial for
studying the resources. It is, for example, obvious in the “preface” by the
English translator of Hugo’s Pia Desidera that this emblem book, and this
translation in particular, is intended to school 17th century English women
and children in a decidedly Christianized morality; and was censored by
the translator in order to cleanse the work of the ‘shameful’ and
‘ridiculous’ follies attributed to monks and Jesuits in the original book:
such historical and literary discourses are of paramount concern for our
audience and their respective disciplines.
Of the three possibilities described above, incorporation of strictly
textual leaves would seem the most important if we desired to take this
19
collection beyond the LIS450DL classroom and into the classrooms and
desktops of our identified audience.
Bibliography:
Witten, Ian H., and David Bainbridge. How to Build a Digital Library. The
Morgan Kaufmann series in multimedia information and systems.
Amsterdam [u.a.]: Morgan Kaufmann, 2003.
20
Appendix A: DEmblems Configuration File, collect.cfg
21
Appendix B: hfile for DEmblems, sub.txt
22
Appendix C: Greenstone Archival Format for
Adversity_Misery_HH_02_1686.html, an example of doc.xml
23