Documente Academic
Documente Profesional
Documente Cultură
Opening Summary
Archival collection practices are evolving as digitization and born digital documents are
garnering more and more attention. It is important for aspiring archivists and information
professionals to understand how to create and maintain both a physical and a digital collection.
For this important practice, using the Archive-It tool, our group was able to create a collection of
electronic resources regarding the topic of Reading and the Brain. This report will describe the
process of that collection development, the trial and error, scope and breadth direction, metadata,
assessment, and results therein.
similar to one another, most scientific or academic in nature, and those which held the best
information to cover our topic well. As we assessed the seeds for the most relevant and accurate
information, we experienced what Hunter (2003) describes about the appraisal process for
collection development: Appraisal decisions are so difficult because all records have some
conceivable value (p. 52). While each seed was relevant and related to our topic, we narrowed
our seeds down to the five we thought best represented this criteria and after completing another
test crawl, concluded that this seed selection would give us a more concise range of documents
to archive, representing more acutely the breadth of our topic. Eventually, the document search
results climbed in numbers back to over 50,000 and was gradually sloping downward, but the
data was more in line with our topic, based on the changes that were instituted to our seed group.
At this point, satisfied with the selections, seed specific metadata was added.
When entering the seed metadata, we found the process slightly confusing. This was
because despite the fact that while the metadata entries use archival terms, its not immediately
clear how these terms apply to Internet sources. Eventually, we decided to use another
collections data, which had already been developed, as a metadata reference point. The
collection was called the National Disaster and Digital Preservation:9/11 and the Internet, and it
was collected by the University of Pittsburgh. This collection, along with the knowledge gained
in class and from reading the textbooks, helped us create the metadata of our collection.
coupled with the small about of initial seeds, it didnt produce the results our group desired. So
while it brought back results, it was not the depth of information we were hoping to obtain. As
for our first production crawl, we bumped the initial number of seeds up to ten running twice a
day.
However, the problem we had with the test crawl was reversed. With our production run
we ran into the issue of having too many results. So after a quick group consultation, we decided
to slim down the number of seeds to five running once a day. This second production crawl
seemed to produce much better results overall. Our group decided that we were comfortable with
the current the number of seeds and times a day and decided to move forward with this data.
Once our seeds had been limited down to our preeminent five choices, we continued to
run test crawls but found that the amount of results those crawls produced was still an extremely
high number. The group decided to take a look at some other collections contained within
Archive-It in order to get a grasp on the information we were reading, and ultimately, we found
that researching other collections served as a reference point that allowed us to see our own
collection in an entirely new light. Not only had our group been allowing the chosen seeds to
crawl too frequently, but they had also not been crawling at the same frequency. While some of
our seeds had been crawling either once or twice a day, we decided that it would be best to make
them all crawl once daily and in unison. The test crawl conducted by our group after adjusting
for these changes resulted in a significantly smaller data set that made both examination and
analysis of the data much easier to comprehend. Unfortunately, the effect did not last and we
soon found our crawls producing more and more archived documents until the number of articles
once again reached approximately 50,000 hits.
The content itself was overwhelming to analyze at first, as there was a great amount of
substance to sort through, and it took more effort to search through the material and make
congruent sense of it than was originally anticipated by the group. We found that the files types
returned by our crawls were quite varied, and those corresponding crawls included text files,
image files, applications, scripts, fonts, banners, and videos. There was an amazing amount of
health-related YouTube videos embedded within the URLs, for instance, and many of the sites
listed in the Hosts Report were merely incidental with no real relevance to our topic at hand.
Governmental sites were recorded a great deal (such as those for the CDC, the FDA, and the
White House), while general business, health and technology sites were also found to be a center
of great source material.
There were also a lot of URLs that were found to be inaccessible to our group and
although they were discovered through our seed crawls, they were not archived within our
Archive-It account. The reasons for these site pages not being properly archived varied, but were
ultimately due to such issues as pages not being linked to the seed site, connection errors on the
host site, issues with the robots.txt blocked pages, and URLs that fell outside of the scope
(Archive-It How-to FAQ - Archive-It Help - IA Webteam Confluence, 2014).
necessary for the crawls to harvest multimedia, PDF and image content within relevant seed
sources but not initially recognizable seed-related addresses. As well, the finesse of clarifying
and modifying crawl and seed scoping requires research and practice. Fortunately, the Archive-It
site provides extensive Help Documents in appropriate detail for self-learning in this regard
(Archive-It How-to FAQ - Archive-It Help - IA Webteam Confluence, 2014). Pages for
selecting seeds, scoping and running crawls, interpreting host reports, and management
and analysis of reports were all very useful as we worked to adjust and delimit our returns
toward a manageable and accurate collection.
As we monitored our reports, it became clear to us that what seemed initially an automaintaining collection development tool, in fact, would require much expertise and effort to
manage successfully. Confusion remains among the team as to specific content of the returned
files, and much time is required to evaluate and assess the relevance and fit of specific items. In
this way, the Archive-It Tool reinforces the information learned via Collection Management and
Archival Studies courses - selecting, acquisition, preservation and usability of electronic media
and Internet resources is a complex and continuous, care-demanding process, in which relying on
experts from many fields and disciplines must be performed fruitfully.
Nonetheless, because the Archive-It Tool creates accessible and shareable collections in a
predominantly open-sourced colloquium, collections begun can be honed and revised by many
experts and users throughout the world. This contributes greater confidence to collection startups using Archive-It, for with accurate metadata our collection is interoperable and openly
searchable, allowing experts in the fields of content to add or alter metadata, refine and scope
hosts and seeds, and enhance and extend the value of the collection.
After the documents have been preserved in this process, Archive-It then gives visitors
accessibility to these captures using the wayback feature. The wayback feature allows for
anyone, regardless of Internet browser or advances in technology, to browse any archived
websites, including those websites that dont exist anymore or older versions of currently
existing websites.
Through both the preservation and access methods, Archive-It continues to show how
traditional archiving techniques and responsibilities can be adapted for new environments. While
Archive-It is not a perfect tool for the collection of digital documents, it has been making strides
to enhance how the digital environment will be preserved. Archive-It understands that preserving
the Internet means adapting that preservation to how the Internet develops, while also keeping all
information gathered accessible to any of the sites visitors and contributors.
Future Collection Management
For future development and maintenance of this collection, we suggest continued
research for academic and scientifically grounded resources concerning the effects of reading on
the brain and adding them as seeds to broaden the depth and breadth of the collection. To keep
the narrow scope of the collection, however, the articles should not diverge too far from the core
topics covered by the five seeds which started the collection. If and when more direct or specific
seeds are discovered relating to the topic, or information contained in the original seeds becomes
out of date, it will be best to remove them from the seed pool and replace them with seeds
containing up to date information. An electronic collection requires a unique appraisal method to
ensure its contents aligns continually with collection goals. Trace (2010) suggests the appraisal
process include, ...ongoing monitoring of electronic records selected for preservation (p. 61)
10
and that would absolutely be needed if continuing to develop and maintain this collection using
the Archive-It tool.
If an institution were interested in using Archive-It to explore the possibility of creating a
web archive, Trace (2014) agrees that determining value before adding electronic record to a
collection is the best course of action. It would be beneficial to decide on appraisal methods,
search parameters, and seed qualifications before setting out to conduct research into potential
seeds for a chosen topic. Hunter (2003) recommends the following values be considered when
appraising records:
operating value
administrative value
fiscal value
legal value
archival value
While the process to determine these values may differ when applied to electronic resources and
collections, they remain useful markers to gauge and coordinate the selection process. An
Archive-It electronic collection should still fit into the goals and patron needs of the governing
institution, and these appraisal guidelines can help create a well aligned scope.
Concluding Summary
Overall, the assignment was relatively simple, although it seemed to be a larger task to
surmount in the beginning. Figuring out a new computer process, as well as interpreting the
results, while simultaneously striving to record this data as a legible report, seemed quite a
laborious process. However, as our group came together and we began to contribute together
well, the difficulty became surmountable. So, while there were problems and glitches, such as
dealing with the large amount of returns, there was also the ability of our group to navigate these
challenges and come out with the ability to do quality research using the Archive-It tool.
11
References
12