Group D-Reading and The Brain Archive-It

1
Reading and the Brain: An Archive-It Collection
Group DAdam Doolittle

Chris Evans
Nathan Filbert
Rikki Carter
Shelby Cunningham
Opening Summary
Archival collection practices are evolving as digitization and born digital documents are
garnering more and more attention. It is important for aspiring archivists and information
professionals to understand how to create and maintain both a physical and a digital collection.
For this important practice, using the Archive-It tool, our group was able to create a collection of
electronic resources regarding the topic of Reading and the Brain. This report will describe the
process of that collection development, the trial and error, scope and breadth direction, metadata,
assessment, and results therein.
Theme, Seed Selection, and Scope

The seed selection process began when each member of our group made suggestions as to
topics of interest and collectively voted on which we would use for this project. We selected
Reading and the Brain, which seemed appropriately library and archival studies related, as
well as generally interesting to us all. We then independently conducted research online to find
seeds to contribute, and each of us entered our findings into our Archive-It collection. We had not
come to a consensus on the criteria for our searches before wed set out to do our research, as
none of us were well versed in what might be out there, but we decided to research
independently to conduct a wide and varied sweep of the topic. Thus, we came together with a
broad range of seeds representing various areas of our subject.
The group entered a total of 10 seeds and we began running crawls twice daily on some
seeds and once daily on others. Each report showed an increasing number of documents
archived, even upwards of 50,000, some more related and relevant than others. After evaluating
these results, the group decided it best to deselect a handful of seeds, choosing instead those most
similar to one another, most scientific or academic in nature, and those which held the best
information to cover our topic well. As we assessed the seeds for the most relevant and accurate
information, we experienced what Hunter (2003) describes about the appraisal process for
collection development: Appraisal decisions are so difficult because all records have some
conceivable value (p. 52). While each seed was relevant and related to our topic, we narrowed
our seeds down to the five we thought best represented this criteria and after completing another
test crawl, concluded that this seed selection would give us a more concise range of documents
to archive, representing more acutely the breadth of our topic. Eventually, the document search
results climbed in numbers back to over 50,000 and was gradually sloping downward, but the
data was more in line with our topic, based on the changes that were instituted to our seed group.
At this point, satisfied with the selections, seed specific metadata was added.
When entering the seed metadata, we found the process slightly confusing. This was
because despite the fact that while the metadata entries use archival terms, its not immediately
clear how these terms apply to Internet sources. Eventually, we decided to use another
collections data, which had already been developed, as a metadata reference point. The
collection was called the National Disaster and Digital Preservation:9/11 and the Internet, and it
was collected by the University of Pittsburgh. This collection, along with the knowledge gained
in class and from reading the textbooks, helped us create the metadata of our collection.
Scope of the Collection Between Crawls

We went about determining the scope of our crawls through trial and error. First,
however, we needed to come to a consensus on the number of seeds to put into the Archive-It
tool. We initially decided to start a test run with two to three seeds. We also had to decide on how
many times a day to run our crawls. At the outset, we decided to run it once a day. However,
coupled with the small about of initial seeds, it didnt produce the results our group desired. So
while it brought back results, it was not the depth of information we were hoping to obtain. As
for our first production crawl, we bumped the initial number of seeds up to ten running twice a
day.
However, the problem we had with the test crawl was reversed. With our production run
we ran into the issue of having too many results. So after a quick group consultation, we decided
to slim down the number of seeds to five running once a day. This second production crawl
seemed to produce much better results overall. Our group decided that we were comfortable with
the current the number of seeds and times a day and decided to move forward with this data.
Test and Production Crawls

As our group began to conduct our first test crawls within Archive-It, we focused on
capturing those URLs that best covered our topic of Reading and the Brain. As previously
stated, these test crawls were initially limited to ten seed URLs, however, upon viewing our data
results, we found that the number of documents archived was simply too vast and that we would
need to limit our seeds in some way. We decided to retain those seeds containing the most
quantifiable information and the most technical, educational and comprehensive material about
our topic of choice. According to Trace (2010), its widely acknowledged that selection in the
electronic environment requires different strategies (both tactical and methodological) to those
traditionally used in the paper environment (p. 60), and this was something that we needed to
take into account since we were not dealing with any physical documents or files. It proved
extremely important to us that we keep those seeds with the most varied data so that we were
sure to fully cover the spectrum of our topic and to make sure the acquisition of this material was
as unproblematic as possible.
Once our seeds had been limited down to our preeminent five choices, we continued to
run test crawls but found that the amount of results those crawls produced was still an extremely
high number. The group decided to take a look at some other collections contained within
Archive-It in order to get a grasp on the information we were reading, and ultimately, we found
that researching other collections served as a reference point that allowed us to see our own
collection in an entirely new light. Not only had our group been allowing the chosen seeds to
crawl too frequently, but they had also not been crawling at the same frequency. While some of
our seeds had been crawling either once or twice a day, we decided that it would be best to make
them all crawl once daily and in unison. The test crawl conducted by our group after adjusting
for these changes resulted in a significantly smaller data set that made both examination and
analysis of the data much easier to comprehend. Unfortunately, the effect did not last and we
soon found our crawls producing more and more archived documents until the number of articles
once again reached approximately 50,000 hits.
The content itself was overwhelming to analyze at first, as there was a great amount of
substance to sort through, and it took more effort to search through the material and make
congruent sense of it than was originally anticipated by the group. We found that the files types
returned by our crawls were quite varied, and those corresponding crawls included text files,
image files, applications, scripts, fonts, banners, and videos. There was an amazing amount of
health-related YouTube videos embedded within the URLs, for instance, and many of the sites
listed in the Hosts Report were merely incidental with no real relevance to our topic at hand.
Governmental sites were recorded a great deal (such as those for the CDC, the FDA, and the
White House), while general business, health and technology sites were also found to be a center
of great source material.
There were also a lot of URLs that were found to be inaccessible to our group and
although they were discovered through our seed crawls, they were not archived within our
Archive-It account. The reasons for these site pages not being properly archived varied, but were
ultimately due to such issues as pages not being linked to the seed site, connection errors on the
host site, issues with the robots.txt blocked pages, and URLs that fell outside of the scope
(Archive-It How-to FAQ - Archive-It Help - IA Webteam Confluence, 2014).
Assessing the Archive-It Tool

Upon first encountering the Archive-It Tool (Internet Archive, 2014), we found the
interface inviting, and the links, categories and input areas clear, providing a simple set-up and
easy copy-and-paste of URLs we wanted to test as seeds. The site is tabulated well, with links
for seed entry and management, metadata creation and editing, crawl reports and scoping control.
We jumped in excitedly and began crawling seeds twice a day. We found the core functionalities
of the tool well-mated to library and archive services - as metadata definitions are set to simple
Dublin Core elements, making them readily available and interoperable with other collection
partners. The seed, scoping and modification tools also seem sensible and comprehensible,
providing useful and usable information and input charts, enabling host constraints, crawl limits
and expansion. For librarians and archivists, overall, the Archive-It Tool feels familiar and ready
for convenient and effective use.
The learning gap we encountered with the Tool occurred when we began to attempt
interpreting reports and figuring out successful modifications of our crawl scope and seed limits.
Comprehending the Host list is confusing yet very informative as regards the convergence of
linkages in singular websites and file URLs. Knowing which items to block or expand, limit or
edit can be difficult as many of the Host addresses are mediatory (link and node hubs) -
necessary for the crawls to harvest multimedia, PDF and image content within relevant seed
sources but not initially recognizable seed-related addresses. As well, the finesse of clarifying
and modifying crawl and seed scoping requires research and practice. Fortunately, the Archive-It
site provides extensive Help Documents in appropriate detail for self-learning in this regard
(Archive-It How-to FAQ - Archive-It Help - IA Webteam Confluence, 2014). Pages for
selecting seeds, scoping and running crawls, interpreting host reports, and management
and analysis of reports were all very useful as we worked to adjust and delimit our returns
toward a manageable and accurate collection.
As we monitored our reports, it became clear to us that what seemed initially an automaintaining collection development tool, in fact, would require much expertise and effort to
manage successfully. Confusion remains among the team as to specific content of the returned
files, and much time is required to evaluate and assess the relevance and fit of specific items. In
this way, the Archive-It Tool reinforces the information learned via Collection Management and
Archival Studies courses - selecting, acquisition, preservation and usability of electronic media
and Internet resources is a complex and continuous, care-demanding process, in which relying on
experts from many fields and disciplines must be performed fruitfully.
Nonetheless, because the Archive-It Tool creates accessible and shareable collections in a
predominantly open-sourced colloquium, collections begun can be honed and revised by many
experts and users throughout the world. This contributes greater confidence to collection startups using Archive-It, for with accurate metadata our collection is interoperable and openly
searchable, allowing experts in the fields of content to add or alter metadata, refine and scope
hosts and seeds, and enhance and extend the value of the collection.
Preservation and Access through Archive-It

Throughout the process of creating this collection, it became clear that (while having
some issues) the Archive-It tool stands out as one of the best resources for digital document
preservation and access. The archival profession has always been concerned with preservation of
their materials, but while archivists have several techniques to preserve physical materials, the
same cannot be said for digital documents. Archivists are still developing methods to counteract
the challenges that the digital world has developed.
One major difficulty that archivists have when collecting digital documents is that digital
records are an evolving process (Hunter, 2003, p. 240). A digital document uploaded one day
could be completely removed the next. A tweet that is sent in November could have its format
and content edited with Photoshop. And, most unfortunate of all, as technology is rapidly
updating, sometimes sites or other online documents simply just become inaccessible to modern
audiences.
Luckily, the makers of Archive-It had these problems in mind when they created their
resource. The way that archive-it collects and preserves digital documents is in the process called
crawls. The crawls set up by the main website and by the individual collections take snapshots
of the Internet sites as they exist at a certain moment of time. These snapshots preserve the
original format and context of a digital document, important to understanding that documents
original context and importance. In a traditional archive, this would be the only time a record of
the document is imaged and recorded. However, Archive-it will return to a digital document in
later Internet crawls, thus documenting any modifications that are typical of the digital
environment, be it site updates, new comments, or text edits.
After the documents have been preserved in this process, Archive-It then gives visitors
accessibility to these captures using the wayback feature. The wayback feature allows for
anyone, regardless of Internet browser or advances in technology, to browse any archived
websites, including those websites that dont exist anymore or older versions of currently
existing websites.
Through both the preservation and access methods, Archive-It continues to show how
traditional archiving techniques and responsibilities can be adapted for new environments. While
Archive-It is not a perfect tool for the collection of digital documents, it has been making strides
to enhance how the digital environment will be preserved. Archive-It understands that preserving
the Internet means adapting that preservation to how the Internet develops, while also keeping all
information gathered accessible to any of the sites visitors and contributors.
Future Collection Management
For future development and maintenance of this collection, we suggest continued
research for academic and scientifically grounded resources concerning the effects of reading on
the brain and adding them as seeds to broaden the depth and breadth of the collection. To keep
the narrow scope of the collection, however, the articles should not diverge too far from the core
topics covered by the five seeds which started the collection. If and when more direct or specific
seeds are discovered relating to the topic, or information contained in the original seeds becomes
out of date, it will be best to remove them from the seed pool and replace them with seeds
containing up to date information. An electronic collection requires a unique appraisal method to
ensure its contents aligns continually with collection goals. Trace (2010) suggests the appraisal
process include, ...ongoing monitoring of electronic records selected for preservation (p. 61)
10
and that would absolutely be needed if continuing to develop and maintain this collection using
the Archive-It tool.
If an institution were interested in using Archive-It to explore the possibility of creating a
web archive, Trace (2014) agrees that determining value before adding electronic record to a
collection is the best course of action. It would be beneficial to decide on appraisal methods,
search parameters, and seed qualifications before setting out to conduct research into potential
seeds for a chosen topic. Hunter (2003) recommends the following values be considered when
appraising records:
operating value
administrative value
fiscal value
legal value
archival value
While the process to determine these values may differ when applied to electronic resources and
collections, they remain useful markers to gauge and coordinate the selection process. An
Archive-It electronic collection should still fit into the goals and patron needs of the governing
institution, and these appraisal guidelines can help create a well aligned scope.
Concluding Summary
Overall, the assignment was relatively simple, although it seemed to be a larger task to
surmount in the beginning. Figuring out a new computer process, as well as interpreting the
results, while simultaneously striving to record this data as a legible report, seemed quite a
laborious process. However, as our group came together and we began to contribute together
well, the difficulty became surmountable. So, while there were problems and glitches, such as
dealing with the large amount of returns, there was also the ability of our group to navigate these
challenges and come out with the ability to do quality research using the Archive-It tool.
11
References
12
About Us. (2014). Retrieved November 30, 2014, from https://archive-it.org/learn-more/learn-more

Archive-It How-to FAQ - Archive-It Help - IA Webteam Confluence. (2014). Retrieved November
30, 2014, from https://webarchive.jira.com/wiki/display/ARIH/Archive-It+Howto+FAQ#Archive-ItHow-toFAQ-HowdoIaddmetadatatomyseeds?
Hosts Report - Archive-It Help - IA Webteam Confluence. (2014). Retrieved November 30, 2014,
from https://webarchive.jira.com/wiki/display/ARIH/Hosts+Report
Hunter, G.S. (2003). Developing and maintaining practical archives: A how-to-do-it manual 2nd
ed.New York: Neal Schuman.
Internet Archive. (2014). Archive-It - Web Archiving Services for Libraries and Archives. Retrieved
November 30, 2014, from https://archive-it.org/
Scoping and Running Crawls - Archive-It Help - IA Webteam Confluence. (2014). Retrieved
November 30, 2014, from
https://webarchive.jira.com/wiki/display/ARIH/Scoping+and+Running+Crawls
Selecting Seeds - Archive-It Help - IA Webteam Confluence. (2014). Retrieved November 30, 2014,
from https://webarchive.jira.com/wiki/display/ARIH/Selecting+Seeds
Trace, C.B. (2010) On or off record?: Notions of value in the archive. In Eastwood, T. &
MacNeil, H. (Eds.). Currents of archival thinking (47-68). Santa Barbara, CA:
Libraries Unlimited.
University of Pittsburgh School of Information. National Disaster and Digital Preservation: 9/11 and
the Internet. (2014). Retrieved December 2nd, 2014 from https://www.archiveit.org/collections/4486

Group D-Reading and The Brain Archive-It

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Group D-Reading and The Brain Archive-It

Încărcat de

Drepturi de autor:

Formate disponibile

1

Reading and the Brain: An Archive-It Collection

Group DAdam Doolittle

Theme, Seed Selection, and Scope

Scope of the Collection Between Crawls

Test and Production Crawls

Assessing the Archive-It Tool

Preservation and Access through Archive-It

About Us. (2014). Retrieved November 30, 2014, from https://archive-it.org/learn-more/learn-more

S-ar putea să vă placă și