Documente Academic
Documente Profesional
Documente Cultură
Albert C. Whittenberg
04 May 2009
Whittenberg 2
organizations are looking for a solid recordkeeping system. Knowing this, many
software companies continue to grind out new products year after year. For the budget
minded organization such as a public university or archives, the high costs of these
packages makes open source software solutions look attractive. One of these is DSpace,
numerous examples of DSpace being used in institutions (including MIT), this open
set in archivist and professor Philip C. Bantin’s book, Understanding Data and
Information Systems for Recordkeeping. Does DSpace meet all these requirements?
According to the official DSpace Wiki, there are 334 organizations currently
using DSpace in 56 countries.1 The Wiki further states that DSpace “captures, stores,
and that “research institutions worldwide use DSpace for a variety of digital archiving
needs.”2 Repeated time and time again, the site also continues to hammer in that the
software is completely open source and free to everyone. For those institutions that are
worried about technical support, the website also boasts a DSpace Community and a
DSpace Federation with mailing lists, conference, user groups, workshops and a host of
1
DSPace Wiki, “DSpace Instances (as of 01/12/2009),” http://wiki.dspace.org/index.php/DSpaceInstances
(accessed April 2009).
2
DSpace Wiki, “What is DSpace,” http://wiki.dspace.org/index.php/EndUserFaq#What_is_DSpace.3F
(accessed April 2009).
Whittenberg 3
other websites. DSpace has also been around for some time. According to the January
In March 2000, Hewlett-Packard Company (HP) awarded $1.8 million to the MIT
Libraries for an 18-month collaboration to build DSpace™, a dynamic repository
for the intellectual output in digital formats of multi-disciplinary research
organizations. HP Labs and MIT Libraries released the system worldwide on
November 4, 2002, under the terms of the BSD open source license [1], one
month after its introduction as a new service of the MIT Libraries. As an open
source system, DSpace is now freely available to other institutions to run as-is, or
to modify and extend as they require to meet local needs. From the outset, HP and
MIT designed the system to be run by institutions other than MIT, and to support
federation among its adopters, in both the technical and the social sense.3
The reason for the project was a “need to collect, preserve, index and distribute” research
materials like those being generated by faculty at MIT.4 It was to be free, easy to install
Data and Information Systems for Recordkeeping provides these. One definition is “a
special kind of information system that manages and preserves the records that provide
Management Standard 15489 defines a system “as an information system which captures,
manages and provides access to records through time” and three characteristics of records
managed for that system are authenticity, reliability and integrity.6 Finally, the
requirements listed for a recordkeeping system (as detailed in Bantin’s book) are as
listed:
3
D-Lib Magazine (January 2003), “DSpace: An Open Source Dynamic Digital Repository,”
http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April 2009).
4
Ibid.
5
Philip C. Bantin, Understanding Data and Information Systems for Recordkeeping (New York: Neal-
Schuman Publishers, 2008), 32.
6
Ibid.
Whittenberg 4
1. Capture records,
2. Support classification scheme(s),
3. Capture record metadata,
4. Support audit control,
5. Ensure records are usable,
6. Manage security and control,
7. Schedule records for disposition, and
8. Preserve records7
“production digital repository service” for research organizations.8 This means that
record capture would not generally be considered automatic but manually. Researchers
and their assistants would be submitting their information in some sort of digital format
the capture process in a recordkeeping system as listed in Bantin’s book, records can be
captured either automatically or manually (so DSpace does qualify in that aspect). Other
characteristics mentioned that the system must support capturing records from various
types of software and/or applications, must be able to maintain all components captured
What types of records does DSpace support? Can it handle the countless word
processing and web applications available today? Accord to the DSpace Wiki, the
7
Bantin., 35-36.
8
D-Lib Magazine (January 2003), http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April
2009).
9
Bantin, 38.
Whittenberg 5
4. Data sets
5. Computer programs
6. Visualizations, simulations, and other models
7. Multimedia publications
8. Administrative records
9. Published books
10. Overlay journals
11. Bibliographic datasets
12. Images
13. Audio files
14. Video files
15. eformatted digital library collections
16. Learning objects
17. Web pages10
Dspace also gives institutions the capability to accommodate different workflows. What
this means is different departments, groups, schools or teams submit items and organize
them in different ways. This answers the questions of how items are grouped together,
who can submit or who can have access. DSpace then has the capability to maintain all
components as a single record or not depending upon the department’s preference. While
this would seem to go against the definition of the record capture process in a true
recordkeeping system, the administrators of a DSpace instance could force the software
to group items.
Versioning was not available initially in DSpace. However, recent updates have
corrected this. According to the Wiki again, “a Google Summer of Code project in 2007
has implemented a versioning prototype, for DSpace Items, DSpace Items have two
identifiers, on permanent, the other is a version lineage id. The Lineage is comprised of
items, each with unique metadata and bundles, bitstreams within the items will be either
10
DSpace Wiki, “End User FAQ,”
http://wiki.dspace.org/index.php/EndUserFaq#What_kind_of_content_does_DSpace_support.3F (accessed
April 2009).
Whittenberg 6
linked from the previous version or added anew.”11 This is a prototype and probably has
some bugs to it. The article in the wiki did not list if any further updates had been made,
be resolved by the robust workflow system built into DSpace. Records can be classified
into a variety of ways or categories. One example given is in the already mentioned
article in D-Lib Magazine where “a department may choose to have two collections: one
for working papers and another for datasets. They may then decide that any member of
the faculty can deposit items to either collection directly, and that any member of the
general public can have access to these collections.”13 This is a very simple classification
scheme, but more complex ones can be implemented. Records can be classified as well
as the record creators (also called users frequently in the articles and wiki).
standard in archives metadata: the qualified Dublin Core. This is composed of the 15
metadata elements of simple Dublin Core plus an additional three (Audience, Provenance
and RightsHolder):
11
DSpace Wiki, “DSpace 2.0/Comparing Exisitng Technologies,”
http://wiki.dspace.org/index.php/DSpace_2.0/Comparing_Existing_Technologies#Versioning_Content
(accessed April 2009).
12
Bantin, 39.
13
D-Lib Magazine (January 2003), http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April
2009).
Whittenberg 7
1. Title,
2. Creator,
3. Subject,
4. Description,
5. Publisher,
6. Contributor,
7. Date,
8. Type,
9. Format,
10. Identifier,
11. Source,
12. Language,
13. Relation,
14. Coverage and
15. Rights.14
Only three of these fields are mandatory with the other being optional. All the metadata
information is in the item record and is fully searchable. In its article regarding DSpace,
D-Lib Magazine authors acknowledge that the metadata “is indexed for browsing and
Since only three fields must be present by design, it is also assumed that organizations
Does DSpace provide and support some type of audit control? The requirements
set in Understanding Data and Information Systems for Recordkeeping regarding audit
1. The system must maintain audit trails for all processes that create, update or
modify, delete, access and use records.
2. At a minimum, the system must track the action that was implemented (what
data or information was accessed, added, deleted or modified).
3. The system must automatically capture the audit trail.
4. The audit trail must be unalterable.
14
Dublin Core Metadata Initative Website, “DCMI Metadata Terms,”
http://dublincore.org/documents/dcmi-terms/ (accessed April 2009).
15
D-Lib Magazine (January 2003), http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April
2009).
Whittenberg 8
5. The audit trail must be kept at least until the records it refers to are destroyed
or deleted.
6. The audit trail must be logically linked to the records it documents.
7. The audit data is not available for inspection or export by any user except
those authorized (administrators of the system for example).
8. Documentation must be created when change are made to the system or
actions taken to the records.16
DSpace fails on several of these items. In a design proposal for DSpace 2.0 from October
While checksums can ensure that records are not altered through the process, no
clearly stated that the records in the main relational database are not easily audited nor
was the proposal mentioned enhancing this in later versions of the software. After
exploring the DSpace Wiki and the main Dpace.org website, there was little to no
information about auditing except repeating what was found in the 2.0 proposal. In fact,
there were several examples of people requesting third party auditing packages to use
with DSpace on the community listserv (with no clear answers given to solve the
problem).
Does DSpace ensure that records are usable? Again, what are the requirements?
must be “easily accessed and retrieved in a timely manner” with searching capabilities
16
Bantin, 41-42.
17
DSpace Wiki, “DSpace 2.0 Design Proposal,” http://wiki.dspace.org/static_files/1/16/Ds2arch.doc
(accessed April 2009).
Whittenberg 9
including full text searches or metadata across files and categories (entire classification
scheme hierarchy).18 Using MIT’s DSpace instance as an example, users can browse
records based on collections, issue date, title, authors or subjects.19 Their search engine
seems to be limited to Boolean type searches like you would find in Google or Yahoo. A
quick test of the word “Washington” produced 5,571 hits with examples of where the
term is part of the title, part of the authors name or mentioned somewhere in the text.
Using the term “physics,” I also received departmental and theses links as well. Most of
the documents were available to review or print with the majority being in Adobe
Acrobat (PDF) format. If the website is up, one can only assume that access is available
so records have the potential of being available 24 hours a day seven days a week.
What about security? How does DSpace handle security and also control access?
Like many systems based on a web interface, the developers gave this a great deal of
attention. DSpace was created for the UNIX operating system, and the primary code was
written in Java. All additional components are open source as well and common to the
web environment (an example is that DSpace uses Apache as its web server engine which
is one of the most common in the industry). By not using specialized packages and
focusing on what is out there readily available and robust, this allows an organization to
have countless tools that could be used to protect the DSpace servers. Virus, anti-spam,
firewall and other software is available in many flavors for a UNIX system.
primary focus of security is allowing only authorized employees or researchers the ability
18
Bantin, 42.
19
MIT Libraries DSpace Website, “Search DSpace,” http://dspace.mit.edu/search (accessed April 2009).
Whittenberg 10
to create, delete or update records. DSpace should be able to limit access in terms of
record manipulation and should also never present information that a user does not have
the necessary permission to receive.20 DSpace responds to there requirements with its
website, contributors are limited to IU departments and scholars (faculty, students and
Content they would like to distribute widely and preserve over the long-term,
A contact person to work with the IUScholarWorks Repository team to set up
and run the Community,
The Community/Collection structure that is best for the department or units
content,
Metadata (descriptive cataloging information) and
20
Bantin, 45.
21
D-Lib Magazine (January 2003), http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April
2009).
Whittenberg 11
As a user that is not a member of any IU community, I was able to access ScholarWorks
and browse the collection based on community, collection, issue date, author, title or
subject. I was able to access a wealth of information, but it was only read access. I was
never given the chance to manipulate the records in any fashion. To do this, I would
have had to go through a formal process with the IU staff to get an IUScholarWorks
Repository Account.
preservation, the DSpace creators focused on two main digital types called “bit
preservation” and “functional preservation”.23 The first means the record is preserved
exactly like it was submitted (down to the actual bit count). Functional means that the
record will be changed to allow for the changed in software and technology to ensure its
preservation (although each repository should also have a solid program of backups and
DSpace creators cannot see the future and predict the countless software updates that may
occur. An example is co-creator MIT itself. They plan to provide functional support for
well-known documented standards such as TIFF or XML, but not for rare or complicated
states they are “not equipped to support the archiving and/or accessibility of dynamic
22
Indiana University ScholarWorks Repository Website, “Getting Started with the IUScholarWorks
Repository,” https://scholarworks.iu.edu/docs/repository/gettingstarted.shtml (accessed April 2009).
23
D-Lib Magazine (January 2003), http://www.dlib.org/dlib/january03/smith/01smith.html (accessed April
2009).
24
Ibid.
Whittenberg 12
resources like open web sites, interactive applications, files with complex metadata
Part of the process of retaining and disposing of records is the ability to make
backups as well as the system creating reports based on records changing or being
deleted. The DSpace Wiki lists in detail the means to restore a system using a full
backup as well as what must be done if you are using a different platform or operating
system version. Since DSpace uses PostgreSQL (an open source data management
system), an SQL dump created from a backup can be uploaded into a new PostgreSQL
instance to get the system back online.26 Since DSpace uses a number of servers for the
overall system, it is also advisable for the organization to have sufficient hardware to
possibly do mirroring (at least a matching server for each one in the system that is
updated as the main ones are updated). Unfortunately, not having access to any sort of
administrator account nor any relevant information found on the DSpace Wiki, there are
few items to list regarding reporting. Several of the wiki documents mention reports or
statistics so one can only assume it does exist. In any case, there are statistical packages
that can be used with an Apache web server to show when files are added, changed or
deleted. This is also true for PostgreSQL environments. If Dspace does not have it built
Does DSpace meet the requirements for a true recordkeeping system? It has the
capability to capture records in a variety of digital formats. Recent updates have also
25
Indiana University ScholarWorks Repository Website, “IUScholarWorks Repository FAQ for
Submitters,” https://scholarworks.iu.edu/docs/repository/faq.shtml (accessed April 2009).
26
DSpace Wiki, “Backup and Restore,” http://wiki.dspace.org/index.php/BackupRestore (accessed April
2009).
Whittenberg 13
added versioning. Its workflow system does provide a means for classification schemes
as well as ensure that only the right people have access to change records (by forming
standard, and the information is incorporated in the record (and is searchable). Security
is a strength as well as the use of accepted software standards like UNIX, Java and
Apache (which several strong security packages exist for). Reporting may or may not be
a problem, but additional packages again can be purchased to expand this capability. The
one glaring weakness seems to be audit control with the online documentation clearly
stating that records in the database are not audited easily. As mentioned before, there are
a number of organizations trying to find a third party software solution to resolve this
How can DSpace be converted into a true recordkeeping system? What steps
must take place? What types of functionality must be added? For the institution willing
to take these steps, it would seem logical to investigate third party solutions (and
potentially open source solutions) for the problems with audit control, reporting and to a
lesser degree, security. For example, PostgreSQL has a report generator through its open
source graphical user interface pgaccess.27 Free with support from several PostgreSQL
listservs, this could possibly be converted to produce some of the needed reports for an
organization. The options for security with UNIX servers and Apache web application
are so numerous that an organization should get their IT department involved to wallow
27
PostgreSQL Website, “User Client Questions,”
http://www.postgresql.org/files/documentation/books/aw_pgsql/node194.html (accessed May 2009).
Whittenberg 14
could be altered to require more than the three mandatory fields of Dublin Core.
However, the Dublin Core has realatively few fields with most involved with creation.
Solid recordkeeping metadata should include field throughout the life of the record. In
The Dublin Core version used by DSpace covers roughly the first and second category
while leaving some significant gaps for the rest. Is it any wonder that most of the
standards listed in Understanding Data and Information Systems for Recordkeeping are
dramatically larger (such as the European model listed with 109 elements with 79 being
mandatory).29
may require additional software to expand its functionality depending on the institutions
need. However, price seems to always be a concern for most universities or other
organizations that might need a digital repository. In this, DSpace knocks down most of
its competitors, and makes many an archive or library think about implementing it (as
28
Bantin, 48.
29
Ibid., 49.
Whittenberg 15
seen by the 334 organizations using it already). Open source products are sometimes a
concern to install due to the lack of technical support. DSpace also seems to have this
beat by the online communities that have been formed to help one another. Is it perfect?
It is not, but there is probably not a “perfect” system out there. DSpace should be
Bibliography
Bantin, Philip C. Understanding Data and Information Systems for Recordkeeping. New
York: Neal-Schuman Publishers, 2008.