Sunteți pe pagina 1din 10

Redeye: A Digital Library for Forensic Document Triage

Paul Logasa Bogen II, Amber McKenzie, Rob Gillen


Computational Data Analytics Group
Oak Ridge National Laboratory
Oak Ridge, TN
redev@ornl.gov
ABSTRACT Redeye in detail followed by discussion of the problem of
Forensic document analysis has become an important aspect of forensic digital libraries and document triage. Finally, we will
investigation of many different kinds of crimes from money present our conclusions and areas of future work.
laundering to fraud and from cybercrime to smuggling. The
current workflow for analysts includes powerful tools, such as
2. BACKGROUND
Palantir and Analysts Notebook, for moving from evidence to Previously, the ICR has produced three general-purpose tools
actionable intelligence and tools for finding documents among Piranha, Raptor and Distribute The Highest Selected Textual
the millions of files on a hard disk, such as Forensic Toolkit Recommendation (DTHSTR) for filtering a collection of
(FTK). Analysts often leave the process of sorting through documents down to relevant ones. Our work on the Redeye
collections of seized documents to filter out noise from actual Analysis Workbench is largely inspired by adapting and
evidence to highly labor-intensive manual efforts. This paper packaging these tools in concert with existing open source text
presents the Redeye Analysis Workbench, a tool to help analysts and forensic tools into a single workbench for forensic science
move from manual sorting of a collection of documents to document triage tasks. This section will briefly discuss each of
performing intelligent document triage over a digital library. We these tools.
will discuss the tools and techniques we build upon in addition 2.1 Piranha
to an in-depth discussion of our tool and how it addresses two Piranha was motivated by the need to address the challenges
major use cases we observed analysts performing. Finally, we most people face when sifting through a large amount of data for
also include a new layout algorithm for radial graphs that is used accurate and relevant information [1]. The approach the ICR
to visualize clusters of documents in our system. team took to address this was to create software that can quickly
filter, relate, and show documents and relationships based on
Categories and Subject Descriptors textual similarity to them.
H.3.7 [Digital Libraries]: Collection, Systems issues.
I.7.5 [Document Capture]: Document analysis. The result of this work is the ICRs software agent approach to
text analysis using a large number of agents distributed over
Keywords very large computer clusters. This approach combined with the
Redeye, Document Triage, Forensic Science ICRs corpus-based term weighting scheme (TF-ICF) [2] allows
Piranha to achieve high performance requirements faced with
1. INTRODUCTION large data sets.
In late 2011, a team of analysts at a federal law enforcement Piranha provides two primary functions: finding similar
agency were facing a problem: How do you manage the deluge documents and document sampling. Finding similar documents
of documents generated by an executed search warrant in a allows a user to select a document of interest and then quickly
manner that can help the analysts make sense of the information, find other documents that are a close match to it. For example,
given that the bulk of documents obtained are most likely noise? you may have an e-mail message of interest and want to find
This is essentially a classic digital libraries problem of ingesting, other similar e-mails in your collection; Piranha provides a
processing, and storing documents in a manner that enables easy visualization based on textual similarity that allows quick
retrieval of documents by users. With this insight into the determination of those similar documents. Second, document
problem, we began work on a tool, the Redeye Analysis sampling takes advantage of the common themes and topics
Workbench, which could bridge the gap in forensic tools often found over a coherent set of documents. Since a set of
between forensic data recovery tools, such as FTK, and case documents will usually contain common themes or topics, we
preparation tools, such as Analysts Notebook. can quickly identify a smaller number of representative
The remainder of this paper is divided into seven parts. First, we documents from the set and return those documents to the
will talk about the background and predecessor systems that analyst for review.
Redeye is built from and the challenges that necessitated a new
tool. This will be followed by a discussion of related work on 2.2 Raptor and DTHSTR
document suggestion, document clustering, document triage, The other predecessor systems to the Redeye Analysis
and collection management. Third, we will discuss the design of Workbench are Raptor and DTHSTR. They are knowledge-
sharing tools that leverages the DTHSTR techniques in semi-
supervised machine learning and unsupervised text analytics to
Copyright 2013 Association for Computing Machinery. ACM enhance and motivate sharing [3]. They accomplish this by
acknowledges that this contribution was authored or co-authored by an providing an increased benefit of sharing personal documents
employee, contractor or affiliate of the U.S. Government. As such, the through helping a user retrieve relevant documents from the
Government retains a nonexclusive, royalty-free right to publish or
collection. This is accomplished by using information provided
reproduce this article, or to allow others to do so, for Government
purposes only. by the user to build a profile of what they are interested in,
JCDL13, July 2226, 2013, Indianapolis, Indiana, USA. which is then used to match documents from the repository.
Copyright ACM 978-1-4503-2077-1/13/07...$15.00.

181
3. RELATED WORK While tools currently exist for many of these stages, which will
Beyond the prior systems the Redeye Analysis Workbench has be discussed in the following section, the computer profile and
descended from, there is a variety of work that we have drawn crime potential phases are lacking dedicated tools. At these
from for inspiration and direction. In general these works fall in stages we see an opportunity for document triage with a focus
to two categories: document triage and digital forensics. on helping an analyst make sense out of documents they have
already processed through the first two stages.

3.1 Document Triage


Document triage was originally coined by Bae et al.s 2005 3.2 Analysis Tools and Digital Forensics
work on the subject [4]. In this work, the authors drew on their Because detailing the breadth of analysis and digital forensics
prior work in an area known as information triage. While applications is out of the scope of this paper, we will focus on
information triage focused on the sorting and assessing of two of the more common technologies available Palantir and
information in documents [5], document triage focuses on the Analysts Notebook and also look at the area of digital
evaluation of the merit and disposition of the document itself archiving in which digital forensics tools are being
[4]. This transition from supporting work on the information implemented.
level to the entire document level was motivated by results Arguably one of the frontrunners in digital forensics applications
found in Shipman et al.s 2004 study on personal collections in is the Palantir platform, targeted at enabling analysts to discover
spatial hypertext where the authors noticed that users were often relevant information within large, heterogeneous data sets [12].
not just organizing information inside the VKB spatial hypertext Palantir comprises three parts: a data platform, an application
tool, but they were often organizing documents themselves [6]. platform and a computer platform. The data platform is geared
This has led to further work that seeks to automatically organize toward ingesting both structured and unstructured data for use
documents in a spatial hypertext system based on the inferred during the analysis process. The application platform is designed
interests and the characteristics of the documents with which to incorporate a number of data analysis and visualization tools
users interact [7]. with an API that allows developers to create their own
While document triage was born out of work on spatial applications or tools within Palantir. Implementing a
hypertext, the concept of evaluating merit and disposition to aid MapReduce framework, the Palantir Compute Platform
in sorting and assessing of documents is not limited to this area. facilitates running Palantir applications using HPC clusters to
Buchanan and Owen approached the problem of skim reading as ensure that the computationally-heavy text analysis can be
a document triage problem [8] by identifying relevant accomplished without lengthy lag times. The focus of the
information in the document and giving it a more pronounced Palantir architecture is openness and flexibility, allowing the
presentation in a pages thumbnail. Similarly document triage design and development of custom applications using the
has also been applied to evaluating work on both large displays provided framework, but lacks sufficient support at the data
[4] and small displays [9]. ingestion stage for adapting unstructured data for optimal
utilization during analysis.
Recently, Gary Cantrell et al. have taken the concept of
document triage and brought it into the realm of digital forensics One of the more widely used platforms among Department of
[10]. While prior work had mainly focused on the interfaces for Defense organizations, IBMs Analysts Notebook highlights
document triage [7] and processes of organizing documents four aspects of its platform: data acquisition, data modeling,
[11], Cantrell et al. primarily focuses on modeling the entire analytic capabilities and communication of complex data [13].
triage process formally [10]. Their formal model includes seven Analyst's Notebook's flexible data acquisition capabilities
stages. involve drag and drop functionality, volume structured data
importing, and integration with other IBM analytical products
First is planning and readiness which is focused on training of [14]. The built-in flexible data model minimizes constraints on
forensic staff in tools and techniques. The second is the live the representation of complex relational data in order to
forensics phase which encompasses making available facilitate analysis and visualization.
information from a target machine whether that is on an
encrypted disk, in volatile memory or just hidden in some These applications and platforms suffer from the same limitation
obscure folder in the file system, and includes work on hardware in that they all assume that the input data is all relevant and
write-lockers as well as software tools we will discuss in the useful. This relies on the analyst to expend the effort needed to
next section. The third phase is the computer profile. This is the determine which data to input or else risk inundating the system
phase the authors identify as being poorly supported. In this with superfluous data that the analysts have no interest in
phase, files are actually copied off and information about the examining.
imaged machine is stored. At this point, Cantrell et al. suggest a Taking a different approach, a separate body of research has
triage should occur to reduce the number of noise documents aimed at utilizing techniques and applications from digital
that go on to later stages. The fourth stage is the crime potential forensics for improving digital media archival ingest and
phase, in which the analyst applies their domain knowledge on a preservation workflows. J. John targets the use of existing
case to build a story from the data. Next, this story is presented technologies in his approach to the capture, preservation and
in the fifth phase. This presentation then is used to further access of personal digital archives [15]. Computer forensics
analyze the information in stage six, the triage examination applications are one of three categories of applications that he
phase. Finally, the seventh stage is the preservation stage, where investigates, due to the areas overlapping requirements of
evidence is formally collected along with analysis in a manner evidence preservation and analysis repeatability. He notes that
that ensures the chain of evidence is maintained. no single forensics product meets all the requirements of a
forensic analyst resulting in the number of competing products
available. He also notes two limitations of computer forensics

182
Figuree 1 - Search Vieew
teechnologies: (1)) the limited appplicability of th
hese applicationss send to a translator. Scanned docum ments are simple TIF files in
too legacy or obso
olete systems and media and (2) the inability forr whichh the text cannot be searchedd once opened. As relevant
thhese technologiees to recreate th
he initial state an
nd layout of thee materrial is found, thee locations of thhe original files on the disk
ddata. imagee are kept trackk of so the files can later be coopied into a
folderr for import intoo Analysts Notebbook where furtther analysis
44. USE CASES can thhen occur.
AAs shown in ourr related work, thhere is a gap in the workflow off
As thhe narrative shhows, the analyysts job was fr fraught with
fforensic documeent analysts bettween their doccument recovery y
manuual tasks that couuld be automatedd. By contrast, thhe following
aand cataloging toools and their an
nalysis tools. Cuurrently this gapp
narrattive shows the w
workflow we imaagined when using Redeye.
iss filled with extensive
e manpo ower that manu ually filters thee
uuninteresting fro
om interesting documents
d for further
f analysis. Juliuss and Dan have now moved to uusing Redeye as a document
TThis is where document triage work
w has the poteential to improvee triagee tool. When doccuments are acqquired under a w warrant, the
thhe workflow of analysts. This seection will discu
uss two use casess Redeyye Ingestion toool is directed to ttheir location annd begins to
wwe observed and d how a documeent triage system m improves thatt proceess the documeents. Instead of splitting the documents
wworkflow. amonngst themselves aand looking at aall of them, Juliius and Dan
now iinteract with thee collection via search looking by relevant
namees, places, or oother keywordss. Relevant doccuments are
44.1 Actor Search
S savedd in a sub-colleection that can later be exporteed with one
Inn the sponsorss previous worrkflow, a majorr portion of an n click and are ready fo
for import into AAnalysts Notebook. Scanned
aanalysts time waas spent manually searching thro ough documentss docum ments are also now searchablee due to the integration of
ffor information about
a entities and actors. This in
nformation could
d Opticcal Character R Recognition in the ingestion pipeline. If
bbe geographicall data, names, or contact infformation. Thiss foreiggn language documents are fouund, the system m provides a
innformation was then collated to identify aliasees and assemblee roughh machine trannslation that infforms the analy lysts if they
pprofiles of the actors
a involved in the case. Th he process as itt shoulld send it to a linguist. Tasks that took Juliuus and Dan
ppreviously stood is described witth the following narrative: longeer than a year caan now be accoomplished in weeeks or days.
Comm ments can now bbe stored on doccuments allowingg Julius and
JJulius and Dan n are analystss working for a federal law w Dan tto easily see eacch others insighhts when docum ments are re-
eenforcement agency. They are cu urrently working g on a case with
h exam ined.
ssuspected moneyy laundering asspects. In orderr to pursue thiss
ccase, the agency has obtained a warrant
w and connducted a search
h Usingg Redeye, the aanalysts work hhas become marrkedly more
of the homes and d offices of a su
uspected money launderer. Thiss effici ent and interacttions between aanalysts occur w
with greater
ssearch yielded tens
t of thousands of documentts both scanned d frequuency. In generaal, Redeye has altered the natuure of their
fr
from paper orig ginals and retrieeved using FTK K from hard diskk work from sifting thhrough documennts to searching for entities.
immages taken du uring the searcch. These docu uments are then n Analyysts are now free to ask m more questions and revisit
inndexed into a da atabase that allo
ows users to broowse through thee docum ments. Additionnally, since seaarching can bee performed
fi
files one at a tim
me. Julius and Dan
D divide the set s of documentss acros s cases, they caan build links bbetween cases something
bbetween them and a then begin to look throug gh each one. If that w
was difficult-to-iimpossible previiously.
MMicrosoft Office or other pre-in nstalled softwaree cannot open a
ddocument, the fille is passed by. If a document iss not in English,,
thhey determine based
b on appearaance if it is impo
ortant enough to
o

183
Figure 2 - Advisor Vieew
44.2 Term Matching
M By prroviding these aanalytical tools to help filter doown a large
TThe second obserrved activity in the
t prior workflo ow was spent on n noisyy set of documennts to a more m manageable sizee, we enable
teerm matching. The term mattching task inv volves manually y analyysts to more quiickly triage doccuments in to ssets that are
sscanning throug gh documents looking
l for releevant keywordss relevaant to the case aand that are not. Thus, we beliieve that we
fr
from a list of term
ms. Unlike searcching, the goal iss not to find one, have helped bridge tthe gap betweenn the documentt acquisition
aall or none but to find documeents that match h several criticall phasee and the evidencce-reporting phaase.
pphrases. This prrocess is used to filter the poteentially relevantt
ddocuments out. This
T process can n be imagined wiith the following g
5. S
SYSTEM D
DESIGN
nnarrative: Redeyye is a combination of threee major compponents: an
ingesttion pipeline, aan Apache SOL
LR repository [16], and an
JJulius and Dan are provided a several page list l of importantt Eclip se RCP-based [16] user tool. Each componnent will be
teerms for the casse at hand fromm a forensic acco ountant. The listt coverred in this sectioon.
inncludes terms th hat may indicatte money laundeering divided byy
ccategories as well as names, organizations,
o and
a locations off
innterest. None off the terms by themselves
t are indicative
i nor iss 5.1 Ingestion
thhe full, multi-pa
age list of termss likely to be foound in a singlee Each case processed presents differeent needs and intterests. This
ddocument. Instea ad Julius and Dan
D browse thro ough documentss leads to a design wheerein the ingestioon process is brooken up into
loooking for docu uments that conttain several of the
t phrases. Thee a seriies of steps that can be arrangedd via a pipeline tto match the
loocation of these documents is reecorded and lateer the documentss speciffic requirementts of each casee. This pipelinee is loosely
aare copied off to a folder forr further analyssis in Analystss couplled and backedd by a temporarry state manageement store
NNotebook. Due to the inherentt poorly defined d nature of thee impleemented using MongoDB [177]. The objectiives of the
hhuman task of deetermining whatt a sufficient num mber of terms is,, ingesttion process are to extract all off the relevant meetadata from
thhe process is noot easily reprodu
ucible nor is it co
omprehensive ass files as well as perfform pre-processing to aid in pperformance
itt relies on thee Julius and DanD not overloo oking particularr ng the analysis sstages. Once thhis process is coomplete, the
during
ddocuments or misssing terms in a document. analyysis tools do not need to interact with the originaal files other
MMuch like the previous search case,
c the term matching
m case iss than tto display them iif requested.
fr
fraught with auto
omatable processes. The use casse we envisioned
d Somee examples of thhe pipeline stepss include: tree buuilding, text
ffor automating this process drraws from the prior work on n extracction, entity exttraction, metadaata extraction, duuplicate file
RRaptor. consoolidation, comprressed file extraaction, machine translation,
JJulius and Dan are provided a list of key term ms in categoriess tokennization, and poppulation into SO OLR. With the eexception of
fr
from a forensicc accountant; however instea ad of manuallyy tree bbuilding, which is a recursive crawl of the raaw data that
loooking through documents forr these terms, Julius J and Dann buildss the initial tree structure in the state managemeent store, the
ddivide the list by
b category and d load their terrm sets into thee majorrity of the stagess can be orderedd, included/excluuded, re-run,
RRedeye Analysis Workbench. Affter several min nutes of waiting,, etc. aas desired. The pipeline is expoosed via a simpple graphical
thhe Workbench produces
p a set off result lists each
h corresponding
g user iinterface that seerves as a harnesss by which the analyst can
too a category off terms. The documents in the set are ordered d controol which stagess are run, in w what order, and how many
aaccording to thee similarity of th
he document to the terms in thee timess.
ccategory it match
hed against.

184
5.3 Workbencch
The w workbench reprresents the user interface for analysts in
Redeyye. It is built inn the Eclipse RCP framework [16] around
three major tasks. E Each task is suppported with a perspective
contaaining one or moore views. The three tasks as mmentioned in
sectioon four are seearch, advisor, and analysis. Search and
Advissor each have a single view, while analysis has three views:
List VView, Graph Viisualization, andd Detail View. Each of the
three analysis views represents an allternative way oof looking at
relevaant documents identified in the search aand advisor
persppectives.

5.3.11 General Feeatures


Figure 3 - Advanceed Search Dialo
og Throuughout the vaarious views in the Redeyye Analysis
WWhile most of the t stages menttioned are self-eexplanatory, thee Workkbench, there arre a number of common panes that can be
teext extraction stage warrants some additional detail. Previouss seen in the screenshhots. These incclude the Data Sets view,
home grown iterations
i of thee toolset utilized either custom m Scratcch Pad, and thhe console view w. Data Sets is the staging
teext extraction teechniques or a combination
c of open
o source textt grounnd between the ppre-triage searchh stage and the main triage
eextraction librariies such as Apaaches POI projject [18]. Whilee stage.. Data sets conntain a list of eeach set of doccuments for
thhese approaches had some success, ourr results weree furtheer examination. These sets can bbe generated from either the
ddisappointing wh hen faced with the variety of the datasets ourr searchh view or the addvisor view. Thhe scratch pad iss a place for
ssponsor needed to t process. Even ntually, we electted to utilize thee the annalyst to organizze the results off interest to be eexported out
MMicrosoft IFilterr plugin system m that is part of o the Windowss of thee Redeye tool foor import into ann analysis tool likke Analysts
DDesktop Search platform [19]. The IFilter sysstem provides a Toolkkit. Finally, the console view prrovides a detaileed look in to
ssingle API to interface
i with a wide range of o file formats, the bbackground proccesses of the ssystem and provvides status
pproviding a sim mple way to crreate format ag gnostic text and d updattes for long-run unning processes. Additionallyy debugging
mmetadata extracttion tools. Whille the experiencce has not been n informmation for repoorting errors to the developmennt team are
fr
free of problemss, the number off documents thaat fail extraction n also ddisplayed in the console view.
hhas dropped sign nificantly (failurre rates average less than 2% off
ssupported files as compared d to 30% with w our priorr
innfrastructure). Additionally,
A th
he number of file types from m 5.3.22 Search Vieew
wwhich we are ablle to extract textt is far greater th
han our previouss The iinitial view a user interacts withh allows a tradittional search
toool sets. of thee data. The searrch interface cann be seen in Figure 1. The
view represents an innitial pre-triage stage where an analyst can
selectt which documeents are relevannt for further annalysis using
55.2 Reposittory the suummary of the document. Searrches can be as simple as a
list oof optional term ms or complexx with Booleann logic and
A
Apache SOLR, a full text searrch server built on the Apachee
constrtraints on fields tto search.
LLucene search enngine [20], is used as a repository and index forr
inngested data. This
T collection off data is then maade available forr We allso support filterring of results bbased on facets. T
These facets
ssearching from within
w the workbench. are drrawn from a varriety of metadataa associated withh documents
such aas file type and ddocument collecction.
SSOLR provides a myriad of posssible options and d customizationss
too produce a fu ull-fledged searcch application. For
F the Redeyee In adddition to the staandard search innterface, we alsoo provide an
WWorkbench, extrracted textual co ontent and metaadata is indexed d advannced search dialog that allows aadditional compllex querying
ffor each ingested d document and d made availablee to be searched d to be performed simpply. Options in the advanced seearch dialog
bby the user. SO OLR also maintains information n on commentss includde searching by date rangge, searching for hidden
mmade by analystss and whether do ocuments are flaagged or hidden. docum ments, either eexclusively or iinclusively, witth unhidden
SSOLRs facetted d search capabilities are employ yed to allow thee resultts, and searchinng for documentts that are flaggged or have
uuser to make mo ore detailed seaarches on select portions of thee comm ments attached too them. A screennshot showing thhe use of the
ccollection. Ano other SOLR feeature, integratted within thee advannced search dialoog can be seen inn Figure 2.
wworkbench, is thhe terms compon nent, which prov
vides a list of thee Once documents ar are retrieved, w we support tw wo different
toop terms within n the data set along
a with their frequency. Thee mechhanisms for savinng them to a document set. Firsst a user can
teerms componen nt is utilized tot implement the t autosuggestt selectt the save buttonn which either saaves all checkedd documents
ffeature that takees the characterss typed into thee search bar and d or proompts the user ffor a number of documents to save in order
pprovides a drop down
d box with suggested
s complletions. Through h of thee documents rellevance to the ssearch query. A Alternatively,
SSOLRs multicorre option, the daata set is broken up into differentt docum ments can be drragged to either the scratch padd or the data
ccores. This allo ows us to scaale SOLR horizzontally as thee sets vview
ccollection of doccuments grows. An empty coree with a merged d
sschema solves thet problem of searching across a number off
ccores with differeent properties.

185
Figure 4 - Analysis Vieew
55.3.3 Advisorr View 5.3.44 List View
Inn addition to thee search view, th
he Redeye tool also
a provides thee Once pre-triage filteering has been conducted eitheer using the
aadvisor view. Th he advisor view (Figure
( 3) is the implementation
n searchh or the advisorr views, the userr can then switcch to the list
oof the Raptor and the DTHSTR techniques as mentioned
m in ourr view.. The list view, aas shown in Figuure 4, consists oof five major
pprior work. compponents: the listt, properties, enntities, commennts, and the
TThe advisor vieww provides a laarge text entry box b that can bee timeliine.
ppopulated by thee user through direct entry, co opy and pasting g The llist shows the m main informatioon about docum ments in the
fr
from another appplication or by lo
oading a text fille already on thee selectted document sset. In order too simplify the information
uusers machine. The user can n then control the number off presennted the list is ppaginated into ssets of a user specified size.
ddocuments returnned per set with a slider from 1 to
t the number off Each document show wn has a title, a nnumber of status indicators,
ccoarse-selected documents. Th his number of coarse selected d a snipppet from the coontent of the doocument and thee path to the
ddocuments is adjustable
a in the
t system pro operties but iss originnal document. The status inddicators help shhow various
cconsidered an advanced
a featurre that most users
u would nott informmation about thhe document. Thhese indicators are whether
nnormally use. the ooriginal documennt is available, if the documennt has been
FFinally the user can push the buildb result setss button and thee flaggeed as importannt, if there are any commentss about the
ssystem begins prrocessing the terrm list and comp paring the termss docum ment and, if hiddden documents are set to be shhown, if the
too documents in the SOLR repo ository. The pro ocess, which thee docum ment has been hiidden.
aadvisor performss, is described ass follows: first, the input is splitt Once a document is selected a num mber of views arre populated
oon empty lines to o form sets of caategory documen nts. The first linee with information aboout that documeent. The first of them is the
inn each category y becomes the title of the caategory and thee propeerties view. Thee properties view w shows metadaata extracted
remaining lines become individ dual terms. Usin ng a predefined d aboutt the file itself ssuch as when itt was created, iff it has text,
thhreshold the teerms to be seaarched are evaaluated on theirr and iff it was encryptted. If the file is an email, propperties from
ddiscriminatory value
v and are eiither kept or triimmed from thee the em mail header are ddisplayed in adddition to propertiies about the
ssearch. Using these
t trimmed terms, the sollr repository iss file ittself. These incluude subject, sendder, recipient, annd date sent.
ssearched for all documents that match one or more m terms. Thee The eentities view, iin contrast to tthe properties vview, shows
results are rankeed by textual sim milarity to the search terms and d metaddata about the document insstead of the fiile. This is
thhe top documen nts up to a certaiin number the coarse selection n currenntly restricted too entities extractted during ingesstion using a
nnumber set in th he system preferrences are kep pt. Second, forr numbber of regular exxpressions. Entitties include phonne numbers,
eeach document retrieved by the t coarse filteer, the TF-ICF F and eemail addresses aamong others.
wweighted term vector
v is comparred to the TF-IC CF weight term m
vvector of the cateegory using a sttandard cosine similarity metric. The llast view is thhe timeline vieww. The timelinee view is a
TThis provides a ranking of doccuments that is cut off after a mized instance oof the SIMILE JJavaScript timeliine inside an
custom
ccertain number ofo documents the t fine selection n number set by y embeedded web brow wser [21]. The timeline is hosted via an
thhe slider. Each of
o the resulting document sets are a then saved to o embeedded instance oof the Jetty weeb server. Addittionally two
ddata sets for furth
her analysis in th
he system. J2EE web services pprovide both thee JSON dataset to populate

186
Figure 5 - Cluster
C Visualizaation
thhe timeline and
d a callback feaature to feed infformation aboutt review
w documents inn a cluster. An eexample of this vview can be
ddocument selecttion from the timeline
t back innto the Redeyee seen iin Figure 5.
WWorkbench. We hhave devised a novel graph laayout algorithm around the
Inn addition to this
t view of th he data, we also provide two o desireed properties oof the graph. T The properties w we wish to
aadditional viewss. One for view wing clusters of documents, and d guaraantee are all docuuments nodes arre on the same rradius of the
aanother for delviing into the term
ms and entities inn detail. Next wee graphh and that the root of the tree is located in the center.
wwill discuss the cluster
c visualizattion and then thee detail view. Addittionally we souught to minimizze label overlapp and edge
lengthh without relyinng on a dynamicc layout algorithhm, such as
force atlas [23]. Our solution is to fliip the problem oof laying out
55.3.5 Clusterr Visualization
n a grapph upside down and layout the leaves first and thhen walk up
OOne of the majorr features of onee of the predecessor tools of thee the trree. This differs from other layoouts that start frrom the root
RRedeye Workbeench, Piranha, is i a hierarchicaal clustering off and wwork their way ddown to the leavees [24].
ppages best on to op terms and phrrases with-in th he document [1]. In ordder to perform this, we first nneed to know thhe leaves in
TThis was presentted as a radial graph.
g This samme technique hass orderr of their height. This is performmed by first walkking the tree
bbeen adapted intto Redeye along g with a new im mplementation off usingg a post-order traaversal and storinng each node in to a stack.
thhe cluster visuallization. Unlike the visualization
n in Piranha, thee
RRedeye tool allow ws smooth zoom ming and panning g without pausess Afterr this is complette, we iterated thhrough the stackk setting the
bbetween actions. This allows a more
m natural inv
vestigation of thee heighht of each node tto either zero or the maximum hheight of the
ggraph [22]. Analysts, as part off their training, must verify thee nodes children plus one. These heigght labeled nodees are placed
immportance of every
e documentt to a case. Trraditionally thiss on to another stack thhat is then iteraated through to ccalculate the
mmeant wading in ndividually throuugh millions of documents for a anglee of the node on iits level.
ffew relevant ones. The cluster viisualization allowws the analyst too If a nnode has no chilldren, the angle of the node is equal to the
sstill look at each one but in a mo
ore intelligent maanner. Instead off followwing:
loooking at each document in deepth, they can first f look at thee
2
cclusters to triagee which sets off documents aree more likely to o
ccontain pertinentt information. This
T enables thee analyst to nott 2
oonly prioritize thheir search for relevant
r informaation, but allowss wheree n is the numbeer of leaf nodes,, i is the count oof leaf nodes
thhem a means, by b viewing the top terms of a document or a alreaddy laid out, h is the height of the current node, aand H is the
ccluster, to more rapidly
r perform an exhaustive search
s as the top
p maxim mum height of aall nodes. If a nnode does have cchildren, the
teerms help inforrm the analyst of o how extensiv ve they need to o anglee is equal to the aaverage angle off its children.

187
Figuree 6 - Detail View
w
FFinally, we calcu ulate a radius for
fo each node in
n the graph. Thee has ddocuments from. This allows thee detail view to bbe restricted
fformula of the raadius is the follow
wing: to ressults from a partiicular collectionn.
The Top and All tabs are similarr in that they connsist of three
, , tabs, each of which is dedicated to words, phrases,, or entities.
wwhere d is the maximum
m distan
nce in pixels thaat a node can bee The pprimary differennce between thhe two is that Top only
fr
from the center of the graph, B( B , , is the beta value at h displaays the top 500 of each in orrder of importaance in the
wwith tuning paraameters and [25]. We havee experimentally y collecction measured by a TF-ICF w weighted term ccount while
ddetermined satisffactory values off 1 and 2. "All" displays every w
word/phrase/enttity in alphanumeeric order.

OOnce we have obtained


o a , value for eacch node, we usee The document dispplay is populatted based on the current
selecttion in the two tab grroups. By sselecting a
sstandard trigonom
metric formulas to convert to ann , position n
term/p
/phrase/entity, all documentss of that coontain that
inn the window.
term/p
/phrase/entity caan be browsedd using the up and down
Innteractions with
h the graph view are fully inteegrated into thee arrowws above the doocument displayy. If a single ddocument is
wworkbench. Seleected documentts are highlighted in red and d selectted, as Figure 6 shows, then juust that single ddocument is
mmousing over a node
n changes thee selection. Doub
ble clicking on a availaable in the docum
ument display. Lastly, the term/pphrase/entity
nnode opens the original
o documennt in an external viewer. that thhe document is bbeing viewed ass part of is highliighted in the
docum ment to allow tthe analyst to qquickly find thee context in
whichh the term is conntained.
55.3.6 Detail View
TThe final view in n the Redeye Analysis
A Workbench is the detaill
vview. This view flips the notion of documents withw terms upsidee 6. C
CONCLUSIIONS
ddown and instead presents terms, phrases, and entities with thee
The ggap in the forensic pipeline betw ween file collection and the
ddocuments that contain them. This view, like l the clusterr
analyysis of actionablee intelligence prrovides a challennge in terms
vvisualization, waas inherited in paart from Piranha [1]. The view in
n
of maanpower, time too completion annd completeness of analysis.
RRedeye has beeen refined an nd focused to o present only y
In ordder to bridge thiis gap, we identtified two primarry use cases
innformation the analysts
a deemed d useful upon evaluating Piranhaa
that aare currently fullly manual operaations actor seearching and
ffor their needs.
term matching andd designed the R Redeye Analysis Workbench
TThe view consists of four main areas: documen nt source, Top to auutomate the anaalysts time-conssuming tasks. O Our solution
taabs, All tabs, and
a the documen nt display. The document
d sourcee has b een deployed too our sponsor annd is currently beeing used in
ccontains a list of the various collections the curreent document sett activee investigations.. The response tto our tool has been highly

188
positive and we are currently evaluating further use cases in 8. REFERENCES
which we would be able to extend Redeye to help reduce the [1] J. W. Reed, T. E. Potok, and R. M. Patton, "A multi-agent
demands of document pre-processing on the analyst. We are system for distributed cluster analysis," in Proceedings of
currently considering four possible areas of future work. Third International Workshop on Software Engineering for
Our sponsor has indicated a strong interest in semi-automatic Large-Scale Multi-Agent Systems (SELMAS'04) Workshop
generation of social network graphs from documents. This area in conjunction with the 26th International Conference on
of work would extend on existing work on social network Software Engineering Edinburgh, Scotland, UK: IEE,
extraction to help provide a visual and analyst-updatable 2004, pp. 152-5.
representation of the graph of actors involved in an [2] J. W. Reed, Y. Jiao, T. E. Potok, B. A. Klump, M. T.
investigation. Elmore, and A. R. Hurson, "TF-ICF: A new term weighting
Another area is in the area of alias resolution. With the kinds of scheme for clustering dynamic data streams," in Machine
cases the sponsor deals in, individuals may be using several Learning and Applications, 2006. ICMLA'06. 5th
alternate names, companies may have various shell names, and International Conference on, 2006, pp. 258-263.
both use more than one email address, phone number or mailing [3] R. M. Patton, W. McNair, C. T. Symons, J. N. Treadwell,
address. By analyzing clues in documents, the analysts currently and T. E. Potok, "A Text Analysis Approach to Motivate
resolve several virtual identities in-to one physical identity or Knowledge Sharing via Microsoft SharePoint," in System
organization. Providing tools and techniques to speed the Science (HICSS), 2012 45th Hawaii International
process could provide a significant speed up in the analysis of Conference on, 2012, pp. 3670-3678.
documents involved in a case.
[4] S. Bae, R. Badi, K. Meintanis, J. Moore, A. Zacchi, H.
Furthermore, there may be value in investigating visualization of Hsieh, C. Marshall, and F. Shipman, "Effects of display
the document collection in bulk. This kind of visualization could configurations on document triage," Human-Computer
help analysts rapidly narrow down documents of interest or Interaction-INTERACT 2005, pp. 130-143, 2005.
potentially allow them to spot abnormal documents in a
collection that may contain important facts about a case. [5] C. C. Marshall and F. M. Shipman III, "Spatial hypertext
and the practice of information triage," in Proceedings of
Finally, wed like to apply more advanced named entity the eighth ACM conference on Hypertext, 1997, pp. 124-
extraction techniques to help identify actors, locations, addresses 133.
and other important pieces of information contained in
documents. Due to the changing domain of cases and the often [6] F. M. Shipman, H. Hsieh, J. M. Moore, and A. Zacchi,
highly technical nature of the domains, traditional corpus-based "Supporting personal collections across digital libraries in
named entity recognition (NER) techniques may not prove spatial hypertext," in Proceedings of the 4th ACM/IEEE-CS
adequate to provide sufficiently high levels of performance on joint conference on Digital libraries, 2004, pp. 358-367.
multiple cases. Instead, we are looking in to developing novel [7] S. Bae, D. H. Kim, K. Meintanis, J. M. Moore, A. Zacchi,
semi-supervised and unsupervised techniques of NER in order to F. Shipman, H. Hsieh, and C. C. Marshall, "Supporting
address these concerns in dealing with technical cross-domain document triage via annotation-based multi-application
collections for NER. visualizations," in Proceedings of the 10th annual joint
In conclusion, we have identified a need for support in forensic conference on Digital libraries, 2010, pp. 177-186.
document analysis, examined an existing workflow in use by a [8] G. Buchanan and T. Owen, "Improving skim reading for
law enforcement agency and singled out two use cases where document triage," in Proceedings of the second
computer support could improve their workflow, and designed a international symposium on Information interaction in
digital-library-inspired tool for managing their document context, 2008, pp. 83-88.
collections. We have shown that a digital libraries approach to
[9] F. Loizides and G. R. Buchanan, "Performing document
forensic document analysis meets the analysts need and
triage on small screen devices. part 1: structured
provided a direction for further work in this area.
documents," in Proceedings of the third symposium on
Information interaction in context, 2010, pp. 341-346.
7. ACKNOWLEDGEMENTS [10] G. Cantrell, D. Dampier, Y. S. Dandass, N. Niu, and C.
This document was prepared by Oak Ridge National Laboratory, Bogen, "Research toward a Partially-Automated, and
P.O. Box 2008, Oak Ridge, Tennessee 37831-6285; managed by Crime Specific Digital Triage Process Model," Computer
UT-Battelle, LLC, for the US Department of Energy under and Information Science, vol. 5, p. p29, 2012.
contract number DE-AC05-00OR22725. [11] G. Buchanan, "Rapid document navigation for information
This manuscript has been authored by UT-Battelle, LLC, under triage support," in Proceedings of the 7th ACM/IEEE-CS
contract DE-AC05-00OR22725 with the U.S. Department of joint conference on Digital libraries, 2007, pp. 503-503.
Energy. The United States Government retains and the [12] J. Payne, J. Solomon, R. Sankar, and B. McGrew, "Grand
publisher, by accepting the article for publication, acknowledges challenge award: Interactive visual analytics palantir: The
that the United States Government retains a non-exclusive, paid- future of analysis," in Visual Analytics Science and
up, irrevocable, world-wide license to publish or reproduce the Technology, 2008. VAST '08. IEEE Symposium on, 2008,
published form of this manuscript, or allow others to do so, for pp. 201-202.
United States Government purposes.
[13] IBM Corporation. (2012). IBM i2 Analyst's Notebook
datasheet. Available:
http://public.dhe.ibm.com/common/ssi/ecm/en/zzd03127us
en/ZZD03127USEN.PDF

189
[14] i2 Limited, "i2 Analyst's Notebook 8: product overview.," [22] I. Herman, G. Melanon, and M. S. Marshall, "Graph
i2 Limited 2009. visualization and navigation in information visualization: A
[15] J. L. John, "Adapting existing technologies for digitally survey," Visualization and Computer Graphics, IEEE
archiving personal lives," iPRES 2008, p. 48, 2008. Transactions on, vol. 6, pp. 24-43, 2000.

[16] D. Rubel, "The heart of eclipse," Queue, vol. 4, pp. 36-44, [23] M. Bastian, S. Heymann, and M. Jacomy, Gephi: An Open
2006. Source Software for Exploring and Manipulating Networks,
2009.
[17] 10gen Incorporated. (2013). mongoDB. Available:
http://www.mongodb.org/ [24] G. M. Draper, Y. Livnat, and R. F. Riesenfeld, "A survey
of radial methods for information visualization,"
[18] Apache Software Foundation. (2012). Apache POI - the Visualization and Computer Graphics, IEEE Transactions
Java API for Microsoft Documents. Available: on, vol. 15, pp. 759-776, 2009.
http://poi.apache.org
[25] K. Pearson, "Mathematical Contributions to the Theory of
[19] Microsoft Corporation. (2012). IFilter interface. Available: Evolution. XIX. Second Supplement to a Memoir on Skew
http://msdn.microsoft.com/en- Variation," Philosophical Transactions of the Royal Society
us/library/ms691105(v=vs.85).aspx of London. Series A, Containing Papers of a Mathematical
[20] Apache Software Foundation. (2012). Apache Solr. or Physical Character, vol. 216, pp. 429-457, January 1,
Available: http://lucene.apache.org/solr/ 1916 1916.
[21] Massachusetts Institute of Technology. (2009). SIMILE
Widgets Timeline. Available: http://www.simile-
widgets.org/timeline/

190

S-ar putea să vă placă și