2 Indexing

Indexing structure
Text Collections and IR

• Large collections of documents from various sources:
news articles, research papers, books, digital libraries,
Web pages, etc.
Sample Statistics of Text Collections
• Dialog:
–claims to have more than 15 terabytes of data in >600
Databases, > 800 million unique records
• LEXIS/NEXIS:
–claims 7 terabytes, 1.7 billion documents, 1.5 million
subscribers, 11,400 databases; >200,000 searches per day;
9 mainframes, 300 Unix servers, 200 NT servers
• Web Search Engines:
–Google claim to index over 1.5 billion pages
• TREC collections:
–total of about 5 gigabytes of text
Dialog
• Providing more than 15 terabytes of content from the
world's most authoritative publishers, and the tools
to search every bit of it with speed and precision.
• The company is founded on the idea that information
matters—that it really can make a difference in the
world—or your corner of it.
• Dialog will vastly improve your ability to access and
distribute relevant information effectively and
efficiently.
• http://www.dialog.com/
– the site contains all the information and materials you
need to get the most from Dialog and DataStar.
LexisNexis (http://www.lexisnexis.com/)
• is a popular searchable archive of content from
newspapers, magazines, legal documents and other
printed sources.
• LexisNexis claims to be the "world’s largest
collection of public records, unpublished opinions,
forms, legal, news, and business information" while
offering their products to a wide range of
professionals in the legal, risk management, law
enforcement, accounting and academic markets.
– Typical customers of LexisNexis include lawyers, law
students, journalists, and academics.
• LexisNexis is divided into two sites that require
separate subscriptions:
– www.lexis.com is intended for legal research,
– www.nexis.com is intended for journalism research.
Web Search Engines
•There are more than 2,000 general web search engines. The big
four are Google, Yahoo!, Live Search, Ask
–Scientific search of web and selected journals: Scirus, About
–Meta search engine: Search.com, Searchhippo, Searchthe.net,
Windseek, Web-search, Webcrawler, Mamma, Ixquick, AllPlus, Fazzle,
Jux2
–Multimedia search engine: Blinkx
• Visual search engine: Ujiko, Web Brain, RedZee, Kartoo, Mooter
• Audio/sound search engine: Feedster, Findsounds
• video search engine: YouTube, Trooker
–Medical search engine: Search Medica, Healia, Omnimedicalsearch,
–Index/Directory: Sunsteam, Supercrawler, Thunderstone, Thenet1,
Webworldindex, Smartlinks, Whatusee, Re-quest, DMOZ,
Searchtheweb
• Index based: Abcsearchengine, Galaxy, Linkopedia, Beaucoup, Illumirate,
Infoservice, Buzzle
–Others: Lycos, Excite, Altavista, AOL Search, Intute, Accoona, Jayde,
Hotbot, InfoMine, Slider, Selectsurf, Questfinder, Kazazz, Answers,
Factbites, Alltheweb
•There are also Virtual Libraries: Pinakes, WWW Virtual Library,
Digital-librarian, Librarians Internet Index
The Text REtrieval Conf. (TREC)
•TREC was started in 1992. Its goal is to develop an evaluation
methodology for Terabyte-scale document collections.
–The size of test data reached several GBs of text & million of documents.
–The TREC test collections & evaluation software are available to all
researchers in IR, so as to evaluate their own retrieval systems at any time
•For each TREC, a test set of documents and questions is
provided.
–Participants run their own IR systems on the data, and return to TREC a
list of the retrieved top-ranked documents.
–TREC pools the individual results, judges the retrieved documents for
correctness, and evaluates the results.
–The TREC cycle ends with a workshop that enables participants to share
their experiences.
•Number of participating systems & tasks in TREC has grown
each year. For instance,
–93 groups representing 22 countries participated in TREC 2003.
•TREC has also sponsored the first large-scale evaluations of
–the retrieval of non-English (Spanish, Chinese, …) documents, retrieval of
recordings of speech, and retrieval across multiple languages.
–TREC has also introduced evaluations for open-domain question
answering and content-based retrieval of digital video.
Document corpus
•The corpus may be:
•Primary documents: e.g., books, journal articles or Web
pages.
•Surrogates: a representation of a document such as the title,
author, subject, and a short summary. e.g., catalog records or
abstracts, which refer to the primary documents.
• Surrogates are common to display the answers to a user query.
•The storage of the documents may be:
•Central (monolithic) - all documents stored together on a
single server (e.g., library catalog)
•Distributed database - all documents stored on several
servers. But the database may be managed together (e.g.,
Medline), or managed independently (e.g., the Web, )
•Each documents have a unique identifier: a document ID
•that can be used by the search system to refer to the actual
document
Document corpus for Web Search
Engine
For Web search systems:
• A document is a Web page
• The documents store is the Web
• The document ID is the URL of the document
Storage of text: Image vs. ASCII
• Document images: Scanned image of text document
–Not searchable as text: Texts (characters, words, etc.) are
represented as patterns of pixels
–Retrieval from Document Images: Two options
• Recognition-based retrieval: OCR is required to convert
document images to ascii (may be error prone)
– Apply text retrieval systems on the recognized documents
• Document image retrieval: retrieval without explicit
recognition. Search relevant documents directly from image
collections.
–How is the present search engines (like Google) search for
relevant document images?
• Textual documents
–Searchable as text
–words are represented as ascii codes
Designing an IR System
Our focus during IR system design is:
• In improving performance effectiveness of the
system
–Effectiveness of the system is measured in terms of
precision, recall, …
–Stemming, stopwords, weighting schemes, matching
algorithms
• In improving performance efficiency.
The concern here is
–storage space usage, access time, …
–Compression, data/file structures, space – time tradeoffs
• The two subsystems of an IR system:
–Searching and
–Indexing
Indexing Subsystem
documents
Documents Assign document identifier
text document
Tokenize
IDs
tokens Stop list
non-stoplist Stemming & Normalize
tokens
stemmed Term weighting
terms
terms with
weights Index
Searching Subsystem
query parse query
query tokens
ranked
document Stop list non-stoplist
set tokens
ranking
Stemming
retrieved stemmed
document set terms
Boolean
operations
relevant
document set Index
Basic assertion
Indexing and searching:
inexorably connected
– you cannot search that that was not first indexed
in some manner or other
– indexing of documents or objects is done in
order to be searchable
• there are many ways to do indexing
– to index one needs an indexing language
• there are many indexing languages
• even taking every word in a document is an indexing
language
Knowing searching is knowing indexing

Implementation Issues
•Storage of text:
–The need for text compression: to reduce storage space
•Indexing text
–Organizing indexes
• What techniques to use ? How to select it ?
–Storage of indexes
• Is compression required? Do we store on memory or in a disk ?
•Accessing text
–Accessing indexes
• How to access to indexes ? What data/file structure to use?
–Processing indexes
• How to search a given query in the index? How to update the index?
–Accessing documents
Text Compression
• Text compression is about finding ways to represent
the text in fewer bits or bytes
• Advantages:
–save storage space requirement.
–speed up document transmission time
–Takes less time to search the compressed text
• Common compression methods
–Statistical methods:
E.g. Huffman coding
•Estimate probabilities of symbols, code one at a time, shorter
codes for high probabilities
–Dictionary methods: adaptive methods
E.g. Ziv-Lempel compression:
•Replace words or symbols with a pointer to dictionary entries
Huffman coding
•Developed in 1950s by David
Huffman, widely used for text 0 1
compression, multimedia codec and D4
message transmission 0 1
1 D3
•The problem: Given a set of n 0
symbols and their weights (or D1 D2
frequencies), construct a tree
structure (a binary tree for binary
Code of:
code) with the objective of reducing
memory space & decoding time per D1 = 000
symbol. D2 = 001
D3 = 01
•Huffman coding is constructed D4 = 1
based on frequency of occurrence of
letters in text documents
How to construct Huffman coding
Step 1: Create forest of trees for each symbol, t1, t2,… tn
Step 2: Sort forest of trees according to falling
probabilities of symbol occurrence
Step 3: WHILE more than one tree exist DO
–Merge two trees t1 and t2 with least probabilities p1 and p2
–Label their root with sum p1 + p2
–Associate binary code: 1 with the right branch and 0 with the
left branch
Step 4: Create a unique codeword for each symbol by
traversing the tree from the root to the leaf.
–Concatenate all encountered 0s and 1s together during
traversal
• The resulting tree has a prob. of 1 in its root and
symbols in its leaf node.
Example
• Consider a 7-symbol alphabet given in the following
table to construct the Huffman coding.
Symbol Probability
a 0.05
• The Huffman encoding
b 0.05
algorithm picks each time
c 0.1 two symbols (with the
d 0.2 smallest frequency) to
e 0.3 combine
f 0.2
g 0.1
Huffman code tree
1
0 1
0.4 0.6
0
0 1 1
0.3
d f 1 e
0
0.2
1
g 0
0.1
c 0 1
a b
• Using the Huffman coding a table can be constructed by

working down the tree, left to right. This gives the binary
equivalents for each symbol in terms of 1s and 0s.
• What is the Huffman binary representation for ‘café’?
Word level example
• Given text: “for each rose, a rose is a rose”
– Construct the Huffman coding
Algorithm
procedure HuffmanCode(H,n)
// H is the Huffman tree
for i = 1 to n-1 do
r = new Nodetype
rlchild = least(H)
rrchild = least(H)
rfrequency = rlchildfrequency +
rrchildfrequency
insert(H,r)
end for
return (H)
end procedure
Exercise
•Given the following, apply the Huffman algorithm
to find an optimal binary code:
Character: a b c d e t
Frequency: 16 5 12 17 10 25
Ziv-Lempel compression
•The problem with Huffman coding is that it requires
knowledge about the data before encoding takes
place.
–Huffman coding requires frequencies of symbol occurrence
before codeword is assigned to symbols
•Ziv-Lempel compression
–Not rely on previous knowledge about the data
–Rather builds this knowledge in the course of data
transmission/data storage
–Ziv-Lempel algorithm (called LZ) uses a table of code-
words created during data transmission;
•each time it replaces strings of characters with a reference to a
previous occurrence of the string.
Example: LZ compression
• Given a word containing only two letters, a and b,
compress it using LZ technique.
Steps in Compression
• First, split the given word into pieces of symbols
– In the example, the first piece of our sample text is a. The
second piece must then be aa. If we go on like this, we
obtain the breakdown of data as illustrated below:
– Note that, the shortest piece of data is the string of
characters that we have not seen so far.
seen unseen
LZ Compression
•Second, index the pieces of text obtained in the
breaking-down process from 1 to n.
– The empty string (start of text) has index 0, a has index 1, ...
•Third, number the pieces of data using the above
indices.
–Thus a, with the initial string, is numbered Oa. String 2, aa, is
numbered 1a, because it contains a, whose index is 1, and
the new character a. Proceed numbering all the pieces in
terms of those preceding them.
•Is replacing characters by integers compress the

given text ?
LZ Compression
•Now, compute how many bits needed to represent this coded
information.
–each piece of text is composed of an integer and an alphabet.
•The number of bits needed to represent each integer with
index i is at most equal to the number of bits used to represent
the (i -1)th index. For example,
–the number of bits needed to represent 6 in piece 8 is equal to 3, because
it takes three bits to express 7 (the (n-1)th index) in binary.
•One of the advantages of Lempel-Ziv compression is that in a

long string of text, the number of bits needed to transmit the
coded information is peanuts compared to the actual length of
the text.
–E.g. To transmit the actual text aab, 24 bits (8 + 8 + 8) needed, where as
for the code 2b, 12 bits needed.
Indexing structures
–Background
–Inverted files
–Suffix Trees and Suffix Arrays
–Signature files
• Presentation (within 10 days):

–Discuss in detail theoretical and algorithmic concepts
(including construction, various operations, complexity,
etc.) on the following commonly used data structures
1. Tree (AVL tree, Binary tree),
2. Hashing,
3. B tree and its variants (B+/B++ Tree, B* Tree),
4. Hierarchical Tree (like Quad Tree and its variants)
5. PAT-Tree and its variants
6. Graph
7. Data structure vs. file structure, arrays, sorted arrays and linked list
Indexing: Basic Concepts
• Indexing is used to speed up access to desired
information from document collection as per users
query such that
–It enhances efficiency in terms of time for retrieval. Relevant
documents are searched and retrieved quick
Example: author catalog in library
• An index file consists of records, called index
entries.
• Index files are much smaller than the original file.
–Remember Heaps Law: For 1 GB of TREC text collection
the vocabulary has a size of only 5 MB (Ref: Baeza-Yates
and Ribeiro-Neto, 2005)
–This size may be further reduced by Linguistic pre-
processing (like stemming & other normalization methods).
• The usual unit for indexing is the word
–Index terms - are used to look up records in a file.
Issues in Indexing
• Creating the index (with the objective of reducing
storage space & searching time)
–what index terms to use: words, sentences, paragraph, etc.
–what indexing structure to use: inverted file, suffix array...
• Storing the index
–the need for compression
–where to store the index
• Processing the index
–access index file directly from disk or load on RAM
–select file/data structure that speed up execution and reduce
memory usage
• Updating the index
–Is the update performed in batch or as documents arrive one
by one
–Are we updating incrementally or complete re-indexing
–How to synchronize changes to index and documents
How Current Search Engines index?
• Indexes are built using a web crawler, which
retrieves each page on the Web for indexing.
–After indexing, the local copy of each page is discarded,
unless stored in a cache.
• Some search engines: automatically index
–such search engines include:
Google, AltaVista, Excite, HotBot, InfoSeek, Lycos
• Some others: semi automatically index
–Partially human indexed, hierarchically organized
–Such search engines include:
Yahoo, Magellan, Galaxy, WWW Virtual Library
• Common features
–allow Boolean searches
Basic Indexing Process
Documents to
be indexed. Friends, Romans, countrymen.
Token Tokenizer
stream. Friends Romans countrymen
Modified Linguistic friend roman countryman

tokens. preprocessing
Index File Indexer

friend 2 4
(Inverted file).
roman 1 2
countryman 13 16
Major Steps in Index Construction
• Source file: Collection of text document
–A document can be described by a set of representative keywords called
index terms.
• Index Terms Selection:
–Tokenize: identify words in a document, so that each document is
represented by a list of keywords or attributes
–Stop words: removal of high frequency words
• Stop list of words is used for comparing the input text
–Word stem: reduce words with similar meaning into their stem/root word
• Suffix stripping is the common method
–Term relevance weight: Different index terms have varying relevance
when used to describe document contents.
• This effect is captured through the assignment of numerical weights to
each index term of a document.
• There are different index terms weighting methods: TF, TF*IDF, …
• Output: a set of index terms (vocabulary) to be used for

Indexing the documents that each term occurs in.
Building Index file
•An index file of a document is a file consisting of a list of index terms
and a link to one or more documents that has the index term
–A good index file maps each keyword Ki to a set of documents Di that contain
the keyword
•Index file usually has index terms in a sorted order.

–The sort order of the terms in the index file provides an order on a physical file
•An index file is list of search terms that are organized for associative
look-up, i.e., to answer user’s query:
–In which documents does a specified search term appear?
–Where within each document does each term appear? (There may be several
occurrences.)
•For organizing index file for a collection of documents, there are

various options available:
–Decide what data structure and/or file structure to use. Is it sequential file,
inverted file, suffix array, signature file, etc. ?
Index file Evaluation Metrics
•Running time of the main operations
–Access/search time
–Update time (Insertion time, Deletion time, ….)
•Space overhead
–Computer storage space consumed.
•Access types supported efficiently. Is the

indexing structure allows to access:
–records with a specified term, or
–records with terms falling in a specified range of
values.
Sequential File
•Sequential file is the most primitive file structures.
•The records are generally arranged serially, one after
another, but in lexicographic order on the value of
some key field.
• a particular attribute is chosen as primary key whose value
will determine the order of the records.
• when the first key fails to discriminate among records, a
second key is chosen to give an order.
Sequential File
• It has no directory as well as linking pointers.
• To access records search serially;
–starting at the first record read and investigate all
the succeeding records until the required record is
found or end of the file is reached.
• Its main advantages are:
–easy to implement;
–provides fast access to the next record using
lexicographic order.
–Can be searched quickly, e.g., by binary search,
O(log n)
Sequential File
• Its disadvantages:
–difficult to update. Index must be rebuilt if a
new term is added. Inserting a new record may
require moving a large proportion of the file;
–random access is extremely slow.
• The problem of update can be:

–solved by ordering records by date of
acquisition, than the key value, hence, the
newest entries are added to the end of the file
and therefore pose no difficulty to updating
Inverted file
• A word oriented indexing mechanism based on sorted list
of keywords, with each keyword having links to the
documents containing it
–Building and maintaining an inverted index is a relatively low cost
risk. On a text of n words an inverted index can be built in O(n)
time
• Data to be held in the inverted file includes list of index

terms and for each term:
–fij, number of occurrences of term tj in document di
–nj, number of documents containing tj
–mi, maximum frequency of any term in di
–n, total number of documents in a collection
–tf, total frequency of tj in nj
–….
Inverted file
• The inverted file contains:
–The vocabulary (List of terms)
–The occurrence (Location and frequency of terms in a
document collection)
• The vocabulary: is the set of all distinct words (index

terms) in the text collection.
–The collection is organized by terms
• The occurrence: contains one record per term,

listing
–all the text locations/positions where the word occurs
–Frequency of each term in a document, i.e. count number of
occurrences of keywords in a document
Inverted file
•Having information about vocabulary (list of
terms)
–speeds searching for relevant documents
•Having information about the location of each
term within the document helps for:
–user interface design: highlight location of search
term
–proximity based ranking: adjacency and near
operators (in Boolean searching)
•Having information about frequency is used for:
•calculating term weighting (like TF, TF*IDF, …)
•optimizing query processing
Structure of inverted index
Document-level indexing:
– indexing to identify in which documents a specific word
exists.
No. Term Document (freq; document)
1 cold <2; 1,4>
2 days <2; 3,6>
Word-level indexing
– indexing to identify specific location of a word in a document
it exists in.
No. Term Document (freq; (document;location))
1 cold <2;(1;6) ,(4:8)>
2 days <2; (3;2),(6;2)>
•Which one is better (i) from users perspective and (ii)
from efficiency wise
Inversion of Word List
1. The input text is parsed into a list of words along with
their location in the text. (time and storage consuming
operation)
2. This list is inverted from a list of terms in location order
to a list of terms in alphabetical order.
3. Compute term frequency.
Inverted File
Documents are organized by the terms/words they contain
Word Tot Freq Document Term Location
Freq
This is called an
Act 3 2 1 66
index file.
19 1 213
29 1 45
bus 4 3 1 94 Text operations
19 2 7, 212 are performed
before building
22 1 56
the index.
Pen 1 5 1 43
total 3 11 2 3, 70
34 1 40
Construction of Inverted file
An inverted index consists of two files: vocabulary
and posting files
•A vocabulary file (Word list):
–stores all of the distinct terms (keywords) that appear
in any of the documents (in lexicographical order) and
–For each word a pointer to posting file
•Records kept for each term j in the word list
contains the following:
–term j
–number of documents in which term j occurs (nj)
–Total frequency of term j
–pointer to inverted (postings) list for term j
Postings File
• For each distinct term in the vocabulary, stores a list
of pointers to the documents that contain that term.
• Each element in an inverted list is called a posting,
i.e., the occurrence of a term in a document
• It is stored as a separate inverted list for each
column, i.e., a list corresponding to each term in the
index file.
–Each list consists of one or many individual postings
Advantage of dividing inverted file:
• Keeping a pointer in the vocabulary to the list in the
posting file allows:
–the vocabulary to be kept in memory at search time even for
large text collection, and
–Posting file to be kept on disk for accessing to documents
• The following figure shows the general
structure of inverted index file.
Organization of Index File
Vocabulary
Postings
(word list) Documents
(inverted list)
Term No Tot Pointer
of freq To
Doc posting
Act 3 3 Inverted
Bus 3 4 lists
pen 1 1
total 2 3
Example:
• Given a collection of documents, they are parsed
to extract words and these are saved with the
Document ID.
I did enact Julius

Doc 1 Caesar I was killed
i' the Capitol;
Brutus killed me.
So let it be with
Doc 2 Caesar. The noble
Brutus hath told you
Caesar was ambitious
Sorting the Vocabulary
• After all documents have been parsed the inverted file is sorted
by terms
– Inverted index may record term locations within document during parsing
Term Doc # Term Doc #
I 1 ambitious 2
did 1 be 2
enact 1 brutus 1
julius 1 brutus 2
caesar 1 capitol 1
I 1 caesar 1
was 1 caesar 2
killed 1 caesar 2
i' 1 did 1
the 1 enact 1
capitol 1 hath 1
brutus 1 I 1
killed 1
I 1
me 1
i' 1
so 2
it 2
let 2
julius 1
it 2
killed 1
be 2
with 2
killed 1
caesar 2 let 2
the 2 me 1
noble 2 noble 2
brutus 2 so 2
hath 2 the 1
told 2 the 2
you 2 told 2
caesar 2 you 2
was 2 was 1
ambitious 2 was 2
with 2
Remove duplicate terms & add frequency
Term Doc # Freq
•Multiple term
Term Doc # ambitious 2 1
ambitious 2
be 2 1
be 2
entries in a brutus 1 brutus
brutus
1
2
1
1
brutus 2
single capitol 1 capitol 1 1
caesar 1 caesar 1 1
document are caesar 2 caesar 2 2
caesar 2
merged and did 1
did
enact
1
1
1
1
enact 1
frequency hath 1 hath 2 1
I 1 I 1 2
information I 1 i' 1 1
i' 1
added it 2
it
julius
2
1
1
1
julius 1
•Counting killed
killed
1
1
killed
let
1
2
2
1
number of let
me
2
1
me
noble
1
2
1
1
occurrence of noble 2 so 2 1
so 2 the 1 1
terms in the the 1 the 2 1
the 2
collections told 2
told
you
2
2
1
1
you 2
helps to was 1 was 1 1
was 2 was 2 1
compute TF with 2 with 2 1
Vocabulary and postings file
The file is commonly split into a Dictionary and a
Postings file
Term Doc # Freq
ambitious 2 1 Doc # Freq
be 2 1 Term N docs Tot Freq 2 1
brutus 1 1 ambitious 1 1 2 1
brutus 2 1 be 1 1 1 1
capitol 1 1 brutus 2 2 2 1
caesar 1 1 capitol 1 1 1 1
caesar 2 2 caesar 2 3 1 1
did 1 1 did 1 1 2 2
enact 1 1 1 1
enact 1 1
hath 1 1 1 1
hath 2 1
I 1 2 2 1
I 1 2 i' 1 1 1 2
i' 1 1 it 1 1 1 1
it 2 1 julius 1 1 2 1
julius 1 1 killed 1 2 1 1
killed 1 2 let 1 1 1 2
let 2 1 me 1 1 2 1
me 1 1 noble 1 1 1 1
noble 2 1 so 1 1 2 1
so 2 1 the 2 2 2 1
told 1 1 1 1
the 1 1
you 1 1 2 1
the 2 1
was 2 2
told 2 1 2 1
with 1 1
you 2 1 2 1
was 1 1 1 1
2 1
was 2 1
2 1
with 2 1
Pointers
Example of inverted file
•Search for query : noble Brutus
Search for noble AND brutus gets as

results documents 1 since it contains
both words
Search for noble (w) brutus retrieves only

document 1 in which the two words are right
next to each other.
Inverted index storage
•Separation of inverted file into vocabulary and posting
file is a good idea.
–Vocabulary: For searching purpose we need only word list.
This allows the vocabulary to be kept in memory at search
time since the space required for the vocabulary is small.
• The vocabulary grows by O(nβ), where β is a constant between 0 – 1.
• Example: from 1,000,000,000 documents, there may be 1,000,000
distinct words. Hence, the size of index is 100 MBs, which can easily
be held in memory of a dedicated computer.
–Posting file requires much more space.

• For each word appearing in the text we are keeping statistical
information related to word occurrence in documents.
• Each of the postings pointer to the document requires an extra space
of O(n).
•How to speed up access to inverted file?
Recall dictionary and postings files
• Where do we pay in storage?
Doc # Freq
Term N docs Tot Freq 2 1
ambitious 1 1 2 1
be 1 1 1 1
brutus 2 2 2 1
capitol 1 1 1 1
caesar 2 3 1 1
did 1 1 2 2
enact 1 1 1 1
hath 1 1 1 1
I 1 2 2 1
i' 1 1 1 2
it 1 1 1 1
julius 1 1 2 1
killed 1 2 1 1
let 1 1 1 2
me 1 1 2 1
noble 1 1 1 1
so 1 1 2 1
the 2 2 2 1
told 1 1 1 1
you 1 1 2 1
was 2 2
2 1
with 1 1
2 1
1 1
2 1
2 1
Dictionary in memory Postings

on disk
Data vs. File Structures
Both Data and File structure involves
 Representation of data and
 Operation for Accessing and updating the data
The difference between the two is that
 Data structure refers to organization of data in the RAM
 File structure refers to organization of data in the secondary storage device (file)
Accessing and processing data in RAM:
 Fast (since it is electronic, ~120ns)
 Small in size (since expensive)
 Volatile (since it is electronic)
Accessing and processing data in FILE:
 Slow (since it is electronic + mechanical, ~30ms)
 Large in size (since cheap)
 Stable and Persistent (since magnetic)
What Important data structure you know?
Important data structures
Arrays vs. linked lists
Stack vs. queue
Tree (Binary tree, AVL tree and other variants)
B Tree and its variants (B+ Tree, B++ Tree, B* Tree, ..)
Graph
Hashing
PAT tree and its variations
Tries and suffix trie
Hierarchical Tree (like Quad Tree/Octree and its
variants)
File Structure
A File Structure:
 allows applications to read, write and modify data in a file.
supports finding the data that matches some search criteria or
reading through the data in some particular order.
minimizes number of trips to the disk in order to get the desired
information
This can be done by understanding File structure and using

them properly
It is relatively easy to come up with file structure designs
that meet the general goals when the files never change.
When files grow or shrink when information is added and deleted,
it is much more difficult
Issues in File Structure
To utilize File efficiently and effectively the
following are some of the common issues to
be addressed
Data Compression: how to make files smaller
Reclaiming space in files that have undergone
deletions and updates
Sorting files in order to support fast searching
and access to the required information
Important File Structures
Sequential access: Early work assumed that
files were on tape.
Access was sequential and the cost of access grow
in direct proportion to the size of the file.
(analogous to accessing array elements
sequentially)
As files grow very large, unaided sequential access
was not a good solution.
Disks (floppy, CD, …) allowed for direct

access. With this technology, better file
structures are developed
Important File Structures
• Tree structures may be used (Binary tree, AVL tree, B-
Tree, etc)
– Answer query: Supports exact-match lookup. Find all
documents associated with one or a set of terms
– O(log n) lookups to find a list
– Usually easy to expand
– do not handle well synonymy and polysemy,
• Hash table
– Answer query: Supports exact-match lookup. Find all
documents associated with one or a set of terms
– O(1) lookups to find a list
– May be complex to expand
– do not handle well synonymy and polysemy,
• Trie (or digital search trees)

2 Indexing

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

2 Indexing

Încărcat de

Drepturi de autor:

Formate disponibile

Indexing structure

Text Collections and IR

Knowing searching is knowing indexing

• Using the Huffman coding a table can be constructed by

•Is replacing characters by integers compress the

•One of the advantages of Lempel-Ziv compression is that in a

• Presentation (within 10 days):

Modified Linguistic friend roman countryman

Index File Indexer

• Output: a set of index terms (vocabulary) to be used for

•Index file usually has index terms in a sorted order.

•For organizing index file for a collection of documents, there are

•Access types supported efficiently. Is the

• The problem of update can be:

• Data to be held in the inverted file includes list of index

• The vocabulary: is the set of all distinct words (index

• The occurrence: contains one record per term,

I did enact Julius

Search for noble AND brutus gets as

Search for noble (w) brutus retrieves only

–Posting file requires much more space.

Dictionary in memory Postings

This can be done by understanding File structure and using

Disks (floppy, CD, …) allowed for direct

• Trie (or digital search trees)

S-ar putea să vă placă și