Documente Academic
Documente Profesional
Documente Cultură
OUTLINE
1.Introduction
2.Structured Data
3.Unstructured Data
4.Semi-Structured Data
5.Difference between Semi structured and
structured data
Introduction:
Data growth has seen exponential acceleration since
the advent of the computer and internet.
define: it is defined as the data that is stored on digital
Structured
spreadsheet
data
SQL
OLTP systems
Characteristics of structured data
Conforms to a data
model
Structured
data
Data resides in
Attributes in the fixed fields withn
group are the same a record or a file
Definition,
format,meaning of data
is explicitly known
Sources of Structured Data
storage
Scalibility
Ease with
structured
data
Security
Update and
delete
*
Sources of structured Data
The structured data come from databases such as Access,
OLTP Systems, SQL as well as spreadsheets such as
Excel are all in the structured format
To summarize, structured data:
Consists of fully described data sets.
Has clearly defined categories and sub- categories.
Is placed neatly in rows and columns
Goes into records and hence the database is regulated.
by a well-defined structure.
Can be indexed easily by the DBS itself or manually.
Advantages of structured data(Easy to
work with structured data)
It is easy to work with structured data. The advantages
are :
Storage: Both defined and user- defined data types help
with the storage of structured data.
Scalability: Scalability is not generally an issue with
increase in data
Security: ensuring security is easy
Update and Delete: Updating, deleting etc is easy due to
structured form.
*
Hassle free structured data
Retrieving
information
Indexing and
Ease with searching
structured data
Mining data
BI operations
Hassle Free Retrieval
Retrieval of structured data is totally hassle free. The
features are as follows:
Retrieving information: a well defined structure helps in
easy retrieval of data
Indexing and searching: Data can be indexed based not only on a
text string but also on other attributes . This enables streamlined search.
Mining Data: Structured data can be easily mined and knowledge
can be extracted from it.
BI operations: BI works extremely well with structured data. Hence
data mining, warehousing etc. can be easily undertaken
UNSTRUCTURED DATA
It is the one which cannot be stored in the form of
rows and columns as in a database and does not
conform to any data model, i.e. it is difficult to
determine the meaning of the data.
or MS Excel.
A lot of unstructured data is also noisy text such as chats,
*
SOLUTIONS FOR STORING
UNSTRUCTURED DATA
Changing format : Unstructured data may be converted to formats which
are easily managed, stored and searched.
Developing new hardware : New hardware needs to be developed to
support unstructured data. It may either complement the existing storage
device or may be stand-alone for unstructured data.
Storing in RDBMS/BLOBs (Binary Large Objects): While unstructured
data such as video/image cannot be stored into a relational column, there is
no such problem when it comes to storing its metadata, like the date &
time of its creation, the author of the data etc.
Storing in XML format : Unstructured data may be stored in XML format
which tries to give some structure to it by using tags and elements.
CAS (Content Addressable Storage) : It organizes files based on their
metadata and assigns a unique name to every object stored in it. Used
extensively to store emails.
CHALLENGES FACED WHILE EXTRACTING
INFORMATION FROM STORED UNSTRUCTURED
DATA
Interpretation : Unstructured data is not easily interpreted
by conventional search algorithms.
Classification/Taxonomy : Different naming conventions
followed across the firm make it difficult to classify the
data.
Indexing : Designing algorithms to understand the meaning
of the documents and then tagging or indexing them
accordingly is difficult.
Deriving meaning : Computer programs cannot
automatically derive meaning from unstructured data.
File formats : Increasing number of file formats makes it
difficult to interpret data.
Tags : As the data grows, it is not possible to put tags
manually.
POSSIBLE SOLUTIONS TO THESE
CHALLENGES
Tags : Unstructured data can be stored in a virtual repository and can
be automatically tagged. For e.g. Documentum provides this type of
solution.
Text mining : It helps in grouping as well as classifying unstructured
data and assist in analysing by considering grammar, context,
synonyms etc.
Application platforms : such as XOLAP help extract information
from email and XML-based documents.
Classification/Taxonomy : Taxonomies within the firm can be
managed automatically to organize data in the hierarchical structures.
Naming conventions/standards : Following naming conventions
across a firm can greatly improve storage, retrieval, index and search.
UIMA (Unstructured Information
Management Architecture)
UIMA is an open source platform for IBM which integrates
different types of analysis engines to provide a complete solution
for knowledge discovery from unstructured data.
In UIMA, the analysis engine enables integration and analysis of
unstructured information and bridge the gap between structured
and unstructured data.
It stores information in structured format which can be then
mined, searched and put to other uses. They are analysed in
below ways :
Breaking up of documents into separate words.
Grouping and classifying according to Taxonomy.
Detecting parts of speech, grammar, and synonyms.
Detecting relationship between various elements.
*
Getting to know semi-structured data
Only about 10% of data in any organization is semi-structured.
still it is important to understand, manage, and analyze this
semi-structured data coming from heterogeneous sources.
Semi-structured data does not conform to any data model. Also, this
data cannot be stored in rows and columns as in a database
Semi-structured data has tags and markers which helps group the
data and describe how the data is stored. But they are not sufficient
for management and autonomous of data
Similar entities are grouped and organized in a hierarchy. The
attributes or the properties within a group may or may not be the
same.
Does not
Similar conform to a
entities data model
are but contains
grouped tags and
elements
Cannot be
Attribute
stored in the
s in a Semi rows and
group structured columns as
may not data in a
be the
database
same
The tags
Not and
sufficient elements
metadat describe
a the data is
stored
Email Standard format:
To : <NAME>
From : <NAME>
Subject : <TEXT>
CC : <NAME>
Body : <TEXT,GRAPHICS,IMAGES,ETC>
Where does semi-structured data come
from?
Email
XML
TCP/IP Packets
Semi
structured Zipped File
data
Binary
Executables
Mark-Up
Languages
Integration of
data from
heterogeneous
sources
Characteristics of semi structured data are summarized as below :
OODBM Legacy
RDBMS Structure
S d file System
How to manage semi-structured data?
Schemas :
These can be used to describe the structured data. Schemas
define the constrains on the structure, content of the documents.
Graph Based data models :
These can be used to describe data. This is schema-less
approach and is also known as Self-desrcibing as data is
presented in such a way that it explains itself.
XML:
This is widely used to store and exchange semi structured data.
schemas in XML are not tightly coupled to data.
How to store semi-structured data?
Storage
cost
RDBMS
Irregular
and
partial
structured
Challenge
s faced
Implicit
structure
Evolving
Distinction
Schemas
between
schemas and
data
Possible solution contains:
XML
RDBMS
Special Purpose DBMS
OEM (Object Exchange Model)
structured data.
This semi-structured data when stored in the structured