A Computational Framework For Integrating and Retrieving Biodiversity Data On A Large Scale

A Computational Framework for Integrating and Retrieving Biodiversity Data on a
Large Scale
Daniel L. da Silva, Pedro L. P.
Corra
Silvio L. Stanzani, Paulo Andr

Filipak
Escola Politcnica da Universidade

de So Paulo (EPUSP)
So Paulo, Brazil
{daniellins, pedro.correa}@usp.br

de So Paulo (EPUSP)
So Paulo, Brazil
{silvio.stanzani, pfilipak}@usp.br
Abstract The digitization and integration of biodiversity data

are essential for supporting environmental conservation and
sustainable use of natural resources. Nowadays an increasing
amount of data are made available by regional, national and
global initiatives, but the efficient use of data still a challenge.
New techniques are needed to enable efficient manage and the
use of these various types of biotic and abiotic data to generate
useful knowledge for decision-making processes. We present a
work in progress research that proposes a computational
framework to manage biodiversity data and to enable an
efficient information retrieval process.
Keywords: Biodiversity; Architecture; Big Data; Integration;
NoSQL; Information Retrieval
I.
INTRODUCTION
The area known as Biodiversity Informatics (BI) has a

challenge to apply information technologies to the
management, algorithmic exploration, analysis and
interpretation of environmental data, particularly at the
species level [1]. Among the BI challenges, the integration of
biodiversity databases has gained momentum in recent years
due to its importance for environmental management and
decision-making processes for conservation and sustainable
uses of natural resources [2], [3]. However, many
biodiversity data are not accessible and not discoverable yet.
According to a survey conducted in 2000, the results show
that most of the world biodiversity legacy data are not even
digitized [4].
Considering this scenario, several initiatives for the
digitization and integration of biodiversity data have been
undertaken and others are ongoing at the three main levels:
regional, national and global. We can highlight the projects
coordinated by the Biodiversity Information Standards, also
known as Taxonomic Database Working Group (TDWG)
and the Global Biodiversity Facility (GBIF).
Based on the results of these initiatives, we were able to
identify new issues to be resolved in the BI area:
How to manage the increasing amount of data from
digitizing projects, system integrations and scientific
experiments?
How to organize these data in a model that meets the
diverse needs of researchers, institutions and
citizens?
Andreiwid Sheffer Corra

de So Paulo (EPUSP)
IMA, IFSP
So Paulo, Brazil
andreiwid@usp.br
How to optimize the location and retrieval of

information for use in analysis tools and decisionmaking processes?
How to enable the biodiversity data integration with
other relevant information domains, such as social
and economic data?
This paper presents a work in progress project that
proposes a computational framework for integrating and
managing biodiversity data, focusing on the key issues
presented above.
Our approach will consider a strategy based on Big Data
technologies, Web Services and metadata standards to enable
the organization, integration and availability of biodiversity
data.
II.
COMPUTATIONAL ARCHITECTURE OVERVIEW
Figure 1 shows the proposed architecture. It consists of

four functional layers: Offline Processing, Data
Management, Services and Data Publication.
DATA PUBLICATION
DATA PORTAL
SERVICES
OCCURRENCE
DATA
GEOSPATIAL
PORTAL
ANALYTICS
TOOLS
TAXON DATA
GEOSPATIAL
DATA
DATA MANAGEMENT
CASSANDRA
SOLR
NEO4J
LUCENE
POSTGRESQL/POSTGIS
HADOOP
SHAPEFILES
OFF LINE PROCESSING

QA/QC
PROCESS
HARVESTING
METADATA STANDARDS
...
DATA PROVIDER 1 DATA PROVIDER 2
DATA PROVIDER N
Figure 1 - Computational Architecture for Biodiversity Data Management.
A. Offline Processing Layer

This layer is responsible for data harvesting, Quality
Assurance and Quality Control processes. It considers the
use of metadata standards, such as Darwin Core [5], and
internationally recognized protocols for biodiversity data

interoperability, such as IPT, TAPIR and DiGIR [6].
B. Data Management Layer
This layer performs the storage and organization of data
collected from data providers. Based on the concept of
polyglot persistence, uses various database technologies
(relational and non-relational) simultaneously, where each of
them is responsible for managing a particular data type or
solving a specific issue [7]. Non-relational databases have
interesting features within the context of biodiversity data
management, such as capacity to handle very large amounts
of data, flexible schema, horizontal scalability and low cost.
Indexing and data processing tools are also part of this layer.
C. Web Services Layer
This layer provides services for locating and consuming
the stored data. The services are organized into components
according to their type and function and are responsible to
provide informations to the data publishing systems and for
external tools. We chose this strategy for seeking autonomy
among applications, besides facilitating the reuse of these
components.
Due to data heterogeneity and the standards considered in
the architecture, these services could provide data in varying
formats, such as Darwin Core, EML and Plinian Core,
expanding the ability to integrate with other systems and
computational tools.
D. Data Publication Layer
This layer accounts for publishing the integrated
information through a data portal that provides tools for
localization and manipulation of taxonomic data and
specimen occurrence data. It also provides geospatial and
analysis tools.
III.
Find endangered species that eat snakes and live in forests.

Figure 3 shows a biodiversity data representation based on a
Graph Database.
CONSERVATION
STATUS
USED BY
DESCRIBED AT
DESCRIBED AT
PUBLICATION
TAXON
EATS
TAXON
CLASSIFIED AS
FOUND AT
SPECIMEN
COLLECTED AT
BIOME
TIME
IN STAGE
LIFE STAGE
COLLECTED AT
LOCATION
We are conducting further experiments with other tools

and technologies that may be considered in this architecture
later.
IV.
CONSIDERATIONS AND FUTURE WORKS
In this paper, we propose a framework to manage

biodiversity data provided by several institutions. Based on
the research goals, we seek to define a computational
framework able to efficiently organize data and facilitate the
information retrieval process to support processes of
decision-making in conservation and sustainable use of
natural resources.
The experiments with computational tools and algorithms
are in progress for definition of the data organizing process
based on their metadata.
ACKNOWLEDGMENT
This research is supported by the Amazonas Research
Foundation (FAPEAM), the Brazilian Ministry of the
Environment (MMA) and the Deutsche Gesellschaft fr
Internationale Zusammenarbeit (GIZ) GmbH.
REFERENCES
This study case focused experiments with two NoSQL

databases for representing biodiversity data.
The Apache Cassandra is a column-oriented database,
and its flexible schema enables management of
heterogeneous data from different institutions, without loss
of information. Figure 2 shows a Cassandra column family
storing species occurrence data indexed by Life Science
Identifiers (LSID).
[1]
[2]
[3]
[4]
urn:lsid:icmbio.go dwc:collectionCode dwc:scientificName

v.br:occ:MA1202 CENAP
C. brachyurus
dwc:scientificName dbh
Cedrela fissilis Vell.
POLLINATES
TAXON
Figure 3 - Species Interaction Data Model based on a Graph Database.
STUDY CASE 1
urn:lsid:biocol.org abcd:SourceID
:col:15528
RB
COMMUNITY
CLASSIFIED AS
...
95cm
...
Figure 2 - Column family for species occurrence data.
The Neo4J is a graph database that proved to be effective

for manage interactions between species and the
environment, enabling a fast response of questions such as
[5]
[6]
[7]
J. Sobern e T. Peterson, Biodiversity informatics: managing and

applying primary biodiversity data, Philos. Trans. R. Soc. Lond.
B. Biol. Sci., vol. 359, pp. 689698, April 2004.
R. D. M. Page, Biodiversity informatics: the challenge of linking
data and the role of shared identifiers, Brief. Bioinform., vol.
9(5), pp. 345354, January 2008.
V. S. Chavan e P. Ingwersen, Towards a data publishing
framework for primary biodiversity data: challenges and potentials
for
the
biodiversity
informatics
community,
BMC
Bioinformatics, vol. 10(Suppl 14), pp. 111, November 2009.
F. A. Bisby, The Quiet Revolution: Biodiversity Informatics and
the Internet, Science, vol. 289, pp. 23092312, September 2000.
J. Wieczorek, D. Bloom, R. Guralnick, S. Blum, M. Dring, R.
Giovanni, T. Robertson, e D. Vieglais, Darwin Core: An
Evolving Community-Developed Biodiversity Data Standard,
PLoS ONE, vol. 7(1), e29715, January 2012.
A. Goddard, N. Wilson, P. Cryer, e G. Yamashita, Data hosting
infrastructure for primary biodiversity data, BMC Bioinformatics,
vol. 12(Suppl 15), pp. 114, December 2011.
P. J. Sadalage e M. Fowler, NoSQL distilled: a brief guide to the
emerging world of polyglot persistence. Upper Saddle River, NJ:
Addison-Wesley, 2013.

A Computational Framework For Integrating and Retrieving Biodiversity Data On A Large Scale

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

A Computational Framework For Integrating and Retrieving Biodiversity Data On A Large Scale

Încărcat de

Drepturi de autor:

Formate disponibile

A Computational Framework for Integrating and Retrieving Biodiversity Data on a

Silvio L. Stanzani, Paulo Andr

Escola Politcnica da Universidade

Escola Politcnica da Universidade

Abstract The digitization and integration of biodiversity data

The area known as Biodiversity Informatics (BI) has a

Andreiwid Sheffer Corra

How to optimize the location and retrieval of

COMPUTATIONAL ARCHITECTURE OVERVIEW

Figure 1 shows the proposed architecture. It consists of

OFF LINE PROCESSING

Figure 1 - Computational Architecture for Biodiversity Data Management.

A. Offline Processing Layer

internationally recognized protocols for biodiversity data

Find endangered species that eat snakes and live in forests.

We are conducting further experiments with other tools

CONSIDERATIONS AND FUTURE WORKS

In this paper, we propose a framework to manage

This study case focused experiments with two NoSQL

urn:lsid:icmbio.go dwc:collectionCode dwc:scientificName

Figure 3 - Species Interaction Data Model based on a Graph Database.

The Neo4J is a graph database that proved to be effective

J. Sobern e T. Peterson, Biodiversity informatics: managing and

S-ar putea să vă placă și