Sunteți pe pagina 1din 2

A Computational Framework for Integrating and Retrieving Biodiversity Data on a

Large Scale
Daniel L. da Silva, Pedro L. P.
Corra

Silvio L. Stanzani, Paulo Andr


Filipak

Escola Politcnica da Universidade


de So Paulo (EPUSP)
So Paulo, Brazil
{daniellins, pedro.correa}@usp.br

Escola Politcnica da Universidade


de So Paulo (EPUSP)
So Paulo, Brazil
{silvio.stanzani, pfilipak}@usp.br

Abstract The digitization and integration of biodiversity data


are essential for supporting environmental conservation and
sustainable use of natural resources. Nowadays an increasing
amount of data are made available by regional, national and
global initiatives, but the efficient use of data still a challenge.
New techniques are needed to enable efficient manage and the
use of these various types of biotic and abiotic data to generate
useful knowledge for decision-making processes. We present a
work in progress research that proposes a computational
framework to manage biodiversity data and to enable an
efficient information retrieval process.
Keywords: Biodiversity; Architecture; Big Data; Integration;
NoSQL; Information Retrieval

I.

INTRODUCTION

The area known as Biodiversity Informatics (BI) has a


challenge to apply information technologies to the
management, algorithmic exploration, analysis and
interpretation of environmental data, particularly at the
species level [1]. Among the BI challenges, the integration of
biodiversity databases has gained momentum in recent years
due to its importance for environmental management and
decision-making processes for conservation and sustainable
uses of natural resources [2], [3]. However, many
biodiversity data are not accessible and not discoverable yet.
According to a survey conducted in 2000, the results show
that most of the world biodiversity legacy data are not even
digitized [4].
Considering this scenario, several initiatives for the
digitization and integration of biodiversity data have been
undertaken and others are ongoing at the three main levels:
regional, national and global. We can highlight the projects
coordinated by the Biodiversity Information Standards, also
known as Taxonomic Database Working Group (TDWG)
and the Global Biodiversity Facility (GBIF).
Based on the results of these initiatives, we were able to
identify new issues to be resolved in the BI area:
How to manage the increasing amount of data from
digitizing projects, system integrations and scientific
experiments?
How to organize these data in a model that meets the
diverse needs of researchers, institutions and
citizens?

Andreiwid Sheffer Corra


Escola Politcnica da Universidade
de So Paulo (EPUSP)
IMA, IFSP
So Paulo, Brazil
andreiwid@usp.br

How to optimize the location and retrieval of


information for use in analysis tools and decisionmaking processes?
How to enable the biodiversity data integration with
other relevant information domains, such as social
and economic data?
This paper presents a work in progress project that
proposes a computational framework for integrating and
managing biodiversity data, focusing on the key issues
presented above.
Our approach will consider a strategy based on Big Data
technologies, Web Services and metadata standards to enable
the organization, integration and availability of biodiversity
data.
II.

COMPUTATIONAL ARCHITECTURE OVERVIEW

Figure 1 shows the proposed architecture. It consists of


four functional layers: Offline Processing, Data
Management, Services and Data Publication.
DATA PUBLICATION
DATA PORTAL
SERVICES
OCCURRENCE
DATA

GEOSPATIAL
PORTAL

ANALYTICS
TOOLS

TAXON DATA

GEOSPATIAL
DATA

DATA MANAGEMENT

CASSANDRA
SOLR

NEO4J
LUCENE

POSTGRESQL/POSTGIS
HADOOP

SHAPEFILES

OFF LINE PROCESSING


QA/QC
PROCESS

HARVESTING

METADATA STANDARDS

...
DATA PROVIDER 1 DATA PROVIDER 2

DATA PROVIDER N

Figure 1 - Computational Architecture for Biodiversity Data Management.

A. Offline Processing Layer


This layer is responsible for data harvesting, Quality
Assurance and Quality Control processes. It considers the
use of metadata standards, such as Darwin Core [5], and

internationally recognized protocols for biodiversity data


interoperability, such as IPT, TAPIR and DiGIR [6].
B. Data Management Layer
This layer performs the storage and organization of data
collected from data providers. Based on the concept of
polyglot persistence, uses various database technologies
(relational and non-relational) simultaneously, where each of
them is responsible for managing a particular data type or
solving a specific issue [7]. Non-relational databases have
interesting features within the context of biodiversity data
management, such as capacity to handle very large amounts
of data, flexible schema, horizontal scalability and low cost.
Indexing and data processing tools are also part of this layer.
C. Web Services Layer
This layer provides services for locating and consuming
the stored data. The services are organized into components
according to their type and function and are responsible to
provide informations to the data publishing systems and for
external tools. We chose this strategy for seeking autonomy
among applications, besides facilitating the reuse of these
components.
Due to data heterogeneity and the standards considered in
the architecture, these services could provide data in varying
formats, such as Darwin Core, EML and Plinian Core,
expanding the ability to integrate with other systems and
computational tools.
D. Data Publication Layer
This layer accounts for publishing the integrated
information through a data portal that provides tools for
localization and manipulation of taxonomic data and
specimen occurrence data. It also provides geospatial and
analysis tools.
III.

Find endangered species that eat snakes and live in forests.


Figure 3 shows a biodiversity data representation based on a
Graph Database.
CONSERVATION
STATUS

USED BY
DESCRIBED AT
DESCRIBED AT

PUBLICATION
TAXON

EATS

TAXON

CLASSIFIED AS

FOUND AT

SPECIMEN

COLLECTED AT

BIOME

TIME

IN STAGE

LIFE STAGE
COLLECTED AT

LOCATION

We are conducting further experiments with other tools


and technologies that may be considered in this architecture
later.
IV.

CONSIDERATIONS AND FUTURE WORKS

In this paper, we propose a framework to manage


biodiversity data provided by several institutions. Based on
the research goals, we seek to define a computational
framework able to efficiently organize data and facilitate the
information retrieval process to support processes of
decision-making in conservation and sustainable use of
natural resources.
The experiments with computational tools and algorithms
are in progress for definition of the data organizing process
based on their metadata.
ACKNOWLEDGMENT
This research is supported by the Amazonas Research
Foundation (FAPEAM), the Brazilian Ministry of the
Environment (MMA) and the Deutsche Gesellschaft fr
Internationale Zusammenarbeit (GIZ) GmbH.
REFERENCES

This study case focused experiments with two NoSQL


databases for representing biodiversity data.
The Apache Cassandra is a column-oriented database,
and its flexible schema enables management of
heterogeneous data from different institutions, without loss
of information. Figure 2 shows a Cassandra column family
storing species occurrence data indexed by Life Science
Identifiers (LSID).

[1]
[2]
[3]

[4]

urn:lsid:icmbio.go dwc:collectionCode dwc:scientificName


v.br:occ:MA1202 CENAP
C. brachyurus
dwc:scientificName dbh
Cedrela fissilis Vell.

POLLINATES

TAXON

Figure 3 - Species Interaction Data Model based on a Graph Database.

STUDY CASE 1

urn:lsid:biocol.org abcd:SourceID
:col:15528
RB

COMMUNITY

CLASSIFIED AS

...

95cm

...
Figure 2 - Column family for species occurrence data.

The Neo4J is a graph database that proved to be effective


for manage interactions between species and the
environment, enabling a fast response of questions such as

[5]

[6]
[7]

J. Sobern e T. Peterson, Biodiversity informatics: managing and


applying primary biodiversity data, Philos. Trans. R. Soc. Lond.
B. Biol. Sci., vol. 359, pp. 689698, April 2004.
R. D. M. Page, Biodiversity informatics: the challenge of linking
data and the role of shared identifiers, Brief. Bioinform., vol.
9(5), pp. 345354, January 2008.
V. S. Chavan e P. Ingwersen, Towards a data publishing
framework for primary biodiversity data: challenges and potentials
for
the
biodiversity
informatics
community,
BMC
Bioinformatics, vol. 10(Suppl 14), pp. 111, November 2009.
F. A. Bisby, The Quiet Revolution: Biodiversity Informatics and
the Internet, Science, vol. 289, pp. 23092312, September 2000.
J. Wieczorek, D. Bloom, R. Guralnick, S. Blum, M. Dring, R.
Giovanni, T. Robertson, e D. Vieglais, Darwin Core: An
Evolving Community-Developed Biodiversity Data Standard,
PLoS ONE, vol. 7(1), e29715, January 2012.
A. Goddard, N. Wilson, P. Cryer, e G. Yamashita, Data hosting
infrastructure for primary biodiversity data, BMC Bioinformatics,
vol. 12(Suppl 15), pp. 114, December 2011.
P. J. Sadalage e M. Fowler, NoSQL distilled: a brief guide to the
emerging world of polyglot persistence. Upper Saddle River, NJ:
Addison-Wesley, 2013.

S-ar putea să vă placă și