Documente Academic
Documente Profesional
Documente Cultură
Article in Nature journal Scientific Data: Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific
data management and stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016).
Data Workflow
Analyze
Integrate Plan
Discover Collect
PUBLISH
Preserve Assure
Describe
Course Description
This course is intended for graduate students or highly qualified
undergraduates. The overall objective of this course is to teach
students how to conduct data analyses and modeling
experiments to produce Findable, Accessible, Interoperable and
Reusable (FAIR) science in hydrological, weather and climate
studies. Students will be exposed to the philosophy behind FAIR
principles, cyberinfrastructure resources for accessing data,
community-based computing tools, computer programming skills,
and publication of data and analysis results. Utilizing scaffolding
and templates provided in lecture and via online resources the
students will apply what they have learned in authentic learning
exercises appropriate for research settings in hydrological,
weather and climate sciences.
Learning Outcomes
After taking this course, students will be able to:
(i) obtain and process hydrologic, weather and climate model
data using public domain resources;
(ii) Perform geospatial and time series analysis of this data using
publicly available cyberinfrastructure tools;
(iii) Process model data/output using publicly available toolkits;
and
(iv) publish data and its results in public domain for general
access and reproducibility.
Grading
Grading
20% for class participation,
40% for assignments/homework and
40% for final project.
Outline
Foundations I - principles and concepts
● Climate - Big Data Problem, data workflow
● FAIR Science
Foundations II - infrastructure and enabling technology
● MyGeoHub
● Jupyter Notebook
● GitHub
● NCAR Reanalysis Data
● Synda - Docker, Globus, ESGF
Applications
● ESMValTool - Anaconda
● Futureheatwaves - RStudio
● ARCCSS climate extremes -
Workflow I
Workflow II - build your own workflow ‘BYOW’
Foundations I:
Principles and Concepts
Climate - Big Data Problem
The big data problem
Typical Climate->Impacts Workflows
The big data problem
Typical Climate->Impacts Workflows
High Impacts
Global Resolution Assessment
DOWNSCALING
Model Climate Models /
Variables Analysis
High Resolution Modeling
Global Model Regional Model
27 km
9 km
3 km
CESM WRF
1° x 1° 3 km
Horizontal Resolution Horizontal Resolution
30 minute Time Step 20 second Time Step
SIMULATION ROUTE MAP
• Model Re-Initialization
Sensitivity
• Nesting Feedback
Simulations using
Global Model RCP 8.5 • Model Parameterization
• Planetary Boundary Layer
Output
• Microphysics
(1990, 2040, 2090)
• Land Surface
Future Climatology
Simulations • Global Model RCP 8.5 driven
(2015-2100)
General Process of making a suitable
impacts model
•Parameter determination - through calibration, estimate
model parameters for land surface.
Avg. Station
density: 2
Area Km /station
U.S. 700-1000
Hydrographs of routed
runoff show good
correspondence with
observed and
naturalized flows.
1
7
Validation with Illinois Soil Moisture
19 observing stations are compared to the 17 1/8º modeled grid
cells that contain the observation points.
Moisture
Level
Moisture
Flux
Variabilit
y
Persistenc
e
Key Ecosystem Services of Varies by Region
Ecosystem Service Biogeophysical Indicator Biogeophysical
Model
Wood for Timber, Fiber, Energy Above ground C stock, annual Climate, LANDIS,
NEP PnET
Carbon Sequestration Total C stock, annual NEP Climate, PnET
surface
climate
observation
s,
CoCoRHAS
Forest Inventory &
Forest Inventory, Eddy Tower Data
Eddy tower data,
& Sensor Data
Calibration /
Validation Datasets
Sensor Data, USGS,
& CoCoRHAS
Using impacts models to produce “indicators” or “metrics” the convert these into $$
APPROACH
CREATE THE INFRASTRUCTURE AND OPEN-SOURCE TOOLS NECESSARY
TO DEVELOP A SELF-SUSTAINING COMMUNITY OF PRACTICE
Maps,
Data Visualization
and Exploration
Analysis of
Modeling Tradeoffs and
Frameworks and Synergies
Computation
Policy Briefs
GeoHub
Community
Training,
Interactions and
Courses,
Group
Crowd Sourcing
Collaboration
Lecture 3 summary
Integrate Plan
Discover Collect
PUBLISH
Preserve Assure
Describe
FAIR Science
What is FAIR?
Findable. Accessible. Interoperable. Reusable.
● Supports the reuse of data
● Sharing knowledge and collaboration
● Leads to quicker discovery
● Simplifies the cycle of research
● Involves: researchers, publishers, and repositories
● Principles for: data, metadata, and infrastructure
Source: Shelley Stall “Assessing FAIRness within the Enabling FAIR Data project”
Research Data Ecosystem
Source: https://www.go-fair.org/fair-principles/
F1. (Meta)data are assigned a globally unique and eternally persistent
identifier
Do you assign a persistent identifier to data products in your
repository? If so, which PID type/scheme (e.g. DOI)? Do you
assign more than one type?
What kinds of things do you assign PIDs to? What is the
granularity (or granularities) of the things you assign PIDs to? Do
you assign them to individual data values or items, to individual
files, to coherent collections of files, and/or multiple
granularities?
F2. Data are described with rich metadata.
1. Do your available data products come with metadata accessible or browsable by human
users?
2. Do you attempt to capture good coverage of what are known as Dublin Core concepts?
3. Do you attempt to capture detailed metadata that is more specifically relevant to your user
community (i.e., that goes beyond Dublin Core)?
4. What standard or community metadata schemas do you support? Do you specifically
support any of the following: If you support any, do you support specific profiles of these
metadata schemas?
● Does your metadata include geolocation information?
● Does it include temporal information (e.g. coverage in time)?
● Does your repository accept metadata that is applicable to a specific discipline (and not
just generally applicable to all disciplines)? Does your repository disallow or reject
metadata that is specific to a particular discipline?
● Does your metadata include a concept of authors? Contact points? Are these separate
metadata elements?
● Do you capture ORCID or other PIDs for authors? If yes: which? Researcher-ID?
Scopus-ID?
F3. (Meta)data are registered or indexed in a searchable resource
Does your exportable metadata include the data’s PID (e.g., the
data DOI)?
Does your exportable metadata include other persistent
identifiers? ORCiDs? Literature (Crossref) DOIs? Sample
IGSNs? Author contributions (CRediT)?
F4. Metadata specify the data identifier
Does your repository provide search capabilities of its contents?
Do you make your metadata searchable and/or indexable by any
external systems? Which ones?
Do you export your metadata through any of the following
mechanisms:
OAI-PMH
Linked Data Platform
ResourceSync
Landing Page meta tags or similar embedding mechanisms
Have you reviewed and ensured the existence and accuracy of
the re3data record for your repository?
FAIR Principles - Accessible
Accessible:
After the dataset is found, the user needs to be able to easily
access the datasets …
● A1 (meta)data are retrievable by their identifier using a
standardized communications protocol.
● A1.1 the protocol is open, free, and universally
implementable.
● A1.2 the protocol allows for an authentication and
authorization procedure, where necessary.
● A2 metadata are accessible, even when the data are no
longer available.
Source: https://www.go-fair.org/fair-principles/
A1. (Meta)data are retrievable by their identifier using a standardized
communications protocol.
To maximise data reuse, the protocol should be free (no-cost) and open (-sourced)
and thus globally implementable to facilitate data retrieval. Anyone with a computer
and an internet connection can access at least the metadata. Hence, this criterion
will impact your choice of the repository where you will share your data.
Examples
This is a key, but often misunderstood, element of FAIR. The ‘A’ in FAIR does not
necessarily mean ‘open’ or ‘free’. Rather, it implies that one should provide the
exact conditions under which the data are accessible. Hence, even heavily
protected and private data can be FAIR. Ideally, accessibility is specified in such a
way that a machine can automatically understand the requirements, and then either
automatically execute the requirements or alert the user to the requirements. It
often makes sense to request users to create a user account for a repository. This
allows to authenticate the owner (or contributor) of each dataset, and to potentially
set user-specific rights. Hence, this criterion will also affect your choice of the
repository where you will share your data.
Examples
HMAC authentication
HTTPS
FTPS
Telephone
A2. Metadata are accessible, even when the data are no longer
available.
Examples
Examples
The RDF extensible knowledge representation model is a way to describe and structure datasets.
You can refer to the Dublin Core Schema as an example.
OWL
DAML+OIL
JSON LD
I2. (Meta)data use vocabularies that follow FAIR principles.
Examples
Examples
Source: https://www.go-fair.org/fair-principles/
R1. Metadata have a plurality of accurate and relevant attributes.
It will be much easier to find and reuse data if there are many labels are attached to the data. Principle
R1 is related to F2, but R1 focuses on the ability of a user (machine or human) to decide if the data is
actually USEFUL in a particular context. To make this decision, the data publisher should provide not just
metadata that allows discovery, but also metadata that richly describes the context under which the data
was generated. This may include the experimental protocols, the manufacturer and brand of the
machine or sensor that created the data, the species used, the drug regime, etc. Moreover, R1 states
that the data publisher should not attempt to predict the data consumer’s identity and needs. We chose
the term ‘plurality’ to indicate that the metadata author should be as generous as possible in providing
metadata, even including information that may seem irrelevant.
Examples
Commonly used licenses like MIT or Creative Commons can be linked to your data.
Methods for marking up this metadata are provided by the DTL FAIRifier.
R1.2 (Meta)data are associated with their provenance.
For others to reuse your data, they should know where the data came from (i.e.,
clear story of origin/history, see R1), who to cite and/or how you wish to be
acknowledged. Include a description of the workflow that led to your data: Who
generated or collected it? How has it been processed? Has it been published
before? Does it contain data from someone else that you may have transformed or
completed? Ideally, this workflow is described in a machine-readable format.
Examples
https://commons.wikimedia.org/wiki/File:Sampling_coral_microbiome_(2714643765
0).jpg
includes authorship details, and uses the Creative Commons Attribution Share Alike
license, which indicates exactly how the data author wishes to be cited.
R1.3 (Meta)data meet domain-relevant community standards.
It is easier to reuse data sets if they are similar: same type of data, data organised in a
standardised way, well-established and sustainable file formats, documentation (metadata)
following a common template and using common vocabulary. If community standards or
best practices for data archiving and sharing exist, they should be followed. For instance,
many communities have minimal information standards (e.g., MIAME, MIAPE). FAIR data
should at least meet those standards. Other community standards may be less formal, but
nevertheless, publishing (meta)data in a manner that increases its use(ability) for the
community is the primary objective of FAIRness. In some situations, a submitter may have
valid and specified reasons to divert from the standard good practice for the type of data to
be submitted. This should be addressed in the metadata. Note that quality issues are not
addressed by the FAIR principles. The data’s reliability lies in the eye of the beholder and
depends on the intended application.
Examples
http://schema.datacite.org/ [for general purpose, not domain-specific]
https://www.iso.org/standard/53798.html [geographic information and services]
http://cfconventions.org/ [climate and forecast]
FAIR Initiatives
Source: Shelley Stall “Assessing FAIRness within the Enabling FAIR Data project”
Helpful FAIR Resources
● Article describing the Enabling FAIR Data Project:
https://eos.org/editors-vox/enabling-findable-accessible-inter
operable-and-reusable-data
● Outcome of the initial Stakeholder Meeting from Nov 16-17,
2017:
https://eos.org/agu-news/enabling-fair-data-across-the-earth
-and-space-sciences
● DataONE webinar recording:
https://www.dataone.org/webinars/enabling-fair-data
● Enabling FAIR Data (high-level) Project Site:
http://www.copdess.org/home/enabling-fair-data-project/
Foundations II:
Infrastructure and Enabling technology
Setting up a FAIR environment
Analyze
Integrate Plan
Discover Collect
PUBLISH
Preserve Assure
Describe
Jupyter within FAIR
Attributes: easy, free, open, shareable, customizable, various
languages, self describing, findable and accessible
Good example of fair
Tips for how to write a notebook as fair as possible
Session this wednesday on jupiter
Downloading Jupyter Notebook
Online instructions for own machine: https://jupyter.org/install.html
Also available as tool in MyGeoHub!
To access:
1. Log into MyGeoHub
2. Click ‘Resources’ -> ‘Tools’
in top menu
3. Scroll and click ‘Jupyter
Notebook’
4. Click ‘Launch Tool’
MyGeoHub Jupyter Notebook
[video example]
GitHub
What is the GitHub?
Hopefully you went to the session last week.
Intro to Github
Github is a good enabling tool for FAIR science
Analyze
Integrate Plan
Discover Collect
PUBLISH
Preserve Assure
Describe
Download Reanalysis Data
Reanalysis is a scientific method for developing a comprehensive record of how weather and
climate are changing over time. In it, observations and a numerical model that simulates one or
more aspects of the Earth system are combined objectively to generate a synthesized estimate of
the state of the system. A reanalysis typically extends over several decades or longer, and covers
the entire globe from the Earth’s surface to well above the stratosphere. Reanalysis products are
used extensively in climate research and services, including for monitoring and comparing current
climate conditions with those of the past, identifying the causes of climate variations and change,
and preparing climate predictions. Information derived from reanalyses is also being used
increasingly in commercial and business applications in sectors such as energy, agriculture, water
resources, and insurance. https://reanalyses.org/
Validation with Observed Runoff
Hydrographs of routed
runoff show good
correspondence with
observed and
naturalized flows.
6
8
Downloading Reanalysis Data
RDA
Climate Data Guide
ECMWF
Get an OpenID
Open an account with NCAR
Open an account with ECMWF
SYNDA
What is SYNDA?
● SYNchronized DAta
● Command line tool
Analyze ● Uses:
Integrate Plan ○ Search dataset
○ Search files
○ Download file(s)
Discover Collect ○ Manage large
PUBLISH
number of files
Preserve Assure
○ Explore metadata
Describe
SYNDA and FAIR
● Open source
● Supports community
● Free database - enabling discovery
● Keeps track of metadata
● Maintains provenance information
Downloading SYNDA
Credentials needed:
● Globus - https://www.globus.org/
● ESGF -
https://esgf-node.llnl.gov/user/add/?next=https://esgf-node.ll
nl.gov/projects/esgf-llnl/
Intro to ESGF
● ESGF - Earth System Grid Federation
○ Collaboration to develop, deploy, and maintain
software infrastructure
● Manages database for climate data from several
international and federal websites
○ Management, dissemination, and analysis of model
output and observational data
○ Over 700,000 datasets supported!
● Supports the Coupled Model Intercomparison Project
(CMIP)
● Used to discover, access, and analyze data
https://hub.docker.com/editions/community/docker-ce-desktop-m
ac
● Create an account to download
● Click “Get Docker”
● After download, double click docker.dmg to install
● Restart terminal after installation
● Check installation by running from terminal:
○ docker version
○ docker run hello-world
Downloading SYNDA
[Mac]
Commands (after docker installation inside terminal):
sudo docker pull prodiguer/synda
mkdir -p ~/synda/sdt/conf
curl
https://raw.githubusercontent.com/Prodiguer/synda/master/sdt
/conf/sdt.conf -o ~/synda/sdt/conf/sdt.conf
curl
https://raw.githubusercontent.com/Prodiguer/synda/master/sdt
/conf/credentials.conf -o ~/synda/sdt/conf/credentials.conf
vi ~/synda/sdt/conf/credentials.conf
Downloading SYNDA
[Mac]
Configuration:
Add into your .bashrc or .bash_profile
export ST_HOME=$HOME/sdt
export PATH=$ST_HOME/bin:$PATH
Edit credentials for ESGF and Globus
vi $ST_HOME/conf/credentials.conf
SYNDA should be up and working on your machine!
Downloading Synda - Video Example
SYNDA
[Useful Commands]
Search:
synda search rcp85 3hr start=2005-01-01T00:00:00Z
end=2100-12-31T23:59:59Z (dataset)
synda search rcp85 3hr timeslice=20050101-21001231 -f (file)
Download:
synda get
tasmax_day_FGOALS-s2_piControl_r1i1p1_20160101-20161231.nc
synda install CMIP5 CNRM-CM5 tas pr areacello
Other:
synda param
Helpful SYNDA Resources
Github:
https://github.com/Prodiguer/synda
Documentation:
http://prodiguer.github.io/synda/
Earth System Grid Federation presentation:
https://esgf.llnl.gov/esgf-media/2018-F2F/2018-12-05/2018-12-0
5_-_ESGF_F2F_Synda-BenNasser.pdf
SYNDA Examples
Applications
ESMValTool
What is ESMValTool?
Earth System Model eValuation Tool
● Community diagnostics and performance metric tool for the
evaluation of Earth System Models (ESMs)
Features:
● Facilitates the evaluation of ESMs
● Standardizes model evaluation against observations or
other models
● Wide scope of different diagnostics and performance
metrics covering ESMs (dynamics, radiation, clouds, carbon
cycle, chemistry, aerosol, sea-ice, etc.)
ESMValTool Structure
Integrate Analyze
Conda and FAIR
TODO
Downloading Conda
● https://www.anaconda.com/distribution/#macos
● Click “download”
● Open package under downloads
○ Continue through installation process
● Agree to the terms of the software license
agreement
● Install
● Once installation is complete, close the installer
and move to trash
Downloading
[conda]
ESMValTool
● Update conda: conda update -y conda
● Create environment using python 3:
conda create --name esmvaltool python=3
conda env update --name esmvaltool --file
ESMValTool/environment.yml
● Activate the environment:
○ Bash: source activate esmvaltool
○ Tcsh: conda activate esmvaltool
● Software installation: python ESMValTool/setup.py
install
● Check if installation is successful: esmvaltool
--version
Downloading ESMValTool
[source]
Integrate Plan
Discover Collect
PUBLISH
Preserve Assure
Describe
Downloading ARCCSS climate extremes
Workflow I
Workflow II:
‘BYOW’ Bring Your Own Workflow