Sunteți pe pagina 1din 118

CyberTraining

FAIR: Foundations, Applications, and Workflow


Matthew Huber and Venkatesh Merwade with
assistance from Ashley Dicks, Purdue University
FAIR Guiding Principles
FAIR is…
Findable
Accessible
Interoperable
Reusable

Article in Nature journal Scientific Data: Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific
data management and stewardship. Sci. Data 3:160018 doi: 10.1038/sdata.2016.18 (2016).
Data Workflow
Analyze

Integrate Plan

Discover Collect
PUBLISH

Preserve Assure

Describe
Course Description
This course is intended for graduate students or highly qualified
undergraduates. The overall objective of this course is to teach
students how to conduct data analyses and modeling
experiments to produce Findable, Accessible, Interoperable and
Reusable (FAIR) science in hydrological, weather and climate
studies. Students will be exposed to the philosophy behind FAIR
principles, cyberinfrastructure resources for accessing data,
community-based computing tools, computer programming skills,
and publication of data and analysis results. Utilizing scaffolding
and templates provided in lecture and via online resources the
students will apply what they have learned in authentic learning
exercises appropriate for research settings in hydrological,
weather and climate sciences.
Learning Outcomes
After taking this course, students will be able to:
(i) obtain and process hydrologic, weather and climate model
data using public domain resources;
(ii) Perform geospatial and time series analysis of this data using
publicly available cyberinfrastructure tools;
(iii) Process model data/output using publicly available toolkits;
and
(iv) publish data and its results in public domain for general
access and reproducibility.
Grading
Grading
20% for class participation,
40% for assignments/homework and
40% for final project.
Outline
Foundations I - principles and concepts
● Climate - Big Data Problem, data workflow
● FAIR Science
Foundations II - infrastructure and enabling technology
● MyGeoHub
● Jupyter Notebook
● GitHub
● NCAR Reanalysis Data
● Synda - Docker, Globus, ESGF
Applications
● ESMValTool - Anaconda
● Futureheatwaves - RStudio
● ARCCSS climate extremes -
Workflow I
Workflow II - build your own workflow ‘BYOW’
Foundations I:
Principles and Concepts
Climate - Big Data Problem
The big data problem
Typical Climate->Impacts Workflows
The big data problem
Typical Climate->Impacts Workflows

High Impacts
Global Resolution Assessment
DOWNSCALING
Model Climate Models /
Variables Analysis
High Resolution Modeling
Global Model Regional Model

27 km
9 km
3 km

CESM WRF
1° x 1° 3 km
Horizontal Resolution Horizontal Resolution
30 minute Time Step 20 second Time Step
SIMULATION ROUTE MAP
• Model Re-Initialization
Sensitivity
• Nesting Feedback
Simulations using
Global Model RCP 8.5 • Model Parameterization
• Planetary Boundary Layer
Output
• Microphysics
(1990, 2040, 2090)
• Land Surface

Regional Model Driven by:


Historical Simulations
• Global Model Output
(1990-2014)
• Observations, Reanalysis
Products

Future Climatology
Simulations • Global Model RCP 8.5 driven
(2015-2100)
General Process of making a suitable
impacts model
•Parameter determination - through calibration, estimate
model parameters for land surface.

•Producing “Pseudo-Observations” - provide a baseline


of data for model intercomparisons.

•Establish model climatology - for identifying long-term


excursions and characterizing variability.
Typical hydrology
model workflow (VIC)
• Model developed originally
for 15 sub-regions, with
consistent sources.
• Surface forcing data:
• Daily precipitation maximum
and minimum temperatures
(from gauge measurements)

• Radiation, humidity parameterized from Tmax and Tmin


• Wind (from NCEP/NCAR reanalysis)

• Soil parameters: derived from Penn State State STATSGO in


the U.S., FAO global soil map elsewhere.
• Vegetation coverage from the University of Maryland 1-km
Temperature and
Precipitation Data
Precipitation and Temperature from gauge
observations gridded to 1/8o

Avg. Station
density: 2
Area Km /station

U.S. 700-1000

Canada 2500 Within the U.S.:


•Precipitation adjusted for
time-of-observation
Mexico 6000 •Precipitation re-scaled to match
PRISM mean for 1961-90
Validation with Observed Runoff

Hydrographs of routed
runoff show good
correspondence with
observed and
naturalized flows.
1
7
Validation with Illinois Soil Moisture
19 observing stations are compared to the 17 1/8º modeled grid
cells that contain the observation points.

Moisture
Level

Moisture
Flux

Variabilit
y

Persistenc
e
Key Ecosystem Services of Varies by Region
Ecosystem Service Biogeophysical Indicator Biogeophysical
Model
Wood for Timber, Fiber, Energy Above ground C stock, annual Climate, LANDIS,
NEP PnET
Carbon Sequestration Total C stock, annual NEP Climate, PnET

Land Surface Albedo Mean annual albedo, translated Climate, LANDIS


into radiative forcing
Forest Biodiversity Plant species number LANDIS

Freshwater Quality Regulation Upstream N removal PnET, FrAMES


(N, temperature, salt) Length of rivers above thresholds
Water Supply Annual flow frequency at intakes PnET, FrAMES

Flood Control Runoff coefficients relative to PnET, FrAMES


reservoir capacity by subbbasin
Recreation: Fish Habitat Index of quality: flow, FrAMES
temperature, conductivity
regime
Model Integration To Understand
Spatio-temporal dynamics in ecosystem
services
Model Integration To Understand
Spatio-temporal dynamics in ecosystem
services

surface
climate
observation
s,
CoCoRHAS
Forest Inventory &
Forest Inventory, Eddy Tower Data
Eddy tower data,
& Sensor Data
Calibration /
Validation Datasets
Sensor Data, USGS,
& CoCoRHAS
Using impacts models to produce “indicators” or “metrics” the convert these into $$
APPROACH
CREATE THE INFRASTRUCTURE AND OPEN-SOURCE TOOLS NECESSARY
TO DEVELOP A SELF-SUSTAINING COMMUNITY OF PRACTICE

Maps,
Data Visualization
and Exploration

Analysis of
Modeling Tradeoffs and
Frameworks and Synergies
Computation
Policy Briefs
GeoHub

Community
Training,
Interactions and
Courses,
Group
Crowd Sourcing
Collaboration
Lecture 3 summary

Standard workflow for climate impacts studies.


Get global climate model output
Downscale it.
Develop, or simply use existing model/models of
impacts/subsystems.
Calibrate that impacts model and validate it.
Run impacts model with modern climate model output
Compare results when driven by future model output
Then what?
Register for mygeohub.org
Data Workflow
Analyze

Integrate Plan

Discover Collect
PUBLISH

Preserve Assure

Describe
FAIR Science
What is FAIR?
Findable. Accessible. Interoperable. Reusable.
● Supports the reuse of data
● Sharing knowledge and collaboration
● Leads to quicker discovery
● Simplifies the cycle of research
● Involves: researchers, publishers, and repositories
● Principles for: data, metadata, and infrastructure

The core of FAIR science is data management!


FAIR Objectives
● FAIR-compliant data repositories will add value to research
data, provide metadata and landing pages for
discoverability, and support researchers with documentation
guidance, citation support, and data curation.
● FAIR-compliant Earth and space science publishers will align
their policies to establish a similar experience for
researchers. Data will be available through citations that
resolve to repository landing pages. Data are not placed in
the supplement.
● Publishers and repositories are working together towards
the advancement of FAIR science

Source: Shelley Stall “Assessing FAIRness within the Enabling FAIR Data project”
Research Data Ecosystem

Author: Shelley Stall and Erin Robinson


FAIR Principles - Findable
Findable:
The datasets and resources should be easily located by humans
and computers …
● F1. (meta)data are assigned a globally unique and eternally
persistent identifier.
● F2. data are described with rich metadata.
● F3. (meta)data are registered or indexed in a searchable
resource.
● F4. metadata specify the data identifier.

Source: https://www.go-fair.org/fair-principles/
F1. (Meta)data are assigned a globally unique and eternally persistent
identifier
Do you assign a persistent identifier to data products in your
repository? If so, which PID type/scheme (e.g. DOI)? Do you
assign more than one type?
What kinds of things do you assign PIDs to? What is the
granularity (or granularities) of the things you assign PIDs to? Do
you assign them to individual data values or items, to individual
files, to coherent collections of files, and/or multiple
granularities?
F2. Data are described with rich metadata.
1. Do your available data products come with metadata accessible or browsable by human
users?
2. Do you attempt to capture good coverage of what are known as Dublin Core concepts?
3. Do you attempt to capture detailed metadata that is more specifically relevant to your user
community (i.e., that goes beyond Dublin Core)?
4. What standard or community metadata schemas do you support? Do you specifically
support any of the following: If you support any, do you support specific profiles of these
metadata schemas?
● Does your metadata include geolocation information?
● Does it include temporal information (e.g. coverage in time)?
● Does your repository accept metadata that is applicable to a specific discipline (and not
just generally applicable to all disciplines)? Does your repository disallow or reject
metadata that is specific to a particular discipline?
● Does your metadata include a concept of authors? Contact points? Are these separate
metadata elements?
● Do you capture ORCID or other PIDs for authors? If yes: which? Researcher-ID?
Scopus-ID?
F3. (Meta)data are registered or indexed in a searchable resource
Does your exportable metadata include the data’s PID (e.g., the
data DOI)?
Does your exportable metadata include other persistent
identifiers? ORCiDs? Literature (Crossref) DOIs? Sample
IGSNs? Author contributions (CRediT)?
F4. Metadata specify the data identifier
Does your repository provide search capabilities of its contents?
Do you make your metadata searchable and/or indexable by any
external systems? Which ones?
Do you export your metadata through any of the following
mechanisms:
OAI-PMH
Linked Data Platform
ResourceSync
Landing Page meta tags or similar embedding mechanisms
Have you reviewed and ensured the existence and accuracy of
the re3data record for your repository?
FAIR Principles - Accessible
Accessible:
After the dataset is found, the user needs to be able to easily
access the datasets …
● A1 (meta)data are retrievable by their identifier using a
standardized communications protocol.
● A1.1 the protocol is open, free, and universally
implementable.
● A1.2 the protocol allows for an authentication and
authorization procedure, where necessary.
● A2 metadata are accessible, even when the data are no
longer available.

Source: https://www.go-fair.org/fair-principles/
A1. (Meta)data are retrievable by their identifier using a standardized
communications protocol.

1. Do you provide a landing page accessible by resolving a PID assigned by your


repository?
2. Do you support any machine-actionable data access mechanisms in which data can
be retrieved given its identifier? Which standard mechanisms do you support? Are
any considered specific to you your repository?
3. Do you support access to metadata via URLs and content-negotiation?
4. Do you embed machine-readable metadata in your Landing Pages? Do you embed
via HTML <meta> tags? Do you embed JSON-LD data? XML?
5. Does access to any data in your repository require authentication and
authorization? Which machine-actionable access mechanisms support
authentication?
6. Do you support any open standards for authentication and authorization?
A1.1 The protocol is open, free, and universally implementable.

To maximise data reuse, the protocol should be free (no-cost) and open (-sourced)
and thus globally implementable to facilitate data retrieval. Anyone with a computer
and an internet connection can access at least the metadata. Hence, this criterion
will impact your choice of the repository where you will share your data.

Examples

HTTP, FTP, SMTP, …


Telephone (arguably not universally-implementable, but close enough)
A counter-example would be Skype, which is not universally-implementable
because it is proprietary
Microsoft Exchange Server protocol is also proprietary
A1.2 The protocol allows for an authentication and authorization
procedure, where necessary.

This is a key, but often misunderstood, element of FAIR. The ‘A’ in FAIR does not
necessarily mean ‘open’ or ‘free’. Rather, it implies that one should provide the
exact conditions under which the data are accessible. Hence, even heavily
protected and private data can be FAIR. Ideally, accessibility is specified in such a
way that a machine can automatically understand the requirements, and then either
automatically execute the requirements or alert the user to the requirements. It
often makes sense to request users to create a user account for a repository. This
allows to authenticate the owner (or contributor) of each dataset, and to potentially
set user-specific rights. Hence, this criterion will also affect your choice of the
repository where you will share your data.

Examples
HMAC authentication
HTTPS
FTPS
Telephone
A2. Metadata are accessible, even when the data are no longer
available.

Do you expose publicly any metadata for data with restricted


access—i.e. requiring authentication and authorization to
access?
Do you preserve access to metadata about data products that
are no longer available?
A2. Metadata are accessible, even when the data are no longer
available.

Datasets tend to degrade or disappear over time because there is a cost to


maintaining an online presence for data resources. When this happens, links
become invalid and users waste time hunting for data that might no longer be there.
Storing the metadata generally is much easier and cheaper. Hence, principle A2
states that metadata should persist even when the data are no longer sustained. A2
is related to the registration and indexing issues described in F4.

Examples

Metadata are valuable in and of themselves, when planning research, especially


replication studies. Even if the original data are missing, tracking down people,
institutions or publications associated with the original research can be extremely
useful.
FAIR Principles - Interoperable
Interoperable:
The datasets need to be in a format that is usable by others,
therefore needs to satisfy the following …
● I1. (meta)data use a formal, accessible, shared, and broadly
applicable language for knowledge representation.
● I2. (meta)data use vocabularies that follow FAIR principles.
● I3. (meta)data include qualified references to other
(meta)data.

Source for next set of slides: https://www.go-fair.org/fair-principles/


I1. (Meta)data use a formal, accessible, shared, and broadly applicable
language for knowledge representation.
Humans should be able to exchange and interpret each other’s data (so preferably do not use
dead languages). But this also applies to computers, meaning that data that should be readable
for machines without the need for specialised or ad hoc algorithms, translators, or mappings.
Interoperability typically means that each computer system at least has knowledge of the other
system’s data exchange formats. For this to happen and to ensure automatic findability and
interoperability of datasets, it is critical to use (1) commonly used controlled vocabularies,
ontologies, thesauri (having resolvable globally unique and persistent identifiers, see F1) and (2)
a good data model (a well-defined framework to describe and structure (meta)data).

Examples
The RDF extensible knowledge representation model is a way to describe and structure datasets.
You can refer to the Dublin Core Schema as an example.
OWL
DAML+OIL
JSON LD
I2. (Meta)data use vocabularies that follow FAIR principles.

The controlled vocabulary used to describe datasets needs to be documented and


resolvable using globally unique and persistent identifiers. This documentation
needs to be easily findable and accessible by anyone who uses the dataset.

Examples

Using the FAIR Data Point ensures I2.


Links to resources

FAIR Data Point specification


I3. (Meta)data include qualified references to other (meta)data.

A qualified reference is a cross-reference that explains its intent. For example, X is


regulator of Y is a much more qualified reference than X is associated with Y, or X
see also Y. The goal therefore is to create as many meaningful links as possible
between (meta)data resources to enrich the contextual knowledge about the data,
balanced against the time/energy involved in making a good data model. To be
more concrete, you should specify if one dataset builds on another data set, if
additional datasets are needed to complete the data, or if complementary
information is stored in a different dataset. In particular, the scientific links between
the datasets need to be described. Furthermore, all datasets need to be properly
cited (i.e., including their globally unique and persistent identifiers).

Examples

FAIR Data Point


http://www.uniprot.org/uniprot/C8V1L6.rdf
FAIR Principles - Reusable
Reusable:
The datasets need to be able to be used by various people,
therefore must have clear metadata …
● R1. meta(data) have a plurality of accurate and relevant
attributes.
● R1.1. (meta)data are released with a clear and accessible
data usage license.
● R1.2. (meta)data are associated with their provenance.
● R1.3. (meta)data meet domain-relevant community
standards.

Source: https://www.go-fair.org/fair-principles/
R1. Metadata have a plurality of accurate and relevant attributes.
It will be much easier to find and reuse data if there are many labels are attached to the data. Principle
R1 is related to F2, but R1 focuses on the ability of a user (machine or human) to decide if the data is
actually USEFUL in a particular context. To make this decision, the data publisher should provide not just
metadata that allows discovery, but also metadata that richly describes the context under which the data
was generated. This may include the experimental protocols, the manufacturer and brand of the
machine or sensor that created the data, the species used, the drug regime, etc. Moreover, R1 states
that the data publisher should not attempt to predict the data consumer’s identity and needs. We chose
the term ‘plurality’ to indicate that the metadata author should be as generous as possible in providing
metadata, even including information that may seem irrelevant.

Some points to take into consideration (non-exhaustive list):


Describe the scope of your data: for what purpose was it generated/collected?
Mention any particularities or limitations about the data that other users should be aware of.
Specify the date of generation/collection of the data, the lab conditions, who prepared the data, the
parameter settings, the name and version of the software used.
Is it raw or processed data?
Ensure that all variable names are explained or self-explanatory (i.e., defined in the research field’s
controlled vocabulary).
Clearly specify and document the version of the archived and/or reused data.
R1.1 (meta)data are released with a clear and accessible data usage
license.

Under ‘I’, we covered elements of technical interoperability. R1.1 is about legal


interoperability. What usage rights do you attach to your data? This should be
described clearly. Ambiguity could severely limit the reuse of your data by
organisations that struggle to comply with licensing restrictions. Clarity of licensing
status will become more important with automated searches involving more
licensing considerations. The conditions under which the data can be used should
be clear to machines and humans.

Examples

Commonly used licenses like MIT or Creative Commons can be linked to your data.
Methods for marking up this metadata are provided by the DTL FAIRifier.
R1.2 (Meta)data are associated with their provenance.

For others to reuse your data, they should know where the data came from (i.e.,
clear story of origin/history, see R1), who to cite and/or how you wish to be
acknowledged. Include a description of the workflow that led to your data: Who
generated or collected it? How has it been processed? Has it been published
before? Does it contain data from someone else that you may have transformed or
completed? Ideally, this workflow is described in a machine-readable format.

Examples

https://commons.wikimedia.org/wiki/File:Sampling_coral_microbiome_(2714643765
0).jpg
includes authorship details, and uses the Creative Commons Attribution Share Alike
license, which indicates exactly how the data author wishes to be cited.
R1.3 (Meta)data meet domain-relevant community standards.

It is easier to reuse data sets if they are similar: same type of data, data organised in a
standardised way, well-established and sustainable file formats, documentation (metadata)
following a common template and using common vocabulary. If community standards or
best practices for data archiving and sharing exist, they should be followed. For instance,
many communities have minimal information standards (e.g., MIAME, MIAPE). FAIR data
should at least meet those standards. Other community standards may be less formal, but
nevertheless, publishing (meta)data in a manner that increases its use(ability) for the
community is the primary objective of FAIRness. In some situations, a submitter may have
valid and specified reasons to divert from the standard good practice for the type of data to
be submitted. This should be addressed in the metadata. Note that quality issues are not
addressed by the FAIR principles. The data’s reliability lies in the eye of the beholder and
depends on the intended application.
Examples
http://schema.datacite.org/ [for general purpose, not domain-specific]
https://www.iso.org/standard/53798.html [geographic information and services]
http://cfconventions.org/ [climate and forecast]
FAIR Initiatives

Example: GO (Global Open) FAIR initiative towards


standardization of datasets worldwide.
FAIR - Your Part
Questions that will asked:
1. What is the name of your repository and how is it accessed
(URL)?
2. What organization is considered the publisher/operator of
the repository?
3. Has your repository attained (or are you working toward)
any community certifications? if yes: which ones? Do you
have (or are you working toward) repository certification
(e.g. CoreTrustSeal, WDS, DSA, TRAC, NESTOR, ISO
16363)?
4. From whom do you accept data for deposit, generally?

Source: Shelley Stall “Assessing FAIRness within the Enabling FAIR Data project”
Helpful FAIR Resources
● Article describing the Enabling FAIR Data Project:
https://eos.org/editors-vox/enabling-findable-accessible-inter
operable-and-reusable-data
● Outcome of the initial Stakeholder Meeting from Nov 16-17,
2017:
https://eos.org/agu-news/enabling-fair-data-across-the-earth
-and-space-sciences
● DataONE webinar recording:
https://www.dataone.org/webinars/enabling-fair-data
● Enabling FAIR Data (high-level) Project Site:
http://www.copdess.org/home/enabling-fair-data-project/
Foundations II:
Infrastructure and Enabling technology
Setting up a FAIR environment

Start by making yourself FAIR:


Register for an Orcid. Perhaps also a ResearchID and a Scopus ID.
Register for an OpenID
Create a Google Scholar Page
Functioning infrastructure needed
TODO
Compiler, file type
Conda, python (version 3), jupyter, docker (versions.. etc) - table
of versions etc
Space - mygeohub? Internet?
MyGeoHub
What is MyGeoHub?
Covered on friday.
Jupyter Notebook
What is the Jupyter Notebook?

Analyze

Integrate Plan

Discover Collect
PUBLISH

Preserve Assure

Describe
Jupyter within FAIR
Attributes: easy, free, open, shareable, customizable, various
languages, self describing, findable and accessible
Good example of fair
Tips for how to write a notebook as fair as possible
Session this wednesday on jupiter
Downloading Jupyter Notebook
Online instructions for own machine: https://jupyter.org/install.html
Also available as tool in MyGeoHub!

To access:
1. Log into MyGeoHub
2. Click ‘Resources’ -> ‘Tools’
in top menu
3. Scroll and click ‘Jupyter
Notebook’
4. Click ‘Launch Tool’
MyGeoHub Jupyter Notebook
[video example]
GitHub
What is the GitHub?
Hopefully you went to the session last week.
Intro to Github
Github is a good enabling tool for FAIR science

Analyze

Integrate Plan

Discover Collect
PUBLISH

Preserve Assure

Describe
Download Reanalysis Data
Reanalysis is a scientific method for developing a comprehensive record of how weather and
climate are changing over time. In it, observations and a numerical model that simulates one or
more aspects of the Earth system are combined objectively to generate a synthesized estimate of
the state of the system. A reanalysis typically extends over several decades or longer, and covers
the entire globe from the Earth’s surface to well above the stratosphere. Reanalysis products are
used extensively in climate research and services, including for monitoring and comparing current
climate conditions with those of the past, identifying the causes of climate variations and change,
and preparing climate predictions. Information derived from reanalyses is also being used
increasingly in commercial and business applications in sectors such as energy, agriculture, water
resources, and insurance. https://reanalyses.org/
Validation with Observed Runoff

Hydrographs of routed
runoff show good
correspondence with
observed and
naturalized flows.
6
8
Downloading Reanalysis Data

RDA
Climate Data Guide
ECMWF

Get an OpenID
Open an account with NCAR
Open an account with ECMWF
SYNDA
What is SYNDA?
● SYNchronized DAta
● Command line tool
Analyze ● Uses:
Integrate Plan ○ Search dataset
○ Search files
○ Download file(s)
Discover Collect ○ Manage large
PUBLISH
number of files
Preserve Assure
○ Explore metadata

Describe
SYNDA and FAIR
● Open source
● Supports community
● Free database - enabling discovery
● Keeps track of metadata
● Maintains provenance information
Downloading SYNDA
Credentials needed:
● Globus - https://www.globus.org/
● ESGF -
https://esgf-node.llnl.gov/user/add/?next=https://esgf-node.ll
nl.gov/projects/esgf-llnl/
Intro to ESGF
● ESGF - Earth System Grid Federation
○ Collaboration to develop, deploy, and maintain
software infrastructure
● Manages database for climate data from several
international and federal websites
○ Management, dissemination, and analysis of model
output and observational data
○ Over 700,000 datasets supported!
● Supports the Coupled Model Intercomparison Project
(CMIP)
● Used to discover, access, and analyze data

Discover Integrate Analyze


Intro to Globus
● Helps solve the problem of big data by allowing for
portable data
● Research data manager:
○ Transfer large data anywhere
○ Share between collaborators
○ Publish into curated collection for other to access
○ Build
● Access across all systems using ID
● Fast, secure, and reliable
● Low cost subscription - most universities cover

Discover Collect Describe Preserve


Downloading
[Mac]
SYNDA
● Available at: https://github.com/Prodiguer/synda
● Installation methods: RPM, DEB, Source, and Docker
● Download docker at:
https://hub.docker.com/editions/community/docker-ce-deskt
op-mac
● Download SYNDA via docker:
https://hub.docker.com/r/prodiguer/synda
Intro to Docker
● Container platform
○ Bundles applications and dependencies into application
packages
● Manages applications, clouds and infrastructure
● Deploys isolated, complex environments
● Free to download - open source

Analyze Assure Preserve


Downloading SYNDA
[Mac]

https://hub.docker.com/editions/community/docker-ce-desktop-m
ac
● Create an account to download
● Click “Get Docker”
● After download, double click docker.dmg to install
● Restart terminal after installation
● Check installation by running from terminal:
○ docker version
○ docker run hello-world
Downloading SYNDA
[Mac]
Commands (after docker installation inside terminal):
sudo docker pull prodiguer/synda
mkdir -p ~/synda/sdt/conf
curl
https://raw.githubusercontent.com/Prodiguer/synda/master/sdt
/conf/sdt.conf -o ~/synda/sdt/conf/sdt.conf
curl
https://raw.githubusercontent.com/Prodiguer/synda/master/sdt
/conf/credentials.conf -o ~/synda/sdt/conf/credentials.conf
vi ~/synda/sdt/conf/credentials.conf
Downloading SYNDA
[Mac]

Dataset installation and download:


sudo docker ps
sudo docker attach 204fcee0d7e5
● Note: this is the Container ID
synda daemon start &
synda install [dataset_name]
Detach container:
Ctrl+p+Ctrl+q
Downloading SYNDA
[Supercomputer Cluster]

Installation From Source:


wget --no-check-certificate
https://raw.githubusercontent.com/Prodiguer/synda/master/sdc
/install.sh
chmod +x ./install.sh
./install.sh
Downloading SYNDA
[Supercomputer Cluster]

Configuration:
Add into your .bashrc or .bash_profile
export ST_HOME=$HOME/sdt
export PATH=$ST_HOME/bin:$PATH
Edit credentials for ESGF and Globus
vi $ST_HOME/conf/credentials.conf
SYNDA should be up and working on your machine!
Downloading Synda - Video Example
SYNDA
[Useful Commands]
Search:
synda search rcp85 3hr start=2005-01-01T00:00:00Z
end=2100-12-31T23:59:59Z (dataset)
synda search rcp85 3hr timeslice=20050101-21001231 -f (file)
Download:
synda get
tasmax_day_FGOALS-s2_piControl_r1i1p1_20160101-20161231.nc
synda install CMIP5 CNRM-CM5 tas pr areacello
Other:
synda param
Helpful SYNDA Resources
Github:
https://github.com/Prodiguer/synda
Documentation:
http://prodiguer.github.io/synda/
Earth System Grid Federation presentation:
https://esgf.llnl.gov/esgf-media/2018-F2F/2018-12-05/2018-12-0
5_-_ESGF_F2F_Synda-BenNasser.pdf
SYNDA Examples
Applications
ESMValTool
What is ESMValTool?
Earth System Model eValuation Tool
● Community diagnostics and performance metric tool for the
evaluation of Earth System Models (ESMs)
Features:
● Facilitates the evaluation of ESMs
● Standardizes model evaluation against observations or
other models
● Wide scope of different diagnostics and performance
metrics covering ESMs (dynamics, radiation, clouds, carbon
cycle, chemistry, aerosol, sea-ice, etc.)
ESMValTool Structure

Source: Mattia Righi “ESMValTool v2.0 Technical Overview"


What is ESMValTool?
ESMValTool is a good enabling tool for FAIR science
● Free and available
Analyze
on github
Integrate Plan ● Developed so
additional analysis
can be added
Discover Collect ● Open exchange of
PUBLISH
code and results
● Standardized
Preserve Assure ● Broad
documentation
Describe
● Highly flexible
● Multi-language
support
Downloading ESMValTool
● Available at: https://github.com/ESMValGroup/ESMValTool
● Installation methods: anaconda, source, and docker
capabilities
Intro to Anaconda (Conda)
● Package and environment manager
● Runs on Windows, macOS, and Linux
● Command line tool
● Managing environments:
○ Create separate environments with unique packages,
files, and dependencies
○ Isolated from other environments
● Easy to install packages to your environment
○ conda search <package>
○ conda install <package>
● Installs, runs, and updates packages

Integrate Analyze
Conda and FAIR
TODO
Downloading Conda
● https://www.anaconda.com/distribution/#macos
● Click “download”
● Open package under downloads
○ Continue through installation process
● Agree to the terms of the software license
agreement
● Install
● Once installation is complete, close the installer
and move to trash
Downloading
[conda]
ESMValTool
● Update conda: conda update -y conda
● Create environment using python 3:
conda create --name esmvaltool python=3
conda env update --name esmvaltool --file
ESMValTool/environment.yml
● Activate the environment:
○ Bash: source activate esmvaltool
○ Tcsh: conda activate esmvaltool
● Software installation: python ESMValTool/setup.py
install
● Check if installation is successful: esmvaltool
--version
Downloading ESMValTool
[source]

● Download from github: git clone


https://github.com/ESMValGroup/ESMValTool.gi
t
● Go to directory: cd ESMValTool
● Suggestion is to install in anaconda environment to manage
ESMValTool dependencies
[Install using: python setup.py install]
Downloading
[docker]
ESMValTool
● Download via docker pull most recent image:
docker pull esmvalgroup/esmvaltool
● Example use:
[sudo] docker run -ti esmvalgroup/esmvaltool
Downloading ESMValTool
[supercomputer cluster]
● Download source from github: git clone
https://github.com/ESMValGroup/ESMValTool.git
● Via conda: module load anaconda
● Create environment:
conda create --name esmvaltool python=3
conda env update --name esmvaltool --file
ESMValTool/environment.yml
● Activate environment:
conda activate esmvaltool
● Software installation:
python3 ESMValTool/setup.py install
● Check if successful: esmvaltool --version
ESMValTool examples
TODO
Helpful ESMValTool Resources
● About: https://www.esmvaltool.org/index.html
● User Guide:
https://esmvaltool.readthedocs.io/en/version2_development/
user_guide2/index.html
● Github: https://github.com/ESMValGroup/ESMValTool
● Technical overview:
https://www.esmvaltool.org/download/Righi_ESMValTool2-T
echnicalOverview.pdf
futureheatwaves
What is the futureheatwaves package?
● R package for
analysis of
Analyze climate
Integrate Plan projection files to
identify and
determine a
Discover Collect define type of
PUBLISH
multi-day
Preserve Assure
extreme events
○ Ex: heat
Describe waves
Intro to R
R is a language and environment used mainly for statistics, data
manipulation, calculation, and graphics
● FAIR Science
○ Free
○ Compiles and runs on variety of UNIX platforms and
systems
○ Well documented
○ Extended easily via “packages”
■ CRAN existence
○ Anyone can write and contribute
their own packages
■ R-Journal can be a useful
source to “promote”
Integrate Analyze
What is CRAN?
The Comprehensive R Archive Network
● Main repository for R packages
● Easy for users to install
○ install.packages()
Download R
● https://www.r-project.org/
● Choose your preferred CRAN mirror (ideally one closest to
your location)
Downloading RStudio
Open source integrated development environment (IDE) designed
specifically for R
● Free desktop and web browser versions
Available for download at:
https://www.rstudio.com/products/RStudio/
● Windows
● Mac
● Linux
Downloading RStudio
Downloading futureheatwaves
Via RStudio:
● Open RStudio Desktop from your applications
● Install package from inside the console
○ install.packages(“futureheatwaves”)
Via Command Line:
● R
● install.packages(“futureheatwaves”)
● Choose a CRAN Mirror closest to your location
● Download should start
Simple R examples [R Studio]
Objectives:
1. Be able to install package
2. Be able to quickly/ simply analyze data
Working with template script:
Futureheatwaves examples
TODO
Helpful R and futureheatwaves links
https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf
https://www.rstudio.com/products/RStudio/
ARCCSS climate extremes
What is the ARCCSS climate extremes
package?
Analyze

Integrate Plan

Discover Collect
PUBLISH

Preserve Assure

Describe
Downloading ARCCSS climate extremes
Workflow I
Workflow II:
‘BYOW’ Bring Your Own Workflow

S-ar putea să vă placă și