Sunteți pe pagina 1din 33

Accepted Manuscript

Chiminey: Connecting Scientists to HPC, Cloud and Big Data

Iman I. Yusuf, Ian E. Thomas, Maria Spichkova, Heinz W. Schmidt

PII: S2214-5796(16)30041-7
DOI: http://dx.doi.org/10.1016/j.bdr.2017.01.004
Reference: BDR 56

To appear in: Big Data Research

Received date: 18 April 2016


Revised date: 15 November 2016
Accepted date: 20 January 2017

Please cite this article in press as: I.I. Yusuf et al., Chiminey: Connecting Scientists to HPC, Cloud and Big Data, Big Data Res. (2017),
http://dx.doi.org/10.1016/j.bdr.2017.01.004

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing
this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is
published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all
legal disclaimers that apply to the journal pertain.
Chiminey:
Connecting Scientists to HPC, Cloud and Big Data
Iman I. Yusufa , Ian E. Thomasa , Maria Spichkovab , Heinz W. Schmidtb
a
RMIT University, eResearch Office, 17-23 Lygon Street, 3053, Carlton, Australia
b
RMIT University, School of Science, 124 La Trobe Street, 3000, Melbourne, Australia

Abstract
The enabling of scientific experiments increasingly includes data, soft-
ware, computational and simulation elements, often embarrassingly parallel,
long running and data-intensive. Frequently, such experiments are run in a
cloud environment or on high-end clusters and supercomputers. Many dis-
ciplines in sciences and engineering (and outside computer science) find the
requisite computational skills attractive on the one hand but distracting from
their science domain. We developed Chiminey under directions by quantum
physicists and molecular biologists, to ease the steep learning curve in data
management and software platforms, required for the complex computational
target systems. Chiminey is a smart connector mediating running specialist
algorithms developed for workstations with moderately large data set and
relatively small computational grunt. This connector allows the domain sci-
entists to choose the target platform and then manages it automatically; it
accepts all the necessary parameters to run many instances of their program
regardless of whether this runs on a peak supercomputer, a commercial cloud
like Amazon EC2 or (in Australia) the national federated university cloud
system NeCTAR. Chiminey negotiates with target system schedulers, dash-
boards and data bases and provides an easy-to-use dashboard interface to the
running jobs, regardless of the specific target platform. The smart connector
encapsulates and virtualises a number of further aspects that the domain
scientists directing our effort found necessary or desirable.
In this article we present Chiminey and guide the reader through a hands-
on tutorial of this open-source platform. The only requirement is that the
reader has access to one of the supported clouds or cluster platforms - and
very likely there is a matching one. The tutorial stages range in difficulty
from requiring no to little technical background through to advanced sections,

1
such as programming your own domain-specific extension on top of Chiminey
application programmer interfaces.
The different exercises we demonstrate include: installing the Docker de-
ployment environment and Chiminey system; registering resources for file
stores, Hadoop MapReduce and cloud virtual machines; activating hrmclite
and wordcount smart connectors – two demonstrators; running a smart con-
nector and investigating the resulting output files; and building a new smart
connector. We also discuss briefly where to find more detailed information
on, and what is involved in, contributing to the Chiminey open source code
base.
Keywords: Big data, cloud, e-science, high performance computing,
parallel processing, scientific computing, service computing, simulation

1. Introduction
In this article, we present the Chiminey platform, which provides a re-
liable computing and data management service. Chiminey enables domain
scientists, hereafter scientists, to compute on both cloud-based, big data,
and high-performance computing (HPC) facilities, handle failure during the
execution of applications, curate and visualise execution outputs, share such
data with collaborators or the public, and to search for publicly available
data without the need to have a technical understanding of cloud-computing,
HPC, fault tolerance, or data management. Many scientific experiments have
a twofold challenge: they are challenging as complicated domain-specific re-
search tasks (i.e., a complicated analysis of the quantum physics approaches),
and at the same time the corresponding computations and datasets are too
large-scale to be executed on a local desktop machine, i.e., cloud-based and
high-performance computing (HPC) solutions are required.
Any new technology usually means not only new opportunities but also
new challenges, as a technology often manages some initial knowledge acqui-
sition task of its users. Cloud computing [1] enables acquisition of very large

Email addresses: iman.yusuf@rmit.edu.au (Iman I. Yusuf),


ian.edward.thomas@rmit.edu.au (Ian E. Thomas),
maria.spichkova@rmit.edu.au (Maria Spichkova),
heinz.schmidt@rmit.edu.au (Heinz W. Schmidt)

Preprint submitted to Journal of Big Data Research November 14, 2016


computing and storage resources, which can be integrated with big data tech-
nologies for massive scale computation. Such acquisition can be done with
relatively less specialised knowledge beyond that used for a single PC.
Nevertheless, failure while setting up a cloud-based execution environ-
ment or during the execution itself is arguably inevitable: some or all of the
requested virtual machines (VMs) may not be successfully created/instanti-
ated, or the communication with an existing VM may fail due to long-distance
network failure – given cloud data centers are typically remote and commu-
nication crosses many network boundaries. Also, one has to realise that all
tasks of such parallel computations are required to complete, therefore the
failure of any one of them may corrupt the result in some way. Statistically
this means that the reliability of the overall task completion is the product
of that of the individual tasks – and with very many thousands or millions
of compute tasks this may quickly become a vanishingly small number.
When using cloud computing platforms and big data technologies like
Hadoop, scientists require operational skills and to some extent knowledge
of aspects of fault tolerance. Scientists need to learn, for example, how to
create and set up virtual machines (VMs), collect the results of experiments,
and finally destroy VMs. Such challenges distract the user from focusing on
their core goals. Thus, there is a need for a platform that incapsulates these
problems and isolates them from the user. This would enable the user to
focus on domain-specific problems, and to delegate the tool to deal with the
detail that comes with accessing high-performance and cloud computing in-
frastructure, and the data management challenges posed. For these reasons,
we propose a user-friendly open-source platform that would hide the above
problems from the user by encapsulating them in the platform’s functionality.
The proposed open-source platform has been applied across two research
disciplines, physics (material characterisation) and structural biology (un-
derstanding materials at the atomic scale), to assess its usability and practi-
cality. The domain experts noted the following advantages of the platform:
time savings for computing and data management, user-friendly interface for
the computation set up, and visualisation of the calculation results as 2D or
3D graphs.
Previous work: The first prototype of Chiminey was discussed in [2]. A
formal model of a platform for scalable and fault-tolerant cloud computations
as well as the implementation of this platform as the Chiminey platform was
introduced in [3]. The model allows us to have a precise and concise specifica-
tion of the platform on the logical level. We also presented the refined formal

3
model of a cloud-based platform and the latest version of its open-source im-
plementation [4], with the emphasis on usability and reliability aspects. The
feasibility of the Chiminey platform is shown using case studies from the
Theoretical Chemical and Quantum Physics group at RMIT university.
Outline: The rest of the article is organised as follows. Section 2 provides
background information, links or contrasts our work with related work. Sec-
tion 3 introduces one of the core artifacts of Chiminey, Smart Connectors,
as well as the resources the platform provides. Section 4 presents the tuto-
rial, targeting the different types of Chiminey users. Section 5 concludes the
article and presents the core directions of our future work on Chiminey.

2. Background
In 2009, Leavitt in his widely cited1 paper [5] analysed advantages and
challenges related to cloud computing, highlighting that this type of deploy-
ment architecture becomes appealing to many companies. Now, almost 8
years later, we can see that this paradigm becomes even more and more
appealing. Another widely cited2 paper on the cloud computing paradigm
[6] presents a survey done by Zhang et al. The survey highlights the key
concepts of cloud computing, its architectural principles, state-of-the-art im-
plementation as well as research challenges.
Cloud computing provides many benefits, e.g., provisioning of virtual ma-
chines (VMs) within literally 15 minutes, when purchases of physical servers
took days or weeks; access to online storage and computing resources at a
moment’s notice; cost savings by turning virtual servers and hence charges
for them on and off at will; and not least, improved resource utilisation,
across large numbers of users in one or more data centres.
However, failure in cloud services is arguably inevitable due to config-
uration errors, continuous upgrades somewhere in the cloud software stack
or application layers, the unreliability of networks that remote services de-
pend on, and thus generally the heterogeneous character of widely distributed
systems. Yusuf and Schmidt [7] have shown in formal reliability and perfor-
mance studies, that fault-tolerance is best achieved by reflecting the static
and dynamic (behavioural) architecture of high-performance computational
programs. Compared to architecture-agnostic replication, architecture-aware

1
more than 500 citations
2
more than 600 citations

4
fault-tolerance can achieve higher reliability at lower costs, but needs to be
tuned to different architectural/behavioural patterns such as stream process-
ing, map-reduce, randomised access etc.
The development of formal models and architectures for system involved
in cloud computing is a more recent area of system engineering. Vaquero
et al. [8] studied more than 20 definitions of the term cloud computing to
extract a consensus definition as well as a minimum definition containing
the essential characteristics. As a result, they consolidated the following
definition:
Clouds are a large pool of easily usable and accessible virtual-
ized resources (such as hardware, development platforms and/or
services). These resources can be dynamically reconfigured to ad-
just to a variable load (scale), allowing also for an optimum re-
source utilization. This pool of resources is typically exploited by a
pay-per-use model in which guarantees are offered by the Infras-
tructure Provider by means of customized Service-Level Agree-
ments.
Buyya and Sulistio [9] presented a discrete-event grid simulation toolkit,
GridSim, that can be used for investigating the design of utility-oriented
computing systems such as Data Centers and Grids.
Ostermann et al. in their paper [10] stated a research question on whether
the performance of clouds is sufficient for scientific computing. They analyzed
the performance of the Amazon EC2 platform using micro-benchmarks and
kernels, and came to the conclusion that the performance and the reliability
of the tested cloud are low, and probably insufficient for scientific computing
at large.
As the cloud-based systems deal with safety and security critical data, the
formal modelling and verification of cloud architectures becomes more and
more important. Su et al. used the CSP framework to model MapReduce
system, cf. [11]. Reddy et al. [12] proposed an approach to verify the
correctness of Hadoop systems (open source implementation of MapReduce)
using model checking techniques. Our previous work on a formal model of
the Chiminey system was presented in [3, 4].
Several approaches have proposed or compared different map-reduce ap-
proaches for cloud computing, others data stream processing systems, and yet
others parametric parallel solvers using special numeric packages or Monte
Carlo walks distributed over many VMs. For example, Martinaitis et al.

5
[13] introduced an approach towards component-based stream processing in
clouds. Kuntschke and Kemper [14] presented a work on data stream sharing.
For scientific computing, it is crucial to allow researchers to build their
own workflows. There are different types of scientific workflow systems that
are designed to provide this functionality. Oinn et al. [15] presented Taverna
Workbench for the composition and execution of workflows for the life sci-
ences community. Taverna enables users to interoperate services, but does
not support the semantic integration of data outcomes of these services. Af-
gan et al. [16] introduced a Galaxy Cloud that provides an interface with
automated management of cloud computing resources, which was used to
conduct biomedical experiments. Buyya et al. [17] presented Nimrod, a set of
software infrastructure for executing large and complex computations. Nim-
rod contains a simple language for describing sweeps over a parameter space
and the input and output of data for processing. Nimrod is compatible with
the Kepler system [18], such that users can set up complex computational
workflows and have them executed without having to interface directly with
a high-performance computing system.
The contribution of the work presented in this paper is that our platform
provides drop-in components, so-called Smart Connectors (SCs), for existing
workflow engines and user-defined control of fault-tolerance: (i) researchers
can utilise and adapt existing Smart Connectors; (ii) functionality of the
target schedulers, workflow engines or middleware platforms is abstracted
but not duplicated, (iii) the SCs target can be high-performance clusters
or clouds, and (iv) new types of Smart Connectors could be developed with
little effort within the framework if necessary. To our best knowledge, there is
no other framework with these advantages. SCs are geared toward providing
flexibility and power underneath simplicity.

3. Chiminey
The Chiminey platform was implemented as a part of the Bioscience
Data Platform project [19], which is an agile software collaboration between
software engineering and applied natural sciences researchers in quantum
physics (nanomaterials) and computational biology (crystallography studies).
It was important to support these sciences minimising prior knowledge in
cloud and cluster computing to ease the use of parallel computing within a
virtual laboratory development that provided the context for our Chiminey-
related grant.

6
1 Chiminey Cluster
1...
notify status
web front *
1
Smart Connector
1...
* *
* 1...
submit job *
1 External
1 1
1 1...
email job status 1
Data Manager * Storage
* 1 1
Instruments
1 1 1...
* 1 1...
(microscopes,
submit, monitor job MyTardis * Synchrotron,
1... Chiminey HPC, etc.)
1...
* scripts
1 * Storage 1
0..1
Research
Legend (communications) Repositories
among Chiminey components
Chiminey with external components
among external components

Figure 1: A reference architecture of Chiminey platform

Python was selected as the development language due to its rapid pro-
totyping features, integration with the MyTardis data curation system3 , and
due to its increasing uptake by scientists as a scientific software develop-
ment language. However, the domain-specific calculations could be written
in any language. The choice of the language depends on the domain and the
concrete research task.
The reference architecture of the platform is presented in Figure 1. In
our implementation, the data can be sent to MyTardis [20], an applica-
tion for cataloguing, managing and assisting the sharing of large scientific
datasets privately and securely over the web. MyTardis is currently used
across Australian universities in collaborative characterisation of biomedical
or advanced materials and structures at the nano- and microscale.
Configuration parameters are provided by users through a browser inter-
face using web forms, prior to execution. The forms include
1. information for computation-specific plugins to implement sequential
user algorithms;
2. cloud storage and compute resource specification, in particular number
and type of virtual machines (or processes on a cluster);
3. any parallel-pattern specific parameters that are required to coordinate
compute sweeps on behalf of the user;

3
http://mytardis.org/

7
4. fault tolerance parameters to support predefined fault tolerance policies
such as replication and restart of VMs and processes in VMs;
5. data source and sink parameters, if data is fetched from or transferred
to repositories outside the cloud.

3.1. Smart Connectors


The platform provides access to a distributed computing infrastructure.
On the logical level it is modelled as the execution of a dynamically built
set of Smart Connectors (SCs), which handle the provision of cloud-based
infrastructure. SCs vary from each other by the type of computation to be
supported and/or the specific computing infrastructure to be provisioned.
An SC interacts with a cloud service (Infrastructure-as-a-Service) on be-
half of the user. With respect to the execution environment, the only infor-
mation that is expected from the user is to specify the number of computing
resources she wishes to use, credentials to access those resources, and the
location for transferring the output of the computation. Thus, the user does
not need to know about how the execution environment is set up (i.e., how
VMs are created and configured for the upcoming simulation), how a simula-
tion is executed, how the final output is transferred and how the environment
is cleaned up after the computation completion (i.e., how the VMs are de-
stroyed). The platform provides a set of APIs to create new and customise
existing SCs.
The platform identifies three classes of users:

• Software as a service (SaaS) users: scientists that wish to run jobs on


an already deployed Chiminey platform;

• Platform as a Service (PaaS) users: scientists, with technical back-


ground, that would like to create new smart connectors;

• Expert users: users that focus on advanced topics like benchmarking,


reliability, utilisation and contributing to the Chiminey source code.

3.1.1. Stages
For both SaaS and PaaS users, the key concept is a chiminey com-
putational stage. Each stage is a unit of schedulable computation within
Chiminey. A smart connector is composed of stages, each stage with a unique
functionality. For the SaaS user, stages define an underlying workflow that
the user can control via the Chiminey configuration panels. For the PaaS

8
user, stages can be extended by scripting, method redefinitions, by adding
functionality and configuration options or also by prefilling configuration op-
tions the SaaS user of the extended functionality does not have to deal with.
Stages are implemented using python classes with the following elements:

validation: Before the execution of a smart connector starts, the Chiminey


server checks whether the constraints of all stages of the smart connec-
tor are met. This is done by invoking the input valid(self, . . .)
method of each stage of the smart connector.
pre-condition: The Chiminey server uses pre-conditions to determine the
stage that should be executed next. The Chiminey server invokes the
method is triggered(self, . . .) in order to check whether the pre-
condition of a particular stage is met.
action: This is the main functionality of a stage. Such functionality in-
cludes creating virtual machines, waiting for computations to com-
plete, and the like. Once the Chiminey server determines the next
stage to execute, the server executes the stage via the process(self,
. . .) method.

post-condition This is where the new state of the smart connector job
is written to a persistent storage upon the successful completion of a
stage execution. During the execution of a stage, the state of a smart
connector job changes. This change is saved via the output(self, . . .)
method.

Classes with these elements can act as stages that can be connected to
form smart connectors. The chiminey system provides a library of core stages
that can be used to create smart connectors that follow well-known compu-
tational patterns, and a PaaS user can write additional stages or specialise
these existing stages to implement additional behaviour.
The provided suite of stages of a smart support a set of key phases of
computation:
1. Data analysis: determining initial inputs for computations including
algorithm parameters and compute and storage resources.
2. Execution environment setup: creating compute resource (if applicable)
and configuring of the compute resources (e.g., bootstrap of software
requirements).

9
3. Computations: scheduling of computations onto VMs, execution, and
then waiting for output.
4. Output transfer transferring data into designated storage and/or data
curation systems like MyTardis.
5. Cleanup: decommission of allocated VM resources (if applicable).

3.1.2. Payload
For some specific types of stages, there are hooks that allow arbitrary
packages of domain-specific executables and data files to be processed within
the system. These allows the same smart connector to be parameterised on
the specific executable task. A payload is a set of system and optionally
domain-specific files that are needed for the correct execution of a smart
connector. The system files are composed of Makefiles and bash scripts while
the domain-specific files are developer provided executables. The system
files enable the Chiminey server to setup the execution environment, execute
domain-specific programs, and monitor the progress of setup and execution.

3.2. Resources
In the execution of a smart connector, external entities both computa-
tional and storage are registered within the system as compute and storage
resources (respectively). Compute resources are used for execution of tasks.
Examples include Unix hosts, Jenkins servers, PBS cluster head nodes and
cloud nodes. Storage resources are unix filesystems, either local or remote,
which can store directories and files both as the source of data and sink for
computation results.

3.3. Architecture
The Chiminey architecture relies on Docker – an automatic software de-
ployment tool [21] for software platforms and applications in the cloud. While
VMs can be transported from cloud to cloud, extended and specialised, VM
images are large and time-consuming to start up with many software plat-
forms running in the VM. Docker sits somewhere in between managing entire
VMs and managing a single software package by sharing its source code in
some open repository. To this end Docker introduces the notion of contain-
ers. Like the containers on a ship share the same ship, Docker containers
share the same operating system kernel, the same file system and disks etc.
ultimately the same VM or type of VM. Thus Docker containers start up
instantly and use fewer resources. Container layers can be added as needed.

10
These containers are based on open standards and middleware platforms that
run on all major Linux distributions, Windows and generally on top of any
infrastructure.
The Chiminey system is a composition of docker containers with spe-
cialised functions, including front end portals (a Django MVC web frame-
work), databases (Postgresql), task schedulers (Celery), task queue (Redis)
and multiple worker containers that execute jobs (Celery workers).
The basic installation is configured to run on a single container node VM,
though more sophisticated architectures can be deployed by using a multi-
node container orchestration tool such as Google Kubernetes, Docker Swarm,
or Mesos Marathon.

4. Experiencing Chiminey: A Hands-On Tutorial


We divide the tutorial into two main sets of activities: deployment and
usage. This tutorial style tries to engage biologists, physicists and other sci-
ence users. Chiminey is currently being used in etymology for sieving through
insect data, in materials science for classifying nanosurfaces and many more
beside its original uses during its initial development in cooperation with
biologists and quantum physicists.

• Deployment
These activities install and deploy the chiminey system in a standard
configuration:
1. container infrastructure — the container framework for deploy-
ment of the chiminey components.
2. Chiminey deployment — including containers for the portal, work-
ers and databases.
3. configuration of users — creation of accounts for users of chiminey.
4. registration of smart connectors — enabling of pre-existing smart
connectors with the system.

• Usage
These activities show the operation of the running chiminey system:
1. registration of resources — Identifying local and remote storage
and computation resources and providing identifying handles with-
in Chiminey.

11
2. creation of jobs — registration of new executions by identifying
resources and key parameters.
3. creation of new smart connectors — extending existing and build-
ing new smart connectors.

4.1. First Things First


The majority of the exercises in this tutorial require a running Chiminey
platform. Thus, here we show how to deploy and configure a Chiminey
platform via Docker, which is an automatic software deployment tool [21].

4.1.1. Docker
Purpose. In this section, you will create a virtual machine (VM) to
run docker, either on Mac or Windows. Refer to the Docker manuals [21] if
you have a Linux OS.

Exercise 1. We start this tutorial by creating a Docker VM4 .


1. Download Docker Toolbox from https://www.docker.com/toolbox.
2. When the download is complete, open the installation dialog box by
double-clicking the downloaded file.
3. Follow the on-screen prompts to install the Docker toolbox. You may
be prompted for a password just before the installation begins. You
need to enter your password to continue.
4. When the installation is completed, press Close to exit.
5. Verify that docker-engine and docker-compose are installed correctly.
• Open Docker Quickstart Terminal from your application folder.
The resulting output looks like the following:

4
This exercise is inspired by the official Docker website docs [21].

12
• Run docker engine.
$ docker run hello-world

Expected output. You will see a message similar to the one below.
Unable to find image ’hello-world:latest’ locally
latest: Pulling from library/hello-world
03f4658f8b78: Pull complete
a3ed95caeb02: Pull complete
Digest: sha256:8be990ef2aeb16dbcb92...
Status: Downloaded newer image for hello-world:latest

Hello from Docker.


This message shows that your installation appears to be
working correctly.
...

• Run docker-compose
$ docker-compose --version

Expected output. If your OS is not an older Mac


docker-compose version x.x.x, build xxxxxxx

Expected output. If your OS is an older Mac.


Illegal instruction: 4

This error can be fixed by upgrading docker-compose.


$ pip install --upgrade docker-compose

13
4.1.2. Chiminey Deployment
Purpose. In this section, you will deploy a Chiminey platform.

Exercise 2. Follow the steps below to deploy a Chiminey platform on


the VM that is configured to run Docker (see Section 4.1.1).
1. Open Docker Quickstart Terminal.
2. Check if git is installed.
$ git

Expected output. If git is installed


usage: git [--version] [--help] [-C <path>] ..
[--exec-path[=<path>]] [--html-path] [...
[-p|--paginate|--no-pager] [--no- ...
[--git-dir=<path>] [--work-tree=<path>]...
<command> [<args>]
...

Expected output. If git is not installed


bash: git: command not found
Download and install git from http://git-scm.com/download
3. Clone the docker-chiminey source code from github.com.au
$ git clone https://github.com/chiminey/docker-chiminey.git

4. Change your working directory


$ cd docker-chiminey

5. Review the passwords in the environment sections in docker-compose.yml.


6. Setup a self-signed certificate. You will be prompted to enter country
code, state, city, and etc.
$ sh makecert

7. Deploy the Chiminey platform.


$ docker-compose up -d

8. Verify Chiminey was deployed successfully.


• Identify the IP address of the VM on which Chimney was deployed
$ env | grep DOCKER_HOST

14
Expected output.
DOCKER_HOST=tcp://IP:port
E.g., DOCKER_HOST=tcp://192.168.99.100:2376

• Open a browser and visit the Chiminey portal at IP, in our ex-
ample, http://192.168.99.100.
Expected output. After a while, the Chiminey portal will be shown.

4.1.3. Configuring the Chiminey deployment


Purpose. In this section, you will configure the Chiminey deployment.

Exercise 3 Here, we will configure the Chiminey deployment by creat-


ing the superuser, initialising the database, and signing up regular users.
1. Open Docker Quickstart Terminal.
2. Change to docker-chiminey directory: $ cd docker-chiminey
3. Create the superuser: $ ./createsuper
4. Initialise the database: $ ./init
5. Create a regular user $ ./createuser
6. Verify the Chiminey platform is configured correctly.
• Open a browser and visit the Chiminey portal.
• Login with your regular username and password

15
Expected output. You will be redirected to a webpage that displays a
list of jobs. Since no jobs are run yet, the list is empty.

4.1.4. Activate Smart Connectors


Purpose. In this section, you will learn how to view a list of smart
connectors, and how to activate a smart connector.

Exercise 4. Here, you will view a list of smart connectors.


1. Open Docker Quickstart Terminal and change to docker-chiminey di-
rectory $ cd docker-chiminey.
2. List all smart connectors
$ ./listscs

Expected output. A list of smart connectors will be displayed


NAME: DESCRIPTION
hrmclite: Hybrid Reverse Monte Carlo without PSD
randnum: Randnum generator, with timestamp
wordcount: Counting words via Hadoop

Exercise 5 Here, you will activate a smart connector.


1. Open Docker Quickstart Terminal and change to docker-chiminey di-
rectory $ cd docker-chiminey.
2. Activate the hrmclite smart connector
$ ./activatesc hrmclite

The syntax to add any of the smart connectors that are included with
the platform is $ ./activatesc smart-connector-name.
3. Verify the smart connector is successfully activated.
• Open a browser and visit the Chiminey portal.
• Login with your regular username and password.
• Click Create Job.

16
Expected output. hrmclite will appear under the Smart Connectors list.

4.2. Experience Chiminey


This section focuses on topics that are related to submitting a smart
connector job. We will learn how to submit, monitor and terminate jobs; we
will also learn how to manage resources (Section 3.2).
In this section, we assume the following:
1. A Chiminey platform is already deployed and the following SCs are
activated (see Section 4.1):
(a) hrmclite: A cloud-based SC that executes Monte Carlo simula-
tions [22] on the cloud.
(b) wordcount: A Hadoop-based SC that computes the frequency of
each word, in a file, satisfying the given regular expression. This
SC uses a Hadoop MapReduce compute resource.
2. You are logged in to the Chiminey platform.

17
(a) Open a browser and visit the Chiminey portal
(b) Login with your credentials

4.2.1. Registering resources


Purpose In the exercises below, we will learn how to register resources.

Exercise 6. Here, we will register a Cloud-based compute resource.


1. Click Settings.
2. Click Compute Resource from the Settings menu.
3. Click Register Compute Resource.
4. Click the Cloud tab.
5. Select the resource type from the drop down menu. The supported
cloud types are provided by Amazon, NeCTAR and RMIT eResearch
(CSRACK).
6. Enter a unique resource name.
7. Enter EC2 access key and EC2 secret key.
8. Click Register.

Expected output. The newly added cloud-based compute resource will


be displayed under Cloud - NeCTAR/CSRack/Amazon EC2.

Exercise 7. Here, we will register an HPC compute resource, i.e. either


a cluster or a standalone server.
1. Click Settings.
2. Click Compute Resource from the Settings menu.
3. Click Register Compute Resource.
4. Click the HPC tab.
5. Enter a unique resource name.
6. Enter IP address or hostname of the cluster head node or the standalone
server.
7. Enter credentials, i.e. username and password. The password is not
stored in the Chiminey server. It is temporarily kept in memory to to
establish a private/public key authentication from the Chiminey server
to the resource.
8. Click Register

18
Expected output. The new resource will be displayed under the HPC - Cluster
or Standalone Server list.

Exercise 8. Here, we will register a storage resource.


1. Click Settings.
2. Click Storage Resource from the Settings menu.
3. Click Register Storage Resource.
4. Click the Remote File System tab.
5. Follow steps from 5 to 8 of Exercise 7.

Expected output. The newly added resource will be displayed under the Re-
mote File System list.

Exercise 9. Here, we will register a Hadoop cluster.


1. Click Settings.
2. Click Compute Resource from the Settings menu.
3. Click Register Compute Resource.
4. Click the Analytics tab.
5. Select Hadoop MapReduce as Resource Type.
6. Enter a unique resource name.
7. Enter IP address or hostname of the resource.
8. Enter username and password.
9. Enter a Hadoop home path. This is the absolute path to the executable
of the resource.
10. Click Register

Expected output. The new resource will be displayed under the Analytics -
Hadoop MapReduce list.

4.2.2. Updating and removing resources


Purpose The following exercises show how to change the details of
registered resources, and how to remove a registered resource.

19
Exercise 10. Here, we focus on updating registered resources.
1. Click Settings.
2. From the Settings menu, depending on which resource you wish to up-
date, click either Compute Resource or Storage Resource. All registered
resources will be listed.
3. Locate the resource you wish to update, then click Update.
4. Make the changes, and when finished click Update.

Expected output. The resource will be listed with its new details.

Exercise 11. Here, we learn how to remove a registered resource.


Perform all the steps from Exercise 10 but click Delete. The resource will
be removed from the resources’ list.

4.2.3. Understanding the job submission UI


Purpose The aim of the following discussion is to understand the job
submission UI, and therefore be able to run any smart connector job without
difficulty. The job submission UI is accessed by clicking Create Job tab.
Figure 2 shows the job submission UI of wordcount SC.
The Chiminey job submission UI is composed of a list of activated SCs,
and the submission form of the selected SC. The submission form is divided
into various sections. In general, each submission form has at least the fol-
lowing three sections:

• Presets: The end-user can save the set of parameters values of a job as a
preset. Each preset must have a unique name. Using the unique preset
name, the end-user can retrieve, update and delete saved presets.

• Compute resource: This section includes the parameters that are needed
to utilise the compute resource associated with the given SC. Hadoop
compute resources need only the name of the registered Hadoop cluster
(see Exercise 9), while the cloud compute resource needs the resource
name as well as the total VMs that can be used for the computation.
Note that the names of all registered compute resources are automati-
cally populated to a dropdown menu on the submission form.

• Locations: These parameters are used to specify either input or out-


put directories on a registered storage resource. Each location consists

20
Figure 2: Job Submission UI for wordcount SC

of two parameters: a storage location and a relative path. Storage


location is a drop-down menu that lists the name of all registered
storages and, their corresponding root path. A root path is an absolute
path to the directory on the storage resource onto which all input and
output files will be saved. Relative path is the name of a subdirectory
of the root path that contains input and/or output files. In the case
of input locations, Chiminey retrieves the input files that are needed
to run the smart connector job from this subdirectory. In the case of
output location, Chiminey will save the output of the SC job to the
subdirectory.
Some job submission forms include one or more of the following sections:
• Reliability: Fault tolerance support is provided to each SC job. How-
ever, the enduser can limit the degree of such support using the reliabil-

21
ity parameters: reschedule failed processes and maximum retries.

• Sweep: Sweep allows end-users to run multiple jobs simultaneously from


a single submission. The sweep allows end-users to provide ranges of
input values for parameters, and the resulting set of jobs produced
span all possible values within that parameter space. These ranges of
parameters are defined at job submission time, rather than being “hard-
coded” in the definition of the smart connector. The common usecases
for this feature are to generate multiple results across one or more
variation ranges for later comparison, and to quickly perform experi-
mental or ad-hoc variations on existing connectors. Endusers specify
the parameter(s) and their possible values via the sweep parameter.

• Data curation resource: This section provides the parameters that are
needed to curate the output of a SC job. The section includes a drop-
down menu that is populated with the name registered data curation
services like MyTardis.

• Domain-specific parameters: These parameters are needed to guide the


execution of the domain-specific payload of a given SC. wordcount SC
has Word Pattern while hrmclite has pottype, error threshold, and
others.

Exercise 12. Here we will submit an hrmclite SC job. The hrmclite


submission form has five sections: presets, cloud compute resource, input
and output locations, domain secific parameters, and reliability.
1. Select the cloud compute resource for the job.
2. We will use 1 VM for the job. Number of VM instances represents the
ideal number of VMs the Chiminey server can create for the job. The
Chiminey platform terminates the job if it cannot create the minimum
number of VMs as requested via Minimum No. VMs.
3. Specify the input and output location. Chiminey expects the input files
to be located under Relative path in a subdirectory called initial.
Suppose a storage resource has a root path /home/jsmith/experiments,
and the relative path is given as hrmc input, thus the input files must
be located under /home/jsmith/experiments/hrmc input/initial.
4. Keep the default values of the HRMCLite Smart Connector and Reliability
parameters.

22
5. Submit the job. Before submitting the job, you can save the parameter
values through presets.

Expected output. You will be redirected to the job monitoring page with the
new job listed at the top.

Exercise 13. Here we will submit a wordcount SC job. The wordcount


submission form has four sections: presets, Hadoop compute resource, input
and output locations, and domain specific parameters.
1. Select the Hadoop compute resource for the job.
2. Enter the input and output locations as discussed in Exercise 12.
3. Keep the default values of the Word Count Smart Connector. The
default pattern counts all words in the input files.
4. Submit the job. Before submitting the job, you can save the parameter
values through presets.

Expected output. You will be redirected to the job monitoring page with the
new job listed at the top.

4.2.4. Monitoring and terminating jobs


Purpose Here, we focus on job monitoring. Once a job is submitted,
the end-user can monitor the status of the job by clicking Jobs tab. A job
status summary of all jobs will be displayed. The most recently submitted
job is displayed at the top. Click Info button next to each job to view a
detailed status report.

Figure 3: Monitoring a job

23
The Jobs page also allows researchers to terminate submitted jobs. To
terminate a job, check the box at the end of the status summary of the
job, click the Terminate selected jobs button at the end of the page. The
termination of the selected jobs will be scheduled. Depending on the current
activity of each job, terminating one job may take longer than the other.
When a SC job is completed, login to your storage resource. The output
is located in the offset directory under the root path of the resource.

4.3. Get Your Hands Dirty


Here we cover the topics that are related to creating new smart connec-
tors. In order to complete the following exercises, you need to have access
to the Chiminey platform source code. The source code is located in the
/opt/chiminey/current directory of the running Chiminey docker container.

1. Login to the Chiminey node using a terminal.


2. Change to docker-chiminey directory: $ cd docker-chiminey
3. Login to the Chiminey docker container: $ ./chimineyterm
4. Verify you are in /opt/chiminey/current: $ pwd

4.3.1. Creating a smart connector


Purpose In this section, we create a new smart connector.
Creating a smart connector involves completing three tasks: providing the
core functionality of the SC, attaching resources and optional non-functional
properties, and registering the new SC with the Chiminey server.
The core functionality can be provided either via a payload or by overrid-
ing the run task method of chiminey.corestages.execute.Execute class.
This tutorial uses the payload method, refer to Chiminey documentation [23]
for the latter method.

Payload
A payload has the following structure:
payload name
bootstrap.sh
process payload
main.sh
schedule.sh
domain-specific executables

24
The names of the files and directories under payload name, except the
domain specific ones, cannot be changed. bootstrap.sh includes instructions
to install packages, which are needed by the SC job, on the compute resource.
schedule.sh is needed to add process-specific configurations. Some SCs
spawn multiple processes to complete a single job. If each process needs to
be configured differently, the instruction on how to configure each process
should be recorded in schedule.sh. main.sh runs the core functionality of
the SC, and writes the output to a file. domain-specific executables are
additional files that are needed by main.sh.
Not all SC jobs require new packages to be installed, process-level configu-
ration or additional domain-specific executables. On such cases, the minimal
payload, as shown below, can be used.
payload name
process payload
main.sh

Resources and non-functional properties


Resources and non-functional properties are attached to a SC by overrid-
ing get ui schema namespace method of CoreInitial5 class. New domain-
specific variables can be introduced via get domain specific schemas method.

Exercise 14. Now we create a random number smart connector, here-


after referred as randnum.
randnum SC generates a random number with timestamp, and then writes
the output to a file. We will use the minimal payload to provide the core
functionality of this SC. Thus, we will prepare the following payload.

payload randnum
process payload
main.sh

Below is the content of main.sh.


#!/bin/sh

OUTPUT_DIR=$1
echo $RANDOM > $OUTPUT_DIR/signed_randnum

5
chiminey.initialisation.coreinitial.CoreInitial

25
date > $OUTPUT_DIR/signed_randnum

# --- EOF ---


Notice OUTPUT DIR. This is the path to the output directory, and thus
Chiminey expects all outputs to be redirected to that location. The contents
of OUTPUT DIR will be transferred to the output location at the end of each
computation.
The next step is to attach resources. We will use a unix compute resource
for the computation. We need to also attach an output locations. The list
of available resources and non-functional properties is given as INPUT FIELDS
parameter in chiminey/settings changeme.py. Under chiminey/, we create
a python package randnum6 , and add initialise.py.

from chiminey.initialisation import CoreInitial


from django.conf import settings

class RandNumInitial(CoreInitial):
def get_ui_schema_namespace(self):
schemas = [
settings.INPUT_FIELDS[’unix’],
settings.INPUT_FIELDS[’output_location’],
]
return schemas

# ---EOF ---

Registration
The final step is registering randnum SC with the Chiminey server. The
details of this SC will be added to the dictionary, SMART CONNECTORS, in
chiminey/settings changeme.py. The details include a unique name (with
no spaces), a python path to RandNumInitial class, the description of the
SC, and the absolute path to the payload.
"randnum": {
"name": "randnum",
"init": "chiminey.randnum.initialise.RandNumInitial",
"description": "Randnum generator, with timestamp",
"payload": "/opt/chiminey/current/payload_randnum"
},

6
The SC package must be under the /opt/chiminey/current/chiminey/ hierarcy

26
Restart the Chiminey server and then activate randnum SC.
$ sh restart
$ ./activatesc randnum

Expected output. randnum SC will appear under the Smart Connector list of
the Chiminey portal (see Exercise 5).

Figure 4: Random number smart connector

Now we are ready to submit the randnum job (see Figure 4). We only need
to select the compute resource name, and provide the storage resource for
transferring the output of the computation. Recall, storage resource name/offset
is a location (see Section 4.2.3). Suppose we have already registered a storage
resource with the name mystor and a root path of /root/chiminey home and
let the offset be randomnumber. We set the output location to mystor/randomnumber.
Submit the job and when the job is completed, login to your storage resource
and check the contents of randomnumber under the root path of the resource,
in the case of mystor, /root/chiminey home.

Reflections
In this exercise, we have shown how to create a smart connector. Even
though we have used a simple random number generator, the tasks that are
involved for other programs are similar. If the program can be executed
on a cloud, cluster or Hadoop, then this program can be packaged as a

27
smart connector. The huge benefit of using the Chiminey server to run your
program is you dont need to worry about how to manage the execution of
your program on any of the provided compute resources. For instance, if we
want to generate random on a cloud VM, we need to change only one word
in get ui schema namespace method. Replace unix by cloud. Then restart
Chiminey, and activate your cloud-based random generator. We encourage
the reader to check the examples in the Chiminey documentation [23]. They
show how to create various types of smart connectors: cloud-based, Hadoop-
based, sweeps, reliability, data curation resources.

5. Summary
In this article we surveyed research methods and tools that lower the bar-
riers to accessing high-performance computing, big data and cloud resources
for domain scientists. We introduced Chiminey - a platform that evolved
in the bioscience and material science communities with our e-science soft-
ware research perspective complementing the scientific methods in these dis-
ciplines. The article summarised prior and current research in parallel and
distributed software architecture and fault-tolerance underpinning Chiminey.
We briefly mentioned various uses of the platform since its inception but
focussed largely on its architecture and simplified conceptual view of and
usage for scientific workflow automation, when big data is at the centre of
experiments, simulation and computation and when computational resources
include high-performance computing and/or cloud computing.
A large second part of the article is dedicated to hands-on tutorials
demonstrating the Chiminey system through a number of different scenarios
of usage and stages in the science workflow. We demonstrated:

• Installing Docker deployment environment and Chiminey system,

• Register resources for file stores, Hadoop and cloud VMs,

• Activating hrmclite and wordcount smart connector,

• Run smart connector and investigate the resulting output,

• Building a new smart connector.

Chiminey is an extensible framework, as the stages, payloads and connec-


tors that are created can be reused and repurposed for other tasks. Chiminey

28
is licensed with a New BSD license, and scientists are invited to contribute
their new stages and connectors into the core library of functionality (see
http://chiminey.net) for others to utilise and extend.
Beyond its intial purpose, chiminey has been used by further domain
scientists such as engineering disciplines for testing of software-intensive en-
gineered systems and also ab initio quantum physics simulation.

Acknowledgement
The Bioscience Data Platform project acknowledges funding from the
NeCTAR project No. 2179.

[1] M. Armbrust, A. Fox, R. Griffith, et al., A view of cloud computing,


Commun. ACM 53 (2010) 50–58.

[2] I. Yusuf, I. Thomas, M. Spichkova, S. Androulakis, G. Meyer,


D. Drumm, G. Opletal, S. Russo, A. Buckle, H. Schmidt, Chiminey:
Reliable computing and data management platform in the cloud, in:
Proceedings of the 37th International Conference on Software engineer-
ing (ICSE’15), pp. 677–680.

[3] M. Spichkova, I. Thomas, H. Schmidt, I. Yusuf, D. Drumm, S. An-


droulakis, G. Opletal, S. Russo, Scalable and fault-tolerant cloud com-
putations: Modelling and implementation, in: Proceedings of the 21st
IEEE International Conference on Parallel and Distributed Systems (IC-
PADS 2015), pp. 396–404.

[4] M. Spichkova, H. Schmidt, I. Thomas, I. Yusuf, S. Androulakis,


G. Meyer, Managing usability and reliability aspects in cloud comput-
ing, in: Proceedings of the 11th International Conference on Evaluation
of Novel Software Approaches to Software Engineering.

[5] N. Leavitt, Is cloud computing really ready for prime time?, Computer
42 (2009) 15–20.

[6] Q. Zhang, L. Cheng, R. Boutaba, Cloud computing: state-of-the-art


and research challenges, Journal of Internet Services and Applications
1 (2010) 7–18.

29
[7] I. Yusuf, H. Schmidt, Parameterised architectural patterns for providing
cloud service fault tolerance with accurate costings, in: Proc. of the 16th
Int. ACM Sigsoft Symp. on Component-Based Software Engineering, pp.
121–130.

[8] L. M. Vaquero, L. Rodero-Merino, J. Caceres, M. Lindner, A break in


the clouds: towards a cloud definition, SIGCOMM Comput. Commun.
Rev. 39 (2008) 50–55.

[9] R. Buyya, A. Sulistio, Service and utility oriented distributed computing


systems: Challenges and opportunities for modeling and simulation com-
munities, in: Proc. of the 41st Annual Simulation Symposium, ANSS-41
’08, IEEE Comp. Society, 2008, pp. 68–81.

[10] S. Ostermann, A. Iosup, N. Yigitbasi, R. Prodan, T. Fahringer,


D. Epema, A Performance Analysis of EC2 Cloud Computing Services
for Scientific Computing, in: D. Avresky, M. Diaz, A. Bode, B. Ciciani,
E. Dekel (Eds.), Cloud Computing, volume 34 of Lecture Notes of the
Institute for Computer Sciences, Social-Informatics and Telecommuni-
cations Engineering, Springer Berlin Heidelberg, 2010, pp. 115–131.

[11] W. Su, F. Yang, H. Zhu, Q. Li, Modeling mapreduce with csp, in:
3rd IEEE International Symposium on Theoretical Aspects of Software
Engineering, 2009. TASE 2009., pp. 301–302.

[12] G. S. Reddy, Y. Feng, Y. Liu, J. S. Dong, S. Jun, R. Kanagasabai, To-


wards Formal Modeling and Verification of Cloud Architectures: A Case
Study on Hadoop, in: 2013 IEEE Ninth World Congress on Services,
pp. 306–311.

[13] P. N. Martinaitis, C. J. Patten, A. L. Wendelborn, Component-based


stream processing ”in the cloud”, in: Proceedings of the 2009 Workshop
on Component-Based High Performance Computing, CBHPC ’09, ACM,
2009, pp. 16:1–16:12.

[14] R. Kuntschke, A. Kemper, Data stream sharing, in: T. Grust,


H. Höpfner, A. Illarramendi, S. Jablonski, M. Mesiti, S. Müller, P.-
L. Patranjan, K.-U. Sattler, M. Spiliopoulou, J. Wijsen (Eds.), Current
Trends in Database Technology – EDBT 2006, volume 4254 of LNCS,
Springer, 2006, pp. 769–788.

30
[15] T. Oinn, M. Greenwood, M. Addis, et al., Taverna: Lessons in Cre-
ating a Workflow Environment for the Life Sciences, Concurrency and
Computation: Practice and Experience 18 (2006) 1067–1100.

[16] E. Afgan, D. Baker, N. Coraor, et al., Harnessing cloud computing with


Galaxy Cloud, Nature Biotechnology 29 (2011) 972–974.

[17] R. Buyya, D. Abramson, J. Giddy, Nimrod/G: An Architecture for a Re-


source Management and Scheduling System in a Global Computational
Grid, 2000.

[18] B. Ludäscher, I. Altintas, C. Berkley, et al., Scientific workflow manage-


ment and the Kepler system, Concurrency and Computation: Practice
and Experience 18 (2006) 1039–1065.

[19] NeCTAR, 2015. The National eResearch Collaboration Tools and Re-
sources. http://www.nectar.org.au/.

[20] S. Androulakis, J. Schmidberger, M. A. Bate, et al., Federated reposi-


tories of X-ray diffraction images, Acta Crystallographica, Section D 64
(2008) 810–814.

[21] Welcome Friends to the Docker Docs!, https://docs.docker.


com/, 2016. Accessed online on 18-Freb-2016.

[22] G. Opletal, T. C. Petersen, I. K. Snook, S. P. Russo, HRMC 2.0: Hy-


brid Reverse Monte Carlo method with silicon, carbon and germanium
potentials, Com. Phys. Comm. 184 (2013) 1946–1957.

[23] Welcome to the Chiminey Documentation, http://chiminey.


readthedocs.org/en/latest/, 2016. Accessed on 31st-March-2016.

31
2) Reviewer 1 asks for an overview of the overall workflow before the div-
ing into hands-on tutorial elements. We expanded the section on Chiminey
stages, a concept that allows overall workflows to be organised hierarchically
and sequentially while permitting parallelism and, in fact, complex compu-
tational patterns at a further level of refinement.
3) Also we revised the tutorial to emphasise and separate deployment
from different usages of the method/tool.
We have not eliminated the tutorial, but would like to point out that this
special BDR issue explicitly encouraged tutorial papers and in fact even has
the term ’tutorials’ in its title.
Reviewer 1 clearly sees the value in a tutorial and specifically in methods
and tools that support acceleration of scientific method in non-CS sciences
bringing CS research to bear.
We hope that the ’researchy’ of this revised paper meets the stringent
requirements of original multi-diciplinary research that is expected for BDR
generally, and that the remaining tutorial assists digitally savvy researchers
in other disciplines to learn and trial our method/tool in a self-provisioned
way, and perhaps draws them into the open e-Science community that is
growing around our tool outside of the CS discipline.

33

S-ar putea să vă placă și