Sunteți pe pagina 1din 24

X10: An Object-Oriented

Approach to Non-uniform
Cluster Computing
Vijay Saraswat
IBM Research

Overview

Introduction and context

Language model and constructs

Clustered Computing
Big picture
places, atomic, async, finish, clocks, arrays

Example programs and demo


Conclusion and Future Work

Guarantees
Challenges

July 23, 2003

IBM PL Day 2005

Acknowledgements

X10 core team

X10 Tools
Julian Dolby, Steve Fink, Robert
Fuhrer, Matthias Hauswirth,
Peter Sweeney, Frank Tip,
Mandana Vaziri

Additional contributors to X10 ideas:

University partners:
MIT (StreamIt), Purdue University
(X10), UC Berkeley (StreamBit), U.
Delaware (Atomic sections), U.
Illinois (Fortran plug-in), Vanderbilt
University (Productivity metrics),
DePaul U (Semantics)

David Bacon, Bob Blainey, Perry Cheng,


Julian Dolby, Guang Gao (U Delaware),
Robert O'Callahan, Filip Pizlo (Purdue),
Lawrence Rauchwerger (Texas A&M),
Mandana Vaziri, Jan Vitek (Purdue), V.T.
Rajan, Radha Jagadeesan (DePaul)

X10 PM+Tools Team Lead:


Kemal Ebcioglu, Vivek Sarkar
PERCS Principal Investigator:
Mootaz Elnozahy

Philippe Charles
Chris Donawa (IBM Toronto)
Kemal Ebcioglu
Christian Grothoff (Purdue)
Allan Kielstra (IBM Toronto)
Maged Michael
Christoph von Praun
Vivek Sarkar

July 23, 2003

Performance and Productivity


Challenges
1) Memory wall: Architectures exhibit
severe non-uniformities in bandwidth &
latency in memory hierarchy

PEs,
L1 $ .

PEs,

. . L1 $

Clusters (scale-out)

Proc Cluster

Proc Cluster

...

PEs,
L1 $ .

PEs,

SMP

. . L1 $

Multiple cores on a
chip

L2 Cache

L2 Cache

2) Frequency wall: Architectures introduce


hierarchical heterogeneous parallelism to
compensate for frequency scaling
slowdown

...

Coprocessors (SPUs)
SMTs

...

L3 Cache
Memory

July 23, 2003

...

SIMD
ILPSoftware will need to
3) Scalability wall:
deliver ~ 105-way parallelism to utilize
peta-scale parallel systems

IBM PL Day 2005

Proc Cluster

Proc Cluster
PEs,
L1 $

PEs,

. . . L1 $

...

PEs,
L1 $

..

2010: only small fraction of chip can be accessed in 1 cycle

L2 Cache

L2 Cache

...

...

1995: entire chip can be accessed in 1 cycle

PEs,
. L1 $

\\

One billion transistors in a chip

High Complexity Limits Development


Productivity

Major sources of complexity for application developer:


1) Severe non-uniformities in data accesses
2) Applications must exhibit large degrees of parallelism
(up to ~ 105 threads)
Complexity leads to increases in all
phases of HPC Software Lifecycle
related to parallel code

L3 Cache

Parallel
Specification

Source Code

Written
Specification

Algorithm
Development

//

Input Data

Requirements

Memory

July 23, 2003

Development of Parallel
Source Code --Design, Code,
Test, Port,
Scale, Optimize

//

Production
Runs of
Parallel Code

HPC Software Lifecycle

Maintenance and
Porting of Parallel Code

PERCS Programming Model/Tools: Overall


X10 source code Java+Threads+Conc utils C/C++ /MPI /OpenMP Fortran/MPI/OpenMP)
Architecture
Performance
Exploration
Productivity
Metrics

X10
Development
Toolkit

Java
Development
Toolkit

C
Development
Toolkit

...

Fortran
Development
Toolkit

...

Integrated Programming Environment: Edit, Compile, Debug, Visualize, Refactor


Use Eclipse platform (eclipse.org) as foundation for integrating tools
Morphogenic Software: separation of concerns, separation of roles

X10
Components
X10 runtime

Java
components
Java runtime

Fortran
components
Fast extern
interface

Fortran runtime

C/C++
components
C/C++ runtime

Integrated Concurrency Library: messages, synchronization, threads


PERCS = Productive
Easy-to-use Reliable
Computer Systems

Continuous Program Optimization (CPO)


PERCS System Software (K42)

July 23, 2003

PERCS System Hardware

X10 Design Assumptions

Productivity

Scalability

Axiom: OO provides proven baseline


productivity, maintenance, portability
benefits.

Axiom: Programmer must have explicit


language constructs to deal with nonuniformity of access.

Axiom: Design must rule out large


classes of errors (Type safe, Memory
safe, Pointer safe, Lock safe, Clock
safe )

Axiom: Allow specification of a large


collection of activities.

Axiom: A program must use scalable


synchronization constructs.

Axiom: The runtime may implement


aggregate operations more efficiently
than user-specified iterations with index
variables.

Axiom: The user may know more than


the compiler/RTS.

Axiom: Design must support


incremental introduction of explicit
place types/remote operations.
Axiom: PM must integrate with static
tools (Eclipse) -- flag performance
problems, refactor code, detect races.
Axiom: PM must support automatic
static and dynamic optimization
(CPO).

Support High Productivity (&, possibly ) High


Performance Programmer

July 23, 2003

The X10 Programming Model


Place

Place

Partitioned Global heap

Granularity of
place can range
from single register
file to an entire
SMP system

Outbound
activities

Inbound
activities

Place-local heap
Activities &
Activity-local storage
heap
stack
control

Place-local heap

...

Activities &
Activity-local storage

heap
...

stack
control

Partitioned Global heap

heap
Inbound
activity
replies

Outbound
activity
replies

stack

heap
...

control

stack
control

Immutable Data

A program is a collection of places, each


containing resident data and a dynamic
place
collection of activities.
distribution
Program may distribute aggregate data
(arrays) across places during allocation.
Program may directly operate only on local
atomic, when
data, using atomic blocks.

Program may spawn multiple (local or


remote) activities in parallel.
async, {at/for}each
Program must use asynchronous operations
to access/update remote data.
Program may detect termination or
(repeatedly) detect quiescence of a datadependent, distributed set of activities.
finish, clock

Cluster Computing: Common framework for P>=1


Shared Memory (P=1)

MPI (P > 1)

July 23,
2003in Saraswat, Jagadeesan Concurrent Clustered Programming.
Formalized

async

async PlaceExpressionSingleListopt Statement

async (P) S

Parent activity creates a


new child activity at place
P, to execute statement S;
returns immediately.
S may reference final
variables in enclosing
blocks.

double A[D]=; // Global dist. array


final int k = ;
async ( A.distribution[99] ) {
// Executed at A[99]s place
atomic A[99] = k;
}

cf Cilks spawn

July 23, 2003

IBM PL Day 2005

finish

finish S

Statement ::= finish Statement

Execute S, but wait until all


(transitively) spawned asyncs
have terminated.
Trap all exceptions thrown by
spawned activities.
Throw an (aggregate)
exception if any spawned
async terminates abruptly.

Useful for expressing


synchronous operations
on remote data

And potentially, ordering


information in a weakly
consistent memory model

Rooted Exception Model

July 23, 2003

finish ateach(point [i]:A) A[i] = i;


finish async(A.distribution[j]) A[j] = 2;
// All A[i]=i will complete before A[j]=2;

finish ateach(point [i]:A) A[i] = i;


finish async(A.distribution[j]) A[j] = 2;
// All A[i]=i will complete before A[j]=2;

cf Cilks sync

10

atomic

Atomic blocks are

Statement ::= atomic Statement


MethodModifier ::= atomic

Conceptually executed in a
single step, while other
activities are suspended

An atomic block may not


include

Blocking operations
Accesses to data at remote
places
Creation of activities at
remote places

July 23, 2003

// target defined in lexically enclosing environment.


public atomic boolean CAS( Object old,
Object new) {
if (target.equals(old)) {
target = new;
return true;
}
return false;
}
// push data onto concurrent list-stack
Node<int> node=new Node<int>(17);
atomic { node.next = head; head = node; }

IBM PL Day 2005

11

when

Statement ::= WhenStatement


WhenStatement ::= when ( Expression ) Statement

Activity suspends until a


state in which the guard is
true; in that state the body
is executed atomically.

July 23, 2003

class OneBuffer {
nullable Object datum = null;
boolean filled = false;
public
void send(Object v) {
when ( !filled ) {
this.datum = v;
this.filled = true;
}
}
public
Object receive() {
when ( filled ) {
Object v = datum;
datum = null;
filled = false;
return v;
}
}
}

IBM PL Day 2005

12

regions, distributions

Region

a (multi-dimensional) set of
indices
Distribution

A mapping from indices to places


High level algebraic operations
are provided on regions and
distributions

region R = 0:100;
region R1 = [0:100, 0:200];
region RInner = [1:99, 1:199];
// a local distribution
distribution D1=R-> here;
// a blocked distribution
distribution D = block(R);
// union of two distributions
distribution D = (0:1) -> P0 || (2:N) -> P1;
distribution DBoundary = D RInner;

Based on ZPL.

July 23, 2003

IBM PL Day 2005

13

arrays

Arrays may be

Multidimensional
Distributed
Value types
Initialized in parallel:
int [D] A= new int[D]
(point [i,j]) {return N*i+j;};

Array section

High level parallel array,


reduction and span operators

July 23, 2003

A [RInner]

Highly parallel library


implementation
A-B (array subtraction)
A.reduce(intArray.add,0)
A.sum()

IBM PL Day 2005

14

ateach, foreach

ateach ( FormalParam: Expression ) Statement


foreach ( FormalParam: Expression ) Statement

ateach (point p:A) S

public boolean run() {


distribution D = distribution.factory.block(TABLE_SIZE);

Creates |region(A)| async


statements

Instance p of statement S
is executed at the place
where A[p] is located
foreach (point p:R) S
Creates |R| async
statements in parallel at
current place
Termination of all
activities can be ensured
using finish.

long[.] table = new long[D] (point [i]) { return i; }


long[.] RanStarts = new long[distribution.factory.unique()]

(point [i]) { return starts(i);};


long[.] SmallTable = new long value[TABLE_SIZE]
(point [i]) {return i*S_TABLE_INIT;};
finish ateach (point [i] : RanStarts ) {
long ran = nextRandom(RanStarts[i]);
for (int count: 1:N_UPDATES_PER_PLACE) {
int J = f(ran);
long K = SmallTable[g(ran)];
async atomic table[J] ^= K;
ran = nextRandom(ran);
}}
return table.sum() == EXPECTED_RESULT;
}

July 23, 2003

IBM PL Day 2005

15

clocks

async (P) clock (c1,,cn)S

Operations

clock c = new clock();


c.resume();

Signals completion of work by


activity in this clock phase.

,cn)

Static Semantics

next;

Blocks until all clocks it is


registered on can advance.
Implicitly resumes all clocks.

c.drop();

Unregister activity with c.

An activity may operate only on


those clocks it is live on.
In finish S,S may not
contain any top-level clocked
asyncs.

Dynamic Semantics

No explicit operation to register


a clock.

(Clocked async): activity is


registered on the clocks (c1,

A clock c can advance only


when all its registered activities
have executed c.resume().

Supports over-sampling, hierarchical nesting.

July 23, 2003

IBM PL Day 2005

16

Example: SpecJBB
finish async {
clock c = new clock();
Company company = createCompany(...);
for (int w : 0:wh_num) for (int t: 0:term_num)
async clocked(c) { // a client
initialize;
next; //1.
while (company.mode!=STOP) {
select a transaction;
think;
process the transaction;
if (company.mode==RECORDING)
record data;
if (company.mode==RAMP_DOWN) {
c.resume(); //2.
}
}
gather global data;
} // a client

July 23, 2003

// master activity
next; //1.
company.mode = RAMP_UP;
sleep rampuptime;
company.mode = RECORDING;
sleep recordingtime;
company.mode = RAMP_DOWN;
next; //2.
// All clients in RAMP_DOWN
company.mode = STOP;
} // finish
// Simulation completed.
print results.

IBM PL Day 2005

17

Formal semantics (FX10)

Based on Middleweight
Java (MJ)
Configuration is a tree
of located processes

Tree necessary for finish.

Clocks formalized using


short circuits (PODC
88).
Bisimulation semantics.

July 23, 2003

Basic theorems

Equational laws
Clock quiescence is
stable.
Monotonicity of places.
Deadlock freedom (for
language w/out when).
Type Safety
Memory Safety

18

Current Status

09/03
PERCS
Kickoff
02/04

X10
Kickoff

We have an operational X10 0.41 implementation

07/04

All programs shown here run.

X10
0.32
Spec
Draft

X10
Grammar
Analysis passes

Parser

Target
Java
Code emitter

Structure

07/05
X10
Productivity
Study
12/05
X10
Prototype #2

PEM
Events

Code metrics

Translator based on
Polyglot (Java compiler
framework)
X10 extensions are
modular.
Uses Jikes parser
generator.

July 23, 2003

X10
Multithreaded
RTS
Native
code
JVM

X10
source

X10
Prototype
#1

Open
Source
Release?

Annotated
AST

AST

02/05

06/06

Code
Templates

Limitations

Parser: ~45/14K*
Translator: ~112/9K
RTS: ~190/10K
Polyglot base: ~517/80K
Approx 180 test cases.
(* classes+interfaces/LOC)

IBM PL Day 2005

Program
output

Clocked final not yet


implemented.
Type-checking
incomplete.
No type inference.
Implicit syntax not
supported.

19

Future Work:
Implementation
Type checking/inference
Load-balancing

Lock assignment for


atomic sections
Data-race detection

Batch activities into a


single thread.

Message aggregation

Batch small messages.

Efficient implementation of
scan/reduce

Efficient invocation of
components in foreign
languages

Dynamic, adaptive migration


of places from one processor
to another.

Continuous optimization

Activity aggregation

Consistency
management

Clocked types
Place-aware types

C, Fortran

Garbage collection across


multiple places

Welcome University Partners and other collaborators.

July 23, 2003

IBM PL Day 2005

20

Future work: Other topics

Design/Theory

Atomic blocks
Structural study of
concurrency and
distribution

Tools

Clocked types
Hierarchical places
Weak memory model

Persistence/Fault
tolerance
Database integration

Refactoring language.

Applications

Several HPC programs


planned currently.
Also: web-based
applications.

Welcome University Partners and other collaborators.

July 23, 2003

IBM PL Day 2005

21

Backup material

Type system

Value classes
May only have final fields.
May only be subclassed
by value classes.

Instances of value
classes can be copied
freely between places.

nullable is a type
constructor

nullable T contains the


values of T and null.

Place types: T@P,


specify the place at
which the data object
lives.

Future work: Include generics and dependent types.

July 23, 2003

IBM PL Day 2005

23

Example: Latch
public class Latch implements future {
protected boolean forced = false;
protected nullable boxed result = null;
protected nullable exception z = null;
public interface future {
boolean forced();
Object force();
}

public class boxed {


nullable Object val;
}

public atomic
boolean setValue( nullable Object val,
nullable exception z ) {
if ( forced ) return false;
// these assignment happens only once.
this.result .val= val;
this.z = z;
this.forced = true;
return true;
public atomic boolean forced() {
return forced;
}
public Object force() {
when ( forced ) {
if (z != null) throw z;
return result;
}
}

July 23, 2003

IBM PL Day 2005

24

S-ar putea să vă placă și