Sunteți pe pagina 1din 2211

Encyclopedia of Parallel Computing

David Padua (Ed.)

Encyclopedia of Parallel
Computing

With  Figures and  Tables

123
Editor-in-Chief
David Padua
University of Illinois at Urbana-Champaign
Urbana, IL
USA

ISBN ---- e-ISBN ----


DOI ./----
Print and electronic bundle ISBN: ----
Springer New York Dordrecht Heidelberg London

Library of Congress Control Number: 

© Springer Science+Business Media, LLC 

All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer
Science+Business Media, LLC,  Spring Street, New York, NY , USA), except for brief excerpts in connection with reviews or scholarly
analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or
dissimilar methodology now known or hereafter developed is forbidden.
The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be
taken as an expression of opinion as to whether or not they are subject to proprietary rights.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


Preface

Parallelism, the capability of a computer to execute operations concurrently, has been a constant throughout the
history of computing. It impacts hardware, software, theory, and applications. The fastest machines of the past few
decades, the supercomputers, owe their performance advantage to parallelism. Today, physical limitations have
forced the adoption of parallelism as the preeminent strategy of computer manufacturers for continued perfor-
mance gains of all classes of machines, from embedded and mobile systems to the most powerful servers. Parallelism
has been used to simplify the programming of certain applications which react to or simulate the parallelism of the
natural world. At the same time, parallelism complicates programming when the objective is to take advantage of
the existence of multiple hardware components to improve performance. Formal methods are necessary to study
the correctness of parallel algorithms and implementations and to analyze their performance on different classes of
real and theoretical systems. Finally, parallelism is crucial for many applications in the sciences, engineering, and
interactive services such as search engines.
Because of its importance and the challenging problems it has engendered, there have been numerous research
and development projects during the past half century. This Encyclopedia is our attempt to collect accurate and clear
descriptions of the most important of those projects. Although not exhaustive, with over  entries the Encyclo-
pedia covers most of the topics that we identified at the outset as important for a work of this nature. Entries include
many of the best known projects and span all the important dimensions of parallel computing including machine
design, software, programming languages, algorithms, theoretical issues, and applications.
This Encyclopedia is the result of the work of many, whose dedication made it possible. The  Editorial Board
Members created the list of entries, did most of the reviewing, and suggested authors for the entries. Colin Robert-
son, the Managing Editor, Jennifer Carlson, Springer’s Reference Development editor, and Editorial Assistants Julia
Koerting and Simone Tavenrath, worked for  long years coordinating the recruiting of authors and the review and
submission process. Melissa Fearon, Springer’s Senior Editor, helped immensely with the coordination of authors
and Editorial Board Members, especially during the difficult last few months. The nearly  authors wrote crisp,
informative entries. They include experts in all major areas, come from many different nations, and span several
generations. In many cases, the author is the lead designer or researcher responsible for the contribution reported
in the entry. It was a great pleasure for me to be part of this project. The enthusiasm of everybody involved made
this a joyful enterprise. I believe we have put together a meaningful snapshot of parallel computing in the last 
years and presented a believable glimpse of the future. I hope the reader agrees with me on this and finds the entries,
as I did, valuable contributions to the literature.

David Padua
Editor-in-Chief
University of Illinois at Urbana-Champaign
Urbana, IL
USA
Editors

David Padua Sarita Adve


Editor-in-Chief Editorial Board Member
University of Illinois at Urbana-Champaign University of Illinois at Urbana-Champaign
Urbana, IL Urbana, IL
USA USA

Colin Robertson Gheorghe S. Almasi


Managing Editor Editorial Board Member
University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center
Urbana, IL Yorktown Heights, NY
USA USA
viii Editors

Srinivas Aluru Gianfranco Bilardi


Editorial Board Member Editorial Board Member
Iowa State University University of Padova
Ames, IA Padova
USA Italy

David Bader Siddharta Chatterjee


Editorial Board Member Editorial Board Member
Georgia Tech IBM Systems & Technology Group
Atlanta, GA Austin, TX
USA USA
Editors ix

Luiz DeRose José Duato


Editorial Board Member Editorial Board Member
Cray Inc. Universitat Politècnica de València
St. Paul, MN València
USA Spain

Jack Dongarra Paul Feautrier


Editorial Board Member Editorial Board Member
University of Tennessee Ecole Normale Supérieure de Lyon
Knoxville, TN Lyon
USA France
Oak Ridge National Laboratory
Oak Ridge, TN
USA
University of Manchester
Manchester
UK
x Editors

María J. Garzarán William Gropp


Editorial Board Member Editorial Board Member
University of Illinois at Urbana Champaign University of Illinois Urbana-Champaign
Urbana, IL Urbana, IL
USA USA

Michael Gerndt Thomas Gross


Editorial Board Member Editorial Board Member
Technische Universitaet Muenchen ETH Zurich
Garching Zurich
Germany Switzerland
Editors xi

James C. Hoe Hironori Kasahara


Editorial Board Member Editorial Board Member
Carnegie Mellon University Waseda University
Pittsburgh, PA Tokyo
USA Japan

Laxmikant Kale Christian Lengauer


Editorial Board Member Editorial Board Member
University of Illinois at Urbana Champaign University of Passau
Urbana, IL Passau
USA Germany
xii Editors

José E. Moreira Keshav Pingali


Editorial Board Member Editorial Board Member
IBM Thomas J. Watson Research Center The University of Texas at Austin
Yorktown Heights, NY Austin, TX
USA USA

Yale N. Patt Markus Püschel


Editorial Board Member Editorial Board Member
The University of Texas at Austin ETH Zurich
Austin, TX Zurich
USA Switzerland
Editors xiii

Ahmed H. Sameh Pen-Chung Yew


Editorial Board Member Editorial Board Member
Purdue University University of Minnesota at Twin Cities
West Lafayette, IN Minneapolis, MN
USA USA

Vivek Sarkar
Editorial Board Member
Rice University
Houston, TX
USA
List of Contributors

DENNIS ABTS SRINIVAS ALURU


Google Inc. Iowa State University
Madison, WI Ames, IA
USA USA
and
SARITA V. ADVE Indian Institute of Technology Bombay
University of Illinois at Urbana-Champaign Mumbai
Urbana, IL India
USA
PATRICK AMESTOY
GUL AGHA Université de Toulouse ENSEEIHT-IRIT
University of Illinois at Urbana-Champaign Toulouse cedex 
Urbana, IL France
USA

BABA ARIMILLI
JASMIN AJANOVIC IBM Systems and Technology Group
Intel Corporation
Austin, TX
Portland, OR
USA
USA

ROGER S. ARMEN
SELIM G. AKL
Thomas Jefferson University
Queen’s University
Philadelphia, PA
Kingston, ON
USA
Canada

HASAN AKTULGA DOUGLAS ARMSTRONG


Purdue University Intel Corporation
West Lafayette, IN Champaign, IL
USA USA

JOSÉ I. ALIAGA DAVID I. AUGUST


TU Braunschweig Institute of Computational Mathematics Princeton University
Braunschweig Princeton, NJ
Germany USA

ERIC ALLEN CEVDET AYKANAT


Oracle Labs Bilkent University
Austin, TX Ankara
USA Turkey

GEORGE ALMASI DAVID A. BADER


IBM Georgia Institute of Technology
Yorktown Heights, NY Atlanta, GA
USA USA
xvi List of Contributors

MICHAEL BADER SCOTT BIERSDORFF


Universität Stuttgart University of Oregon
Stuttgart Eugene, OR
Germany USA

DAVID H. BAILEY GIANFRANCO BILARDI


Lawrence Berkeley National Laboratory University of Padova
Berkeley, CA Padova
USA Italy

RAJEEV BALASUBRAMONIAN ROBERT BJORNSON


University of Utah Yale University
Salt Lake City, UT New Haven, CT
USA USA

UTPAL BANERJEE GUY BLELLOCH


University of California at Irvine Carnegie Mellon University
Irvine, CA Pittsburgh, PA
USA USA

ALESSANDRO BARDINE ROBERT BOCCHINO


Università di Pisa Carnegie Mellon University
Pisa Pittsburgh, PA
Italy USA

MUTHU MANIKANDAN BASKARAN HANS J. BOEHM


Reservoir Labs, Inc. HP Labs
New York, NY Palo Alto, CA
USA USA

CÉDRIC BASTOUL ERIC J. BOHM


University Paris-Sud  - INRIA Saclay Île-de-France University of Illinois at Urbana-Champaign
Orsay Urbana, IL
France USA

AARON BECKER MATTHIAS BOLLHÖFER


University of Illinois at Urbana-Champaign Universitat Jaume I
Urbana, IL Castellón
USA Spain

MICHAEL W. BERRY DAN BONACHEA


The University of Tennessee Lawrence Berkeley National Laboratory
Knoxville, TN Berkeley, CA
USA USA

ABHINAV BHATELE PRADIP BOSE


University of Illinois at Urbana-Champaign IBM Corp. T.J. Watson Research Center
Urbana, IL Yorktown Heights, NY
USA USA
List of Contributors xvii

MARIAN BREZINA ÜMIT V. ÇATALYÜREK


University of Colorado at Boulder The Ohio State University
Boulder, CO Columbus, OH
USA USA

JEFF BROOKS LUIS H. CEZE


Cray Inc. University of Washington
St. Paul, MN Seattle, WA
USA USA

HOLGER BRUNST BRADFORD L. CHAMBERLAIN


Technische Universität Dresden Cray Inc.
Dresden Seattle, WA
Germany USA

HANS-JOACHIM BUNGARTZ ERNIE CHAN


Technische Universität München NVIDIA Corporation
Garching Santa Clara, CA
Germany USA

MICHAEL G. BURKE RONG-GUEY CHANG


Rice University National Chung Cheng University
Houston, TX Chia-Yi
USA Taiwan

ALFREDO BUTTARI BARBARA CHAPMAN


Université de Toulouse ENSEEIHT-IRIT University of Houston
Toulouse cedex  Houston, TX
France USA

ERIC J. BYLASKA DAVID CHASE


Pacific Northwest National Laboratory Oracle Labs
Richland, WA Burlington, MA
USA USA

ROY H. CAMPBELL DANIEL CHAVARRÍA-MIRANDA


University of Illinois at Urbana-Champaign Pacific Northwest National Laboratory
Urbana, IL Richland, WA
USA USA

WILLIAM CARLSON NORMAN H. CHRIST


Institute for Defense Analyses Columbia University
Bowie, MD New York, NY
USA USA

MANUEL CARRO MURRAY COLE


Universidad Politécnica de Madrid University of Edinburgh
Madrid Edinburgh
Spain UK
xviii List of Contributors

PHILLIP COLELLA KAUSHIK DATTA


University of California University of California
Berkeley, CA Berkeley, CA
USA USA

SALVADOR COLL JIM DAVIES


Universidad Politécnica de Valencia Oxford University
Valencia UK
Spain
JAMES DEMMEL
GUOJING CONG University of California at Berkeley
IBM Berkeley, CA
Yorktown Heights, NY USA
USA

MONTY DENNEAU
JAMES H. COWNIE IBM Corp., T.J. Watson Research Center
Intel Corporation (UK) Ltd.
Yorktown Heights, NY
Swindon
USA
UK

JACK B. DENNIS
ANTHONY P. CRAIG
Massachusetts Institute of Technology
National Center for Atmospheric Research
Cambridge, MA
Boulder, CO
USA
USA

MARK DEWING
ANTHONY CURTIS
Intel Corporation
University of Houston
Champaign, IL
Houston
USA
TX

H. J. J. VAN DAM VOLKER DIEKERT


Pacific Northwest National Laboratory Universität Stuttgart FMI
Richland, WA Stuttgart
USA Germany

FREDERICA DAREMA JACK DONGARRA


National Science Foundation University of Tennessee
Arlington, VA Knoxville, TN
USA USA

ALAIN DARTE DAVID DONOFRIO


École Normale Supérieure de Lyon Lawrence Berkeley National Laboratory
Lyon Berkeley, CA
France USA

RAJA DAS RON O. DROR


IBM Corporation D. E. Shaw Research
Armonk, NY New York, NY
USA USA
List of Contributors xix

IAIN DUFF WU-CHUN FENG


Science & Technology Facilities Council Virginia Tech
Didcot, Oxfordshire Blacksburg, VA
UK USA
and
MICHAEL DUNGWORTH Wake Forest University
Winston-Salem, NC
USA

SANDHYA DWARKADAS
University of Rochester JOHN FEO
Pacific Northwest National Laboratory
Rochester, NY
Richland, WA
USA
USA

RUDOLF EIGENMANN
Purdue University JEREMY T. FINEMAN
West Lafayette, IN Carnegie Mellon University
USA Pittsburgh, PA
USA

E. N. (MOOTAZ) ELNOZAHY
IBM Research JOSEPH A. FISHER
Miami Beach, FL
Austin, TX
USA
USA

JOEL EMER CORMAC FLANAGAN


Intel Corporation University of California at Santa Cruz
Hudson, MA Santa Cruz, CA
USA USA

BABAK FALSAFI JOSÉ FLICH


Ecole Polytechnique Fédérale de Lausanne Technical University of Valencia
Lausanne Valencia
Switzerland Spain

PAOLO FARABOSCHI CHRISTINE FLOOD


Hewlett Packard Oracle Labs
Sant Cugat del Valles Burlington, MA
Spain USA

PAUL FEAUTRIER MICHAEL FLYNN


Ecole Normale Supérieure de Lyon Stanford University
Lyon Stanford, CA
France USA

KARL FEIND JOSEPH FOGARTY


SGI University of South Florida
Eagan, MN Tampa, FL
USA USA
xx List of Contributors

PIERFRANCESCO FOGLIA PEDRO J. GARCIA


Università di Pisa Universidad de Castilla-La Mancha
Pisa Albacete
Italy Spain

TRYGGVE FOSSUM MICHAEL GARLAND


Intel Corporation NVIDIA Corporation
Hudson, MA Santa Clara, CA
USA USA

GEOFFREY FOX KLAUS GÄRTNER


Indiana University Weierstrass Institute for Applied Analysis and Stochastics
Bloomington, IN Berlin
USA Germany

MARTIN FRÄNZLE ED GEHRINGER


Carl von Ossietzky Universität North Carolina State University
Oldenburg Raleigh, NC
Germany USA

FRANZ FRANCHETTI ROBERT A. VAN DE GEIJN


Carnegie Mellon University The University of Texas at Austin
Pittsburgh, PA Austin, TX
USA USA

STEFAN M. FREUDENBERGER AL GEIST


Freudenberger Consulting Oak Ridge National Laboratory
Zürich Oak Ridge, TN
Switzerland USA

HOLGER FRÖNING THOMAS GEORGE


University of Heidelberg IBM Research
Heidelberg Delhi
Germany India

KARL FÜRLINGER MICHAEL GERNDT


Ludwig-Maximilians-Universität München Technische Universität München
Munich München
Germany Germany

EFSTRATIOS GALLOPOULOS AMOL GHOTING


University of Patras IBM Thomas. J. Watson Research Center
Patras Yorktown Heights, NY
Greece USA

ALAN GARA JOHN GILBERT


IBM T.J. Watson Research Center University of California
Yorktown Heights, NY Santa Barbara, CA
USA USA
List of Contributors xxi

ROBERT J. VAN GLABBEEK DON GRICE


NICTA IBM Corporation
Sydney Poughkeepsie, NY
Australia USA
and
The University of New South Wales LAURA GRIGORI
Sydney Laboratoire de Recherche en Informatique Universite
Australia Paris-Sud 
and Paris
Stanford University France
Stanford, CA
USA WILLIAM GROPP
University of Illinois at Urbana-Champaign
Urbana, IL
SERGEI GORLATCH USA
Westfälische Wilhelms-Universität Münster
Münster
ABDOU GUERMOUCHE
Germany Université de Bordeaux
Talence
KAZUSHIGE GOTO France
The University of Texas at Austin
Austin, TX JOHN A. GUNNELS
USA IBM Corp
Yorktown Heights, NY
ALLAN GOTTLIEB USA
New York University
New York, NY ANSHUL GUPTA
USA IBM T.J. Watson Research Center
Yorktown Heights, NY
USA
STEVEN GOTTLIEB
Indiana University
JOHN L. GUSTAFSON
Bloomington, IN
Intel Corporation
USA
Santa Clara, CA
USA
N. GOVIND
Pacific Northwest National Laboratory
ROBERT H. HALSTEAD
Richland, WA Curl Inc.
USA Cambridge, MA
USA
SUSAN L. GRAHAM
University of California KEVIN HAMMOND
Berkeley, CA University of St. Andrews
USA St. Andrews
UK
ANANTH Y. GRAMA
Purdue University JAMES HARRELL
West Lafayette, IN
USA
xxii List of Contributors

ROBERT HARRISON MANUEL HERMENEGILDO


Oak Ridge National Laboratory Universidad Politécnica de Madrid
Oak Ridge, TN Madrid
USA Spain
IMDEA Software Institute
Madrid
JOHN C. HART
Spain
University of Illinois at Urbana-Champaign
Urbana, IL
USA OSCAR HERNANDEZ
Oak Ridge National Laboratory
Oak Ridge, TN
MICHAEL HEATH USA
University of Illinois at Urbana-Champaign
Urbana, IL
PAUL HILFINGER
USA University of California
Berkeley, CA
USA
HERMANN HELLWAGNER
Klagenfurt University
Klagenfurt KEI HIRAKI
Austria The University of Tokyo
Tokyo
Japan
DANNY HENDLER
Ben-Gurion University of the Negev
H. PETER HOFSTEE
Beer-Sheva
IBM Austin Research Laboratory
Israel
Austin, TX
USA
BRUCE HENDRICKSON
Sandia National Laboratories CHRIS HSIUNG
Albuquerque, NM Hewlett Packard
USA Palo Alto, CA
USA

ROBERT HENSCHEL
Indiana University JONATHAN HU
Bloomington, IN Sandia National Laboratories
USA Livermore, CA
USA

KIERAN T. HERLEY THOMAS HUCKLE


University College Cork
Technische Universität München
Cork
Garching
Ireland Germany

MAURICE HERLIHY WEN-MEI HWU


Brown University University of Illinois at Urbana-Champaign
Providence, RI Urbana, IL
USA USA
List of Contributors xxiii

FRANÇOIS IRIGOIN KRISHNA KANDALLA


MINES ParisTech/CRI The Ohio State University
Fontainebleau Columbus, OH
France USA

KEN’ICHI ITAKURA LARRY KAPLAN


Japan Agency for Marine-Earth Science and Technology Cray Inc.
(JAMSTEC) Seattle, WA
Yokohama USA
Japan
TEJAS S. KARKHANIS
JOSEPH F. JAJA IBM T.J. Watson Research Center
University of Maryland Yorktown Heights, NY
College Park, MD USA
USA
RAJESH K. KARMANI
JOEFON JANN University of Illinois at Urbana-Champaign
T. J. Watson Research Center, IBM Corp. Urbana, IL
Yorktown Heights, NY USA
USA
GEORGE KARYPIS
KARL JANSEN University of Minnesota
NIC, DESY Zeuthen Minneapolis, MN
Zeuthen USA
Germany
ARUN KEJARIWAL
PRITISH JETLEY Yahoo! Inc.
University of Illinois at Urbana-Champaign
Sunnyvale, CA
Urbana, IL
USA
USA

MALEQ KHAN
WIBE A. DE JONG
Virginia Tech
Pacific Northwest National Laboratory
Blacksburg, VA
Richland, WA
USA
USA

LAXMIKANT V. KALÉ THILO KIELMANN


Vrije Universiteit
University of Illinois at Urbana-Champaign
Amsterdam
Urbana, IL
The Netherlands
USA

ANANTH KALYANARAMAN GERRY KIRSCHNER


Washington State University Cray Incorporated
Pullman, WA St. Paul, MN
USA USA

AMIR KAMIL CHRISTOF KLAUSECKER


University of California Ludwig-Maximilians-Universität München
Berkeley, CA Munich
USA Germany
xxiv List of Contributors

KATHLEEN KNOBE V. S. ANIL KUMAR


Intel Corporation Virginia Tech
Cambridge, MA Blacksburg, VA
USA USA

ANDREAS KNÜPFER KALYAN KUMARAN


Technische Universität Dresden Argonne National Laboratory
Dresden Argonne, IL
Germany USA

GIORGOS KOLLIAS JAMES LA GRONE


Purdue University University of Houston
West Lafayette, IN Houston, TX
USA USA

K. KOWALSKI
ROBERT LATHAM
Pacific Northwest National Laboratory Argonne National Laboratory
Richland, WA Argonne, IL
USA USA

QUINCEY KOZIOL
BRUCE LEASURE
The HDF Group
Saint Paul, MN
Champaign, IL
USA
USA

JENQ-KUEN LEE
DIETER KRANZLMÜLLER National Tsing-Hua University
Ludwig-Maximilians-Universität München
Hsin-Chu
Munich
Taiwan
Germany

CHARLES E. LEISERSON
MANOJKUMAR KRISHNAN
Massachusetts Institute of Technology
Pacific Northwest National Laboratory
Cambridge, MA
Richland, WA
USA
USA

CHI-BANG KUAN CHRISTIAN LENGAUER


National Tsing-Hua University University of Passau
Hsin-Chu Passau
Taiwan Germany

DAVID J. KUCK RICHARD LETHIN


Intel Corporation Reservoir Labs, Inc.
Champaign, IL New York, NY
USA USA

JEFFERY A. KUEHN ALLEN LEUNG


Oak Ridge National Laboratory Reservoir Labs, Inc.
Oak Ridge, TN New York, NY
USA USA
List of Contributors xxv

JOHN M. LEVESQUE PEDRO LÓPEZ


Cray Inc. Universidad Politécnica de Valencia
Knoxville, TN Valencia
USA Spain

MICHAEL LEVINE GEOFF LOWNEY


Intel Corporation
Husdon, MA
USA
JEAN-YVES L’EXCELLENT
ENS Lyon
VICTOR LUCHANGCO
Lyon
Oracle Labs
France
Burlington, MA
USA
JIAN LI
IBM Research
PIOTR LUSZCZEK
Austin, TX
University of Tennessee
USA
Knoxville, TN
USA
XIAOYE SHERRY LI
Lawarence Berkeley National Laboratory
OLAV LYSNE
Berkeley, CA The University of Oslo
USA Oslo
Norway
ZHIYUAN LI
Purdue University
XIAOSONG MA
West Lafayette, IN North Carolina State University
USA Raleigh, NC
USA
CALVIN LIN and
University of Texas at Austin Oak Ridge National Laboratory
Austin, TX Raleigh, NC
USA USA

HESHAN LIN ARTHUR B. MACCABE


Virginia Tech Oak Ridge National Laboratory
Blacksburg, VA Oak Ridge, TN
USA USA

HANS-WOLFGANG LOIDL KAMESH MADDURI


Heriot-Watt University Lawrence Berkeley National Laboratory
Edinburgh Berkeley, CA
UK USA

RITA LOOGEN JAN-WILLEM MAESSEN


Philipps-Universität Marburg Google
Marburg Cambridge, MA
Germany USA
xxvi List of Contributors

KONSTANTIN MAKARYCHEV PHILLIP MERKEY


IBM T.J. Watson Research Center Michigan Technological University
Yorktown Heights, NY Houghton, MI
USA USA

JUNICHIRO MAKINO JOSÉ MESEGUER


National Astronomical Observatory of Japan University of Illinois at Urbana-Champaign
Tokyo Urbana, IL
Japan USA

ALLEN D. MALONY MICHAEL METCALF


University of Oregon Berlin
Eugene, OR Germany
USA
SAMUEL MIDKIFF
MADHA V. MARATHE Purdue University
Virginia Tech West Lafayette, IN
Blacksburg, VA USA
USA
KENICHI MIURA
ALBERTO F. MARTÍN National Institute of Informatics
Universitat Jaume I
Tokyo
Castellón
Japan
Spain

BERND MOHR
GLENN MARTYNA Forschungszentrum Jülich GmbH
IBM Thomas J. Watson Research Center
Jülich
Yorktown Heights, NY
Germany
USA

JOSÉ E. MOREIRA
ERIC R. MAY
IBM T.J. Watson Research Center
University of Michigan
Yorktown Heights, NY
Ann Arbor, MI
USA
USA

ALAN MORRIS
SALLY A. MCKEE
University of Oregon
Chalmers University of Technology
Eugene, OR
Goteborg
USA
Sweden

MIRIAM MEHL J. ELIOT B. MOSS


Technische Universität München University of Massachusetts
Garching Amherst, MA
Germany USA

BENOIT MEISTER MATTHIAS MÜLLER


Reservoir Labs, Inc. Technische Universität Dresden
New York, NY Dresden
USA Germany
List of Contributors xxvii

PETER MÜLLER ALLEN NIKORA


ETH Zurich California Institute of Technology
Zurich Pasadena, CA
Switzerland USA

YOICHI MURAOKA ROBERT W. NUMRICH


Waseda University City University of New York
Tokyo New York, NY
Japan USA

ANCA MUSCHOLL
STEVEN OBERLIN
Université Bordeaux 
Talence
France
LEONID OLIKER
RAVI NAIR Lawrence Berkeley National Laboratory
IBM Thomas J. Watson Research Center Berkeley, CA
Yorktown Heights, NY USA
USA
DAVID PADUA
STEPHEN NELSON University of Illinois at Urbana-Champaign
Urbana, IL
USA
MARIO NEMIROVSKY
Barcelona Supercomputer Center SCOTT PAKIN
Barcelona Los Alamos National Laboratory
Spain Los Alamos, NM
USA
RYAN NEWTON
Intel Corporation
BRUCE PALMER
Hudson, MA Pacific Northwest National Laboratory
USA Richland, WA
USA
ROCCO DE NICOLA
Universita’ di Firenze
Firenze DHABALESWAR K. PANDA
The Ohio State University
Italy
Columbus, OH
USA
ALEXANDRU NICOLAU
University of California Irvine
Irvine, CA SAGAR PANDIT
USA University of South Florida
Tampa, FL
JAREK NIEPLOCHA† USA
Pacific Northwest National Laboratory
Richland, WA YALE N. PATT
USA The University of Texas at Austin
Austin, TX

deceased USA
xxviii List of Contributors

OLIVIER PÈNE WILFRED POST


University de Paris-Sud-XI Oak Ridge National Laboratory
Orsay Cedex Oak Ridge, TN
France USA

PAUL PETERSEN CHRISTOPH VON PRAUN


Intel Corporation Georg-Simon-Ohm University of Applied Sciences
Champaign, IL Nuremberg
USA Germany

BERNARD PHILIPPE FRANCO P. PREPARATA


Campus de Beaulieu Brown University
Rennes Providence, RI
France USA

MICHAEL PHILIPPSEN
COSIMO ANTONIO PRETE
University of Erlangen-Nuremberg
Università di Pisa
Erlangen Pisa
Germany Italy

JAMES C. PHILLIPS
TIMOTHY PRINCE
University of Illinois at Urbana-Champaign
Intel Corporation
Urbana, IL
Santa Clara, CA
USA
USA

ANDREA PIETRACAPRINA
Università di Padova JEAN-PIERRE PROST
Morteau
Padova
France
Italy

GEPPINO PUCCI
KESHAV PINGALI
Università di Padova
The University of Texas at Austin
Padova
Austin, TX
Italy
USA

TIMOTHY M. PINKSTON MARKUS PÜSCHEL


University of Southern California ETH Zurich
Los Angeles, CA Zurich
USA Switzerland

ERIC POLIZZI ENRIQUE S. QUINTANA-ORTÍ


University of Massachusetts Universitat Jaume I
Amherst, MA Castellón
USA Spain

STEPHEN W. POOLE PATRICE QUINTON


Oak Ridge National Laboratory ENS Cachan Bretagne
Oak Ridge, TN Bruz
USA France
List of Contributors xxix

RAM RAJAMONY YVES ROBERT


IBM Research Ecole Normale Supérieure de Lyon
Austin, TX France
USA
ARCH D. ROBISON
Intel Corporation
ARUN RAMAN
Champaign, IL
Princeton University
USA
Princeton, NJ
USA
A. W. ROSCOE
Oxford University
LAWRENCE RAUCHWERGER Oxford
Texas A&M University UK
College Station, TX
USA ROBERT B. ROSS
Argonne National Laboratory
JAMES R. REINDERS Argonne, IL
Intel Corporation USA
Hillsboro, OR
USA CHRIS ROWEN
CEO, Tensilica
Santa Clara, CA, USA
STEVEN P. REINHARDT
DUNCAN ROWETH
Cray (UK) Ltd.
UK
JOHN REPPY
University of Chicago
SUKYOUNG RYU
Chicago, IL
Korea Advanced Institute of Science and Technology
USA
Daejeon
Korea
MARÍA ENGRACIA GÓMEZ REQUENA
Universidad Politécnica de Valencia VALENTINA SALAPURA
Valencia IBM Research
Spain Yorktown Heights, NY
USA
DANIEL RICCIUTO
Oak Ridge National Laboratory JOEL H. SALTZ
Emory University
Oak Ridge, TN
Atlanta, GA
USA
USA

ROLF RIESEN AHMED SAMEH


IBM Research Purdue University
Dublin West Lafayette, IN
Ireland USA

TANGUY RISSET MIGUEL SANCHEZ


INSA Lyon Universidad Politécnica de Valencia
Villeurbanne Valencia
France Spain
xxx List of Contributors

BENJAMIN SANDER MICHAEL L. SCOTT


Advanced Micro Device Inc. University of Rochester
Austin, TX Rochester, NY
USA USA

PETER SANDERS MATOUS SEDLACEK


Universitaet Karlsruhe Technische Universität München
Karlsruhe Garching
Germany Germany

DAVIDE SANGIORGI JOEL SEIFERAS


Universita’ di Bologna University of Rochester
Bologna Rochester, NY
Italy USA

VIVEK SARIN FRANK OLAF SEM-JACOBSEN


Texas A&M University The University of Oslo
College Station, TX Oslo
USA Norway

VIVEK SARKAR
ANDRÉ SEZNEC
Rice University
IRISA/INRIA, Rennes
Houston, TX
Rennes
USA
France

OLAF SCHENK
University of Basel JOHN SHALF
Lawrence Berkeley National Laboratory
Basel
Berkeley, CA
Switzerland
USA

MICHAEL SCHLANSKER
Hewlett-Packard Inc. MEIYUE SHAO
Umeå University
Palo-Alto, CA
Umeå
USA
Sweden
STEFAN SCHMID
Telekom Laboratories/TU Berlin DAVID E. SHAW
Berlin D. E. Shaw Research
Germany New York, NY
USA
MARTIN SCHULZ and
Lawrence Livermore National Laboratory Columbia University
Livermore, CA New York, NY
USA USA

JAMES L. SCHWARZMEIER XIAOWEI SHEN


Cray Inc. IBM Research
Chippewa Falls, WI Armonk, NY
USA USA
List of Contributors xxxi

SAMEER SHENDE EDGAR SOLOMONIK


University of Oregon University of California at Berkeley
Eugene, OR Berkeley, CA
USA USA

GALEN M. SHIPMAN MATTHEW SOTTILE


Oak Ridge National Laboratory Galois, Inc.
Oak Ridge, TN Portland, OR
USA USA

HOWARD JAY SIEGEL M’HAMED SOULI


Colorado State University Université des Sciences et Technologies de Lille
Fort Collins, CO Villeneuve d’Ascq cédex
USA France

DANIEL P. SIEWIOREK WYATT SPEAR


Carnegie Mellon University University of Oregon
Pittsburgh, PA Eugene, OR
USA USA

FEDERICO SILLA EVAN W. SPEIGHT


Universidad Politécnica de Valencia IBM Research
Valencia Austin, TX
Spain USA

BARRY SMITH MARK S. SQUILLANTE


Argonne National Laboratory IBM
Argonne, IL Yorktown Heights, NY
USA USA

BURTON SMITH ALEXANDROS STAMATAKIS


Microsoft Corporation Heidelberg Institute for Theoretical Studies
Redmond, WA Heidelberg
USA Germany

MARC SNIR GUY L. STEELE, JR.


University of Illinois at Urbana-Champaign Oracle Labs
Urbana, IL Burlington, MA
USA USA

LAWRENCE SNYDER THOMAS L. STERLING


University of Washington Louisiana State University
Seattle, WA Baton Rouge, LA
USA USA

MARCO SOLINAS TJERK P. STRAATSMA


Università di Pisa Pacific Northwest National Laboratory
Pisa Richland, WA
Italy USA
xxxii List of Contributors

PAULA E. STRETZ JOSEP TORRELLAS


Virginia Tech University of Illinois at Urbana-Champaign
Blacksburg, VA Urbana, IL
USA USA

THOMAS M. STRICKER JESPER LARSSON TRÄFF


Zürich, CH University of Vienna
Switzerland Vienna
Austria
JIMMY SU
University of California PHILIP TRINDER
Berkeley, CA Heriot-Watt University
USA Edinburgh
UK
HARI SUBRAMONI
The Ohio State University RAFFAELE TRIPICCIONE
Università di Ferrara and INFN Sezione di Ferrara
Columbus, OH
Ferrara
USA
Italy

SAYANTAN SUR
The Ohio State University MARK TUCKERMAN
New York University
Columbus, OH
New York, NY
USA
USA

JOHN SWENSEN
RAY TUMINARO
CPU Technology
Sandia National Laboratories
Pleasanton, CA
Livermore, CA
USA
USA

HIROSHI TAKAHARA BORA UÇAR


NEC Corporation
ENS Lyon
Tokyo
Lyon
Japan
France

MICHELA TAUFER MARAT VALIEV


University of Delaware Pacific Northwest National Laboratory
Newark, DE Richland, WA
USA USA

VINOD TIPPARAJU NICOLAS VASILACHE


Oak Ridge National Laboratory Reservoir Labs, Inc.
Oak Ridge, TN New York, NY
USA USA

ALEXANDER TISKIN MARIANA VERTENSTEIN


University of Warwick National Center for Atmospheric Research
Coventry Boulder, CO
UK USA
List of Contributors xxxiii

JENS VOLKERT TONG WEN


Johannes Kepler University Linz University of California
Linz Berkeley, CA
Austria USA

YEVGEN VORONENKO R. CLINT WHALEY


Carnegie Mellon University University of Texas at San Antonio
Pittsburgh, PA San Antonio, TX
USA USA

RICHARD W. VUDUC ANDREW B. WHITE


Georgia Institute of Technology Los Alamos National Laboratory
Atlanta, GA Los Alamos, NM
USA USA

GENE WAGENBRETH BRIAN WHITNEY


University of Southern Califorina Oracle Corporation
Topanga, CA Hillsboro, OR
USA USA

DALI WANG ROLAND WISMÜLLER


Oak Ridge National Laboratory University of Siegen
Oak Ridge, TN Siegen
USA Germany

JASON WANG ROBERT W. WISNIEWSKI


LSTC IBM
Livermore, CA Yorktown Heights, NY
USA USA

GREGORY R. WATSON DAVID WOHLFORD


IBM Reservoir Labs, Inc.
Yorktown Heights, NY New York, NY
USA USA

ROGER WATTENHOFER FELIX WOLF


ETH Zürich Aachen University
Zurich Aachen
Switzerland Germany

MICHAEL WEHNER DAVID WONNACOTT


Lawrence Berkeley National Laboratory Haverford College
Berkeley, CA Haverford, PA
USA USA

JOSEF WEIDENDORFER PATRICK H. WORLEY


Technische Universität München Oak Ridge National Laboratory
München Oak Ridge, TN
Germany USA
xxxiv List of Contributors

SUDHAKAR YALAMANCHILI FIELD G. VAN ZEE


Georgia Institute of Technology The University of Texas at Austin
Atlanta, GA Austin, TX
USA USA

KATHERINE YELICK LIXIN ZHANG


University of California at Berkeley and Lawrence Berkeley IBM Research
National Laboratory Austin, TX
Berkeley, CA USA
USA
GENGBIN ZHENG
PEN-CHUNG YEW University of Illinois at Urbana-Champaign
University of Minnesota at Twin-Cities Urbana, IL
Minneapolis, MN USA
USA
HANS P. ZIMA
BOBBY DALTON YOUNG California Institute of Technology
Colorado State University Pasadena, CA
Fort Collins, CO USA
USA
JAROSLAW ZOLA
CLIFF YOUNG Iowa State University
D. E. Shaw Research Ames, IA
New York, NY USA
USA

GABRIEL ZACHMANN
Clausthal University
Clausthal-Zellerfeld
Germany
A
Actors do not share state: an actor must explicitly send
Ab Initio Molecular Dynamics a message to another actor in order to affect the latter’s
behavior. Each actor carries out its actions concurrently
Car-Parrinello Method (and asynchronously) with other actors. Moreover, the
path a message takes as well as network delays it may
encounter are not specified. Thus, the arrival order of
messages is indeterminate. The key semantic proper-
Access Anomaly ties of the standard Actor model are encapsulation of
state and atomic execution of a method in response
Race Conditions to a message, fairness in scheduling actors and in the
delivery of messages, and location transparency enabling
distributed execution and mobility.

Actors
Rajesh K. Karmani, Gul Agha
Advantages of the Actor Model
University of Illinois at Urbana-Champaign, Urbana, In the object-oriented programming paradigm, an
IL, USA object encapsulates data and behavior. This separates
the interface of an object (what an object does) from its
representation (how it does it). Such separation enables
Definition modular reasoning about object-based programs and
Actors is a model of concurrent computation for devel- facilitates their evolution. Actors extend the advan-
oping parallel, distributed, and mobile systems. Each tages of objects to concurrent computations by sepa-
actor is an autonomous object that operates concur- rating control (where and when) from the logic of a
rently and asynchronously, receiving and sending mes- computation.
sages to other actors, creating new actors, and updating The Actor model of programming [] allows
its own local state. An actor system consists of a collec- programs to be decomposed into self-contained, auto-
tion of actors, some of whom may send messages to, or nomous, interactive, asynchronously operating com-
receive messages from, actors outside the system. ponents. Due to their asynchronous operation, actors
provide a model for the nondeterminism inherent in
Preliminaries distributed systems, reactive systems, mobile systems,
An actor has a name that is globally unique and a behav- and any form of interactive computing.
ior which determines its actions. In order to send an
actor a message, the actor’s name must be used; a name
cannot be guessed but it may be communicated in a History
message. When an actor is idle, and it has a pend- The concept of actors has developed over  decades. The
ing message, the actor accepts the message and does earliest use of the term “actors” was in Carl Hewitt’s
the computation defined by its behavior. As a result, Planner [] where the term referred to rule-based
the actor may take three types of actions: send mes- active entities which search a knowledge base for pat-
sages, create new actors, and update its local state. An terns to match, and in response, trigger actions. For
actor’s behavior may change as it modifies its local state. the next  decades, Hewitt’s group worked on actors as

David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----,


© Springer Science+Business Media, LLC 
 A Actors

State
Thread

Methods

Mailbox
Create

State State
Thread Thread

Methods Methods
msg

Mailbox Mailbox

Actors. Fig.  Actors are concurrent objects that communicate through messages and may create new actors. An actor
may be viewed as an object augmented with its own control, a mailbox, and a globally unique, immutable name

agents of computation, and it evolved as a model of con- Illustrative Actor Language


current computing. A brief history of actor research can In order to show how actor programs work, con-
be found in []. The commonly used definition of actors sider a simple imperative actor language ActorFoundry
today follows the work of Agha () which defines that extends Java. A class defining an actor behavior
actors using a simple operational semantics []. extends osl.manager.Actor. Messages are handled by
In recent years, the Actor model has gained in methods; such methods are annotated with @message.
popularity with the growth of parallel and distributed The create(class, args) method creates an actor
computing platforms such as multicore architectures, instance of the specified actor class, where args cor-
cloud computers, and sensor networks. A number respond to the arguments of a constructor in the class.
of actor languages and frameworks have been devel- Each newly created actor has a unique name that is ini-
oped. Some early actor languages include ABCL, tially known only to the creator at the point where the
POOL, ConcurrentSmalltalk, ACT++, CEiffel (see [] creation occurs.
for a review of these), and HAL []. Actor lan-
guages and frameworks in current use include Erlang Execution Semantics
(from Ericsson) [], E (Erights.org), Scala Actors The semantics of ActorFoundry can be informally
library (EPFL) [], Ptolemy (UC Berkeley) [], ASP described as follows. Consider an ActorFoundry pro-
(INRIA) [], JCoBox (University of Kaiserslautern) gram P that consists of a set of actor definitions.
[], SALSA (UIUC and RPI) [], Charm++ [] An actor communicates with another actor in P by
and ActorFoundry [] (both from UIUC), the Asyn- sending asynchronous (non-blocking) messages using
chronous Agents Library [] and Axum [] (both from the send statement: send(a,msg) has the effect of even-
Microsoft), and Orleans framework for cloud comput- tually appending the contents of msg to the mail-
ing from Microsoft Research []. Some well-known box of the actor a. However, the call to send returns
open source applications built using actors include immediately, that is, the sending actor does not wait
Twitter’s message queuing system and Lift Web Frame- for the message to arrive at its destination. Because
work, and among commercial applications are Face- actors operate asynchronously, and the network has
book Chat system and Vendetta’s game engine. indeterminate delays, the arrival order of messages is
Actors A 

nondeterministic. However, we assume that messages Listing  Hello World! program in ActorFoundry
A
are eventually delivered (a form of fairness). public class HelloActor extends Actor {
At the beginning of execution of P, the mailbox of
@message
each actor is empty and some actor in the program must public void greet() throws RemoteCodeException
receive a message from the environment. The Actor- {
ActorName other = null;
Foundry runtime first creates an instance of a specified send(stdout, "print", "Hello");
actor and then sends a specified message to it, which other = create(WorldActor.class);
send(other, "audience");
serves as P’s entry point. }
Each actor can be viewed as executing a loop with }
the following steps: remove a message from its mailbox public class WorldActor extends Actor {
(often implemented as a queue), decode the message,
and execute the corresponding method. If an actor’s @message
public void audience() throws RemoteCodeException
mailbox is empty, the actor blocks – waiting for the next {
message to arrive in the mailbox (such blocked actors send(stdout, "print", "World");
}
are referred to as idle actors). The processing of a mes- }
sage may cause the actor’s local state to be updated,
new actors to be created, and messages to be sent.
Because of the encapsulation property of actors, there is
no interference between messages that are concurrently
processed by different actors. Synchronization
An actor program “terminates” when every actor Synchronization in actors is achieved through commu-
created by the program is idle and the actors are nication. Two types of commonly used communication
not open to the environment (otherwise the envi- patterns are Remote Procedure Call (RPC)-like mes-
ronment could send new messages to their mail- saging and local synchronization constraints. Language
boxes in the future). Note that an actor program need constructs can enable actor programmers to specify
not terminate – in particular, certain interactive pro- such patterns. Such language constructs are definable
grams and operating systems may continue to execute in terms of primitive actor constructs, but providing
indefinitely. them as first-class linguistic objects simplifies the task
Listing  shows the HelloWorld program in Actor- of writing parallel code.
Foundry. The program comprises of two actor def-
initions: the HelloActor and the WorldActor. An
instance of the HelloActor can receive one type of
RPC-Like Messaging
message, the greet message, which triggers the exe-
RPC-like communication is a common pattern of
cution of greet method. The greet method serves
message-passing in actor programs. In RPC-like com-
as P’s entry point, in lieu of the traditional main
munication, the sender of a message waits for the
method.
reply to arrive before the sender proceeds with process-
On receiving a greet message, the HelloActor
ing other messages. For example, consider the pattern
sends a print message to the stdout actor (a built-
shown in Fig.  for a client actor which requests a quote
in actor representing the standard output stream)
from a travel service. The client wishes to wait for the
along with the string “Hello.” As a result, “Hello”
quote to arrive before it decides whether to buy the trip,
will eventually be printed on the standard output
or to request a quote from another service.
stream. Next, it creates an instance of the WorldActor.
Without a high-level language abstraction to express
The HelloActor sends an audience message to the
RPC-like message pattern, a programmer has to explic-
WorldActor, thus delegating the printing of “World”
itly implement the following steps in their program:
to it. Note that due to asynchrony in communica-
tion, it is possible for “World” to be printed before . The client actor sends a request.
“Hello.” . The client then checks incoming messages.
 A Actors

dependencies in the code. This may not only make the


Client Re
que program’s execution more inefficient than it needs to be,
st
it can lead to deadlocks and livelocks (where an actor
Service #1
ignores or postpones processing messages, waiting for
ly an acknowledgment that never arrives).
Rep

Client Re Local Synchronization Constraints


que
st Asynchrony is inherent in distributed systems and
Service #2 mobile systems. One implication of asynchrony is that
ly the number of possible orderings in which messages
Rep may arrive is exponential in the number of messages
that are “pending” at any time (i.e., messages that have
Client
been sent but have not been received). Because a sender
may be unaware of what the state of the actor it is send-
Actors. Fig.  A client actor requesting quotes from ing a message to will be when the latter receives the
multiple competing services using RPC-like message, it is possible that the recipient may not be in
communication. The dashed slanted arrows denote a state where it can process the message it is receiv-
messages and the dashed vertical arrows denote that the ing. For example, a spooler may not have a job when
actor is waiting or is blocked during that period in its life some printer requests one. As another example, mes-
sages to actors representing individual matrix elements
(or groups of elements) asking them to process dif-
ferent iterations in a parallel Cholesky decomposition
. If the incoming message corresponds to the reply
algorithm need to be monotonically ordered. The need
to its request, the client takes the appropriate action
for such orderings leads to considerable complexity in
(accept the offer or keep searching).
concurrent programs, often introducing bugs or ineffi-
. If an incoming message does not correspond to the
ciencies due to suboptimal implementation strategies.
reply to its request, the message must be handled
For example, in the case of Cholesky decomposition,
(e.g., by being buffered for later processing), and the
imposing a global ordering on the iterations leads to
client continues to check messages for the reply.
highly inefficient execution on multicomputers [].
RPC-like messaging is almost universally supported Consider the example of a print spooler. Suppose
in actor languages and libraries. RPC-like messages are a ‘get’ message from an idle printer to its spooler may
particularly useful in two kinds of common scenarios. arrive when the spooler has no jobs to return the printer.
One scenario occurs when an actor wants to send an One way to address this problem is for the spooler to
ordered sequence of messages to a particular recipient – refuse the request. Now the printer needs to repeatedly
in this case, it wants to ensure that a message has been poll the spooler until the latter has a job. This technique
received before it sends another. A variant of this sce- is called busy waiting; busy waiting can be expensive –
nario is where the sender wants to ensure that the target preventing the waiting actor from possibly doing other
actor has received a message before it communicates work while it “waits,” and it results in unnecessary mes-
this information to another actor. A second scenario is sage traffic. An alternate is to the spooler buffer the
when the state of the requesting actor is dependent on “get” message for deferred processing. The effect of such
the reply it receives. In this case, the requesting actor buffering is to change the order in which the messages
cannot meaningfully process unrelated messages until are processed in a way that guarantees that the num-
it receives a response. ber of messages put messages to the spooler is always
Because RPC-like messages are similar to proce- greater than the number of get messages processed by
dure calls in sequential languages, programmers have the spooler.
a tendency to overuse them. Unfortunately, inappropri- If pending messages are buffered explicitly inside the
ate usage of RPC-like messages introduces unnecessary body of an actor, the code specifying the functionality
Actors A 

images is passed through a series of filtering and trans-


A
forming stages. The output of the last stage is a stream
Client
Open of processed images. This pattern has been demon-
strated by an image processing example, written using
File :
the Asynchronous Agents Library [], which is part of
“open”
the Microsoft Visual Studio .
A map-reduce graph is an example of the divide-
and-conquer pattern (see Fig. b). A master actor maps
Rea
Client d the computation onto a set of workers and the out-
Rea
d put from each of these workers is reduced in the “join
Rea continuation” behavior of the master actor (possibly
d
File :
modeled as a separate actor) (e.g., see []). Other exam-
“read” | “close” ples of divide-and-conquer pattern are naïve parallel
Close quicksort [] and parallel mergesort. The synchroniza-
Client tion idioms discussed above may be used in succinct
encoding of these patterns in actor programs since these
patterns essentially require ordering the processing of
some messages.

Actors. Fig.  A file actor communication with a client is Semantic Properties


constrained using local synchronization constraints. The As mentioned earlier, some important semantic prop-
vertical arrows depict the timeline of the life of an actor erties of the pure Actor model are encapsulation and
and the slanted arrows denote messages. The labels inside atomic execution of methods (where a method repre-
a circle denote the messages that the file actor can accept sents computation in response to a message), fairness,
in that particular state and location transparency []. We discuss the implica-
tions of these properties.
Note that not all actor languages enforce all these
(the how or representation) of the actor is mixed with properties. Often, the implementations compromise
the logic determining the order in which the actor pro- some actor properties, typically because it is simpler
cesses the messages (the when). Such mixing violates the to achieve efficient implementations by doing so. How-
software principle of separation of concerns. Researchers ever, it is possible by sophisticated program transforma-
have proposed various constructs to enable program- tions, compilation, and runtime optimizations to regain
mers to specify the correct orderings in a modular and almost all the efficiency that is lost in a naïve language
abstract way, specifically, as logical formulae (predi- implementation, although doing so is more challeng-
cates) over the state of an actor and the type of mes- ing for library-like actor frameworks []. By failing to
sages. Many actor languages and frameworks provide enforce some actor properties in an actor language or
such constructs; examples include local synchroniza- framework implementation, actor languages add to the
tion constraints in ActorFoundry, and pattern matching burden of the programmers, who have to then ensure
on sets of messages in Erlang and Scala Actors library. that they write programs in a way that does not violate
the property.

Patterns of Actor Programming Encapsulation and Atomicity


Two common patterns of parallel programming are Encapsulation implies that no two actors share state.
pipeline and divide-and-conquer []. These patterns are This is useful for enforcing an object-style decomposi-
illustrated in Fig. a and b, respectively. tion in the code. In sequential object-based languages,
An example of the pipeline pattern is an image pro- this has led to the natural model of atomic change in
cessing network (see Fig. a) in which a stream of objects: an object invokes (sends a message to) another
 A Actors

Stage #1 Stage #2 Stage #3


Raw images Processed images

Master
(map)
Requests Worker #1

Worker #2

Replies

Master
(reduce)

Actors. Fig.  Patterns of actor programming (from top): (a) pipeline pattern (b) divide-and-conquer pattern

object, which finishes processing the message before Fairness


accepting another message from a different object. This The Actor model assumes a notion of fairness which
allows us to reason about the behavior of the object in states that every actor makes progress if it has some
response to a message, given the state of the target object computation to do, and that every message is eventually
when it received the message. In a concurrent computa- delivered to the destination actor, unless the destination
tion, it is possible for a message to arrive while an actor actor is permanently disabled. Fairness enables mod-
is busy processing another message. Now if the sec- ular reasoning about the liveness properties of actor
ond message is allowed to interrupt the target actor and programs []. For example, if an actor system A is com-
modify the target’s state while the target is still process- posed with an actor system B where B includes actors
ing the first message, it is no longer feasible to reason that are permanently busy, the composition does not
about the behavior of the target actor based on what affect the progress of the actors in A. A familiar example
the target’s state was when it received the first message. where fairness would be useful is in browsers. Prob-
This makes it difficult to reason about the behavior of lems are often caused by the composition of browser
the system as such interleaving of messages may lead to components with third-party plug-ins: in the absence
erroneous and inconsistent states. of fairness, such plug-ins sometimes result in browser
Instead, the target actor processes messages one crashes and hang-ups.
at a time, in a single atomic step consisting of all
actions taken in response to a given message []. Location Transparency
By dramatically reducing the nondeterminism that In the Actor model, the actual location of an actor does
must be considered, such atomicity provides a macro- not affect its name. Actors communicate by exchang-
step semantics which simplifies reasoning about actor ing messages with other actors, which could be on the
programs. Macro-step semantics is commonly used same core, on the same CPU, or on another node in
by correctness-checking tools; it significantly reduces the network. Location transparent naming provides an
the state-space exploration required to check a prop- abstraction for programmers, enabling them to pro-
erty against an actor program’s potential executions gram without worrying about the actual physical loca-
(e.g., see []). tion of actors. Location transparent naming facilitates
Actors A 

automatic migration in the runtime, much as indi- It has been noted that a faithful but naïve implemen-
A
rection in addressing facilitates compaction following tation of the Actor model can be highly inefficient []
garbage collection in sequential programming. (at least on the current generation of architectures).
Mobility is defined as the ability of a computation Consider three examples:
to migrate across different nodes. Mobility is impor-
tant for load-balancing, fault-tolerance, and reconfig- . An implementation that maps each actor to a sepa-
uration. In particular, mobility is useful in achieving rate process may have a high cost for actor creation.
scalable performance, particularly for dynamic, irreg- . If the number of cores is less than the number of
ular applications []. Moreover, employing different actors in the program (sometimes termed CPU over-
distributions in different stages of a computation may subscription), an implementation mapping actors to
improve performance. In other cases, the optimal or separate processes may suffer from high context
correct performance depends on runtime conditions switching cost.
such as data and workload, or security characteristics . If two actors are located on the same sequential
of different platforms. For example, web applications node, or on a shared-memory processor, it may
may be migrated to servers or to mobile clients depend- be an order of magnitude more efficient to pass a
ing on the network conditions and capabilities of the reference to the message contents rather than to
client []. make a copy of the actual message contents.
Mobility may also be useful in reducing the energy
consumed by the execution of parallel applications. These inefficiencies may be addressed by compi-
Different parts of an application often involve differ- lation and runtime techniques, or through a combi-
ent parallel algorithms and the energy consumption of nation of the two. The implementation of the ABCL
an algorithm depends on how many cores the algo- language [] demonstrates some early ideas for opti-
rithm is executed on and at what frequency these cores mizing both intra-node and internode execution and
communication between actors. The Thal language
operate []. Mobility facilitates dynamic redistribution
project [] shows that encapsulation, fairness, and uni-
of a parallel computation to the appropriate number
versal naming in an actor language can be implemented
of cores (i.e., to the number of cores that minimize
efficiently on commodity hardware by using a combi-
energy consumption for a given performance require-
nation of compiler and runtime. The Thal implemen-
ment and parallel algorithm) by migrating actors. Thus,
tation also demonstrates that various communication
mobility could be an important feature for energy-
abstractions such as RPC-like communication, local
aware programming of multicore (manycore) architec-
tures. Similarly, energy savings may be facilitated by synchronization constraints, and join expressions can
being able to migrate actors in sensor networks and also be supported efficiently using various compile-time
clouds. program transformations.
The Kilim framework develops a clever post-
compilation continuation-passing style (CPS) trans-
form (“weaving”) on Java-based actor programs for
Implementations supporting lightweight actors that can pause and
Erlang is arguably the best-known implementation of resume []. Kilim and Scala also add type systems to
the Actor model. It was developed to program tele- support safe but efficient messages among actors on a
com switches at Ericsson about  years ago. Some shared node [, ]. Recent work suggests that own-
recent actor implementations have been listed earlier. ership transfer between actors, which enables safe and
Many of these implementations have focused on a efficient messaging, can be statically inferred in most
particular domain such as the Internet (SALSA), dis- cases [].
tributed applications (Erlang and E), sensor networks On distributed platforms such as cloud comput-
(ActorNet), and, more recently multicore processors ers or grids, because of latency in sending messages
(Scala Actors library, ActorFoundry, and many others in to remote actors, an important technique for achiev-
development). ing good performance is communication–computation
 A Actors

overlap. Decomposition into actors and the placement Extensions and Abstractions
of actors can significantly determine the extent of this A programming language should facilitate the pro-
overlap. Some of these issues have been effectively cess of writing programs by being close to the con-
addressed in the Charm++ runtime []. Decomposi- ceptual level at which a programmer thinks about a
tion and placement issues are also expected to show up problem rather than at the level at which it may be
on scalable manycore architectures since these architec- implemented. Higher level abstractions for concurrent
ture cannot be expected to support constant time access programming may be defined in interaction languages
to shared memory. which allow patterns to be captured as first-class
Finally, note that the notion of garbage in actors objects []. Such abstractions can be implemented
is somewhat complex. Because an actor name may through an adaptive, reflective middleware []. Besides
be communicated in a message, it is not sufficient to programming abstractions for concurrency in the pure
mark the forward acquaintances (references) of reach- (asynchronous) Actor model, there are variants of the
able actors as reachable. The inverse acquaintances of Actor model, such as for real-time, which extend the
reachable actors that may be potentially active need to model [, ]. Two interaction patterns are discussed
be considered as well (these actors may send a mes- to illustrate the ideas of interaction patterns.
sage to a reachable actor). Efficient garbage collection
of distributed actors remains an open research problem
because of the problem of taking efficient distributed
snapshots of the reachability graph in a running Pattern-Directed Communication
system []. Recall that a sending actor must know the name of
a target actor before the sending actor can communi-
cate with the target actor. This property, called locality,
is useful for compositional reasoning about actor pro-
Tools grams – if it is known that only some actors can send
Several tools are available to aid in writing, main- a message to an actor A, then it may be possible to
taining, debugging, model checking, and testing actor figure out what types of messages A may get and per-
programs. Both Erlang and Scala have a plug-in for haps specify some constraints on the order in which
the popular, open source IDE (Integrated Development it may get them. However, real-world programs gener-
Environment) called Eclipse (http://www.eclipse.org). ally create an open system which interacts with their
A commercial testing tool for Erlang programs called external environment. This means that having ways of
QuickCheck [] is available. The tool enables program- discovering actors which provide certain services can be
mers to specify program properties and input genera- helpful. For example, if an actor migrates to some envi-
tors which are used to generate test inputs. ronment, discovering a printer in that environment may
JCute [] is a tool for automatic unit testing of pro- be useful.
grams written in a Java actor framework. Basset [] Pattern-directed communication allows program-
works directly on executable (Java bytecode) actor mers to declare properties of a group of actors, enabling
programs and is easily retargetable to any actor lan- the use of the properties to discover actual recipients are
guage that compiles to bytecode. Basset understands chosen at runtime. In the ActorSpace model, an actor
the semantic structure of actor programs (such as the specifies recipients in terms of patterns over properties
macro-step semantics), enabling efficient path explo- that must be satisfied by the recipients. The sender may
ration through the Java Pathfinder (JPF) – a popu- send the message to all actors (in some group) that sat-
lar tool for model checking programs []. The term isfy the property, or to a single representative actor [].
rewriting system Maude provides an Actor module to There are other models for pattern-based communica-
specify program behavior; it has been used to model tion. In Linda, potential recipients specify a pattern for
check actor programs []. There has also been work on messages they are interested in []. The sending actors
runtime monitoring of actor programs []. simply inserts a message (called tuple in Linda) into
Actors A 

a tuple-space, from where the receiving actors may read From the perspective of each end point (actor), the
A
or remove the tuples if the tuple matches the pattern of channel contract specifies the interface of the other end
messages the receiving actor is interested in. point (actor) in terms of not only the type of messages
but also the ordering on messages. In Erlang, contracts
are enforced at runtime, while in Singularity a more
Coordination restrictive notion of typed contracts make it feasible to
Actors help simplify programming by increasing the check the constraints at compile time.
granularity at which programmers need to reason about
concurrency, namely, they may reason in terms of the
potential interleavings of messages to actors, instead Current Status and Perspective
of in terms the interleavings of accesses to shared Actor languages have been used for parallel and dis-
variables within actors. However, developing actor pro- tributed computing in the real world for some time
grams is still complicated and prone to errors. A key (e.g., Charm++ for scientific applications on supercom-
cause of complexity in actor programs is the large num- puters [], Erlang for distributed applications []). In
ber of possible interleaving of messages to groups of recent years, interest in actor-based languages has been
actors: if these message orderings are not suitably con- growing, both among researchers and among practi-
strained, some possible execution orders may fail to tioners. This interest is triggered by emerging program-
meet the desired specification. ming platforms such as multicore computers and cloud
Recall that local synchronization constraints post- computers. In some cases, such as cloud computing,
pone the dispatch of a message based on the contents of web services, and sensor networks, the Actor model
the messages and the local state of the receiving actor is a natural programming model because of the dis-
(see section “Local Synchronization Constraints”). Syn- tributed nature of these platforms. Moreover, as multi-
chronizers, on the other hand, change the order in which core architectures are scaled, multicore computers will
messages are processed by a group of actors by defin- also look more and more like traditional multicomputer
ing constraints on ordering of messages processed at platforms. This is illustrated by the -core Single-Chip
different actors in a group of actors. For example, if Cloud Computer (SCC) developed by Intel [] and the
a withdrawal and deposit messages must be processed -core TILE-Gx by Tilera []. However, the argu-
atomically by two different actors, a Synchronizer can ment for using actor-based programming languages is
specify that they must be scheduled together. Synchro- not simply that they are a good match for distributed
nizers are described in []. computing platforms; it is that Actors is a good model in
In the standard actor semantics, an actor that knows which to think about concurrency. Actors simplify the
the name of a target actor may send the latter a mes- task of programming by extending object-based design
sage. An alternate semantics introduces the notion of a to concurrent (parallel, distributed, mobile) systems.
channel; a channel is used to establish communication
between a given sender and a given recipient. Recent
work on actor languages has introduced stateful chan- Bibliography
. Agha G, Callsen CJ () Actorspace: an open distributed
nel contracts to constrain the order of messages between
programming paradigm. In: Proceedings of the fourth ACM
two actors. Channels are a central concept for commu-
SIGPLAN symposium on principles and practice of paral-
nication between actors in both Microsoft’s Singular- lel programming (PPoPP), San Diego, CA. ACM, New York,
ity platform [] and Microsoft’s Axum language [], pp –
while they can be optionally introduced between two . Agha G, Frolund S, Kim WY, Panwar R, Patterson A, Sturman D
actors in Erlang. Channel contracts specify a proto- () Abstraction and modularity mechanisms for concurrent
computing. IEEE Trans Parallel Distr Syst ():–
col that governs the communication between the two
. Agha G () Actors: a model of concurrent computation in
end points (actors) of the channel. The contracts are distributed systems. MIT Press, Cambridge, MA
stated in terms of state transitions based on observing . Agha G () Concurrent object-oriented programming. Com-
messages on the channel. mun ACM (): –
 A Actors

. Arts T, Hughes J, Johansson J, Wiger U () Testing telecoms . Haller P, Odersky M () Actors that unify threads and events.
software with quviq quickcheck. In: ERLANG ’: proceedings of In: th International conference on coordination models and lan-
the  ACM SIGPLAN workshop on Erlang. ACM, New York, guages, vol  of lecture notes in computer science, Springer,
pp – Berlin
. Agha G, Kim WY () Parallel programming and complexity . Haller P, Odersky M () Capabilities for uniqueness and bor-
analysis using actors. In: Proceedings: third working conference rowing. In: D’Hondt T (ed) ECOOP  object-oriented pro-
on massively parallel programming models, , London. IEEE gramming, vol  of lecture notes in computer science, Springer,
Computer Society Press, Los Alamitos, CA, pp – Berlin, pp –
. Agha G, Mason IA, Smith S, Talcott C () A foundation for . Kim WY, Agha G () Efficient support of location transparency
actor computation. J Funct Program ():– in concurrent object-oriented programming languages. In: Super-
. Armstrong J () Programming Erlang: software for a concur- computing ’: proceedings of the  ACM/IEEE conference on
rent World. Pragmatic Bookshelf, Raleigh, NC supercomputing, San Diego, CA. ACM, New York, p 
. Astley M, Sturman D, Agha G () Customizable middleware . Korthikanti VA, Agha G () Towards optimizing energy costs
for modular distributed software. Commun ACM ():– of algorithms for shared memory architectures. In: SPAA ’:
. Astley M (–) The actor foundry: a java-based actor pro- proceedings of the nd ACM symposium on parallelism in algo-
gramming environment. Open Systems Laboratory, University of rithms and architectures, Santorini, Greece. ACM, New York,
Illinois at Urbana-Champaign, Champaign, IL pp –
. Briot JP, Guerraoui R, Lohr KP () Concurrency and distribu- . Kale LV, Krishnan S () Charm++: a portable concurrent
tion in object-oriented programming. ACM Comput Surv (): object oriented system based on c++. ACM SIGPLAN Not ():
– –
. Chang PH, Agha G () Towards context-aware web applica- . Karmani RK, Shali A, Agha G () Actor frameworks for the
tions. In: th IFIP international conference on distributed appli- JVM platform: a comparative analysis. In: PPPJ ’: proceed-
cations and interoperable systems (DAIS), , Paphos, Cyprus. ings of the th international conference on principles and prac-
LNCS , Springer, Berlin tice of programming in java, Calgary, Alberta. ACM, New York,
. Clavel M, Durán F, Eker S, Lincoln P, Martí-Oliet N, Meseguer J, pp –
Talcott C () All about maude – a high-performance logi- . Lauterburg S, Dotta M, Marinov D, Agha G () A frame-
cal framework: how to specify, program and verify systems in work for state-space exploration of javabased actor programs.
rewriting logic. Springer, Berlin In: ASE ’: proceedings of the  IEEE/ACM interna-
. Carriero N, Gelernter D () Linda in context. Commun ACM tional conference on automated software engineering, Auck-
:– land, New Zealand. IEEE Computer Society, Washington, DC,
. Caromel D, Henrio L, Serpette BP () Asynchronous sequen- pp –
tial processes. Inform Comput ():– . Lee EA () Overview of the ptolemy project. Technical report
. Intel Corporation. Single-chip cloud computer. http:// UCB/ERL M/. University of California, Berkeley
techresearch.intel.com/ProjectDetails.aspx?Id= . Lauterburg S, Karmani RK, Marinov D, Agha G () Eval-
. Microsoft Corporation. Asynchronous agents library. http:// uating ordering heuristics for dynamic partialorder reduction
msdn.microsoft.com/enus/library/dd(VS.).aspx techniques. In: Fundamental approaches to software engineering
. Microsoft Corporation. Axum programming language. http:// (FASE) with ETAPS, , LNCS , Springer, Berlin
msdn.microsoft.com/en-us/devlabs/dd.aspx . Negara S, Karmani RK, Agha G () Inferring ownership trans-
. Tilera Corporation. TILE-Gx processor family. http://tilera.com/ fer for efficient message passing. In: To appear in the th ACM
products/processors/TILE-Gxfamily SIGPLAN symposium on principles and practice of parallel pro-
. Fähndrich M, Aiken M, Hawblitzel C, Hodson O, Hunt G, Larus gramming (PPoPP). ACM, New York
JR, Levi S () Language support for fast and reliable message- . Ren S, Agha GA () Rtsynchronizer: language support for real-
based communication in singularity OS. SIGOPS Oper Syst Rev time specifications in distributed systems. ACM SIGPLAN Not
():– ():–
. Feng TH, Lee EA () Scalable models using model transforma- . Sturman D, Agha G () A protocol description language for
tion. In: st international workshop on model based architecting customizing failure semantics. In: Proceedings of the thirteenth
and construction of embedded systems (ACESMB), Toulouse, symposium on reliable distributed systems, Dana Point, CA. IEEE
France Computer Society Press, Los Alamitos, CA, pp –
. Houck C, Agha G () Hal: a high-level actor language and its . Sen K, Agha G () Automated systematic testing of open
distributed implementation. In: st international conference on distributed programs. In: Fundamental approaches to software
parallel processing (ICPP), vol II, An Arbor, MI, pp – engineering (FASE), volume  of lecture notes in computer
. Hewitt C () PLANNER: a language for proving theorems in science, Springer, Berlin, pp –
robots. In: Proceedings of the st international joint conference . Kliot G, Larus J, Pandya R, Thelin J, Bykov S, Geller A ()
on artificial intelligence, Morgan Kaufmann, San Francisco, CA, Orleans: a framework for cloud computing. Technical report
pp – MSR-TR--, Microsoft Research
Affinity Scheduling A 

. Singh V, Kumar V, Agha G, Tomlinson C () Scalability of more efficient, and it is most often related to differ-
parallel sorting on mesh multicomputers. In: Parallel processing
A
ent speeds or overheads associated with the resources
symposium, . Proceedings, fifth international. Anaheim, CA, of the computing node that are required by the
pp –
computing task.
. Srinivasan S, Mycroft A () Kilim: isolation typed actors for
java. In: Procedings of the European conference on object ori-
ented programming (ECOOP), Springer, Berlin
. Schäfer J, Poetzsch-Heffter A () Jcobox: generalizing active Discussion
objects to concurrent components. In: Proceedings of the Introduction
th European conference on object-oriented programming,
In parallel computing environments, it may be more
ECOOP’, Maribor, Slovenia. Springer, Berlin/Heidelberg
pp –
efficient to schedule a computing task on one com-
. Sen K, Vardhan A, Agha G, Rosu G () Efficient decentralized puting node than on another. Such affinity of a spe-
monitoring of safety in distributed systems. In: ICSE ’: proceed- cific task for a particular node can arise from many
ings of the th international conference on software engineer- sources, where the goal of the scheduling policy is
ing, Edinburg, UK. IEEE Computer Society. Washington, DC,
typically to optimize a functional of response time
pp –
. Varela C, Agha G () Programming dynamically reconfig-
and/or throughput. For example, the affinity might be
urable open systems with SALSA. ACM SIGPLAN Notices (): based on how fast the computing task can be exe-
– cuted on a computing node in an environment com-
. Venkatasubramanian N, Agha G, Talcott C () Scalable dis- prised of nodes with heterogeneous processing speeds.
tributed garbage collection for systems of active objects. In: As another example, the affinity might concern the
Bekkers Y, Cohen J (eds) International workshop on memory
resources associated with the computing nodes such
management, ACM SIGPLAN and INRIA, St. Malo, France.
Lecture notes in computer science, vol , Springer, Berlin, that each node has a collection of available resources,
pp – each task must execute on a node with a certain
. Visser W, Havelund K, Brat G, Park S () Model checking pro- set of resources, and the scheduler attempts to max-
grams. In: Proceedings of the th IEEE international conference imize performance subject to these imposed system
on automated software engineering, ASE ’, Grenoble, France. constraints.
IEEE Computer Society, Washington, DC, pp 
. Yonezawa A (ed) () ABCL: an object-oriented concurrent
Another form of affinity scheduling is based on the
system. MIT Press, Cambridge, MA state of the memory system hierarchy. More specif-
ically, it may be more efficient in parallel comput-
ing environments to schedule a computing task on a
particular computing node than on any other if rel-
evant data, code, or state information already resides
Affinity Scheduling in the caches or local memories associated with the
node. This is the form most commonly referred to as
Mark S. Squillante affinity scheduling, and it is the primary focus of this
IBM, Yorktown Heights, NY, USA entry. The use of such affinity information in general-
purpose multiprocessor systems can often improve per-
formance in terms of functionals of response time and
Synonyms throughput, particularly if this information is inexpen-
Cache affinity scheduling; Resource affinity
sive to obtain and exploit. On the other hand, the per-
scheduling
formance benefits of this form of affinity scheduling
often depend upon a number of factors and vary over
Definition time, and thus there is a fundamental scheduling trade-
Affinity scheduling is the allocation, or scheduling, of off between scheduling tasks where they execute most
computing tasks on the computing nodes where they efficiently and keeping the workload shared among
will be executed more efficiently. Such affinity of a task nodes.
for a node can be based on any aspects of the com- Affinity scheduling in general-purpose multiproces-
puting node or computing task that make execution sor systems will be next described within its historical
 A Affinity Scheduling

context, followed by a brief discussion of the perfor- times may cause significant increases in the execution
mance trade-off between affinity scheduling and load times of individual tasks as well as reductions in the
sharing. performance of the entire system. These cache-reload
effects may be further compounded by an increase in
General-Purpose Multiprocessor Systems the overhead of cache coherence protocols (due to a
Many of the general-purpose parallel computing envi- larger number of cache invalidations resulting from
ronments introduced in the s were shared-memory modification of a task’s data still resident in another
multiprocessor systems of modest size in compari- processor’s cache) and an increase in bus traffic and
son with the parallel computers of today. Most of the interference (due to cache misses).
general-purpose multiprocessor operating systems at To this end, Squillante and Lazowska [, ] for-
the time implemented schedulers based on simple pri- mulate and analyze mathematical models to investigate
ority schemes that completely ignored the affinity a task fundamental principles underlying the various per-
might have for a specific processor due to the contents of formance trade-offs associated with processor-cache
processor caches. On the other hand, as processor cycle affinity scheduling. These mathematical models repre-
times improved at a much faster rate than main memory sent general abstractions of a wide variety of general-
access times, researchers were observing the increas- purpose multiprocessor systems and workloads, ranging
ingly important relative performance impact of cache from more traditional time-sharing parallel systems
misses (refer to, e.g., []). The relative costs of a cache and workloads even up through more recent dynamic
miss were also increasing due to other issues includ- coscheduling parallel systems and workloads (see,
ing cache coherency [] and memory bus interference. e.g., []). Several different scheduling policies are
As a matter of fact, the caches in one of the general- considered, spanning the entire spectrum from ignor-
purpose multiprocessor systems at the time were not ing processor-cache affinity to fixing tasks to execute
intended to reduce the memory access time at all, but on specific processors. The results of this modeling
rather to reduce memory bus interference and protect analysis illustrate and quantify the benefits and limita-
the bus from most processor-memory references (refer tions of processor-cache affinity scheduling in general-
to, e.g., []). purpose shared-memory multiprocessor systems with
In one of the first studies of such processor-cache respect to improving and degrading the first two sta-
affinity issues, Squillante and Lazowska [, ] consid- tistical moments of response time and throughput. In
ered the performance implications of scheduling based particular, the circumstances under which performance
on the affinity of a task for the cache of a particular pro- improvement and degradation can be realized and the
cessor. The typical behaviors of tasks in general-purpose importance of exploiting processor-cache affinity infor-
multiprocessor systems can include alternation between mation depend upon many factors. Some of the most
executing at a processor and releasing this processor important of these factors include the size of proces-
either to perform I/O or synchronization operations (in sor caches, the locality of the task memory references,
which cases the task is not eligible for scheduling until the size of the set of cache blocks in active use by the
completion of the operation) or because of quantum task (its cache footprint), the ratio of the cache footprint
expiration or preemption (in which cases the task is loading time to the execution time of the task per visit
suspended to allow execution of another task). Upon to a processor, the time spent non-schedulable by the
returning and being scheduled on a processor, the task task between processor visits, the processor activities in
may experience an initial burst of cache misses and the between such visits, the system architecture, the paral-
duration of this burst depends, in part, upon the num- lel computing workload, the system scheduling strategy,
ber of blocks belonging to the task that are already res- and the need to adapt scheduling decisions with changes
ident in the cache of the processor. Continual increases in system load.
in cache sizes at the time further suggested that a signif- A large number of empirical research studies sub-
icant portion of the working set of a task may reside in sequently followed to further examine affinity schedul-
the cache of a specific processor under these scenarios. ing across a broad range of general-purpose parallel
Scheduling decisions that disregard such cache reload computing systems, workloads, scheduling strategies,
Affinity Scheduling A 

and resource types. Gupta et al. [] use a detailed that affinity scheduling yields significant performance
A
multiprocessor simulator to evaluate various schedul- improvements for a few of the application workloads
ing strategies, including gang scheduling (coschedul- and moderate performance gains for most of the appli-
ing), two-level scheduling (space-sharing) with process cation workloads considered in their study, while not
control, and processor-cache affinity scheduling, with degrading the performance of the rest of these work-
a focus on the performance impact of scheduling on loads. These results can once again be explained by the
cache behavior. The benefits of processor-cache affin- factors identified above.
ity scheduling are shown to exhibit a relatively small In addition to affinity scheduling in general-
but noticeable performance improvement, which can purpose shared-memory multiprocessors, a number
be explained by the factors identified above under the of related issues have arisen with respect to other
multiprocessor system and workload studied in []. types of resources in parallel systems. One particu-
Two-level scheduling with process control and gang larly interesting application area concerns affinity-based
scheduling are shown to provide the highest levels of scheduling in the context of parallel network protocol
performance, with process control outperforming gang processing, as first considered by Salehi et al. (see []
scheduling when applications have large working sets and the references cited therein), where parallel com-
that fit within a cache. Vaswani and Zahorjan [] then putation is used for protocol processing to support
developed an implementation of a space-sharing strat- high-bandwidth, low-latency networks. In particular,
egy (under which processors are partitioned among Salehi et al. [] investigate affinity-based schedul-
applications) to examine the implications of cache affin- ing issues with respect to: supporting a large num-
ity in shared-memory multiprocessor scheduling under ber of streams concurrently; receive-side and send-side
various scientific application workloads. The results of protocol processing (including data-touching protocol
this study illustrate that processor-cache affinity within processing); stream burstiness and source locality of
such a space-sharing strategy provides negligible per- network traffic; and improving packet-level concur-
formance improvements for the three scientific appli- rency and caching behavior. The approach taken to eval-
cations considered. It is interesting to note that the uate various affinity-based scheduling policies is based
models in [, ], parameterized by the system and on a combination of multiprocessor system measure-
application measurements and characteristics in [], ments, analytic modeling, and simulation. This com-
also show that affinity scheduling yields negligible per- bination of methods is used to illustrate and quantify
formance benefits in these circumstances. Devarakonda the potentially significant benefits and effectiveness of
and Mukherjee [] consider various implementation affinity-based scheduling in multiprocessor network-
issues involved in exploiting cache affinity to improve ing. Across all of the workloads considered, the authors
performance, arguing that affinity is most effective find benefits in managing threads and free-memory by
when implemented through a thread package which taking affinity issues into account, with different forms
supports the multiplexing of user-level threads on oper- of affinity-based scheduling performing best under dif-
ating system kernel-level threads. The results of this ferent workload conditions.
study show that a simple scheduling strategy can yield The trends of increasing cache and local mem-
significant performance improvements under an appro- ory sizes and of processor speeds decreasing at much
priate application workload and a proper implementa- faster rates than memory access times continued to
tion approach, which can once again be explained by the grow over time. This includes generations of non-
factors identified above. Torrellas et al. [, ] study uniform memory-access (NUMA) shared-memory and
the performance benefits of cache-affinity scheduling distributed-memory multiprocessor systems in which
in shared-memory multiprocessors under a wide vari- the remoteness of memory accesses can have an even
ety of application workloads, including various sci- more significant impact on performance. The penalty
entific, software development, and database applica- for not adhering to processor affinities can be consider-
tions, and under a time-sharing strategy that was in ably more significant in such NUMA and distributed-
widespread use by the vast majority of parallel com- memory multiprocessor systems, where the actual
puting environments at the time. The authors conclude cause of the larger costs depends upon the memory
 A Affinity Scheduling

management policy but are typically due to remote information about the average behavior of the sys-
access or demand paging of data stored in non-local tem while ignoring the current state, have been pro-
memory modules. These trends in turn have caused posed and studied by numerous researchers including
the performance benefits of affinity scheduling in par- Bokhari [], and Tantawi and Towsley []. These poli-
allel computing environments to continue to grow. The cies have been shown, in many cases, to provide better
vast majority of parallel computing systems available performance than policies that do not attempt to share
today, therefore, often exploit various forms of affinity the system workload. Other studies have shown that
scheduling throughout distinct aspects of the parallel more adaptive policies, namely those that make deci-
computing environment. There continues to be impor- sions based on the current state of the system, have the
tant differences, however, among the various forms of potential to greatly improve system performance over
affinity scheduling depending upon the system architec- that obtained with static policies. Furthermore, Livny
ture, application workload, and scheduling strategy, in and Melman [], and Eager, Lazowska, and Zahor-
addition to new issues arising from some more recent jan [] have shown that much of this potential can
trends such as power management. be realized with simple methods. This potential has
prompted a number of studies of specific adaptive load
sharing policies, including the research conducted by
Affinity Scheduling and Load Sharing Barak and Shiloh [], Wang and Morris [], Eager
Trade-off et al. [], and Mirchandaney et al. [].
In parallel computing environments that employ affin- While there are many similarities between the
ity scheduling, the system often allocates computing affinity scheduling and load sharing trade-off in
tasks on the computing nodes where they will be exe- distributed and shared-memory (as well as some
cuted most efficiently. Conversely, underloaded nodes distributed-memory) parallel computing environments,
are often inevitable in parallel computing environments there can be several important differences due to dis-
due to factors such as the transient nature of system tinct characteristics of these diverse parallel system
load and the variability of task service times. If comput- architectures. One of the most important differences
ing tasks are always executed on the computing nodes concerns the direct costs of moving a computing task. In
for which they have affinity, then the system may suf- a distributed system, these costs are typically incurred
fer from load sharing problems as tasks are waiting at by the computing node from which the task is being
overloaded nodes while other nodes are underloaded. migrated, possibly with an additional network delay.
On the one hand, if processor affinities are not fol- Subsequent to this move, the processing requirements
lowed, then the system may incur significant penalties of the computing task are identical to those at the com-
as each computing task must establish its working set in puting node where the task has affinity. On the other
close proximity to a computing node before it can pro- hand, the major costs of task migration in a shared-
ceed. On the other hand, scheduling decisions cannot be memory (as well as distributed-memory) system are
based solely on task-node affinity, else other scheduling the result of a larger service demand at the comput-
criteria, such as fairness, may be sacrificed. Hence, there ing node to which the computing task is migrated,
is a fundamental scheduling trade-off between keep- reflecting the time required to either establish the work-
ing the workload shared among nodes and scheduling ing set of the task in closer proximity to this node or
tasks where they execute most efficiently. An adaptive remotely access the working set from this node. Hence,
scheduling policy is needed that determines, as a func- there is a shift of the direct costs of migration from the
tion of system load, the appropriate balance between (overloaded) computing node for which the comput-
the extremes of strictly balancing the workload among ing task has affinity to the (underloaded) node receiving
all computing nodes and abiding by task-node affini- the task.
ties blindly. Motivated by these and related differences between
One such form of load sharing has received con- distributed and shared-memory systems, Squillante
siderable attention with respect to distributed system and Nelson [] investigated the fundamental trade-
environments. Static policies, namely those that use off between affinity scheduling and load sharing in
Affinity Scheduling A 

shared-memory (as well as some distributed-memory) attention in the literature (see, e.g., [, , ]). Exam-
A
parallel computing environments. More specifically, the ples of shared memory multiprocessor systems from
authors consider the question of how expensive task the s include the DEC Firefly [] and the Sequent
migration must become before it is not beneficial to Symmetry []. Multiprocessor operating systems of
have an underloaded computing node migrate a com- this period that completely ignored processor-cache
puting task waiting at another node, as it clearly would affinity include Mach [], DYNIX [], and Topaz [].
be beneficial to migrate such a task if the cost to do This entry primarily focuses on affinity scheduling
so was negligible. Squillante and Nelson [], therefore, based on the state of the memory system hierarchy
formulate and analyze mathematical models to inves- where it can be more efficient to schedule a com-
tigate this and related questions concerning the condi- puting task on a particular computing node than on
tions under which it becomes detrimental to migrate another if any relevant information already resides in
a task away from its affinity node with respect to the caches or local memories in close proximity to the node.
costs of not adhering to these task-node affinities. The This entry also considers the fundamental trade-off
results of this modeling analysis illustrate and quantify between affinity scheduling and load sharing. A number
the potentially significant benefits of migrating waiting of important references have been provided through-
tasks to underloaded nodes in shared-memory mul- out, to which the interested reader is referred together
tiprocessors even when migration costs are relatively with the citations provided therein. Many more refer-
large, particularly at moderate to heavy loads. By shar- ences on these subjects are widely available, in addition
ing the collection of computing tasks among all com- to the wide variety of strategies that have been proposed
puting nodes, a combination of affinity scheduling and to address affinity scheduling and related performance
threshold-based task migration can yield performance trade-offs.
that is better than a non-migratory policy even with
a larger service demand for migrated tasks, provided
proper threshold values are employed. These model- Bibliography
ing analysis results also demonstrate the potential for . Accetta M, Baron R, Bolosky W, Golub D, Rashid R, Tevanian A,
unstable behavior under task migration policies when Young M () Mach: a new kernel foundation for UNIX devel-
opment. In: Proceedings of USENIX Association summer tech-
improper threshold settings are employed, where opti-
nical conference, Atlanta, GA, June . USENIX Association,
mal policy thresholds avoiding such behavior are pro- Berkeley, pp –
vided as a function of system load and the relative . Archibald J, Baer J-L () Cache coherence protocols: eval-
processing time of migrated tasks. uation using a multiprocessor simulation model. ACM Trans
Comput Syst ():–
. Barak A, Shiloh A () A distributed load-balancing policy for
Related Entries a multicomputer. Softw Pract Exper ():–
Load Balancing, Distributed Memory . Bokhari SH () Dual processor scheduling with dynamic reas-
Operating System Strategies signment. IEEE Trans Softw Eng SE-():–
. Conway RW, Maxwell WL, Miller LW () Theory of schedul-
Scheduling Algorithms
ing. Addison-Wesley, Reading
. Craft DH () Resource management in a decentralized system.
In: Proceedings of symposium on operating systems principles,
Bibliographic Notes and Further
October . ACM, New York, pp –
Reading . Devarakonda M, Mukherjee A () Issues in implementation
Affinity scheduling based on the different speeds of cache-affinity scheduling. In: Proceedings of winter USENIX
of executing tasks in heterogeneous processor envi- conference, January . USENIX Association, Berkeley,
ronments has received considerable attention in the pp –
literature, including both deterministic models and . Eager DL, Lazowska ED, Zahorjan J () Adaptive load sharing
in homogeneous distributed systems. IEEE Trans Softw Eng SE-
stochastic models for which, as examples, the inter-
():–
ested reader is referred to [, ] and [, ], . Eager DL, Lazowska ED, Zahorjan J () A comparison of
respectively. Affinity scheduling based on resource receiver-initiated and sender-initiated adaptive load sharing.
requirement constraints have also received considerable Perform Evaluation ():–
 A Ajtai–Komlós–Szemerédi Sorting Network

. Gupta A, Tucker A, Urushibara S () The impact of oper- . Torrellas J, Tucker A, Gupta A () Evaluating the performance
ating system scheduling policies and synchronization meth- of cache-affinity scheduling in shared-memory multiprocessors.
ods on the performance of parallel applications. In: Proceed- J Parallel Distrib Comput ():–
ings of ACM SIGMETRICS conference on measurement and . Vaswani R, Zahorjan J () The implications of cache affin-
modeling of computer systems, May . ACM, New York, ity on processor scheduling for multiprogrammed, shared mem-
pp – ory multiprocessors. In: Proceedings of symposium on operating
. Horowitz E, Sahni S () Exact and approximate algorithms for systems principles, October . ACM, New York, pp –
scheduling nonidentical processors. J ACM ():– . Wang YT, Morris R () Load sharing in distributed systems.
. Jouppi NP, Wall DW () Available instruction-level parallelism IEEE Trans Comput C-():–
for superscalar and superpipelined machines. In: Proceedings of . Weinrib A, Gopal G () Decentralized resource allocation
international symposium on computer architecture, April . for distributed systems. In: Proceedings of IEEE INFO-
ACM Press, New York, pp – COM ’, San Francisco, April . IEEE, Washington, DC,
. Lin W, Kumar PR () Optimal control of a queueing system pp –
with two heterogeneous servers. IEEE Trans Automatic Contr . Yu OS () Stochastic bounds for heterogeneous-server queues
():– with Erlang service times. J Appl Probab :–
. Livny M, Melman M () Load balancing in homogeneous
broadcast distributed systems. In: Proceedings of ACM com-
puter network performance symposium. ACM Press, New York,
pp –
. Mirchananey R, Towsley D, Stankovic JA () Adaptive load
Ajtai–Komlós–Szemerédi Sorting
sharing in heterogeneous systems. J Parallel Distrib Comput Network
:–
. Salehi JD, Kurose JF, Towsley D () The effectiveness of affinity- AKS Network
based scheduling in multiprocessor networking (extended ver-
sion). IEEE/ACM Trans Networ ():–
. Sequent Computer Systems () Symmetry technical summary.
Sequent Computer Systems Inc, Beaverton
. Squillante MS, Lazowska ED () Using processor-cache affin- AKS Network
ity information in shared-memory multiprocessor scheduling.
Technical Report --, Department of Computer Science,
Joel Seiferas
University of Washington, June . Minor revision, Feb  University of Rochester, Rochester, NY, USA
. Squillante MS, Lazowska ED () Using processor-cache affinity
information in shared-memory multiprocessor scheduling. IEEE
Trans Parallel Distrib Syst ():– Synonyms
. Squillante MS, Nelson RD () Analysis of task migration Ajtai–Komlós–Szemerédi sorting network; AKS sorting
in shared-memory multiprocessors. In: Proceedings of ACM network; Logarithmic-depth sorting network
SIGMETRICS conference on measurement and modeling of com-
puter systems, May . ACM, New York, pp –
. Squillante MS, Zhang Y, Sivasubramaniam A, Gautam N, Definition
Franke H, Moreira J () Modeling and analysis of dynamic AKS networks are O(log n)-depth networks of -item
coscheduling in parallel and distributed environments. In: Pro- sorters that sort their n input items by following the
ceedings of ACM SIGMETRICS conference on measurement and  design by Miklós Ajtai, János Komlós, and Endre
modeling of computer systems, June . ACM, New York,
Szemerédi.
pp –
. Tantawi AN, Towsley D () Optimal static load balancing in
distributed computer systems. J ACM ():– Discussion
. Thacker C, Stewart LC, Satterthwaite EH Jr () Firefly: a mul-
tiprocessor workstation. IEEE Trans Comput C-():– Comparator Networks for Sorting
. Thakkar SS, Gifford PR, Fieland GF () The balance multipro- Following Knuth [, Section ..] or Cormen et al.
cessor system. IEEE Micro ():–
[, Chapter ], consider algorithms that reorder their
. Torrellas J, Tucker A, Gupta A () Benefits of cache-affinity
scheduling in shared-memory multiprocessors: a summary.
input sequences I (of items, or “keys,” from some
In: Proceedings of ACM SIGMETRICS conference on mea- totally ordered universe) via “oblivious comparison–
surement and modeling of computer systems, May . ACM, exchanges.” Each comparison–exchange is effected by
New York, pp – a “comparator” from a data position i to another data
AKS Network A 

position j, which permutes I[i] and I[j] so that I[i] ≤ continue recursively on each half. But the depth of just
A
I[j]. (So a comparator amounts to an application of a halving network would already have to be Ω(log n),
a -item sorter.) The sequence of such comparison– again seeming to lead, through recursion, only to sort-
exchanges depends only on the number n of input items, ing networks of depth Ω(log n).
but not on whether items get switched. Independent (To see that (all) the outputs of a perfect halver
comparison–exchanges, involving disjoint pairs of data have to be at depth Ω(log n), consider the computa-
positions, can be allowed to take place at the same time – tion on a sequence of n identical items, and consider
up to ⌊n/⌋ comparison–exchanges per parallel step. any particular output item. If its depth is only d, then
Each such algorithm can be reformulated so that there are n − d input items that can have no effect
every comparison–exchange [i : j] has i < j (Knuth’s on that output item. If this number exceeds n/, then
standard form) [, Exercise ..-, or , Exercise these uninvolved input items can all be made larger or
.-]. Following Knuth, restrict attention to such smaller to make the unchanged output wrong. (In fact, a
algorithms and represent them by left-to-right paral- slightly more careful argument by Alekseev [, ] shows
lel horizontal “time lines” for the n data positions (also also that the total number of comparators has to be
referred to as registers), connecting pairs of them by ver- Ω(n log n), again seeming to leave no room for a good
tical lines to indicate comparison–exchanges. A famous recursive result.))
example is Batcher’s family of networks for sorting
“bitonic” sequences []. The network for n =  is
depicted in Fig. . Since the maximum depth is , four
steps suffice. In general, Batcher’s networks sort bitonic I [0]
input sequences of lengths n that are powers of , in I [1]
log n steps.
I [2]
If such a network reorders every length-n input
sequence into sorted order (i.e., so that I[i] ≤ I[j] holds I [3]
whenever i < j does), then call it a sorting network. Since I [4]
the concatenation of a sorted sequence and the reverse
of a sorted sequence is bitonic, Batcher’s networks can I [5]

be used to merge sorted sequences in Θ(log n) paral- I [6]


lel steps (using Θ(n log n) comparators), and hence to
I [7]
implement merge sort in Θ(log n) parallel steps (using
Θ(n log n) comparators). I [8]
The total number Θ(n log n) of comparators used I [9]
by Batcher’s elegant sorting networks is worse by a factor
of Θ(log n) than the number Θ(n log n) = Θ(log n!) I [10]

of two-outcome comparisons required for the best I [11]


(nonoblivious) sorting algorithms, such as merge sort
I [12]
and heap sort [, ]. At the expense of simplicity
and practical multiplicative constants, the “AKS” sort- I [13]
ing networks of Ajtai, Komlós, and Szemerédi [–] I [14]
close this gap, sorting in O(log n) parallel steps, using
O(n log n) comparators. I [15]

Step 1 Step 2 Step 3 Step 4


The Sorting-by-Splitting Approach
The AKS algorithm is based on a “sorting-by-splitting” AKS Network. Fig.  Batcher’s bitonic merger for n = 
or “sorting-by-classifying” approach that amounts to an elements. Each successive comparator (vertical line)
ideal version of “quicksort” [, ]: Separate the items reorders its inputs (top and bottom endpoints) into sorted
to be sorted into a smallest half and a largest half, and order
 A AKS Network

0 λ⬘ 1/2 1 Lemma  For each ε > , ε-halving can be performed


prefix misplacement area
by comparator networks (one for each even n) of constant
depth (depending on ε, but independent of n).
AKS Network. Fig.  Setting for approximate halving
(prefix case) Proof From the study of “expander graphs” [, e.g.,],
using the fact that −εε
ε < , start with a constant d, deter-
0 λ⬘ λ 1 mined by ε, and a d-regular n/-by-n/ bipartite graph
prefix misplacement area with the following expansion property: Each subset S of
either part, with ∣S∣ ≤ εn/, has more neighbors than
AKS Network. Fig.  Setting for approximate λ-separation −ε
∣S∣.
ε
(prefix case) The ε-halver uses a first-half-to-second-half com-
parator corresponding to each edge in the d-regular
bipartite expander graph. This requires depth only d,
The , AKS breakthrough [] was twofold: to by repeated application of a minimum-cut argument, to
notice that there are shallow approximate separators, extract d matchings [, Exercise .-, for example].
and to find a way to tolerate their errors in sorting by To see that the result is indeed an ε-halver, consider,
classification. in any application of the network, the final positions of
the strays from the m ≤ n/ smallest elements. Those
positions and all their neighbors must finally contain ele-
Approximate Separation via Approximate
ments among the m smallest. If the fraction of strays
Halving via Bipartite Expander Graphs
were more than ε, then this would add up to more than
The approximate separators defined, designed, and used
εm + −ε ε
εm = m, a contradiction.
are based on the simpler definition and existence of
approximate halvers. The criterion for approximate Using approximate halvers in approximate separa-
halving is a relative bound (ε) on the number of mis- tion and exact sorting resembles such mundane tasks
placements from each prefix or suffix (of relative length as sweeping dirt with an imperfect broom or clearing
λ′ up to λ = /) of a completely sorted result to the snow with an imperfect shovel. There has to be an effi-
wrong half of the result actually produced (see Fig. ). cient, converging strategy for cleaning up the relatively
For approximate separation, the “wrong fraction” one- few imperfections. Subsequent strokes should be aimed
half is generalized to include even larger fractions  − λ. at where the concentrations of imperfections are cur-
(See Fig. .) rently known to lie, rather than again at the full job –
something like skimming spilled oil off the surface of
Definition  For each ε >  and λ ≤ /, the criterion
the Gulf of Mexico.
for ε-approximate λ-separation of a sequence of n input
The use of approximate halvers in approximate sep-
elements is that, for each λ′ ≤ λ, at most ελ′ n of the ⌊λ′ n⌋
aration involves relatively natural shrinkage of “stroke
smallest (respectively, largest) elements do not get placed
size” to sweep more extreme keys closer to the ends.
among the ⌊λn⌋ first (respectively, last) positions. For λ =
/, this is called ε-halving. Lemma  For each ε >  and λ ≤ /, ε-approximate
λ-separation and simultaneous ε-halving can be per-
Although these definitions do not restrict n, they
formed by comparator networks (one for each even n) of
will be needed and used only for even n.
constant depth (depending on ε and λ, but independent
The AKS construction will use constant-depth net-
of n).
works that perform both ε-halving and ε-approximate
λ-separation for some λ smaller than /. Note that nei- Proof For ε small in terms of ε and λ, make use of the
ther of these quite implies the other, since the former ε  -halvers already provided by Lemma  (the result for
constrains more prefixes and suffixes (all the way up λ = /):
to half the length), while the latter counts more mis- First apply the ε  -halver for length n to the whole
placements (the ones to a longer “misplacement area”) sequence, and then work separately on each resulting
as significant. half, so that the final result will remain an ε  -halver.
AKS Network A 

The halves are handled symmetrically, in paral- Based on its actual rank (position in the actual
A
lel; so focus on the first half. In terms of m = sorted order), each register’s current item can be con-
⌊λn⌋ (assuming n ≥ /λ), apply ε  -halvers to the sidered native to one bag at each level of the tree. For
⌈log (n/m)⌉ −  prefixes of lengths m, m, m, …, example, if it lies in the second quartile, it is native to
⌈log (n/m)−⌉ m, in reverse order, where the last one the second of the four bags that are grandchildren of the
listed (first one performed) is simplified as if the inputs root. Sometimes an item will occupy a bag to which it
beyond the first n/ were all some very large element. is not native, where it is a stranger; more specifically, it
Then the total number of elements from the smallest will be j-strange if its bag is at least j steps off its native
⌊λ′ n⌋ (for any λ′ ≤ λ) that do not end up among the path down from the root. (So the -strangers are all the
first ⌊λn⌋ = m positions is at most ε  λ′ n in each of the strangers; and the additional -strangers are actually
⌈log (n/m)⌉ intervals (m, m], (m, m], (m, m], …, native, and not strangers at all.)
(⌈log (n/m)−⌉ m, n/], and (n/, n], for a total of at most Initially, consider all n registers to be in the root
⌈log (n/m)⌉ε  λ′ n. bag (the one corresponding to the whole sequence of
For any chosen c <  (close to , making log (/c) indices), to which all contents are native. The strategy is
close to ), the following holds: Unless n is small (n < to design and follow a schedule, oblivious to the particu-
(/λ)(/( − c))), lar data, of applications of certain comparator networks
and conceptual rebaggings of the results, that is guar-
+log (n/m) = (n/m) = n/⌊λn⌋ < n/(λn − ) anteed to leave all items in the bags of constant-height
≤ n/(cλn) = (/c)(/λ), subtrees to which they are native. Then to finish, it
will suffice to apply a separate sorting network to the
O() registers in the bags of each of these constant-size
so that log (n/m) is less than log (/c) + log (/λ),
subtrees.
and the number of misplaced elements above is at most
⌈log (/c)+log (/λ)⌉ε  λ′ n. This is bounded by ελ′n as
The Structure of the Bagging Schedule
required, provided ε  ≤ ε/⌈log (/c) + log (/λ)⌉.
Each stage of the network acts separately on the contents
of each nonempty bag, which is an inductively pre-
Sorting with Approximate Separators dictable subsequence of the n registers. In terms of the
Without loss of generality, assume the n items to be bag’s current “capacity” b (an upper bound on its num-
sorted are distinct, and that n is a power of  larger ber of registers), a certain fixed “skimming” fraction λ,
than . To keep track of the progress of the recursive and other parameters to be chosen later (in retrospect),
classification, and the (now not necessarily contiguous) it applies an approximate separator from Lemma  to
sets of registers to which approximate halvers and sep- that sequence of registers, and it evacuates the results
arators should be applied, consider each of the registers to the parent and children bags as follows: If there is a
always to occupy one of n −  =  +  +  + ⋅ ⋅ ⋅+ n/ bags – parent, then “kick back” to it ⌊λb⌋ items (or as many as
one bag corresponding to each nonsingleton binary are available, if there are not that many) from each end
subinterval of the length-n index sequence: one whole, of the results, where too-small and too-large items will
its first and second halves (two halves), their first and tend to accumulate. If the excess (between these ends) is
second halves (four quarters), …, their first and second odd (which will be inductively impossible at the root),
halves (n/ contiguous pairs). These binary subinter- then kick back any one additional register (the middle
vals, and hence the associated bags, correspond nicely to one, say) to the parent. Send the first and second halves
the nodes of a complete binary tree, with each nontriv- of any remaining excess down to the respective children.
ial binary subinterval serving as parent for its first and Note that this plan can fail to be feasible in only one
second halves, which serve respectively as its left child case: the number of registers to be evacuated exceeds
and right child. (The sets, and even the numbers, of reg- ⌊λb⌋ + , but the bag has no children (i.e., it is a leaf).
isters corresponding to these bags will change with time, The parameter b is an imposed capacity that will
according to time-determined baggings and rebaggings increase (exponentially) with the depth of the bag in the
by the algorithm.) tree but decrease (also exponentially) with time, thus
 A AKS Network

“squeezing” all the items toward the leaves, as desired. How much successful iteration is enough? Until
Aim to choose the parameters so that the squeezing is the leaf capacities b dip below the constant /λ. At
slow enough that the separators from Lemma  have that point, the subtrees of smallest height k such
time to successfully skim and reroute all strangers back that (/λ)(/A)k+ <  contain all the registers, because
to their native paths. higher-level capacities are at most b/Ak+ < (/λ)
To complete a (not yet proved) description of a net- (/A)k+ < . And the contents are correctly classified,
work of depth O(log n) to sort n items, here is a preview because the number of (k − i + )-strangers in each
of one set of adequate parameters: bag at each height i ≤ k is bounded by λε k−i b/Ai ≤
λb < . So the job can be finished with independent k+ -
λ = ε = / and b = n ⋅ d (.)t , sorters, of depth at most (k + )(k + )/ []. In fact,
for the right choice of parameters (holding /λ to less
where d ≥  is the depth and t ≥  is the number of than A ), k as small as  can suffice, so that the job can be
previous stages. It turns out that adopting these param- finished with mere -sorters, of depth just . Therefore,
eters enables iteration until bleaf < , and that at that the number t of successful iterations need only exceed
time every item will be in a height- subtree to which ((+log A)/ log (/ν)) log n−log (A/λ)/ log (/ν) =
it is native, so that the job can be finished with disjoint O(log n), since that is (exactly) enough to get the leaf
-sorters. capacity nν t A(log n)− down to /λ.
As noted in the previous section, only one thing can
prevent successful iteration of the proposed procedure
A Suitable Invariant
for each stage: ⌊λb⌋ +  being less than the number of
In this and the next section, the parameters are carefully
items to be evacuated from a bag with current capacity b
reintroduced, and constraints on them are accumulated,
but with no children. Since such a bag is a leaf, it follows
sufficient for the analysis to work out. For A comfortably
from Clause  of the invariant that the number of items
larger than  ( in the preview above) and ν less than but
is at most . Thus the condition is ⌊λb⌋+ < , implying
close to  (. in the preview above), define the capacity
the goal b < /λ has already been reached.
(b above) of a depth-d bag after t stages to be nν t Ad .
Therefore, it remains only to choose parameters
(Again note the dynamic reduction with time, so that
such that each successful iteration does preserve the
this capacity eventually becomes small even compared
entire invariant.
to the number of items native to the bag.) Let λ <  be
chosen as indicated later, and let ε >  be small.
Subject to the constraints accumulated in sec- Argument that the Invariant Is Maintained
tion “Argument that the Invariant is Maintained”, it Only Clauses  and  are not immediately clear. First
will be shown there that each successful iteration of consider the former – that capacity will continue after
the separation–rebagging procedure described in sec- the next iteration to bound the number of registers in
tion “The Structure of the Bagging Schedule” (i.e., each each bag. This is nontrivial only for a bag that is cur-
stage of the network) preserves the following four- rently empty. If the current capacity of such a bag is
clause invariant. b ≥ A, then the next capacity can safely be as low as

. Alternating levels of the tree are entirely empty. (number of registers from below)
. On each level, the number of registers currently in + (number of registers from above)
each bag (or in the entire subtree below) is the same ≤ (λbA + ) + b/(A)
(and hence at most the size of the corresponding = b(λA + /(A)) + 
binary interval). ≤ b(λA + /(A) + /A), since b ≥ A.
. The number of registers currently in each bag is
bounded by the current capacity of the bag. In the remaining Clause  case, the current capac-
. For each j ≥ , the number of j-strangers currently in ity of the empty bag is b < A. Therefore, all higher bags’
the registers of each bag is bounded by λε j− times capacities are bounded by b/A < , so that the n regis-
the bag’s current capacity. ters are equally distributed among the subtrees rooted
AKS Network A 

on the level below, an even number of registers per sub- For this remaining estimate, compare the current
A
tree. Since each root on that level has passed down an “actual” distribution with a more symmetric “bench-
equal net number of registers to each of its children, it mark” distribution that has an unchanged number of
currently holds an even number of registers and will not registers in each bag, but that, for each bag C′ on the
kick back an odd register to the bag of interest. In this same level as B, has only C′ -native items below C′ and
case, therefore, the next capacity can safely be as low as has d/ C′ -native items in the parent D′ of C′ . (If d
just λbA. is odd, then the numbers of items in D′ native to its
In either Clause  case, therefore, any two children will be ⌊d/⌋ and ⌈d/⌉, in either order.)
ν ≥ λA + /(A) will work. That there is such a redistribution follows from Clause 
Finally, turn to restoration of Clause . First, con- of the invariant: Start by partitioning the completely
sider the relatively easy case of j > . Again, this is sorted list among the bags on B’s level, and then move
nontrivial only for a bag that is currently empty. Sup- items down and up in appropriate numbers to fill the
pose the current capacity of such a bag is b. What is a budgeted vacancies.
bound on the number of j-strangers after the next step? In the benchmark distribution, the number of
It is at most C-native items in excess of d/ is . If the actual distribu-
(all (j + )-strangers currently in children) tion is to have an excess, where can the excess C-native
items come from, in terms of the benchmark distribu-
+ ((j − )-strangers currently in parent, and not
tion? They can come only from C-native items on levels
filtered out by the separation) above D’s and from a net reduction in the number of
< bAλε j + ε((b/A)λε j− ) C-native items in C’s subtree. The latter can only be via
≤ λε j− νb, provided Aε + /A ≤ ν . the introduction into C’s subtree of items not native to C.
By Clause  of the invariant, the number of these can be
Note that the bound ε((b/A)λε j− ) exploits the “filter- at most
ing” performance of an approximate separator: At most
λεbA + λε  bA + λε  bA + λε  bA + . . .
fraction ε of the “few” smallest (or largest) are permuted
“far” out of place. = λεbA( + (εA) + ((εA) )
All that remains is the more involved Clause  case + ((εA) ) + . . . ) < λεbA/( − (εA) ).
of j = . Consider any currently empty bag B, of current
capacity b. At the root there are always no strangers; so If i is the number of bags C′ on the level of C, then
assume B has a parent, D, and a sibling, C. Let d be the the total number of items on levels above D’s is at most
number of registers in D. i− b/A + i− b/A + i− b/A + . . . .
There are three sources of -strangers in B after
the next iteration, two previously strange, essentially as Since the number native to each such C′ is the same, the
above, and one newly strange: number native to C is at most /i times as much:
. Current -strangers at children (at most λεbA). b/(A) + b/(A) + b/(A) + . . .
. Unfiltered current -strangers in D (at most
ε(λ(b/A)), as above). < b/((A) ( − /(A) ))
. Items in D that are native to C but that now get sent = b/(A − A).
to B instead.
So the total number of -strangers from all sources
For one of these last items to get sent down to B, it must is at most
get permuted by the approximate separator of Lemma 
into “B’s half ” of the registers. The number that do is at λεbA + ε(λ(b/A)) + εb/(A) + λεbA/( − (εA) )
most the number of C-native items in excess of d/, plus + b/(A − A).
the number of “halving errors” by the approximate sep-
arator. By the approximate halving behavior, the latter is This total is indeed bounded, as needed, by λνb (λ times
at most εb/(A), leaving only the former to estimate. the new capacity), provided
 A AKS Network

λεA + ελ/A + ε/(A) + λεA/( − (εA) ) in a design originally conceived for perfect halvers.
The “one-step-backward, two-steps-forward” approach
+ /(A − A) ≤ λν.

is a valuable one that should and indeed does show
This completes the argument, subject to the follow- up in many other settings. This benefit holds regard-
ing accumulated constraints: less of whether the constant hidden in the big-O for
this particular result can ever be reduced to something
A > , practical.
ν < ,
ε > , Related Entries
λ < , Bitonic Sort
ν ≥ λA + /(A), Sorting

ν ≥ Aε + /A,
λν ≥ λεA + ελ/A + ε/(A) + λεA/( − (εA) ) Bibliographic Notes and Further
+/(A − A).
 Reading
Ajtai, Komlós, and Szemerédi submitted the first ver-
Here is one choice order and strategy that works out sion of their design and argument directly to a
neatly: journal []. Inspired by Joel Spencer’s ideas for a more
accessible exposition, they soon published a quite dif-
. Choose A big
ferent conference version []. The most influential ver-
. Choose λ between /(A − A) and /(A)
sion, on which most other expositions are based, was
. Choose ε small
the one eventually published by Paterson []. The ver-
. Choose ν within the resulting allowed range
sion presented here is based on a much more recent
For example, A = , λ = /, ε = /, and ν = simplification [], with fewer distinct parameters, but
/. In fact, perturbing λ and ε to / makes it pos- thus with less potential for fine-tuning more quantita-
sible to hold the parameter “k” of section “A Suitable tive results.
Invariant” to  and get by with -sorters at the very end, The embedded expander graphs, and thus the con-
as previewed in section “The Structure of the Bagging stant hidden in the O(log n) depth estimate, make these
Schedule”. networks impractical to build or use. The competi-
tion for the constant is Batcher’s extra log  n factor [],
Concluding Remarks which is relatively quite small for any remotely practical
Thus the AKS networks are remarkable sorting algo- value of n.
rithms, one for each n, that sort their n inputs in The shallowest approximate halvers can be pro-
just O(log n) oblivious compare–exchange steps that duced by more direct arguments rather than via
involve no concurrent reading or writing. expander graphs. By a direct and careful counting argu-
The amazing kernel of this celebrated result is that ment, Paterson [] (the theorem in his appendix, spe-
the number of such steps for the problem of approx- cialized to the case α = ) proves that the depth of an
imate halving (in provable contrast to the problem of ε-halver need be no more than ⌈( log ε)/ log( − ε) +
perfect halving) does not have to grow with n at all. /ε − ⌉. Others [, ] have less precisely sketched or
This is yet another (of many) reasons to marvel at the promised asymptotically better results.
existence of constant-degree expander graphs. And it Even better, one can focus on the construction
is another reason to consider approximate solutions to of approximate separators. Paterson [], for example,
fundamental algorithms. (Could some sort of approxi- notes the usefulness in that construction (as presented
mate merging algorithm similarly lead to a fast sorting above, for Lemma ) of somewhat weaker (and thus
algorithm based on merge sort?) shallower) halvers, which perform well only on extreme
Algorithmically most interesting is the measured use sets of sizes bounded in terms of an additional parame-
of approximate halvers to clean up their own errors ter α ≤ . Tailoring the parameters to the different levels
Algebraic Multigrid A 

of approximate halvers in the construction, he man- . Manos H () Construction of halvers. Inform Process Lett
():–
A
ages to reduce the depth of separators enough to reduce
the depth of the resulting sorting network to less than . Paterson MS () Improved sorting networks with O(log N)
depth. Algorithmica (–):–
 log n.
. Seiferas J () Sorting networks of logarithmic depth, further
Ajtai, Komlós, and Szemerédi [] announce that simplified. Algorithmica ():–
design in terms of generalized, multi-way comparators . Seiferas J () On the counting argument for the existence of
(i.e., M-sorter units) can lead to drastically shallower expander graphs, manuscript
approximate halvers and “near-sorters” (their original
version [] of separators). Chvátal [] pursues this idea
carefully and arrives at an improved final bound less
than  log n, although still somewhat short of the
AKS Sorting Network
 log n privately claimed by Komlós. To date, the
AKS Network
tightest analyses have appeared only as sketches in pre-
liminary conference presentations or mere promises
of future presentation [], or as unpublished technical
reports []. Algebraic Multigrid
Leighton [] shows, as a corollary of the AKS result
(regardless of the networks), that there is an n-node Marian Brezina , Jonathan Hu , Ray Tuminaro

degree- network that can sort n items in O(log n) steps. University of Colorado at Boulder, Boulder, CO, USA

Considering the VLSI model of parallel computa- Sandia National Laboratories, Livermore, CA, USA
tion, Bilardi and Preparata [] show how to lay out the
AKS sorting networks to sort n O(log n)-bit numbers Synonyms
in optimal area O(n ) and time O(log n). AMC

Definition
Bibliography Multigrid refers to a family of iterative algorithms for
. Ajtai M, Komlós J, and Szemerédi E () Sorting in c log n
parallel steps. Combinatorica ():–
solving large sparse linear systems associated with a
. Ajtai M, Komlós J, Szemerédi E () An O(n log n) sorting broad class of integral and partial differential equa-
network, Proceedings of the fifteenth annual ACM symposium tions [, , ]. The key to its success lies in the use
on theory of computing, Association for computing machinery, of efficient coarse scale approximations to dramatically
Boston, pp – accelerate the convergence so that an accurate approxi-
. Ajtai M, Komlós J, Szemerédi E () Halvers and expanders,
mation is obtained in only a few iterations at a cost that
Proceedings, rd annual symposium on foundations of com-
puter science, IEEE computer society press, Los Alamitos, is linearly proportional to the problem size. Algebraic
pp – multigrid methods construct these coarse scale approx-
. Alekseev VE () Sorting algorithms with minimum memory. imations utilizing only information from the finest res-
Kibernetika ():– olution matrix. Use of algebraic multigrid solvers has
. Batcher KE () Sorting networks and their applications,
become quite common due to their optimal execution
AFIPS conference proceedings, Spring joint computer confer-
ence, vol , Thompson, pp –
time and relative ease-of-use.
. Bilardi G, Preparata FP () The VLSI optimality of the AKS
sorting network. Inform Process Lett ():– Discussion
. Chvatal V () Lecture notes on the new AKS sorting network,
DCS-TR-, Computer Science Department, Rutgers University Introduction
. Cormen TH, Leiserson CE, Rivest RL, Stein C () Introduction Multigrid algorithms are used to solve large sparse lin-
to algorithms, nd edn. The MIT Press, Cambridge ear systems
. Knuth DE () The art of computer programming, vol , Sort- Ax = b ()
ing and searching, nd edn. Addison–Wesley, Reading
. Leighton T () Tight bounds on the complexity of parallel where A is an n × n matrix, b is a vector of length n,
sorting. IEEE Trans Comput C-():– and one seeks a vector x also of length n. When these
 A Algebraic Multigrid

matrices arise from elliptic partial differential equations AMG methods is quite extensive, but this entry only
(PDEs), multigrid algorithms are often provably opti- addresses one alternative referred to as smoothed aggre-
mal in that they obtain a solution with O(n/p + log p) gation [, ], denoted here as SA. Both C-AMG and
floating point operations where p is the number of pro- SA account for the primary AMG use by most scien-
cesses employed in the calculation. Their rapid conver- tists. Some publicly available AMG codes include [,
gence rate is a key feature. Unlike most simpler iterative , , , ]. A number of commercial software prod-
methods, the number of iterations required to reach a ucts also contain algebraic multigrid solvers. Today,
given tolerance does not degrade as the system becomes research continues on algebraic multigrid to further
larger (e.g., when A corresponds to a PDE and the mesh expand applicability and robustness.
used in discretization is refined).
Multigrid development started in the s with the Geometric Multigrid
pioneering works of Brandt [] and Hackbusch [, ], Although multigrid has been successfully applied to a
though the basic ideas first appeared in the s [, , wide range of problems, its basic principles are eas-
]. The first methods would now be classified as geo- ily understood by first studying the two-dimensional
metric multigrid (GMG). Specifically, applications sup- Poisson problem
ply a mesh hierarchy, discrete operators corresponding
to the PDE discretization on all meshes, interpolation −uxx − uyy = f ()
operators to transfer solutions from a coarse resolu-
tion mesh to the next finer one, and restriction oper- defined over the unit square with homogeneous Dirich-
ators to transfer solutions from a fine resolution mesh let conditions imposed on the boundary. Discretization
to the next coarsest level. In geometric multigrid, the leads to an SPD matrix problem
inter-grid transfers are typically based on a geometri-
cal relationship of a coarse mesh and its refinement, Ax = b. ()
such as using linear interpolation to take a solution on
The Gauss–Seidel relaxation method is one of the oldest
a coarse mesh and define one on the next finer mesh.
iterative solution techniques for solving (). It can be
While early work developed, proved, and demonstrated
written as
optimally efficient algorithms, it was the development of
algebraic multigrid (AMG) that paved the way for non- x(k+) ← (D + L)− (b − Ux(k) ) ()
multigrid experts to realize this optimal behavior across
a broad range of applications including combustion cal- where x(k) denotes the approximate solution at the kth
culations, computational fluid dynamics, electromag- iteration, D is the diagonal of A, L is the strictly lower tri-
netics, radiation transfer, semiconductor device model- angular portion of A, and U is the strictly upper triangu-
ing, structural analysis, and thermal calculations. While lar portion of A. While the iteration is easy to implement
better understood for symmetric positive definite (SPD) and economical to apply, a prohibitively large number
systems, AMG methods have achieved notable success of iterations (which increases as the underlying mesh
on other types of linear systems such as those involv- is refined) are often required to achieve an accurate
ing strongly convective flows [, ] or drift diffusion solution [].
equations. Figure  depicts errors (e(k) = x(∗) − x(k) , where
The first algebraic multigrid scheme appeared in x(∗) is the exact solution) over the problem domain as a
the mid-s and is often referred to as classical alge- function of iteration number k for a typical situation on
braic multigrid [, ]. This original technique is still a  ×  mesh. Notice that while the error remains quite
one of the most popular, and many improvements have large after four iterations, it has become locally much
helped extend its applicability on both serial and parallel smoother. Denoting by r(k) = b−Ax(k) the residual, the
computers. Although classical algebraic multigrid is fre- error is found by solving Ae (k) = r(k) . Of course, solving
quently referred to as just algebraic multigrid, the nota- this residual equation appears to be no easier than solv-
tion C-AMG is used here to avoid confusion with other ing the original system. However, the basic multigrid
algebraic multigrid methods. The list of alternative idea is to recognize that smooth error is well represented
Algebraic Multigrid A 

a k=0 b k=2 c k=4

Algebraic Multigrid. Fig.  Errors as a function of k, the Gauss–Seidel iteration number

MGV(A, x, b, ):


if  =
/ max
pre
x ← S (A, x, b)
r ← b − Ax
x+1 ← 0
x+1 ← MGV(A+1, x+1, I+1 r, +1)

x ← x + I +1x+1
post
x ← S (A, x, b)
else x ← A−1
 b

Algebraic Multigrid. Fig.  Multigrid V-cycle

on a mesh with coarser resolution. Thus, a coarse ver- can be applied without incurring prohibitive cost. Iℓℓ+
sion can be formed and used to obtain an approxima- restricts residuals from level ℓ to level ℓ + , and Iℓ+ℓ
pre post
tion with less expense by inverting a smaller problem. prolongates from level ℓ+ to level ℓ. Sℓ () and Sℓ ()
This approximation is then used to perform the update denote a basic iterative scheme (e.g., Gauss-Seidel) that
(k) (k)
x(∗) ≈ x(k+) = x(k) + ec where ec is defined by is applied to smooth the error. It will be referred to as
interpolating to the fine mesh the solution of the coarse relaxation in the remainder of the entry, although it
resolution version of Ae(k) = r(k) . This is referred to as is also commonly called smoothing. The right side of
coarse-grid correction. Since the coarse-level correction Fig.  depicts the algorithm flow for a four-level method.
may introduce oscillatory components back into the The lowest circle represents the direct solver. The cir-
error, it is optionally followed by application of a small cles to the left of this represent pre-relaxation while
number of simple relaxation steps (the post-relaxation). those to the right indicate post-relaxation. Finally, the
If the size of the coarse discretization matrix is small downward arrows indicate restriction while the upward
enough, Gaussian elimination can be applied to directly arrows correspond to interpolation or prolongation, fol-
obtain a solution of the coarse level equations. If the size lowed by correction of the solution. Each relaxation in
of the coarse system matrix remains large, the multigrid the hierarchy effectively damps errors that are oscilla-
idea can be applied recursively. tory with respect to the mesh at that level. The net effect
Figure  illustrates what is referred to as multigrid is that for a well-designed multigrid, errors at all fre-
V-cycle for solving the linear system Aℓ xℓ = bℓ . Sub- quencies are substantially reduced by one sweep of a
scripts are introduced to distinguish different resolution V-cycle iteration.
approximations. A = A is the operator on the finest Different multigrid variations visit coarse meshes
level, where one seeks a solution while Aℓmax denotes more frequently. The V-cycle (depicted here) and a
the coarsest level system where Gaussian elimination W-cycle are the most common. The W-cycle is obtained
 A Algebraic Multigrid

by adding a second xℓ+ ← MGV() invocation immedi- on each level. Furthermore, errors that are relatively
ately after the current one in Fig. . In terms of cost, grid low frequency on a given level appear oscillatory on
transfers are relatively inexpensive compared to relax- some coarser level, and will be efficiently eliminated by
ation. The relaxation cost is typically proportional to the relaxation on that level.
number of unknowns or ck on a k×k×k mesh where c is For complex PDE operators, it can sometimes be
a constant. Ignoring the computational expense of grid challenging to define multigrid components in a way
transfers and assuming that coarse meshes are defined that preserves this complementary balance. Jacobi and
by halving the resolution in each coordinate direction, Gauss-Seidel relaxations, for example, do not effectively
the V-cycle cost is smooth errors in directions of weak coupling when
k  k k applied to highly anisotropic diffusion PDEs. This is
V-cycle cost ≈ c (k + + + + ...) easily seen by examining a damped Jacobi iteration
  
 x(k+) ← x(k) + ωD− (b − Ax(k) ) ()

fine-level relaxation cost.
 where D is the diagonal matrix associated with A and
That is, the extra work associated with the coarse level ω is a scalar damping parameter typically between zero
computations is almost negligible. and one. The error propagation is governed by
In addition to being inexpensive, coarse-level com-
putations dramatically accelerate convergence so that a e(k+) = (I − ωD− A)e(k) = (I − ωD− A)k+ e() ()
solution is obtained in a few iterations. Figure  illus- where e(k) = x(∗) − xk . Clearly, error components in
trates convergence histories for a Jacobi iteration, a eigenvector directions associated with small eigenvalues
Gauss–Seidel iteration, and a multigrid V-cycle iter- of D− A are almost unaffected by the Jacobi itera-
pre post
ation where Sℓ () and Sℓ () correspond to one tion. Gauss-Seidel exhibits similar behavior, though the
Gauss–Seidel sweep. These experiments use a standard analysis is more complex. When A corresponds to a
finite difference discretization of () on four different standard Poisson operator, these eigenvectors are all
meshes with a zero initial guess and a random right low frequency modes, and thus effectively attenuated
hand. Figure  clearly depicts the multigrid advantage. through the coarse-grid correction process. However,
The key to this rapid convergence lies in the complemen- when applied to uxx + єuyy (with є ≪ ), eigenvectors
tary nature of relaxation and the coarse level corrections. associated with low eigenvalues are smooth functions
Relaxation eliminates high frequency errors while the in the x direction, but may be oscillatory in the y direc-
coarse level correction eliminates low frequency errors tion. Within geometric multigrid schemes, where coars-
ening is dictated by geometrical considerations, these
oscillatory error modes are not reduced, and multigrid
102
convergence suffers when based on a simple point-wise
relaxation, such as Jacobi. It may thus be necessary to
100 employ more powerful relaxation (e.g., line relaxation)
Jacobi on 31 ⫻ 31 mesh to make up for deficiencies in the choice of coarse grid
Gauss–Seidel on 31 ⫻ 31 mesh representations.
10−2
Multigrid on 31 ⫻ 31 mesh
Jacobi on 63 ⫻ 63 mesh
||r||2

10−4
Gauss–Seidel on 63 ⫻ 63 mesh Algebraic Multigrid
Multigrid on 63 ⫻ 63 mesh
Algebraic multigrid differs from geometric multigrid
in that I ℓℓ+ and I ℓ+

are not defined from geometric
10−6
information. They are constructed using only Aℓ and,
optionally, a small amount of additional information (to
10−8 be discussed). Once grid transfers on a given level are
0 20 40 60 80 100
Iterations defined, a Galerkin projection is employed to obtain a
coarse level discretization:
Algebraic Multigrid. Fig.  Residuals as a function of k,
the Gauss–Seidel iteration number Aℓ+ ← Iℓℓ+ Aℓ Iℓ+

.
Algebraic Multigrid A 

Further, Iℓℓ+ is generally taken as the transpose of the where ê is a coarse vector minimizing the left side and β
A
prolongation operator for symmetric problems. In this is a constant independent of mesh size and e. Basically,
case, an algebraic V-cycle iteration can be completely () requires that vectors with smaller Aℓ -norm be more
specified by a procedure for generating prolongation accurately captured on coarse meshes. That is, high and
matrices along with a choice of appropriate relaxation low energy, measured by the Aℓ -norm, replaces the
method. In contrast to geometric multigrid, the relax- geometric notions of high and low frequency.
ation process is typically selected upfront, but the pro- It is possible to show that a two-level method that
longation is automatically tailored to the problem at employs a coarse grid correction satisfying () followed
hand. This is referred to as operator-dependent prolon- by a relaxation method Sℓ satisfying () is itself conver-
gation, and can overcome problems such as anisotropies gent, independent of problem size. The error propagation
of which geometric multigrid coarsening may not be operator associated with the coarse-grid correction is
aware. given by
Recalling the Jacobi error propagation formula (),
−
components associated with large eigenvalues of A are Tℓ = I − Iℓ+

(Iℓℓ+ Aℓ Iℓ+

) Iℓℓ+ Aℓ . ()
easily damped by relaxation while those associated with
small eigenvalues remain. (Without loss of general- Note that () moves fine error to the coarse level, per-
ity, one can assume that A has been scaled so that forms an exact solve, interpolates the result back to the
its diagonal is identically one. Thus, A as opposed to fine level, and differences it with the original fine error.
D− A is used to simplify the discussion.) In geometric Assuming that the grid transfer Iℓ+ℓ
is full-rank, the fol-
multigrid, the focus is on finding a relaxation scheme lowing estimate for a two-level multigrid method can be
such that all large eigenvalues of its error propagation proved using () and ():
operator correspond to low frequency eigenvectors. In

algebraic multigrid, one assumes that a simple (e.g., α
∣∣S ℓ Tℓ ∣∣A ℓ ≤ − , ()
Gauss–Seidel) relaxation is employed where the error β
propagation satisfies
where post-relaxation only is assumed to simplify the
∣∣Sℓ e∣∣A ℓ ≤ ∣∣e∣∣A ℓ − α∣∣e∣∣A discussion. That is, a coarse-grid correction followed

= ∣∣e∣∣A ℓ − α∣∣r∣∣ () by relaxation converges at a rate independent of the


mesh size. Given the general nature of problems treated
where α is a positive constant independent of the mesh with algebraic multigrid, however, sharp multilevel con-
size. Equation () indicates that errors associated with vergence results (as opposed to two-level results) are
a large Aℓ -norm are reduced significantly while those difficult to obtain. The best known multilevel AMG con-
associated with small residuals are relatively unchanged vergence bounds are of the form,  − C(L) 
, where the
by relaxation. The focus is now on constructing Iℓℓ+ constant C(L) depends polynomially on the number of

and Iℓ+ such that eigenvectors associated with small multigrid levels [].
eigenvalues of Aℓ are transferred between grid levels In addition to satisfying some approximation prop-
accurately. That is, error modes that are not damped erty, grid transfers must be practical. The sparsity pat-
by relaxation on a given level must be transferred to ℓ
terns of I ℓ+ and Iℓℓ+ effectively determine the number
another level so that they can be reduced there. Note of nonzeros in Aℓ+ . If prohibitively large, the method
that these components need not be smooth in the geo- is impractically expensive. Most AMG methods con-
metric sense, and are therefore termed algebraically struct grid transfers in several stages. The first stage
smooth. defines a graph associated with Aℓ . The second stage
More formally, the general principle for grid trans- coarsens this graph. Coarsening fixes the dimensions of
fers applied to symmetric problems is that they satisfy ℓ
I ℓ+ and Iℓ ℓ + . Coarsening may also effectively deter-
an approximation property, for example, mine the sparsity pattern of the grid transfer matrices or,
β within some AMG methods, the sparsity pattern may be
∣∣e − I ℓ+

ê∣∣ ≤ ∣∣e∣∣ () constructed separately. Finally, the actual grid transfer
∥Aℓ ∥ A ℓ
 A Algebraic Multigrid

coefficients are determined so that the approximation C-points identified, the interpolation coefficients are
property is satisfied and that relaxation on the coarse calculated. C-points themselves are simply interpolated
grid is effective. via injection. The strong influence of an F-point j on a
One measure of expense is the operator complex- different F-point i is given by the weighted interpolation
ity, Σ i (nnz(Ai )/nnz(A ), which compares number of from the C-points common to i and j. The latter is done
nonzeros on all levels to the number of nonzeros in in such a manner so as to preserve interpolation of con-
the finest grid matrix. The operator complexity gives stants. Additionally, it is based on the assumption that
an indication both of the amount of memory required residuals are small after relaxation. That is, Aℓ e ≈ , or,
to store the AMG preconditioner and the cost to apply equivalently, modes associated with small eigenvalues
it. Another measure of complexity is the matrix sten- should be accurately interpolated. Further details can be
cil size, which is the average number of coefficients in found in [].
a row of Aℓ . Increases in matrix stencil size can lead SA also coarsens based on the graph associated
to increases in communication and matrix evaluation with matrix Aℓ . In contrast to C-AMG, however, the
costs. In general, AMG methods must carefully bal- coarse grid unknowns are formed from disjoint groups,
ance accuracy with both types of complexity. Increasing or aggregates, of fine grid unknowns. Aggregates are
the complexity of an AMG method often leads to bet- formed around initially-selected unknowns called root
ter convergence properties, at the cost of each iteration points. An unknown i is included in an aggregate if it
being more expensive. Conversely, decreasing the com- is strongly coupled to the root point j. More precisely,
plexity of a method tends to lead to a method that unknowns
√ i and j are said to be strongly coupled if ∣aij ∣ >
converges more slowly, but is cheaper to apply. θ ∣aii ∣∣ajj ∣, where θ ≥  is a tuning parameter indepen-
Within C-AMG, a matrix graph is defined with the dent of i and j. Because the aggregates produced by SA
help of a strength-of-connection measure. In particu- tend to have a graph diameter , SA generally coarsens
lar, each vector unknown corresponds to a vertex. A at a rate of d , where d is the geometric problem dimen-
graph edge between vertex i and j is only added if i sion. After aggregate identification, a tentative prolon-
is strongly influenced by j or if −aij ≥ є maxi≠k (−aik ), gator is constructed so that it exactly interpolates a set
where є >  is independent of i and j. The basic idea is of user-defined vectors. For scalar PDEs this is often just
to ignore weak connections when defining the matrix a single vector corresponding to the constant function.
graph in order to coarsen only in directions of strong For more complex operators, however, it may be neces-
connections (e.g., to avoid difficulties with anisotropic sary to provide several user-defined vectors. In three-

applications). The structure of Iℓ+ is determined by dimensional elasticity, it is customary to provide six
selecting a subset of vertices, called C-points, from the vectors that represent the six rigid body modes (three
fine graph. The C-points constitute the coarse graph ver- translations and three rotations). In general, these vec-
tices; the remaining unknowns are called F-points and tors should be near-kernel vectors that are generally
will interpolate from the values at C-points. If C-points problematic for the relaxation. The tentative prolonga-
are too close to each other, the resulting complexities are tor is defined by restricting the user-defined vectors
high. If they are too far apart, convergence rates tend to to each aggregate. That is, each column of the tenta-
suffer. To avoid these two situations, the selection pro- tive prolongator is nonzero only for degrees of freedom
cess attempts to satisfy two conditions. The first is that within a single aggregate, and the values of these nonze-
if point j strongly influences an F-point i, then j should ros correspond to one of the user-defined vectors. The
either be a C-point or j and i should be strongly influ- columns are locally orthogonalized within an aggregate
enced by a common C-point. The second is that the by a QR algorithm to improve linear independence. The
set of C-points should form a maximal independent set final result is that the user-defined modes are within the
such that no two C-points are strongly influenced by range space of the tentative prolongator, the prolongator
each other. For isotropic problems, the aforementioned columns are orthonormal, and the tentative prolonga-
C-AMG algorithm tends to select every other unknown tor is quite sparse because each column’s nonzeros are
as a C-point. This yields a coarsening rate of roughly associated with a single aggregate. Unfortunately, indi-
d , where d is the spatial problem dimension. With the vidual tentative prolongator columns have a high energy
Algebraic Multigrid A 

(or large A-norm). This essentially implies that some is challenging and relies on sophisticated multicoloring
A
algebraically smooth modes are not accurately repre- schemes on unstructured meshes []. As an alternative,
sented throughout the mesh hierarchy even though the most large parallel AMG packages support an option to
user-defined modes are accurately represented. To rec- employ Processor Block (or local) Gauss–Seidel. Here,
tify this, one step of damped Jacobi is applied to all each processor performs Gauss–Seidel as a subdomain
columns of the tentative prolongator: solver for a block Jacobi method. While Processor Block
Gauss–Seidel is easy to parallelize, the overall multigrid

I ℓ+ = (I − ωD−
ℓ A ℓ ) I ℓ+

convergence rate usually suffers and can even lead to
where ω is the Jacobi damping parameter and Dℓ is divergence if not suitably damped [].
the diagonal of Aℓ . This reduces the energy in each of Given a multigrid hierarchy, the main unique aspect
the columns while preserving the interpolation of user- associated with parallelization boils down to partition-
defined low energy modes. Further details can be found ing all the operators in the hierarchy. As these oper-
in [, ]. ators are associated with matrix graphs, it is actually
the graphs that must be partitioned. The partitioning
Parallel Algebraic Multigrid of the finest level graph is typically provided by an out-
Distributed memory parallelization of most simula- side application and so it ignores the coarse graphs that
tions based on PDEs is accomplished by dividing the are created during multigrid construction. While coarse
computational domain into subdomains, where one graph partitioning can also be done in this fashion, it
subdomain is assigned to each processor. A processor is desirable that the coarse and fine graph partitions
is then responsible for updating unknowns associated “match” in some way so that inter-processor communi-
within its subdomain only. To do this, processors occa- cation during grid transfers is minimized. This is usually
sionally need information from other processors; that done by deriving coarse graph partitions from the fine
information is obtained via communication. Partition- graph partition. For example, when the coarse graph
ing into boxes or cubes is straightforward for logi- vertices are a subset of fine vertices, it is natural to
cally rectangular meshes. There are several tools that simply use the fine graph partitioning on the coarse
automate the subdivision of domains for unstructured graph. If the coarse graph is derived by agglomerating
meshes [, , ]. A general goal during subdivision elements and the fine graph is partitioned by elements,
is to assign an equal amount of work to each pro- the same idea holds. In cases without a simple corre-
cessor and to reduce the amount of communication spondence between coarse and fine graphs, it is often
between processors by minimizing the surface area of natural to enforce a similar condition that coarse ver-
the subdomains. tices reside on the same processors that contain most
Multigrid parallelization follows in a similar fash- of the fine vertices that they interpolate []. Imbalance
ion. For example, V-cycle computations within a mesh can result if the original matrix A itself is poorly load-
are performed in parallel, but each mesh in the hier- balanced, or if the coarsening rate differs significantly
archy is addressed one at a time as in standard multi- among processes, leading to imbalance on coarse levels.
grid. A multigrid cycle (see Fig. ) consists almost Synchronization occurs within each level of the multi-
entirely of relaxation and sparse matrix–vector prod- grid cycle, so the time to process a level is determined
ucts. This means that the parallel performance depends by the slowest process. Even when work is well balanced,
entirely on these two basic kernels. Parallel matrix– processors may have only a few points as the total num-
vector products are quite straightforward, as are parallel ber of graph vertices can diminish rapidly from one
Jacobi and parallel Chebyshev relaxation (which are level to a coarser level. In fact, it is quite possible that
often preferred in this setting []). Chebyshev relax- the total number of graph vertices on a given level is
ation requires an estimate of the largest eigenvalue of actually less than the number of cores/processors on
A ℓ (which is often available or easily estimated) but not a massively parallel machine. In this case some pro-
that of the smallest eigenvalue, as relaxation need only cessors are forced to remain idle while more generally
damp the high end of the spectrum. Unfortunately, con- if processes have only a few points, computation time
struction of efficient parallel Gauss–Seidel algorithms can be dominated by communication. A partial solution
 A Algebraic Multigrid

is to redistribute and load-balance points on a subset with the remaining terms are not significant. However,
of processes. While this certainly leaves processes idle when this first term is not so dominant, then ineffi-
during coarse level relaxation, this can speed up run ciencies associated with coarse level computations are a
times because communication occurs among fewer pro- concern. Despite a possible loss of efficiency, the conver-
cesses and expensive global all-to-all communication gence benefits of multigrid far outweigh these concerns
patterns are avoided. A number of alternative multigrid and so, generally, multilevel methods remain far supe-
algorithms that attempt to process coarse level correc- rior to single level methods, even on massively parallel
tions in parallel (as opposed to the standard sequential systems.
approach) have been considered []. Most of these, The other significant issue associated with paral-
however, suffer drawbacks associated with convergence lel algebraic multigrid is parallelizing the construction
rates or even more complex load balancing issues. phase of the algorithm. In principle this construction
The parallel cost of a simple V-cycle can be estimated phase is highly parallel due to the locality associated
to help understand the general behavior. In particular, with most algorithms for generating grid transfers and
assume that run time on a single level is modeled by setting up smoothers. The primary challenge typically
  centers on the coarsening aspect of algebraic multi-
k ⎛ k ⎞
Tsl = c ( ) + c α + β ( ) grid. For example, the smoothed aggregation algorithm
q ⎝ q ⎠
requires construction of aggregates, then a tentative
where k is the number of degrees of freedom on the prolongator, followed by a prolongator smoothing step.
level, q is the number of processors, and α + βw mea- Prolongator smoothing as well as Galerkin projec-
sures the cost of sending a message of length w from one tion (to build coarse level discretizations) require only
processor to another on a distributed memory machine. an efficient parallel matrix–matrix multiplication algo-
The constants c and c reflect the ratio of computa- rithm. Further, construction of the tentative prolonga-
tion to communication inherent in the smoothing and tor only requires that the near-null space be injected
matrix-vector products. Then, if the coarsening rate per in a way consistent with the aggregates followed by
dimension in D is γ, it is easy to show that the run a series of independent small QR computations. It is,
time of a single AMG V-cycle with several levels, ℓ, is however, the aggregation phase that can be difficult.
approximately While highly parallel in principle, it is difficult to get
  the same quality of aggregates without having some
k γ k γ inefficiencies. Aggregation is somewhat akin to plac-
Tamg = c ( ) (  ) + c β ( ) (  ) + c αℓ
q γ − q γ − ing floor tiles in a room. If several workers start at
where now k is the number of degrees of freedom different locations and lay tile simultaneously, the end
on the finest mesh. For γ = , a standard multigrid result will likely lead to the trimming of many interior
coarsening rate, this becomes tiles so that things fit. Unfortunately, a large degree of
  irregularity in aggregates can either degrade the conver-
 k  k
Tamg = c ( ) + c β ( ) + c αℓ. gence properties of the overall method or significantly
 q  q increase the cost per iteration. Further, once each pro-
Comparing this with the single level execution time, cess (or worker in our analogy) has only a few rows of
we see that the first term is pure computation and it the relatively coarse operator Aℓ , the rate of coarsening
increases by . to account for the hierarchy of levels. slows, thus leading to more multigrid levels. Instead of
The second term reflects communication bandwidth naively allowing each process to coarsen its portion of
and it increases by .. The third term is commu- the graph independently, several different aggregation-
nication latency and it increases by ℓ. Thus, it is to based strategies have been proposed in the context of
be expected that parallel multigrid spends a higher SA, based on parallel maximal independent sets [, ]
percentage of time communicating than a single level or using graph-partitioning algorithms []. The latter
method. However, if (k/q) (the number of degrees can increase the coarsening rate and decrease the opera-
of freedom per processor) is relatively large, then the tor complexity. In both cases, an aggregate can span sev-
first term dominates and so inefficiencies associated eral processors. Parallel variants of C-AMG coarsening
Algebraic Multigrid A 

have also been developed to minimize these effects. The emergence of very large scale multi-core
A
The ideas are quite similar to the smoothed aggrega- architectures presents increased stresses on the underly-
tion context, but the details tend to be distinct due to ing multigrid parallelization algorithms and so research
differences in coarsening rates and the fact that one is must continue to properly take advantage of these new
based on aggregation while the other centers on identi- architectures.
fying vertex subsets. Subdomain blocking [] coarsens
from the process boundary inward, and CLJP coars- Related Multigrid Approaches
ening [, ] is based on selecting parallel maximal When solving difficult problems, it is often advanta-
independent sets. A more recently developed aggres- geous to “wrap” the multigrid solver with a Krylov
sive coarsening variant, PMIS [], addresses the high method. This amounts to using a small number of AMG
complexities sometimes seen with CLJP. iterations as a preconditioner in a Krylov method. The
We conclude this section with some performance additional cost per iteration amounts to an additional
figures associated with the ML multigrid package that is residual computation and a small number of inner
part of the Trilinos framework. ML implements paral- products, depending on Krylov method used. Multigrid
lel smoothed aggregation along the lines just discussed. methods exhibiting a low-dimensional deficiency can
Figure  illustrates weak parallel scaling on a semicon- have good convergence rates restored this way.
ductor simulation. Each processor has approximately Standard implementations of C-AMG and SA base
,  degrees of freedom so that the problem size their coarsening on assumptions of smoothness. For
increases as more processors are used. In an ideal sit- C-AMG, this is that an algebraically smooth error
uation, one would like that the total simulation time behaves locally as a constant. When this is not the case,
remains constant as the problem size and processor convergence will suffer. One advantage of SA is that it is
count increase. Examination of Fig.  shows that exe- possible to construct methods capable of approximat-
cution times rise only slightly due to relatively constant ing any prescribed user-defined error component accu-
run time per iteration. rately on coarse meshes. Thus, in principle, if one knows

Weak Scaling Study: Average Time for Different Preconditioners


2 ⫻ 1.5 micron BJT Steady-State Drift-Diffusion Bias 0.3V
1200
Average CPU time per Newton step(Prec+Lin Sol)(s)

1-level DD ILU
3-level NSA W(1,1) agg125
4096p
1000 3-level PGSA W(1,1) agg125

Red Storm: 1 core of


2.4 GHz dual core Opteron
800

600

400
1024p

200
256p
1024p 4096p
4p 16p 64p
0
105 106 107 108
Unknowns

Algebraic Multigrid. Fig.  Weak scaling timings for a semiconductor parallel simulation. The left image is the
steady-state electric potential; red represents high potential, blue indicates low potential. The scaling results compare
GMRES preconditioned by domain-decomposition, a nonideal AMG method, and an AMG method intended for
nonsymmetric problems, respectively. Results courtesy of []
 A Algebraic Multigrid

difficult components for relaxation, suitable grid trans- Acknowledgment


fers can be defined that accurately transfer these modes Sandia National Laboratories is a multiprogram lab-
to coarse levels. However, there are applications where oratory operated by Sandia Corporation, a wholly
such knowledge is lacking. For instance, QCD prob- owned subsidiary of Lockheed Martin company, for the
lems may have multiple difficult components that are US Department of Energy’s National Nuclear Security
not only oscillatory, but are not a priori known. Adap- Administration under contract DE-AC-AL.
tive multigrid methods have been designed to address
such problems [, ]. The key feature is that they deter- Bibliography
mine the critical components and modify coarsening to . Adams MF () A parallel maximal independent set algorithm.
ensure their attenuation. These methods can be intri- In: Proceedings th Copper Mountain conference on iterative
cate, but are based on the simple observation that an methods, Copper Mountain
. Adams MF () A distributed memory unstructured Gauss-
iteration process applied to the homogeneous prob-
Seidel algorithm for multigrid smoothers. In: ACM/IEEE
lem, Aℓ xℓ = , will either converge with satisfactory Proceedings of SC: High Performance Networking and
rate, or reveal, as its solution, the error that the current Computing, Denver
method does not attenuate. As the method improves, . Adams M, Brezina M, Hu J, Tuminaro R (July ) Parallel
any further components are revealed more rapidly. The multigrid smoothing: polynomial versus Gauss-Seidel. J Comp
adaptive methods are currently more extensively devel- Phys ():–
. Bakhvalov NS () On the convergence of a relaxation method
oped within the SA framework, but progress has also
under natural constraints on an elliptic operator. Z Vycisl Mat Mat
been made within the context of C-AMG, []. Fiz :–
Another class of methods that attempt to take . Boman E, Devine K, Heaphy R, Hendrickson B, Leung V, Riesen
advantage of additional available information are LA, Vaughan C, Catalyurek U, Bozdag D, Mitchell W, Teresco J.
known as AMGe [, ]. These utilize local finite ele- Zoltan () .: Parallel partitioning, load balancing, and data-
management services; user’s guide. Sandia National Laboratories,
ment information, such as local stiffness matrices, to
Albuquerque, NM, . Tech. Report SAND-W. http://
construct inter-grid transfers. Although this departs www.cs.sandia.gov/Zoltan/ughtml/ug.html
from the AMG framework, variants related to the . Brandt A () Multi-level adaptive solutions to boundary value
methodology have been designed that do not explicitly problems. Math Comp :–
require local element matrices [, ]. . Brandt A, McCormick S, Ruge J () Algebraic multigrid
AMG methods have been successfully applied (AMG) for sparse matrix equations. In: Evans DJ (ed) Sparsity
and its applications. Cambridge University Press, Cambridge
directly to high-order finite element discretizations.
. Brezina M, Cleary AJ, Falgout RD, Henson VE, Jones JE, Manteuf-
However, given the much denser nature of a matrix, fel TA, McCormick SF, Ruge JW () Algebraic multigrid based
AH , obtained from high-order discretization, it is usu- on element interpolation (AMGe). SIAM J Sci Comp :–
ally advantageous to precondition the high-order sys- 
tem by an inverse of the low-order discretization, AL , . Brezina M, Falgout R, MacLachlan S, Manteuffel T, McCormick
S, Ruge J () Adaptive smoothed aggregation (αSA) multigrid.
corresponding to the problem over the same nodes
SIAM Rev ():–
that are used to define the high-order Lagrange inter- . Brezina M, Falgout R, MacLachlan S, Manteuffel T, McCormick
polation. One iteration of AMG can then be used S, Ruge J () Adaptive algebraic multigrid. SIAM J Sci Comp
to efficiently approximate the action of (AL )− , and :–
the use of AH is limited to computing a fine-level . Brezina M, Manteuffel T, McCormick S, Ruge J, Sanders G ()
residual, which may be computed in a matrix-free Towards adaptive smoothed aggregation (αSA) for nonsymmet-
ric problems. SIAM J Sci Comput :
fashion.
. Brezina M () SAMISdat (AMG) version . - user’s guide
. Briggs WL, Henson VE, McCormick S () A multigrid tuto-
rial, nd ed. SIAM, Philadelphia
. Chartier T, Falgout RD, Henson VE, Jones J, Manteuffel T,
Related Entries Mc-Cormick S, Ruge J, Vassilevski PS () Spectral AMGe
Metrics (ρAMGe). SIAM J Sci Comp :–
Preconditioners for Sparse Iterative Methods . Chow E, Falgout R, Hu J, Tuminaro R, Meier-Yang U ()
Rapid Elliptic Solvers A survey of parallelization techniques for multigrid solvers.
Algorithm Engineering A 

In: Parallel processing for scientific computing, SIAM book . Ruge J, Stüben K () Algebraic multigrid (AMG). In:
series on software, environments, and tools. SIAM, Philadelphia, McCormick SF (ed) Multigrid methods, vol  of Frontiers in
A
pp – applied mathematics. SIAM, Philadelphia, pp –
. Cleary AJ, Falgout RD, Henson VE, Jones JE () Coarse-grid . Sala M, Tuminaro R () A new Petrov-Galerkin smoothed
selection for parallel algebraic multigrid. In: Proceedings of the aggregation preconditioner for nonsymmetric linear systems.
fifth international symposium on solving irregularly structured SIAM J Sci Comput ():–
problems in parallel. Lecture Notes in Computer Science, vol . . De Sterck H, Yang UM, Heys JJ () Reducing complexity in
Springer New York, pp – parallel algebraic multigrid preconditioners. SIAM J Matrix Anal
. Dohrmann CR () Interpolation operators for algebraic Appl ():–
multigrid by local optimization. SIAM J Sci Comp :– . Trottenberg U, Oosterlee C, Schüller A () Multigrid. Aca-
. Fedorenko RP (/) A relaxation method for solving elliptic demic, London
difference equations. Z Vycisl Mat Mat Fiz :– (). Also . Tuminaro R, Tong C () Parallel smoothed aggregation multi-
in U.S.S.R. Comput Math Math Phys :– () grid: aggregation strategies on massively parallel machines. In:
. Fedorenko RP () The speed of convergence of one itera- Donnelley J (ed) Supercomputing  proceedings, 
tive process. Z. Vycisl. Mat Mat Fiz :–. Also in U.S.S.R. . Vaněk P, Brezina M, Mandel J () Convergence of algebraic
Comput Math Math Phys :– multigrid based on smoothed aggregation. Numerische Mathe-
. Gee M, Siefert C, Hu J, Tuminaro R, Sala M () ML matik :–
. smoothed aggregation user’s guide. Technical Report . Vaněk P, Mandel J, Brezina M () Algebraic multigrid by
SAND-, Sandia National Laboratories, Albuquerque, smoothed aggregation for second and fourth order elliptic prob-
NM,  lems. Computing :–
. Hackbusch W () On the convergence of a multi-grid iteration . Varga RS () Matrix iterative analysis. Prentice-Hall, Engle-
applied to finite element equations. Technical Report Report - wood Cliffs
, Institute for Applied Mathematics, University of Cologne, West . Yang UM () On the use of relaxation parameters in hybrid
Germany smoothers. Numer Linear Algebra Appl :–
. Hackbusch W () Multigrid methods and applications, vol .
Computational mathematics. Springer, Berlin
. Hackbusch W () Convergence of multi-grid iterations
applied to difference equations. Math Comput ():–
. Henson VE, Vassilevski PS () Element-free AMGe: general
algorithms for computing interpolation weights in AMG. SIAM J Algorithm Engineering
Sci Comp :–
. Henson VE, Yang UM () BoomerAMG: a parallel algebraic Peter Sanders
multigrid solver and preconditioner. Appl Numer Math (): Universitaet Karlsruhe, Karlsruhe, Germany
–
. Karypis G, Kumar V () Multilevel k-way partitioning scheme
for irregular graphs. Technical Report -, Department of
Computer Science, University of Minnesota Synonyms
. Karypis G, Kumar V () ParMETIS: Parallel graph partition- Experimental parallel algorithmics
ing and sparse matrix ordering library. Technical Report -,
Department of Computer Science, University of Minnesota
. Krechel A, Stüben K () Parallel algebraic multigrid based on Definition
subdomain blocking. Parallel Comput ():–
Algorithmics is the subdiscipline of computer science
. Lin PT, Shadid JN, Tuminaro RS, Sala M () Performance of
a Petrov-Galerkin algebraic multilevel preconditioner for finite
that studies the systematic development of efficient
element modeling of the semiconductor device drift-diffusion algorithms. Algorithm Engineering (AE) is a methodol-
equations. Int J Num Meth Eng. Early online publication, ogy for algorithmic research that views design, analysis,
doi:./nme. implementation, and experimental evaluation of algo-
. Mavriplis DJ () Parallel performance investigations of an
rithms as a cycle driving algorithmic research. Further
unstructured mesh Navier-Stokes solver. Intl J High Perf Comput
Appl :–
components are realistic models, algorithm libraries,
. Prometheus. http://www.columbia.edu/_m/promintro. and a multitude of interrelations to applications. Fig. 
html. gives an overview. A more detailed definition can be
. Stefan Reitzinger. http://www.numa.unilinz.ac.at/research/ found in []. This article is concerned with particular
projects/pebbles.html issues that arise in engineering parallel algorithms.
 A Algorithm Engineering

Discussion basic aspects of parallelizable problems, it has proved


quite difficult to implement PRAM algorithms on main-
Introduction
stream parallel computers.
The development of algorithms is one of the core areas
The remainder of this article closely follows Fig. ,
of computer science. After the early days of the s–
giving one section for each of the main areas’ mod-
s, in the s and s, algorithmics was largely
els, design, analysis, implementation, experiments,
viewed as a subdiscipline of computer science that is
instances/benchmarks, and algorithm libraries.
concerned with “paper-and-pencil” theoretical work –
design of algorithms with the goal to prove worst-
case performance guarantees. However, in the s Models
it became more and more apparent that this purely Parallel computers are complex systems containing pro-
theoretical approach delays the transfer of algorith- cessors, memory modules, and networks connecting
mic results into applications. Therefore, in algorithm them. It would be very complicated to take all these
engineering implementation and experimentation are aspects into account at all times when designing, ana-
viewed as equally important as design and analysis of lyzing, and implementing parallel algorithms. There-
algorithms. Together these four components form a fore we need simplified models. Two families of such
feedback cycle: Algorithms are designed, then analyzed models have proved very useful: In a shared memory
and implemented. Together with experiments using machine, all processors access the same global mem-
realistic inputs, this process induces new insights that ory. In a distributed memory machine, several sequential
lead to modified and new algorithms. The methodology computers communicate via an interconnection net-
of algorithm engineering is augmented by using realis- work. While these are useful abstractions, the difficulty
tic models that form the basis of algorithm descriptions, is to make these models more concrete by specifying
analysis, and implementation and by algorithm libraries what operations are supported and how long they take.
that give high quality reusable implementations. For example, shared memory models have to specify
The history of parallel algorithms is exemplary for how long it takes to access a memory location. From
the above development, where many clever algorithms sequential models we are accustomed to constant access
were developed in the s that were based on the Par- time and this also reflects the best case behavior of
allel Random Access Machine (PRAM) model of com- many parallel machines. However, in the worst case,
putation. While this yielded interesting insights into the most real world machines will exhibit severe contention

Realistic
Algorithm models
engineering Real
inputs
Design
Applications

Falsifiable
Analysis hypotheses Experiments
induction
Deduction
Implementation Appl. engineering
Perf.−
guarantees
Algorithm−
libraries

Algorithm Engineering. Fig.  Algorithm engineering as a cycle of design, analysis, implementation, and experimental
evaluation driven by falsifiable hypotheses
Algorithm Engineering A 

when many processors access the same memory mod- for gaps between theory and practice. Thus, the analysis
A
ule. Hence, despite many useful models (e.g., QRQW of such algorithms is an important aspect of AE.
PRAM – Queue Read Queue Write Parallel Random For example, a central problem in parallel process-
Access Machine []), there remains a considerable gap ing is partitioning of large graphs into approximately
to reality when it comes to large-scale shared memory equal-sized blocks such that few edges are cut. This
systems. problem has many applications, for example, in sci-
The situation is better for distributed memory entific computing. Currently available algorithms with
machines, in particular, when the sequential machines performance guarantees are too slow for practical use.
are connected by a powerful switch. We can then Practical methods iteratively contract the graph while
assume that all processors can communicate with an preserving its basic structure until only few nodes are
arbitrary partner with predictable performance. The left, compute an initial solution on this coarse represen-
LogP model [] and the Bulk Synchronous Parallel tation, and then improve by local search on each level.
model [] put this into a simple mathematical form. These algorithms are very fast and yield good solutions
Equally useful are folklore models where one simply in many situations, yet no performance guarantees are
defines the time needed for exchanging a message as the known (see [] for a recent overview).
sum of a startup overhead and a term proportional to
the message length. Then the assumption is usually that Implementation
every processor can only communicate with a single Despite huge efforts in parallel programming languages
other processor at a time, or perhaps it can receive from and in parallelizing compilers, implementing parallel
one processor and send to another one. Also note that algorithms is still one of the main challenges in the
algorithms designed for a distributed memory model algorithm engineering cycle. There are several reasons
often yield efficient implementations on shared memory for this. First, there are huge semantic gaps between
machines. the abstract algorithm description, the programming
tools used, and the actual hardware. In particular, really
Design efficient codes often use fairly low-level programming
As in algorithm theory, we are interested in effi- interfaces such as MPI or atomic memory operations
cient algorithms. However, in algorithm engineering, in order to keep the overheads for processor interac-
it is equally important to look for simplicity, imple- tion manageable. Perhaps more importantly, debugging
mentability, and possibilities for code reuse. In partic- parallel programs is notoriously difficult.
ular, since it can be very difficult to debug parallel code, Since performance is the main reason for using
the algorithms should be designed for simplicity and parallel computers in the first place, and because of
testability. Furthermore, efficiency means not just the complexity of parallel hardware, performance tun-
asymptotic worst-case efficiency, but we also have to ing is an important part of the implementation phase.
look at the constant factors involved. For example, we Although the line between implementation and exper-
have to be aware that operations where several pro- imentation is blurred here, there are differences. In
cessors interact can be a large constant factor more particular, performance tuning is less systematic. For
expensive than local computations. Furthermore, we example, there is no need for reproducibility, detailed
are not only interested in worst-case performance but studies of variances, etc., when one finds out that
also in the performance for real-world inputs. In partic- sequential file I/O is a prohibitive bottleneck or when it
ular, some theoretically efficient algorithms have similar turns out that a collective communication routine of a
best-case and worst-case behavior whereas the algo- particular MPI implementation has a performance bug.
rithms used in practice perform much better on all but
contrived examples. Experiments
Meaningful experiments are the key to closing the
Analysis cycle of the AE process. Compared to the natural sci-
Even simple and proven practical algorithms are often ences, AE is in the privileged situation where it can
difficult to analyze and this is one of the main reasons perform many experiments with relatively little effort.
 A Algorithm Engineering

However, the other side of the coin is highly nontrivial averaging over many repeated runs is advisable. Never-
planning, evaluation, archiving, postprocessing, and theless, the measurements may remain unreliable since
interpretation of results. The starting point should rare delays, for example, due to work done by the oper-
always be falsifiable hypotheses on the behavior of the ating system, can become quite frequent when they can
investigated algorithms, which stem from the design, independently happen on any processor.
analysis, implementation, or from previous experi-
ments. The result is a confirmation, falsification, or Speedup and Efficiency
refinement of the hypothesis. The results complement In parallel computing, running time depends on the
the analytic performance guarantees, lead to a better number of processors used and it is sometimes difficult
understanding of the algorithms, and provide ideas for to see whether a particular execution time is good or
improved algorithms, more accurate analysis, or more bad considering the amount of resources used. There-
efficient implementation. fore, derived measures are often used that express this
Experiments with parallel algorithms are challeng- more directly. The speedup is the ratio of the run-
ing because the number of processors (let alone other ning time of the fastest known sequential implementa-
architectural parameters) provide another degree of tion to that of the parallel running time. The speedup
freedom for the measurements, because even parallel directly expresses the impact of parallelization. The rel-
programs without randomization may show nondeter- ative speedup is easier to measure because it compares
ministic behavior on real machines, and because large with the parallel algorithm running on a single pro-
parallel machines are an expensive resource. cessor. However, note that the relative speedup can
Experiments on comparing the quality of the com- significantly overstate the actual usefulness of the par-
puted results are not so much different from the sequen- allel algorithm, since the sequential algorithm may be
tial case. In the following, we therefore concentrate on much faster than the parallel algorithm run on a single
performance measurements. processor.
For a fixed input and a good parallel algorithm, the
Measuring Running Time speedup will usually start slightly below one for a single
The CPU time is a good way to characterize the time processor, and initially goes up linearly with the number
used by a sequential process (without I/Os), even in of processors. Eventually, the speedup curve gets more
the presence of some operating system interferences. and more flat until parallelization overheads become
In contrast, in parallel programs we have to measure so large that the speedup goes down again. Clearly, it
the actual elapsed time (wall-clock time) in order to makes no sense to add processors beyond the maximum
capture all aspects of the parallel program, in partic- of the speedup curve. Usually it is better to stop much
ular, communication and synchronization overheads. earlier in order to keep the parallelization cost-effective.
Of course, the experiments must be performed on an Efficiency, the ratio of the speedup to the number of pro-
otherwise unloaded machine, by using dedicated job cessors, more directly expresses this. Efficiency usually
scheduling and by turning off unnecessary components starts somewhere below one and then slowly decreases
of the operating system on the processing nodes. Usu- with the number of processors. One can use a threshold
ally, further aspects of the program, like startup, initial- for the minimum required efficiency to decide on the
ization, and shutdown, are not interesting for the mea- maximum number of efficiently usable processors.
surement. Thus timing is usually done as follows: All Often, parallel algorithms are not really used to
processors perform a barrier synchronization imme- decrease execution time but to increase the size of the
diately before the piece of program to be timed; one instances that can be handled in reasonable time. From
processor x notes down its local time and all processors the point of view of speedup and efficiency, this is good
execute the program to be measured. After another bar- news because for a scalable parallel algorithm, by suffi-
rier synchronization, processor x measures the elapsed ciently increasing the input size, one can efficiently use
time. As long as the running time is large compared to any number of processors. One can check this exper-
the time for a barrier synchronization, this is an easy imentally by scaling the input size together with the
way to measure wall-clock time. To get reliable results, number of processors. An interesting property of an
Algorithm Engineering A 

algorithm is how much one has to increase the input calls and thus helps to decide which of several possi-
A
size with the number of processors. The isoefficiency ble calls to use, or whether a manual implementation
function [] expresses this relation analytically, giving could help.
the input size needed to achieve some given, constant Benchmark suites of input instances for an impor-
efficiency. As usual in algorithmics, one uses asymp- tant computational problem can be key to consistent
totic notation to get rid of constant factors and lower progress on this problem. Compared to the alterna-
order terms. tive that each working group uses its own inputs, this
has obvious advantages: there can be wider range of
inputs, results are easier to compare, and bias in instance
Speedup Anomalies
selection is less likely. For example, Chris Walshaw’s
Occasionally, efficiency exceeding one (also called
graph partitioning archive [] has become an important
superlinear speedup) causes confusion. By Brent’s prin-
reference point for graph partitioning.
ciple (a single processor can simulate a p-processor
Synthetic instances, though less realistic than real-
algorithm with a uniform slowdown factor of p) this
world inputs can also be useful since they can be gen-
should be impossible. However, genuine superlinear
erated in any size and sometimes are good as stress
absolute speedup can be observed if the program relies
tests for the algorithms (though it is often the other way
on resources of the parallel machine not available to
round – naively constructed random instances are likely
a simulating sequential machine, for example, main
to be unrealistically easy to handle). For example, for the
memory or cache.
graph partitioning problem, one can generate graphs
A second reason is that the computations done by
that almost look like random graphs but have a pre-
an algorithm can be done in many different ways, some
designed structure that can be more or less easy to detect
leading to a solution fast, some more slowly. Hence, the
according to tunable parameters.
parallel program can be “lucky” to find a solution more
than p times earlier than the sequential program. Inter-
Algorithm Libraries
estingly, such effects do not always disappear when aver-
Algorithm libraries are made by assembling implemen-
aging over all inputs. For example, Schöning [] gives
tations of a number of algorithms using the methods
a randomized algorithm for finding satisfying assign-
of software engineering. The result should be efficient,
ments to formulas in propositional calculus that are
easy to use, well documented, and portable. Algorithm
in conjunctive normal form. This algorithm becomes
libraries accelerate the transfer of know-how into appli-
exponentially faster when run in parallel on many (pos-
cations. Within algorithmics, libraries simplify compar-
sibly simulated) processors. Moreover, its worst-case
isons of algorithms and the construction of software
performance is better than any sequential algorithm.
that builds on them. The software engineering involved
Brent’s principle is not violated since the best sequential
is particularly challenging, since the applications to
algorithm turns out to be the emulation of the parallel
be supported are unknown at library implementation
algorithm.
time and because the separation of interface and (often
Finally, there are many cases were superlinear
highly complicated) implementation is very important.
speedup is not genuine, mostly because the sequential
Compared to an application-specific reimplementation,
algorithm used for comparison is not really the best one
using a library should save development time without
for the inputs considered.
leading to inferior performance. Compared to simple,
easy to implement algorithms, libraries should improve
Instances and Benchmarks performance. To summarize, the triangle between gen-
Benchmarks have a long tradition in parallel comput- erality, efficiency, and ease of use leads to challeng-
ing. Although their most visible use is for comparing ing trade-offs because often optimizing one of these
different machines, they are also helpful within the aspects will deteriorate the others. It is also worth men-
AE cycle. During implementation, benchmarks of basic tioning that correctness of algorithm libraries is even
operations help to select the right approach. For exam- more important than for other softwares because it is
ple, SKaMPI [] measures the performance of most MPI extremely difficult for a user to debug a library code
 A Algorithmic Skeletons

that has not been written by his team. All these diffi- of parallel computation. In: th ACM SIGPLAN symposium on
culties imply that implementing algorithms for use in a principles and practice of parallel programming, pp –, San
library is several times more difficult than implementa- Diego, – May, . ACM, New York
. Gibbons PB, Matias Y, Ramachandran V () The queue-read
tions for experimental evaluation. On the other hand,
queue-write pram model: accounting for contention in parallel
a good library implementation might be used orders algorithms. SIAM J Comput ():–
of magnitude more frequently. Thus, in AE there is a . Grama AY, Gupta A, Kumar V () Isoefficiency: measuring the
natural mechanism leading to many exploratory imple- scalability of parallel algorithms and architectures. IEEE Concurr
mentations and a few selected library codes that build ():–
. Reussner R, Sanders P, Prechelt L, Müller M () SKaMPI: a
on previous experimental experience.
detailed, accurate MPI benchmark. In: EuroPVM/MPI, number
In parallel computing, there is a fuzzy bound-  in LNCS, pp –
ary between software libraries whose main purpose is . Sanders P () Algorithm engineering – an attempt at a defini-
to shield the programmer from details of the hard- tion. In: Efficient Algorithms. Lecture Notes in Computer Science,
ware and genuine algorithm libraries. For example, the vol . Springer, pp –
. Schöning U () A probabilistic algorithm for k-sat and con-
basic functionality of MPI (message passing) is of the
straint satisfaction problems. In: th IEEE symposium on foun-
first kind, whereas its collective communication rou- dations of computer science, pp –
tines have a distinctively algorithmic flavor. The Intel . Singler J, Sanders P, Putze F () MCSTL: the multi-core stan-
Thread Building Blocks (http://www.threadingbuild dard template library. In: th international Euro-Par conference.
ingblocks.org) offers several algorithmic tools includ- LNCS, vol . Springer, pp –
ing a load balancer hidden behind a task concept and . Soperm AJ, Walshaw C, Cross M () A combined evolu-
tionary search and multilevel optimisation approach to graph
distributed data structures such as hash tables. The stan-
partitioning. J Global Optim ():–
dard libraries of programming languages can also be . Valiant L () A bridging model for parallel computation. Com-
parallelized. For example, there is a parallel version of mun ACM ()
the C++ STL in the GNU distribution []. . Walshaw C, Cross M () JOSTLE: parallelmultilevel graph-
The Computational Geometry Algorithms Library partitioning software – an overview. In: Magoules F (ed) Mesh
partitioning tech-niques and domain decomposition techniques,
(CGAL, http://www.cgal.org) is a very sophisticated
pp –. Civil-Comp Ltd. (Invited chapter)
example of an algorithms library that is now also getting
partially parallelized [].

Conclusion
This article explains how algorithm engineering (AE) Algorithmic Skeletons
provides a methodology for research in parallel algo-
rithmics that allows to bridge gaps between theory and Parallel Skeletons
practice. AE does not abolish theoretical analysis but
contains it as an important component that, when appli-
cable, provides particularly strong performance and
robustness guarantees. However, adding careful imple-
mentation, well-designed experiments, realistic inputs, All Prefix Sums
algorithm libraries, and a process coupling all of this Reduce and Scan
together provides a better way to arrive at algorithms Scan for Distributed Memory, Message-Passing
useful for real-world applications. Systems

Bibliography
. Batista VHF, Millman DL, Pion S, Singler J () Parallel geo-
metric algorithms for multi-core computers. In: th ACM sym-
posium on computational geometry, pp – Allen and Kennedy Algorithm
. Culler D, Karp R, Patterson D, Sahay A, Schauser KE, Santos E,
Subramonian R, Eicken Tv () LogP: towards a realistic model Parallelism Detection in Nested Loops, Optimal
Allgather A 

commonly assumed that all nodes know the size of all


Allgather A
subvectors in advance. The Message Passing Interface
(MPI), for instance, makes this assumption for both
Jesper Larsson Träff , Robert A. van de Geijn its regular MPI_Allgather and its irregular MPI_

University of Vienna, Vienna, Austria MPIAllgatherv collective operations. Other collec-

The University of Texas at Austin, Austin, TX, USA tive interfaces that support the operation make similar
assumptions. This can be assumed without loss of gen-
Synonyms erality. If it is not the case, a special, one item per node
All-to-all broadcast; Collect; Concatenation; Gather-to- allgather operation can be used to collect the subvector
all; Gossiping; Total exchange sizes at all nodes.
The allgather operation is a symmetric variant of
Definition the broadcast operation in which all nodes concurrently
Among a group of processing elements (nodes) each broadcast their data item, and is therefore often referred
node has a data item that is to be transferred to all other to as all-to-all broadcast. It is semantically equivalent
nodes, such that all nodes in the group end up having all to a gather operation that collects data items from all
of the data items. The allgather operation accomplishes nodes at a specific root node, followed by a broad-
this total data exchange. cast from that root node, or to p concurrent gather
operations with each node serving as root in one such
Discussion operation. This explains the term allgather. The term
The reader may consider first visiting the entry on col- concatenation can be used to emphasize that the data
lective communication. items are gathered in increasing index order at all nodes
It is assumed that the p nodes in the group of nodes if such is the case. When concerned with the operation
participating in the allgather operation are indexed con- for specific communication networks (graphs) the term
secutively, each having an index i with  ≤ i < p. gossiping has been used. The problem has, like broad-
It is furthermore assumed that the data items are to cast, been extensively studied in the literature, and other
be collected in some fixed order determined by the terminology is occasionally found.
node indices; each node may apply a different order.
Assume that each node i initially has a vector of data
xi of some number of elements ni with n = ∑i= ni
p− Lower Bounds
being the total number of subvector elements. Upon To obtain lower bounds for the allgather operation a
completion of the allgather operation each node i will fully connected, k-ported, homogeneous communication
have the full vector x consisting of the subvectors xi , system with linear communication cost is assumed. This
i = , . . . , p − . This is shown in Fig. . means that
All nodes in the group are assumed to explic- ● All nodes can communicate directly with all other
itly take part in the allgather communication oper- nodes, at the same communication cost,
ation. If all subvectors xi have the same num- ● In each communication operation, a node can send
ber of elements ni = n/p the allgather opera- up to k distinct messages to k other nodes, and
tion is said to be regular, otherwise irregular. It is simultaneously receive k messages from k possibly
different nodes, and
● The cost of transmitting a message of size n (in some
Before After
unit) between any two nodes is modeled by a simple,
Node  Node  Node  Node  Node  Node 
linear function α + nβ. Here α is a start-up latency
x x x x and β the transfer cost per unit.
x x x x
x x x x
With these approximations, lower bounds on the
number of communication rounds during which nodes
Allgather. Fig.  Allgather on three nodes are involved in k-ported communication, and the total
 A Allgather

amount of data that have to be transferred along a The cost of this algorithm is
critical path of any algorithm, can be easily established:
n (p − )n
(p − ) (α + β) = (p − )α + β.
● Since the allgather operation implies a broadcast of p p
a data item from each node, the number of com- This simple algorithm achieves the lower bound in
munication rounds is at least ⌈logk+ p⌉. This follows the second term, and is useful when the vector size n is
because the number of nodes to which this particu- large. The linear, first term is far from the lower bound.
lar item has been broadcast can at most increase by If the p nodes are instead organized in a r ×r mesh,
a factor of k +  per round, such that the number of with p = r r , the complete allgather operation can be
nodes to which the item has been broadcast after d accomplished by first gathering (simultaneously) along
rounds is at most (k + )d . the first dimension, and secondly gathering the larger
● The total amount of data to be received by node subvectors along the second dimension. The cost of this
j is n − nj , and since this can be received over algorithm becomes
k simultaneously active ports, a lower bound is
n−n n−n/p
max≤j<p ⌈ k j ⌉. For the regular case this is ⌈ k ⌉ =
n n
(p−)n
⌈ pk ⌉. (r − ) (α + β) + (r − ) (α + r β)
p p
In the linear cost model, a lower bound for (r − )n + (r − )r n
= (r + r − )α + β
the allgather operation is therefore ⌈logk+ p⌉α + p
(n−n
max ≤j<p ⌈ k j ⌉β. For regular allgather problems this (p − )n
= (r + r − )α + β.
simplifies to p
Generalizing the approach, if the p nodes are orga-
(p − )n
⌈logk+ p⌉α + ⌈ ⌉β. nized in a d-dimension with p = r × r × ⋯ × rd− , the
pk complete allgather operation can be accomplished by d
successive allgather operations along the d dimensions.
The model approximations abstract from any specific
The cost of this algorithm becomes
network topology, and for specific networks, including
networks with a hierarchical communication structure,
better, more precise lower bounds can sometimes be n n
(rd− − ) (α + β) + (rd− − ) (α + rd− β) + ⋯
established. The network diameter will for instance pro- p p
vide another lower bound on the number of communi- ⎛d− ⎞ (p − )n
cation rounds. = ∑(rj − )α + β.
⎝ j= ⎠ p
Notice that as the number of dimensions increases,
Algorithms the α term decreases while the (optimal) β term does
At first it is assumed that k =  and that the allgather is not change.
regular, that is, the size ni of each subvector xi is equal If p = d so that d = log p this approach yields an
to n/p. A brief survey of practically useful, common algorithm with cost
algorithmic approaches to the problem follows.
⎛log  p− ⎞ (p − )n (p − )n
∑ ( − )α + β = (log p)α + β.
⎝ j= ⎠ p p
Ring, Mesh, and Hypercube
A flexible class of algorithms can be described by first The allgather in each dimension now involves only
viewing the nodes as connected as a ring where node two nodes, and becomes a bidirectional exchange of
i sends to node (i + ) mod p and receives from node data. This algorithm maps easily to hypercubes, and
(i − ) mod p. An allgather operation can be accom- conversely a number of essentially identical algorithms
plished in p −  communication rounds. In round j, were originally developed for hypercube systems. For
 ≤ j < p − , node i sends subvector x(i−j) mod p and fully connected networks it is restricted to the situa-
receives subvector x(i−−j) mod p . tion where p is a power of two, and does not easily,
Allgather A 

j←1
j ← 2 A
while j < p do
/* next round */
par/* simultaneous send-receive */
Send subvector (xi,x(i+1) mod p, . . . ,x(i+j−1) mod p) to node (i−j) mod p
Receive subvector (x(i+j) mod p,x(i+j+1) mod p, . . . ,x(i+j+j−1) mod p) from node (i+j) mod p
endpar
j ← j
j ← 2j
endwhile
/* last subvector */
j ← p−j
par/* simultaneous send-receive */
Send subvector (xi,x(i+1) mod p, . . . ,x(i+j−1) mod p) to node (i−j) mod p
Receive subvector (x(i+j+1) mod p,x(i+j+1) mod p, . . . ,x(i+j+j−1) mod p) from node (i+j) mod p
endpar

Allgather. Fig.  The dissemination allgather algorithm for node i,  ≤ i < p, and  < p

without loss of theoretical performance, generalize to ⌈log p⌉-regular communication pattern is an instance
arbitrary values of p. It is sometimes called the bidirec- of a so-called circulant graph.
tional exchange algorithm, and relies only on restricted,
telephone-like bidirectional communication. For this
Composite Algorithms
algorithm both the α and the β terms achieve their
Different, sometimes practically efficient algorithms for
respective lower bounds. In contrast to, for instance,
the allgather operation can be derived by combinations
the broadcast operation, optimality can be achieved
of algorithms for broadcast and gather. The full vector
without the use of pipelining techniques.
can be gathered at a chosen root node and broadcast
from this node to all other nodes. This approach is
Dissemination Allgather
inherently a factor of two off from the optimal number
On networks supporting bidirectional, fully connected,
of communication rounds, since both gather and broad-
single-ported communication the allgather problem
cast requires at least ⌈log p⌉ communication rounds
can be solved in the optimal ⌈log  p⌉ number of commu-
even for fully connected networks. Other algorithms
nication rounds for any p as shown in Fig. . In round
can be derived from broadcast or gather, by doing these
k, node i communicates with nodes (i + k ) mod p and
operations simultaneously with each of the p nodes
(i − k ) mod p, and the size of the subvectors sent and
acting as either broadcast or gather root node.
received doubles in each round, with the possible excep-
tion of the last. Since each node sends and receives p − 
subvectors each of n/p elements the total cost of the
algorithm is Related Entries
Broadcast
(p − )n Collective Communication
⌈log p⌉α + β,
p Message Passing Interface (MPI)
for any number of nodes p, and meets the lower
bound for single-ported communication systems. It can
be generalized optimally (for some k) to k-ported com- Bibliographic Notes and Further
munication systems. Reading
This scheme is useful in many settings for The allgather operation is a symmetric variant of
implementation of other collective operations. The the broadcast operation, and together with this, one
 A All-to-All

of the most studied collective communication opera- . Hedetniemi SM, Hedetniemi T, Liestman AL () A survey
tions. Early theoretical surveys under different com- of gossiping and broadcasting in communication networks. Net-
munication assumptions can be found in [, , , ]. works :–
. Hensgen D, Finkel R, Manber U () Two algorithms for barrier
For (near-)optimal algorithms for hypercubes, meshes,
synchronization. Int J Parallel Program ():–
and tori, see [, ]. Practical implementations, for . Ho C-T, Kao M-Y () Optimal broadcast in all-port wormhole-
instance, for MPI for a variety of parallel systems routed hypercubes. IEEE Trans Parallel Distrib Syst ():–
have frequently been described, see, for instance, [, , . Krumme DW, Cybenko G, Venkataraman KN () Gossiping
, ]. Algorithms and implementations that exploit in minimal time. SIAM J Comput ():–
. Mamidala AR, Vishnu A, Panda DK () Efficient shared mem-
multi-port communication capabilities can be found in
ory and RDMA based design for MPI Allgather over InfiniBand.
[, , ]. Algorithms based on ring shifts are some- In: Recent advances in parallel virtual machine and message pass-
times called cyclic or bucket algorithms; in [, ] it is ing interface, th European PVM/MPI users’ group meeting.
discussed how to create hybrid algorithms for meshes Lecture notes in computer science, vol . Springer, Berlin,
of lower dimension and fully connected architectures pp –
. Mitra P, Payne DG, Schuler L, van de Geijn R () Fast collective
where the number of nodes is not a power of two. The
communication libraries, please. In: Intel Supercomputer Users’
ideas behind these algorithms date back to the early Group Meeting, University of Texas,  June 
days of distributed memory architectures [, ]. The . Qian Y, Afsahi A () RDMA-based and SMP-aware multi-port
dissemination allgather algorithm is from the funda- all-gather on multi-rail QsNetII SMP clusters. In: International
mental work of Bruck et al. [], although the term conference on parallel processing (ICPP ) Xi’ an, China, p. 
is not used in that paper [, ]. Attention has almost . Saad Y, Schultz MH () Data communication in parallel archi-
tectures. Parallel Comput ():–
exclusively been given to the regular variant of the
. Träff JL () Efficient allgather for regular SMP-clusters. In:
problem. A pipelined algorithm for very large, irregular Recent advances in parallel virtual machine and message passing
allgather problems was given in []. More general bib- interface, th European PVM/MPI users’ group meeting, Lecture
liographic notes can be found in the entry on collective notes in computer science, vol . Springer, Berlin, pp –
communication. . Träff JL, Ripke A, Siebert C, Balaji P, Thakur R, Gropp W ()
A pipelined algorithm for large, irregular all-gather problems. Int
J High Perform Comput Appl ():–
Bibliography . Yang Y, Wang J () Near-optimal all-to-all broadcast in multi-
dimensional all-port meshes and tori. IEEE Trans Parallel Distrib
. Benson GD, Chu C-W, Huang Q, Caglar SG () A compar-
Syst ():–
ison of MPICH allgather algorithms on switched networks. In:
Recent advances in parallel virtual machine and message pass-
ing interface, th European PVM/MPI users’ group meeting.
Lecture notes in computer science, vol . Springer, Berlin,
pp –
. Bruck J, Ho C-T, Kipnis S, Upfal E, Weathersby D () Efficient All-to-All
algorithms for all-to-all communications in multiport message-
passing systems. IEEE Trans Parallel Distrib Syst ():– Jesper Larsson Träff , Robert A. van de Geijn
. Chan E, Heimlich M, Purkayastha A, van de Geijn RA () 
University of Vienna, Vienna, Austria
Collective communication: theory, practice, and experience. Con- 
The University of Texas at Austin, Austin, TX, USA
currency Comput: Pract Exp ():–
. Chan E, van de Geijn RA, Gropp W, Thakur R () Collec-
tive communication on architectures that support simultaneous Synonyms
communication over multiple links. In: ACM SIGPLAN sympo-
Complete exchange; Index; Personalized all-to-all
sium on principles and practice of parallel programming (PPoPP),
ACM, New York, pp –
exchange; Transpose
. Chen M-S, Chen J-C, Yu PS () On general results for all-to-all
broadcast. IEEE Trans Parallel Distrib Syst ():– Definition
. Fox G, Johnson M, Lyzenga G, Otto S, Salmon J, Walker D ()
Among a set of processing elements (nodes) each
Solving problems on concurrent processors, vol I. Prentice-Hall,
Englewood Cliffs
node has distinct (personalized) data items destined
. Fraigniaud P, Lazard E () Methods and problems of commu- for each of the other nodes. The all-to-all operation
nication in usual networks. Discret Appl Math (–):– accomplishes this total data exchange among the set of
All-to-All A 

nodes, such that each node ends up having an individual node knows not only the number of elements of all
A
data item from each of the other nodes. subvectors it has to send to other nodes, but also the
number of elements in all subvectors it has to receive
Discussion from other nodes. This can be assumed without loss of
The reader may consider first visiting the entry on col- generality, since a regular all-to-all operation with sub-
lective communication. vectors of size one can be used to exchange and collect
Let the p nodes be indexed consecutively, the required information on number of elements to be
,  . . . , p − . Initially each node i has a (column)vector sent and received.
of data x(i) that is further subdivided into subvectors All-to-all communication is the most general and
(i) (i) most communication intensive collective data-exchange
xj for j = , . . . , p − . The subvector xj is to be sent
to node j from node i. Upon completion of the all-to-all communication pattern. Since it is allowed that some
(i) (i)
exchange operation node i will have the vector consist- ∣xj ∣ =  and that some subvectors xj are identical
(j) for some i and j, the irregular all-to-all operation sub-
ing of the subvectors xi for j = , . . . , p − . In effect,
the matrix consisting of the i columns x(i) is transposed sumes other common collective communication pat-
(j) terns like broadcast, gather, scatter, and allgather.
with the ith row consisting of subvectors xi originally
Algorithms for specific patterns are typically con-
distributed over the p nodes j = , . . . , p −  now trans-
siderably more efficient than general, all-to-all algo-
ferred to node i. This transpose all-to-all operation is
(i) rithms, which motivates the inclusion of a spectrum
illustrated in Fig. . The subvectors xi do not have to
of collective communication operations in collective
be communicated, but for symmetry reasons it is con-
communication interfaces and libraries. The Message-
venient to think of the operation as if each node also
Passing Interface (MPI), for instance, has both regu-
contributes a subvector to itself.
lar MPI_Alltoall and irregular MPI_Alltoallv
The all-to-all exchange can be interpreted as an
and MPI_Alltoallw operations in its repertoire,
actual matrix transpose operation in the case where
(j) as well as operations for the specialized operations
all subvectors xi have the same number of ele-
broadcast, gather, scatter, and allgather, and structured
ments n. In that case the all-to-all problem is called
data can be flexibly described by so-called user-defined
regular and the operation is also sometimes termed
datatypes.
index. The all-to-all operation is, however, also well
All-to-all communication is required in FFT com-
defined in cases where subvectors have different num-
putations, matrix transposition, generalized permuta-
ber of elements. Subvectors could for instance dif-
(i) tions, etc., and is thus a fundamental operation in a large
fer for each row index j, e.g., nj = ∣xj ∣ for all
number of scientific computing applications.
nodes i, or each subvector could have a possibly dif-
(i)
ferent number of elements nij = ∣xj ∣ without any
specified relation between the subvector sizes. In all
Lower Bounds
The (regular) all-to-all communication operation
such cases, the all-to-all problem is called irregular.
(i) requires that each node exchanges data with each
Likewise, the subvectors xj can be structured objects,
other node. Therefore, a lower bound on the all-to-all
for instance matrix blocks. It is common to assume that
communication time will be determined by a minimal
for both regular and irregular all-to-all problems each
bisection and the bisection bandwidth of the commu-
nication system or network. The number of subvectors
Before After that have to cross any bisection of the nodes (i.e., par-
tition into two roughly equal-sized subsets) is p / (for
Node  Node  Node  Node  Node  Node 
even p), namely p/ × p/ = p / subvectors from each
() () () () () ()
x x x x x x subset of the partition. The number of communication
() () () () () () links that can be simultaneously active in a minimal
x x x x x x
() () () () () () partition determines the number of communication
x x x x x x
rounds that are at the least needed to transfer all subvec-
All-to-All. Fig.  All-to-all communication for three nodes tors across the bisection assuming that subvectors are
 A All-to-All

not combined. A d-dimensional, symmetric torus with Algorithms


unidirectional communication has a bisection of kd− In the following, the regular all-to-all operation is con-

with k = d p, and the number of required communi- sidered. Each node i has a subvector xj(i) to transfer

cation rounds is therefore p d p/. A hypercube (which to each other node j, and all subvectors have the same
(j)
can also be construed as a log p dimensional mesh) has number of elements n = ∣xi ∣.
bisection p/. The low bisection of the torus limits the
bandwidth that can be achieved for all-to-all communi-
Fully Connected Systems
cation. If the bisection bandwidth is B vector elements
In a direct algorithm each node i sends each subvec-
per time unit, any all-to-all algorithms requires at least
tor directly to its destination node j. It is assumed
np /B units of time.
that communication takes place in rounds, in which
Assume now a fully connected, homogeneous,
all or some nodes send and/or receive data. The diffi-
k-ported, bidirectional send-receive communication
culty is to organize the communication in such a way
network. Each node can communicate directly with any
that no node stays idle for too many rounds, and that
other node at the same cost, and can at the same time
in the same round no node is required to send or
receive data from at most k nodes and send distinct data
receive more than the k subvectors permitted by the
to at most k, possibly different nodes. In this model,
k-ported assumption.
the following general lower bounds on the number of
For one-ported, bidirectional systems the simplest
communication rounds and the amount of data trans-
algorithm takes p− communication rounds. In round r,
ferred in sequence per node can be proved. Here n is (i)
 ≤ r < p, node i sends subvector x(i+r) mod p to node
the amount of data per subvector for each node.
((i−r) mod p)
(i + r) mod p, and receives subvector xi
(i)
from node (i − r) mod p. If required, subvector xi is
● A complete all-to-all exchange requires at least
copied in a final noncommunication round. This algo-
⌈logk+ p⌉ communication rounds and transfers at
rithm achieves the lower bound on data transfer, at the
least ⌈ n(p−) ⌉ units of data per node.
k expense of p −  communication rounds, and trivially
● Any all-to-all algorithm that uses ⌈log k+ p⌉ commu-
np generalizes to k-ported systems. In this case, the last
nication rounds must transfer at least Ω( k+ logk+ p)
round may not be able to utilize all k communication
units of data per node.
n(p−) ports.
● Any all-to-all algorithm that transfers exactly k
p− For single-ported systems (k = ) with weaker
units of data per node requires at least k commu- bidirectional communication capabilities allowing only
nication rounds. that each node i sends and receives from the same
node j (often referred to as the telephone model)
These lower bounds bound the minimum required in a communication round, or even unidirectional
time for the all-to-all operation. In contrast to, for communication capabilities allowing each node to only
instance, allgather communication, there is a trade-off either send or receive in one communication round, a
between the number of communication rounds and different algorithm solves the problem. Such an algo-
the amount of data transferred. In particular, it is not rithm is described in the following.
possible to achieve a logarithmic number of commu- A fully connected communication network can be
nication rounds and a linear number of subvectors per viewed as a complete, undirected graph with processing
node. Instead, fewer rounds can only be achieved at the nodes modeled as graph nodes. For even p the complete
expense of combining subvectors and sending the same graph can be partitioned into p− disjoint -factors (per-
subvector several times. As a first approximation, com- fect matchings), each of which associates each node i
munication cost is modeled by a linear cost function with a different node j. For odd p this is not possible,
such that the time to transfer n units of data is α + nβ but the formula j = (r − i) mod p for r = , . . . , p − 
for a start-latency α and cost per unit β. All-to-all com- associates a different node j with i for each round r such
munication time is then at least α times the number of that if i is associated with j in round r, then j is asso-
communication rounds plus β times the units of data ciated with i in the same round. In each round there
transferred in sequence by a node. is exactly one node that becomes associated with itself.
All-to-All A 

This can be used to achieve the claimed -factorization also pairs node i with itself in some round, and the self-
A
for even p. Let the last node i = p −  be special. Per- exchange could be done in this round. This would lead
form the p −  rounds on the p −  non-special nodes. to an algorithm with p rounds, in some of which some
In the round where a non-special node becomes asso- nodes perform the self-exchange. If bidirectional com-
ciated with itself, it instead performs an exchange with munication is not supported, each exchange between
the special node. It can be shown that p and p −  com- nodes i and j can be accomplished by the smaller
munication rounds for odd and even p, respectively, is numbered node sending and then receiving from the
optimal. The algorithm is depicted in Fig. . It takes p −  larger numbered node, and conversely the larger num-
communication rounds for even p, and p rounds for bered node receiving and then sending to the smaller
odd p. If a self-copy is required it can for odd p be done numbered node.
in the round where a node is associated with itself and If p is a power of two the pairing j = i ⊕ r, where
for even p either before or after the exchange. Note that ⊕ denotes the bitwise exclusive-or operation, likewise
in the case where p is even the formula j = (r − i) mod p produces a -factorization and this has often been used.

Hypercube
if odd(p) then By combining subvectors the number of communi-
for r ← 0, 1,. . . ,p − 1 do cation rounds and thereby the number of message
j ← (r − i) mod p start-ups can be significantly reduced. The price is an
if i = j then
par/* simultaneous send-receive */ increased communication volume because subvectors
(i)
Send subvector xj to node j will have to be sent multiple times via intermediate
(j)
Receive subvector xi from node j nodes. Such indirect algorithms were pioneered for
end par
end if
hypercubes and later extended to other networks and
end for communication models. In particular for the fully con-
else if i < p − 1 then /* p even */ nected, k-ported model algorithms exist that give the
/* non-special nodes */
for r ← 0,1,. . . ,p − 2 do
optimal trade-off (for many values of k and p) between
j ← (r − i − 1) mod (p − 1) the number of communication rounds and the amount
if i = j then j ← p − 1 of data transferred per node.
par/* simultaneous send-receive */
(i) A simple, indirect algorithm for the d-dimensional
Send subvector xj to node j
(j)
Receive subvector xi from node j
hypercube that achieves the lower bound on the
end par number of communication rounds at the expense of
end for a logarithmic factor more data is as follows. Each
else /* special node */
for r ← 0,1,. . . ,p − 2 do hypercube node i communicates with each of its d
if even (r) then j ← r/2 else j ← (p−1+r)/2 neighbors, and the nodes pair up with their neighbors
par/* simultaneous send-receive */ in the same fashion. In round r, node i pairs up and
(i)
Send subvector xj to node j
(j) performs an exchange with neighbor j = i ⊕  d−r for
Receive subvector xi from node j
end par r = , . . . , d. In the first round, node i sends in one mes-
end for sage the d− subvectors destined to the nodes of the
end if d − -dimensional subcube to which node j belongs. It
All-to-All. Fig.  Direct, telephone-like all-to-all receives from node j the d− subvectors for the d − -
communication based on -factorization of the complete p dimensional subcube to which it itself belongs. In the
node graph. In each communication round r each node i second round, node i pairs up with a neighbor of a
becomes associated with a unique node j with which it d−-dimensional subcube, and exchanges the d− own
performs an exchange. The algorithm falls into three subvectors and the additional d− subvectors for this
special cases for odd and even p, respectively, and for the subcube received in the first round. In general, in each
latter for nodes i < p −  and the special node i = p − . For round r node i receives and sends r d−−r = d− = p/
odd p the number of communication rounds is p, and p −  subvectors.
for even p. In both cases, this is optimal in the number of Total cost of this algorithm in the linear cost model
p
communication rounds is log p(α + β  n).
 A All-to-All

Irregular All-to-All Communication the lower bounds on number of rounds and the trade-
The irregular all-to-all communication problem is con- off between amount of transferred data and required
siderably more difficult both in theory and in prac- number of communication rounds. Variants and com-
tice. In the general case with no restrictions on the binations of these algorithms have been implemented
sizes of the subvectors to be sent and received, find- in various MPI like libraries for collective communica-
ing communication schedules that minimize the num- tion [, , ].
ber of communication rounds and/or the amount A proof that a -factorization of the p node com-
of data transferred is an NP-complete optimization plete graph exists when p is even can be found in []
problem. For problem instances that are not too and elsewhere.
irregular a decomposition into a sequence of more reg- All-to-all algorithms not covered here for meshes
ular problems and other collective operations some- and tori can be found in [, , , ], and algorithms
times work, and such approaches have been used in for multistage networks in [, ]. Lower bounds for
practice. Heuristics and approximation algorithms for tori and meshes based on counting link load can be
many variations of the problem have been consid- found in []. Implementation considerations for some
ered in the literature. For concrete communication of these algorithms for MPI for the Blue Gene systems
libraries, a further practical difficulty is that full infor- can be found in [].
mation about the problem to be solved, that is the Irregular all-to-all exchange algorithms and imple-
sizes of all subvectors for all the nodes is typically mentations have been considered in [, ] (decompo-
not available to any single node. Instead each node sition into a series of easier problems) and later in [, ],
knows only the sizes of the subvectors it has to send the former summarizing many complexity results.
(and receive). To solve the problem this informa-
tion has to be centrally gathered entailing a sequen- Bibliography
tial bottleneck, or distributed algorithms or heuris- . Bala V, Bruck J, Cypher R, Elustondo P, Ho A, Ho CT, Kipnis S,
tics for computing a communication schedule must be Snir M () CCL: a portable and tunable collective communi-
employed. cations library for scalable parallel computers. IEEE T Parall Distr
():–
. Bokhari SH () Multiphase complete exchange: a theoretical
analysis. IEEE T Comput ():–
Related Entries . Bruck J, Ho CT, Kipnis S, Upfal E, Weathersby D () Efficient
Allgather algorithms for all-to-all communications in multiport message-
Collective Communication passing systems. IEEE T Parall Distr ():–
FFT (Fast Fourier Transform) . Fox G, Johnson M, Lyzenga G, Otto S, Salmon J, Walker D ()
Solving problems on concurrent processors, vol I. Prentice-Hall,
Hypercubes and Meshes
Englewood Cliffs
MPI (Message Passing Interface)
. Goldman A, Peters JG, Trystram D () Exchanging messages
of different sizes. J Parallel Distr Com ():–
. Harary F () Graph theory. Addison-Wesley, Reading, Mass
Bibliographic Notes and Further . Johnsson SL, Ho CT () Optimum broadcasting and per-
Reading sonalized communication in hypercubes. IEEE T Comput ():
–
All-to-all communication has been studied since the
. Kumar S, Sabharwal Y, Garg R, Heidelberger P () Optimiza-
early days of parallel computing, and many of the tion of all-to-all communication on the blue gene/l supercom-
results presented here can be found in early, seminal puter. In: International conference on parallel processing (ICPP),
work [, ]. Classical references on indirect all-to- Portland, pp –
all communication for hypercubes are [, ], see also . Lam CC, Huang CH, Sadayappan P () Optimal algorithms for
all-to-all personalized communication on rings and two dimen-
later [] for combinations of different approaches lead-
sional tori. J Parallel Distr Com ():–
ing to hybrid algorithms. The generalization to arbitrary . Liu W, Wang CL, Prasanna VK () Portable and scalable algo-
node counts for the k-ported, bidirectional communi- rithm for irregular all-to-all communication. J Parallel Distr Com
cation model was developed in [], which also proves :–
AMD Opteron Processor Barcelona A 

. Massini A () All-to-all personalized communication on


multistage interconnection networks. Discrete Appl Math AMD Opteron Processor A
(–):– Barcelona
. Ranka S, Wang JC, Fox G () Static and run-time algorithms
for all-to-many personalized communication on permutation Benjamin Sander
networks. IEEE T Parall Distr ():– Advanced Micro Device Inc., Austin, TX, USA
. Ranka S, Wang JC, Kumar M () Irregular personalized com-
munication on distributed memory machines. J Parallel Distr
Com ():–
. Ritzdorf H, Träff JL () Collective operations in NEC’s high- Synonyms
performance MPI libraries. In: International parallel and dis- Microprocessors
tributed processing symposium (IPDPS ), p 
. Saad Y, Schultz MH () Data communication in parallel archi-
tectures. Parallel Comput ():– Definition
. Scott DS () Efficient all-to-all communication patterns in
The AMD OpteronTM processor codenamed “Barcelona”
hypercube and mesh topologies. In: Proceedings th conference
on distributed memory concurrent Computers, pp –
was introduced in September  and was notable
. Suh YJ, Shin KG () All-to-all personalized communication in as the industry’s first “native” quad-core x proces-
multidimensional torus and mesh networks. IEEE T Parall Distr sor, containing four cores on single piece of silicon.
():– The die also featured an integrated L cache which was
. Suh YJ, Yalamanchili S () All-to-all communication with accessible by all four cores, Rapid Virtualization Index-
minimum start-up costs in D/D tori and meshes. IEEE T Parall
ing to improve virtualization performance, new power
Distr ():–
. Thakur R, Gropp WD, Rabenseifner R () Improving the per- management features including split power planes for
formance of collective operations in MPICH. Int J High Perform the integrated memory controller and the cores, and
C :– a higher-IPC (Instructions per Clock) core including
. Tseng YC, Gupta SKS () All-to-all personalized communi- significantly higher floating-point performance. Finally,
cation in a wormhole-routed torus. IEEE T Parall Distr ():
the product was a plug-in replacement for the previous-
–
. Tseng YC, Lin TH, Gupta SKS, Panda DK () Bandwidth-
generation AMD Opteron processor – “Barcelona” used
optimal complete exchange on wormhole-routed D/D torus the same Direct Connect system architecture, the same
networks: A diagonal-propagation approach. IEEE T Parall Distr physical socket including commodity DDR memory,
():– and fit in approximately the same thermal envelope.
. Yang Y, Wang J () Optimal all-to-all personalized exchange
in self-routable multistage networks. IEEE T Parall Distr ():
– Discussion
. Yang Y, Wang J () Near-optimal all-to-all broadcast in mul-
tidimensional all-port meshes and tori. IEEE T Parall Distr Previous Opteron Generations
():– The first-generation AMD Opteron processor was
introduced in April of . AMD Opteron introduced
the AMD instruction set architecture, which was an
evolutionary extension to the x instruction set that
provided additional registers and -bit addressing [].
All-to-All Broadcast An important differentiating feature of AMD was
Allgather that it retained compatibility with the large library of
-bit x applications which had been written over
the -year history of that architecture – other -bit
architectures at that time either did not support the
Altivec x ISA or supported it only through a slow emulation
mode. The first-generation AMD Opteron also intro-
IBM Power Architecture duced Direct Connect architecture which featured a
Vector Extensions, Instruction-Set Architecture (ISA) memory controller integrated on the same die as the
 A AMD Opteron Processor Barcelona

processor, and HyperTransportTM Technology bus links through an improved power management flexibility and
to directly connect the processors and enable glueless the shared L cache.
two-socket and four-socket multiprocessor systems. The L cache was a non-inclusive architecture –
The second-generation AMD Opteron processor the contents of the L caches were independent and
was introduced two years later in April of  and typically not replicated in the L, enabling more effi-
enhanced the original offering by integrating two cores cient use of the cache hierarchy. The L and L caches
(Dual-Core) with the memory controller. The dual-core were designed as a victim-cache architecture. A newly
AMD Opteron was later improved with DDR mem- retrieved cache line would initially be written to the
ory for higher bandwidth, and introduced AMD-VTM L cache. Eventually, the processor would evict the line
hardware virtualization support. from the L cache and would then write it into the L
“Barcelona” introduced the third generation of the cache. Finally, the line would be evicted from the L
AMD Opteron architecture, in September of . cache and the processor would write it to the L cache.
The shared cache used a “sharing-aware” fill and
System-on-a-Chip replacement policy. Consider the case where a line-
“Barcelona” was the industry’s first native quad-core fill request hits in the L cache: The processor has the
processor – as shown in Fig. , the processor contained option of leaving a copy of the line in the L (predict-
four processing cores (each with a private  k L ing that other cores are likely to want that line in the
cache), a shared  MB L cache, an integrated crossbar future) or moving the line to the requesting core, inval-
which enabled efficient multiprocessor communication, idating it from the L to leave a hole which could be
and DDR memory controller and three HyperTrans- used by another fill. Lines which were likely to be shared
port links. The tight integration on a single die enabled (for example, instruction code or lines which had been
efficient communication between all four cores, notably shared in the past) would leave a copy of the line in

AMD Opteron Processor Barcelona. Fig.  Quad-Core AMD OpteronTM processor design
AMD Opteron Processor Barcelona A 

the L cache for other cores to access. Lines which were same peak bandwidth as the previous generation: Each
A
likely to be private (for example, requested with write socket still had two channels of DDR memory, run-
permission or which had never been previously shared ning at up to  MHz and delivering a peak band-
in the past) would move the line from the L to the L. width of . GB/s (per socket). To feed the additional
The L cache maintained a history of which lines had cores, the memory controller in “Barcelona” included
been shared to guide the fill policy. Additionally, the several enhancements which improved the delivered
sharing history information influenced the L replace- bandwidth.
ment policy – the processor would preferentially keep One notable feature was the introduction of “inde-
lines which had been shared in the past by granting pendent” memory channels. In the second-generation
them an additional trip through the replacement policy. AMD Opteron product, each memory channel serviced
In Client CPUs, AMD later introduced a “Triple- half of a -byte memory request – i.e., in parallel
Core” product based on the same die used by the memory controller would read  bytes from each
“Barcelona” – this product served an important prod- channel. This was referred to as a “ganged” controller
uct category (for users who valued an affordable prod- organization and has the benefit of perfectly balancing
uct with more than dual-core processing power). The the memory load between the two memory channels.
Triple-Core product was targeted at the consumer However, this organization also effectively reduces the
market and was based on the same die used by number of available dram banks by a factor of two –
“Barcelona” with three functional cores rather than four. when a dram page is opened on the first channel, the
controller opens the same page on the other channel at
Socket Compatibility the same time. Effectively, the ganged organization cre-
“Barcelona” was socket-compatible with the previous- ates pages which are twice as big, but provides half as
generation Dual-Core AMD OpteronTM product: many dram pages as a result.
“Barcelona” had the same pinout, used the same two With four cores on a die all running four different
channels of DDR memory, and also fit in a similar threads, the memory accesses tend to be unrelated and
thermal envelope enabling the use of the same cool- more random. DRAMs support only a small number
ing and thermal solution. Even though “Barcelona” used of open banks, and requests which map to the same
four cores (rather than two) and each core possessed bank but at different addresses create a situation called
twice the peak floating-point capability of the previ- a “page conflict.” The page conflict can dramatically
ous generation, “Barcelona” was able to fit in the same reduce the efficiency and delivered bandwidth of the
thermal envelope through the combination of a smaller DRAM, because the DRAM has to repeatedly switch
process (“Barcelona” was the first AMD server prod- between the multiple pages which are competing for
uct based on  nm SOI technology) and a reduction the same open bank resources. Additionally, write traf-
in peak operating frequency (initially “Barcelona” ran at fic coming from the larger multi-level “Barcelona” cache
. GHz). The doubling of core density in the same plat- hierarchy created another level of mostly random mem-
form was appealing and substantially increased perfor- ory traffic which had to be efficiently serviced. All of
mance on many important server workloads compared these factors led to a new design in which the two mem-
to the previous-generation AMD Opteron product. ory channels were controlled independently – i.e., each
Customers could upgrade their AMD Opteron products channel could independently service an entire -byte
to take advantage of quad-core “Barcelona,” and OEMs cache line rather than using both channels for the same
could leverage much of their platform designs from the cache line. This change enabled the two channels to
previous generation. independently determine which pages to open, effec-
tively doubling the number of banks and enabling the
Delivering More Memory Bandwidth design to better cope with the more complex memory
As mentioned above, the “Barcelona” processor’s socket- stream coming from the quad-core processor.
compatibility eased its introduction to the marketplace. A low-order address bit influenced the channel
However, the four cores on a die placed additional selection, which served to spread the memory traf-
load on the memory controller – the platform had the fic between the two controllers and avoid overloading
 A AMD Opteron Processor Barcelona

one of the channels. The design hashed the low-order The DRAM controller was also internally re-
address bit with other higher-order address bit to spread plumbed with wider busses (the main busses were
the traffic more evenly, and in practice, applications increased from -bits to -bits) and with additional
showed an near-equal allocation between the two chan- buffering. Many of these changes served to prepare
nels, enabling the benefits of the extra banks provided the controller to support future higher-speeds mem-
by the independent channels along with equal load bal- ory technologies. One notable change was the addition
ancing. The independent channels also resulted in a of write-bursting logic, which buffered memory writes
longer burst length (twice as long as the ganged organi- until a watermark level was achieved, at which time the
zation), which reduced pressure on the command bus; controller would burst all the writes to the controller. As
essentially each Read or Write command performed compared to trickling the writes to the DRAM as they
twice as much work with the independent organiza- arrived at the memory controller, the write-bursting
tion. This was especially important for some DIMMs mode was another bandwidth optimization. Typically,
which required commands to be sent for two consec- the read and write requests address different banks, so
utive cycles (“T” mode). switching between a read mode and a write mode both
“Barcelona” continued to support a dynamic open- minimizes the number of read/write bus turnarounds
page policy, in which the memory controller leaves and minimizes the associated page open/close traffic.
dram pages in the “open” state if the pages are likely
to be accessed in the future. As compared to a closed-
page design (which always closes the dram pages), the “Barcelona” Core Architecture
dynamic open-page design can improve latency and The “Barcelona” core was based on the previous
bandwidth in cases where the access stream has local- “K” core design but included comprehensive IPC
ity to the same page (by leaving the page open), and (Instruction-Per-Clock) improvements throughout the
also deliver best-possible bandwidth when the access entire pipeline. One notable feature was the “Wide
stream contains conflicts (by recognizing the page- Floating-Point Accelerator,” which doubled the raw
conflict stream and closing the page). “Barcelona” intro- computational floating-point data paths and floating-
duced a new predictor which examined the history of point execution units from -bits to -bits. A
accesses to each bank to determine whether to leave the . GHz “Barcelona” core possessed a peak rate of
page open or to close it. The predictor was effective at  GFlops of single-precision computation; the quad-
increasing the number of page hits (delivering lower core “Barcelona” could then deliver  GFlops at peak
latency) and reducing the number of page conflicts (twice as many cores, each with twice as much per-
(improving bandwidth). formance). This doubling of each core’s raw computa-
“Barcelona” also introduced a new DRAM prefetcher tion bandwidth was accompanied by a doubling of the
which monitored read traffic and prefetched the next instruction fetch bandwidth (from -bytes/cycle to -
cache line when it detected a pattern. The prefetcher bytes/cycle) and a doubling of the data cache bandwidth
had sophisticated pattern detection logic which could (two -bit loads could be serviced each clock cycle),
detect both forward and backward patterns, unit and enabling the rest of the pipeline to feed the new higher-
non-unit strides, as well as some more complicated bandwidth floating-point units. Notably, SSE instruc-
patterns. The prefetcher targeted a dedicated buffer to tions include one or more prefix bytes, frequently use
store the prefetched data (rather than a cache) and thus the REX prefix to encode additional registers, and can
the algorithm could aggressively exploit unused dram be quite large. The increased instruction fetch band-
bandwidth without concern for generating cache pollu- width was therefore important to keep the pipeline
tion. The prefetcher also had a mechanism to throttle balanced and able to feed the high-throughput wide
the prefetcher if the prefetches were inaccurate or if floating-point units.
the dram bandwidth was consumed by non-prefetch “Barcelona” introduced a new “unaligned SSE
requests. Later generations of the memory controller mode,” which allowed a single SSE instruction to both
would improve on both the prefetch and the throttling load and execute an SSE operation, without concern for
mechanisms. alignment. Previously, users had to use two instructions
AMD Opteron Processor Barcelona A 

(an unaligned load followed by an execute operation). Overall the “Barcelona” core was a comprehen-
A
This new mode further reduced pressure on the instruc- sive but evolutionary improvement over the previous
tion decoders and also reduced register pressure. The design. The evolutionary improvement provided a con-
mode relaxed a misguided attempt to simplify the archi- sistent optimization strategy for compilers and soft-
tecture by penalizing unaligned operations and was ware developers: optimizations for previous-generation
later adopted as the x standard. Many important algo- AMD OpteronTM processors were largely effective on
rithms such as video decompression can benefit from the “Barcelona” core as well.
vector instructions but cannot guarantee alignment in
the source data (for example, if the input data is com-
pressed). Virtualization and Rapid Virtualization
The “Barcelona” pipeline included an improved Indexing
branch predictor, using more bits for the global history In , virtualization was an emerging application class
to improve the accuracy and also doubling the size of driven by customer desire to more efficiently utilize
the return stack. “Barcelona” added a dedicated predic- multi-core server systems and thus was an important
tor for indirect branches to improve performance when target for the quad-core “Barcelona.” One performance
executing virtual functions commonly used in mod- bottleneck in virtualized applications was the address
ern programming styles. The wider -byte instruc- translation performed by the hypervisor – the hypervi-
tion fetch improved both SSE instruction throughput sor virtualizes the physical memory in the system and
as well as higher throughput in codes with large integer thus has to perform an extra level of address translation
instructions, particularly when using some of the more between the guest physical and the actual host physi-
complicated addressing modes. cal address. Effectively, the hypervisor creates another
“Barcelona” added a Sideband Stack Optimizer fea- level of page tables for this final level of translation.
ture which executed common stack operations (i.e., Previous-generation processors performed this transla-
the PUSH and POP instructions) with dedicated stack tion with a software-only technique called “shadow pag-
adjustment logic. The logic broke the serial dependence ing.” Shadow paging required a large number of hyper-
chains seen in consecutive strings of PUSH and POP visor intercepts (to maintain the page tables) which
instructions (common at function entry and exit), and slowed performance and also suffered from an increase
also freed the regular functional units to execute other in the memory footprint. “Barcelona” introduced Rapid
operations. Virtualization Indexing (also known as “nested pag-
“Barcelona” also improved the execution core with ing”) which provided hardware support for performing
a data-dependent divide, which provided an early the final address translation; effectively, the hardware
out for the common case where the dividend was was aware of both the host and guest page tables and
small. “Barcelona” introduced the SSEa instruction could walk both as needed []. “Barcelona” also pro-
set, which added a handful of instructions including vided translation caching structures to accelerate the
leading-zero count and population count, bit INSERT nested table walk.
and EXTRACT, and streaming single-precision store “Barcelona” continued to support AMD-VTM hard-
operations. ware virtualization support, tagged TLBs to reduce
“Barcelona” core added an out-of-order load fea- TLB flushing when switching between guests, and
ture, which enabled loads to bypass other loads in the the Device Exclusion Vector for security. Addition-
pipeline. Other memory optimizations included a wider ally, well-optimized virtualization applications typically
L bus, larger data and instruction TLBs,  GB page demonstrate a high degree of local memory accesses,
size, and -bit physical address to support large server i.e., accesses to the integrated memory controller rather
database footprints. The L data TLB was expanded to than to another socket in the multi-socket system.
 entries, and each entry could hold any of the three AMD’s Direct Connect architecture, which provided
page sizes in the architecture ( K,  M, or  GB); this lower latency and higher bandwidth for local memory
provided flexibility for the architecture to efficiently run accesses, was thus particularly well suited for running
applications with both legacy and newer page sizes. virtualization applications.
 A Amdahl’s Argument

Power Reduction Features system) and thus the system can immediately return
The “Barcelona” design included dedicated power sup- the requested data without having to probe the system.
plies for the CPU cores and the memory controller, HT Assist enables the AMD Opteron platform to effi-
allowing the voltage for each to be separately controlled. ciently scale bandwidth to -socket and -socket server
This allowed the cores to operate at reduced power con- systems.
sumption levels while the memory controller continued The next AMD OpteronTM product is on the near
to run at full speed and service memory requests from horizon as well. Codenamed “Magny-Cours,” this pro-
other cores in the system. In addition, the core design cessor is planned for introduction in the first quarter of
included the use of fine-gaters to reduce the power , includes up to -cores in each socket, and intro-
to logic on the chip which was not currently in use. duces a new G platform. Each G socket contains 
One example was the floating-point unit – for inte- DDR channels and  HyperTransport links. “Magny-
ger code, when the floating-point unit was not in use, Cours” continues to use evolved versions of the core
“Barcelona” would gate the floating-point unit and sig- and memory controller that were initially introduced in
nificantly reduce the consumed power. The fine-gaters “Barcelona.”
could be re-enabled in a single cycle and did not cause “Barcelona” broke new ground as the industry’s
any visible increased latency. first native quad-core device, including a shared L
The highly integrated “Barcelona” design also cache architecture and leadership memory bandwidth.
reduced the overall system chip count and thus reduced The “Barcelona” core was designed with an evolution-
system power. Notably the AMD OpteronTM system ary approach, and delivered higher core performance
architecture integrated the northbridge on the same die (especially on floating-point codes), and introduced
as the processor (reducing system chip count by one new virtualization technologies to improve memory
device present in some legacy system architectures), translation performance and ease Hypervisor imple-
and also used commodity DDR memory (which mentation. Finally, “Barcelona” was plug-compatible
consumed less power than the competing FB-DIMM with the previous-generation AMD Opteron proces-
standard). sors, leveraging the stable AMD Direct Connect archi-
tecture and cost-effective commodity DDR memory
Future Directions technology.
The AMD OpteronTM processor codenamed “Barcelona”
was the third generation in the AMD Opteron proces-
sor line. “Barcelona” was followed by the AMD Opteron Bibliography
processor codenamed “Shangai,” which was built in . Advanced Micro Devices, Inc. x-TM Technology White
 nm SOI process technology, included a larger  M Paper. http://www.amd.com/us-en/assets/content_type/white_
shared L cache, further core and northbridge perfor- papers_and_tech_docs/x-_wp.pdf
. Advanced Micro Devices, Inc. () AMD-VTM Nested Paging.
mance improvements, faster operating frequencies, and
http://developer.amd.com/assets/NPT-WP-%-final-TM.pdf
faster DDR and HT interconnect frequencies. “Shang- . Advanced Micro Devices, Inc. () AMD architecture tech
hai” was introduced in November of . docs. http://www.amd.com/us-en/Processors/DevelopWithAMD/
“Shanghai” was followed by the AMD Opteron pro- ,___,.html
cessor codenamed “Istanbul,” which integrated six cores . Sander B () Core optimizations for system-level perfor-
mance. http://www.instat.com/fallmpf//conf.htm http://www.
onto a single die, and again plugged into the same
instat.com/Fallmpf//
socket as “Barcelona” and “Shanghai.” “Istanbul” also
included an “HT Assist” feature which substantially
reduced probe traffic in the system. HT Assist adds
cache directory to each memory controller; the direc-
tory tracks lines in the memory range serviced by the
memory controller which are cached somewhere in the
Amdahl’s Argument
system. Frequently, a memory access misses the direc-
tory (indicating the line is not cached anywhere in the Amdahl’s Law
Amdahl’s Law A 

Discussion A
Amdahl’s Law
Graphical Explanation
John L. Gustafson The diagram in Fig.  graphically explains the formula
Intel Labs, Santa Clara, CA, USA in the definition.
The model sets the time required to solve the present
workload (top bar) to unity. The part of the workload
that is serial, f , is unaffected by parallelization. (See dis-
Synonyms cussion below for the effect of including the time for
Amdahl’s argument; Fixed-size speedup; Law of dimin- interprocessor communication.) The model assumes
ishing returns; Strong scaling that the remainder of the time,  − f , parallelizes per-
fectly so that it takes only /P as much time as on the
serial processor. The ratio of the top bar to the bottom
Definition bar is thus /( f + ( − f )/P).
Amdahl’s Law says that if you apply P processors
to a task that has serial fraction f , the predicted
History
net speedup is
In the late s, research interest increased in the idea
of achieving higher computing performance by using

Speedup = −f
. many computers working in parallel. At the Spring
f+ P  meeting of the American Federation of Informa-
tion Processing Societies (AFIPS), organizers set up
More generally, it shows the speedup that results from a session entitled “The best approach to large com-
applying any performance enhancement by a factor of puting capability – A debate.” Daniel Slotnick pre-
P to only one part of a given workload. sented “Unconventional Systems,” a description of a
A corollary of Amdahl’s Law, often confused with -processor ensemble controlled by a single instruc-
the law itself, is that even when one applies a very tion stream, later known as the ILLIAC IV []. IBM’s
large number of processors P (or other performance chief architect, Gene Amdahl, presented a counterargu-
enhancement) to a problem, the net improvement in ment entitled “Validity of the single processor approach
speed cannot exceed /f . to achieving large scale computing capabilities” []. It

Time for present workload

f 1–f

Serial P processors applied to


fraction parallel fraction

1–f
f
P
Reduced time

Amdahl’s Law. Fig.  Graphical explanation of Amdahl’s Law


 A Amdahl’s Law

was in this presentation that Amdahl made a specific of Diminishing Returns is a classic guideline in eco-
argument about the merits of serial mainframes over nomics and business. In the early s, computer users
parallel (and pipelined) computers. attributed the success of Cray Research vector comput-
The formula known as Amdahl’s Law does not ers over rivals such as those made by CDC to Cray’s
appear anywhere in that paper. Instead, the paper shows better attention to Amdahl’s Law. The Cray designs did
a hand-drawn graph that includes the performance not take vector pipelining to such extreme levels rela-
speedup of a -processor system over a single proces- tive to the rest of their system and often got a higher
sor, as the fraction of parallel work increases from % to fraction of peak performance as a result. The widely
%. There are no numbers or labels on the axes in the used textbook on computer architecture by Hennessy
original, but Fig.  reproduces his graph more precisely. and Patterson [] harkens back to the traditional view
Amdahl estimated that about % of an algorithm of Amdahl’s Law as guidance for computer designers,
was inherently serial, and data management imposed particularly in its earlier editions.
another % serial overhead, which he showed as the The formula Amdahl used was simply the use of ele-
gray area in the figure centered about % parallel con- mentary algebra to combine two different speeds, here
tent. He asserted this was the most probable region of defined as work per unit time, not distance per unit time:
operation. From this, he concluded that a parallel sys- it simply compares two cases of the net speed as the
tem like the one Slotnick described would only yield total work divided by the total time. This common result
from about X to X speedup. At the debate, he pre- is certainly not due to Amdahl, and he was chagrined
sented the formula used to produce the graph, but did at receiving credit for such an obvious bit of mathe-
not include it in the text of the paper itself. matics. “Amdahl’s Law” really refers to the argument
This debate was so influential that in less than a year, that the formula (along with its implicit assumptions
the computing community was referring to the argu- about typical serial fractions and the way computer
ment against parallel computing as “Amdahl’s Law.” The costs and workloads scale) predicts harsh limits on what
person who first coined the phrase may have been Willis parallel computing can achieve. For those who wished
H. Ware, who in  first put into print the phrase to avoid the change to their software that parallelism
“Amdahl’s Law” and the usual form of the formula, in would require, either for economic or emotional rea-
a RAND report titled “The Ultimate Computer” []. sons, Amdahl’s Law served as a technical defense for
The argument rapidly became part of the commonly their preference.
accepted guidelines for computer design, just as the Law

30

25

20
Performance

15

10

0
0.0 0.2 0.4 0.6 0.8 1.0
Fraction of arithmetic that can be run in parallel

Amdahl’s Law. Fig.  Amdahl’s original -processor speedup graph (reconstructed)


Amdahl’s Law A 

Estimates of the “Serial Fraction” Prove Machines had announced systems with over , pro-
A
Pessimistic cessors, and Karp gave the community  years to solve
The algebra of Amdahl’s Law is unassailable since it the problem putting a deadline at the end of  to
describes the fundamental way speeds add algebraically, achieve the goal. Karp suggested fluid dynamics, struc-
for a fixed amount of work. However, the estimate of tural analysis, and econometric modeling as the three
the serial fraction f (originally %, give or take %) application areas to draw from, to avoid the use of
was only an estimate by Amdahl and was not based on contrived and unrealistic applications. The published
mathematics. speedups of the  era tended to be less than tenfold
There exist algorithms that are inherently almost and used applications of little economic value (like how
% serial, such as a time-stepping method for a to place N queens on a chessboard so that no two can
physics simulation involving very few spatial variables. attack each other).
There are also algorithms that are almost % parallel, The Karp Challenge was widely distributed by
such as ray tracing methods for computer graphics, or e-mail but received no responses for years, suggesting
computing the Mandelbrot set. It therefore seems rea- that if -fold speedups were possible, they required
sonable that there might be a rather even distribution of more than a token amount of effort. C. Gordon Bell,
serial fraction from  to  over the entire space of com- also interested in promoting the advancement comput-
puter applications. The following figure shows another ing, proposed a similar challenge but with two alter-
common way to visualize the effects of Amdahl’s Law, ations: He raised the award to $,, and said that it
with speedup as a function of the number of processors. would be given annually to the greatest parallel speedup
Figure  shows performance curves for serial fractions achieved on three real applications, but only awarded if
., ., ., and . for a -processor computer system. the speedup was at least twice that of the previous award.
The limitations of Amdahl’s Law for performance This definition was the original Gordon Bell Prize [],
prediction were highlighted in , when IBM scien- and Bell envisioned that the first award might be for
tist Alan Karp publicized a skeptical challenge (and a something close to tenfold speedup, with increasingly
token award of $) to anyone who could demonstrate difficult advances after that.
a speedup of over  times on three real computer By late , Sandia scientists John Gustafson, Gary
applications []. He had just returned from a confer- Montry, and Robert Benner undertook to demon-
ence at which startup companies nCUBE and Thinking strate high parallel speedup on applications from fluid

15
Speedup (time reduction)

se)
rea
inc

10
ar

Serial fraction f = 0.1


line
al (
Ide

Serial fraction f = 0.2


5
Serial fraction f = 0.3

Serial fraction f = 0.4

10 20 30 40 50 60
Number of processors

Amdahl’s Law. Fig.  Speedup curves for large serial fractions


 A Amdahl’s Law

60 e)
re as
inc
ar 1
50 (l ine 0 .00
al f=
Speedup (time reduction)
Ide ctio
n
a
l fr
40 ria
Se
0.01
actio nf=
30
Se rial fr

20

10 Serial fraction f = 0.1

10 20 30 40 50 60
Number of processors

Amdahl’s Law. Fig.  Speedup curves for smaller serial fractions

dynamics, structural mechanics, and acoustic wave Observable Fraction and Superlinear
propagation. They recognized that an Amdahl-type Speedup
speedup (now called “strong scaling”) was more chal- For many scientific codes, it is simple to instrument
lenging to achieve than some of the speedups claimed and measure the amount of time f spent in serial exe-
for distributed memory systems like Caltech’s Cosmic cution. One can place timers in the program around
Cube [] that altered the problem according to the serial regions and obtain an estimate of f that might or
number of processors in use. However, using a ,- might not strongly depend on the input data. One can
processor nCUBE , the three Sandia researchers were then apply this fraction for Amdahl’s Law estimates of
able to achieve performance on , processors rang- time reduction, or Gustafson’s Law estimates of scaled
ing from X to X that of a single processor run- speedup. Neither law takes into account communica-
ning the same size problem, implying the Amdahl serial tion costs or intermediate degrees of parallelism.
fraction was only .–. for those applications. A more common practice is to measure the paral-
This showed that the historical estimates of values for lel speedup as the number of processors is varied, and
the serial fraction in Amdahl’s formula might be far too fit the resulting curve to derive f . This approach may
high, at least for some applications. While the mathe- yield some guidance for programmers and hardware
matics of Amdahl’s Law is unassailable, it was a poorly developers, but it confuses serial fraction with com-
substantiated opinion that the actual values of the serial munication overhead, load imbalance, changes in the
fraction for computing workloads would always be too relative use of the memory hierarchy, and so on. The
high to permit effective use of parallel computing. term “strong scaling” refers to the requirement to keep
Figure  shows Amdahl’s Law, again for a - the problem size the same for any number of processors.
processor system, but with serial fractions of ., ., A common phenomenon that results from “strong scal-
and .. ing” is that when spreading a problem across more and
The TOP list ranks computers by their abil- more processors, the memory per processor goes down
ity to solve a dense system of linear equations. to the point where the data fits entirely in cache, result-
In November , the top-ranked system (Jaguar, ing in superlinear speedup []. Sometimes, the superlin-
Oak Ridge National Laboratories) achieved over % ear speedup effects and the communication overheads
parallel efficiency using , computing cores. partially cancel out, so what appears to be a low value
For Amdahl’s Law to hold, the serial fraction must be of f is actually the result of the combination of the
about one part per million for this system. two effects. In modern parallel systems, performance
Amdahl’s Law A 

analysis with Amdahl’s original law alone will usually is less reasonable when the speeds differ by many orders
A
be inaccurate, since so many other parallel processing of magnitude. Since the execution time of an application
phenomena have large effects on the speedup. tends to match human patience (which differs accord-
ing to application), people might scale the problem such
Impact on Parallel Computing that the time is constant and thus is the controlled vari-
Even at the time of the  AFIPS conference, there was able. That is, t = t , and the speedup simplifies to w /w .
already enough investment in serial computing software See Gustafson’s Law.
that the prospect of rewriting all of it to use paral-
lelism was quite daunting. Amdahl’s Law served as a
strong defense against having to rewrite the code to System Cost: Linear with the Number
exploit multiple processors. Efforts to create experimen- of Processors?
tal parallel computer systems proceeded in the decades Another implicit assumption is that system cost is linear
that followed, especially in academic or research labo- in the number of processors, so anything less than per-
ratory settings, but the major high-performance com- fect speedup implies that cost-effectiveness goes down
puter companies like IBM, Digital, Cray, and HP did every time an approach uses more processors. At the
not create products with a high degree of parallelism, time Amdahl made his argument in , this was a
and cited Amdahl’s Law as the reason. It was not until reasonable assumption: a system with two IBM pro-
the  solution to the Karp Challenge, which made cessors would probably have cost almost exactly twice
clear that Amdahl’s Law need not limit the utility of that of an individual IBM processor. Amdahl’s paper
highly parallel computing, that vendors began develop- even states that “. . . by putting two processors side by
ing commercial parallel products in earnest. side with shared memory, one would find approximately
. times as much hardware,” where the additional .
hardware is for sharing the memory with a crossbar
Implicit Assumptions, and Extensions
switch. He further estimated that memory conflicts
to the Law
would add so much time that net price performance of
Fixed Problem Size a dual-processor system would be . that of a single
The assumption that the computing community over- processor.
looked for  years was that the problem size is fixed. His cost assumptions are not valid for present-era
If one applies many processors (or other performance system designs. As Moore’s Law has decreased the cost
enhancement) to a workload, it is not necessarily true of transistors to the point where a single silicon chip
that users will keep the workload fixed and accept holds many processors in the same package that for-
shorter times for the execution of the task. It is com- merly held a single processor, it is apparent that system
mon to increase the workload on the faster machine to costs are far below linear in the number of processors.
the point where it takes the same amount of time as Processors share software and other facilities that can
before. cost much more than individual processor cores. Thus,
In comparing two things, the scientific approach is while Amdahl’s algebraic formula is true, the implica-
to control all but one variable. The natural choice when tions it provided in  for optimal system design have
comparing the performance of two computations is to changed. For example, it might be that increasing the
run the same problem in both situations and look for a number of processors by a factor of  only provides a
change in the execution time. If speed is w/t where w net speedup of .X for the workload, but if the qua-
is work and t is time, then speedup for two situations drupling of processors only increases system cost by
is the ratio of the speeds: (w /t )/(w /t ). By keeping .X, the cost-effectiveness of the system increases with
the work the same, that is, w = w , the speedup sim- parallelism. Put another way, the point of diminishing
plifies to t /t , and this avoids the difficult problem of returns for adding processors in  might have been a
defining “work” for a computing task. Amdahl’s Law single processor. With current economics, it might be
uses this “fixed-size speedup” assumption. While the a very large number of processors, depending on the
assumption is reasonable for small values of speedup, it application workload.
 A Amdahl’s Law

All-or-None Parallelism processor j − , except that processor  sends to pro-


In the Amdahl model, there are only two levels of con- cessor N, forming a communication ring (completely
currency for the use of N processors: N-fold parallel or parallel).
serial. A more realistic and detailed model recognizes
that the amount of exploitable parallelism might vary
from one to N processors through the execution of a Analogies
program []. The speedup is then Without mentioning Amdahl’s Law by name, others
have referred, often humorously, to the limitations of
Speedup=/ ( f + f / + f / + . . . + fN /N) , parallel processing. Fred Brooks, in The Mythical Man-
Month (), pointed out the futility of trying to com-
where fj is the fraction of the program that can be run
plete software projects in less time by adding more
on j processors in parallel, and f + f + . . . + fN = .
people to the project []. “Brooks’ Law” is his observa-
tion that adding engineers to a project can actually make
Sharing of Resources the project take longer. Brooks quoted the well-known
Because the parallelism model was that of multiple quip, “Nine women can’t have a baby in one month,” and
processors controlled by a single instruction stream, may have been the first to apply that quip to computer
Amdahl formulated his argument for the parallelism of technology.
a single job, not the parallelism of multiple users run- From  to , Ambrose Bierce wrote a collec-
ning multiple jobs. For parallel computers with multiple tion of cynical definitions called The Devil’s Dictionary.
instruction streams, if the duration of a serial section is It includes the following definition:
longer than the time it takes to swap in another job in
the queue, there is no reason that N −  of the N pro- ▸ Logic, n. The art of thinking and reasoning in strict
cessors need to go idle as long as there are users waiting accordance with the limitations and incapacities of
for the system. As the previous section mentions, the human misunderstanding. The basis of logic is the
degree of parallelism can vary throughout a program. syllogism, consisting of a major and a minor premise
A sophisticated queuing system can allocate processing and a conclusion – thus:
resources to other users accordingly, much as systems Major Premise: Sixty men can do a piece of work  times
partition memory dynamically for different jobs. as quickly as one man.
Minor Premise: One man can dig a post-hole in s;
therefore –
Communication Cost Conclusion: Sixty men can dig a post-hole in s.
Because Amdahl formulated his argument in , he This may be called the syllogism arithmetical, in which,
treated the cost of communication of data between by combining logic and mathematics, we obtain a dou-
processors as negligible. At that time, computer arith- ble certainty, and are twice blessed.
metic took so much longer than data motion that the
data motion was overlapped or insignificant. Arith- In showing the absurdity of using  processors
metic speed has improved much more than the speed of (men) for an inherently serial task, he predated Amdahl
interprocessor communication, so many have improved by almost  years.
Amdahl’s Law as a performance model by incorporating Transportation provides accessible analogies for
communication terms in the formula. Amdahl’s Law. For example, if one takes a trip at
Some have suggested that communication costs are  miles per hour and immediately turns around, how
part of the serial fraction of Amdahl’s Law, but this fast does one have to go to average  miles per hour?
is a misconception. Interprocessor communication can This is a trick question that many people incorrectly
be serial or parallel just as the computation can. For answer, “ miles per hour.” To average  miles per
example, a communication algorithm may ask one pro- hour, one would have to travel back at infinite speed
cessor to send data to all others in sequence (completely and instantly. For a fixed travel distance, just as for
serial) or it may ask each processor j to send data to a fixed workload, speeds do not combine as a simple
Amdahl’s Law A 

arithmetic average. This is contrary to our intuition, Related Entries A


which may be the reason some consider Amdahl’s Law Brent’s Theorem
such a profound observation. Gustafson’s Law
Metrics
Pipelining
Perspective
Amdahl’s  argument became a justification for the
avoidance of parallel computing for over  years. It
was appropriate for many of the early parallel com- Bibliographic Notes and Further
puter designs that shared an instruction stream or the Reading
memory fabric or other resources. By the s, hard- Amdahl’s original  paper is short and readily avail-
ware approaches emerged that looked more like col- able online, but as stated in the Discussion section, ith
lections of autonomous computers that did not share has neither the formula nor any direct analysis. An
anything yet were capable of cooperating on a single objective analysis of the Law and its implications can
task. It was not until Gustafson published his alter- be found in [] or []. The series of editions of thehth
native formulation for parallel speedup in , along textbook on computer architecture by Hennessey and
with several examples of actual ,-fold speedups Patterson [] began in  with a strong alignment toht
from a -processor system, that the validity of the Amdahl’s  debate position against parallel comput-
parallel-computing approach became widely accepted ing, and has evolved a less strident stance in more recent
outside the academic community. Amdahl’s Law still editions.
is the best rule-of-thumb when the goal of the perfor- For a rigorous mathematical treatment of Amdahl’s
mance improvement is to reduce execution time for a Law that covers many of the extensions and refine-
fixed task, whereas Gustafson’s Law is the best rule-of- ments mentioned in the Discussion section, see [].
thumb when the goal is to increase the problem size for One of the first papers to show how fixed-sized speedup
a fixed amount of time. Amdahl’s and Gustafson’s Laws measurement is prone to superlinear speedup effects
do not contradict one another, nor is either a corol- is [].
lary or equivalent of the other. They are for different A classic  work on speedup and efficiency is
assumptions and different situations. “Speedup versus efficiency in parallel systems” by D. L.
Gene Amdahl, in a  personal interview, stated Eager, J. Zahorjan, and E. D. Lazowska, in IEEE Trans-
that he never intended his argument to be applied to actions, March , –. DOI=./..
the case where each processor had its own operating
system and data management, and would have been
Bibliography
far more open to the idea of parallel computing as
. Amdahl GM () Validity of the single-processor approach
a viable approach had it been posed that way. He is to achieve large scale computing capabilities. AFIPS Joint
now a strong advocate of parallel architectures and sits Spring Conference Proceedings  (Atlantic City, NJ, Apr. –
on the technical advisory board of Massively Parallel ), AFIPS Press, Reston VA, pp –, At http://www-
Technologies, Inc. inst.eecs.berkeley.edu/∼n/paper/Amdahl.pdf
. Bell G (interviewed) (July ) An interview with Gordon Bell.
With the commercial introduction of single-image
IEEE Software, ():–
systems with over , processors such as Blue Gene, . Brooks FP () The mythical man-month: Essays on software
and clusters with similar numbers of server proces- engineering. Addison-Wesley, Reading. ISBN ---
sor cores, it becomes increasingly unrealistic to use a . Gustafson JL (April ) Fixed time, tiered memory, and super-
fixed-size problem to compare the performance of a linear speedup. Proceedings of the th distributed memory con-
single processor with that of the entire system. Thus, ference, vol , pp –. ISBN: ---
. Gustafson JL, Montry GR, Benner RE (July ) Development
scaled speedup (Gustafson’s Law) applies to measure
of parallel methods for a -processor hypercube. SIAM J Sci
performance of the largest systems, with Amdahl’s Law Statist Comput, ():–
applied mainly where the number of processors changes . Gustafson (May ) Reevaluating Amdahl’s law. Commun
over a narrow range. ACM, ():–. DOI=./.
 A AMG

. Hennessy JL, Patterson DA (, , , ) Computer accelerate molecular dynamics (MD) simulations of
architecture: A quantitative approach. Elsevier Inc. biomolecular systems. Anton performs massively par-
. Hwang K, Briggs F () Computer architecture and parallel
allel computation on a set of identical MD-specific
processing, McGraw-Hill, New York. ISBN: 
ASICs that interact in a tightly coupled manner using a
. Karp A () http://www.netlib.org/benchmark/karp-challenge
. Lewis TG, El-Rewini H () Introduction to parallel comput- specialized high-speed communication network. Anton
ing, Prentice Hall. ISBN: ---, – enabled, for the first time, the simulation of proteins at
. Seitz CL () Experiments with VLSI ensemble machines. Jour- an atomic level of detail for periods on the order of a
nal of VLSI and computer systems, vol . No. , pp – millisecond – about two orders of magnitude beyond
. Slotnick D () Unconventional systems. AFIPS joint spring
the previous state of the art – allowing the observation
conference proceedings  (Atlantic City, NJ, Apr. –). AFIPS
Press, Reston VA, pp – of important biochemical phenomena that were previ-
. Sun X-H, Ni L () Scalable problems and memory-bounded ously inaccessible to both computational and experi-
speedup. Journal of parallel and distributed computing, vol . mental study.
No , pp –
. Ware WH () The Ultimate Computer. IEEE spectrum, vol .
No. , pp – Discussion
Introduction
Classical molecular dynamics (MD) simulations give
scientists the ability to trace the motions of biologi-
AMG cal molecules at an atomic level of detail. Although
MD simulations have helped yield deep insights into
Algebraic Multigrid the molecular mechanisms of biological processes in
a way that could not have been achieved using only
laboratory experiments [, ], such simulations have
historically been limited by the speed at which they can
Analytics, Massive-Scale be performed on conventional computer hardware.
A particular challenge has been the simulation
Massive-Scale Analytics of functionally important biological events that often
occur on timescales ranging from tens of microseconds
to a millisecond, including the “folding” of proteins into
their native three-dimensional shapes, the structural
Anomaly Detection changes that underlie protein function, and the inter-
actions between two proteins or between a protein and
Race Detection Techniques a candidate drug molecule. Such long-timescale simu-
Intel Parallel Inspector lations pose a much greater challenge than simulations
of larger chemical systems at more moderate timescales:
the number of processors that can be used effectively in
parallel scales with system size but not with simulation
Anton, A Special-Purpose length, because of the sequential dependencies within a
Molecular Simulation Machine simulation.
Anton, a specialized, massively parallel supercom-
Ron O. Dror , Cliff Young , David E. Shaw, puter developed by D. E. Shaw Research, accelerated

D. E. Shaw Research, New York, NY, USA such calculations by several orders of magnitude com-

Columbia University, New York, NY, USA
pared with the previous state of the art, enabling the
simulation of biological processes on timescales that
Definition might otherwise not have been accessible for many
Anton is a special-purpose supercomputer architec- years. The first -node Anton machine (Fig. ), which
ture designed by D. E. Shaw Research to dramatically became operational in late , completed an all-atom
Anton, A Special-Purpose Molecular Simulation Machine A 

Anton, A Special-Purpose Molecular Simulation Machine. Fig.  A -node Anton machine

Anton, A Special-Purpose Molecular Simulation Machine. Table  The longest (to our knowledge) all-atom MD
simulations of proteins in explicitly represented water published through the end of 
Length (μs) Protein Hardware Software Citation
, BPTI Anton [native] []
 gpW Anton [native] []
 WW domain x cluster NAMD [, ]
 Villin HP- x cluster NAMD []
 Villin HP- x cluster GROMACS []
 Rhodopsin Blue Gene/L Blue Matter [, ]
 β AR x cluster Desmond []

protein simulation spanning more than a millisecond is executed by a programmable portion of each chip
of biological time in  []. By way of compari- that achieves a substantial degree of parallelism while
son, the longest such simulation previously reported in preserving the flexibility necessary to accommodate
the literature, which was performed on general-purpose anticipated advances in physical models and simulation
computer hardware using the MD code NAMD, was  methods.
microseconds (μs) in length []; at the time, few other Anton was created to attack a somewhat differ-
published simulations had reached  μs (Table ). ent problem than the ones addressed by several other
An Anton machine comprises a set of identical projects that have deployed significant computational
processing nodes, each containing a specialized MD resources for MD simulations. The Folding@Home
computation engine implemented as a single ASIC project [], for example, uses hundreds of thousands of
(Fig. ). These processing nodes are connected through PCs (made available over the Internet by volunteers) to
a specialized high-performance network to form a simulate a very large number of separate molecular tra-
three-dimensional torus. Anton was designed to use jectories, each of which is limited to the timescale acces-
both novel parallel algorithms and special-purpose sible on a single PC. While a great deal can be learned
logic to dramatically accelerate those calculations that from a large number of independent MD trajectories,
dominate the time required for a typical MD simula- many other important problems require the examina-
tion []. The remainder of the simulation algorithm tion of a single, very long trajectory – the principal
 A Anton, A Special-Purpose Molecular Simulation Machine

Host
−Y +Y +X Computer

Torus Torus Torus Host


Link Link Link Interface

Torus Router Router


Link
+Z

Router

Memory Controller
Flexible High-
Subsystem Throughput
Torus
Link

−Z

Router

DRAM
Interaction
Subsystem
(HTIS)
Router
Torus
Link

−X

Router

Memory Controller

Intra-chip
DRAM Ring Network

Anton, A Special-Purpose Molecular Simulation Machine. Fig.  Block diagram of a single Anton ASIC, comprising the
specialized high-throughput interaction subsystem, the more general-purpose flexible subsystem, six inter-chip torus
links, an intra-chip communication ring, and two memory controllers

task for which Anton was designed. Other projects analogous goal, Anton (the machine) was designed as
have produced special-purpose hardware (e.g., FAST- a sort of “computational microscope,” providing con-
RUN [], MDGRAPE [], and MD Engine []) to temporary biological and biomedical researchers with
accelerate the most computationally expensive elements a tool for understanding organisms and their diseases
of an MD simulation. Such hardware reduces the effec- at previously inaccessible spatial and temporal scales.
tive cost of simulating a given period of biological time, Anton has enabled substantial advances in the study
but Amdahl’s law and communication bottlenecks pre- of the processes by which proteins fold, function, and
vent the efficient use of enough such chips in parallel interact with drugs [, ].
to extend individual simulations beyond timescales of a
few microseconds. Structure of a Molecular Dynamics
Anton was named after Anton van Leeuwenhoek, Computation
often referred to as the “father of microscopy.” In An MD simulation computes the motion of a collec-
the seventeenth century, van Leeuwenhoek built high- tion of atoms – for example, a protein surrounded by
precision optical instruments that allowed him to visu- water – over a period of time according to the laws of
alize bacteria and other microorganisms, as well as classical physics. Time is broken into a series of dis-
blood cells and spermatozoa, revealing for the first crete time steps, each representing a few femtoseconds
time an entirely new biological world. In pursuit of an of simulated time. For each time step, the simulation
Anton, A Special-Purpose Molecular Simulation Machine A 

performs a computationally intensive force calculation how MD is done, simultaneously considering changes
A
for each atom, followed by a less expensive integration to algorithms, software, and, especially, hardware.
operation that advances the positions and velocities of Hardware specialization allows Anton to redeploy
the atoms. resources in ways that benefit MD. Compared to other
Forces are evaluated based on a model known as high-performance computing applications, MD uses
a force field. Anton supports a variety of commonly much computation and communication but surpris-
used biomolecular force fields, which express the total ingly little memory. Anton exploits this property by
force on an atom as a sum of three types of component using only SRAMs and small first-level caches on the
forces: () bonded forces, which involve interactions ASIC, constraining all code and data to fit on-chip in
between small groups of atoms connected by one or normal operation (for chemical systems that exceed
more covalent bonds; () van der Waals forces, which SRAM size, Anton pages state to each node’s local
include interactions between all pairs of atoms in the DRAM). The area that would have been spent on large
system, but which fall off quickly with distance and caches and aggressive memory hierarchies is instead
are typically only evaluated for nearby pairs of atoms; dedicated to computation and communication. Each
and () electrostatic forces, which include interactions Anton ASIC contains dedicated, specialized hardware
between all pairs of charged atoms, and fall off slowly datapaths to evaluate the range-limited interactions
with distance. and perform charge spreading and force interpola-
Electrostatic forces are typically computed by one tion, packing much more computational logic on a
of several fast, approximate methods that account for chip than is typical of general-purpose architectures.
long-range effects without requiring the explicit interac- Each ASIC also contains programmable processors with
tion of all pairs of atoms. Anton, like most MD codes for specialized instruction set architectures tailored to the
general-purpose hardware, divides electrostatic interac- remainder of the MD computation. Anton’s specialized
tions into two contributions. The first decays rapidly network fabric not only delivers bandwidth and latency
with distance, and is thus computed directly for all two orders of magnitude better than Gigabit Ethernet,
atom pairs separated by less than some cutoff radius. but also sustains a large fraction of peak network band-
This contribution and the van der Waals interactions width when delivering small packets and provides hard-
together constitute the range-limited interactions. The ware support for common MD communication patterns
second contribution (long-range interactions) decays such as multicast [].
more slowly, but can be expressed as a convolution The most computationally intensive parts of an MD
and efficiently computed using fast Fourier transforms simulation – in particular, the electrostatic interac-
(FFTs) []. This process requires the mapping of tions – are also the most well established and unlikely to
charges from atoms to nearby mesh points before the change as force field models evolve, making these calcu-
FFT computations (charge spreading), and the calcula- lations particularly amenable to hardware acceleration.
tion of forces on atoms based on values associated with Dramatically speeding up MD, however, requires that
nearby mesh points after the FFT computations ( force one accelerates more than just an “inner loop.” Calcula-
interpolation). tion of electrostatic and van der Waals forces accounts
for roughly % of the computational time for a rep-
The Role of Specialization in Anton resentative MD simulation on a single general-purpose
During the five years spent designing and building processor. Amdahl’s law states that no matter how much
Anton, the number of transistors on a chip increased one accelerates this calculation, the remaining compu-
by roughly tenfold, as predicted by Moore’s law. Anton, tations, left unaccelerated, would limit the maximum
on the other hand, enabled simulations approximately speedup to a factor of . Hence, Anton dedicates a
, times faster than was possible at the beginning significant fraction of silicon area to accelerating other
of that period, providing access to biologically crit- tasks, such as bonded force computation and integra-
ical millisecond timescales significantly sooner than tion, incorporating programmability as appropriate to
would have been possible on commodity hardware. accommodate a variety of force fields and simulation
Achieving this performance required reengineering features.
 A Anton, A Special-Purpose Molecular Simulation Machine

System Architecture produced by Anton in certain modes of operation are


The building block of an Anton system is a node, which exactly reversible (a physical property guaranteed by
includes an ASIC with two major computational sub- Newton’s laws but rarely achieved in numerical simu-
systems (Fig. ). The first is the high-throughput interac- lation) [].
tion subsystem (HTIS), designed for computing massive
numbers of range-limited pairwise interactions of vari-
ous forms. The second is the flexible subsystem, which is The High-Throughput Interaction Subsystem
composed of programmable cores used for the remain- (HTIS)
ing, less structured part of the MD calculation. The The HTIS is the largest computational accelerator in
Anton ASIC also contains a pair of high-bandwidth Anton, handling the range-limited interactions, charge
DRAM controllers (augmented with the ability to accu- spreading, and force interpolation. These tasks account
mulate forces and other quantities), six high-speed for a substantial majority of the computation involved
(. Gbit/s per direction) channels that provide com- in an MD simulation and require several hundred
munication to neighboring ASICs, and a host interface microseconds per time step on general-purpose super-
that communicates with an external host computer for computers. The HTIS accelerates these computations
input, output, and general control of the Anton sys- such that they require just a few microseconds on
tem. The ASICs are implemented in -nm technol- Anton, using an array of  hardwired pairwise point
ogy and clocked at  MHz, with the exception of the interaction modules (PPIMs) (Fig. ). The heart of each
arithmetically intensive portion of the HTIS, which is PPIM is a force calculation pipeline that computes the
clocked at  MHz. force between a pair of particles; this is a -stage
An Anton machine may incorporate between  and pipeline (at  MHz) of adders, multipliers, function
, nodes, each of which is responsible for updating evaluation units, and other specialized datapath ele-
the position of particles within a distinct region of space ments. The functional units of this pipeline use cus-
during a simulation. One -node machine, one tomized numerical precisions: bit width varies across
-node machine, ten -node machines, and several the different stages to minimize die area while ensuring
smaller machines were operational as of June . For an accurate -bit result. The HTIS keeps these pipelines
a given machine size, the nodes are connected to form a operating at high utilization through careful choreog-
three-dimensional torus (i.e., a three-dimensional mesh raphy of data flow both between chips and within a
that wraps around in each dimension, which maps nat- chip. A single HTIS can perform , interactions
urally to the periodic boundary conditions used during per microsecond; a modern x core, by contrast, can
most MD simulations). Four nodes are incorporated in perform about  interactions per microsecond [].
each node board, and  node boards fit in a -inch Despite its name, the HTIS also addresses latency: a -
rack; larger machines are constructed by linking racks node Anton performs the entire range-limited interac-
together. tion computation of a ,-atom MD time step in
Almost all computation on Anton uses fixed-point just  μs, over two orders of magnitude faster than any
arithmetic, which can be thought of as operating on contemporaneous general-purpose computer.
twos-complement numbers in the range [−, ). In prac- The computation is parallelized across chips using
tice, most of the quantities handled in an MD simu- a novel technique, the NT method [], which requires
lation fall within well-characterized, bounded ranges less communication bandwidth than traditional meth-
(e.g., bonds are between  and  Å in length), so there ods for parallelizing range-limited interactions. Figure 
is no need for software or hardware to dynamically shows the spatial volume from which particle data must
normalize fixed-point values. Use of fixed-point arith- be imported into each node using the NT method
metic reduces die area requirements and facilitates the compared with the import volume required by the tra-
achievement of certain desirable numerical properties: ditional “half-shell” approach. As the level of paral-
for example, repeated Anton simulations will produce lelism increases, the import volume of the NT method
bitwise identical results even when performed on dif- becomes progressively smaller in both absolute and
ferent numbers of nodes, and molecular trajectories asymptotic terms than that of the traditional method.
Anton, A Special-Purpose Molecular Simulation Machine A 

PPIM
A
− − −
x2 x2 x2
+
<
HTIS
communication
ring
interfaces
− − − × × + × +
1
x2 x2 x2 x
particle + x2
memory
< × ×
f(x) g(x)
interaction
×>> ×>> ×>>
control block
+
× × ×

Anton, A Special-Purpose Molecular Simulation Machine. Fig.  High-throughput interaction subsystem (HTIS) and
detail of a single pairwise point interaction module (PPIM). In addition to the  PPIMs, the HTIS includes two
communication ring interfaces, a buffer area for particle data, and an embedded control core called the interaction control
block. The U-shaped arrows show the flow of data through the particle distribution and force reduction networks. Each
PPIM includes eight matchmaking units (shown stacked), a number of queues, and a force calculation pipeline that
computes pairwise interactions

The NT method requires that each chip compute pipelines, each PPIM includes eight dedicated match
interactions between particles in one spatial region (the units that collectively check each arriving plate particle
tower) and particles in another region (the plate). The against the tower particles stored in the PPIM to deter-
HTIS uses a streaming approach to bring together all mine which pairs may need to interact, using a low-
pairs of particles from the two sets. Figure  depicts precision distance test. Each particle pair that passes
the internal structure of the HTIS, which is dominated this test and satisfies certain other criteria proceeds
by the two halves of the PPIM array. The HTIS loads to the PPIM’s force calculation pipeline. As long as at
tower particles into the PPIM array and streams the least one-eighth of the pairs checked by the match units
plate particles through the array, past the tower parti- proceed, the force calculation pipeline approaches full
cles. Each plate particle accumulates the force from its utilization.
interactions with tower particles as it streams through In addition to range-limited interactions, the HTIS
the array. While the plate particles are streaming by, performs charge spreading and force interpolation.
each tower particle also accumulates the force from its Anton is able to map these tasks to the HTIS by employ-
interactions with plate particles. After the plate particles ing a novel method for efficient electrostatics com-
have been processed, the accumulated tower forces are putation, k-space Gaussian Split Ewald (k-GSE) [],
streamed out. which employs radially symmetric spreading and inter-
Not all plate particles need to interact with all tower polation functions. This radial symmetry allows the
particles; some pairs, for example, exceed the cutoff dis- hardware that computes pairwise nonbonded inter-
tance. To improve the utilization of the force calculation actions between pairs of particles to be reused for
 A Anton, A Special-Purpose Molecular Simulation Machine

interactions between particles and grid points. Both the


k-GSE method and the NT method were developed
while re-examining fundamental MD algorithms dur-
ing the design phase of Anton.

The Flexible Subsystem


Although the HTIS handles the most computationally
intensive parts of an Anton calculation, the flexible sub-
system performs a far wider variety of tasks. It initi-
ates each force computation phase by sending particle
positions to multiple ASICs. It handles those parts of
force computation not performed in the HTIS, includ-
Anton: A Special-Purpose Molecular Simulation Machine.
ing calculation of bonded force terms and the FFT. It
Fig.  Import regions associated with two parallelization
performs all integration tasks, including updating posi-
methods for range-limited pairwise interactions. (a) In a
tions and velocities, constraining some particle pairs to
traditional spatial decomposition method, each node
be separated by a fixed distance, modulating tempera-
imports particles from the half-shell region so that they
ture and pressure, and migrating atoms between nodes
can interact with particles in the home box. (b) In the NT
as the molecular system evolves. Lastly, it performs all
method, each node computes interactions between
boot, logging, and maintenance activities. The compu-
particles in a tower region and particles in a plate region.
tational details of these tasks vary substantially from one
Both of these regions include the home box, but particles
MD simulation to another, making programmability a
in the remainder of each region must be imported. In both
requirement.
methods, each pair of particles within the cutoff radius of
The flexible subsystem contains eight geometry cores
one another will have their interaction computed on some
(GCs) that were designed at D. E. Shaw Research to per-
node of the machine
form fast numerical computations, four control cores
(Tensilica LXs) that coordinate the overall data flow
in the Anton system, and four data transfer engines
that allow communication to be hidden behind com- flexible subsystem. In addition to the usual system inter-
putation. The GCs perform the bulk of the flexible face and cache interfaces, each control core also con-
subsystem’s computational tasks, and they have been nects to a -KB scratchpad memory, which holds MD
customized in a number of ways to speed up MD. simulation data for background transfer by the data
Each GC is a dual-issue, statically scheduled SIMD transfer engine. These engines can be programmed to
processor with pipelined multiply accumulate support. write data from the scratchpad to network destinations
The GC’s basic data type is a vector of four -bit and to monitor incoming writes for synchronization
fixed-point values, and two independent SIMD oper- purposes. The background data transfer capability pro-
ations on these vectors issue each cycle. The GC’s vided by these engines is crucial for performance, as it
instruction set includes element-wise vector operations enables overlapped communication and computation.
(for example, vector addition), more complicated vec- The control cores also handle maintenance tasks, which
tor operations such as a dot product (which is used tend not to be performance-critical (e.g., checkpointing
extensively in calculating bonded forces and applying every million time steps).
distance constraints), and scalar operations that read Considerable effort went into keeping the flexible
and write arbitrary scalar components of the vector reg- subsystem from becoming an Amdahl’s law bottleneck.
isters (essentially accessing the SIMD register file as a Careful scheduling allows some of the tasks performed
larger scalar register file). by the flexible subsystem to be partially overlapped with
Each of the four control cores manages a corre- or completely hidden behind communication or HTIS
sponding programmable data transfer engine, used to computation []. Adjusting parameters for the algo-
coordinate communication and synchronization for the rithm used to evaluate electrostatic forces (including the
Anton, A Special-Purpose Molecular Simulation Machine A 

cutoff radius and the FFT grid density) shifts computa- assembly language, while the control cores of the flexi-
A
tional load from the flexible subsystem to the HTIS []. ble subsystem and a control processor in the HTIS are
A number of mechanisms balance load among the cores programmed in C augmented with intrinsics. Various
of the flexible subsystem and across flexible subsystems fixed-function hardware units, such as the PPIMs, are
on different ASICs to minimize Anton’s overall execu- programmed by configuring state machines and filling
tion time. Even with these and other optimizations, the tables. Anton’s design philosophy emphasized perfor-
flexible subsystem remains on the critical path for up to mance over ease of programmability, although increas-
one-third of Anton’s overall execution time. ingly sophisticated compilers and other tools to simplify
programming are under development.
Communication Subsystem
The communication subsystem provides high-speed,
Anton Performance
low-latency communication both between ASICs and
Figure  shows the performance of a -node Anton
among the subsystems within an ASIC []. Within
machine on several different chemical systems, vary-
a chip, two -bit, -MHz communication rings
ing in size and composition. On the widely used Joint
link all subsystems and the six inter-chip torus ports.
AMBER-CHARMM benchmark system, which con-
Between chips, each torus link provides .-Gbit/s full-
tains , atoms and represents the protein dihydro-
duplex communication with a hop latency around  ns.
folate reductase (DHFR) surrounded by water, Anton
The communication subsystem supports efficient mul-
simulates . μs per day of wall-clock time []. The
ticast, provides flow control, and provides class-based
fastest previously reported simulation of this system was
admission control with rate metering.
obtained using a software package, called Desmond,
In addition to achieving high bandwidth and low
which was developed within our group for use on com-
latency, Anton supports fine-grained inter-node com-
modity clusters []. This Desmond simulation executed
munication, delivering half of peak bandwidth on mes-
at a rate of  nanoseconds (ns) per day on a -
sages of just  bytes. These properties are critical to
node .-GHz Intel Xeon E cluster connected by
delivering high performance in the communication-
a DDR InfiniBand network, using only two of the eight
intensive tasks of an MD simulation. A -node Anton
cores on each node in order to maximize network band-
machine, for example, performs a  ×  × , spatially
width per core []. (Using more nodes, or more cores
distributed D FFT in under  μs, an order of magni-
per node, leads to a decrease in performance as a result
tude faster than the contemporary implementations in
of an increase in communication requirements.) Due
the literature [].
to considerations related to the efficient utilization of
resources, however, neither Desmond nor other high-
Software performance MD codes for commodity clusters are typ-
Although Anton’s hardware architecture incorporates ically run at such a high level of parallelism, or in a
substantial flexibility, it was designed to perform vari- configuration with most cores on each node idle. In
ants of a single application: molecular dynamics. practice, the performance realized in such cluster-based
Anton’s software architecture exploits this fact to maxi- simulations is generally limited to speeds on the order of
mize application performance by eliminating many of  ns/day. The previously published simulations listed
the layers of a traditional software stack. The Anton in Table , for example, ran at  ns/day or less – over
ASICs, for example, do not run a traditional operating two orders of magnitude short of the performance we
system; instead, the control cores of the flexible sub- have demonstrated on Anton.
system run a loader that installs code and data on the Anton machines with fewer than  nodes may
machine, simulates for a time, then unloads the results prove more cost effective when simulating certain
of the completed simulation segment. smaller chemical systems. A -node Anton machine
Programming Anton is complicated by the hetero- can be partitioned, for example, into four -node
geneous nature of its computational units. The geome- machines, each of which achieves . μs/day on the
try cores of the flexible subsystem are programmed in DHFR system – well over % of the . μs/day
 A Anton, A Special-Purpose Molecular Simulation Machine

20

Performance (simulated μ s/day)


Water only
gpW
Protein in water
15 DHFR

10 aSFP

5 NADHOx
FtsZ T7Lig

0
0 20 40 60 80 100 120
Thousands of atoms

Anton: A Special-Purpose Molecular Simulation Machine. Fig.  Performance of a -node Anton machine for
chemical systems of different sizes. All simulations used .-femtosecond time steps with long-range interactions
evaluated at every other time step; additional simulation parameters can be found in Table  of reference []

achieved when parallelizing the same simulation across an MD simulation, and many other high-performance
all  nodes. Configurations with more than  computing problems, can be accelerated through par-
nodes deliver increased performance for larger chem- allelization. The design of Anton broke with com-
ical systems, but do not benefit chemical systems with modity designs, embraced specialized architecture and
only a few thousand atoms, for which the increase co-designed algorithms, and achieved a three-order-
in communication latency outweighs the increase in of-magnitude speedup over a development period of
parallelism. approximately five years. Anton has thus given sci-
Prior to Anton’s completion, few reported all-atom entists, for the first time, the ability to perform MD
protein simulations had reached  μs, the longest being simulations on the order of a millisecond – 
a -μs simulation that took over  months on the times longer than any atomically detailed simula-
NCSA Abe supercomputer [] (Table ). On June , tion previously reported on either general-purpose
, Anton completed the first millisecond-long sim- or special-purpose hardware. For computer archi-
ulation – more than  times longer than any reported tects, Anton’s level of performance raises the questions
previously. This ,-μs simulation modeled a pro- of which other high-performance computing prob-
tein called bovine pancreatic trypsin inhibitor (BPTI) lems might be similarly accelerated, and whether the
(Fig. ), which had been the subject of many previ- economic or scientific benefits of such acceleration
ous MD simulations; in fact, the first MD study of would justify building specialized machines for those
a protein, published in  [], simulated BPTI for problems.
. ps. The Anton simulation, which was over  mil- In its first two years of operation, Anton has
lion times longer, revealed unanticipated behavior that begun to serve as a “computational microscope,” allow-
was not evident at previously accessible timescales, ing the observation of biomolecular processes that
including transitions among several distinct structural have been inaccessible to laboratory experiments and
states [, ]. that were previously well beyond the reach of com-
puter simulation. Anton has revealed, for example,
the atomic-level mechanisms by which certain pro-
Future Directions teins fold (Fig. ) and the structural dynamics under-
Commodity computing benefits from economies of lying the function of important drug targets [,
scale but imposes limitations on the extent to which ]. Anton’s predictive power has been demonstrated
Anton, A Special-Purpose Molecular Simulation Machine A 

Anton: A Special-Purpose Molecular Simulation Machine. Fig.  Two renderings of a protein (BPTI) taken from a
molecular dynamics simulation on Anton. (a) The entire simulated system, with each atom of the protein represented by a
sphere and the surrounding water represented by thin lines. For clarity, water molecules in front of the protein are not
pictured. (b) A “cartoon” rendering showing important structural elements of the protein (secondary and tertiary
structure)

Related Entries
Amdahl’s Law
Distributed-Memory Multiprocessor
GRAPE
IBM Blue Gene Supercomputer
NAMD (NAnoscale Molecular Dynamics)
N-Body Computational Methods
a t = 5 μs b t = 25 μs c t = 50 μs
QCDSP and QCDOC Computers
Anton: A Special-Purpose Molecular Simulation
Machine. Fig.  Unfolding and folding events in a -μs Bibliographic Notes and Further
simulation of the protein gpW, at a temperature that Reading
equally favors the folded and unfolded states. Panel (a) The first Anton machine was developed at D. E. Shaw
shows a snapshot of a folded structure early in the Research between  and . The overall archi-
simulation, (b) is a snapshot after the protein has partially tecture was described in [], with the HTIS, the
unfolded, and (c) is a snapshot after it has folded again. flexible subsystem, and the communication subsystem
Anton has also simulated the folding of several proteins described in more detail in [], [], and [], respec-
from a completely extended state to the experimentally tively. Initial performance results were presented in [],
observed folded state and initial scientific results in [] and []. Other aspects
of the Anton architecture, software, and design process
are described in several additional papers [, , , ].
through comparison with experimental observations A number of previous projects built specialized
[, ]. Anton thus provides a powerful complement hardware for MD simulation, including MD-GRAPE
to laboratory experiments in the investigation of funda- [], MD Engine [], and FASTRUN []. Extensive
mental biological processes, and holds promise as a tool effort has focused on efficient parallelization of MD
for the design of safe, effective, precisely targeted drugs. on general-purpose architectures, including IBM’s Blue
 A Anton, A Special-Purpose Molecular Simulation Machine

Gene [] and commodity clusters [, , ]. More Hierarchical simulation-based verification of Anton, a special-
recently, MD has also been ported to the Cell BE pro- purpose parallel machine. In: Proceedings of the th IEEE inter-
national conference on computer design (ICCD ’), Lake Tahoe
cessor [] and to GPUs [].
. Grossman JP, Young C, Bank JA, Mackenzie K, Ierardi DJ, Salmon
JK, Dror RO, Shaw DE () Simulation and embedded soft-
Bibliography ware development for Anton, a parallel machine with heteroge-
. Bhatele A, Kumar S, Mei C, Phillips JC, Zheng G, Kalé LV neous multicore ASICs. In: Proceedings of the th IEEE/ACM/
() Overcoming scaling challenges in biomolecular simula- IFIP international conference on hardware/software codesign
tions across multiple platforms. In: Proceedings of the IEEE inter- and system synthesis (CODES/ISSS ’)
national parallel and distributed processing symposium, Miami . Hess B, Kutzner C, van der Spoel D, Lindahl E () GROMACS
. Bowers KJ, Chow E, Xu H, Dror RO, Eastwood MP, Gregersen : algorithms for highly efficient, load-balanced, and scalable
BA, Klepeis JL, Kolossváry I, Moraes MA, Sacerdoti FD, Salmon molecular simulation. J Chem Theor Comput :–
JK, Shan Y, Shaw DE () Scalable algorithms for molecular . Ho CR, Theobald M, Batson B, Grossman JP, Wang SC,
dynamics simulations on commodity clusters. In: Proceedings Gagliardo J, Deneroff MM, Dror RO, Shaw DE () Post-silicon
of the ACM/IEEE conference on supercomputing (SC). IEEE, debug using formal verification waypoints. In: Proceedings of the
New York design and verification conference and exhibition (DVCon ’),
. Chow E, Rendleman CA, Bowers KJ, Dror RO, Hughes DH, San Jose
Gullingsrud J, Sacerdoti FD, Shaw DE () Desmond per- . Khalili-Araghi F, Gumbart J, Wen P-C, Sotomayor M, Tajkhorshid
formance on a cluster of multicore processors. D. E. Shaw E, Shulten K () Molecular dynamics simulations of mem-
Research Technical Report DESRES/TR--, New York. brane channels and transporters. Curr Opin Struct Biol :–
http://deshawresearch.com . Klepeis JL, Lindorff-Larsen K, Dror RO, Shaw DE () Long-
. Dror RO, Arlow DH, Borhani DW, Jensen MØ, Piana S, Shaw timescale molecular dynamics simulations of protein structure
DE () Identification of two distinct inactive conformations of and function. Curr Opin Struct Biol :–
the β  -adrenergic receptor reconciles structural and biochemical . Kuskin JS, Young C, Grossman JP, Batson B, Deneroff MM, Dror
observations. Proc Natl Acad Sci USA :– RO, Shaw DE () Incorporating flexibility in Anton, a special-
. Dror RO, Grossman JP, Mackenzie KM, Towles B, Chow E, ized machine for molecular dynamics simulation. In: Proceedings
Salmon JK, Young C, Bank JA, Batson B, Deneroff MM, Kuskin of the th annual international symposium on high-performance
JS, Larson RH, Moraes MA, Shaw DE () Exploiting - computer architecture (HPCA ’). IEEE, New York
nanosecond end-to-end communication latency on Anton. In: . Larson RH, Salmon JK, Dror RO, Deneroff MM, Young C, Gross-
Proceedings of the conference for high performance computing, man JP, Shan Y, Klepeis JL, Shaw DE () High-throughput
networking, storage and analysis (SC). IEEE, New York pairwise point interactions in Anton, a specialized machine
. Ensign DL, Kasson PM, Pande VS () Heterogeneity even for molecular dynamics simulation. In: Proceedings of the th
at the speed limit of folding: large-scale molecular dynamics annual international symposium on high-performance computer
study of a fast-folding variant of the villin headpiece. J Mol architecture (HPCA ’). IEEE, New York
Biol :– . Luttman E, Ensign DL, Vishal V, Houston M, Rimon N, Øland
. Fine RD, Dimmler G, Levinthal C () FASTRUN: a special J, Jayachandran G, Friedrichs MS, Pande VS () Acceler-
purpose, hardwired computer for molecular simulation. Proteins ating molecular dynamic simulation on the cell processor and
:– PlayStation . J Comput Chem :–
. Fitch BG, Rayshubskiy A, Eleftheriou M, Ward TJC, Giampapa . Martinez-Mayorga K, Pitman MC, Grossfield A, Feller SE, Brown
ME, Pitman MC, Pitera JW, Swope WC, Germain RS () Blue MF () Retinal counterion switch mechanism in vision evalu-
Matter: scaling of N-body simulations to one atom per node. IBM ated by molecular simulations. J Am Chem Soc :–
J Res Dev : . McCammon JA, Gelin BR, Karplus M () Dynamics of folded
. Freddolino PL, Liu F, Gruebele MH, Schulten K () Ten- proteins. Nature :–
microsecond MD simulation of a fast-folding WW domain. Bio- . Pande VS, Baker I, Chapman J, Elmer SP, Khaliq S, Larson SM,
phys J :L–L Rhee YM, Shirts MR, Snow CD, Sorin EJ, Zagrovic B ()
. Freddolino PL, Park S, Roux B, Schulten K () Force field bias Atomistic protein folding simulations on the submillisecond
in protein folding simulations. Biophys J :– time scale using worldwide distributed computing. Biopolymers
. Freddolino P, Schulten K () Common structural transitions :–
in explicit-solvent simulations of villin headpiece folding. Bio- . Piana S, Sarkar K, Lindorff-Larsen K, Guo M, Gruebele M, Shaw
phys J :– DE () Computational design and experimental testing of the
. Grossfield A, Pitman MC, Feller SE, Soubias O, Gawrisch K fastest-folding β-sheet protein. J Mol Biol :–
() Internal hydration increases during activation of the . Rosenbaum DM, Zhang C, Lyons JA, Holl R, Aragao D, Arlow
G-protein-coupled receptor rhodopsin. J Mol Biol :– DH, Rasmussen SGF, Choi H-J, DeVree BT, Sunahara RK,
. Grossman JP, Salmon JK, Ho CR, Ierardi DJ, Towles B, Bat- Chae PS, Gellman SH, Dror RO, Shaw DE, Weis WI, Caffrey M,
son B, Spengler J, Wang SC, Mueller R, Theobald M, Young Gmeiner P, Kobilka BK () Structure and function of an irre-
C, Gagliardo J, Deneroff MM, Dror RO, Shaw DE () versible agonist-β  adrenoceptor complex. Nature :–
Array Languages A 

. Shan Y, Klepeis JL, Eastwood MP, Dror RO, Shaw DE ()
Gaussian split Ewald: a fast Ewald mesh method for molecular Architecture Independence A
simulation. J Chem Phys :
. Shaw DE () A fast, scalable method for the parallel evalua- Network Obliviousness
tion of distance-limited pairwise particle interactions. J Comput
Chem :–
. Shaw DE, Deneroff MM, Dror RO, Kuskin JS, Larson RH, Salmon
JK, Young C, Batson B, Bowers KJ, Chao JC, Eastwood MP,
Gagliardo J, Grossman JP, Ho CR, Ierardi DJ, Kolossváry I, Klepeis Area-Universal Networks
JL, Layman T, McLeavey C, Moraes MA, Mueller R, Priest EC,
Shan Y, Spengler J, Theobald M, Towles B, Wang SC () Anton: Universality in VLSI Computation
a special-purpose machine for molecular dynamics simulation.
In: Proceedings of the th annual international symposium on
computer architecture (ISCA ’). ACM, New York
. Shaw DE, Dror RO, Salmon JK, Grossman JP, Mackenzie
KM, Bank JA, Young C, Deneroff MM, Batson B, Bowers Array Languages
KJ, Chow E, Eastwood MP, Ierardi DJ, Klepeis JL, Kuskin JS,
Larson RH, Lindorff-Larsen K, Maragakis P, Moraes MA, Piana Calvin Lin
S, Shan Y, Towles B () Millisecond-scale molecular dynam- University of Texas at Austin, Austin, TX, USA
ics simulations on Anton. In: Proceedings of the conference for
high performance computing, networking, storage and analysis
(SC). ACM, New York
. Shaw DE, Maragakis P, Lindorff-Larsen K, Piana S, Dror RO, East-
Definition
wood MP, Bank JA, Jumper JM, Salmon JK, Shan Y, Wriggers W An array language is a programming language that sup-
() Atomic-level characterization of the structural dynamics ports the manipulation of entire arrays – or portions of
of proteins. Science :– arrays – as a basic unit of operation.
. Stone JE, Phillips JC, Freddolino PL, Hardy DJ, Trabuco LG,
Schulten K () Accelerating molecular modeling applications
with graphics processors. J Comput Chem :– Discussion
. Taiji M, Narumi T, Ohno Y, Futatsugi N, Suengaga A, Takada
Array languages provide two primary benefits: () They
N, Konagaya A () Protein Explorer: a petaflops special-
purpose computer system for molecular dynamics simulations.
raise the level of abstraction, providing conciseness and
In: Proceedings of the ACM/IEEE conference on supercomputing programming convenience; () they provide a natural
(SC ’), Phoenix, AZ. ACM, New York source of data parallelism, because the multiple ele-
. Toyoda S, Miyagawa H, Kitamura K, Amisaki T, Hashimoto E, ments of an array can typically be manipulated concur-
Ikeda H, Kusumi A, Miyakawa N () Development of MD rently. Both benefits derive from the removal of control
Engine: high-speed accelerator with parallel processor design for
flow. For example, the following array statement assigns
molecuar dynamics simulations. J Comput Chem :–
. Young C, Bank JA, Dror RO, Grossman JP, Salmon JK, Shaw
each element of the B array to its corresponding element
DE () A  ×  × , spatially distributed D FFT in four in the A array:
microseconds on Anton. In: Proceedings of the conference for
high performance computing, networking, storage and analysis
A := B;
(SC). ACM, New York This same expression could be expressed in a scalar
language using an explicit looping construct:
for i := 1 to n
Application-Specific Integrated for j := 1 to n
Circuits A[i][j] := B[i][j];
VLSI Computation The array statement is conceptually simpler because
it removes the need to iterate over individual array
elements, which includes the need to name individual
Applications and Parallelism elements. At the same time, the array expression admits
more parallelism because it does not over-specify the
Computational Sciences order in which the pairs of elements are evaluated; thus,
 A Array Languages

the elements of the B array can be assigned to the B(2:n,2:n) = (A(1:n-1,2:n)+


elements of the A array in any order, provided that A(3:n+1,2:n)+
they obey array language semantics. (Standard array A(2:n, 1:n-1)+
language semantics dictate that the righthand side of A(2:n, 3:n+3)+
an assignment be fully evaluated before its value is
The problem becomes much worse, of course, for
assigned to the variable on the lefthand side of the
higher-dimensional arrays.
assignment.)
The C∗ language [], which was designed for
For the above example, the convenience of array
the Connection machine, simplifies array indexing by
operators appears to be minimal, but in the context of
defining indices that are relative to each array element.
a parallel computer, the benefits to programmer pro-
For example, the core of the Jacobi iteration can be
ductivity can be substantial, because a compiler can
expressed in C∗ as follows, where active is an array
translate the above array statement to efficient paral-
that specifies the elements of the array where the com-
lel code, freeing the programmer from having to deal
putation should take place:
with many low-level details that would be necessary if
writing in a lower-level language, such as MPI. In partic- where (active)
ular, the compiler can partition the work and compute {
loop bounds, handling the messy cases where values do B = ([.-1][.]A + [.+1][.]A
not divide evenly by the number of processors. Further- + [.][.-1]A + [.][.+1]A)/4;
more, if the compiler can statically identify locations }
where communication must take place, it can allocate
extra memory to cache communicated values, and it can C∗ ’s relative indexing is less error prone than slices:
insert communication where necessary, including any Each relative index focuses attention on the differences
necessary marshalling of noncontiguous data. Even for among the array references, so it is clear that the first two
shared memory machines, the burden of partitioning array references refer to the neighbors above and below
work, inserting appropriate synchronization, etc., can each element and that the last two array references refer
be substantial. to the neighbors to the left and right of each element.
The ZPL language [] further raises the level of
abstraction by introducing the notion of a region to rep-
Array Indexing resent index sets. Regions are a first-class language con-
Array languages can be characterized by the mecha- struct, so regions can be named and manipulated. The
nisms that they provide for referring to portions of ability to name regions is important because – for soft-
an array. The first array language, APL, provided no ware maintenance and readability reasons – descriptive
method of accessing portions of an array. Instead, in names are preferable to constant values. The ability to
APL, all operators are applied to all elements of their manipulate regions is important because – like C∗ ’s rel-
array operands. ative indexing – it defines new index sets relative to
Languages such as Fortran  use array slices to con- existing index sets, thereby highlighting the relationship
cisely specify indices for each dimension of an array. For between the new index set and the old index set.
example, the following statement assigns the upper left To express the Jacobi iteration in ZPL, programmers
 ×  corner of array A to the upper left corner of array B. would use the At operator (@) to translate an index set
by a named vector:
B(1:3,1:3) = A(1:3,1:3)
[R] B := (A@north + A@south +
The problem with slices is that they introduce con-
A@west + A@east)/4;
siderable redundancy, and they force the program-
mer to perform index calculations, which can be error The above code assumes that the programmer has
prone. For example, consider the Jacobi iteration, which defined the region R to represent the index set
computes for each array element the average of its four [:n][:n], has defined north to be a vector whose
nearest neighbors: value is [−, ], has defined south to be a vector
Array Languages A 

whose value is [+,], and so forth. Given these defini- irregular grids. In particular, FIDIL supports index sets,
A
tions, the region R provides a base index set [:n][:n] known as domains, which are first-class objects that
for every reference to a two-dimensional array in this can be manipulated through union, intersection, and
statement, so R applies to every occurrence of A in difference operators.
this statement. The direction north shifts this base
index set by − in the first dimension, etc. Thus, the
above statement has the same meaning as the Fortran Array Operators
 slice notation, but it uses named values instead of Array languages can also be characterized by the set of
hard-coded constants. array operators that they provide. All array languages
ZPL provides other region operators to support support elementwise operations, which are the natu-
other common cases, and it uses regions in other ways, ral extension of scalar operators to array operands. The
for example, to declare array variables. More signifi- Array Indexing section showed examples of the elemen-
cantly, regions are quite general, as they can represent twise + and = operators.
sparse index sets and hierarchical index sets. While element-wise operators are useful, at some
More recently, the Chapel language from Cray [] point, data parallel computations need to be combined
provides an elegant form of relative indexing that can or summarized, so most array languages provide reduc-
take multiple forms. For example, the below Chapel tion operators, which combine – or reduce – multiple
code looks almost identical to the ZPL equivalent, values of an array into a single scalar value. For exam-
except that it includes the base region, R, in each array ple, the values of an array can be reduced to a scalar
index expression: by summing them or by computing their maximum or
minimum value.
T[R] = (A[R+north] + A[R+south]+
A[R+east] + A[R+west])/4.0; Some languages provide additional power by allow-
ing reductions to be applied to a subset of an array’s
Alternatively, the expression could be written from the dimensions. For example, each row of values in a two-
perspective of a single element in the index set: dimensional array can be reduced to produce a sin-
[ij in R] T(ij) = (A(ij+north) + gle column of values. In general, a partial reduction
A(ij+south) + accepts an n-dimensional array of values and produces
A(ij+east) + an m-dimensional array of values, where m < n. This
A(ij+west))/4.0; construct can be further generalized by allowing the
programmer to specify an arbitrary associative and
The above formulation is significant because it allows commutative function as the reduction operator.
the variable ij to represent an element of an Parallel prefix operators – also known as scan oper-
arbitrary tuple, where the tuple (referred to as a domain ators – are an extension of reductions that produce an
in Chapel) could represent many different types of index array of partial values instead of a single scalar value.
sets, including sparse index sets, hierarchical index sets, For example, the prefix sum accepts as input n values
or even nodes of a graph. and computes all sums, x + x + x + . . . + xk for  ≤
Finally, the following Chapel code fragment shows k ≤ n. Other parallel prefix operations are produced
that the tuple can be decomposed into its constituent by replacing the + operator with some other associative
parts, which allows arbitrary arithmetic to be per- operator. (When applied to multi-dimensional arrays,
formed on each part separately: the array indices are linearized in some well-defined
manner, e.g., using Row Major Order.)
[(i,j) in R] A(ij) = (A(i-1,j)+
The parallel prefix operator is quite powerful
A(i+1,j)+
because it provides a general mechanism for paralleliz-
A(i,j-1)+
ing computations that might seem to require sequential
A(i,j+1))/4.0;
iteration. In particular, sequential loop iterations that
The FIDIL language [] represents an early attempt accumulate information as they iterate can typically be
to raise the level of abstraction, providing support for solved using a parallel prefix.
 A Array Languages

Given the ability to index an individual array ele- irregular pointer-based data structures are also needed.
ment, programmers can directly implement their own In principle, Chapel’s domain construct supports all of
reduction and scan code, but there are several bene- these extensions of array languages, but further research
fits of language support. First, reduction and scan are is required to fully support these data structures and
common abstractions, so good linguistic support makes to provide good performance. Second, it is important
these abstractions easier to read and write. Second, to integrate task parallelism with data parallelism, as is
language support allows their implementations to be being explored in languages such as Chapel and X.
customized to the target machine. Third, compiler sup-
port introduces nontrivial opportunities for optimiza- Related Entries
tion: When multiple reductions or scans are performed Chapel (Cray Inc. HPCS Language)
in sequence, their communication components can be Fortran  and Its Successors
combined to reduce communication costs. HPF (High Performance Fortran)
Languages such as Fortran  and APL also pro- NESL
vide additional operators for flattening, re-shaping, and ZPL
manipulating arrays in other powerful ways. These lan-
guages also provide operators such as matrix multiplica-
tion and matrix transpose that treat arrays as matrices. Bibliographic Notes and Further
The inclusion of matrix operations blurs the distinction Reading
between array languages and matrix languages, which The first array language, APL [], was developed in the
are described in the next section. early s and has often been referred to as the first
write-only language because of its terse, complex nature.
Matrix Languages Subsequent array languages that extended
Array languages should not be confused with matrix more conventional languages began to appear in the
languages, such as Matlab []. An array is a program- late s. For example, extensions of imperative lan-
ming language construct that has many uses. By con- guages include C∗ [], FIDIL [], Dataparallel C [],
trast, a matrix is a mathematical concept that carries and Fortran  []. NESL [] is a functional language
additional semantic meaning. To understand this dis- that includes support for nested one dimensional arrays.
tinction, consider the following statement: In the early s, High Performance Fortran (HPF)
A = B * C; was a data parallel language that extended Fortran 
and Fortran  to provide directives about data distri-
In an array language, the above statement assigns to bution. At about the same time, the ZPL language []
A the element-wise product of B and C. In a matrix showed that a more abstract notion of an array’s index
language, the statement multiples B and C. set could lead to clear and concise programs.
The most popular matrix language, Matlab, was More recently, the DARPA-funded High-
originally designed as a convenient interactive interface Productivity Computing Systems project led to the
to numeric libraries, such as EISPACK and LINPACK, development of Chapel [] and X [], which both inte-
that encourages exploration. Thus, for example, there grate array languages with support for task parallelism.
are no variable declarations. These interactive features Ladner and Fischer [] presented key ideas of the
make Matlab difficult to parallelize, because they inhibit parallel prefix algorithm, and Blelloch [] elegantly
the compiler’s ability to carefully communication the demonstrated the power of the scan operator for array
computation and communication. languages.

Future Directions
Bibliography
Array language support can be extended in two dimen-
. Adams JC, Brainerd WS, Martin JT, Smith BT, Wagener JL ()
sions. First, the restriction to flat dense arrays is too Fortran  handbook. McGraw-Hill, New York
limiting for many computations, so language support . Blelloch G () Programming parallel algorithms. Comm ACM
for sparsely populated arrays, hierarchical arrays, and ():–
Array Languages, Compiler Techniques for A 

. Blelloch GE () NESL: a nested data-parallel language. Tech- array operations and intrinsic functions to deliver per-
nical Report CMUCS--, School of Computer Science,
A
formance for parallel architectures, such as vector pro-
Carnegie Mellon University, Pittsburgh, PA, January  cessors, multi-processors and VLIW processors.
. Chamberlain BL () The design and implementation of a
region-based parallel language. PhD thesis, University of Wash-
ington, Department of Computer Science and Engineering, Discussion
Seattle, WA
. Chamberlain BL, Callahan D, Zima HP () Parallel pro- Introduction to Array Languages
grammability and the Chapel language. Int J High Perform Com- There are several programming languages providing
put Appl ():–
a rich set of array operations and intrinsic functions
. Ebcioglu K, Saraswat V, Sarkar V () X: programming
for hierarchical parallelism and non-uniform data access. In:
along with array constructs to assist data-parallel pro-
International Workshop on Language Runtimes, OOPSLA , gramming. They include Fortran , High Perfor-
Vancouver, BC mance Fortran (HPF), APL and MATLAB, etc. Most
. Amos Gilat () MATLAB: an introduction with applications, of them provide programmers array intrinsic functions
nd edn. Wiley, New York and array operations to manipulate data elements of
. Hatcher PJ, Quinn MJ () Data-parallel programming on
multidimensional arrays concurrently without requir-
MIMD computers. MIT Press, Cambridge, MA
. Hilfinger PN, Colella P () FIDIL: a language for scientific pro- ing iterative statements. Among these array languages,
gramming. Technical Report UCRL-, Lawrence Livermore Fortran  is a typical example, which consists of an
National Laboratory, Livermore, CA, January  extensive set of array operations and intrinsic functions
. Iverson K () A programming language. Wiley, New York as shown in Table . In the following paragraphs, sev-
. Ladner RE, Fischer MJ () Parallel prefix computation. JACM
eral examples will be provided to bring readers basic
():–
. Lin C, Snyder L () ZPL: an array sublanguage. In: Banerjee U,
information about array operations and intrinsic func-
Gelernter D, Nicolau A, Padua D (eds) Languages and compilers tions supported by Fortran . Though in the examples
for parallel computing. Springer-Verlag, New York, pp – only Fortran  array operations are used for illustra-
. Rose JR, Steele Jr GL () C∗ : an extended C language for tion, the array programming concepts and compiler
data parallel programming. In: nd International Conference on techniques are applicable to common array languages.
Supercomputing, Santa Clara, CA, March 
Fortran  extends former Fortran language features
to allow a variety of scalar operations and intrinsic func-
tions to be applied to arrays. These array operations and
intrinsic functions take array objects as inputs, perform
a specific operation on array elements concurrently, and
Array Languages, Compiler return results in scalars or arrays. The code fragment
Techniques for below is an array accumulation example, which involves
two two-dimensional arrays and one array-add opera-
Jenq-Kuen Lee , Rong-Guey Chang , Chi-Bang Kuan tion. In this example, all array elements in the array S is

National Tsing-Hua University, Hsin-Chu, Taiwan
 going to be updated by accumulating by those of array
National Chung Cheng University, Chia-Yi, Taiwan
A in corresponding positions. The accumulation result,
also a two-dimensional  ×  array, is at last stored
back to array S, in which each data element contains
Synonyms element-wise sum of array A and array S.
Compiler optimizations for array languages
integer S( , ) , A( , )
Definition S = S + A
Compiler techniques for array languages generally
include compiler supports, optimizations and code gen- Besides primitive array operations, Fortran  also
eration for programs expressed or annotated by all provides programmers a set of array intrinsic functions
kinds of array languages. These compiler techniques to manipulate array objects. These intrinsic functions
mainly take advantage of data-parallelism explicit in listed in Table  include functions for data movement,
 A Array Languages, Compiler Techniques for

Array Languages, Compiler Techniques for. Table  Array The second example presents a way to reorganize
intrinsic functions in Fortran  data elements within an input array with array intrinsic
Array intrinsics Functionality functions. The circular-shift (CSHIFT) intrinsic func-
CSHIFT Circular-shift elements of the input tion in Fortran  performs data movement over an
array along one specified dimension input array along a given dimension. Given a two-
DOT_PRODUCT Compute dot-product of two input dimensional array, it can be considered to shift data
arrays as two vectors elements of the input array to the left or right, up
EOSHIFT End-off shift elements of the input or down in a circular manner. The first argument of
array along one specified dimension CSHIFT indicates the input array to be shifted while
MATMUL Matrix multiplication the rest two arguments specify the shift amount and
MERGE Combine two conforming arrays under the dimension along which data are shifted. In the code
the control of an input mask fragment below, one two-dimensional array, A, is going
PACK Pack an array under the control of an to be shifted by one-element offset along the first dimen-
input mask
sion. If the initial contents of array A are labeled as the
Reduction Reduce an array by one specified left-hand side of Fig. , after the circular-shift its data
dimension and operator
contents will be moved to new positions as shown in the
RESHAPE Construct an array of a specified shape right-hand side of Fig. , where the first row of A sinks
from elements of the input array
to the bottom and the rest move upward by one-element
SPREAD Replicate an array by adding one
offset.
dimension
Section move Perform data movement over a region integer A( , )
of the input array
TRANSPOSE Matrix transposition A = CSHIFT(A ,  , )
UNPACK Unpack an array under the control of
an input mask
Compiler Techniques for Array Languages
So far, readers may have experienced the concise rep-
resentation of array operations and intrinsic functions.
matrix multiplication, array reduction, compaction, etc.
The advantages brought by array operations mainly
In the following paragraphs, two examples using array
focus on exposing data parallelism to compilers for con-
intrinsic functions are provided: the first one with
current data processing. The exposed data parallelism
array reduction and the second one with array data
can be used by compilers to generate efficient code
movement.
to be executed on parallel architectures, such as vec-
The first example, shown in the code fragment
tor processors, multi-processors and VLIW processors.
below, presents a way to reduce an array with a spe-
Array operations provide abundant parallelism to be
cific operator. The SUM intrinsic function, one instance
exploited by compilers for delivering performance on
of array reduction functions in Fortran , sums up an
those parallel processors. In the following paragraphs,
input array along a specified dimension. In this exam-
two compiler techniques on compiling Fortran  array
ple, a two-dimension  ×  array A, composed of total
operations will be elaborated to readers. Though these
 elements, is passed to SUM with its first dimension
techniques are originally developed for Fortran , they
specified as the target for reduction. After reduction by
are also applicable to common array languages, such as
SUM, it is expected that the SUM will return a one-
Matlab and APL.
dimension array consisting of four elements, each of
The first technique covers array operation synthe-
them corresponds to a sum of the array elements in the
sis for consecutive array operations, which treats array
first dimension.
intrinsic functions as mathematical functions. In the
i n t e g e r A( , ), S() synthesis technique, each array intrinsic function has
its data access function that specifies its data access pat-
S = SUM(A , ) tern and the mapping relationship between its input and
Array Languages, Compiler Techniques for A 

Consecutive Array Operations


11 12 13 14 21 22 23 24 A
As shown in the previous examples, array operations
21 22 23 24 31 32 33 34
A= A′ = take one or more arrays as inputs, conduct a specific
31 32 33 34 41 42 43 44 operation to them, and return results in an array or a
41 42 43 44 11 12 13 14
scalar value. For more advanced usage, multiple array
operations can be cascaded to express compound com-
Array Languages, Compiler Techniques for. Fig.  Array putation over arrays. An array operation within the con-
contents change after CSHIFT (shift amount = , secutive array operations takes inputs either from input
dimension = ) arrays or intermediate results from others, processing
data elements and passing its results to the next. In this
way, they conceptually describe a particular relationship
output arrays. Through providing data access functions between source arrays and target arrays.
for all array operations, this technique can synthesize The following code fragment is an example with
multiple array operations into a composite data access consecutive array operations, which involves three array
function, which expresses data accesses and computa- operations, TRANSPOSE, RESHAPE, and CSHIFT, and
tion to generate the target arrays from source arrays. In three arrays, A(, ), B(, ), and C(). The cascaded
this way, compiling programs directly by the composite array operations at first transpose array A and reshape
data access function can greatly improve performance array C into two  ×  matrices, afterwards sum the
by reducing redundant data movement and temporary two  ×  intermediate results, and finally circular-shift
storages required for passing immediate results. the results along its first dimension. With this exam-
The second compiler technique concerns compiler ple, readers may experience the concise representation
supports and optimizations for sparse array programs. and power of compound array operations in the way
In contrast with dense arrays, sparse arrays consist manipulating arrays without iterative statements.
of much more zero elements than nonzero elements.
Having this characteristic, they are more applicable i n t e g e r A( , ) , B( , ), C()
than dense arrays in many scientific applications. For
example, sparse linear systems, such as Boeing–Harwell B = CSHIFT((TRANSPOSE(A)
matrix, are with popular usages. Similar to dense arrays, + RESHAPE(C , / , /)) ,  , )
there are demands for support of array operations
To compile programs with consecutive array oper-
and intrinsic functions to elaborate data parallelism
ations, one straightforward compilation may translate
in sparse matrices. Later on, we will have several each array operation into a parallel loop and create tem-
paragraphs illustrating how to support sparse array porary arrays to pass intermediate results used by rest
operations in Fortran , and we will also cover some array operations. Take the previous code fragment as an
optimizing techniques for sparse programs. example, at first, compilers will separate the TRANS-
POSE function from the consecutive array operations
Synthesis for Array Operations and create a temporary array T to keep the transposed
In the next paragraphs, an array operation synthe- results. Similarly, another array T is created for the
sis technique targeting on compiling Fortran  array RESHAPE function, and these two temporary arrays, T
operations is going to be elaborated. This synthesis tech- and T, are summed into another temporary array, T.
nique can be applied to programs that contain consec- At last, T is taken by the CSHIFT and used to produce
utive array operations. Array operations here include the final results in the target array B.
not only those of Fortran  but all array operations
that can be formalized into data access functions. With i n t e g e r A( , ) , B( , ) , C()
this technique, compilers can generate efficient codes i n t e g e r T( , ) , T( , ) , T( , )
by removing redundant data movement and temporary
storages, which are often introduced in compiling array T = TRANSPOSE(A)
programs. T = RESHAPE(C , / , /)
 A Array Languages, Compiler Techniques for

T = T + T segmented data accessing, which means data in the


B = CSHIFT(T ,  , ) arrays are accessed in parts or with strides. Due to seg-
mented data access, the data access functions cannot
This straightforward scheme is inefficient as it intro-
be represented in a single continuous form. Instead,
duces unnecessary data movement between temporary
they have to be described by multiple data access pat-
arrays and temporary storages for passing intermediate
terns, each of which covers has an disjointed array index
results. To compile array programs in a more efficient
range. To represent an array index range, a notation
way, array operation synthesis can be applied to obtain
called, segmentation descriptors, can be used, which are
a function F at compile-time such that B = F(A, C),
boolean predicates of the form:
which is the synthesized data access function of the
compound operations and it is functionally identical to ϕ(/fi (i , i , ⋯, in ), ⋯, fm (i , i , ⋯, in )/, /l : u : s ,
the original sequence of array operations. The synthe- l : u : s , ⋯, lm : um : sm /)
sized data access function specifies a direct mapping
from source arrays to target arrays, and can be used by where fi is an index function and li , ui , and si are
compilers to generate efficient code without introducing the lower bound, upper bound, and stride of the
temporary arrays. index function fi (i , i , ⋯, in ). The stride si can be
omitted if it is equal to one, representing contiguous
Data Access Functions data access. For example, the segmented descriptor
The concept of data access functions needs to be fur- ϕ(/i, j/, / : ,  : /) delimits the range of (i = ,
ther elaborated here. An array operation has its own j =  : ).
data access function that specifies element-wise map- After providing a notation for segmented index
ping between its input and output arrays. As array oper- ranges, let us go back to array operations with sin-
ations in Fortran  have various formats, there are gle source and segmented data access functions, such
different looks of data access functions. In the follow- as CSHIFT and EOSHIFT in Fortran . For an
ing paragraphs, total three types of data access functions array operation with an n-dimensional target array T
for different types of array operations will be provided and an m-dimensional source array S, annotated in
as the basis of array operation synthesis. T = Array_Operation(S), its data access function can
The first type of data access functions is for be represented as follows:
array operations that contain a single source array ⎧


⎪ S[f (i , i , ⋯, in ), f (i , i , ⋯, in ), ⋯,
and comply with continuous data accessing, such as ⎪




TRANSPOSE and SPREAD in Fortran . For an ⎪

⎪ fm (i , i , ⋯, in )] ∣ γ
array operation with n-dimensional target array T and ⎪




m-dimensional source array S, which can be annotated T[i , i , ⋯, in ] = ⎨ S[g (i , i , ⋯, in ), g (i , i , ⋯, in ), ⋯,



in T = Array_Operation(S), its data access function ⎪



⎪ gm (i , i , ⋯, in )] ∣ γ 
can be represented as equation (). In equation (), fi ⎪





in the right hand side is an index function represent- ⎪
⎩⋯

ing an array subscript of the source array S, where the ()
array subscripts from i to in are index variables of the where γ  and γ  are two separate array index ranges for
target array T. For example, the data access function index function fi and gi respectively. Take CSHIFT as
for T = TRANSPOSE(S) is as follows: T[i, j] = S[ j, i] an example, for an array operation B = CSHIFT(A, , )
where the index functions by definition are f (i, j) = j with the input A and output B are both  ×  arrays,
and f (i, j) = j. its data access function can be represented as equa-
T[i , i , ⋯, in ] = S[f (i , i , ⋯, in ), f (i , i , ⋯, in ), ⋯, tion (). In equation (), the data access function of
CSHIFT(A, , ) is divided into two parts, describing
fm (i , i , ⋯, in )] ()
different sections of the target array B will be computed
The second type of data access functions is for array by different formulas: for the index range, i = , j =  ∼ ,
operations that also have a single source array but with B[i, j] is assigned by the value of A[i−, j]; for the index
Array Languages, Compiler Techniques for A 

range, i =  ∼ , j =  ∼ , B[i, j] is assigned by the value i n t e g e r A(, ) , B(, ) , C()


A
of A[i + , j].
B = CSHIFT((TRANSPOSE(A)



⎪ + RESHAPE(C, /, /)) ,  , )
⎪A[i − , j] ∣ ϕ(/i, j/, / : ,  : /)
B[i, j] = ⎨ ()


⎪ In the first step, a parse tree is constructed for the
⎪A[i + , j] ∣ ϕ(/i, j/, / : ,  : /)
⎩ consecutive array operations. In the parse tree, source
arrays are leaf nodes while target arrays are roots, and
The third type of data access functions is for array each internal node corresponds to an array operation.
operations with multiple source arrays and continuous The parse tree for the running example is provided in
data access. For an array operation with k source arrays, Fig. . All temporary arrays created in the straightfor-
annotated as T = Array_Operation(S , S , ⋯, Sk ), its data ward compilation are arbitrarily given an unique name
access function can be represented as equation (), for identification. In the example, temporary arrays T,
where F is a k-nary function used to describe how the T and T are attached to nodes of “+,” TRANSPOSE,
desired output to be derived by k data elements from the and RESHAPE as their intermediate results. The root
input arrays. Each element from source arrays is associ- node is labeled with the target array, array B in this
j
ated with an index function of the form fi , where i and j example, which will contain the final results of the
indicates its input array number and dimension, respec- consecutive array operations.
tively. Take whole-array addition C(:,:) = A(:,:) + B(:,:) In the second step, data access functions are pro-
as an example, its data access function can be repre- vided for all array operations. In the running example,
sented in C[i, j] = F(A[i, j], B[i, j]), where F(x, y) = the data access function of T = TRANSPOSE(A) is
x + y. T[i, j] = A[ j, i]; the data access function of T =
RESHAPE(C, /, /) is T[i, j] = C[i + j ∗ ]; the
T[i , i , ⋯, in ] = F (S [f (i , i , ⋯, in ), data access function of the T = T + T is T[i, j] =
F (T[i, j], T[i, j]), where F (x, y) = x + y; the data
f (i , i , ⋯, in ), ⋯, fd (i , i , ⋯, in )] , ⋯, access function of B = CSHIFT(T) is as follows:
()
Sk [fk (i , i , ⋯, in ), fk (i , i , ⋯, in ), ⎧


⎪ T[i + , j] ∣ ϕ(/i, j/, / : ,  : /)

B[i, j] = ⎨ ()
⋯, fkdk (i , i , ⋯, in )]) ⎪


⎪ T[i − , j] ∣ ϕ(/i, j/, / : ,  : /)

In the last step, the collected data access functions
Synthesis of Array Operations are going to be synthesized, starting from the data access
After providing the definition of data access functions,
let us begin to elaborate the mechanism of array opera-
tion synthesis. For illustration, the synthesis framework CSHIFT B
can be roughly partitioned into three major steps:
. Build a parse tree for consecutive array + T3
operations
. Provide a data access function for each array
operation TRANSPOSE T1 RESHAPE T2
. Synthesize collected data access functions into a
composite one
A C
Throughout the three steps, the consecutive array oper-
ations shown in the previous example will again be used Array Languages, Compiler Techniques for. Fig.  The
as a running example to illustrate the synthesis flow. We parse tree for consecutive array operations in the running
replicate the code fragment here for reference. example
 A Array Languages, Compiler Techniques for

function of array operation at the root node. Through- j


out the synthesis process, every temporary array at 0 1 2 3

right-hand side (RHS) of access function is replaced 0


by the data access function that defines the temporary 1
array. During the substituting process, other tempo- i
2
rary arrays will continually appear in the RHS of the
updated data access function. This process repeats until 3
all temporary arrays in the RHS are replaced with source
arrays. B(i, j ) = A(j, i+1), + C(i+1+j *4)
For the running example, the synthesis process
B(i, j ) = A(j, i−3), + C(i −3+j *4)
begins with the data access function of CSHIFT for
it is the root node in the parse tree. The data access
Array Languages, Compiler Techniques for. Fig. 
function of CSHIFT is listed in Equation (), in which
Constituents of the target array are described by the
temporary array T appears in the right-hand side. To
synthesized data access function
substitute T, the data access function of T = T + T,
which specifies T[i, j] = F (T[i, j], T[i, j]) is of
our interest. After the substitution, it may result in the the target array B, gray blocks and white blocks in the
following data access function: Fig. , will be computed in different manners.
B[i, j] =
⎧ Code Generation with Data Access Functions


⎪F (T[i + , j], T[i + , j]) ∣ ϕ(/i, j/, / : ,  : /)
⎪ After deriving a composite data access function,




compilers can generate a parallel loop with FORALL
⎪ F (T[i − , j], T[i − , j]) ∣ ϕ(/i, j/, / : ,  : /)
⎩  statements for the consecutive array operations. For the
() running example, compilers may generate the pseudo
In the updated data access function, come out two code below, with no redundant data movement between
new temporary arrays T and T. Therefore, the pro- temporary arrays and therefore no additional storages
cess continues to substitute T with T[i, j] = A[ j, i] required.
(the data access function of TRANSPOSE) and T with integer A( ,) , B( ,) , C(   )
T[i, j] = C[i + j ∗ ] (the data access function of
RESHAPE). At the end of synthesis process, a composite FORALL i =  t o  , j =  t o 
data access function will be derived as follows, contain- IF (i, j) ∈ ϕ(/i, j/, / : ,  : /) THEN
ing only the target array B and the source array A and C B [ i , j ] =A[ j, i + ]+ C[ i +  + j ∗  ]
without any temporary arrays. IF (i, j) ∈ ϕ(/i, j/, / : ,  : /) THEN
B [ i , j ] =A[ j , i − ] +C[ i −+ j ∗ ]
B[i, j] =
⎧ END FORALL


⎪F (A[ j, i + ], C[i +  + j ∗ ]) ∣ ϕ(/i, j/, / : ,  : /)





⎪F (A[ j, i − ], C[i −  + j ∗ ]) ∣ ϕ(/i, j/, / : ,  : /) Optimizations on Index Ranges
⎩  The code generated in the previous step consists of only
()
one single nested loop. It is straightforward but ineffi-
The derived data access function describes a direct cient because it includes two if-statements to ensure that
and element-wise mapping from data elements in the different parts of the target array will be computed by
source arrays to those of the target array. The mapping different equations. These guarding statements incurred
described by the synthesized data access function can by segmented index ranges will be resolved at runtime
be visualized in Fig. . Since the synthesized data access and thus lead to performance degradation. To optimize
function consists of two formulas with different seg- programs with segmented index ranges the loop can be
mented index ranges, it indicates that two portions of further divided into multiple loops, each of which has a
Array Languages, Compiler Techniques for A 

0 <0,0,0,5,3,0> 0 0 0 0
A
0 <2,0,1,0,0,0> <0,0,9,3,0,0> 0 0 0

<9,0,0,0,6,0> 0 0 0 <1,0,6,2,3,5> 0

0 0 0 0 0 <0,2,0,3,4,0>

0 0 <1,0,0,0,7,2> 0 0 0

0 <2,4,2,8,0,0> 0 0 0 <0,0,0,7,3,4>

Array Languages, Compiler Techniques for. Fig.  A three-dimensional sparse array A(, , )

bound to cover its array access range. By the index range A(, , ) is depicted in Fig. . The sparse array A is
optimization, the loop for the running example can be a three-dimensional  ×  ×  matrix with its first two
divided into two loops for two disjointed access ranges. dimensions containing plenty of zero elements, which
At last, we conclude this section by providing efficient are zero-vectors in the rows and columns.
code as follows for the running example. It is efficient
in array data accessing and promising to deliver per- Compression and Distribution Schemes
formance. For more information about array operation for Sparse Arrays
synthesis, readers can refer to [–]. For their sparse property, sparse arrays are usually
represented in a compressed format to save compu-
i n t e g e r A(  ,  ) , B (  ,  ) , C (   )
tation and memory space. Besides, to process sparse
workload on distributed memory environments effi-
FORALL i =  t o  , j =  t o 
ciently sparse computation are often partitioned to
B [ i , j ] = A[ j , i +  ] + C[ i +  + j ∗ ]
be distributed to processors. For these reasons, both
END FORALL
compression and distribution schemes have to be con-
sidered in supporting parallel sparse computation.
FORALL i =  t o  , j =  t o 
Table  lists options of compression and distribution
B [ i , j ] =A[ j , i −  ] +C[ i −+ j ∗  ]
schemes for two-dimensional sparse arrays. There will
END FORALL
be more detailed descriptions for these compression
and distribution in upcoming paragraphs.
Support and Optimizations for Sparse
Array Operations Compression Schemes
In the next section, another compiler technique is going For a one-dimensional array, its compression can be
to be introduced, which targets array operation sup- either in a dense representation or in a pair-wise
ports and optimizations for sparse arrays. In contrast sparse representation. In the sparse representation, an
to dense arrays, sparse arrays consist of more zero ele- array containing index and value pairs are used to
ments than nonzero elements. With this characteristic, record nonzero elements, with no space for keeping
they are more applicable than dense arrays in many zero elements. For example, a one-dimensional array
data specific applications, such as network theory and ⟨, , , , , ⟩ can be compressed by the pair (, ) and
earthquake detection. Besides, they are also extensively (, ), representing two non-zero elements in the array,
used in scientific computation and numerical analysis. which are the fourth element equaling to five and the
To help readers understand sparse arrays, a sparse array fifth element equaling to three.
 A Array Languages, Compiler Techniques for

Array Languages, Compiler Techniques for. Table  Value 5 3


Compression and distribution schemes for Index 4 5 DA CO RO
two-dimensional arrays Value 2 1
Index 1 3 2 1
Compression scheme Distribution scheme
2 2
Value 9 3
Compressed row storage (CRS) (Block, *) 3 4
Index 3 4
1 6
Compressed column storage (CCS) (*, Block)
Value 9 6 5 7
Dense representation (Block, Block) Index 1 5 6 8
3
Value 1 6 2 3 5 10
2
Index 1 3 4 5 6 6
0 11 0 0 CRS CCS Value 2 3 4
34 0 21 0 Index 2 4 5
RO CO DA RO CO DA
0 97 0 9
Value 1 7 2
51 28 0 0 1 2 11 1 2 34 Index 1 5 6
CRS view
2 1 34 3 4 51 Value 2 4 2 8
Index 1 2 3 4
4 3 21 6 1 11
0 11 0 0 Value 7 3 4
6 2 97 7 3 97
34 0 21 0 Index 4 5 6
8 4 9 8 4 28
0 97 0 9
Array Languages, Compiler Techniques for. Fig.  A
51 28 0 0 1 51 2 21
compression scheme for the three-dimensional array in
CCS view 2 28 3 9
Fig. 

Array Languages, Compiler Techniques for. Fig.  CRS


and CCS compression schemes for a two-dimensional
array Compression schemes for higher-dimensional arrays
can be constructed by employing -d and -d sparse
arrays as bases to construct higher-dimensional arrays.
For higher-dimensional arrays, there are two com- Figure  shows a compression scheme for the three-
pression schemes for two-dimensional arrays, which dimensional sparse array in Fig. . The three-dimens-
are Compressed Row Storage (CRS) and Compressed ional sparse array is constructed by a two-dimensional
Column Storage (CCS). They have different ways to CRS structure with each element in the first level
compress a two-dimensional array, either by rows or by is a one-dimensional sparse array. Similarly, one can
columns. The CCS scheme regards a two-dimensional employ -d sparse arrays as bases to build four-
array as a one-dimensional array of its rows, whereas dimensional spare arrays. The representation will be a
the CRS scheme considers it as a one-dimensional array two-dimensional compressed structure with each DA
of its columns. Both of them use a CO and DA tuple, field in the structure also a two-dimensional com-
an index and data pair, to represent a nonzero element pressed structure.
in a one-dimensional array, a row in CRS or a column The code fragment below shows an instance to
in CCS. The CO fields correspond to index offsets of implement three-dimensional sparse arrays for both
nonzero values in a CRS row or in a CCS column. In compression schemes, in which each data field contains
addition to CO and DA pairs, both CRS and CCS rep- a real number. This derived sparsed_real data type
resentation contain a RO list that keeps value indexes can be used to declare sparse arrays with CRS or CCS
where each row or column starts. In Fig. , we show an compression schemes as the example shown in Fig. .
example to encode a two-dimensional array with CRS All array primitive operations such as +, −, ∗, etc., and
and CCS schemes, where total  non-zero values are array intrinsic functions applied on the derived type
recorded with their values in the DA fields and CCS can be overloaded to sparse implementations. In this
column or CRS row indexes in the CO fields. way, data stored in spare arrays can be manipulated as
Array Languages, Compiler Techniques for A 

they are in dense arrays. For example, one can conduct The following code fragment is used to illustrate
A
a matrix multiplication over two sparse arrays via a how to assign a distribution scheme to a sparse array
sparse matmul, in which only nonzero elements will be on a distributed environment. The three-dimensional
computed. sparse array in Fig.  is again used for explanation.
At first, the sparse array is declared as a -D sparse
type s p a r s e  d _ r e a l array by the derived type, sparsed_real. Next, the
type ( d e s c r i p t o r ) : : d bound function is used to specify the shape of the
i n t e g e r , p o i n t e r , dimension ( : ) : : RO sparse array, specifying its size of each dimension. After
i n t e g e r , p o i n t e r , dimension ( : ) : : CO shape binding, the sparse array is assigned to a (Block,
type ( s p a r s e  d _ r e a l ) , pointer , Block, ∗ ) distribution scheme through the distribution
dimension ( : ) : : DA function, by which the sparse array will be partitioned
end t y p e s p a r s e  d _ r e a l and distributed along its first two dimensions. Fig-
ure  shows the partition and distribution situation
over a distributed environment with four-processors,
Distribution Schemes where each partition covers data assigned to a target
In distributed memory environments, data distribution processor.
of sparse arrays needs to be considered. The distri-
bution schemes currently considered for sparse arrays use s p a r s e
are general block partitions based on the number of type ( s p a r s e  d _ r e a l ) : : A
nonzero elements. For two-dimensional arrays, there
are (Block, ∗ ), (∗ , Block) and (Block, Block) distribu- c a l l bound ( A ,  ,  ,  )
tions. In the (∗ , Block) scheme a sparse array is dis- c a l l d i s t r i b u t i o n (A , Block , Block , * )
tributed by rows, while in the (Block, ∗ ) distribution an ...
array is distributed by columns. For the (Block, Block)
distribution, an array is distributed both by rows and
columns. Similarly, distribution schemes for higher- Programming with Sparse Arrays
dimensional arrays can be realized by extending more With derived data types and operation overloading,
dimensions in data distribution options. sparse matrix computation can also be expressed as

0 <0,0,0,5,3,0> 0 0 0 0

0 <2,0,1,0,0,0> <0,0,9,3,0,0> 0 0 0

<9,0,0,0,6,0> 0 0 0 <1,0,6,2,3,5> 0

0 0 0 0 0 <0,2,0,3,4,0>

0 0 <1,0,0,0,7,2> 0 0 0

0 <2,4,2,8,0,0> 0 0 0 <0,0,0,7,3,4>

Array Languages, Compiler Techniques for. Fig.  A three-dimensional array is distributed on four processors by (Block,
Block, ∗ )
 A Array Languages, Compiler Techniques for

integer, parameter : : row = 1000 integer, parameter: : row = 1000


real, dimension ( row , 2∗row−1) : : A type ( s p a r s e 2 d_r e a l ) : : A
real, dimension ( row ) : : x, b type( s p a r s e 1 d_r e a l ) : : x, b
integer, dimension (2∗row−1) : : s h i f t integer, dimension (2∗row−1) : : s h i f t

b = sum (A∗e o s h i f t ( s p r e a d ( x, dim=2, n c o p i e s =2∗row−1), c a l l bound ( A , row , 2∗row−1)


dim=1, s h i f t=a r t h (−row +1 ,1 ,2∗row−1)) , dim=2) c a l l bound ( x , row )
c a l l bound ( b , row )
Array Languages, Compiler Techniques for. Fig.  b = sum (A∗e o s h i f t ( s p r e a d ( x, dim=2, n c o p i e s =2∗row−1) ,
Numerical routines for banded matrix multiplication dim=1, s h i f t=a r t h (−row +1 ,1 ,2∗row−1)) , dim=2)

Array Languages, Compiler Techniques for. Fig.  Sparse


implementation for the banmul routine
concisely as dense computation. The code fragment in
Fig.  is a part of a Fortran  program excerpted
from the book, Numerical Recipes in Fortran  []
with its computation kernel named banded multipli-
cation that calculates b = Ax, with the input, A, a Selection of Compression and Distribution
matrix and the second input, x, a vector. (For more Schemes
information about banded vector-matrix multiplica- Given a program with sparse arrays, there introduces
tion, readers can refer to the book [].) In Fig. , an optimizing problem, how to select distribution and
both the input and output are declared as dense compression schemes for sparse arrays in the pro-
arrays, with three Fortran  array intrinsic functions, gram. In the next paragraphs, two combinations of
EOSHIFT, SPREAD, and SUM used to produce its compression and distribution schemes for the banmul
results. kernel are presented to show how scheme combina-
When the input and output are sparse matrices with tions make impact to performance. At first Fig. 
plenty of zero elements, the dense representation and shows the expression tree of the banmul kernel in
computation will be inefficient. With the support of Fig. , where T, T, and T beside the internal
sparse data types and array operations, the banmul ker- nodes are temporary arrays introduced by compilers
nel can be rewritten into a sparse version as shown in for passing intermediate results for array operations.
Fig. . Comparing Fig.  with Fig. , one can find that As mentioned before, through exposed routines, pro-
the sparse implementation is very similar to the dense grammers can choose compression and distribution
version, with slight differences in array declaration and schemes for input and output arrays. Nevertheless,
extra function calls for array shape binding. By chang- it still has the need to figure out proper compres-
ing inputs and outputs from dense to sparse arrays, sion and distribution schemes for temporary arrays
a huge amount of space is saved by only recording since they are introduced by compilers rather than
non-zero elements. Besides, the computation kernel is programmers.
unchanged thanks to operator overloading, and it tends The reason why compression schemes for input,
to have better performance for only processing nonzero output and temporary arrays should be concerned is
elements. One thing to be noted here, the compression briefed as follows. When an array operation is con-
and distribution scheme of the sparse implementation ducted on arrays in different compression schemes, it
are not specified for input and output arrays, and thus requires data conversion that reorganizes compressed
the default settings are used. In advanced program- arrays from one compression scheme to another for
ming, these schemes can be configured via exposed a conforming format. That is because array opera-
routines to programmers, which can be used to opti- tions are designed to process input arrays and return
mize programs for compressed data organizations. We results in one conventional scheme, either in CRS
use following paragraphs to discuss how to select a or in CCS. The conversion cost implicit in array
proper compression and distribution scheme for sparse operations hurts performance a lot and needs to be
programs. avoided.
Array Languages, Compiler Techniques for A 

Array Languages, Compiler Techniques for. Table 


Sum b
Assignments with less compression conversions A
Array Compressed scheme Distribution scheme

x d (Block)
* T3
T CCS (Block, *)

T CCS (Block, *)
A
Eoshift T2
T CCS (Block, *)

A CCS (Block, *)

b d (Block)
Spread T1

x Sparsity, Cost Models, and the Optimal


Selection
Array Languages, Compiler Techniques for. Fig. 
To compile sparse programs for distributed environ-
Expression tree with intrinsic functions in banmul
ments, in additional to conversion costs, communica-
tion costs need to be considered. In order to minimize
the total execution time, an overall cost model needs be
Array Languages, Compiler Techniques for. Table 
introduced, which covers three factors that have major
Assignments with extra compression conversions needed
impact to performance: computation, communication
Array Compressed scheme Distribution scheme and compression conversion cost. The costs of compu-
x d (Block) tation and communication model time consumed on
computing and transferring nonzero elements of sparse
T CRS (Block, *)
arrays, while the conversion cost mentioned before
T CCS (Block, *) models time consumed on converting data in one com-
T CRS (Block, *) pression schemes to another.
All these costs are closely related to the numbers of
A CRS (Block, *)
nonzero elements in arrays, defined as sparsity of arrays,
b d (Block) which can be used to infer the best scheme combination
for a sparse program. The sparsity informations can be
provided by programmers that have knowledge about
program behaviors, or they can be obtained through
In Table  and Table , we present two combinations
profiling or advanced probabilistic inference schemes
of compression and distribution schemes for the ban-
discussed in []. With sparsity informations, the selec-
mul kernel. To focus on compression schemes, all arrays
tion process can be formulated into a cost function
now have same distribution schemes, (Block) for one-
that estimates overall execution time to run a sparse
dimensional arrays and (Block, *) for two-dimensional
program on distributed memory environments. With
arrays. As shown in Table , the input array A and three
this paragraph, we conclude the compiler supports and
temporary arrays are in sparse representation, with only
optimizations for sparse arrays. For more information
T assigned to CCS scheme and the rest assigned to
about compiler techniques for sparse arrays, readers can
CRS scheme. When the conversion cost is considered,
refer to [, ].
we can tell that the scheme selection in Table  is bet-
ter than that in Table  because it requires less con-
versions: one for converting the result of Spread to Related Entries
CCS and the other for converting the result of Eoshift Array Languages
to CRS. BLAS (Basic Linear Algebra Subprograms)
 A Array Languages, Compiler Techniques for

Data Distribution Except the compiler techniques mentioned in this


Dense Linear System Solvers entry, there is a huge amount of literature discussing
Distributed-Memory Multiprocessor compiler techniques for dense and sparse array pro-
HPF (High Performance Fortran) grams, which are not limited to Fortran . Among
Locality of Reference and Parallel Processing them there is a series of work on compiling ZPL [–],
Metrics a language that defines a concise representation for
Reduce and Scan describing data-parallel computations. In [], another
Shared-Memory Multiprocessors approach similar to array operations synthesis is pro-
posed to achieve the same goal through loop fusion
and array contraction. For other sparse program opti-
Bibliographic Notes and Further mizations, there are also several research efforts on
Reading Matlab [–]. One design and implementation of
For more information about compiler techniques sparse array support in Matlab is provided in [], and
of array operation synthesis, readers can refer to there are related compiler techniques on Matlab sparse
papers [–]. Among them, the work in [, ] focuses arrays discussed in [, ].
on compiling array operations supported by Fortran .
Besides this, the synthesis framework, they also provide
solutions to performance anomaly incurred by the pres-
ence of common subexpressions and one-to-many array Bibliography
operations. In their following work [], the synthesis . Press WH, Teukolsky SA, Vetterling WT, Flannery BP ()
framework is extended to synthesize HPF array oper- Numerical recipes in Fortran : the art of parallel scientific
computing. Cambridge University Press, New York
ations on distributed memory environments, in which
. Hwang GH, Lee JK, Ju DC () An array operation synthe-
communication issues are addressed. sis scheme to optimize Fortran  programs. ACM SIGPLAN
For more information about compiler supports and Notices (ACM PPoPP Issue)
optimizations for sparse programs, readers can refer . Hwang GH, Lee JK, Ju DC () A function-composition
to papers [, ]. The work in [] puts lots of efforts approach to synthesize Fortran  array operations. J Parallel
Distrib Comput
to support Fortran  array operations for sparse
. Hwang GH, Lee JK, Ju DC () Array operation synthesis
arrays on parallel environments, in which both com- to optimize HPF programs on distributed memory machines.
pression schemes and distribution schemes for multi- J Parallel Distrib Comput
dimensional sparse arrays are considered. Besides this, . Chang RG, Chuang TR, Lee JK () Parallel sparse supports
the work also provides a complete complexity analysis for array intrinsic functions of Fortran . J Supercomputing
for the sparse array implementations and report that the ():–
. Chang RG, Chuang TR, Lee JK () Support and optimization
complexity is in proportion to the number of nonzero
for parallel sparse programs with array intrinsics of Fortran .
elements in sparse arrays, which is consistent with the Parallel Comput
conventional design criteria for sparse algorithms and . Lin C, Snyder L () ZPL: an array sublanguage. th Inter-
data structures. national Workshop on Languages and Compilers for Parallel
The support of sparse arrays and operations in [] Computing, Portland
. Chamberlain BL, Choi S-E, Christopher Lewis E, Lin C, Snyder L,
is divided into a two-level implementation: in the low-
Weathersby D () Factor-join: a unique approach to compiling
level implementation, a sparse array needs to be speci- array languages for parallel machines. th International Work-
fied with compression and distribution schemes; in the shop on Languages and Compilers for Parallel Computing, San
high-level implementation, all intrinsic functions and Jose, California
operations are overloaded for sparse arrays, and com- . Christopher Lewis E, Lin C, Snyder L () The implementa-
pression and distribution details are hidden in imple- tion and evaluation of fusion and contraction in array languages.
International Conference on Programming Language Design and
mentations. In the work [], a compilation scheme
Implementation, San Diego
is proposed to transform high-level representations to . Shah V, Gilbert JR () Sparse matrices in Matlab*P: design and
low-level implementations with the three costs, compu- implementation. th International Conference on High Perfor-
tation, communication and conversion, considered. mance Computing, Springer, Heidelberg
Asynchronous Iterative Algorithms A 

. Buluç A, Gilbert JR () On the representation and multipli- high communication and synchronization costs. While
cation of hypersparse matrices. nd IEEE International Sympo-
A
these costs have improved over the years, a corre-
sium on Parallel and Distributed Processing, Miami, Florida sponding increase in processor speeds has tempered the
. Buluç A, Gilbert JR () Challenges and advances in parallel
impact of these improvements. Indeed, the speed gap
sparse matrix-matrix multiplication. International Conference on
Parallel Processing, Portland between the arithmetic and logical operations and the
memory access and message passing operations has a
significant impact even on parallel programs execut-
ing on a tightly coupled shared-memory multiprocessor
Asynchronous Iterations implemented on a single semiconductor chip.
System techniques such as prefetching and multi-
Asynchronous Iterative Algorithms threading use concurrency to hide communication and
synchronization costs. Application programmers can
complement such techniques by reducing the num-
ber of communication and synchronization operations
Asynchronous Iterative when they implement parallel algorithms. By analyz-
Algorithms ing the dependencies among computational tasks, the
programmer can determine a minimal set of communi-
Giorgos Kollias, Ananth Y. Grama, Zhiyuan Li
cation and synchronization points that are sufficient for
Purdue University, West Lafayette, IN, USA
maintaining all control and data dependencies embed-
ded in the algorithm. Furthermore, there often exist
Synonyms several different algorithms to solve the same compu-
Asynchronous iterations; Asynchronous iterative tational problem. Some may perform more arithmetic
computations operations than others but require less communica-
tion and synchronization. A good understanding of the
Definition available computing system in terms of the tradeoff
In iterative algorithms, a grid of data points are updated between arithmetic operations versus the communica-
iteratively until some convergence criterion is met. The tion and synchronization cost will help the programmer
update of each data point depends on the latest updates select the most appropriate algorithm.
of its neighboring points. Asynchronous iterative algo- Going beyond the implementation techniques men-
rithms refer to a class of parallel iterative algorithms tioned above requires the programmer to find ways to
that are capable of relaxing strict data dependencies, relax the communication and synchronization require-
hence not requiring the latest updates when they are ment in specific algorithms. For example, an iterative
not ready, while still ensuring convergence. Such relax- algorithm may perform a convergence test in order to
ation may result in the use of inconsistent data which determine whether to start a new iteration. To per-
potentially may lead to an increased iteration count form such a test often requires gathering data which
and hence increased computational operations. On the are scattered across different processors, incurring the
other hand, the time spent on waiting for the lat- communication overhead. If the programmer is famil-
est updates performed on remote processors may be iar with the algorithm’s convergence behavior, such a
reduced. Where waiting time dominates the computa- convergence test may be skipped until a certain num-
tion, a parallel program based on an asynchronous algo- ber of iterations have been executed. To further reduce
rithm may outperform its synchronous counterparts. the communication between different processors, algo-
rithm designers have also attempted to find ways to
Discussion relax the data dependencies implied in conventional
parallel algorithms such that the frequency of com-
Introduction munication can be substantially reduced. The concept
Scaling application performance to large number of of asynchronous iterative algorithms, which dates back
processors must overcome challenges stemming from over  decades, is developed as a result of such attempts.
 A Asynchronous Iterative Algorithms

With the emergence of parallel systems with tens of one can constrain the system state to be determined
thousands of processors and the deep memory hierar- by the prior state of the system in a prescribed time
chy accessible to each processor, asynchronous itera- window. So long as system invariants such as energy
tive algorithms have recently generated a new level of are maintained, this approach is valid for many appli-
interest. cations. For example, in protein folding and molecu-
lar docking, the objective is to find minimum energy
Motivating Applications states. It is shown to be possible to converge to true
Among the most computation-intensive applications minimum energy states under such relaxed models of
currently solved using large-scale parallel platforms synchrony.
are iterative linear and nonlinear solvers, eigenvalue While these are two representative examples from
solvers, and particle simulations. Typical time-dependent scientific domains, nonscientific algorithms lend them-
simulations based on iterative solvers involve multiple selves to relaxed synchrony just as well. For example, in
levels at which asynchrony can be exploited. At the information retrieval, PageRank algorithms can tolerate
lowest level, the kernel operation in these solvers are relaxed dependencies on iterative computations [].
sparse matrix-vector products and vector operations. In
a graph-theoretic sense, a sparse matrix-vector prod-
Iterative Algorithms
uct can be thought of as edge-weighted accumula-
An iterative algorithm is typically organized as a series
tions at nodes in a graph, where nodes correspond
of steps essentially of the form
to rows and columns and edges correspond to matrix
entries. Indeed, this view of a matrix-vector product
x(t + ) ← f (x(t)) ()
forms the basis for parallelization using graph partition-
ers. Repeated matrix-vector products, say, to compute
where operator f (⋅) is applied to some data x(t) to pro-
A(n) y, while solving Ax = b, require synchronization
duce new data x(t+). Here integer t counts the number
between accumulations. However, it can be shown that
of steps, assuming starting with x(), and captures the
within some tolerance, relaxing strict synchronization
notion of time. Given certain properties on f (⋅), that
still maintains convergence guarantees of many solvers.
it contracts (or pseudo-contracts) and it has a unique
Note that, however, the actual iteration count may be
fixed point, and so on, an iterative algorithm is guar-
larger. Similarly, vector operations for computing resid-
anteed to produce at least an approximation within a
uals and intermediate norms can also relax synchro-
prescribed tolerance to the solution of the fixed-point
nization requirements. At the next higher level, if the
equation x = f (x), although the exact number of needed
problem is nonlinear in nature, a quasi-Newton scheme
steps cannot be known in advance.
is generally used. As before, convergence can be guar-
In the simplest computation scenario, data x reside
anteed even when the Jacobian solves are not exact.
in some Memory Entity (ME) and operator f (⋅) usually
Finally, in time-dependent explicit schemes, some time-
consists of both operations to be performed by some
skew can be tolerated, provided global invariants are
Processing Entity (PE) and parameters (i.e., data)
maintained, in typical simulations.
hosted by some ME. For example, if the iteration step
Particle systems exhibit similar behavior to one
is a left matrix-vector multiplication x ← Ax, x is con-
mentioned above. Spatial hierarchies have been well
sidered to be the “data” and the matrix itself with the
explored in this context to derive fast methods such
set of incurred element-by-element multiplications and
as the Fast Multipole Method and Barnes-Hut method.
subsequent additions to be the “operator”, where the
Temporal hierarchies have also been explored, albeit to
“operator” part also contains the matrix elements as its
a lesser extent. Temporal hierarchies represent one form
parameters.
of relaxed synchrony, since the entire system does not
However, in practice this picture can get compli-
evolve in lock steps. Informally, in a synchronous sys-
cated:
tem, the state of a particle is determined by the state of
the system in the immediately preceding time step(s) in ● Ideally x data and f (⋅) parameters should be readily
explicit schemes. Under relaxed models of synchrony, available to the f (⋅) operations. This means that one
Asynchronous Iterative Algorithms A 

would like all data to fit in the register file of a typ- fragments {xj } of the newly computed data x for an iter-
A
ical PE, or even its cache files (data locality). This is ation step. So, to preserve the exact semantics, it should
not possible except for very small problems which, wait for these data fragments to become available, in
most of the time, are not of interest to researchers. effect causing its PE to synchronize with all those PEs
Data typically reside either in the main memory executing the operator fragments involved in updating
or for larger problems in a slow secondary mem- these {xj } in need. Waiting for the availability of data
ory. Designing data access strategies which feed the at the end of each iteration make these iterations “syn-
PE at the fastest possible rate, in effect optimizing chronous” and this practice preserves the semantics. In
data flow across the memory hierarchy while pre- networked environments, such synchronous iterations
serving the semantics of the algorithm, is important can be coupled with either “synchronous communica-
in this aspect. Modern multicore architectures, in tions” (i.e., synchronous global communication at the
which many PEs share some levels of the mem- end of each step) or “asynchronous communications”
ory hierarchy and compete for the interconnection for overlapping computation and communication (i.e.,
paths, introduce new dimensions of complexity in asynchronously sending new data fragments as soon as
the effective mapping of an iterative algorithm. they are locally produced and blocking the receipt of
● There are cases in which data x are assigned to more new data fragments at the end of each step) [].
than one PEs and f (⋅) is decomposed. This can be Synchronization between PEs is crucial to enforc-
the result either of the sheer volume of the data or ing correctness of the iterative algorithm implementa-
the parameters of f (⋅) or even the computational tion but it also introduces idle synchronization phases
complexity of f (⋅) application which performs a between successive steps. For a moment consider the
large number of operations in each iteration on each extreme case of a parallel iterative computation where
data unit. In some rarer cases the data itself or the for some PE the following two conditions happen to
operator remains distributed by nature throughout hold:
the computation. Here, the PEs can be the cores
of a single processor, the nodes (each consisting of . It completes its iteration step much faster than the
multiple cores and processors) in a shared-memory other PEs.
machine or similar nodes in networked machines. . Its input communication links from the other PEs
The latter can be viewed as PEs accessing a network are very slow.
of memory hierarchies of multiple machines, i.e., an
extra level of ME interconnection paths.
It is evident that this PE will suffer a very long
idle synchronization phase, i.e., time spent doing noth-
ing (c.f. Fig. ). Along the same line of thought, one
Synchronous Iterations and Their Problems could also devise the most unfavorable instances of
The decomposition of x and f (⋅), as mentioned above, computation/communication time assignments for the
necessitates the synchronization of all PEs involved at other PEs which suffer lengthy idle synchronization
each step. Typically, each fragment fi (⋅) of the decom- phases. It is clear that the synchronization penalty for
posed operator may need many, if not all, of the iterative algorithms can be severe.

A 1 idle 2 idle

B 1 2 idle

Asynchronous Iterative Algorithms. Fig.  Very long idle periods due to synchronization between A and B executing
synchronous iterations. Arrows show exchange messages, the horizontal axis is time, boxes with numbers denote iteration
steps, and dotted boxes denote idle time spent in the synchronization. Note that such boxes will be absent in
asynchronous iterations
 A Asynchronous Iterative Algorithms

Asynchronous Iterations: Basic Idea and hand the synchronous convergence assumption guar-
Convergence Issues antees that local computation also cannot produce any
The essence of asynchronous iterative algorithms is state escaping the current box, its only effect being a
to reduce synchronization penalty by simply eliminat- possible inclusion of some of the local data compo-
ing the synchronization phases in iterative algorithms nents in the respective intervals defining some smaller
described above. In other words, each PE is permitted (nested) box. The argument here is that if communi-
to proceed to its next iteration step without waiting for cations are nothing but state transitions parallel to box
updating its data from other PEs. Use newly arriving facets and computations just drive some state coordi-
data if available, but otherwise reuse the old data. This nates in the facet of some smaller (nested) box, then
approach, however, fails to retain the temporal order- their combination will continuously drive the set of
ing of the operations in the original algorithm. The local states to successively smaller boxes and ultimately
established convergence properties are usually altered toward convergence within a tolerance.
as a consequence. The convergence behavior of the new It follows that since a synchronous iteration is just a
asynchronous algorithm is much harder to analyze than special case in this richer framework, its asynchronous
the original [], the main difficulty being the intro- counterpart will typically fail to converge for all asyn-
duction of multiple independent time lines in practice, chronous scenarios of minimal assumptions. Two broad
one for each PE. Even though the convergence anal- classes of asynchronous scenarios are usually identified:
ysis uses a global time line, the state space does not
involve studying the behavior of the trajectory of only Totally asynchronous The only constraint here is that, in
one point (i.e., the shared data at the end of each itera- the local computations, data components are ulti-
tion as in synchronous iterations) but rather of a set of mately updated. Thus, no component ever becomes
points, one for each PE. An extra complexity is injected arbitrarily old as the global time line progresses,
by the flexibility of the local operator fragments fi to be potentially indefinitely, assuming that any local
applied to rather arbitrary combinations of data com- computation marks a global clock tick. Under such a
ponents from the evolution history of their argument constraint, ACT can be used to prove, among other
lists. Although some assumptions may underly a partic- things, that the totally asynchronous execution of
ular asynchronous algorithm so that certain scenarios of the classical iteration x ← Ax + b, where x and
reusing old data components are excluded, a PE is free b are vectors and A a matrix distributed row-wise,
to use, in a later iteration, data components older than converges to the correct solution provided that the
those of the current one. In other words, one can have spectral radius of the modulus matrix is less than
non-FIFO communication links for local data updates unity (ρ(∣A∣) < ). This is a very interesting result
between the PEs. from a practical standpoint as it directly applies to
The most general strategy for establishing conver- classical stationary iterative algorithms (e.g., Jacobi)
gence of asynchronous iterations for a certain problem with nonnegative iteration matrix, those commonly
is to move along the lines of the Asynchronous Con- used for benchmarking the asynchronous model
vergence Theorem (ACT) []. One tries to construct a implementations versus the synchronous ones.
sequence of boxes (cartesian products of intervals where Partially asynchronous The constraint here is stricter:
data components are known to lie) with the property each PE must perform at least one local itera-
that the image of each box under f (⋅) will be contained tion within the next B global clock ticks (for B
in the next box in this sequence, given that the orig- a fixed integer) and it must not use components
inal (synchronous) iterative algorithm converges. This that are computed B ticks prior or older. More-
nested box structure ensures that as soon as all local over, for the locally computed components, the most
states enter such a box, no communication scenario can recent must always be used. Depending on how
make any of the states escape to an outer box (decon- restricted the value of B may be, partially asyn-
verge), since updating variables through communica- chronous algorithms can be classified in two types,
tion is essentially a coordinate exchange. On the other Type I and Type II. The Type I is guaranteed to
Asynchronous Iterative Algorithms A 

converge for arbitrarily chosen B. An interesting the global convergence tests too early. Indeed, experi-
A
case of Type I has the form x ← Ax where x is ments have shown long phases in which PEs continually
a vector and A is a column-stochastic matrix. This enter and exit their “locally converged” status. With the
case arises in computing stationary distributions of introduction of an extra waiting period for the local
Markov chains [] like the PageRank computa- convergence status to stabilize, one can avoid the pre-
tion. Another interesting case is found in gossip-like mature trigger of the global convergence detection pro-
algorithms for the distributed computation of statis- cedure which may be either centralized or distributed.
tical quantities, where A = A(t) is a time-varying Taking centralized global detection for example, there
row-stochastic matrix []. For Type II partially is a monitor PE which decides global convergence and
asynchronous algorithms, B is restricted to be a notifies all iterating PEs accordingly. A practical strat-
function of the structure and the parameters of the egy is to introduce two integer “persistence parame-
linear operator itself to ensure convergence. Exam- ters” (a localPersistence and a globalPersistence). If
ples include gradient and gradient projection algo- local convergence is preserved for more than localPer-
rithms, e.g., those used in solving optimization and sistence iterations, the PE will notify the monitor. As
constrained optimization problems respectively. In soon as the monitor finds out that all the PEs remain
these examples, unfortunately, the step parameter in in locally convergence status for more than globalPer-
the linearized operator must vary as inversely pro- sistence of their respective checking cycles, it signals
portional to B in order to attain convergence. This global convergence to all the PEs.
implies that, although choosing large steps could Another solution, with a simple proof of correct-
accelerate the computation, the corresponding B ness, is to embed a conservative iteration number in
may be too small to enforce in practice. messages, defined as the smallest iteration number of
all those messages used to construct the argument list of
the current local computation, incremented by one [].
Implementation Issues This is strongly reminiscent of the ideas in the ACT and
Termination Detection practically the iteration number is the box counter in
In a synchronous iterative algorithm, global conver- that context, where nested boxes are marked by succes-
gence can be easily decided. It can consist of a local con- sively greater numbers of such. In this way, one can pre-
vergence detection (e.g., to make sure a certain norm is compute some bound for the target iteration number for
below a local threshold in a PE) which triggers global a given tolerance, with simple local checks and minimal
convergence tests in all subsequent steps. The global communication overhead. A more elaborate scheme for
tests can be performed by some PE acting as the monitor asynchronous computation termination detection in a
to check, at the end of each step, if the error (deter- distributed setting can be found in literature [].
mined by substituting the current shared data x in the
fixed point equation) gets below a global threshold. At Asynchronous Communication
that point, the monitor can signal all PEs to terminate. The essence of asynchronous algorithms is not to let
However, in an asynchronous setting, local convergence local computations be blocked by communications with
detection at some PE does not necessarily mean that other PEs. This nonblocking feature is easier to imple-
the computation is near the global convergence, due to ment on a shared memory system than on a distributed
the fact that such a PE may compute too fast and not memory system. This is because on a shared memory
receive much input from its possibly slow communi- system, computation is performed either by multiple
cation in-links. Thus, in effect, there could also be the threads or multiple processes which share the physi-
case in which each PE computes more or less in iso- cal address space. In either case, data communication
lation from the others an approximation to a solution can be implemented by using shared buffers. When a
to the less-constrained fixed point problem concern- PE sends out a piece of data, it locks the data buffer
ing its respective local operator fragment fi (⋅) but not until the write completes. If another PE tries to receive
f (⋅). Hence, local convergence detection may trigger the latest data but finds the data buffer locked, instead
 A Asynchronous Iterative Algorithms

of blocking itself, the PE can simply continue iterating must also note that multiple probes by a receiving PE
the computation with its old, local copy of the data. are necessary since multiple messages with the same
On the other hand, to make sure the receive oper- “envelope” might have arrived during a local iteration.
ation retrieves the data in the buffer in its entirety, This again has a negative impact on the performance
the buffer must also be locked until the receive is of an asynchronous algorithm, because the receiving PE
complete. The sending PE hence may also find itself must find the most recent of these messages and discard
locked out of the buffer. Conceptually, one can give the rest.
the send operation the privilege to preempt the receive The issues listed above have been discussed in
operation, and the latter may erase the partial data the context of the MPI library []. Communication
just retrieved and let the send operation deposit the libraries such as Jace [] implemented in Java or its
most up to date data. However, it may be easier to C++ port, CRAC [], address some of the shortcomings
just let the send operation wait for the lock, especially of “general-purpose” message passing libraries. Inter-
because the expected waiting will be much shorter than nally they use separate threads for the communication
what is typically experienced in a distributed memory and computation activities with synchronized queues of
system. buffers for the messages and automatic overwriting on
On distributed memory systems, the communica- the older ones.
tion protocols inevitably perform certain blocking oper-
ations. For example, on some communication layer, a
send operation will be considered incomplete until the Recent Developments and Potential Future
receiver acknowledges the safe receipt of the data. Sim- Directions
ilarly, a receive operation on a certain layer will be Two-Stage Iterative Methods and Flexible
considered incomplete until the arriving data become Communications
locally available. From this point of view, a communi- There exist cases in which an algorithm can be
cation operation in effect synchronizes the communi- restructured so as to introduce new opportunities for
cating partners, which is not quite compatible with the parallelization and thus yield more places to inject asyn-
asynchronous model. For example, the so-called non- chronous semantics. Notable examples are two-stage
blocking send and probe operations assume the exis- iterative methods for solving linear systems of equations
tence a hardware device, namely, the network interface of the form Ax = b. Matrix A is written as A = M − N
card that is independent of the PE. However, in order with M being a block diagonal part. With such split-
to avoid expensive copy operations between the appli- ting, the iteration reads as Mx(t + ) ← (Nx(t) + b),
cation and the network layers, the send buffer is usually and its computation can be decomposed row-wise and
shared by both layers, which means that before the PE distributed to the PEs, i.e., a block Jacobi iteration.
can reuse the buffer, e.g., while preparing for the next However, to get a new data fragment xi (t + ) at
send in the application layer, it must wait until the buffer each iteration step at some PE, a new linear system
is emptied by the network layer. (Note that the appli- of equations must be solved, which can be done by
cation layer cannot force the network layer to undo its a new splitting local Mi part, resulting in a nested
ongoing send operation.) This practically destroys the iteration. In a synchronous version, new data frag-
asynchronous semantics. To overcome this difficulty, ments exchange only at the end of each iteration to
computation and communication must be decoupled, coordinate execution. The asynchronous model applied
e.g., implemented as two separate software modules, in this context [] not only relaxes the timing in
such that the blocking due to communication does not xi (t + ) exchanges in the synchronous model, but
impede the computation. Mechanisms must be set up to it also introduces asynchronous exchanges even dur-
facilitate fast data exchange between these two modules, ing the nested iterations (toward xi (t + )) and their
e.g., by using another data buffer to copy data between immediate use in the computation and not at the
them. Now, the computation module will act like a PE end of the respective computational phases. This idea
sending data in a shared memory system while the com- was initially used for solving systems of nonlinear
munication module acts like a PE receiving data. One equations [].
Asynchronous Iterative Algorithms A 

Asynchronous Tiled Iterations (or Parallelizing dynamic changes in data or topology, non-FIFO com-
A
Data Locality) munication channels), asynchronous implementations
As mentioned in Sect. , the PEs executing the opera- are evaluated. Note that in all those cases of practi-
tor and the MEs hosting its parameters and data must cal interest, networked PEs are implied. The interesting
have fast interconnection paths. Restructuring the com- aspect of [, ] is that it broadens the applicability of
putation so as to maximize reuse of cached data across the asynchronous paradigm in shared memory setups
iterations have been studied in the past []. These are for yet another purpose, which is to preserve locality
tiling techniques [] applied to chunks of iterations but without losing parallelism itself as a performance
(during which convergence is not tested) and coupled boosting strategy.
with strategies for breaking the so induced dependen-
cies. In this way data locality is considerably increased
but opportunities for parallelization are confined only Related Entries
within the current tile data and not the whole data set Memory Models
as is the general case, e.g., in iterative stencil compu- Synchronization
tations. Furthermore, additional synchronization bar-
riers, scaling with the number of the tiles in number,
are introduced. In a very recent work [], the asyn- Bibliographic Notes and Further
chronous execution of tiled chunks is proposed for Reading
regaining the parallelization degree of nontiled itera- The asynchronous computation model, as an alter-
tions: each PE is assigned a set of tiles (its sub-grid) and native to the synchronous one, has a life span of
performs the corresponding loops without synchroniz- almost  decades. It started with its formal descrip-
ing with the other PEs. Only the convergence test at the tion in the pioneering work of Chazan and Miranker
end of such a phase enforces synchronization. So on [] and its first experimental investigations by Baudet
the one hand, locality is preserved since each PE tra- [] back in the s. Perhaps the most exten-
verses its current tile data only and on the other hand sive and systematic treatment of the subject is con-
all available PEs execute concurrently in a similar fash- tained in a book by Bertsekas and Tsitsiklis in the
ion without synchronizing, resulting in a large degree of s [], particularly in its closing three chap-
parallelization. ters. During the last  decades an extensive litera-
ture has been accumulated [, , , , , , ,
When to Use Asynchronous Iterations? , , , ]. Most of these works explore theo-
Asynchronism can enter an iteration in both natural retical extensions and variations of the asynchronous
and artificial ways. In naturally occurring asynchronous model coupled with very specific applications. How-
iterations, PEs are either asynchronous by default (or it ever in the most recent ones, focus has shifted to more
is unacceptable to synchronize) computation over sen- practical, implementation-level aspects [, , ],
sor networks or distributed routing over data networks since the asynchronous model seems appropriate for
being such examples. the realization of highly heterogeneous, Internet-scale
However, the asynchronous execution of a paral- computations [, , ].
lelized algorithm enters artificially, in the sense that
most of the times it comes as a variation of the syn- Bibliography
chronous parallel port of a sequential one. Typically, . Bahi JM, Contassot-Vivier S, Couturier R, Vernier F () A
there is a need to accelerate the sequential algorithm decentralized convergence detection algorithm for asynchronous
and as a first step it is parallelized, albeit in synchronous parallel iterative algorithms. IEEE Trans Parallel Distrib Syst
mode in order to preserve semantics. Next, in the pres- ():–
. Bahi JM, Contassot-Vivier S, Couturier R () Asynchronism
ence of large synchronization penalties (when PEs are
for iterative algorithms in a global computing environment. In:
heterogeneous both in terms of computation and com- th Annual International Symposium on high performance com-
munication as in some Grid installations) or extra flex- puting systems and applications (HPCS’). IEEE, Moncton,
ibility needs (such as asynchronous computation starts, Canada, pp –
 A Asynchronous Iterative Algorithms

. Bahi JM, Contassot-Vivier S, Couturier R () Coupling . Chazan D, Miranker WL () Chaotic relaxation. J Linear
dynamic load balancing with asynchronism in iterative algo- Algebra Appl :–
rithms on the computational Grid. In: th International Parallel . Couturier R, Domas S () CRAC: a Grid environment to solve
and Distributed Processing Symposium (IPDPS’), p . IEEE, scientific applications with asynchronous iterative algorithms. In:
Nice, France Parallel and Distributed Processing Symposium, . IPDPS
. Bahi JM, Contassot-Vivier S, Couturier R () Performance . IEEE International, p –
comparison of parallel programming environments for imple- . Elsner L, Koltracht I, Neumann M () On the convergence
menting AIAC algorithms. In: th International Parallel and of asynchronous paracontractions with application to tomo-
Distributed Processing Symposium (IPDPS’). IEEE, Santa graphic reconstruction from incomplete data. Linear Algebra
Fe, USA Appl :–
. Bahi JM, Contassot-Vivier S, Couturier R () Parallel itera- . Frommer A, Szyld DB () On asynchronous iterations. J Com-
tive algorithms: from sequential to Grid computing. Chapman & put Appl Math (–):–
Hall/CRC, Boca Raton, FL . Frommer A, Schwandt H, Szyld DB () Asynchronous
. Bahi JM, Domas S, Mazouzi K () Combination of Java and weighted additive schwarz methods. ETNA :–
asynchronism for the Grid: a comparative study based on a paral- . Frommer A, Szyld DB () Asynchronous two-stage iterative
lel power method. In: th International Parallel and Distributed methods. Numer Math ():–
Processing Symposium (IPDPS ’), pp a, . IEEE, Santa Fe, . Liu L, Li Z () Improving parallelism and locality with
USA, April  asynchronous algorithms. In: th ACM SIGPLAN Symposium
. Bahi JM, Domas S, Mazouzi K (). Jace: a Java environ- on principles and practice of parallel programming (PPoPP),
ment for distributed asynchronous iterative computations. In: pp –, Bangalore, India
th Euromicro Conference on Parallel, Distributed and Network- . Lubachevsky B, Mitra D () A chaotic, asynhronous algorithm
Based Processing (EUROMICRO-PDP’), pp –. IEEE, for computing the fixed point of a nonnegative matrix of unit
Coruna, Spain spectral radius. JACM ():–
. Baudet GM () Asynchronous iterative methods for multipro- . Miellou JC, El Baz D, Spiteri P () A new class of asyn-
cessors. JACM ():– chronous iterative algorithms with order intervals. Mathematics
. El Baz D () A method of terminating asynchronous itera- of Computation, ():–
tive algorithms on message passing systems. Parallel Algor Appl . Moga AC, Dubois M () Performance of asynchronous
:– linear iterations with random delays. In: Proceedings of the
. El Baz D () Communication study and implementation anal- th International Parallel Processing Symposium (IPPS ’),
ysis of parallel asynchronous iterative algorithms on message pp –
passing architectures. In: Parallel, distributed and network-based . Song Y, Li Z () New tiling techniques to improve cache tem-
processing, . PDP ’. th EUROMICRO International Con- poral locality. ACM SIGPLAN Notices ACM SIGPLAN Conf
ference, pp –. Weimar, Germany Program Lang Design Implement ():–
. El Baz D, Gazen D, Jarraya M, Spiteri P, Miellou JC () Flexible . Spiteri P, Chau M () Parallel asynchronous Richardson
communication for parallel asynchronous methods with applica- method for the solution of obstacle problem. In: Proceed-
tion to a nonlinear optimization problem. D’Hollander E, Joubert ings of th Annual International Symposium on High Perfor-
G et al (eds). In: Advances in Parallel Computing: Fundamen- mance Computing Systems and Applications, Moncton, Canada,
tals, Application, and New Directions. North Holland, vol , pp –
pp – . Strikwerda JC () A probabilistic analysis of asynchronous
. El Baz D, Spiteri P, Miellou JC, Gazen D () Asynchronous iteration. Linear Algebra Appl (–):–
iterative algorithms with flexible communication for nonlinear . Su Y, Bhaya A, Kaszkurewicz E, Kozyakin VS () Further
network flow problems. J Parallel Distrib Comput ():– results on convergence of asynchronous linear iterations. Linear
. Bertsekas DP () Distributed asynchronous computation of Algebra Appl (–):–
fixed points. Math Program ():– . Szyld DB () Perspectives on asynchronous computations for
. Bertsekas DP, Tsitsiklis JN () Parallel and distributed compu- fluid flow problems. First MIT Conference on Computational
tation. Prentice-Hall, Englewood Cliffs, NJ Fluid and Solid Mechanics, pp –
. Blathras K, Szyld DB, Shi Y () Timing models and local . Szyld DB, Xu JJ () Convergence of some asynchronous non-
stopping criteria for asynchronous iterative algorithms. J Parallel linear multisplitting methods. Num Algor (–):–
Distrib Comput ():– . Uresin A, Dubois M () Effects of asynchronism on the con-
. Blondel VD, Hendrickx JM, Olshevsky A, Tsitsiklis JN () vergence rate of iterative algorithms. J Parallel Distrib Comput
Convergence in multiagent coordination, consensus, and flock- ():–
ing. In: Decision and Control,  and  European Con- . Wolfe M () More iteration space tiling. In: Proceedings of the
trol Conference. CDC-ECC’. th IEEE Conference on,  ACM/IEEE conference on Supercomputing, p . ACM,
pp – Reno, NV
ATLAS (Automatically Tuned Linear Algebra Software) A 

. Kollias G, Gallopoulos E, Szyld DB () Asynchronous itera- faster when tuned for the hardware than when written
tive computations with Web information retrieval structures: The
A
naively. Unfortunately, highly tuned codes are usually
PageRank case. In: Joubert GR, Nagel WE, et al (Eds). Parallel not performance portable (i.e., a code transformation
Computing: Current and Future issues of High-End computing,
that helps performance on architecture A may reduce
NIC Series. John von Neumann-Institut für Computing, Jülich,
Germany, vol , pp – performance on architecture B).
. Kollias G, Gallopoulos E () Asynchronous Computation of The BLAS API provides basic building block lin-
PageRank computation in an interactive multithreading envi- ear algebra operations, and was designed to help ease
ronment. In: Frommer A, Mahoney MW, Szyld DB (eds) Web the performance portability problem. The idea was to
Information Retrieval and Linear Algebra Algorithms, Dagstuhl
design an API that provides the basic computational
Seminar Proceedings. IBFI, Schloss Dagstuhl, Germany, ISSN:
– needs for most dense linear algebra algorithms, so that
when this API has been tuned for the hardware, all
higher-level codes that rely on it for computation auto-
matically get the associated speedup. Thus, the job of
Asynchronous Iterative optimizing a vast library such as LAPACK can be largely
Computations handled by optimizing the much smaller code base
involved in supporting the BLAS API. The BLAS are
Asynchronous Iterative Algorithms split into three “levels” based on how much cache reuse
they enjoy, and thus how computationally efficient they
can be made to be. In order of efficiency, the BLAS lev-
els are: Level  BLAS [], which involve matrix–matrix
ATLAS (Automatically Tuned operations that can run near machine peak, Level 
Linear Algebra Software) BLAS [, ] which involve matrix–vector operations
and Level  BLAS [, ], which involve vector–vector
R. Clint Whaley operations. The Level  and  BLAS have the same
University of Texas at San Antonio, San Antonio, order of memory references as floating point opera-
TX, USA
tions (FLOPS), and so will run at roughly the speed of
memory for out-of-cache operation.
Synonyms The BLAS were extremely successful as an API,
Numerical libraries allowing dense linear algebra to run at near-peak rates
of execution on many architectures. However, with
Definition hardware changing at the frantic pace dictated by
ATLAS [–, , ] is an ongoing research project Moore’s Law, it was an almost impossible task for hand
that uses empirical tuning to optimize dense linear tuners to keep BLAS libraries up-to-date on even those
algebra software. The fruits of this research are embod- fortunate families of machines that enjoyed their atten-
ied in an empirical tuning framework available as tions. Even worse, many architectures did not have any-
an open source/free software package (also referred one willing and able to provide tuned BLAS, which
to as “ATLAS”), which can be downloaded from the left investigators with codes that literally ran orders
ATLAS homepage []. ATLAS generates optimized of magnitude slower than they should, representing
libraries which are also often collectively referred to as huge missed opportunities for research. Even on sys-
“ATLAS,” “ATLAS libraries,” or more precisely, “ATLAS- tems where a vendor provided BLAS implementations,
tuned libraries.” In particular, ATLAS provides a full license issues often prevented their use (e.g., SUN pro-
implementation of the BLAS [, , , ] (Basic vided an optimized BLAS for the SPARC, but only
Linear Algebra Subprograms) API, and a subset of licensed its use for their own compilers, which left
optimized LAPACK [] (Linear Algebra PACKage) rou- researchers using other languages such as High Per-
tines. Because dense linear algebra is rich in operand formance Fortran without BLAS; this was one of the
reuse, many routines can run tens or hundreds of times original motivations to build ATLAS).
 A ATLAS (Automatically Tuned Linear Algebra Software)

Empirical tuning arose as a response to this need number>.<update number>. The meaning of these
for performance portability. The idea is simple enough terms is:
in principle: Rather than hand-tune operations to the
Major number: Major release numbers are changed
architecture, write a software framework that can vary
only when fairly large, sweeping changes are made.
the implementation being optimized (through tech-
Changes in the API are the most likely to cause
niques such as code generation) so that thousands
a major release number to increment. For exam-
of inter-related transformation combinations can be
ple, when ATLAS went from supporting only matrix
empirically evaluated on the actual machine in ques-
multiply to all the Level  BLAS, the major num-
tion. The framework uses actual timings to discover
ber changed; the same happened when ATLAS
which combinations of transformations lead to high
went from supporting only Level  BLAS to
performance on this particular machine, resulting in
all BLAS.
portably efficient implementations regardless of archi-
Minor number: Minor release numbers are changed at
tecture. Therefore, instead of waiting months (or even
each official release. Even numbers represent stable
years) for a hand tuner to do the same thing, the user
releases, while odd minor numbers are reserved for
need only install the empirical tuning package, which
developer releases.
will produce a highly tuned library in a matter of hours.
Update number: Update numbers are essentially patches
on a particular release. For instance, stable ATLAS
ATLAS Software Releases and Version
releases only occur roughly once per year or two.
Numbering
As errors are discovered, they are errata-ed, so that
ATLAS almost always has two current software releases
a user can apply the fixes by hand. When enough
available at any one time. The first is the stable release,
errata are built up that it becomes impractical to
which is the safest version to use. The stable release
apply the important ones by hand, an update release
has undergone extensive testing, and is known to work
is issued. So, stable updates are typically bug fixes,
on many different platforms. Further, every known bug
or important system workarounds, while developer
in the stable release is tracked (along with associated
updates often involve substantial new code. A typ-
fixes) in the ATLAS errata file []. When errors affect-
ical number of updates to a stable release might be
ing answer accuracy are discovered in the stable release,
something like . A developer release may have any
a message is sent to the ATLAS error list [], which
number of updates.
any user can sign up for. In this way, users get updates
anytime the library they are using might have an error, So, .. would be a stable release, with one group of fixes
and they can update the software with the supplied already applied. .. would be the th update (th
patch if the error affects them. Stable releases happen release) of the associated developer release.
relatively rarely (say once every year or two).
The second available package is the developer release,
Essentials of Empirical Tuning
which is meant to be used by ATLAS developers, con-
Any package that adapts software based on timings
tributers, and people happy to live on the bleeding edge.
falls into a classification that ATLAS shares, which we
Developer releases typically contain a host of features
call AEOS (Automated Empirical Optimization of Soft-
and performance improvements not available in the sta-
ware). These packages can vary strongly on details, but
ble release, but many of these features will have been
they must have some commonalities:
exposed to minimal testing (a new developer release
may have only been crudely tested on a single platform, . The search must be automated in some way, so that
whereas a new stable release will have been extensively an expert hand-tuner is not required.
tested on dozens of platforms). Developer releases hap- → ATLAS has a variety of searches for different
pen relatively often (it is not uncommon to release two operations, all of which can be found in the
in the same week). ATLAS/tune directory.
Each ATLAS release comes with a version num- . The decision of whether a transformation is useful
ber, which is comprised of: <major number>.<minor or not must be empirical, in that an actual timing
ATLAS (Automatically Tuned Linear Algebra Software) A 

measurement on the specific architecture in ques- . Multiple implementation: Formalized search of


A
tion is performed, as opposed to the traditional multiple hand-written implementations of the ker-
application of transformations using static heuris- nel in question. ATLAS uses multiple implementa-
tics or profile counts. tion in the tuning of all levels of the BLAS.
→ ATLAS has a plethora of timers and testers, . Source generation: Write a program that can gen-
which can be found in ATLAS/tune and erate differing implementations of a given algo-
ATLAS/bin. These timers must be much rithm based on a series of input parameters. ATLAS
more accurate and context-sensitive than typ- presently uses source generation in tuning matrix
ical timers, since optimization decisions are multiply (and hence the entire Level  BLAS).
based on them. ATLAS uses the methods
described in [] to ensure high-quality
timings. Multiple implementation: Perhaps the simplest
. These methods must have some way to vary/adapt approach for source code adaptation is for an empirical
the software being tuned. ATLAS currently uses tuning package to supply various hand-tuned imple-
parameterized adaptation, multiple implementation, mentations, and then the search heuristic can be as sim-
and source generation (see Methods of Software ple as trying each implementation in turn until the best
Adaption for details). is found. At first glance, one might suspect that supply-
ing these multiple implementations would make even
Methods of Software Adaptation this approach to source code adaptation much more
Parameterized adaptation: The simplest method is difficult than the traditional hand-tuning of libraries.
having runtime or compile-time variables that cause However, traditional hand-tuning is not the mere appli-
different behaviors depending on input values. In lin- cation of known techniques it may appear when exam-
ear algebra, the most important of such parameters is ined casually. Knowing the size and properties of your
probably the blocking factor(s) used in blocked algo- level  cache is not sufficient to choose the best blocking
rithms, which, when varied, varies the data cache uti- factor, for instance, as this depends on a host of inter-
lization. Other parameterized adaptations in ATLAS locking factors which often defy a priori understanding
include a large number of crossover points (empiri- in the real world. Therefore, it is common in hand-tuned
cally found points in some parameter space where a optimizations to utilize the known characteristics of the
second algorithm becomes superior to a first). Impor- machine to narrow the search, but then the programmer
tant crossover points in ATLAS include: whether prob- writes various implementations and chooses the best.
lem size is large enough to withstand a data copy, For multiple implementation, this process remains
whether problem is large enough to utilize paral- the same, but the programmer adds a search and tim-
lelism, whether a problem dimension is close enough ing layer to accomplish what would otherwise be done
to degenerate that a special-case algorithm should be by hand. In the simplest cases, the time to write this
used, etc. layer may not be much if any more than the time the
Not all important tuning variables can be handled implementer would have spent doing the same pro-
by parameterized adaptation (simple examples include cess in a less formal way by hand, while at the same
instruction cache size, choice of combined or separate time capturing at least some of the flexibility inher-
multiply and add instructions, length of floating point ent in empirical tuning. Due to its obvious simplicity,
and fetch pipelines, etc.), since varying them actually this method is highly parallelizable, in the sense that
requires changing the underlying source code. This then multiple authors can meaningfully contribute without
brings in the need for the second method of software having to understand the entire package. In particular,
adaptation, source code adaptation, which involves actu- various specialists on given architectures can provide
ally generating differing implementations of the same hand-tuned routines without needing to understand
operation. other architectures, the higher level codes (e.g., timers,
ATLAS presently uses two methods of source code search heuristics, higher-level routines which utilize
adaptation, which are discussed in greater detail below. these basic kernels, etc.). Therefore, writing a multiple
 A ATLAS (Automatically Tuned Linear Algebra Software)

implementation framework can allow for outside con- discussed below are available only in the later developer
tribution of hand-tuned kernels in an open source/free releases; this is noted whenever true.
software framework such as ATLAS. GEMM (GEneral rectangular Matrix Multiply)
Source generation: In source generation, a source is empirically tuned using all discussed methods
generator (i.e., a program that writes other programs) (parameterized adaption, multiple implementation, and
is produced. This source generator takes as parame- source generation). Parameterization is mainly used for
ters the various source code adaptations to be made. crossover points and cache blocking, and ATLAS uses
As before, simple examples include loop unrolling fac- both its methods of source code adaptation in order to
tors, choice of combined or separate multiply and add optimize GEMM:
instructions, length of floating point and fetch pipelines,
. Code generation: ATLAS’s main code generator
and so on. Depending on the parameters, the source
produces ANSI C implementations of ATLAS’s
generator produces a routine with the requisite char-
matrix multiply kernel []. The code generator can
acteristics. The great strength of source generators is
be found in ATLAS/tune/blas/gemm/emit_
their ultimate flexibility, which can allow for far greater
mm.c. ATLAS/tune/blas/gemm/mmsearc
tunings than could be produced by all but the best
h.c is the master GEMM search that not only exer-
hand-coders. However, generator complexity tends to
cises the options to emit_mm.c, but also invokes
go up along with flexibility, so that these programs
all subservient searches. With emit_mm.c and
rapidly become almost insurmountable barriers to out-
mmsearch.c, ATLAS has a general-purpose code
side contribution.
generator that can work on any platform with an
ATLAS therefore combines these two methods of
ANSI C compiler. However, most compilers are
source adaptation, where the GEMM kernel source
generally unsuccessful in vectorizing these types
generator produces strict ANSI/ISO C for maximal
of code (and vectorization has become critically
architectural portability. Multiple implementation is
important, especially on the ×), and so from
utilized to encourage outside contribution, and allows
version .. and later, ATLAS has a code gen-
for extreme architectural specialization via assem-
erator written by Chad Zalkin that generates SSE
bly implementations. Parameterized adaptation is then
vectorized code using gcc’s implementation of the
combined with these two techniques to fully tune the
Intel SSE intrinsics. The vectorized source generator
library.
is ATLAS/tune/blas/gemm/mmgen_sse.c,
Both multiple implementation and code generation
and the search that exercises its options is ATLAS/
are specific to the kernel being tuned, and can be either
tune/blas/gemm/mmksearch_sse.c.
platform independent (if written to use portable lan-
. Multiple implementation: ATLAS also tunes
guages such as C) or platform dependent (as when
GEMM using multiple implementation, and this
assembly is used or generated). Empirically tuned com-
search can be found in ATLAS/tune/blas/
pilers can relax the kernel-specific tuning requirement,
gemm/ummsearch.c, and all the hand-tuned
and there has been some initial work on utilizing this
kernel implementations which are searched can be
third method of software adaptation [, ] for ATLAS,
found in ATLAS/tune/blas/gemm/CASES/.
but this has not yet been incorporated into any ATLAS
release.
Empirical Tuning in the Rest of the Package
Search and Software Adaptation for the Currently, the the Level  and  BLAS are tuned
Level  BLAS only via parameterization and multiple implementation
ATLAS’s Level  BLAS are implemented as GEMM- searches. ATLAS .. and greater has some prototype
based BLAS, and so ATLAS’s empirical tuning is all SSE generators for matrix vector multiply and rank-
done for matrix multiply (see [] and [] for descrip- update (most important of the Level  BLAS routines)
tions of ATLAS’s GEMM-based BLAS). As of this available in ATLAS/tune/blas/gemv/mvgen_
writing, the most current stable release is .., and sse.c & ATLAS/tune/blas/ger/r1gen_
the newest developer release is ... Some features sse.c, respectively. These generators are currently not
ATLAS (Automatically Tuned Linear Algebra Software) A 

debugged, and lack a search, but may be used in later The PHiPAC effort [] from Berkeley comprised the
A
releases. first systematic attempt to harness automated empirical
ATLAS can also autotune LAPACK’s blocking fac- tuning in this area. PHiPAC did not deliver the required
tor, as discussed in []. ATLAS currently autotunes performance on the platforms being used, but the idea
the QR factorization for both parallel and serial was obviously sound. Whaley began working on this
implementations. idea on nights and weekends in an attempt to make
the HPF results more credible. Eventually, a working
Parallelism in ATLAS prototype was produced, and was demonstrated to the
ATLAS is currently used in two main ways in paral- director of ICL (Jack Dongarra), and it became the full-
lel programming. Parallel programmers call ATLAS’s time project for Whaley, who was eventually joined on
serial routines directly in algorithms they themselves the project by Antoine Petitet. ATLAS has been under
have parallelized. The other main avenue of parallel continuous development since that time, and has fol-
ATLAS use is programmers that write serial algorithms lowed Whaley to a variety of institutions, as shown in
which get implicit parallelism by calling ATLAS’s par- the ATLAS timeline below. Please note that ATLAS is an
allelized BLAS implementation. ATLAS currently par- open source project, so many people have substantially
allelizes only the Level  BLAS (the Level  and  are contributed to ATLAS that are not mentioned here.
typically bus-bound, so parallelization of these opera- Please see ATLAS/doc/AtlasCredits.txt for a
tions can sometimes lead to slowdown due to increased rough description of developer contribution to ATLAS.
bus contention). In the current stable (..), the par-
allel BLAS use pthreads. However, recent research []
forced a complete rewrite of the threading system for
the developer release which results in as much as a dou- Rough ATLAS Timeline Including Stable
bling of parallel application performance. The developer Releases
release also supports Windows threads and OpenMP in Early : Development of prototype by Whaley in
addition to pthreads. Ongoing work involves improv- spare time.
ing parallelism at the LAPACK level [], and empirically Mid : Whaley full time on ATLAS development at
tuning parallel crossover points, which may lead to the ICL/UTK.
safe parallelization of the Level  and  BLAS. Dec : Technical report describing ATLAS pub-
lished. ATLAS v . released, provides S/D GEMM
only.
Discussion Sep : Version . released, using SuperScalar
History of the Project GEMM-based BLAS [] to provide entire real Level
The research that eventually grew into ATLAS was  BLAS in both precisions.
undertaken by R. Clint Whaley in early , as a direct Mid : Antoine Petitet joins ATLAS group at ICL/
response to a problem from ongoing research on paral- UTK.
lel programming and High Performance Fortran (HPF). Feb : Version . released. Automated install and
The Innovative Computer Laboratory (ICL) of the Uni- configure steps, all Level  BLAS supported in all
versity of Tennessee at Knoxvile (UTK) had two small four types/precisions, C interface to BLAS added.
clusters, one consisting of SPARC chips, and the other Dec : Version .Beta released. ATLAS provides
PentiumPROs. For the PentiumPROs, no optimized complete BLAS support with C and F interfaces.
BLAS were available. For the SPARC cluster, SUN pro- GEMM generator can generate all transpose cases,
vided an optimized BLAS, but licensed them so that saving the need to copy small matrices. Addition of
they could not be used with non-SUN compilers, such LU, Cholesky, and associated LAPACK routines.
as the HPF compiler. This led to the embarrassment Dec : Version . released. Pthreads support for
of having a -processor parallel algorithm run slower parallel Level  BLAS. Support for user-contribution
than a serial code on the same chips, due to lack of a of kernels, and associated multiple implementation
portably optimal BLAS. search.
 A ATLAS (Automatically Tuned Linear Algebra Software)

Mid : Antoine leaves to take job at SUN/France. between degrees of parallelism, which should improve
Jan : Whaley begins PhD work in iterative compi- current performance and enable opportunities to safely
lation at FSU. parallelize additional operations. Further work along
Jun : Version . released. Level  BLAS optimized. the lines of [] is being pursued in order to more effec-
Addition of LAPACK inversion and related routines. tively parallelize LAPACK. Finally, use of massively par-
Addition of sanity check after install. allel GPUs is being investigated based on the impressive
Dec : Version . released. Numerous optimiza- initial work of [, , ].
tions, but no new API coverage.
Jul : Whaley begins as assistant professor at UTSA. LAPACK
Mar : NSF/CRI funding for ATLAS obtained. In addition to the parallel work mentioned in the previ-
Dec : Version . released. Complete rewrite of ous section, the ATLAS group is expanding the coverage
configure and install, as part of overall moderniza- of routines to include all Householder factorization-
tion of package (bitrotted from  to , where related routines, based on the ideas presented in []
there was no financial support, and so ATLAS work (this will result in ATLAS providing a native imple-
only done on volunteer basis). Numerous fixes and mentation, with full C and Fortran interfaces, for
optimizations, but API support unchanged. Addi- all dense factorizations). Another ongoing investiga-
tion of ATLAS install guide. Extended LAPACK tion involves extending the LAPACK tuning discussed
timing/testing available. in [] to handle more routines more efficiently. Finally,
the ATLAS group is researching error control [] with
Ongoing Research and Development an eye to keeping error bounds low while using faster
There are a host of areas in ATLAS that require signifi- algorithms.
cant research in order to improve. Since , there has
been continuous ATLAS R&D coming from Whaley’s Bibliography
group at the University of Texas at San Antonio. Here, . Anderson E, Bai Z, Bischof C, Demmel J, Dongarra J, Du
we discuss only those areas that the ATLAS team is actu- Croz J, Greenbaum A, Hammarling S, McKenney A, Ostrou-
ally currently investigating (time of writing: Novem- chov S, Sorensen D () LAPACK users’ guide, rd edn. SIAM,
Philadelphia, PA
ber ). Initial work on these ideas is already in
. Bilmes J, Asanović K, Chin CW, Demmel J () Optimizing
the newest developer releases, and at least some of the matrix multiply using PHiPAC: a portable, high-performance,
results will be available in the next stable release (..). ANSI C coding methodology. In: Proceedings of the ACM
SIGARC International Conference on SuperComputing, Vienna,
Kernel Tuning Austria, July 
Improving ATLAS’s kernel tuning is an ongoing focus. . Castaldo AM, Whaley RC () Minimizing startup costs for
performance-critical threading. In: Proceedings of the IEEE
One area of interest involves exploiting empirical com-
international parallel and distributed processing symposium,
pilation [, ] for all kernels, and adding additional Rome, Italy, May 
vectorized source generators, particularly for the Level  . Castaldo AM, Whaley RC () Scaling LAPACK panel
and  BLAS. Additionally, it is important to add sup- operations using parallel cache assignment. In: Accepted for pub-
port for tuning the Level  and  BLAS to different lication in th AMC SIGPLAN Annual Symposium on Prin-
cache states, as discussed in []. There is also an ongo- ciples and Practice of Parallel Programming, Bangalore, India,
January 
ing effort to rewrite the ATLAS timing and tuning . Castaldo AM, Whaley RC, Chronopoulos AT () Reducing
frameworks so that others can easily plug in their own floating point error in dot product using the superblock family of
searches or empirical tuning frameworks for ATLAS to algorithms. SIAM J Sci Comput ():–
automatically use. . Dongarra J, Du Croz J, Duff I, Hammarling S () A set of level
 basic linear algebra subprograms. ACM Trans Math Softw ():
–
Improvements in Parallel Performance
. Dongarra J, Du Croz J, Hammarling S, Hanson R () Algo-
There is ongoing work aimed at extending [], both rithm : an extended set of basic linear algebra subprograms:
to cover more OSes, to even further improve over- model implementation and test programs. ACM Trans Math
heads, and to empirically tune a host of crossover points Softw ():–
Automatically Tuned Linear Algebra Software (ATLAS) A 

. Dongarra J, Du Croz J, Hammarling S, Hanson R () An . Whaley RC, Whalley DB () Tuning high performance
extended set of FORTRAN basic linear algebra subprograms. kernels through empirical compilation. In: The  interna-
A
ACM Trans Math Softw ():– tional conference on parallel processing, Oslo, Norway, June
. Elmroth E, Gustavson F () Applying recursion to serial and , pp –
parallel qr factorizaton leads to better performance. IBM J Res . Yi Q, Whaley RC () Automated transformation for
Dev ():– performance-criticial kernels. In: ACM SIGPLAN sympo-
. Hanson R, Krogh F, Lawson C () A proposal for standard sium on library-centric software design, Montreal, Canada,
linear algebra subprograms. ACM SIGNUM Newsl ():– October 
. Kågström B, Ling P, van Loan C () Gemm-based level 
blas: high performance model implementations and performance
evaluation benchmark. ACM Trans Math Softw ():–
. Lawson C, Hanson R, Kincaid D, Krogh F () Basic linear
algebra subprograms for fortran usage. ACM Trans Math Softw Atomic Operations
():–
. Li Y, Dongarra J, Tomov S () A note on autotuning GEMM
Synchronization
for GPUs. Technical Report UT-CS--, University of Ten- Transactional Memories
nessee, January 
. Whaley TC et al. Atlas mailing lists. http://math-atlas.
sourceforge.net/faq.html#lists
. Volkov V, Demmel J () Benchmarking GPUs to tune
dense linear algebra. In Supercomputing . Los Alamitos, Automated Empirical
November  Optimization
. Volkov V, Demmel J (). LU, QR and Cholesky factorizations
using vector capabilities of GPUs. Technical report, University of Autotuning
California, Berkeley, CA, May 
. Whaley RC () Atlas errata file. http://math-atlas.sourceforge.
net/errata.html
. Whaley RC () Empirically tuning lapack’s blocking factor for
increased performance. In: Proceedings of the International Mul- Automated Empirical Tuning
ticonference on Computer Science and Information Technology,
Wisla, Poland, October 
Autotuning
. Whaley RC, Castaldo AM () Achieving accurate and
context-sensitive timing for code optimization. Softw Practice
Exp ():–
. Whaley RC, Dongarra J () Automatically tuned linear algebra
software. Technical Report UT-CS--, University of Ten- Automated Performance Tuning
nessee, TN, December . http://www.netlib.org/lapack/lawns/
lawn.ps
Autotuning
. Whaley RC, Dongarra J () Automatically tuned linear alge-
bra software. In: SuperComputing : high performance net-
working and computing, San Antonio, TX, USA, . CD-ROM
proceedings. Winner, best paper in the systems category. http://
www.cs.utsa.edu/~whaley/papers/atlas_sc.ps
. Whaley RC, Dongarra J () Automatically tuned linear algebra
Automated Tuning
software. In: Ninth SIAM conference on parallel processing for
scientific computing, . CD-ROM proceedings. Autotuning
. Whaley RC, Petitet A () Atlas homepage. http://math-atlas.
sourceforge.net/
. Whaley RC, Petitet A () Minimizing development and main-
tenance costs in supporting persistently optimized BLAS. Softw
Practice Exp ():–, February . http://www.cs.utsa. Automatically Tuned Linear
edu/~whaley/papers/spercw.ps Algebra Software (ATLAS)
. Whaley RC, Petitet A, Dongarra JJ () Automated empirical
optimization of software and the ATLAS project. Parallel Comput ATLAS (Automatically Tuned Linear Algebra
(–):–
Software)
 A Autotuning

of which are automated in a given autotuning approach


Autotuning or system:

Richard W. Vuduc ● Identification of a space of candidate implementa-


Georgia Institute of Technology, Atlanta, tions. That is, the computation is associated with
GA, USA some space of possible implementations that may be
defined implicitly through parameterization.
● Generation of these implementations. That is, an
autotuner typically possesses some facility for pro-
Synonyms ducing (generating) the actual code that corre-
Automated empirical optimization; Automated empiri- sponds to any given point in the space of candidates.
cal tuning; Automated performance tuning; Automated ● Search for the best implementation, where best is
tuning; Software autotuning defined relative to the performance goal. This search
may be guided by an empirically derived model
and/or actual experiments, that is, benchmarking
Definition
candidate implementations.
Automated performance tuning, or autotuning, is an
automated process, guided by experiments, of selecting Bilmes et al. first described this particular notion of
one from among a set of candidate program imple- autotuning in the context of an autotuner for dense
mentations to achieve some performance goal. “Perfor- matrix–matrix multiplication [].
mance goal” may mean, for instance, the minimization
of execution time, energy delay, storage, or approxi- Intellectual Genesis
mation error. An “experiment” is the execution of a The modern notion of autotuning can be traced his-
benchmark and observation of its results with respect torically to several major movements in program
to the performance goal. A system that implements an generation.
autotuning process is referred to as an autotuner. An The first is the work by Rice on polyalgorithms.
autotuner may be a stand-alone code generation system A polyalgorithm is a type of algorithm that, given some
or may be part of a compiler. a particular input instance, selects from among several
candidate algorithms to carry out the desired computa-
tion []. Rice’s later work included his attempt to for-
Discussion malize algorithm selection mathematically, along with
Introduction: From Manual to Automated an approximation theory approach to its solution, as
Tuning well as applications in both numerical algorithms and
When tuning a code by hand, a human program- operating system scheduler selection []. The key influ-
mer typically engages in the following iterative process. ence on autotuning is the notion of multiple candidates
Given an implementation, the programmer develops and the formalization of the selection problem as an
or modifies the program implementation, performs an approximation and search problem.
experiment to measure the implementation’s perfor- A second body of work is in the area of profile-
mance, and then analyzes the results to decide whether and feedback-directed compilation. In this work, the
the implementation has met the desired performance compiler instruments the program to gather and
goal; if not, he or she repeats these steps. The use of store data about program behavior at runtime, and
an iterative, experiment-driven approach is typically then uses this data to guide subsequent compila-
necessary when the program and hardware perfor- tion of the same program. This area began with
mance behavior is too complex to model explicitly in the introduction of detailed program performance
another way. measurement [] and measurement tools []. Soon
Autotuning attempts to automate this process. More after, measurement became an integral component in
specifically, the modern notion of autotuning is a pro- the compilation process itself in several experimen-
cess consisting of the following components, any or all tal compilers, including Massalin’s superoptimizer for
Autotuning A 

exploring candidate instruction schedules, as well as of the candidate transformations, for example, depth
A
the peephole optimizers of Chang et al. []. Dynamic of loop unrolling, cache blocking/tiling sizes. There
or just-in-time compilers employ similar ideas; see may be numerous other implementation issues, such
Smith’s survey of the state-of-the-art in this area as as where to place data or how to schedule tasks and
of  []. The key influence on modern auto- communication.
tuning is the idea of automated measurement-driven Regarding who identifies these candidates, possi-
transformation. bilities include the autotuner developer; the autotuner
The third body of work that has exerted a strong itself, for example, through preprogrammed and possi-
influence on current work in autotuning is that of for- bly extensible rewrite rules; or even the end-user pro-
malized domain-driven code generation systems. The grammer, for example, through a meta-language or
first systems were developed for signal processing and directives.
featured high-level transformation/rewrite systems that Among target architectures, autotuning researchers
manipulated symbolic formulas and translated these have considered all of the different categories of sequen-
formulas into code [, ]. The key influence on mod- tial and parallel architectures. These include everything
ern autotuning is the notion of high-level mathemati- from superscalar cache-based single- and multicore
cal representations of computations and the automated systems, vector-enabled systems, and shared and dis-
transformations of these representations. tributed memory multiprocessor systems.
Autotuning as it is largely studied today began
with the simultaneous independent development of the Code Generation
PHiPAC autotuner for dense matrix–matrix multiplica- Autotuners employ a variety of techniques for produc-
tion [], FFTW for the fast Fourier transform [], and ing the actual code.
the OCEANS iterative compiler for embedded systems One approach is to build a specialized code gener-
[]. These were soon followed by additional systems for ator that can only produce code for a particular com-
dense linear algebra (ATLAS) [] and signal transforms putation or family of computations. The generator itself
(SPIRAL) []. might be as conceptually simple as a script that, given a
few parameter settings, produces an output implemen-
Contemporary Issues and Approaches tation. It could also be as sophisticated as a program that
A developer of an autotuner must, in implementing takes an abstract mathematical formula as input, using
his or her autotuning system or approach, consider symbolic algebra and rewriting to transform the for-
a host of issues for each of the three major auto- mula to an equivalent formula, and translating the for-
tuning components, that is, identification of candidate mula to an implementation. In a compiler-based auto-
implementations, generation, and search. The issues cut tuner, the input could be code in a general-purpose lan-
across components, though for simplicity the following guage, with conventional compiler technology enabling
exposition treats them separately. the transformation or rewriting of that input to some
implementation output. In other words, there is a large
Identification of Candidates design space for the autotuner code generator compo-
The identification of candidates may involve characteri- nent. Prominent examples exist using combinations of
zation of the target computations and/or anticipation of these techniques.
the likely target hardware platforms. Key design points
include what candidates are possible, who specifies the Search
candidates, and how the candidates are represented for Given a space of implementations, selecting the best
subsequent code generation and/or transformation. implementation is, generically, a combinatorial search
For example, for a given computation there may be problem. Key questions are what type of search to use,
multiple candidate algorithms and data structures, for when to search, and how to evaluate the candidate
example, which linear solver, what type of graph data implementations during search.
structure. If the computation is expressed as code, the Among types of search, the simplest is an exhaus-
space may be defined through some parameterization tive approach, in which the autotuner enumerates all
 A Autotuning

possible candidates and experimentally evaluates each There are numerous debates and issues within this
one. To reduce the potential infeasibility of exhaustive community at present, a few examples of which follow:
search, numerous pruning heuristics are possible. These
● How will we measure the success of autotuners, in
include random search, simulated annealing, statisti-
terms of performance, productivity, and/or other
cal experimental design approaches, machine-learning
measures?
guided search, genetic algorithms, and dynamic pro-
● To what extent can entire programs be autotuned,
gramming, among others.
or will large successes be largely limited to relatively
On the question of when to search, the main cat-
small library routines and small kernels extracted
egories are off-line and at runtime. An off-line search
from programs?
would typically occur once per architecture or architec-
● For what applications is data-dependent tuning
tural families, prior to use by an application. A runtime
really necessary?
search, by contrast, occurs when the computation is
● Is there a common infrastructure (tools, languages,
invoked by the end-user application. Hybrid approaches
intermediate representations) that could support
are of course possible. For example, a series of off-
autotuning broadly, across application domains?
line benchmarks could be incorporated into a runtime
● Where does the line between an autotuner and a
model that selects an implementation. Furthermore,
“traditional” compiler lie?
tuning could occur across multiple application invoca-
● When is search necessary, rather than analytical
tions through historical recording and analysis mech-
models? Always? Never? Or only sometimes, per-
anisms, as is the case in profile and feedback-directed
haps as a “stopgap” measure when porting to new
compilation.
platforms? How does an autotuner know when to
The question of evaluation is one of the method-
stop?
ologies for carrying out the experiment. One major
issue is whether this evaluation is purely experimen-
tal or guided by some predictive model (or both). The Related Entries
model itself may be parameterized and parameter val- Algorithm Engineering
ues learned (in the statistical sense) during tuning. A ATLAS (Automatically Tuned Linear Algebra
second major issue is under what context evaluation Software)
occurs. That is, tuning may depend on features of the Benchmarks
input data, so conducting an experiment in the right Code Generation
context could be critical to effective tuning. FFTW
Profiling
Spiral
Additional Pointers
The literature on work related to autotuning is large and
growing. At the time of this writing, the last major com- Bibliography
prehensive surveys of autotuning projects had appeared . (). Collective Tuning Wiki. http://ctuning.org
in the Proceedings of the IEEE special issue edited by . (). International Workshop on Automatic Performance Tun-
ing. http://iwapt.org
Moura et al. [] and the article by Vuduc et al. [,
. Aarts B, Barreteau M, Bodin F, Brinkhaus P, Chamski Z, Charles
Section ], which include many of the methodological H-P, Eisenbeis C, Gurd J, Hoogerbrugge J, Hu P, Jalby W, Kni-
aspects autotuning described here. jnenburg PMW, O’Boyle MFP, Rohou E, Sakellariou R, Schep-
There are numerous community-driven efforts to ers H, Seznec A, Stöhr E, Verhoeven M, Wijshoff HAG ()
assemble autotuning researchers and disseminate their OCEANS: Optimizing Compilers for Embedded Applications. In:
results. These include: the U.S. Department of Energy Proc. Euro- Par, vol  of LNCS, Passau, Germany. Springer,
Berlin / Heidelberg
(DOE) sponsored workshop on autotuning, organized
. Bilmes J, Asanovic K, Chin C-W, Demmel J () Optimizing
under the auspices of CScADS []; the International matrix multiply using PHiPAC: A portable, highperformance,
Workshop on Automatic Performance Tuning []; and ANSI C coding methodology. In Proc. ACM Int’l Conf Super-
the Collective Tuning (cTuning) wiki [], to name a few. computing (ICS), Vienna, Austria, pp –
Autotuning A 

. Center for Scalable Application Development Software () . Moura JMF, Püschel M, Dongarra J, Padua D, (eds) ()
Workshop on Libraries and Autotuning for Petascale Applica- Proceedings of the IEEE: Special Issue on Program Generation,
A
tions. http://cscads.rice.edu/workshops/summer/autotuning Optimization, and Platform Adaptation, vol . IEEE Comp Soc.
. Chang PP, Mahlke SA, Mei W, Hwu W () Using profile infor- http://ieeexplore.ieee.org/xpl/tocresult.jsp?isNumber=&
mation to assist classic code optimizations. Software: Pract Exp puNumber=
():– . Püschel M, Moura JMF, Johnson J, Padua D, Veloso M,
. Covell MM, Myers CS, Oppenheim AV () Symbolic and Singer B, Xiong J, Franchetti F, Gacic A, Voronenko Y, Chen K,
knowledge-based signal processing, chapter : computer-aided Johnson RW, Rizzolo N () SPIRAL: Code generation for DSP
algorithm design and rearrangement. Signal Processing Series. transforms. Proc. IEEE: Special issue on “Program Generation,
Prentice-Hall, pp – Optimization, and Platform Adaptation” ():–
. Frigo M, Johnson SG () A fast Fourier transform compiler. . Rice JR () A polyalgorithm for the automatic solution of non-
ACM SIGPLAN Notices ():–. Origin: Proc. ACM Conf. linear equations. In Proc. ACM Annual Conf./Annual Mtg. New
Programming Language Design and Implementation (PLDI) York, pp –
. Graham SL, Kessler PB, McKusick MK () gprof: A call graph . Rice JR () The algorithm selection problem. In: Alt F, Rubinoff
execution profiler. ACM SIGPLAN Notices ():–. Origin: M, Yovits MC (eds) Adv Comp :–
Proc. ACM Conf. Programming Language Design and Imple- . Smith MD () Overcoming the challenges to feedbackdi-
mentation (PLDI) rected optimization. ACM SIGPLAN Notices ():–
. Johnson JR, Johnson RW, Rodriguez D, Tolimieri R () . Vuduc R, Demmel J, Bilmes J () Statistical models for empir-
A methodology for designing, modifying, and implementing ical search-based performance tuning. Int’l J High Performance
Fourier Transform algorithms on various architectures. Circuits, Comp Appl (IJHPCA) ():–
Syst Signal Process ():– . Whaley RC, Petitet A, Dongarra J () Automated empirical
. Knuth DE () An empirical study of FORTRAN programs. optimizations of software and the ATLAS project. Parallel Comp
Software: Practice Exp ():– (ParCo) (–):–
B
architecture in use. The strength of the bandwidth-
Backpressure latency models is that they model quite precisely
the communication and synchronization operations.
Flow Control When the parameters of a given machine are fed
into the performance analysis of a parallel algorithm,
bandwidth-latency models lead to rather precise and
Bandwidth-Latency Models (BSP, expressive performance evaluations. The most impor-
LogP) tant models of this class are bulk synchronous parallel
(BSP) and LogP, as discussed below.
Thilo Kielmann , Sergei Gorlatch

Vrije Universiteit, Amsterdam, The Netherlands The BSP Model

Westfälische Wilhelms-Universität Münster, Münster, The bulk synchronous parallel (BSP) model was pro-
Germany posed by Valiant [] to overcome the shortcomings
of the traditional PRAM (Parallel Random Access
Synonyms Machine) model, while keeping its simplicity. None of
Message-passing performance models; Parallel com- the suggested PRAM models offers a satisfying fore-
munication models cast of the behavior of parallel machines for a wide
range of applications. The BSP model was developed
Definition as a bridge between software and hardware developers:
Bandwidth-latency models are a group of performance if the architecture of parallel machines is designed as
models for parallel programs that focus on modeling prescribed by the BSP model, then software develop-
the communication between the processes in terms of ers can rely on the BSP-like behavior of the hardware.
network bandwidth and latency, allowing quite precise Furthermore it should not be necessary to customize
performance estimations. While originally developed perpetually the model of applications to new hardware
for distributed-memory architectures, these models details in order to benefit from a higher efficiency of
also apply to machines with nonuniform memory emerging architectures.
access (NUMA), like the modern multi-core The BSP model is an abstraction of a machine with
architectures. physically distributed memory that uses a presenta-
tion of communication as a global bundle instead of
Discussion single point-to-point transfers. A BSP model machine
consists of a number of processors equipped with mem-
Introduction
ory, a connection network for point-to-point messages
The foremost goal of parallel programming is to speed
between processors, and a synchronization mechanism,
up algorithms that would be too slow when exe-
which allows a barrier synchronization of all processors.
cuted sequentially. Achieving this so-called speedup
Calculations on a BSP machine are organized as a
requires a deep understanding of the performance of
sequence of supersteps as shown in Fig. :
the inter-process communication and synchronization,
together with the algorithm’s computation. Both com- ● In each superstep, each processor executes local
putation and communication/synchronization perfor- computations and may perform communication
mance strongly depend on properties of the machine operations.

David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----,


© Springer Science+Business Media, LLC 
 B Bandwidth-Latency Models (BSP, LogP)

Barrier synchronization data are put into the local memory either at the program
start-up time or by the communication operations of
previous supersteps. Therefore, the computation oper-
ations of a process are independent of other processes,
Local computations
while it is not allowed for multiple processes to read
Superstep

or write the same memory location in the same step.


Because of the barrier synchronization, all memory and
communication operations in a superstep must com-
Global communications pletely finish before any operation of the next superstep
begins. These restrictions imply that a BSP computer
Time Barrier synchronization has a sequential consistency memory model.
A program execution on a BSP machine is charac-
terized using the following four parameters []:

Processors ● p: the number of processors.


● s: the computation speed of processors expressed as
Bandwidth-Latency Models (BSP, LogP). Fig.  The BSP
the number of computation steps that can be exe-
model executes calculations in supersteps. Each superstep
cuted by a processor per second. In each step, one
comprises three phases: () simultaneous local
arithmetic operation on local data can be executed.
computations performed by each process, () global
● l: the number of steps needed for the barrier syn-
communication operations for exchanging data between
chronization.
processes, () barrier synchronization for finalizing the
● g: the average number of steps needed for transport-
communication operation and enabling access to received
ing a memory word between two processors of the
data by the receiving processes
machine.

In a real parallel machine, there are many differ-


● A local computation can be executed in every time
ent patterns of communication between processors. For
unit (step).
simplicity, the BSP model abstracts the communication
● The result of a communication operation does not
operations using the h relation concept: an h relation is
become effective before the next superstep begins,
an abstraction of any communication operation, where
i.e., the receiver cannot use received data until the
each node sends at most h words to various nodes and
current superstep is finished.
each node receives at most h words. On a BSP computer,
● At the end of every superstep, a barrier synchroniza-
the time to realize any h relation is not longer than g ⋅h.
tion using the synchronization mechanism of the
The BSP model is more realistic than the PRAM
machine takes place.
model, because it accounts for several overheads
The BSP model machine can be viewed as a MIMD ignored by PRAM:
(Multiple Instruction Multiple Data) system, because
the processes can execute different instructions simulta- ● To account for load imbalance, the computation
neously. It is loosely synchronous at the superstep level, time w is defined as the maximum number of steps
compared to the instruction-level tight synchrony in the spent on computation operations by any processor.
PRAM model: within a superstep, different processes ● The synchronization overhead is l, which has a lower
execute asynchronously at their own pace. There is a sin- bound equal to the communication network latency
gle address space, and a processor can access not only its (i.e., the time for a word to propagate through the
local memory but also any remote memory in another physical network) and is always greater than zero.
processor, the latter imposing communication. ● The communication overhead is g ⋅h steps, i.e., g ⋅h
Within a superstep, each computation operation is the time to execute the most time-consuming h
uses only data in the processor’s local memory. These relation. The value of g is platform-dependent: it is
Bandwidth-Latency Models (BSP, LogP) B 

smaller on a computer with more efficient commu- equipped with memory. The processors can communi-
nication support. cate using point-to-point messages through a commu-
nication network. The behavior of the communication B
For a real parallel machine, the value of g depends on
is described using four parameters L, o, g, and P, which
the bandwith of the communication network, the com-
give the model the name LogP:
munication protocols, and the communication library.
The value of l depends on the diameter of the network, ● L (latency) is an upper bound for the network’s
as well as on the communication library. Both parame- latency, i.e., the maximum delay between sending
ters are usually estimated using benchmark programs. and receiving a message.
Since the value s is used for normalizing the values l ● o (overhead) describes the period of time in which
and g, only p, l and g are independent parameters. The a processor is busy with sending or receiving a
execution time of a BSP-program is a sum of the exe- message; during this time, no other work can be
cution time of all supersteps. The execution time of performed by that processor.
each superstep comprises three terms: () maximum ● g (gap) denotes the minimal timespan between
local computation time of all processes, w, () costs sending or receiving two messages back-to-back.
of global communication realizing a h-relation, and ● P is the number of processors of the parallel
() costs for the barrier synchronization finalizing the machine.
superstep:
Figure  shows a visualization of the LogP param-
Tsuperstep = w + g ⋅h + l () eters []. Except P, all parameters are measured in
time units or multiple machine cycle units. The model
BSP allows for overlapping of the computation, the
assumes a network with finite capacity. From the defini-
communication, and the synchronization operations
tions of L and g, it follows that, for a given pair of source
within a superstep. If all three types of operations are
and destination nodes, the number of messages that can
fully overlapped, then the time for a superstep becomes
be on their way through the network simultaneously is
max(w, g⋅h, l). However, usually the more conservative
limited by L/g. A processor trying to send a message
w + g ⋅h + l is used.
that would exceed this limitation will be blocked until
The BSP model was implemented in a so-called
the network can accept the next message. This property
BSPlib library [, ] that provides operations for the ini-
models the network bandwidth where the parameter g
tialization of supersteps, execution of communication
reflects the bottleneck bandwidth, independent of the
operations, and barrier synchronizations.
bottleneck’s location, be it the network link, or be it the
processing time spent on sending or receiving. g thus
The LogP Model denotes an upper bound for o.
In [], Culler et al. criticize BSP’s assumption that the
length of supersteps must allow to realize arbitrary
h-relations, which means that the granularity has a
lower bound. Also, the messages that have been sent
P processors
during a superstep are not available for a recipient
before the following superstep begins, even if a message M M M
is sent and received within the same superstep. Fur-
thermore, the BSP model assumes hardware support for P P P
the synchronization mechanism, although most exist- Overhead o Overhead o
ing parallel machines do not provide such a support.
Because of these problems, the LogP model [] has been Latency L
devised, which is arguably more closely related to the Communication network
modern hardware used in parallel machines.
In analogy to BSP, the LogP model assumes that Bandwidth-Latency Models (BSP, LogP). Fig.  Parameter
a parallel machine consists of a number of processors visualization of the LogP model
 B Bandwidth-Latency Models (BSP, LogP)

For this capacity constraint, LogP both gets praised g g

and criticized. The advantage of this constraint is a very 1 2 3 4 5


realistic modeling of communication performance, as
the bandwidth capacity of a communication path can
O O O O O
get limited by any entity between the sender’s mem-
ory and the receiver’s memory. Due to this focus on
point-to-point communication, LogP (variants) have
been successfully used for modeling various computer L L L L L

communication problems, like the performance of the


Network File System (NFS) [] or collective commu-
nication operations from the Message Passing Inter-
face (MPI) []. The disadvantage of the capacity con- O O O O O

straint is that LogP exposes the sensitivity of a par- Time


allel computation to the communication performance
Bandwidth-Latency Models (BSP, LogP). Fig.  Modeling
of individual processors. This way, analytic perfor-
the transfer of a large message in n segments using the
mance modeling is much harder with LogP, as com-
LogP model. The last message segment is sent at time
pared to models with a higher abstraction level like
Ts = (n − )⋅g and will be received at time Tr = Ts + o + L
BSP [].
While LogP assumes that the processors are work-
ing asynchronously, it requires that no message may which one half is used for reading, and the other half
exceed a preset size, i.e., larger messages must be for writing. A sequence of n pipelined messages can be
fragmented. The latency of a single message is not delivered in Tn = L + o + (n − )g time units, as shown
predictable, but it is limited by L. This means, in par- in Fig. .
ticular, that messages can overtake each other such A strong point of LogP is its simplicity. This sim-
that the recipient may potentially receive the messages plicity, however, can, at times, lead to inaccurate per-
out of order. The values of the parameters L, o, and g formance modeling. The most obvious limitation is the
depend on hardware characteristics, the used commu- restriction to small messages only. This was overcome
nication software, and the underlying communication by introducing a LogP variant, called LogGP []. It con-
protocols. tains an additional parameter G (Gap per Byte) as the
The time to communicate a message from one node time needed to send a byte from a large message. /G
to another (i.e., the start-up time t ) consists of three is the bandwidth per processor. The time for sending a
terms: t = o + L + o. The first o is called the send over- message consisting of n Bytes is then Tn = o + (n − )
head, which is the time at the sending node to execute G + L + o.
a message send operation in order to inject a message A more radical extension is the parameterized LogP
into the network. The second o is called the receive model [], PLogP, for short. PLogP has been designed
overhead, which is the time at the receiving node to exe- taking observations about communication software,
cute a message receive operation. For simplicity, the two like the Message Passing Interface (MPI), into account.
overheads are assumed equal and called the overhead These observations are () Overhead and gap strongly
o, i.e., o is the length of a time period that a node is depend on the message size; some communication
engaged in sending or receiving a message. During this libraries even switch between different transfer imple-
time, the node cannot perform other operations (e.g., mentations, for short, medium, and large messages.
overlapping computations). () Send and receive overhead can strongly differ, as
In the LogP model, the runtime of an algorithm is the handling of asynchronously incoming messages
determined as the maximum runtime across all proces- needs a fundamentally different implementation than
sors. A consequence of the LogP model is that the access a synchronously invoked send operation. In the PLogP
to a data element in memory of another processor costs model (see Fig. ), the original parameters o and g have
Tm = L + o time units (a message round-trip), of been replaced by the send overhead os (m), the receive
Bandwidth-Latency Models (BSP, LogP) B 

g(m) the important costs associated with a communication


event, such as latency, bandwidth, or overhead, allowing
Sender
algorithm designers to factor them into the comparative B
analysis of parallel algorithms. Even more, the empha-
sis on modeling communication cost has shifted to the
os (m) Time cost at the nodes that are the endpoints of the com-
or (m)
munication message, such that the number of messages
and contention at the endpoints have become more
Receiver
important than mapping to network technologies. In
L g(m) fact, both the BSP and LogP models ignore network
topology, modeling network delay as a constant value.
Bandwidth-Latency Models (BSP, LogP). Fig.  An in-depth comparison of BSP and LogP has been
Visualization of a message transport with m bytes using performed in [], showing that both models are roughly
the PLogP model. The sender is busy for os (m) time. The equivalent in terms of expressiveness, slightly favor-
message has been received at T = L + g(m), out of which ing BSP for its higher-level abstraction. But it was
the receiver had been busy for or (m) time exactly this model that was found to be too restrictive
by the designers of LogP []. Both models have their
advantages and disadvantages. LogP is better suited for
overhead or (m), and the gap g(m), where m is the modeling applications that actually use point-to-point
message size. L and P remain the same as in LogP. communication, while BSP is better and simpler for
PLogP allows precise performance modeling when data-parallel applications that fit the superstep model.
parameters for the relevant message sizes of an appli- The BSP model also provides an elegant framework
cation are used. In [], this has been demonstrated that can be used to reason about communication and
for MPI’s collective communication operations, even parallel performance. The major contribution of both
for hierarchical communication networks with differ- models is the explicit acknowledgement of communi-
ent sets of performance parameters. Pješivac-Grbović cation costs that are dependent on properties of the
et al. [] have shown that PLogP provides flexible and underlying machine architecture.
accurate performance modeling. The BSP and LogP models are important steps
toward a realistic architectural model for designing and
Concluding Remarks analyzing parallel algorithms. By experimenting with
Historically, the major focus of parallel algorithm devel- the values of the key parameters in the models, it is pos-
opment had been the PRAM model, which ignores data sible to determine how an algorithm will perform across
access and communication cost and considers only load a range of architectures and how it should be structured
balance and extra work. PRAM is very useful in under- for different architectures or for portable performance.
standing the inherent concurrency of an application,
which is the first conceptual step in developing a par- Related Entries
allel program; however, it does not take into account Amdahl’s Law
important realities of particular systems, such as the fact BSP (Bulk Synchronous Parallelism)
that data access and communication costs are often the Collective Communication
dominant components of execution time. Gustafson’s Law
The bandwidth-latency models described in this Models of Computation, Theoretical
chapter articulate the performance issues against which PRAM (Parallel Random Access Machines)
software must be designed. Based on a clearer under-
standing of the importance of communication costs Bibliographic Notes and Further
on modern machines, models like BSP and LogP Reading
help analyze communication cost and hence improve The presented models and their extensions were orig-
the structure of communication. These models expose inally introduced in papers [, , ] and described in
 B Banerjee’s Dependence Test

textbooks [–] which were partially used for writ- . Hwang K, Xu Z () Scalable parallel computing.
ing this entry. A number of various models similar to WCB/McGraw-Hill, New York
. Culler DE, Singh JP, Gupta A () Parallel computer architec-
BSP and LogP have been proposed: Queuing Shared
ture – a hardware/software approach. Morgan Kaufmann, San
Memory (QSM) [], LPRAM [], Decomposable BSP
Francisco
[], etc. Further papers in the list of references deal . Gibbons PB, Matias Y, Ramachandran V () Can a shared-
with particular applications of the models and their memory model serve as a bridging-model for parallel computa-
classification and standardization. tion? Theor Comput Syst ():–
. Aggarwal A, Chandra AK, Snir M () Communication com-
plexity of PRAMs. Theor Comput Sci :–
Bibliography . De la Torre P, Kruskal CP () Submachine locality in the bulk
. Alexandrov A, Ionescu M, Schauser KE, Scheiman C () synchronous setting. In: Proceedings of the EUROPAR’, LNCS
LogGP: incorporating long messages into the LogP model – one , pp –, Springer, Berlin, August 
step closer towards a realistic model for parallel computation.
In: th ACM Symposium on Parallel Algorithms and Architec-
tures (SPAA’), Santa Barbara, California, pp –, July 
. Bilardi G, Herley KT, Pietracaprina A, Pucci G, Spirakis P ()
BSP versus LogP. Algorithmica :–
. Culler DE, Dusseau AC, Martin RP, Schauser KE () Fast par- Banerjee’s Dependence Test
allel sorting under LogP: from theory to practice. In: Portability
and Performance for Parallel Processing, Wiley, Southampton, Utpal Banerjee
pp –,  University of California at Irvine, Irvine, CA, USA
. Culler DE, Karp R, Sahay A, Schauser KE, Santos E,
Subramonian R, von Eicken T () LogP: towards a real-
istic model of parallel computation. In: th ACM SIGPLAN
Symposium on Principles and Practice of Parallel Programming
Definition
(PPoPP’), pp –, 
. Goudreau MW, Hill JM, Lang K, McColl WF, Rao SD, Stefanescu
Banerjee’s Test is a simple and effective data depen-
DC, Suel T, Tsantilas T () A proposal for a BSP World- dence test widely used in automatic vectorization and
wide standard. Technical Report, BSP Worldwide, www.bsp- parallelization of loops. It detects dependence between
wordwide.org,  statements caused by subscripted variables by analyzing
. Hill M, McColl W, Skillicorn D () Questions and answers the subscripts.
about BSP. Scientific Programming ():–
. Kielmann T, Bal HE, Verstoep K () Fast measurement of
LogP parameters for message passing platforms. In: th Work- Discussion
shop on Runtime Systems for Parallel Programming (RTSPP),
held in conjunction with IPDPS , May 
Introduction
. Kielmann T, Bal HE, Gorlatch S, Verstoep K, Hofman RFH
() Network performance-aware collective communication
A restructuring compiler transforms a sequential pro-
for clustered wide area systems. Parallel Computing (): gram into a parallel form that can run efficiently on a
– parallel machine. The task of the compiler is to discover
. Martin RP, Culler DE () NFS sensitivity to high perfor- the parallelism that may be hiding in the program and
mance networks. In: ACM SIGMETRICS International Con- bring it out into the open. The first step in this pro-
ference on Measurement and Modeling of Computer Systems,
cess is to compute the dependence structure imposed
pp –, 
. Pješivac-Grbović J, Angskun T, Bosilca G, Fagg GE, Gabriel E, on the program operations by the sequential execution
Dongarra JJ () Performance analysis of MPI collective opera- order. The parallel version of the sequential program
tions. In: th International Workshop on Performance Modeling, must obey the same dependence constraints, so that it
Evaluation, and Optimization of Parallel and Distributed Systems computes the same final values as the original program.
(PMEO-PDS’), April 
(If an operation B depends on an operation A in the
. Valiant LG () A bridging model for parallel computation.
Communications of the ACM ():– sequential program, then A must be executed before
. Rauber T, Rünger G () Parallel programming: for multicore B in the parallel program.) To detect possible depen-
and cluster systems. Springer, New York dence between two operations, one needs to know if
Banerjee’s Dependence Test B 

they access the same memory location during sequen- Lexicographic Order
tial execution and in which order. Banerjee’s Test pro- For any positive integer m, the set of all integer
vides a simple mechanism for dependence detection in m-vectors (i , i , . . . , im ) is denoted by Zm . The zero B
loops when subscripted variables are involved. vector (, , . . . , ) is abbreviated as . Addition and
In this essay, Banerjee’s test is developed for one- subtraction of members of Zm are defined coordinate-
dimensional array variables in assignment statements wise in the usual way.
within a perfect loop nest. Pointers are given for exten- For  ≤ ℓ ≤ m, a relation ≺ℓ in Zm is defined as
sions to more complicated situations. The first section follows: If i = (i , i , . . . , im ) and j = ( j , j , . . . , jm ) are
below is on mathematical preliminaries, where certain vectors in Zm , then i ≺ℓ j if
concepts and results are presented that are essential for i = j , i = j , . . . , iℓ− = jℓ− , and iℓ < jℓ .
an understanding of the test. After that the relevant
dependence concepts are explained, and then the test The lexicographic order ≺ in Zm is then defined by
itself is discussed. requiring that i ≺ j, if i ≺ℓ j for some ℓ in  ≤ ℓ ≤ m.
It is often convenient to write i ≺m+ j when i = j ,
i = j , . . . , im = jm , that is, i = j. The notation i ⪯ j
Mathematical Preliminaries means either i ≺ j or i = j, that is, i ≺ℓ j for some ℓ in
 ≤ ℓ ≤ m + . Note that ⪯ is a total order in Zm .
Linear Diophantine Equations
The associated relations ≻ and ⪰ are defined in the
Let Z denote the set of all integers. An integer b divides
usual way: j ≻ i means i ≺ j, and j ⪰ i means i ⪯ j. (⪰ is
an integer a, if there exists an integer c such that a = bc.
also a total order in Zm .) An integer vector i is positive
For a list of integers a , a , . . . , am , not all zero, the great-
if i ≻ , nonnegative if i ⪰ , and negative if i ≺ .
est common divisor or gcd is the largest positive integer
Let R denote the field of real numbers. The sign
that divides each member of the list. It is denoted by
function sgn : R → Z is defined by
gcd(a , a , . . . , am ). The gcd of a list of zeros is defined


to be . ⎪

⎪  if x > 


A linear diophantine equation in m variables is an ⎪

equation of the form sgn(x) = ⎨  if x = 







⎪ − if x < ,
a x  + a  x  + ⋯ + a m x m = c ⎩
for each x in R. The direction vector of any vector
where the coefficients a , a , . . . , am are integers not all (i , i , . . . , im ) in Zm is (sgn(i ), sgn(i ), . . . , sgn(im )).
zero, c is an integer, and x , x , . . . , xm are integer vari- Note that a vector is positive (negative) if and only if its
ables. A solution to this equation is a sequence of inte- direction vector is positive (negative).
gers (i , i , . . . , im ) such that ∑m
k= a k ik = c. The following
theorem is a well-known result in Number Theory. Extreme Values of Linear Functions
For any positive integer m, let Rm denote the
Theorem  The linear diophantine equation
m-dimensional Euclidean space consisting of all real
m-vectors. It is a real vector space where vector addi-
a  x + a x + ⋯ + a m xm = c
tion and scalar multiplication are defined coordinate-
has a solution if and only if gcd(a , a , . . . , am ) divides c. wise. The concepts of the previous subsection stated in
terms of integer vectors can be trivially extended to the
Proof The “only if ” Part is easy to prove. If the equation realm of real vectors. For a real number a, we define the
has a solution, then there are integers i , i , . . . , im such positive part a+ and the negative part a− as in []:
that ∑m k= ak ik = c. Since gcd(a , a , . . . , am ) divides each ∣a∣ + a
ak , it must also divide c. To get the “if ” Part (and derive a+ = = max(a, ) and

the general solution), see the proof of Theorem . ∣a∣ − a
in []. a− = = max(−a, ).

 B Banerjee’s Dependence Test

Thus, a+ = a and a− =  for a ≥ , while a+ =  and . If x = y, then


a− = −a for a ≤ . For example, + = , − = , (−)+ =
−(a − b)− q ≤ ax − by ≤ (a − b)+ q.
, and (−)− = . The following lemma lists the basic
properties of positive and negative parts of a number . If x ≤ y − , then
(Lemma . in [], Lemma . in []). −b−(a− +b)+ (q−) ≤ ax−by ≤ −b+(a+ −b)+ (q−).
Lemma  For any real number a, the following state- . If x ≥ y + , then
ments hold:
a − (b+ − a)+ (q − ) ≤ ax − by ≤ a + (b− + a)+ (q − ).
. a+ ≥ , a− ≥ ,
. If (x, y) varies freely in A, then
. a = a+ − a− , ∣a∣ = a+ + a− ,
. (−a)+ = a− , (−a)− = a+ , −(a− + b+ )q ≤ ax − by ≤ (a+ + b− )q.
. (a+ )+ = a+ , (a+ )− = , Proof By Lemma ,  ≤ x ≤ q implies −a− q ≤ ax ≤
. (a− )+ = a− , (a− )− = , a+ q. This result and Lemma  are used repeatedly in the
. −a− ≤ a ≤ a+ . following proof.
The next lemma gives convenient expressions for the Case . Let x = y. Then ax − by = (a − b)x. Since
extreme values of a simple function. (This is Lemma .  ≤ x ≤ q, one gets
in []; it generalizes Lemma . in [].) −(a − b)− q ≤ ax − by ≤ (a − b)+ q.
Lemma  Let a, p, q denote real constants, where p < q. Case . Let  ≤ x ≤ y −  ≤ q − . Then ax ≤ a+ (y − ).
The minimum and maximum values of the function Hence,
f (x) = ax on the interval p ≤ x ≤ q are (a+ p − a− q)
ax − by = −b + [ax − b(y − )] ≤ −b + (a+ − b)(y − ).
and (a+ q − a− p), respectively.
Since  ≤ y −  ≤ q − , an upper bound for (ax − by) is
Proof For p ≤ x ≤ q, the following hold:
given by:
a+ p ≤ a + x ≤ a + q ax − by ≤ −b + (a+ − b)+ (q − ).

−a− q ≤ −a− x ≤ −a− p, To derive a lower bound, replace a with −a and b with
−b to get
since a+ ≥  and −a− ≤ . Adding these two sets of
inequalities, one gets −ax + by ≤ b + (a− + b)+ (q − ),
so that
a+ p − a− q ≤ ax ≤ a+ q − a− p,
−b − (a− + b)+ (q − ) ≤ ax − by.
since a = a+ − a− . To complete the proof, note that these
bounds for f (x) are actually attained at the end points Case . Let x ≥ y + . Then  ≤ y ≤ x −  ≤ q − . In
x = p and x = q, and therefore they are the extreme Case , replace x by y, y by x, a by −b, and b by −a, to
values of f . For example, when a ≥ , it follows that get
a+ p − a− q = ap = f (p) a − ((−b)− − a)+ (q − ) ≤ (−b)y − (−a)x
a+ q − a− p = aq = f (q). ≤ a + ((−b)+ + a)+ (q − ),
The following theorem gives the extreme values of a or
function of two variables in some simple domains; these a − (b+ − a)+ (q − ) ≤ ax − by ≤ a + (b− + a)+ (q − ).
results will be needed in the next section.
Case . Since  ≤ x ≤ q, one gets −a− q ≤ ax ≤ a+ q.
Theorem  Let a, b, q denote real constants, where Also, since  ≤ y ≤ q, it follows that −b− q ≤ by ≤ b+ q,
q > . Define a function f : R → R by f (x, y) = ax − by that is, −b+ q ≤ −by ≤ b− q. Hence,
for each (x, y) ∈ R . Let A denote the rectangle
−a− q − b+ q ≤ ax − by ≤ a+ q + b− q.
{(x, y) ∈ R :  ≤ x ≤ q,  ≤ y ≤ q}. The extreme values
of f on A under an additional restriction are as follows: This gives the bounds for Case .
Banerjee’s Dependence Test B 

As in the proof of Lemma , it can be shown in Dependence Concepts


each case that each bound is actually attained at some For more details on the material of this section, see
point of the corresponding domain. Hence, the bounds Chapter  in []. An assignment statement has the form B
represent the extreme values of f in each case.
S: x=E
m
The Euclidean space R is an inner product where S is a label, x a variable, and E an expression.
space, where the inner product of two vectors x = Such a statement reads the memory locations specified
(x , x , . . . , xm ) and y = (y , y , . . . , ym ) is defined by in E, and writes the location x. The output variable of
⟨x, y⟩ = ∑m k= xk yk . The inner product defines a norm S is x and its input variables are the variables in E. Let
(length), the norm defines a metric (distance), and the Out(S) = {x} and denote by In(S) the set of all input
metric defines a topology. In this topological vector variables of S.
space, one can talk about bounded sets, open and closed The basic program model is a perfect nest of loops
sets, compact sets, and connected sets. The follow- L = (L , L , . . . , Lm ) shown in Fig. . For  ≤ k ≤ m,
ing theorem is easily derived from well-known results the index variable Ik of Lk runs from  to some pos-
in topology. (See the topology book by Kelley [] for itive integer Nk in steps of . The index vector of the
details.) loop nest is I = (I , I , . . . , Im ). An index point or index
Theorem  Let f : Rm → R be a continuous function. value of the nest is a possible value of the index vec-
If a set A ⊂ Rm is closed, bounded, and connected, then tor, that is, a vector i = (i , i , . . . , im ) ∈ Zm such that
f (A) is a finite closed interval of R.  ≤ ik ≤ Nk for  ≤ k ≤ m. The subset of Zm consist-
ing of all index points is the index space of the loop nest.
Proof Note that Rm and R are both Euclidean spaces. During sequential execution of the nest, the index vec-
A subset of a Euclidean space is compact if and only tor starts at the index point  = (, , . . . , ) and ends
if it is closed and bounded. Thus, the set A is com- at the point (N , N , . . . , Nm ) after traversing the entire
pact and connected. Since f is continuous, it maps a index space in the lexicographic order.
compact set onto a compact set, and a connected set The body of the loop nest L is denoted by H(I) or H;
onto a connected set. Hence, the set f (A) is compact it is assumed to be a sequence of assignment statements.
and connected, that is, closed, bounded, and connected. A given index value i defines a particular instance H(i)
Therefore, f (A) must be a finite closed interval of R. of H(I), which is an iteration of L. An iteration H(i) is
Corollary  Let f : Rm → R denote a continuous executed before an iteration H(j) if and only if i ≺ j.
function, and let A be a closed, bounded, and connected A typical statement in the body of the loop nest is
subset of Rm . Then f assumes a minimum and a max- denoted by S(I) or S, and its instance for an index value
imum value on A. And for any real number c satisfying i is written as S(i). Let S and T denote any two (not nec-
the inequalities essarily distinct) statements in the body. Statement T
depends on statement S, if there exist a memory location
min f (x) ≤ c ≤ max f (x), M, and two index points i and j, such that
x∈A x∈A

the equation f (x) = c has a solution x ∈ A. L1 : do I1 = 0, N1, 1


L2 : do I2 = 0, N2, 1
Proof By Theorem , the image f (A) of A is a finite .. ..
closed interval [α, β] of R. Then for each x ∈ A, . .
Lm : do Im = 0, Nm, 1
one has α ≤ f (x) ≤ β. So, α is a lower bound
H(I1, I2, . . . , Im)
and β an upper bound for f on A. Since α ∈ f (A)
enddo
and β ∈ f (A), there are points x , x ∈ A such ..
that f (x ) = α and f (x ) = β. Thus, f assumes .
enddo
a minimum and a maximum value on A, and α = enddo
minx∈A f (x) and β = maxx∈A f (x). If α ≤ c ≤ β,
then c ∈ f (A), that is, f (x ) = c for some x ∈ A. Banerjee’s Dependence Test. Fig.  A perfect loop nest
 B Banerjee’s Dependence Test

. The instances S(i) of S and T(j) of T both reference . An anti-dependence if


(read or write) M u(I) ∈ In(S) and v(I) ∈ Out(T);
. In the sequential execution of the program, S(i) is . An output dependence if
executed before T(j). u(I) ∈ Out(S) and v(I) ∈ Out(T);
. An input dependence if
If i and j are a pair of index points that satisfy these two
u(I) ∈ In(S) and v(I) ∈ In(T).
conditions, then it is convenient to say that T(j) depends
on S(i). Thus, T depends on S if and only if at least one
instance of T depends on at least one instance of S. The Banerjee’s Test
concept of dependence can have various attributes as In its simplest form, Banerjee’s Test is a necessary con-
described below. dition for the existence of dependence of one statement
Let i = (i , i , . . . , im ) and j = ( j , j , . . . , jm ) denote on another at a given level, caused by a pair of one-
a pair of index points, such that T(j) depends on S(i). dimensional array variables. This form of the test is
If S and T are distinct and S appears lexically before T in described in Theorem . Theorem  gives the version of
H, then S(i) is executed before T(j) if and only if i ⪯ j. the test that deals with dependence with a fixed direc-
Otherwise, S(i) is executed before T(j) if and only if tion vector. Later on, pointers are given for extending
i ≺ j. If i ⪯ j, there is a unique integer ℓ in  ≤ ℓ ≤ m +  the test in different directions.
such that i ≺ℓ j. If i ≺ j, there is a unique integer ℓ in
 ≤ ℓ ≤ m such that i ≺ℓ j. This integer ℓ is a depen- Theorem  Consider any two assignment statements S
dence level for the dependence of T on S. The vector and T in the body H of the loop nest of Fig. . Let X( f (I))
d ∈ Zm , defined by d = j − i, is a distance vector for denote a variable of S and X(g(I)) a variable of T, where
the same dependence. Also, the direction vector σ of d, X is a one-dimensional array, f (I) = a  + ∑m k= a k Ik ,

defined by g(I) = b + ∑m b I
k= k k , and the a’s and the b’s are all integer
constants. If X( f (I)) and X(g(I)) cause a dependence
σ = (sgn( j − i ), sgn( j − i ), . . . , sgn( jm − im )), of T on S at a level ℓ, then the following two conditions
is a direction vector for this dependence. Note that both hold:
d and σ are nonnegative vectors if S precedes T in H, (A) gcd(a − b , . . . , aℓ− − bℓ− , aℓ , . . . , am , bℓ , . . . , bm )
and both are positive vectors otherwise. The depen- divides (b − a );
dence of T(j) on S(i) is carried by the loop Lℓ if the (B) α ≤ b − aℓ−
 ≤ β, where
dependence level ℓ is between  and m. Otherwise, +
α = −bℓ − ∑(ak − bk )− Nk − (a−ℓ + bℓ ) (Nℓ − )
the dependence is loop-independent. k=
A statement can reference a memory location only m

through one of its variables. The definition of depen- − ∑ (a−k + b+k ) Nk


k=ℓ+
dence is now extended to make explicit the role played ℓ−
by the variables in the statements under consideration. β = −bℓ + ∑(ak − bk )+ Nk + (a+ℓ − bℓ )+ (N ℓ − )
A variable u(I) of the statement S and a variable v(I) of k=
m
the statement T cause a dependence of T on S, if there + ∑ (a+k + b−k ) Nk .
are index points i and j, such that k=ℓ+

. The instance u(i) of u(I) and the instance v(j) of


v(I) both represent the same memory location; Proof Assume that the variables X( f (I)) and X(g(I))
. In the sequential execution of the program, S(i) is do cause a dependence of statement T on statement S at
executed before T(j). a level ℓ. If S precedes T in H, the dependence level ℓ
could be any integer in  ≤ ℓ ≤ m + . Otherwise, ℓ must
If these two conditions hold, then the dependence
be in the range  ≤ ℓ ≤ m. By hypothesis, there exist two
caused by u(I) and v(I) is characterized as
index points i = (i , i , . . . , im ) and j = ( j , j , . . . , jm )
. A flow dependence if such that i ≺ℓ j, and X( f (i)) and X(g(j)) represent the
u(I) ∈ Out(S) and v(I) ∈ In(T); same memory location.
Banerjee’s Dependence Test B 

The restrictions on i , i , . . . , im , j , j , . . . , jm are as Remarks


follows: . In general, the two conditions of Theorem  are
necessary for existence of dependence, but they are B

 ≤ ik ≤ Nk ⎪⎪

⎪ not sufficient. Suppose both conditions hold. First,
⎬ ( ≤ k ≤ m) () think in terms of real variables. Let P denote the

 ≤ jk ≤ N k ⎪

⎪ subset of Rm defined by the inequalities ()–().

Then P is a closed, bounded, and connected set. The
i k = jk ( ≤ k ≤ ℓ − ) ()
right-hand side of () represents a real-valued con-
iℓ ≤ jℓ − . () tinuous function on Rm . Its extreme values on P
are given by α and β. Condition (B) implies that
(As iℓ and jℓ are integers, i ℓ < jℓ means iℓ ≤ jℓ − .) there is a real solution to equation () with the con-
Since f (i) and g(j) must be identical, it follows that straints ()–() (by Corollary  to Theorem ). On
the other hand, condition (A) implies that there is an
ℓ−
integer solution to equation () without any further
b − a = ∑(ak ik − bk jk ) + (aℓ iℓ − bℓ jℓ )
k= constraints (by Theorem ). Theoretically, the two
m conditions together do not quite imply the existence
+ ∑ (ak ik − bk jk ). () of an integer solution with the constraints needed to
k=ℓ+
guarantee the dependence of T(j) on S(i). However,
Because of (), this equation is equivalent to note that using Theorem , one would never falsely
conclude that a dependence does not exist when in
ℓ− fact it does.
b − a = ∑(ak − bk )ik + (a ℓ iℓ − bℓ jℓ )
k=
. If one of the two conditions of Theorem  fails to
m hold, then there is no dependence. If both of them
+ ∑ (ak ik − bk jk ). () hold, there may or may not be dependence. In prac-
k=ℓ+
tice, however, it usually turns out that when the
First, think of i , i , . . . , im , jℓ , jℓ+ , . . . , jm as integer conditions hold, there is dependence. This can be
variables. Since the linear diophantine equation () has explained by the fact that there are certain types of
a solution, condition (A) of the theorem is implied by array subscripts for which the conditions are indeed
Theorem . sufficient, and these are the types most commonly
Next, think of i  , i , . . . , im , jℓ , jℓ+ , . . . , jm as real vari- encountered in practice. Theorem . in [] shows
ables. Note that for  ≤ k < t ≤ m, there is no relation that the conditions are sufficient, if there is an inte-
between the pair of variables (ik , jk ) and the pair of vari- ger t >  such that ak , bk ∈ {−t, , t} for  ≤ k ≤ m.
ables (it , jt ). Hence, the minimum (maximum) value of Psarris et al. [] prove the sufficiency for another
the right-hand side of () can be computed by summing large class of subscripts commonly found in real
the minimum (maximum) values of all the individual programs.
terms (ak ik − bk jk ). For each k in  ≤ k ≤ m, one can Example  Consider the loop nest of Fig. , where
compute the extreme values of the term (ak ik − bk jk ) X is a one-dimensional array, and the constant terms
by using a suitable case of Theorem . It is then clear a , b in the subscripts are integers unspecified for the
that α is the minimum value of the right-hand side moment. Suppose one needs to check if T is output-
of (), and β is its maximum value. Hence, (b − a ) dependent on S at level . For this problem, m = ,
must lie between α and β. This is condition (B) of the ℓ = , and
theorem.
(N , N , N ) = (, , ),
For the dependence problem of Theorem , equa- (a , a , a ) = (, , −), (b , b , b ) = (, −, ).
tion () or () is the dependence equation, condi-
tion (A) is the gcd Test, and condition (B) is Banerjee’s Sincegcd(a −b , a , a , b , b ) = gcd(−, , −, −, ) = ,
Test. condition (A) of Theorem  is always satisfied. To test
 B Banerjee’s Dependence Test

L1 : do I1 = 0, 100, 1 integer constants. If these variables cause a dependence of


L2 : do I2 = 0, 50, 1 T on S with a direction vector (σ , σ , . . . , σm ), then the
L3 : do I3 = 0, 40, 1 following two conditions hold:
S: X(a0 + I1 + 2I2 − I3) = ···
T: X(b0 + 3I1 − I2 + 2I3) = ··· (A) The gcd of all integers in the three lists
enddo
enddo {ak − bk : σk = }, {ak : σk ≠ }, {bk : σk ≠ }
enddo
divides (b − a );
Banerjee’s Dependence Test. Fig.  Loop nest of (B) α ≤ b − a ≤ β, where
example 
α = − ∑ (ak − bk )− Nk
σ k =
condition (B), evaluate α and β: +
− +
− ∑ [bk + (a−k + bk ) (Nk − )]
α = −b − (a − b ) N − (a− + b ) (N − ) σ k =
+
− (a−+ b+ ) N = − + ∑ [ak − (b+k − ak ) (Nk − )]
+ σ k =−
β= −b + (a − b )+ N + (a+ − b ) (N − )
− ∑ (a−k + b+k ) Nk
+ (a+ + b− ) N = . σ k =∗

Condition (B) then becomes β = ∑ (ak − bk )+ Nk


σ k =
− ≤ b − a ≤ . +
+ ∑ [−bk + (a+k − bk ) (Nk − )]
σ k =
First, take a = − and b = . Then (b − a ) +
is outside this range, and hence Banerjee’s Test is not + ∑ [ak + (b−k + ak ) (Nk − )]
σ k =−
satisfied. By Theorem , statement T is not output-
+ ∑ (a+k + b−k ) Nk .
dependent on statement S at level . σ k =∗
Next, take a = b = . Then b − a is within the
range, and Banerjee’s Test is satisfied. Theorem  can- Proof All expressions of the form (ak − bk ), where
not guarantee the existence of the dependence under  ≤ k ≤ m and σk = , constitute the list {ak −bk : σk = }.
question. However, there exist two index points i = The other two lists have similar meanings. The sum
(, , ) and j = (, , ) of the loop nest, such that
∑σk = is taken over all values of k in  ≤ k ≤ m such that
i ≺ j, the instance S(i) of statement S is executed before σk = . The other three sums have similar meanings.
the instance T(j) of statement T, and both instances As in the case of Theorem , the proof of this theorem
write the memory location X(). Thus, statement T is follows directly from Theorem  and Theorem .
indeed output-dependent on statement S at level .
For the dependence problem with a direction vector,
Theorem  dealt with dependence at a level. It is
condition (A) of Theorem  is the gcd Test and condi-
now extended to a result that deals with dependence
tion (B) is Banerjee’s Test. Comments similar to those
with a given direction vector. A direction vector here
given in Remarks  also apply to Theorem .
has the general form σ = (σ , σ , . . . , σm ), where each
The expressions for α and β in Banerjee’s Test
σk is either unspecified, or has one of the values , , −.
may look complicated even though their numerical
When σk is unspecified, one writes σk = ∗. (As compo-
evaluation is quite straightforward. With the goal of
nents of a direction vector, many authors use ‘=, <, >’ in
developing the test quickly while avoiding complicated
place of , , −, respectively.)
formulas, the perfect loop nest of Fig.  was taken as the
Theorem  Consider any two assignment statements S model program, and the variables were assumed to be
and T in the body H of the loop nest of Fig. . Let X( f (I)) one-dimensional array elements. However, Banerjee’s
denote a variable of S and X(g(I)) a variable of T, where Test can be extended to cover multidimensional array
X is a one-dimensional array, f (I) = a + ∑m k= ak Ik , elements in a much more general program. Detailed
g(I) = b + ∑k= bk Ik , and the a’s and the b’s are all
m
references are given in the section on further reading.
Banerjee’s Dependence Test B 

Related Entries A loop of the form “do I = p, q, θ,” where p, q, θ are


Code Generation integers and θ ≠ , can be converted into the loop “do
Dependence Abstractions Î = , N, ,” where I = p + Îθ and N = ⌊(q − p)/θ⌋. The B
Dependences new variable Î is the iteration variable of the loop. Using
Loop Nest Parallelization this process of loop normalization, one can convert any
Omega Test nest of loops to a standard form if the loop limits and
Parallelization, Automatic strides are integer constants. (See [].)
Parallelism Detection in Nested Loops, Optimal Consider now the dependence problem posed by
Unimodular Transformations variables that come from a multidimensional array,
where each subscript is a linear (affine) function of the
index variables. The gcd Test can be generalized to han-
Bibliographic Notes and Further dle this case; see Section . of [] and Section . of
Reading []. Also, Banerjee’s Test can be applied separately to
Banerjee’s Test first appeared in Utpal Banerjee’s the subscripts in each dimension; see Theorem . in
MS thesis [] at the University of Illinois, Urbana- []. If the test is not satisfied in one particular dimen-
Champaign, in . In that thesis, the test is developed sion, then there is no dependence as a whole. But,
for checking dependence at any level when the subscript if the test is satisfied in each dimension, there is no
functions f (I) and g(I) are polynomials in I , I , . . . , Im . definite conclusion. Another alternative is array lin-
The case where f (I) and g(I) are linear (affine) func- earization; see Section . in []. Zhiyuan Li et al. []
tions of the index variables is then derived as a special did a study of dependence analysis for multidimen-
case (Theorem .). Theorem . of [] appeared as The- sional array elements, that did not involve subscript-
orem  in the  paper by Banerjee, Chen, Kuck, and by-subscript testing, nor array linearization. The gen-
Towle []. Theorem  presented here is essentially the eral method presented in Chapter  of [] includes Li’s
same theorem, but has a stronger gcd Test. (The stronger λ-test.
gcd Test was pointed out by Kennedy in [].) The dependence problem is quite complex when one
Banerjee’s Test for a direction vector of the form has an arbitrary loop nest with loop limits that are lin-
σ = (, . . . , , , −, ∗, . . . , ∗) was given by Kennedy in ear functions of index variables, and multidimensional
[]. It is a special case of Theorem  presented here. See array elements with linear subscripts. To understand the
also the comprehensive paper by Allen and Kennedy general problem and some methods of solution, see the
[]. Wolfe and Banerjee [] give an extensive treatment developments in [] and [].
of the dependence problem involving direction vectors. Many researchers have studied Banerjee’s Test over
(The definition of the negative part of a number used in the years; the test in various forms can be found in many
[] is slightly different from that used here and in most publications. Dependence Analysis [] covers this test
other publications.) quite extensively. For a systematic development of the
The first book on dependence analysis [] was pub- dependence problem, descriptions of the needed math-
lished in ; it gave the earliest coverage of Banerjee’s ematical tools, and applications of dependence analysis
Test in a book form. Dependence Analysis [] published to program transformations, see the books [–] in
in  is a completely new work that subsumes the the series on Loop Transformations for Restructuring
material in []. compilers.
It is straightforward to extend theorems  and  to
the case where we allow the loop limits to be arbitrary
integer constants, as long as the stride of each loop is Bibliography
kept at . See theorems . and . in []. This model . Allen JR, Kennedy K (Oct ) Automatic translation of
FORTRAN programs to vector form. ACM Trans Program Lang
can be further extended to test the dependence of a
Syst ():–
statement T on a statement S when the nest of loops . Banerjee U (Nov ) Data dependence in ordinary programs.
enclosing S is different from the nest enclosing T. See MS Thesis, Report –, Department of Computer Science,
theorems . and . in []. University of Illinois at Urbana-Champaign, Urbana, Illinois
 B Barnes-Hut

. Banerjee U () Dependence analysis for supercomputing. Definition


Kluwer, Norwell Behavioral equivalences serve to establish in which
. Banerjee U () Loop transformations for restructuring com- cases two reactive (possible concurrent) systems offer
pilers: the foundations. Kluwer, Norwell
similar interaction capabilities relatively to other sys-
. Banerjee U () Loop transformations for restructuring com-
pilers: loop parallelization. Kluwer, Norwell tems representing their operating environment. Behav-
. Banerjee U () Loop transformations for restructuring com- ioral equivalences have been mainly developed in the
pilers: dependence analysis. Kluwer, Norwell context of process algebras, mathematically rigorous lan-
. Banerjee U, Chen S-C, Kuck DJ, Towle RA (Sept ) Time and guages that have been used for describing and verifying
parallel processor bounds for FORTRAN-like loops. IEEE Trans
properties of concurrent communicating systems. By
Comput C-():–
. Kelley JL () General topology. D. Van Nostrand Co., relying on the so-called structural operational seman-
New York tics (SOS), labeled transition systems are associated to
. Kennedy K (Oct ) Automatic translation of FORTRAN pro- each term of a process algebra. Behavioral equivalences
grams to vector form. Rice Technical Report --, Depart- are used to abstract from unwanted details and identify
ment of Mathematical Sciences, Rice University, Houston, Texas
those labeled transition systems that react “similarly”
. Li Z, Yew P-C, Zhu C-Q (Jan ) An efficient data dependence
analysis for parallelizing compilers. IEEE Trans Parallel Distrib
to external experiments. Due to the large number of
Syst ():– properties which may be relevant in the analysis of con-
. Psarris K, Klappholz D, Kong X (June ) On the accuracy of current systems, many different theories of equivalences
the Banerjee test. J Parallel Distrib Comput ():– have been proposed in the literature. The main con-
. Wolfe M, Banerjee U (Apr ) Data dependence and its tenders consider those systems equivalent that () per-
application to parallel processing. Int J Parallel Programming
form the same sequences of actions, or () perform the
():–
same sequences of actions and after each sequence are
ready to accept the same sets of actions, or () perform
the same sequences of actions and after each sequence
Barnes-Hut exhibit, recursively, the same behavior. This approach
leads to many different equivalences that preserve
N-Body Computational Methods significantly different properties of systems.

Barriers Introduction
In many cases, it is useful to have theories which can
NVIDIA GPU
be used to establish whether two systems are equiva-
Synchronization
lent or whether one is a satisfactory “approximation” of
another. It can be said that a system S is equivalent to
a system S whenever “some” aspects of the externally
Basic Linear Algebra observable behavior of the two systems are compatible.
Subprograms (BLAS) If the same formalism is used to model what is required
of a system (its specification) and how it can actually be
BLAS (Basic Linear Algebra Subprograms) built (its implementation), then it is possible to use the-
ories based on equivalences to prove that a particular
concrete description is correct with respect to a given
Behavioral Equivalences abstract one. If a step-wise development method is
used, equivalences may permit substituting large spec-
Rocco De Nicola ifications with equivalent concise ones. In general it
Universitá degli Studi di Firenze, Firenze, Italy is useful to be able to interchange subsystems proved
behaviorally equivalent, in the sense that one subsystem
Synonyms may replace another as part of a larger system without
Behavioral relations; Extensional equivalences affecting the behavior of the overall system.
Behavioral Equivalences B 

The kind of equivalences, or approximations, in spite of the fact that their behavior is the same; they
involved depends very heavily on how the systems can (only) execute infinitely many a-actions, and they
under consideration will be used. In fact, the way should thus be considered equivalent. B
a system is used determines the behavioral aspects The basic principles for any reasonable equivalence
which must be taken into account and those which can be summarized as follows. It should:
can be ignored. It is then important to know, for
the considered equivalence, the systems properties it ● Abstract from states (consider only the actions)
preserves. ● Abstract from internal behavior
In spite of the general agreement on taking an ● Identify processes whose LTSs are isomorphic
extensional approach for defining the equivalence of ● Consider two processes equivalent only if both can
concurrent or nondeterministic systems, there is still execute the same actions sequences
disagreement on what “reasonable” observations are ● Allow to replace a subprocess by an equivalent coun-
and how their outcomes can be used to distinguish terpart without changing the overall semantics of
or identify systems. Many different theories of equiva- the system
lences have been proposed in the literature for models
which are intended to be used to describe and reason However, these criteria are not sufficiently insightful
about concurrent or nondeterministic systems. This is and discriminative, and the above adequacy require-
mainly due to the large number of properties which ments turn out to be still too loose. They have given rise
may be relevant in the analysis of such systems. Almost to many different kinds of equivalences, even when all
all the proposed equivalences are based on the idea actions are considered visible.
that two systems are equivalent whenever no external The main equivalences over LTSs introduced in the
observation can distinguish them. In fact, for any given literature consider as equivalent those systems that:
system it is not its internal structure which is of interest
but its behavior with respect to the outside world, i.e., . Perform the same sequences of actions
its effect on the environment and its reactions to stimuli . Perform the same sequences of actions and after
from the environment. each sequence are ready to accept the same sets of
One of the most successful approaches for describ- actions
ing the formal, precise, behavior of concurrent sys- . Perform the same sequences of actions and after
tems is the so-called operational semantics. Within this each sequence exhibit, recursively, the same behavior
approach, concurrent programs or systems are modeled
as labeled transition systems (LTSs) that consist of a set These three different criteria lead to three groups
of states, a set of transition labels, and a transition rela- of equivalences that are known as traces equivalences,
tion. The states of the transition systems are programs, decorated-traces equivalences, and bisimulation-based
while the labels of the transitions between states repre- equivalences. Equivalences in different classes behave
sent the actions (instructions) or the interactions that differently relatively to the three-labeled transition sys-
are possible in a given state. tems in Fig. . The three systems represent the specifica-
When defining behavioral equivalence of concur- tions of three vending machines that accept two coins
rent systems described as LTSs, one might think that and deliver coffee or tea. The trace-based equivalences
it is possible to consider systems equivalent if they equate all of them, the bisimulation-based equivalences
give rise to the same (isomorphic) LTSs. Unfortunately, distinguish all of them, and the decorated traces distin-
this would lead to unwanted distinctions, e.g., it would guish the leftmost system from the other two, but equate
consider the two LTSs below different the central and the rightmost one.
Many of these equivalences have been reviewed [];
a a
here, only the main ones are presented. First, equiv-
alences that consider invisible (τ) actions just nor-
a q1
mal actions will be presented, then their variants that
p q
abstract from internal actions will be introduced.
 B Behavioral Equivalences

p q r
coin1 coin1
coin1 coin1

p1 q1 r2 r1

coin2 coin2 coin2 coin2 coin2

p2 q2 q3 r4 r3

coffee tea coffee tea coffee tea

p3 p4 q4 q5 r6 r5

Behavioral Equivalences. Fig.  Three vending machines

The equivalences will be formally defined on states we have that P ≃T Q , but P , unlike Q , after perform-
μ
of LTSs of the form ⟨Q, Aτ , −→ ⟩, where Q is a set of states, ing action a, can reach a state in which it cannot perform
ranging over p, q, p′ , q , …, Aτ is the set of labels, rang- any action, i.e., a deadlocked state.
ing over a, b, c, …, that also contains the distinct silent Traces equivalence identifies all of the three LTSs of
μ Fig. . Indeed, it is not difficult to see that the three vend-
action τ, and −→ is the set of transitions. In the follow-
ing, s will denote a generic element of A∗τ , the set of all ing machines can perform the same sequences of visible
sequences of actions that a process might perform. actions. Nevertheless, a customer with definite prefer-
ences for coffee who is offered to choose between the
Traces Equivalence three machines would definitely select to interact with
The first equivalence is known as traces equivalence the leftmost one since the others do not let him choose
and is perhaps the simplest of all; it is imported from what to drink.
automata theory that considers those automata equiv-
alent that generate the same language. Intuitively, two
processes are deemed traces equivalent if and only if Bisimulation Equivalence
they can perform exactly the same sequences of actions. The classical alternative to traces equivalence is bisim-
s
→ p′ , with s = μ  μ  . . . μ n ,
In the formal definition, p − ilarity (also known as observational equivalence) that
μ μ μn
denotes the sequence p −→ p −→ p . . . −→ p′ of considers equivalent two systems that can simulate each
transitions. other step after step []. Bisimilarity is based on the
Two states p and q are traces equivalent (p ≃T q) if : notion of bisimulation:
s
→ p′ implies q −
. p −
s
→ q′ for some q′ and A relation R ⊆ Q × Q is a bisimulation if, for any
s ′ s pair of states p and q such that ⟨p, q⟩ ∈ R, the following
. q →
− q implies p −→ p′ for some p′ .
holds:
A drawback of ≃T is that it is not sensitive to dead- μ μ
locks. For example, if we consider the two LTSs below: . For all μ ∈ Aτ and p′ ∈ Q, if p −
→ p′ then q −
→ q′ for
′ ′ ′
some q ∈ Q s.t. ⟨p , q ⟩ ∈ R
μ μ
p3 . For all μ ∈ Aτ and q′ ∈ Q, if q −
→ q′ then p −
→ p′ for
a
some p′ ∈ Q s.t. ⟨p′ , q′ ⟩ ∈ R
p1
a Two states p, q are bisimilar (p ∼ q) if there exists a
b
bisimulation R such that ⟨p, q⟩ ∈ R.
p2 p4
This definition corresponds to the circular defini-
tion below that more clearly shows that two systems are
q1
a
q2
b
q3
bisimilar (observationally equivalent) if they can per-
form the same action and reach bisimilar states. This
Behavioral Equivalences B 

recursive definition can be solved with the usual fixed on the notions of observers, observations, and success-
points techniques. ful observations. Equivalences are defined that consider
Two states p, q ∈ Q are bisimilar, written p ∼ q, if and equivalent those systems that satisfy (lead to successful B
only if for each μ ∈ Aτ : observations by) the same sets of observers. An observer
μ μ is an LTS with actions in Aτ,w ≜ Aτ ∪ {w}, with w ∈/ A.
. if p !→ p′ then q !→ q′ for some q′ such that To determine whether a state q satisfies an observer with
p′ ∼ q′ ; initial state o, the set OBS(q, o) of all computations from
μ μ
. if q !→ q′ then p !→ p′ for some p′ such that ⟨q, o⟩ is considered.
p′ ∼ q′ . Given an LTS ⟨Q, Aτ , −
μ
→ ⟩ and an observer
μ
Bisimilarity distinguishes all machines of Fig. . This ⟨O, Aτ,w , −
→ ⟩, and a state q ∈ Q and the initial state
is because the basic idea behind bisimilarity is that two o ∈ O, an observation c from ⟨q, o⟩ is a maximal sequence
states are considered equivalent if by performing the of pairs ⟨qi , oi ⟩, such that ⟨q , o ⟩ = ⟨q, o⟩. The transition
μ
same sequences of actions from these states it is pos- ⟨qi , oi ⟩ −
→ ⟨qi+ , oi+ ⟩ can be proved using the following
sible to reach equivalent states. It is not difficult to see inference rule:
that bisimilarity distinguishes the first and the second μ μ
→ E′
E− → F′
F−
machine of Fig.  because after receiving two coins (coin μ ∈ Aτ
μ
and coin ) the first machine still offers the user the pos- → ⟨E′ , F ′ ⟩
⟨E, F⟩ −
sibility of choosing between having coffee or tea while
An observation from ⟨q, o⟩ is successful if it contains
the second does not. To see that also the second and the w
third machine are distinguished, it is sufficient to con- a configuration ⟨qn , on ⟩ ∈ c, with n ≥ , such that on −→ o
sider only the states reachable after just inserting coin for some o.
because already after this insertion the user loses his When analyzing the outcome of observations, one
control of the third machine. Indeed, there is no way for has to take into account that, due to nondeterminism,
this machine to reach a state bisimilar to the one that the a process satisfies an observer sometimes or a process
second machine reaches after accepting coin . satisfies an observer always. This leads to the following
definitions:
. q may satisfy o if there exists an observation from
Testing Equivalence
⟨q, o⟩ that is successful.
The formulation of bisimilarity is mathematically very
. q must satisfy o if all observations from ⟨q, o⟩ are
elegant and has received much attention also in
successful.
other fields of computer science []. However, some
researchers do consider it too discriminating: two pro- These notions can be used to define may, must and,
cesses may be deemed unrelated even though there is testing equivalence.
no practical way of ascertaining it. As an example, con-
May equivalence : p is may equivalent to q (p ≃m q) if,
sider the two rightmost vending machines of Fig. . They
for all possible observers o:
are not bisimilar because after inserting the first coin in
p may satisfy o if and only if q may satisfy o;
one case there is still the illusion of having the possibil-
Must equivalence : p is must equivalent to q (p ≃M q) if,
ity of choosing what to drink. Nevertheless, a customer
for all possible observers o:
would not be able to appreciate their differences since
p must satisfy o if and only if q must satisfy o.
there is no possibility of deciding what to drink with
Testing equivalence : p is testing equivalent to q (p ≃test q)
both machines.
if p ≃m q and p ≃M q.
Testing equivalence has been proposed [] (see
also []) as an alternative to bisimilarity; it takes The three vending machines of Fig.  are may equiv-
to the extreme the claim that when defining behav- alent, but only the two rightmost ones are must equiva-
ioral equivalences, one does not want to distinguish lent and testing equivalent. Indeed, in most cases must
between systems that cannot be taken apart by external equivalence implies may equivalence, and thus in most
observers and bases the definition of the equivalences cases must and testing do coincide. The two leftmost
 B Behavioral Equivalences

machines are not must equivalent because one after that coincides with failures equivalence. For the consid-
receiving the two coins the machine cannot refuse to ered class of systems, it also holds that must and testing
(must) deliver the drink chosen by the customer while equivalence ≃test do coincide. Thus, bisimilarity implies
the other can. testing equivalence that in turn implies traces
May and must equivalences have nice alternative equivalence.
characterizations. It has been shown that may equiva-
lence coincides with traces equivalence and that must Weak Variants of the Equivalences
equivalence coincides with failures equivalence, another When considering abstract versions of systems making
well-studied relation that is inspired by traces equiva- use of invisible actions, it turns out that all equivalences
lence but takes into account the possible interactions considered above are too discriminating. Indeed, traces,
( failures) after each trace and is thus more discrimi- testing/failures, and observation equivalence would dis-
native than trace equivalence []. Failures equivalence tinguish the two machines of Fig.  that, nevertheless,
relies on pairs of the form ⟨s, F⟩, where s is a trace and exhibit similar observable behaviors: get a coin and
F is a set of labels. Intuitively, ⟨s, F⟩ is a failure for a pro- deliver a coffee. The second one can be obtained, e.g.,
cess if it can perform the sequence of actions s to evolve from the term
into a state from which no action in F is possible. This
equivalence can be formulated on LTS as follows: coin.grinding.coffee.nil
Failures equivalence: Two states p and q are failures-
equivalent, written p ≃F q, if and only if they possess the by hiding the grinding action that is irrelevant for the
same failures, i.e., if for any s ∈ A∗τ and for any F ⊆ Aτ : customer.
s s Because of this overdiscrimination, weak variants of
. p −→ p′ and Init(p′ ) ∩ F = / implies q −
→ q′ for some the equivalences have been defined that permit ignoring
q′ and Init(q′ ) ∩ F = / (to different extents) internal actions when considering
s s
. q −→ q′ and Init(q′) ∩ F = / implies p −→ p′ for some the behavior of systems. The key step of their defini-
′ ′
p and Init(p ) ∩ F = / tion is the introduction of a new transition relation
a
where Init(q) represents the immediate actions of that ignores silent actions. Thus, q ⇒ = q′ denotes that
state q. q reduces to q′ by performing the visible action a pos-
sibly preceded and followed by any number (also )
s
= q′ , instead,
of invisible actions (τ). The transition q ⇒
Hierarchy of Equivalences denotes that q reduces to q′ by performing the sequence
The equivalences considered above can be precisely s of visible actions, each of which can be preceded and
related (see [] for a first study). Their relationships є
followed by τ-actions, while ⇒ = indicates that only
over the class of finite transition systems with only vis- τ-actions, possibly none, are performed.
ible actions are summarized by the figure below, where Weak traces equivalence The weak variant of traces
the upward arrow indicates containment of the induced equivalence is obtained by simply replacing the tran-
relations over states and ≡ indicates coincidence. s
sitions p −→ p′ above with the observable transitions
s
p⇒= p′ .

F M
coin coffee
q0

T m
coin t coffee
p0
Overall, the figure states that may testing gives rise
to a relation (≃m ) that coincides with traces equiva- Behavioral Equivalences. Fig.  Weakly equivalent
lence, while must testing gives rise to a relation ≃M vending machines
Behavioral Equivalences B 

Two states p and q are weak traces equivalent (p ≊T q) action possibly preceded and followed by any number
if for any s ∈ A∗ : of invisible actions.

. p ⇒
s
= p′ implies q ⇒
s
= q′ for some q′ p
a

B
s s
= q′ implies p ⇒
. q ⇒ = p′ for some p′
t q1 t t qn a q¢1 t q¢2 t t
q ··· ··· q¢
Weak testing equivalence To define the weak variants
of may, must, and testing equivalences (denoted by
≊m , ≊M , ≊test respectively), it suffices to change experi- Branching bisimulation equivalence An alternative to
ments so that processes and observers can freely per- weak bisimulation has also been proposed that consid-
form silent actions. To this purpose, one only needs ers those τ-actions important that appear in branch-
to change the inference rule of the observation step: ing points of systems descriptions: only silent actions
μ that do not eliminate possible interaction with external
⟨qi , oi ⟩ −
→ ⟨qi+ , oi+ ⟩ that can now be proved using:
observers are ignored.
τ
→ E′
E−
τ
→ F′
F− A symmetric relation R ⊆ Q × Q is a branching
τ τ bisimulation if, for any pair of states p and q such that
→ ⟨E′ , F⟩
⟨E, F⟩ − → ⟨E, F ′ ⟩
⟨E, F⟩ − μ
→ p′ , with μ ∈ Aτ and p′ ∈ Q, at least one
⟨p, q⟩ ∈ R, if p −
a a
→ E′
E− → F′
F− of the following conditions holds:
a a∈A
→ ⟨E′ , F ′ ⟩
⟨E, F⟩ − ● μ = τ and ⟨p′ , q⟩ ∈ R
є μ
● = q′′ −
q⇒ → q′ for some q′ , q′′ ∈ Q such that ⟨p, q′′ ⟩ ∈ R
To define, instead, weak failures equivalence, it suffices to
s s and ⟨p′ , q′ ⟩ ∈ R
replace p − → p′ with p ⇒
= p′ in the definition of its strong
variant. It holds that weak traces equivalence coin- Two states p, q are branching bisimilar (p ≈b q)
cides with weak may equivalence, and that weak failures if there exists a branching bisimulation R such that
equivalence ≊F coincides with weak must equivalence. ⟨p, q⟩ ∈ R.
Weak bisimulation equivalence For defining weak obser- The figure below describes the intuition behind
vational equivalence, a new notion of (weak) bisim- branching bisimilarity; it corresponds to the defini-
ulation is defined that again assigns a special role to tion above although it might appear, at first glance,
τ’s. To avoid having four items, the definition below more demanding. In order to consider two states, say
requires that the candidate bisimulation relations be p and q, equivalent, it is necessary, like for weak bisim-
symmetric: ilarity, that for each visible action performed by one
A symmetric relation R ⊆ Q × Q is a weak bisimula- of them the other has to have the possibility of per-
tion if, for any pair of states p and q such that ⟨p, q⟩ ∈ R, forming the same visible action possibly preceded and
the following holds: followed by any number of invisible actions. Branching
a a
bisimilarity, however, imposes the additional require-
● For all a ∈ A and p′ ∈ Q, if p − → p′ then q ⇒
= q′ for ment that all performed internal actions are not used
′ ′ ′
some q ∈ Q s. t. ⟨p , q ⟩ ∈ R to change equivalent class. Thus, all states reached via
τ є
● For all p′ ∈ Q, if p − → p′ then q ⇒
= q′ for some q′ τ’s before performing the visible action are required
∈ Q s.t. ⟨p′ , q′ ⟩ ∈ R to be equivalent to p, while all states reached via τ’s
after performing the visible action are required to be
Two states p, q are weakly bisimilar (p ≈ q) if there
equivalent to p′ .
exists a weak bisimulation R such that ⟨p, q⟩ ∈ R.
The figure below describes the intuition behind
weak bisimilarity. In order to consider two states, say a

p
p and q, equivalent, it is necessary that for each vis-
ible action performed by one of them the other has t t t a q¢1 t q¢2 t t
q q1 ··· qn ··· q¢
to have the possibility of performing the same visible
 B Behavioral Equivalences

Hierarchy of Weak Equivalences p3


Like for the strong case, also weak equivalences can be a
clearly related. Their relationships over the class of finite p1
transition systems with invisible actions, but without t
τ-loops (so-called non-divergent or strongly convergent b
p2 p4
LTSs) are summarized by the figure below, where the
upward arrow indicates containment of the induced
relations over states. b q6
q3
t
q1 t a q5
b
q2 b
q4

F M
Both of them, after zero or more silent actions, can
be either in a state where both actions a and b are possi-
T m ble or in a state in which only a b transition is possible.
However, via a τ-action, the topmost system can reach a
state that has no equivalent one in the bottom one, thus
Thus, over strongly convergent LTSs with silent they are not weakly bisimilar.
actions, branching bisimilarity implies weak bisimilar- The next two LTSs are instead equated by weak
ity, and this implies testing and failures equivalences; bisimilarity, and thus by weak trace and weak must
and these imply traces equivalence. equivalences, but are not branching bisimilar.
A number of counterexamples can be provided to
show that the implications of the figure above are proper p2
c
and thus that the converse does not hold. a
p0 p1
The two LTSs reported below are weak traces equiv- t
alent and weakly may equivalent, but are distinguished b
p3 p4
by all the other equivalences.

b q6
a q5
p3
a q3 q0 a c q2
a
q1 t
p1 q1 b b
t q3 q4
q2
p2 b p4
It is easy to see that from the states p and q , the
same visible action is possible and bisimilar states can
be reached. The two states p and q are instead not
Indeed, they can perform exactly the same weak traces, branching bisimilar because p , in order to match the
but while the former can silently reach a state in which a action of q to q and reach a state equivalent to
an a-action can be refused the second cannot. q , needs to reach p through p , but these two states,
The next two LTSs are equated by weak trace and connected by a τ-action, are not branching bisimilar.
weak must equivalences, but are distinguished by weak It is worth concluding that the two LTSs of Fig.  are
bisimulation and branching bisimulation. equated by all the considered weak equivalences.
Benchmarks B 

Future Directions were introduced in []. Failure semantic was first intro-
The study on behavioral equivalences of transition sys- duced in [].
tems is still continuing. LTSs are increasingly used as the B
basis for specifying and proving properties of reactive
Bibliography
systems. For example, they are used in model checking as
. Baeten JCM, Weijland WP () Process algebra. Cambridge
the model against which logical properties are checked. University Press, Cambridge
It is then important to be able to use minimal systems . Brookes SD, Hoare CAR, Roscoe AW () A theory of commu-
that are, nevertheless, equivalent to the original larger nicating sequential processes. J ACM ():–
ones so that preservation of the checked properties is . De Nicola R () Extensional equivalences for transition sys-
tems. Acta Informatica ():–
guaranteed. Thus, further research is expected on devis-
. De Nicola R, Hennessy M () Testing equivalences for
ing efficient algorithms for equivalence checking and on processes. Theor Comput Sci :–
understanding more precisely the properties of systems . Hennessy M () Algebraic theory of processes. The MIT Press,
that are preserved by the different equivalences. Cambridge
. Hennessy M, Milner R () Algebraic laws for nondeterminism
and concurrency. J ACM ():–
Related Entries . Hoare CAR () Communicating sequential processes.
Actors Prentice-Hall, Englewood Cliffs
Bisimulation . Milner R () Communication and concurrency. Prentice-Hall,
Upper Saddle River
CSP (Communicating Sequential Processes)
. Roscoe AW () The theory and practice of concurrency.
Pi-Calculus Prentice-Hall, Hertfordshire
Process Algebras . Sangiorgi D () On the origins of bisimulation and coinduc-
tion. ACM Trans Program Lang Syst ():.–.
. van Glabbeek RJ () The linear time-branching time
Bibliographic Notes and Further spectrum I: the semantics of concrete, sequential processes.
Reading In: Bergstra JA, Ponse A, Smolka SA (eds) Handbook of process
The theories of equivalences can be found in a num- algebra, Elsevier, Amsterdam, pp –
. van Glabbeek RJ () The linear time-branching time spec-
ber of books targeted to describing the different process
trum II. In: Best E (ed) CONCUR ’, th international con-
algebras. The theory of bisimulation is introduced in [], ference on concurrency theory, Hildesheim, Germany, Lecture
while failure and trace semantics are considered in [] notes in computer science, vol . Springer-Verlag, Heidelberg,
and []. The testing approach is presented in []. pp –
Moreover, interesting papers relating the different
approaches are [], the first paper to establish precise
relationships between the many equivalences proposed
in the literature, and the two papers by R. van Glabbeek: Behavioral Relations
[], considering systems with only visible actions, and
[], considering also systems with invisible actions. Behavioral Equivalences
In his two companion papers, R. van Glabbeek pro-
vides a uniform, model-independent account of many
of the equivalences proposed in the literature and
proposes several motivating testing scenarios, phrased Benchmarks
in terms of “button pushing experiments” on reactive
Jack Dongarra, Piotr Luszczek
machines to capture them.
University of Tennessee, Knoxville, TN, USA
Bisimulation and its relationships with modal log-
ics is deeply studied in [], while a deep study of its
origins and its use in other areas of computer science Definition
is provided in []. Branching bisimulation was first Computer benchmarks are computer programs that
introduced in [], while the testing based equivalences form standard tests of the performance of a computer
 B Benchmarks

and the software through which it is used. They another, so the best machine for circuit simulation may
are written to a particular programming model and not be the best machine for computational fluid dynam-
implemented by specific software, which is the final ics. Finally, the performance depends greatly on a com-
arbiter as to what the programming model is. A bination of compiler characteristics and human efforts
benchmark is therefore testing a software interface to were expended on obtaining the results.
a computer, and not a particular type of computer The conclusions drawn from a benchmark study of
architecture. computer performance depend not only on the basic
timing results obtained, but also on the way these are
interpreted and converted into performance figures.
Discussion The choice of the performance metric, may itself influ-
The basic goal of performance modeling is to measure, ence the conclusions. For example, is it desirable to
predict, and understand the performance of a computer have a computer that generates the most mega op per
program or set of programs on a computer system. In second (or has the highest Speedup), or the computer
other words, it transcends the measurement of basic that solves the problem in the least time? It is now
architectural and system parameters and is meant to well known that high values of the first metrics do
enhance the understanding of the performance behav- not necessarily imply the second property. This confu-
ior of full complex applications. However, the programs sion can be avoided by choosing a more suitable met-
and codes used in different areas of science differ in ric that re effects solution time directly, for example,
a large number of features. Therefore the performance either the Temporal, Simulation, or Benchmark perfor-
of full application codes cannot be characterized in a mance, defined below. This issue of the sensible choice
general way independent of the application and code of performance metric is becoming increasingly impor-
used. The understanding of the performance character- tant with the advent of massively parallel computers
istics is tightly bound to the specific computer program which have the potential of very high Giga-op rates,
code used. Therefore the careful selection of an inter- but have much more limited potential for reducing
esting program for analysis is the crucial first step in solution time.
any more detailed and elaborate investigation of full Given the time of execution T and the floating-
application code performance. The applications of per- point operation-count several different performance
formance modeling are numerous, including evaluation measures can be defined. Each metric has its own uses,
of algorithms, optimization of code implementations, and gives different information about the computer and
parallel library development, and comparison of system algorithm used in the benchmark. It is important there-
architectures, parallel system design, and procurement fore to distinguish the metrics with different names,
of new systems. symbols and units, and to understand clearly the differ-
A number of projects such as Perfect, NPB, Park- ence between them. Much confusion and wasted work
Bench, HPC Challenge, and others have laid the can arise from optimizing a benchmark with respect to
groundwork for a new era in benchmarking and eval- an inappropriate metric. If the performance of different
uating the performance of computers. The complex- algorithms for the solution of the same problem needs
ity of these machines requires a new level of detail in to be compared, then the correct performance metric to
measurement and comprehension of the results. The use is the Temporal Performance which is defined as the
quotation of a single number for any given advanced inverse of the execution time
architecture is a disservice to manufacturers and users
RT = /T.
alike, for several reasons. First, there is a great varia-
tion in performance from one computation to another A special case of temporal performance occurs for
on a given machine; typically the variation may be one simulation programs in which the benchmark problem
or two orders of magnitude, depending on the type is defined as the simulation of a certain period of phys-
of machine. Secondly, the ranking of similar machines ical time, rather than a certain number of timesteps. In
often changes as one goes from one application to this case, the term “simulation performance” is used,
Beowulf Clusters B 

and it is measured in units such as simulated days per often proprietary, and/or subject to distribution restric-
day (written sim-d/d or ‘d’/d) in weather forecasting, tions. To minimize the negative impact of these fac-
where the apostrophe is used to indicate “simulated” or tors, the use of compact applications was proposed in B
simulated pico-seconds per second (written sim-ps/s or many benchmarking efforts. Compact applications are
‘ps’/s) in electronic device simulation. It is important to typical of those found in research environments (as
use simulation performance rather than timestep/s for opposed to production or engineering environments),
comparing different simulation algorithms which may and usually consist of up to a few thousand lines of
require different sizes of timestep for the same accuracy source code. Compact applications are distinct from
(e.g., an implicit scheme that can use a large timestep, kernel applications since they are capable of produc-
compared with an explicit scheme that requires a much ing scientifically useful results. In many cases, compact
smaller step). In order to compare the performance of applications are made up of several kernels, interspersed
a computer on one benchmark with its performance on with data movements and I/O operations between the
another, account must be taken of the different amounts kernels.
of work (measured in op) that the different problems Any of the performance metrics, R, can be described
require for their solution. The benchmark performance with a two-parameter Amdahl saturation, for a fixed
is defined as the ratio of the floating-point operation- problem size as a function of number of
count and the execution time processors p,
R = R∞ /( + p/ /p)
RB = FB /T.
where R∞ is the saturation performance approached as
The units of benchmark performance are Giga-op/s p → ∞ and p/ is the number of processors required
(benchmark name), where the name of the benchmark to reach half the saturation performance. This univer-
is included in parentheses to emphasize that the per- sal Amdahl curve [, ] could be matched against the
formance may depend strongly on the problem being actual performance curves by changing values of the
solved, and to emphasize that the values are based on two parameters (R∞ , p/ ).
the nominal benchmark op-count. In other contexts
such performance figures would probably be quoted Related Entries
as examples of the so-called sustained performance of HPC Challenge Benchmark
a computer. For comparing the observed performance LINPACK Benchmark
with the theoretical capabilities of the computer hard- Livermore Loops
ware, the actual number of floating-point operations TOP
performed FH is computed, and from it the actual hard-
ware performance Bibliography
. Hockney RW () A framework for benchmark analysis. Super-
RH = FH /T. computer (IX-):–
. Addison C, Allwright J, Binsted N, Bishop N, Carpenter B, Dalloz P,
Parallel speedup is a popular metric that has been Gee D, Getov V, Hey A, Hockney R, Lemke M, Merlin J, Pinches M,
Scott C, Wolton I () The genesis distributed-memory bench-
used for many years in the study of parallel computer
marks. Part : methodology and general relativity benchmark with
performance. Speedup is usually defined as the ratio of results for the SUPRENUM computer. Concurrency: Practice and
execution time of one-processor T and execution time Experience ():–
on p-processors Tp .
One factor that has hindered the use of full applica-
tion codes for benchmarking parallel computers in the
past is that such codes are difficult to parallelize and to Beowulf Clusters
port between target architectures. In addition, full appli-
cation codes that have been successfully parallelized are Clusters
 B Beowulf-Class Clusters

Discussion
Beowulf-Class Clusters
Notations and Conventions
Clusters Is this essay, a program is represented as a sequence
of operations, i.e., of instances of high level statements
or machine instructions. Such a sequence is called a
trace. Each operation has a unique name, u, and a
text T(u), usually specified as a (high-level language)
Bernstein’s Conditions statement. There are many schemes for naming oper-
ations: for polyhedral programs, one may use integer
Paul Feautrier
vectors, and operations are executed in lexicographic
Ecole Normale Supérieure de Lyon, Lyon, France
order. For flowcharts programs, one may use words of a
regular language to name operations, and if the program
has function calls, words of a context-free language [].
Definition In the last two cases, u is executed before v iff u is a
Bersntein’s conditions [] are a simple test for deciding prefix of v. In what follows, u ≺ v is a shorthand for
if statements or operations can be interchanged without “u is executed before v.” For sequential programs, ≺ is
modifying the program results. The test applies to oper- a well-founded total order: there is no infinite chain
ations which read and write memory at well defined x , x , . . . , xi , . . . such that xi+ ≺ xi . This is equivalent
addresses. If u is an operation, let M(u) be the set of to stipulating that a program execution has a begining,
(addresses of) the memory cells it modifies, and R(u) but may not have an end.
the set of cells it reads. Operations u and v can be All operations will be assumed deterministic: the
reordered if: state of memory after execution of u depends only on
T(u) and on the previous state of memory.
M(u) ∩ M(v) = M(u) ∩ R(v) = R(u) ∩ M(v) = / () For static control programs, one can enumerate the
unique trace – or at least describe it – once and for all.
If these conditions are met, one says that u and v com-
One can also consider static control program families,
mute or are independent.
where the trace depends on a few parameters which are
Note that in most languages, each operation writes
know at program start time. Lastly, one can consider
at most one memory cell: W(u) is a singleton. However,
static control parts of programs or SCoPs. Most of this
there are exceptions: multiple and parallel assignments,
essay will consider only static control programs.
vector assignments among others.
When applying Bernstein’s conditions, one usually
The importance of this result stems from the fact
considers a reference trace, which comes from the orig-
that most program optimizations consist – or at least,
inal program, and a candidate trace, which is the result
involve – moving operations around. For instance, to
of some optimization or parallelization. The problem is
improve cache performance, one must move all uses of
to decide whether the two traces are equivalent, in a
a datum as near as possible to its definition. In paral-
sense to be discussed later. Since program equivalence
lel programming, if u and v are assigned to different
is in general undecidable, one has to restrict the set of
threads or processors, their order of execution may be
admissible transformations. Bernstein’s conditions are
unpredictable, due to arbitrary decisions of a scheduler
specially usefull for dealing with operation reordering.
or to the presence of competing processes. In this case,
if Bernstein’s conditions are not met, u and v must be
Commutativity
kept in the same thread.
To prove that Berstein’s conditions are sufficient for
Checking Bernstein’s conditions is easy for opera-
commutativity, one needs the following facts:
tions accessing scalars (but beware of aliases), is more
difficult for array accesses, and is almost impossible for ● When an operation u is executed, the only mem-
pointer dereferencing. See the Dependences entry for ory cells which may be modified are those whose
an in-depth discussion of this question. adresses are in M(u)
Bernstein’s Conditions B 

● The values stored in M(u) depend only on u and on Legality


the values read from R(u). Here, the question is to decide whether a candidate
▸ Consider two operations u and v which satisfy (Eq. ). trace is equivalent to a reference trace, where the two B
Assume that u is executed first. When v is executed traces contains exactly the same operations. There are
later, it finds in R(v) the same values as if it were two possibilities for deciding equivalence. Firstly, if the
executed first, since M(u) and R(v) are disjoint. traces are finite, one may examine the state of memory
Hence, the values stored in M(v) are the same, and after their termination. There is equivalence if these two
they do not overwrite the values stored by u, since states are identical. Another possibility is to construct
M(u) and M(v) are disjoint. The same reasoning the history of each memory cell. This is a list of values
applies if v is executed first. ordered in time. A new value is appended to the history
of x each time an operation u such that x ∈ M(u) is exe-
The fact that u and v do not meet Bernstein’s condi-
cuted. Two traces are equivalent if all cells have the same
tions is written u ⊥ v to indicate that u and v cannot be
history. This is clearly a stronger criterion than equal-
executed in parallel.
ity of the final memory; it has the advantage of being
applicable both to terminating programs and to non-
Atomicity terminating systems. The histories are especially simple
When dealing with parallel programs, commutativity is when a trace has the single assignment property: there is
not enough for correctness. Consider for instance two only one operation that writes into x. In that case, each
operations u and v with T(u) = [x = x + 1] and history has only one element.
T(v) = [x = x + 2]. These two operations com-
mute, since their sequential execution in whatever order Terminating Programs
is equivalent to a unique operation w such that T(w) = A terminating program is specified by a finite list of
[x = x + 3]. However, each one is compiled into operations, [u , . . . , un ], in order of sequential execu-
a sequence of more elementary machine instructions, tion. There is a dependence relation ui → uj iff i < j
which when executed in parallel, may result in x being and ui ⊥ uj .
increased by  or  or  (see Fig. , where r1 and r2 are All reorderings of the u: [v , . . . , vn ] such that
processor registers). the execution order of dependent operations is not
Observe that these two operations do not satisfy modified:
Bernstein’s conditions. In contrast, operations that sat- ui → u j , u i = v i ′ , u j = v j ′ ⇒ i ′ < j′
isfy Bernstein’s conditions do not need to be protected are legal.
by critical sections when run in parallel. The reason is
that neither operation modifies the input of the other, ▸ The proof is by a double induction. Let k be the length
and that they write in distinct memory cells. Hence, the of the common prefix of the two programs:
stored values do not depend on the order in which the ui = vi , i = , k.
writes are interleaved. Note that k may be null. The element uk+ occurs some-
where among the v, at position i > k. The element vi−
occurs among the u at position j > k +  (see Fig. ). It
x = 0 x = 0 x = 0 follows that uk+ = vi and uj = vi− are ordered differ-
r1 = x -- r1 = x -- r1 = x --
-- r2 = x r1 += 1 -- -- r2 = x ently in the two programs, and hence must satisfy Bern-
r1 += 1 -- x = r1 -- r1 += 1 -- stein’s condition. vi− and vi can therefore be exchanged
-- r2 += 2 -- r2 = x -- r2 += 2 without modifying the result of the reordered program.
x = r1 -- -- r2 += 2 -- x = r2
-- x = r2 -- x = r2 x = r1 -- Continuing in this way, vi can be brought in position
-- -- -- -- -- -- k + , which means that the common prefix has been
x = 2 x = 3 x = 1 extended one position to the right. This process can be
P #1 P #2 P #1 P #2 P #1 P #2
continued until the length of the prefix is n. The two
Bernstein’s Conditions. Fig.  Several possible interleaves programs are now identical, and the final result of the
of x = x + 1 and x = x + 2 candidate trace has not been modified.
 B Bernstein’s Conditions

uk vi−1 vi by A[σ (y, u)]. That the new trace has the single assign-
v ment property is clear. It is equivalent to the reference
trace in the following sense: for each cell x, construct
a history by appending the value of A[u] each time an
operation u such that x ∈ M(u) is executed. Then the
vk histories of a cell in the reference trace and in the single
u assigment trace are identical.
uj
uk+1 ▸ Let us say that an operation u has a discrepancy for x if
the value assigned to x by u in the reference trace is dif-
Bernstein’s Conditions. Fig.  The commutation Lemma
ferent from the value of A[u] in the single assignment
trace. Let u be the earliest such operation. Since all
operations are assumed deterministic, this means that
The property which has just been proved is crucial there is a cell y ∈ R(u ) whose value is different from
for program optimization, since it gives a simple test for A[σ(y, u )]. Hence σ(y, u ) ≺ u also has a discrepancy,
the legality of statement motion, but what is its import a contradiction.
for parallel programming?
The point is that when parallelizing a program, its Single assignment programs (SAP) where first pro-
operations are distributed among several processors or posed by Tesler and Enea [] as a tool for parallel
among several threads. Most parallel architectures do programming. In a SAP, the sets M(u) ∩ M(v) are
not try to combine simultaneous writes to the same always empty, and if there is a non-empty R(u) ∩ M(v)
memory cell, which are arbitrarily ordered by the bus where u ≺ v, it means that some variable is read before
arbiter or a similar device. It follows that if one is only being assigned, a programming error. Some authors
interested in the final result, each parallel execution is [] then noticed that a single assignment program is a
equivalent to some interleave of the several threads of collection of algebraic equations, which simplifies the
the program. Taking care that operations which do not construction of correctness proofs.
satisfy Bernstein’s condition are excuted in the order
specified by the original sequentail program guarantees Non-Terminating Systems
deterministic execution and equivalence to the sequen- The reader may have noticed that the above legal-
tial program. ity proof depends on the finiteness of the program
trace. What happens when one wants to build a non-
terminating parallel system, as found for instance in
Single Assignment Programs signal processing applications or operating systems?
A trace is in single assignment form if, for each mem- For assessing the correctness of a transformation, one
ory cell x, there is one and only one operation u such cannot observe the final result, which does not exists.
that x ∈ M(u). Any trace can be converted to (dynamic) Beside, one clearly needs some fairness hypothesis: it
single assignment form – at least in principle – by the would not do to execute all even numbered operations,
following method. ad infinitum, and then to execute all odd numbered
Let A be an (associative) array indexed by the oper- operations, even if Bernstein’s conditions would allow
ation names. Assuming that all M(u) are singletons, it. The needed property is that for all operations u in the
operation u now writes into A[u] instead of M(u). reference trace, there is a finite integer n such that u is
The source of cell x at u, noted σ (x, u), is defined as: the n-th operation in the candidate trace.
Consider first the case of two single assignment
● x ∈ M(σ (x, u))
traces, one of which is the reference trace, the other hav-
● σ (x, u) ≺ u
ing been reordered while respecting dependences. Let u
● there is no v such that σ (x, u) ≺ v ≺ u and x ∈ M(v)
be an operation. By the fairness hypothesis, u is present
In words, σ (x, u) is the last write to x that precedes u. in both traces. Assume that the values written in A[u]
Now, in the text of u, replace all occurences of y ∈ R(u) by the the two traces are distinct. As above, one can find
Bernstein’s Conditions B 

an operation v such that A[v] is read by u, A[v] has dif- is not executed too early, and since there is a dependence
ferent values in the two traces, and v ≺ u in the two from a test to each enclosed operation, that no operation
traces. One can iterate this process indefinitely, which is executed before the tests results are known. One must B
contradicts the well-foundedness of ≺. take care not to compute dependences between opera-
Consider now two ordinary traces. After conversion tions u and v which have incompatible guards gu and gv
to single assignment, one obtain the same values for the such that gu ∧ gv = false.
A[u]. If one extract an history for each cell x as above, The case of while loops is more complex. Firstly, the
one obtain two identical sequence of values, since oper- construction while(true) is the simplest way of writing
ations that write to x are in dependence and hence are a non terminating program, whose analysis has been
ordered in the same direction in the two traces. discussed above. Anything that follows an infinite loop
Observe that this proof applies also to terminating is dead code, and no analysis is needed for it. Consider
traces. If the cells of two terminating traces have identi- now a terminating loop:
cal histories, it obviously follows that the final memory
while(p) do S;
states are identical. On the other hand, the proof for
terminating traces applies also, in a sequential con- The several executions of the continuation predicate,
text, to operations which commutes without satisfying p, must be considered as operations. Strictly speaking,
Bernstein’s conditions. one cannot execute an instance of S before the corre-
sponding instance of p, since if the result of p is false,
Dynamic Control Programs S is not executed. On the other hand, there must be a
The presence of tests whose outcome cannot be dependence from S to p, since otherwise the loop would
predicted at compile time greatly complicates program not terminate. Hence, a while loop must be executed
analysis. The simplest case is that of well structured pro- sequentially. The only way out is to run the loop specu-
grams, which uses only the if then else construct. For latively, i.e., to execute instances of the loop body before
such programs, a simple syntactical analysis allows the knowing the outcome of the continuation predicate, but
compiler to identify all tests which have an influence on this method is beyond the scope of this essay.
the execution of each operation. One has to take into
account three new phenomena: Related Entries
● A test is an operation in itself, which has a set of read Dependences
cells, and perhaps a set of modified cells if the source Polyhedron Model
language allows side effects
● An operation cannot be executed before the out- Bibliographic Notes and Further
comes of all controlling tests are known Reading
● No dependence exists for two operations which See Allen and Kennedy’s book [] for many uses of the
belong to opposite branches of a test concept of dependence in program optimization.
A simple solution, known as if-conversion [], can For more information on the transformation to
be used to solve all three problems at once. Each test: Single Assignment form, see [] or []. For the use of
Single Assignment Programs for hardware synthesis,
if(e) then . . . else . . . is replaced by a new operation b
see [] or [].
= e; where b is a fresh boolean variable. Each operation
in the range of the test is guarded by b or ¬b, depending
on whether the operation is on the then or else branch Bibliography
of the test. In the case of nested tests, this transformation . Allen JR, Kennedy K, Porterfield C, Warren J () Conversion
is applied recursively; the result is that each operation of control dependence to data dependence. In: Proceedings of
the th ACM SIGACT-SIGPLAN symposium on principles of
is guarded by a conjunction of the b’s or their com-
programming languages, POPL ’, ACM, New York, pp –
plements. Bernstein’s conditions are then applied to the . Amiranoff P, Cohen A, Feautrier P () Beyond iteration vec-
resulting trace, the b variables being included in the read tors: instancewise relational abstract domains. In: Static analysis
and modified sets as necessary. This insures that the test symposium (SAS ’), Seoul, August 
 B Bioinformatics

. Arsac J () La construction de programmes structurés. is punctuated by the occasional development of new
Dunod, Paris technologies in the field which generally created new
. Bernstein AJ () Analysis of programs for parallel processing.
types of data acquisition (such as microarrays to cap-
IEEE Trans Electron Comput EC-:–
ture gene expression developed in the mid-s) or
. Feautrier P () Dataflow analysis of scalar and array references.
Int J Parallel Program ():– more rapid acquisition of data (such as the develop-
. Feautrier P () Array dataflow analysis. In: Pande S, Agrawal D ment of next-generation sequencing technologies in the
(eds) Compiler optimizations for scalable parallel systems. Lec- mid-s). Generally, these developments have ush-
ture notes in computer science, vol , chapter . Springer, ered in new avenues of bioinformatics research due
Berlin, pp –
to new applications enabled by novel data sources, or
. Kennedy K, Allen R () Optimizing compilers for modern
architectures: a dependence-based approach. Morgan Kaufman, increases in the scale of data that need to be archived
San Francisco and analyzed, or new applications that come within
. Leverge H, Mauras C, Quinton P () The alpha language and reach due to improved scales and efficiencies. For exam-
its use for the design of systolic arrays. J VLSI Signal Process ple, the rapid adoption of microarray technologies for
:–
measuring gene expressions ushered in the era of sys-
. Tesler LG, Enea HJ () A language design for concurrent
processes. In: AFIPS SJCC , Thomson Book Co., pp –
tems biology; the relentless increases in cost efficiencies
. Verdoolaege S, Nikolov H, Stefanov T () Improved derivation in sequencing enabled genome sequencing for many
of process networks. In: Digest of the th workshop on optimiza- species, which then formed the foundation for the field
tion for DSP and embedded systems, New York, March , of comparative genomics.
pp – Thanks to the aforementioned advances and our
continually improving knowledge of how biological sys-
tems are designed and operate, bioinformatics devel-
oped into a broad field with several well-defined sub-
Bioinformatics fields of specialization – computational genomics, com-
parative genomics, metagenomics, phylogenetics, sys-
Srinivas Aluru tems biology, structural biology, etc. Several entries in
Iowa State University, Ames, IA, USA this encyclopedia are designed along the lines of such
Indian Institute of Technology Bombay, Mumbai, India
subfields whenever a sufficient body of work exists in
development of parallel methods in the area. This entry
Synonyms contains general remarks about the field of parallel
Computational biology computational biology; the readers are referred to the
related entries for an in-depth discussion and appropri-
Definition ate references for specific topics. An alternative view to
Bioinformatics and/or computational biology is broadly classifying bioinformatics research relies on the organ-
defined as the development and application of informat- isms of study – () microbial organisms, () plants,
ics techniques for solving problems arising in biological and () humans/animals. Even though the underly-
sciences. ing bioinformatics techniques are generally applicable
across organisms, the key target applications tend to
Discussion vary based on this organismal classification. In studying
The terms “Bioinformatics” and “Computational Biol- microbial organisms, a key goal is the ability to engi-
ogy” are broadly used to represent research in com- neer them for increased production of certain products
putational models, methods, databases, software, and or to meet certain environmental objectives. In agri-
analysis tools aimed at solving applications in the bio- cultural biotechnology, key challenges being pursued
logical sciences. Although the origins of the field can be include increasing yields, increasing nutitional content,
traced as far back as , the field exploded in promi- developing robust crops and biofuel production. On the
nence during the s with the conception and exe- other hand, much of the research on humans is driven
cution of the human genome project, and such explo- by medical concerns and understanding and treating
sive growth continues to this date. This long time line complex diseases.
Bioinformatics B 

Perhaps the oldest studied problem in computa- upward of  billion reads per experiment by early .
tional biology is that of molecular dynamics, originally The high throughput sequencing data generated by
studied in computational chemistry but increasingly these systems is impacting many subfields of computa- B
being applied to the study of protein structures in biol- tional biology, and the data deluge is severely straining
ogy. Outside of this classical area, research in parallel the limits of what can be achieved by sequential meth-
computational biology can be traced back to the devel- ods. Next-generation sequencing is shifting individual
opment of parallel sequence alignment algorithms in investigators into terascale and bigger organizations
the late s. One of the early applications where par- into petascale, necessitating the development of parallel
allel bioinformatics methods proved crucial is that of methods. Many other types of high-throughput instru-
genome assembly. At the human genome scale, this mentation are becoming commonplace in biology, rais-
requires inferring a sequence by assembling tens of mil- ing further complexities in massive scale, heterogenous
lions of randomly sampled fragments of it, which would data integration and multiscale modeling of biologi-
take weeks to months if done serially. Albeit the initial cal systems. Emerging high performance computing
use of parallelism only in phases where such paral- paradigms such as Clouds and manycore GPU plat-
lelism is obvious, the resulting improvements in time- forms are also providing impetus to the development of
to-solution proved adequate, and subsequently moti- parallel bioinformatics methods.
vated the development of more sophisticated parallel
bioinformatics methods for genome assembly. Spurred Related Entries
by the enormous data sizes, problem complexities, and Genome Assembly
multiple scales of biological systems, interest in paral- Homology to Sequence Alignment, From
lel computational biology continues to grow. However, Phylogenetics
the development of parallel methods in bioinformatics Protein Docking
represents a mere fraction of the problems for which Suffix Trees
sequential solutions have been developed. In addition, Systems Biology, Network Inference in
progress in parallel computational biology has not been
uniform across all subfields of computational biology.
For example, there has been little parallel work in the
Bibliographic Notes and Further
field of comparative genomics, not counting trivial uses
Reading
Readers who wish to conduct an in-depth study of
of parallelism such as in all-pairs alignments. In other
bioinformatics and computational biology may find the
fields such as systems biology, work in parallel meth-
comprehensive handbook on computational biology a
ods is in its nascent stages with certain problem areas
useful reference []. Works on development of parallel
(such as network inference) targeted more than oth-
methods in computational biology can be found in the
ers. The encyclopedia entries reflect this development
book chapter [], the survey article [], the first edited
and cover the topical areas that reflect strides in parallel
volume on this subject [], and several journal special
computational biology.
issues [–]. The annual IEEE International Workshop
Going forward, there are compelling developments
on High Performance Computational Biology held since
that favor growing prominence of parallel computing
 provides the primary forum for research dissemi-
in the field of bioinformatics and computational biol-
nation in this field and the readers may consult the pro-
ogy. One such development is the creation of high-
ceedings (www.hicomb.org) for scoping the progress in
throughput second- and third-generation sequencing
the field and for further reference.
technologies. After more than three decades of sequenc-
ing DNA one fragment at a time, next-generation
sequencing technologies permit the simultaneous
Bibliography
. Aluru S (ed) () Handbook of computational molecular biol-
sequencing of a large number of DNA fragments. With
ogy. Chapman & Hall/CRC Computer and Information Science
throughputs increasing at the rate of a factor of  per Series, Boca Raton
year, sequencers that were generating a few million . Aluru S, Amato N, Bader DA, Bhandarkar S, Kale L, Mari-
DNA reads per experiment in  are delivering nescu D, Samatova N () Parallel computational biology.
 B Bisimilarity

In: Heroux MA, Raghavan P, Simon HD (eds) Parallel Process- be labeled by predicates from a given set P that hold in
ing for Scientific Computing (Software, Environments and Tools). that state.
Society for Industrial and Applied Mathematics (SIAM), Philadel-
phia, pp – Definition  Let A and P be sets (of actions and pred-
. Aluru S, Bader DA () Special issue on high performance icates, respectively).
computational biology. J Parallel Distr Com (–):– A labeled transition system (LTS) over A and P is a triple
. Aluru S, Bader DA () Special issue on high performance
(S, →, ⊧) with:
computational biology. Concurrency-Pract Ex ():–
. Aluru S, Bader DA () Special issue on high performance ● S a class (of states).
computational biology. Parallel Comput ():– a
● → a collection of binary relations !→ ⊆ S × S – one
. Amato N, Aluru S, Bader DA () Special issue on high per-
formance computational biology. IEEE Trans Parallel Distrib Syst for every a ∈ A – (the transitions),
a
():– such that for all s ∈ S the class {t ∈ S ∣ s !→ t} is a
. Bader DA () Computational biology and high-performance set.
computing. Commun ACM ():– ● ⊧ ⊆ S × P. s ⊧ p says that predicate p ∈ P holds in
. Zomaya AY (ed) () Parallel computing for bioinformatics and
state s ∈ S.
computational biology: models, enabling technolgoies, and case
studies. Wiley, Hoboken LTSs with A a singleton (i.e., with → a single binary
relation on S) are known as Kripke structures, the models
of modal logic. General LTSs (with A arbitrary) are the
Kripke models for polymodal logic. The name “labeled
Bisimilarity transition system” is employed in concurrency theory.
There, the elements of S represent the systems one is
Bisimulation a
interested in, and s !→ t means that system s can
evolve into system t while performing the action a. This
approach identifies states and systems: The states of a
Bisimulation system s are the systems reachable from s by follow-
ing the transitions. In this realm P is often taken to

Robert J. van Glabbeek be empty, or it contains a single predicate indicating
NICTA, Sydney, Australia successful termination.
The University of New South Wales, Sydney, Australia
Stanford University, Standford, CA, USA Definition  A process graph over A and P is a tuple
g = (S, I, →, ⊧) with (S, →, ⊧) an LTS over A and P in
which S is a set, and I ∈ S.
Synonyms
Process graphs are used in concurrency theory to
Bisimulation equivalence; Bisimilarity
disambiguate between states and systems. A process
graph (S, I, →, ⊧) represents a single system, with S the
Definition set of its states and I its initial state. In the context of
Bisimulation equivalence is a semantic equivalence rela- an LTS (S, →, ⊧) two concurrent systems are modeled
tion on labeled transition systems, which are used to by two members of S; in the context of process graphs,
represent distributed systems. It identifies systems with they are two different graphs. The nondeterministic finite
the same branching structure. automata used in automata theory are process graphs
with a finite set of states over a finite alphabet A and a set
Discussion P consisting of a single predicate denoting acceptance.
Labeled Transition Systems
A labeled transition system consists of a collection of
states and a collection of transitions between them. The Bisimulation Equivalence
transitions are labeled by actions from a given set A that Bisimulation equivalence is defined on the states of a
happen when the transition is taken, and the states may given LTS, or between different process graphs.
Bisimulation B 

Definition  Let (S, →, ⊧) be an LTS over A and P. A Definition  The language L of polymodal logic over
bisimulation is a binary relation R ⊆ S × S, satisfying: A and P is given by:
∧ if sRt then s ⊧ p ⇔ t ⊧ p for all p ∈ P. ● ⊺∈L B
a
∧ if sRt and s !→ s′ with a ∈ A, then there exists a t ′ ● p ∈ L for all p ∈ P
a
with t !→ t′ and s′ Rt′ . ● if φ, ψ ∈ L for then φ ∧ ψ ∈ L
a
∧ if sRt and t !→ t ′ with a ∈ A, then there exists an ● if φ ∈ L then ¬φ ∈ L
a
s′ with s !→ s′ and s′ Rt′ . ● if φ ∈ L and a ∈ A then //a//φ ∈ L

Two states s, t ∈ S are bisimilar, denoted s ↔ t, if there Basic (as opposed to poly-) modal logic is the spe-
exists a bisimulation R with sRt. cial case where ∣A∣ = ; there //a//φ is simply denoted
◇φ. The Hennessy–Milner logic is polymodal logic with
Bisimilarity turns out to be an equivalence relation
P = /. The language L∞ of infinitary polymodal logic
on S, and is also called bisimulation equivalence.
over A and P is obtained from L by additionally allow-
Definition  Let g = (S, I, →, ⊧) and h = (S′ , I ′ , →′ , ing ⋀i∈I φ i to be in L∞ for arbitrary index sets I and
⊧′ ) be process graphs over A and P. A bisimulation φ i ∈ L∞ for i ∈ I. The connectives ⊺ and ∧ are then the
between g and h is a binary relation R ⊆ S × S′ , satisfy- special cases I = / and ∣I∣ = .
ing IRI ′ and the same three clauses as above. g and h are
Definition  Let (S, →, ⊧) be an LTS over A and P.
bisimilar, denoted g ↔ h, if there exists a bisimulation
The relation ⊧ ⊆ S×P can be extended to the satisfaction
between them.
relation ⊧ ⊆ S × L∞ , by defining

● s ⊧ ⋀i∈I φ i if s ⊧ φ i for all i ∈ I – in particular, s ⊧ ⊺


for any state s ∈ S
a a a
● s ⊧ ¬φ if s ⊧/φ
a
● s ⊧ /a/φ if there is a state t with s !→ t and t ⊧ φ
/ /

b c b c
Write L(s) for {φ ∈ L ∣ s ⊧ φ}.
Theorem  [] Let (S, →, ⊧) be an LTS and s, t ∈ S.
Then s ↔ t ⇔ L∞ (s) = L∞ (t).
Example The two process graphs above (over A = In case the systems s and t are image finite, it suf-

{a, b, c} and P = { }), in which the initial states are fices to consider finitary polymodal formulas only [].
indicated by short incoming arrows and the final states In fact, for this purpose it is enough to require that one

(the ones labeled with ) by double circles, are not of s and t is image finite.
bisimulation equivalent, even though in automata the-
Definition  Let (S, →, ⊧) be an LTS. A state t ∈ S is
ory they accept the same language. The choice between
reachable from s ∈ S if there are si ∈ S and ai ∈ A for
b and c is made at a different moment (namely, before ai
i = , . . . , n with s = s , si− !→ si for i = , . . . , n, and
vs. after the a-action); that is, the two systems have
sn = t. A state s ∈ S is image finite if for every state t ∈ S
a different branching structure. Bisimulation semantics
reachable from s and for every a ∈ A, the set {u ∈ S ∣
distinguishes systems that differ in this manner. a
t !→ u} is finite.

Modal Logic Theorem  [] Let (S, →, ⊧) be an LTS and s, t ∈S with


(Poly)modal logic is an extension of propositional logic s image finite. Then s ↔ t ⇔ L(s) = L(t).
with formulas //a//φ, saying that it is possible to follow
an a-transition after which the formula φ holds. Modal
formulas are interpreted on the states of labeled transi- Non-well-Founded Sets
tion systems. Two systems are bisimilar iff they satisfy Another characterization of bisimulation semantics
the same infinitary modal formulas. can be given by means of Aczel’s universe V of
 B Bisimulation

non-well-founded sets []. This universe is an exten- of related systems. The notions of weak and delay bisim-
sion of the Von Neumann universe of well-founded ulation equivalence, which were both introduced by
sets, where the axiom of foundation (every chain x ∋ Milner under the name observational equivalence, make
x ∋ ⋯ terminates) is replaced by an anti-foundation more identifications, motivated by observable machine-
axiom. behaviour according to certain testing
scenarios.
Definition  Let (S, →, ⊧) be an LTS, and let B denote τ
Write s 8⇒ t for ∃n ≥  : ∃s , . . . , sn : s = s !→
the unique function M : S → V satisfying, for all s ∈ S, τ τ
s !→ ⋯ !→ sn = t, that is, a (possibly empty) path of
(a)
a τ-steps from s to t. Furthermore, for a ∈ Aτ , write s !→ t
M(s) = {//a, M(t)// ∣ s !→ t}. a (a)
for s !→ t ∨ (a = τ ∧ s = t). Thus !→ is the same as !→
a
(τ)
It follows from Aczel’s anti-foundation axiom that for a ∈ A, and !→ denotes zero or one τ-step.
such a function exists. In fact, the axiom amounts to Definition  Let (S, →, ⊧) be an LTS over Aτ and P.
saying that systems of equations like the one above
Two states s, t ∈ S are branching bisimulation equivalent,
have unique solutions. B(s) could be taken to be the denoted s ↔b t, if they are related by a binary relation
branching structure of s. The following theorem then R ⊆ S × S (a branching bisimulation), satisfying:
says that two systems are bisimilar iff they have the same
branching structure. ∧ if sRt and s ⊧ p with p ∈ P, then there is a t with
t 8⇒ t ⊧ p and sRt .
Theorem  [] Let (S, →, ⊧) be an LTS and s, t ∈ S.
∧ if sRt and t ⊧ p with p ∈ P, then there is a s with
Then s ↔ t ⇔ B(s) = B(t).
s 8⇒ s ⊧ p and s Rt.
a
∧ if sRt and s !→ s′ with a ∈ Aτ , then there are t , t , t′
(a)
Abstraction with t 8⇒ t !→ t = t ′ , sRt , and s′ Rt ′ .
a
In concurrency theory it is often useful to distinguish ∧ if sRt and t !→ t ′ with a ∈ Aτ , then there are s , s , s′
(a)
between internal actions, which do not admit interac- with s8⇒s !→ s = s′ , s Rt, and s′ Rt′ .
tions with the outside world, and external ones. As nor-
Delay bisimulation equivalence, ↔d , is obtained by
mally there is no need to distinguish the internal actions
dropping the requirements sRt and s Rt. Weak bisimu-
from each other, they all have the same name, namely,
lation equivalence [], ↔w , is obtained by furthermore
τ. If A is the set of external actions a certain class of sys-
relaxing the requirements t = t ′ and s = s′ to t 8⇒ t′
tems may perform, then Aτ := A˙∪{τ}. Systems in that
and s 8⇒ s′ .
class are then represented by labeled transition systems
over Aτ and a set of predicates P. The variant of bisimu- These definitions stem from concurrency theory.
lation equivalence that treats τ just like any action of A is On Kripke structures, when studying modal or tempo-
called strong bisimulation equivalence. Often, however, ral logics, normally a stronger version of the first two
one wants to abstract from internal actions to various conditions is imposed:
degrees. A system doing two τ actions in succession is
∧ if sRt and p ∈ P, then s ⊧ p ⇔ t ⊧ p.
then considered equivalent to a system doing just one.
However, a system that can do either a or b is consid- For systems without τ’s all these notions coincide with
ered different from a system that can do either a or first strong bisimulation equivalence.
τ and then b, because if the former system is placed
in an environment where b cannot happen, it can still Concurrency
do a instead, whereas the latter system may reach a When applied to parallel systems, capable of perform-
state (by executing the τ action) in which a is no longer ing different actions at the same time, the versions of
possible. bisimulation discussed here employ interleaving seman-
Several versions of bisimulation equivalence that tics: no distinction is made between true parallelism and
formalize these desiderata occur in the literature. its nondeterministic sequential simulation. Versions of
Branching bisimulation equivalence [], like strong bisimulation that do make such a distinction have been
bisimulation, faithfully preserves the branching structure developed as well, most notably the ST-bisimulation [],
Bitonic Sort B 

which takes temporal overlap of actions into account, Definition


and the history preserving bisimulation [], which even Bitonic Sort is a sorting algorithm that uses comparison-
keeps track of causal relations between actions. For swap operations to arrange into nondecreasing order an B
this purpose, system representations such as Petri nets input sequence of elements on which a linear order is
or event structures are often used instead of labeled defined (for example, numbers, words, and so on).
transition systems.
Discussion
Bibliography
Introduction
. Aczel P () Non-well-founded Sets, CSLI Lecture Notes .
Stanford University, Stanford, CA Henceforth, all inputs are assumed to be numbers, with-
. van Glabbeek RJ () Comparative concurrency semantics and out loss of generality. For ease of presentation, it is also
refinement of actions. PhD thesis, Free University, Amsterdam. assumed that the length of the sequence to be sorted is
Second edition available as CWI tract , CWI, Amsterdam  an integer power of . The Bitonic Sort algorithm has the
. Hennessy M, Milner R () Algebraic laws for nondeterminism
following properties:
and concurrency. J ACM ():–
. Hollenberg MJ () Hennessy-Milner classes and process alge- . Oblivious: The indices of all pairs of elements
bra. In: Ponse A, de Rijke M, Venema Y (eds) Modal logic and involved in comparison-swaps throughout the exe-
process algebra: a bisimulation perspective, CSLI Lecture Notes ,
CSLI Publications, Stanford, CA, pp –
cution of the algorithm are predetermined, and do
. Milner R () Operational and algebraic semantics of concur- not depend in any way on the values of the input ele-
rent processes. In: van Leeuwen J (ed) Handbook of theoretical ments. It is important to note here that the elements
computer science, Chapter . Elsevier Science Publishers B.V., to be sorted are assumed to be kept into an array
North-Holland, pp – and swapped in place. The indices that are referred
to here are those in the array.
Further Readings . Recursive: The algorithm can be expressed as a pro-
Baeten JCM, Weijland WP () Process algebra. Cambridge cedure that calls itself to operate on smaller versions
University Press of the input sequence.
Milner R () Communication and concurrency. Prentice Hall, New . Parallel: It is possible to implement the algorithm
Jersey using a set of special processors called “compara-
Sangiorgi D () on the origins of bisimulation and coinduction.
tors” that operate simultaneously, each implement-
ACM Trans Program Lang Syst (). doi: ./.
ing a comparison-swap. A comparator receives a
distinct pair of elements as input and produces that
pair in sorted order as output in one time unit.
Bitonic sequence. A sequence (a , a , . . . , am ) is said to
Bisimulation Equivalence be bitonic if and only if:
Bisimulation (a) Either there is an integer j,  ≤ j ≤ m, such that

a ≤ a ≤ ⋯ ≤ aj ≥ aj+ ≥ aj+ ≥ ⋯ ≥ am

(b) Or the sequence does not initially satisfy the con-


Bitonic Sort dition in (a), but can be shifted cyclically until the
Selim G. Akl condition is satisfied.
Queen’s University, Kingston, ON, Canada For example, the sequence (, , , , , , , , )
is bitonic, as it satisfies condition (a). Similarly, the
sequence (, , , , , , ), which does not satisfy con-
Synonyms dition (a), is also bitonic, as it can be shifted cyclically
Bitonic sorting network; Ditonic sorting to obtain (, , , , , , ).
 B Bitonic Sort

Let (a , a , . . . , am ) be a bitonic sequence, and let Finally, since ak ≥ ak+ , ak ≥ ak−m , ak−m+ ≥ ak−m ,
di = min(ai , am+i ) and ei = max(ai , am+i ), for i = , , and ak−m+ ≥ ak+ , it follows that:
. . . , m. The following properties hold:
(a) The sequences (d , d , . . . , dm ) and (e , e , . . . , em ) max(ak−m , ak+ ) ≤ min(ak , ak−m+ ).
are both bitonic.
(b) max(d , d , . . . , dm ) ≤ min(e , e , . . . , em ).
Sorting a Bitonic Sequence
In order to prove the validity of these two properties, Given a bitonic sequence (a , a , . . . , am ), it can
it suffices to consider sequences of the form: be sorted into a sequence (c , c , . . . , cm ), arranged
in nondecreasing order, by the following algorithm
a ≤ a ≤ . . . ≤ aj− ≤ aj ≥ aj+ ≥ . . . ≥ am ,
MERGEm :
for some  ≤ j ≤ m, since a cyclic shift of
Step . The two sequences (d , d , . . . , dm ) and (e , e ,
{a , a , . . . , am } affects {d , d , . . . , dm } and
. . . , em ) are produced.
{e , e , . . . , em } similarly, while affecting neither of the
Step . These two bitonic sequences are sorted indepen-
two properties to be established. In addition, there is
dently and recursively, each by a call to MERGEm .
no loss in generality to assume that m < j ≤ m, since
{am , am− , . . . , a } is also bitonic and neither property It should be noted that in Step  the two sequences
is affected by such reversal. can be sorted independently (and simultaneously
There are two cases: if enough comparators are available), since no ele-
ment of (d , d , . . . , dm ) is larger than any element of
. If am ≤ am , then ai ≤ am+i . As a result, di = ai , and
(e , e , . . . , em ). The m smallest elements of the final
ei = am+i , for  ≤ i ≤ m, and both properties hold.
sorted sequence are produced by sorting (d , d , . . . ,
. If am > am , then since aj−m ≤ aj , an index k exists,
dm ), and the m largest by sorting (e , e , . . . , em ). The
where j ≤ k < m, such that ak−m ≤ ak and ak−m+ >
recursion terminates when m = , since MERGE is
ak+ . Consequently:
implemented directly by one comparison-swap (or one
di = ai and ei = am+i for  ≤ i ≤ k − m, comparator).

and
Sorting an Arbitrary Sequence
di = am+i and ei = ai for k − m < i ≤ m.
Algorithm MERGEm assumes that the input sequence
Hence: to be sorted is bitonic. However, it is easy to mod-
ify an arbitrary input sequence into a sequence of
di ≤ di+ for  ≤ i < k − m,
bitonic sequences as follows. Let the input sequence
and be (a , a , . . . , an ), and recall that, for simplicity, it
di ≥ di+ for k − m ≤ i < m, is assumed that n is a power of . Now the following
n/ comparisons-swaps are performed: For all odd i,
implying that {d , d , . . . , dm } is bitonic. Similarly, ai is compared with ai+ and a swap is applied if nec-
ei ≤ ei+ , for k − m ≤ i < m, em ≤ e , ei ≤ ei+ , for essary. These comparison-swaps are numbered from
 ≤ i < j − m, and ei ≥ ei+ , for j − m ≤ i < k − m,  to n/. Odd-numbered comparison-swaps place the
implying that {e , e , . . . , em } is also bitonic. Also, smaller of (ai , ai+ ) first and the larger of the pair second.
max(d , d , . . . , dm ) = max(dk−m , dk−m+ ) Even-numbered comparison-swaps place the larger of
(ai , ai+ ) first and the smaller of the pair second.
= max(ak−m , ak+ ),
At the end of this first stage, n/ bitonic sequences
and are obtained. Each of these sequences is of length 
and can be sorted using MERGE . These instances of
min(e , e , . . . , em ) = min(ek−m , ek−m+ ) MERGE are numbered from  to n/. Odd-numbered
= min(ak , ak−m+ ). instances sort their inputs in nondecreasing order while
Bitonic Sort B 

even-numbered instances sort their inputs in nonin- bitonic sequences of length  and  are shown in Figs. 
creasing order. This yields n/ bitonic sequences each and , respectively.
of length . Finally, a combinational circuit for sorting an arbi- B
The process continues until a single bitonic sequence trary sequence of numbers, namely, the sequence
of length n is produced and is sorted by giving it as input (, , , , , , , ) is shown in Fig. . Comparators
to MERGEn . If a comparator is used to implement each that reverse their outputs (i.e., those that produce the
comparison-swap, and all independent comparison- larger of their two inputs on the left output line, and the
swaps are allowed to be executed in parallel, then the smaller on the right output line) are indicated with the
sequence is sorted in (( + log n) log n)/ time units (all letter R.
logarithms are to the base ). This is now illustrated.
Analysis
Implementation as a Combinational Circuit The depth of a sorting circuit is the number of rows it
Figure  shows a schematic diagram of a comparator. contains – that is, the maximum number of comparators
The comparator receives two numbers x and y as input, on a path from input to output. Depth, in other words,
and produces the smaller of the two on its left output represents the time it takes the circuit to complete the
line, and the larger of the two on its right output line. sort. The size of a sorting circuit is defined as the number
A combinational circuit for sorting is a device, built of comparators it uses.
entirely of comparators that takes an arbitrary sequence Taking m = i , the depth of the circuit in Fig. ,
at one end and produces that sequence in sorted order which implements algorithm MERGEm for sorting a
at the other end. The comparators are arranged in rows. bitonic sequence of length m, is given by the recur-
Each comparator receives its two inputs from the input rence:
sequence, or from comparators in the previous row, and d() = 
delivers its two outputs to the comparators in the fol-
d(i ) =  + d(i− ),
lowing row, or to the output sequence. A combinational
circuit has no feedback: Data traverse the circuit in the whose solution is d(i ) = i. The size of the circuit is
same way as water flows from high to low terrain. given by the recurrence:
Figure  shows a schematic diagram of a combina- s() = 
tional circuit implementation of algorithm MERGEm ,
which sorts the bitonic sequence (a , a , . . . , am ) into s(i ) = i− + s(i− ),
nondecreasing order from smallest, on the leftmost line, whose solution is s(i ) = ii− .
to largest, on the rightmost. A circuit for sorting an arbitrary sequence of length
Clearly, a bitonic sequence of length  is sorted by n, such as the circuit in Fig.  where n = , consists of
a single comparator. Combinational circuits for sorting log n phases: In the ith phase, n/i circuits are required,
each implementing MERGEi , and having a size of s(i )
and a depth of d(i ), for i = , , . . . , log n. The depth and
x y
size of this circuit are

log n log n
( + log n) log n
∑ d( ) = ∑ i =
i
,
i= i= 
and

log n log n
∑ (log n)−i s(i ) = ∑ (log n)−i ii−
i= i=
n( + log n) log n
min(x,y) max(x,y) = ,

Bitonic Sort. Fig.  A comparator respectively.
 B Bitonic Sort

Bitonic input
a1 a2 a3 am−2 am −1 am am+1 am+2 am +3 a2m−2 a2m−1 a2m

MERGEm MERGEm

c1 c2 c3 cm−2 cm−1 cm cm+1 cm +2 cm+3 c2m−2 c2m−1 c2m


Sorted output

Bitonic Sort. Fig.  A circuit for sorting a bitonic sequence of length  m

a1 a2 a3 a4 A lower bound on the number of comparisons. A


lower bound on the number of comparisons required in
the worst case by a comparison-based sorting algorithm
to sort a sequence of n numbers is derived as follows. All
comparison-based sorting algorithms are modeled by a
binary tree, each of whose nodes represents a compari-
son between two input elements. The leaves of the tree
represent the n! possible outcomes of the sort. A path
from the root of the tree to a leaf is a complete execu-
tion of an algorithm on a given input. The length of the
longest such path is log n!, which is a quantity on the
order of n log n.
Several sequential comparison-based algorithms,
such as Heapsort and Mergesort, for example, achieve
this bound, to within a constant multiplicative factor,
and are therefore said to be optimal. The Bitonic Sort
c1 c2 c3 c4
circuit always runs in time on the order of log n and
Bitonic Sort. Fig.  A circuit for sorting a bitonic sequence is therefore significantly faster than any sequential algo-
of length  rithm. However, the number of comparators it uses, and
therefore the number of comparisons it performs, is on
the order of n log n, and consequently, it is not optimal
in that sense.
Lower Bounds Lower bounds on the depth and size. These lower
In order to evaluate the quality of the Bitonic Sort circuit, bounds are specific to circuits. A lower bound on the
its depth and size are compared to two types of lower depth of a sorting circuit for an input sequence of n
bounds. elements is obtained as follows. Each comparator has
Bitonic Sort B 

a1 a2 a3 a4 a5 a6 a7 a8

c1 c2 c3 c4 c5 c6 c7 c8

Bitonic Sort. Fig.  A circuit for sorting a bitonic sequence of length 

two outputs, implying that an input element can reach bound on the number of comparisons required to sort,
at most r locations after r rows. Since each element in addition to the lower bounds on depth and size spe-
should be able to reach all n output positions, a lower cific to circuits. It is, therefore, theoretically optimal on
bound on the depth of the circuit is on the order of log n. all counts.
A lower bound on the size of a sorting circuit is In practice, however, the AKS circuit may not be
obtained by observing that the circuit must be able to very useful. In addition to its high conceptual com-
produce any of the n! possible output sequences for each plexity, its depth and size expressions are preceded by
input sequence. Since each comparator can be in one of constants on the order of  . Even if the depth of the
two states (swapping or not swapping its two inputs), AKS circuit were as small as  log n, in order for the
the total number of configurations in which a circuit latter to be smaller than the depth of the Bitonic Sort
with c comparators can be is c , and this needs to be circuit, namely, (( + log n) log n)/, the input sequence
at least equal to n!. Thus, a lower bound on the size of a will need to have the astronomical length of n >  ,
sorting circuit is on the order of n log n. which is greater than the number of atoms that the
The Bitonic Sort circuit exceeds each of the lower observable universe is estimated to contain.
bounds on depth and size by a factor of log n, and is
therefore not optimal in that sense as well. Future Directions
There exists a sorting circuit, known as the AKS cir- An interesting open problem in this area of research is to
cuit, which has a depth on the order of log n and a design a sorting circuit with the elegance, regularity, and
size on the order of n log n. This circuit meets the lower simplicity of the Bitonic Sort circuit, while at the same
 B Bitonic Sort

5 3 2 6 1 4 7 5

R R

3 5 6 2 1 4 7 5

R R

3 6 2 5 7 1 5 4

R R

2 3 5 6 7 5 4 1

2 7 3 5 4 5 1 6

2 4 1 3 5 7 5 6

1 2 3 4 5 5 6 7

Bitonic Sort. Fig.  A circuit for sorting an arbitrary sequence of length 

time matching the theoretical lower bounds on depth Permutation Circuits


and size as closely as possible. Sorting

Related Entries Bibliographic Notes and Further


AKS Sorting Network Reading
Bitonic Sorting, Adaptive In , Ken Batcher presented a paper at the AFIPS
Odd-Even Sorting conference, in which he described two circuits for
Bitonic Sort B 

sorting, namely, the Odd-Even Sort and the Bitonic and so on. If the circuit has depth D, then M sequences
Sort circuits []. This paper pioneered the study of can be sorted in D + M −  time units [].
parallel sorting algorithms and effectively launched B
the field of parallel algorithm design and analysis.
Bibliography
Proofs of correctness of the Bitonic Sort algorithm
. Ajtai M, Komlós J, Szemerédi E () An O(n log n) sorting
are provided in [, , ]. Implementations of Bitonic network. In: Proceedings of the ACM symposium on theory of
Sort on other architectures beside combinational cir- computing, Boston, Massachusetts, pp –
cuits have been proposed, including implementations . Ajtai M, Komlós J, Szemerédi E () Sorting in c log n parallel
on a perfect shuffle computer [], a mesh of pro- steps. Combinatorica :–
. Akl SG () Parallel sorting algorithms. Academic Press,
cessors [, ], and on a shared-memory parallel
Orlando, Florida
machine []. . Akl SG () The design and analysis of parallel algorithms.
In its simplest form, Bitonic Sort assumes that n, the Prentice-Hall, Englewood Cliffs, New Jersey
length of the input sequence, is a power of . When . Akl SG () Parallel computation: models and methods. Pren-
n is not a power of , the sequence can be padded tice Hall, Upper Saddle River, New Jersey
with z zeros such that n + z is the smallest power . Batcher KE () Sorting networks and their applications. In:
Proceedings of the AFIPS spring joint computer conference.
of  larger than n. Alternatively, several variants of
Atlantic City, New Jersey, pp –. Reprinted in: Wu CL,
Bitonic Sort were proposed in the literature that are Feng TS (eds) Interconnection networks for parallel and
capable of sorting input sequences of arbitrary length distributed processing. IEEE Computer Society, , pp
[, , ]. –
Combinational circuits for sorting (also known as . Bilardi G, Nicolau A () Adaptive bitonic sorting: An optimal
parallel algorithm for shared-memory machines. SIAM J Comput
sorting networks) are discussed in [–, , –, ].
():–
A lower bound for oblivious merging is derived in [] . Cormen TH, Leiserson CE, Rivest RL, Stein C () Introduction
and generalized in [], which demonstrates that the to algorithms, nd edn. MIT Press/McGraw-Hill, Cambridge,
bitonic merger is optimal to within a small constant fac- Massachusetts/New York
tor. The AKS sorting circuit (whose name derives from . Floyd RW () Permuting information in idealized two-
the initials of its three inventors) was first described level storage. In: Miller RE, Thatcher JW (eds) Complex-
ity of computer computations. Plenum Press, New York, pp
in [] and then in []. Unlike the Bitonic Sort cir-
–
cuit that is based on the idea of repeatedly merging . Gibbons A, Rytter W () Efficient parallel algorithms.
increasingly longer subsequences, the AKS circuit sorts Cambridge University Press, Cambridge, England
by repeatedly splitting the original input sequence into . JáJá J () An introduction to parallel algorithms. Addison-
increasingly shorter and disjoint subsequences. While Wesley, Reading, Massachusetts
. Knuth DE () The art of computer programming, vol .
theoretically optimal, the circuit suffers from large mul-
Addison-Wesley, Reading, Massachusetts
tiplicative constants in the expressions for its depth and . Leighton FT () Introduction to parallel algorithms and archi-
size, making it of little use in practice, as mentioned tectures. Morgan Kaufmann, San Mateo, California
earlier. A formulation in [] manages to reduce the . Liszka KJ, Batcher KE () A generalized bitonic sorting net-
constants to a few thousands, still a prohibitive number. work. In: Proceedings of the international conference on parallel
Descriptions of the AKS circuit appear in [, , , ]. processing, vol , pp –
. Nakatani T, Huang ST, Arden BW, Tripathi SK () K-way
Sequential sorting algorithms, including Heapsort and
bitonic sort. IEEE Trans Comput ():–
Mergesort, are covered in [, ]. . Nassimi D, Sahni S () Bitonic sort on a mesh-connected
One property of combinational circuits, not shared parallel computer. IEEE Trans Comput C-():–
by many other parallel models of computation, is their . Parberry I () Parallel complexity theory. Pitman, London
ability to allow several input sequences to be processed . Paterson MS () Improved sorting networks with O(log N)
depth. Algorithmica :–
simultaneously, in a pipeline fashion. This is certainly
. Smith JR () The design and analysis of parallel algorithms.
true of sorting circuits: Once the elements of the first Oxford University Press, Oxford, England
input sequence have traversed the first row and moved . Stone HS () Parallel processing with the perfect shuffle. IEEE
on to the second, a new sequence can enter the first row, Trans Comput C-():–
 B Bitonic Sorting Network

. Thompson CD, Kung HT () Sorting on a mesh-connected ● It can be implemented in a highly parallel manner
parallel computer. Commun ACM ():– on modern architectures, such as a streaming archi-
. Wang BF, Chen GH, Hsu CC () Bitonic sort with and arbitrary
tecture (GPUs), even without any scatter operations,
number of keys. In: Proceedings of the international conference
that is, without random access writes.
on parallel processing. vol , Illinois, pp –
. Yao AC, Yao FF () Lower bounds for merging networks. One of the main differences between “regular” bitonic
J ACM ():–
sorting and adaptive bitonic sorting is that regular
bitonic sorting is data-independent, while adaptive
bitonic sorting is data-dependent (hence the name).
As a consequence, adaptive bitonic sorting cannot
Bitonic Sorting Network be implemented as a sorting network, but only on archi-
tectures that offer some kind of flow control. Nonethe-
Bitonic Sort
less, it is convenient to derive the method of adaptive
bitonic sorting from bitonic sorting.
Sorting networks have a long history in computer
science research (see the comprehensive survey []).
Bitonic Sorting, Adaptive One reason is that sorting networks are a convenient
way to describe parallel sorting algorithms on CREW-
Gabriel Zachmann
PRAMs or even EREW-PRAMs (which is also called
Clausthal University, Clausthal-Zellerfeld, Germany
PRAC for “parallel random access computer”).
In the following, let n denote the number of keys
Definition to be sorted, and p the number of processors. For the
Adaptive bitonic sorting is a sorting algorithm suitable sake of clarity, n will always be assumed to be a power
for implementation on EREW parallel architectures. of . (In their original paper [], Bilardi and Nicolau
Similar to bitonic sorting, it is based on merging, which have described how to modify the algorithms such that
is recursively applied to obtain a sorted sequence. In they can handle arbitrary numbers of keys, but these
contrast to bitonic sorting, it is data-dependent. Adap- technical details will be omitted in this article.)
tive bitonic merging can be performed in O( np ) parallel The first to present a sorting network with optimal
time, p being the number of processors, and executes asymptotic complexity were Ajtai, Komlós, and Sze-
only O(n) operations in total. Consequently, adaptive merédi []. Also, Cole [] presented an optimal parallel
n log n
bitonic sorting can be performed in O( p ) time, merge sort approach for the CREW-PRAM as well as
which is optimal. So, one of its advantages is that it exe- for the EREW-PRAM. However, it has been shown that
cutes a factor of O(log n) less operations than bitonic neither is fast in practice for reasonable numbers of keys
sorting. Another advantage is that it can be imple- [, ].
mented efficiently on modern GPUs. In contrast, adaptive bitonic sorting requires less
than n log n comparisons in total, independent of the
number of processors. On p processors, it can be imple-
Discussion n log n
mented in O( p ) time, for p ≤ logn n .
Introduction Even with a small number of processors it is effi-
This chapter describes a parallel sorting algorithm, cient in practice: in its original implementation, the
adaptive bitonic sorting [], that offers the following sequential version of the algorithm was at most by a
benefits: factor . slower than quicksort (for sequence lengths
up to  ) [].
● It needs only the optimal total number of compar-
ison/exchange operations, O(n log n).
● The hidden constant in the asymptotic number of Fundamental Properties
operations is less than in other optimal parallel sort- One of the fundamental concepts in this context is the
ing methods. notion of a bitonic sequence.
Bitonic Sorting, Adaptive B 

Definition  (Bitonic sequence) Let a = (a , . . . , an− ) sense) of bitonic sequences; it is convenient to define the
be a sequence of numbers. Then, a is bitonic, iff it mono- following rotation operator.
tonically increases and then monotonically decreases, Definition  (Rotation) Let a = (a , . . . , an− ) and B
or if it can be cyclically shifted (i.e., rotated) to j ∈ N. We define a rotation as an operator Rj on the
become monotonically increasing and then monoton- sequence a:
ically decreasing.
Figure  shows some examples of bitonic sequences. Rj a = (aj , aj+ , . . . , aj+n− )
In the following, it will be easier to understand This operation is performed by the network shown
any reasoning about bitonic sequences, if one consid- in Fig. . Such networks are comprised of elementary
ers them as being arranged in a circle or on a cylinder: comparators (see Fig. ).
then, there are only two inflection points around the cir- Two other operators are convenient to describe
cle. This is justified by Definition . Figure  depicts an sorting.
example in this manner.
Definition  (Half-cleaner) Let a = (a , . . . , an− ).
As a consequence, all index arithmetic is understood
modulo n, that is, index i + k ≡ i + k mod n, unless La = (min(a , a n ) , . . . , min(a n − , an− )) ,
otherwise noted, so indices range from  through n − . Ua = (max(a , a n ) , . . . , max(a n − , an− )) .
As mentioned above, adaptive bitonic sorting can be
regarded as a variant of bitonic sorting, which is in order In [], a network that performs these operations
to capture the notion of “rotational invariance” (in some together is called a half-cleaner (see Fig. ).

i i i
1 n 1 n 1 n

Bitonic Sorting, Adaptive. Fig.  Three examples of sequences that are bitonic. Obviously, the mirrored sequences
(either way) are bitonic, too

Bitonic Sorting, Adaptive. Fig.  Left: according to their definition, bitonic sequences can be regarded as lying on a
cylinder or as being arranged in a circle. As such, they consist of one monotonically increasing and one decreasing part.
Middle: in this point of view, the network that performs the L and U operators (see Fig. ) can be visualized as a wheel of
“spokes.” Right: visualization of the effect of the L and U operators; the blue plane represents the median
 B Bitonic Sorting, Adaptive

a min(a,b) LR n a, which can be verified trivially. In the latter case,


Eq.  becomes
b max(a,b)
LRj a = (min(aj , aj+ n ) , . . . , min(a n − , an− ) , . . . ,
min(aj− , aj−+ n ))
a max(a,b)
= Rj La.
Thus, with the cylinder metaphor, the L and U oper-
b min(a,b)
ators basically do the following: cut the cylinder with
Bitonic Sorting, Adaptive. Fig.  Comparator/exchange circumference n at any point, roll it around a cylinder
elements with circumference n , and perform position-wise the
max and min operator, respectively. Some examples are
shown in Fig. .
The following theorem states some important prop-
erties of the L and U operators.
Theorem  Given a bitonic sequence a,
max{La} ≤ min{Ua} .
Moreover, La and Ua are bitonic too.
In other words, each element of La is less than or
Bitonic Sorting, Adaptive. Fig.  A network that performs equal to each element of Ua.
the rotation operator This theorem is the basis for the construction of the
bitonic sorter []. The first step is to devise a bitonic
merger (BM). We denote a BM that takes as input
bitonic sequences of length n with BMn . A BM is recur-
sively defined as follows:
BMn (a) = ( BM n (La), BM n (Ua) ) .
The base case is, of course, a two-key sequence, which
is handled by a single comparator. A BM can be easily
represented in a network as shown in Fig. .
Given a bitonic sequence a of length n, one can show
Bitonic Sorting, Adaptive. Fig.  A network that performs that
the L and U operators BMn (a) = Sorted(a). ()

It should be obvious that the sorting direction can be


changed simply by swapping the direction of the ele-
It is easy to see that, for any j and a, mentary comparators.
Coming back to the metaphor of the cylinder, the
La = R−j mod n LRj a, () first stage of the bitonic merger in Fig.  can be visual-
ized as n comparators, each one connecting an element
and of the cylinder with the opposite one, somewhat like
spokes in a wheel. Note that here, while the cylinder can
Ua = R−j mod n URj a. () rotate freely, the “spokes” must remain fixed.
From a bitonic merger, it is straightforward to derive
This is the reason why the cylinder metaphor is valid. a bitonic sorter, BSn , that takes an unsorted sequence,
The proof needs to consider only two cases: j = n and produces a sorted sequence either up or down.
and  ≤ j < n . In the former case, Eq.  becomes La = Like the BM, it is defined recursively, consisting of two
Bitonic Sorting, Adaptive B 

Ua
Ua
a
La a
B
La
i i
1 n/2 1 n/2 n

Bitonic Sorting, Adaptive. Fig.  Examples of the result of the L and U operators. Conceptually, these operators fold the
bitonic sequence (black), such that the part from indices n +  through n (light gray) is shifted into the range  through n
(black); then, L and U yield the upper (medium gray) and lower (dark gray) hull, respectively

BM(n)

La BM(n/2)

Sorted
n/2-1
Bitonic

n/2

Ua BM(n/2)

n–1

1 stage

Bitonic Sorting, Adaptive. Fig.  Schematic, recursive diagram of a network that performs bitonic merging

smaller bitonic sorters and a bitonic merger (see Fig. ). Clearly, there is some redundancy in such a net-
Again, the base case is the two-key sequence. work, since n comparisons are sufficient to merge two
sorted sequences. The reason is that the comparisons
Analysis of the Number of Operations of performed by the bitonic merger are data-independent.
Bitonic Sorting
Since a bitonic sorter basically consists of a number of
bitonic mergers, it suffices to look at the total number of Derivation of Adaptive Bitonic Merging
comparisons of the latter. The algorithm for adaptive bitonic sorting is based on
The total number of comparators, C(n), in the the following theorem.
bitonic merger BMn is given by: Theorem  Let a be a bitonic sequence. Then, there is
an index q such that
n n
C(n) = C( ) + , with C() = ,
 
which amounts to La = (aq , . . . , aq+ n − ) ()
 Ua = (aq+ n , . . . , aq− ) ()
C(n) = n log n.

As a consequence, the bitonic sorter consists of (Remember that index arithmetic is always mod-
O(n log n) comparators. ulo n.)
 B Bitonic Sorting, Adaptive

BS(n)

Sorted
BS(n/2)

Sorted
Bitonic
Unsorted n/2-1
BM(n)
n/2

Sorted
BS(n/2)

n-1

Bitonic Sorting, Adaptive. Fig.  Schematic, recursive diagram of a bitonic sorting network

the median, each half must have length n . The indices


where the cut happens are q and q + n . Figure  shows
m an example (in one dimension).
The following theorem is the final keystone for the
adaptive bitonic sorting algorithm.
Theorem  Any bitonic sequence a can be partitioned
0 q+n/2 q n-1 into four subsequences (a , a , a , a ) such that either

U L (La, Ua) = (a , a , a , a ) ()


Bitonic Sorting, Adaptive. Fig.  Visualization for the or
proof of Theorem  (La, Ua) = (a , a , a , a ). ()
Furthermore,
n
The following outline of the proof assumes, for the ∣a ∣ + ∣a ∣ = ∣a ∣ + ∣a ∣ = , ()

sake of simplicity, that all elements in a are distinct. Let
m be the median of all ai , that is, n elements of a are less
∣a ∣ = ∣a ∣ , ()
than or equal to m, and n elements are larger. Because
of Theorem , and
∣a ∣ = ∣a ∣ , ()
max{La} ≤ m < min{Ua} .
where ∣a∣ denotes the length of sequence a.
Employing the cylinder metaphor again, the median Figure  illustrates this theorem by an example.
m can be visualized as a horizontal plane z = m that This theorem can be proven fairly easily too: the
cuts the cylinder. Since a is bitonic, this plane cuts the length of the subsequences is just q and n − q, where q is
sequence in exactly two places, that is, it partitions the the same as in Theorem . Assuming that max{a } <
sequence into two contiguous halves (actually, any hor- m < min{a }, nothing will change between those
izontal plane, i.e., any percentile partitions a bitonic two subsequences (see Fig. ). However, in that case
sequence in two contiguous halves), and since it is min{a } > m > max{a }; therefore, by swapping
Bitonic Sorting, Adaptive B 

a
m
B
a 0 n/2 n–1 0 q n/2 q + n/2 n – 1

b a1 a2 a3 a4

Ua

m m

La
0 q n/2 0 q n/2 q + n/2 n – 1
d
a1 a4 a3 a2

c La Ua

Bitonic Sorting, Adaptive. Fig.  Example illustrating Theorem 

a and a (which have equal length), the bounds with C() = , C() =  and n = k . This amounts
max{(a , a )} < m < min{a , a )} are obtained. The to
other case can be handled analogously.
Remember that there are n comparator-and- C(n) = n − log n − .
exchange elements, each of which compares ai and
The only question that remains is how to achieve the
ai+ n . They will perform exactly this exchange of sub-
data rearrangement, that is, the swapping of the subse-
sequences, without ever looking at the data.
quences a and a or a and a , respectively, without
Now, the idea of adaptive bitonic sorting is to find
sacrificing the worst-case performance of O(n). This
the subsequences, that is, to find the index q that marks
can be done by storing the keys in a perfectly balanced
the border between the subsequences. Once q is found,
tree (assuming n = k ), the so-called bitonic tree. (The
one can (conceptually) swap the subsequences, instead
tree can, of course, store only k −  keys, so the n-th
of performing n comparisons unconditionally.
Finding q can be done simply by binary search key is simply stored separately. ) This tree is very similar
to a search tree, which stores a monotonically increas-
driven by comparisons of the form (ai , ai+ n ).
ing sequence: when traversed in-order, the bitonic tree
Overall, instead of performing n comparisons in the
produces a sequence that lists the keys such that there
first stage of the bitonic merger (see Fig. ), the adaptive
are exactly two inflection points (when regarded as a
bitonic merger performs log( n ) comparisons in its first
circular list).
stage (although this stage is no longer representable by
Instead of actually copying elements of the sequence
a network).
in order to achieve the exchange of subsequences, the
Let C(n) be the total number of comparisons per-
adaptive bitonic merging algorithm swaps O(log n)
formed by adaptive bitonic merging, in the worst case.
pointers in the bitonic tree. The recursion then works on
Then
the two subtrees. With this technique, the overall num-
k− ber of operations of adaptive bitonic merging is O(n).
n n
C(n) = C( ) + log(n) = ∑ i log( i ) ,
 i=  Details can be found in [].
 B Bitonic Sorting, Adaptive

Clearly, the adaptive bitonic sorting algorithm needs A GPU Implementation


O(n log n) operations in total, because it consists of Because adaptive bitonic sorting has excellent scalabil-
log(n) many complete merge stages (see Fig. ). ity (the number of processors, p, can go up to n/ log(n))
It should also be fairly obvious that the adaptive and the amount of inter-process communication is
bitonic sorter performs an (adaptive) subset of the fairly low (only O(p)), it is perfectly suitable for imple-
comparisons that are executed by the (nonadaptive) mentation on stream processing architectures. In addi-
bitonic sorter. tion, although it was designed for a random access
architecture, adaptive bitonic sorting can be adapted
to a stream processor, which (in general) does not
have the ability of random-access writes. Finally, it can
The Parallel Algorithm
be implemented on a GPU such that there are only
So far, the discussion assumed a sequential implemen-
O(log (n)) passes (by utilizing O(n/ log(n)) (concep-
tation. Obviously, the algorithm for adaptive bitonic
tual) processors), which is very important, since the
merging can be implemented on a parallel architecture,
just like the bitonic merger, by executing recursive calls
on the same level in parallel.
Unfortunately, a naïve implementation would
require O(log n) steps in the worst case, since there Algorithm : Adaptive construction of La and Ua
are log(n) levels. The bitonic merger achieves O(log n) (one stage of adaptive bitonic merging)
parallel time, because all pairwise comparisons within input : Bitonic tree, with root node r and extra
one stage can be performed in parallel. But this is not node e, representing bitonic sequence a
straightforward to achieve for the log(n) comparisons output: La in the left subtree of r plus root r, and Ua
of the binary-search method in adaptive bitonic merg- in the right subtree of r plus extra node e
ing, which are inherently sequential. // phase : determine case
However, a careful analysis of the data dependencies
if value(r) < value(e) then
between comparisons of successive stages reveals that case = 
the execution of different stages can be partially over-
else
lapped []. As La, Ua are being constructed in one stage case = 
by moving down the tree in parallel layer by layer (occa-
swap value(r) and value(e)
sionally swapping pointers); this process can be started
( p, q ) = ( left(r) , right(r) )
for the next stage, which begins one layer beneath the
for i = , . . . , log n −  do
one where the previous stage began, before the first stage
// phase i
has finished, provided the first stage has progressed “far
enough” in the tree. Here, “far enough” means exactly test = ( value(p) > value(q) )
two layers ahead. if test == true then
This leads to a parallel version of the adaptive bitonic swap values of p and q
merge algorithm that executes in time O( np ) for p ∈ if case ==  then
swap the pointers left(p) and
O( logn n ), that is, it can be executed in (log n) parallel
left(q)
time.
else
Furthermore, the data that needs to be communi-
swap the pointers right(p) and
cated between processors (either via memory, or via
right(q)
communication channels) is in O(p).
if ( case ==  and test == false ) or ( case ==
It is straightforward to apply the classical sorting-
by-merging approach here (see Fig. ), which yields the  and test == true ) then
( p, q ) = ( left(p) , left(q) )
adaptive bitonic sorting algorithm. This can be imple-
mented on an EREW machine with p processors in else
n log n ( p, q ) = ( right(p) , right(q) )
O( p ) time, for p ∈ O( logn n ).
Bitonic Sorting, Adaptive B 

Algorithm : Merging a bitonic sequence to obtain a Algorithm : Simplified adaptive construction of La


sorted sequence and Ua
input : Bitonic tree, with root node r and extra node input : Bitonic tree, with root node r and extra node B
e, representing bitonic sequence a e, representing bitonic sequence a
output: Sorted tree (produces sort(a) when output: La in the left subtree of r plus root r, and Ua
traversed in-order) in the right subtree of r plus extra node e
construct La and Ua in the bitonic tree by // phase 
Algorithm  if value(r) > value(e) then
call merging recursively with left(r) as root and r swap value(r) and value(e)
as extra node swap pointers left(r) and right(r)
call merging recursively with right(r) as root and ( p, q ) = ( left(r) , right(r) )
e as extra node for i = , . . . , log n −  do
// phase i
if value(p) > value(q) then
swap value(p) and value(q)
number of passes is one of the main limiting factors on swap pointers left(p) and left(q)
GPUs. ( p, q ) = ( right(p) , right(q) )
This section provides more details on the imple-
else
mentation on a GPU, called “GPU-ABiSort” [, ]. ( p, q ) = ( left(p) , left(q) )
For the sake of simplicity, the following always assumes
increasing sorting direction, and it is thus not explicitely
specified. As noted above, the sorting direction must
be reversed in the right branch of the recursion in the simplified construction of La and Ua is presented in
bitonic sorter, which basically amounts to reversing the Algorithm . (Obviously, the simplified algorithm now
comparison direction of the values of the keys, that is, really needs trees with pointers, whereas Bilardi’s orig-
compare for < instead of > in Algorithm . inal bitonic tree could be implemented pointer-less
As noted above, the bitonic tree stores the sequence (since it is a complete tree). However, in a real-world
(a , . . . , an− ) in in-order, and the key an− is stored in implementation, the keys to be sorted must carry point-
the extra node. As mentioned above, an algorithm that ers to some “payload” data anyway, so the additional
constructs (La, Ua) from a can traverse this bitonic tree memory overhead incurred by the child pointers is at
and swap pointers as necessary. The index q, which is most a factor ..)
mentioned in the proof for Theorem , is only deter-
mined implicitly. The two different cases that are men-
tioned in Theorem  and Eqs.  and  can be distin- Outline of the Implementation
guished simply by comparing elements a n − and an− . As explained above, on each recursion level j =
This leads to Algorithm . Note that the root of , . . . , log(n) of the adaptive bitonic sorting algorithm,
the bitonic tree stores element a n − and the extra log n−j+ bitonic trees, each consisting of j− nodes,
node stores an− . Applying this recursively yields Algo- have to be merged into log n−j bitonic trees of j nodes.
rithm . Note that the bitonic tree needs to be con- The merge is performed in j stages. In each stage k =
structed only once at the beginning during setup time. , . . . , j − , the construction of La and Ua is executed
Because branches are very costly on GPUs, one on k subtrees. Therefore, log n−j ⋅k instances of the La
should avoid as many conditionals in the inner loops / Ua construction algorithm can be executed in par-
as possible. Here, one can exploit the fact that Rn/ a = allel during that stage. On a stream architecture, this
(a n , . . . , an− , a , . . . , a n − ) is bitonic, provided a is potential parallelism can be exposed by allocating a
bitonic too. This operation basically amounts to swap- stream consisting of log n−j+k elements and executing
ping the two pointers left(root) and right(root). The a so-called kernel on each element.
 B Bitonic Sorting, Adaptive

The La / Ua construction algorithm consists of j − k nodes as necessary. The strategy is simple: simply out-
phases, where each phase reads and modifies a pair put every node visited along this path to a stream. Since
of nodes, (p, q), of a bitonic tree. Assume that a ker- the data layout is fixed and predetermined, the ker-
nel implementation performs the operation of a single nel can store the index of the children with the node
phase of this algorithm. (How such a kernel implemen- as it is being written to the output stream. One child
tation is realized without random-access writes will be address remains the same anyway, while the other is
described below.) The temporary data that have to be determined when the kernel is still executing for the
preserved from one phase of the algorithm to the next current node. Figure  demonstrates the operation of
one are just two node pointers (p and q) per kernel the stream program using the described stream output
instance. Thus, each of the log n−j+k elements of the allo- technique.
cated stream consist of exactly these two node pointers.
When the kernel is invoked on that stream, each kernel Complexity
instance reads a pair of node pointers, (p, q), from the A simple implementation on the GPU would need
stream, performs one phase of the La/Ua construction O(log n) phases (or “passes” in GPU parlance) in
algorithm, and finally writes the updated pair of node total for adaptive bitonic sorting, which amounts to
pointers (p, q) back to the stream. O(log n) operations in total.
This is already very fast in practice. However, the
optimal complexity of O(log n) passes can be achieved
Eliminating Random-Access Writes exactly as described in the original work [], that is,
Since GPUs do not support random-access writes (at phase i of a stage k can be executed immediately after
least, for almost all practical purposes, random-access phase i +  of stage k −  has finished. Therefore, the exe-
writes would kill any performance gained by the paral- cution of a new stage can start at every other step of the
lelism) the kernel has to be implement so that it modifies algorithm.
node pairs (p, q) of the bitonic tree without random- The only difference from the simple implementation
access writes. This means that it can output node pairs is that kernels now must write to parts of the output
from the kernel only via linear stream write. But this stream, because other parts are still in use.
way it cannot write a modified node pair to its original
location from where it was read. In addition, it can- GPU-Specific Details
not simply take an input stream (containing a bitonic For the input and output streams, it is best to apply
tree) and produce another output stream (containing the ping-pong technique commonly used in GPU pro-
the modified bitonic tree), because then it would have to gramming: allocate two such streams and alternatingly
process the nodes in the same order as they are stored in use one of them as input and the other one as output
memory, but the adaptive bitonic merge processes them stream.
in a random, data-dependent order.
Fortunately, the bitonic tree is a linked data structure Preconditioning the Input
where all nodes are directly or indirectly linked to the For merge-based sorting on a PRAM architecture (and
root (except for the extra node). This allows us to change assuming p < n), it is a common technique to sort
the location of nodes in memory during the merge algo- locally, in a first step, p blocks of n/p values, that is, each
rithm as long as the child pointers of their respective processor sorts n/p values using a standard sequential
parent nodes are updated (and the root and extra node algorithm.
of the bitonic tree are kept at well-defined memory loca- The same technique can be applied here by imple-
tions). This means that for each node that is modified its menting such a local sort as a kernel program. However,
parent node has to be modified also, in order to update since there is no random write access to non-temporary
its child pointers. memory from a kernel, the number of values that can be
Notice that Algorithm  basically traverses the sorted locally by a kernel is restricted by the number of
bitonic tree down along a path, changing some of the temporary registers.
Bitonic Sorting, Adaptive B 

7>0 4 < 11 13 > 1


7 0 4 11 13 1 0 7 4 11 1 13

4 3 12 3 5 9 4 3 12 3 5 9
... ...
2 6 5 1 15 8 0 7 2 10 14 6 2 6 5 1 15 8 0 7 2 10 14 6
B
Root spare Root spare ... Root spare root spare root spare ... root spare
Phase 0
kernel
p0 q0 p0 q0 ... p0 q0 p0 q0 p0 q0 ... p0 q0

3<4 12 > 3 9>5


0 7 4 11 1 13 0 7 4 11 1 13

3 4 12 3 9 5 3 4 3 12 5 9
... ...
5 1 2 6 15 8 0 7 14 6 2 10 5 1 2 6 15 8 0 7 14 6 2 10

p0 q0 p0 q0 ... p0 q0 p0 q0 p0 q0 ... p0 q0
Phase 1
kernel
p1 q1 p1 q1 ... p1 q1 p1 q1 p1 q1 ... p1 q1

5>2 8>7 6 < 10


0 7 4 11 1 13 0 7 4 11 1 13

3 4 3 12 5 9 3 4 3 12 5 9
... ...
5 1 2 6 0 8 15 7 2 6 14 10 2 1 5 6 0 7 15 8 2 6 14 10

p1 q1 p1 q1 ... p1 q1 p1 q1 p1 q1 ... p1 q1
Phase 2
kernel
p2 q2 p2 q2 ... p2 q2 p2 q2 p2 q2 ... p2 q2

Bitonic Sorting, Adaptive. Fig.  To execute several instances of the adaptive La/Ua construction algorithm in parallel,
where each instance operates on a bitonic tree of  nodes, three phases are required. This figure illustrates the operation
of these three phases. On the left, the node pointers contained in the input stream are shown as well as the comparisons
performed by the kernel program. On the right, the node pointers written to the output stream are shown as well as the
modifications of the child pointers and node values performed by the kernel program according to
Algorithm 

On recent GPUs, the maximum output data size of The Last Stage of Each Merge
a kernel is  ×  bytes. Since usually the input consists Adaptive bitonic merging, being a recursive procedure,
of key/pointer pairs, the method starts with a local sort eventually merges small subsequences, for instance of
of -key/pointer pairs per kernel. For such small num- length . For such small subsequences it is better to use
bers of keys, an algorithm with asymptotic complexity a (nonadaptive) bitonic merge implementation that can
of O(n) performs faster than asymptotically optimal be executed in a single pass of the whole stream.
algorithms.
After the local sort, a further stream operation Timings
converts the resulting sorted subsequences of length The following experiments were done on arrays consist-
 pairwise to bitonic trees, each containing  nodes. ing of key/pointer pairs, where the key is a uniformly
Thereafter, the GPU-ABiSort approach can be applied distributed random -bit floating point value and the
as described above, starting with j = . pointer a -byte address. Since one can assume (without
 B Bitonic Sorting, Adaptive

500
450
CPU sort
400
n CPU sort GPUSort GPU-ABiSort GPUSort
350
GPU-ABiSort

Time (in ms)


32,768 9–11 ms 4 ms 5 ms 300
65,536 19–24 ms 8 ms 8 ms 250
131,072 46–52 ms 18 ms 16 ms 200
262,144 98–109 ms 38 ms 31 ms 150
524,288 203–226 ms 80 ms 65 ms 100
1,048,576 418–477 ms 173 ms 135 ms 50
0
32768 65536 131072 262144 524288 1048576
Sequence length

Bitonic Sorting, Adaptive. Fig.  Timings on a GeForce  system. (There are two curves for the CPU sort, so as to
visualize that its running time is somewhat data-dependent)

loss of generality) that all pointers in the given array are according to measurements by [], the transfer of one
unique, these can be used as secondary sort keys for the million key/pointer pairs from CPU to GPU and back
adaptive bitonic merge. takes in total roughly  ms on a PCI Express bus PC.
The experiments described in the following com-
pare the implementation of GPU-ABiSort of [, ] with
sorting on the CPU using the C++ STL sort function (an Conclusion
optimized quicksort implementation) as well as with the Adaptive bitonic sorting is not only appealing from a
(nonadaptive) bitonic sorting network implementation theoretical point of view, but also from a practical one.
on the GPU by Govindaraju et al., called GPUSort []. Unlike other parallel sorting algorithms that exhibit
Contrary to the CPU STL sort, the timings of GPU- optimal asymptotic complexity too, adaptive bitonic
ABiSort do not depend very much on the data to be sorting offers low hidden constants in its asymptotic
sorted, because the total number of comparisons per- complexity and can be implemented on parallel archi-
formed by the adaptive bitonic sorting is not data- tectures by a reasonably experienced programmer. The
dependent. practical implementation of it on a GPU outperforms
Figure  shows the results of timings performed on the implementation of simple bitonic sorting on the
a PCI Express bus PC system with an AMD Athlon- same GPU by a factor ., and it is a factor  faster than
 + CPU and an NVIDIA GeForce  GTX a standard CPU sorting implementation (STL).
GPU with  MB memory. Obviously, the speedup
of GPU-ABiSort compared to CPU sorting is .–.
for n ≥  . Furthermore, up to the maximum tested Related Entries
sequence length n =  (= , , ), GPU-ABiSort is AKS Network
up to . times faster than GPUSort, and this speedup is Bitonic Sort
increasing with the sequence length n, as expected. Non-Blocking Algorithms
The timings of the GPU approaches assume that the Metrics
input data is already stored in GPU memory. When
embedding the GPU-based sorting into an otherwise
purely CPU-based application, the input data has to be Bibliographic Notes and Further
transferred from CPU to GPU memory, and afterwards Reading
the output data has to be transferred back to CPU mem- As mentioned in the introduction, this line of research
ory. However, the overhead of this transfer is usually began with the seminal work of Batcher [] in the
negligible compared to the achieved sorting speedup: late s, who described parallel sorting as a network.
BLAS (Basic Linear Algebra Subprograms) B 

Research of parallel sorting algorithms was reinvigo- . Leighton T () Tight bounds on the complexity of parallel
rated in the s, where a number of theoretical ques- sorting. In: STOC ’: Proceedings of the sixteenth annual ACM
tions have been settled [, , , , , ]. symposium on Theory of computing, ACM, New York, NY, USA,
pp – B
Another wave of research on parallel sorting ensued
. Natvig L () Logarithmic time cost optimal parallel sorting is
from the advent of affordable, massively parallel archi- not yet fast in practice! In: Proceedings supercomputing ’, New
tectures, namely, GPUs, which are, more precisely, York, NY, pp –
streaming architectures. This spurred the development . Purcell TJ, Donner C, Cammarano M, Jensen HW, Hanrahan P
of a number of practical implementations [, –, , () Photon mapping on programmable graphics hardware. In:
Proceedings of the  annual ACM SIGGRAPH/eurographics
, ].
conference on graphics hardware (EGGH ’), ACM, New York,
pp –
. Satish N, Harris M, Garland M () Designing efficient sorting
Bibliography algorithms for manycore gpus. In: Proceedings of the  IEEE
. Ajtai M, Komlós J, Szemerédi J () An O(n log n) sorting International Symposium on Parallel and Distributed Process-
network. In: Proceedings of the fifteenth annual ACM sym- ing (IPDPS), IEEE Computer Society, Washington, DC, USA, pp
posium on theory of computing (STOC ’), New York, NY, –
pp – . Schnorr CP, Shamir A () An optimal sorting algorithm for
. Akl SG () Parallel sorting algorithms. Academic, Orlando, FL mesh connected computers. In: Proceedings of the eighteenth
. Azar Y, Vishkin U () Tight comparison bounds on the com- annual ACM symposium on theory of computing (STOC), ACM,
plexity of parallel sorting. SIAM J Comput ():– New York, NY, USA, pp –
. Batcher KE () Sorting networks and their applications. In: . Sintorn E, Assarsson U () Fast parallel gpu-sorting
Proceedings of the  Spring joint computer conference (SJCC), using a hybrid algorithm. J Parallel Distrib Comput ():
Atlanta City, NJ, vol , pp – –
. Bilardi G, Nicolau A () Adaptive bitonic sorting: An optimal
parallel algorithm for shared-memory machines. SIAM J Comput
():–
. Cole R () Parallel merge sort. SIAM J Comput ():–.
see Correction in SIAM J. Comput. ,  BLAS (Basic Linear Algebra
. Cormen TH, Leiserson CE, Rivest RL, Stein C () Introduc-
tion to algorithms, rd edn. MIT Press, Cambridge, MA
Subprograms)
. Gibbons A, Rytter W () Efficient parallel algorithms. Cam-
bridge University Press, Cambridge, England Robert van de Geijn, Kazushige Goto
. Govindaraju NK, Gray J, Kumar R, Manocha D () GPUT- The University of Texas at Austin, Austin, TX, USA
eraSort: high performance graphics coprocessor sorting for
large database management. Technical Report MSR-TR--,
Microsoft Research (MSR), December . In: Proceedings of
ACM SIGMOD conference, Chicago, IL
Definition
. Govindaraju NK, Raghuvanshi N, Henson M, Manocha D () The Basic Linear Algebra Subprograms (BLAS) are an
A cachee efficient sorting algorithm for database and data min- interface to commonly used fundamental linear algebra
ing computations using graphics processors. Technical report, operations.
University of North Carolina, Chapel Hill
. Greß A, Zachmann G () GPU-ABiSort: optimal parallel
sorting on stream architectures. In: Proceedings of the th Discussion
IEEE international parallel and distributed processing sympo-
sium (IPDPS), Rhodes Island, Greece, p  Introduction
. Greß A, Zachmann G () Gpu-abisort: Optimal parallel The BLAS interface supports portable high-performance
sorting on stream architectures. Technical Report IfI--, TU implementation of applications that are matrix and
Clausthal, Computer Science Department, Clausthal-Zellerfeld, vector computation intensive. The library or applica-
Germany
tion developer focuses on casting computation in terms
. Kipfer P, Westermann R () Improved GPU sorting. In:
Pharr M (ed) GPU Gems : programming techniques for
of the operations supported by the BLAS, leaving the
high-performance graphics and general-purpose computation. architecture-specific optimization of that software layer
Addison-Wesley, Reading, MA, pp – to an expert.
 B BLAS (Basic Linear Algebra Subprograms)

A Motivating Example Vector–Vector Operations (Level- BLAS)


The use of the BLAS interface will be illustrated by con- The first BLAS interface was proposed in the s when
sidering the Cholesky factorization of an n×n matrix A. vector supercomputers were widely used for computa-
When A is Symmetric Positive Definite (a property that tional science. Such computers could achieve near-peak
guarantees that the algorithm completes), its Cholesky performance as long as the bulk of computation was cast
factorization is given by the lower triangular matrix L in terms of vector operations and memory was accessed
such that A = LLT . mostly contiguously. This interface is now referred to as
An algorithm for this operation can be derived as the Level- BLAS.
follows: Partition Let x and y be vectors of appropriate length and α
be scalar. Commonly encountered vector operations are
⎛ α ⋆ ⎞ ⎛ λ  ⎞
A→⎜

⎟  multiplication of a vector by a scalar (x ← αx), inner
⎜ ⎟ and L→⎜

⎟,

⎝ a A ⎠ ⎝ l L ⎠ (dot) product (α ← xT y), and scaled vector addition
(y ← αx + y). This last operation is known as an axpy:
where α  and λ are scalars, a and l are vectors, A alpha times x plus y.
is symmetric, L is lower triangular, and the ⋆ indicates The Cholesky factorization, coded in terms of such
the symmetric part of A that is not used. Then operations, is given by
T
⎛ α ⋆ ⎞ ⎛ λ  ⎞⎛ λ  ⎞ do j=1, n
⎜  ⎟ = ⎜  ⎟ ⎜  ⎟ A( j,j ) = sqrt( A( j,j ) )
⎜ ⎟ ⎜ ⎟⎜ ⎟
⎝ a A ⎠ ⎝ l L ⎠ ⎝ l L ⎠ call dscal( n-j, 1.0d00/A( j,j ),
A( j+1, j ), 1 )
⎛ λ ⋆ ⎞ do k=j+1,n
=⎜ ⎟

⎜ ⎟ call daxpy( n-k+1, -A( k,j ),
⎝ λ l l l
T
+ L LT . ⎠ A(k,j), 1, A( k, k ), 1 );
enddo
This yields the following algorithm for overwriting A
enddo
with L:
√ Here
● α  ← α  .
● a ← a /α  . ● The first letter in dscal and daxpy indicates that
● A ← −a aT + A , updating only the lower trian- the computation is with double precision numbers.
gular part of A . (This is called a symmetric rank- ● The call to dscal performs the computation a ←
update.) a /α  .
● Continue by overwriting A with L where A = ● The loop
L LT .
do i=k,n
A simple code in Fortran is given by A( i,k ) = A( i,k ) - A( i,j )
do j=1, n * A( k,j )
A( j,j ) = sqrt( A( j,j ) ) enddo
do i=j+1,n
A( i,j ) = A( i,j ) / A( j,j ) is replaced by the call
enddo
call daxpy( n-k+1, -A( k,j ),
do k=j+1,n
A(k,j), 1, A( k, k ), 1 )
do i=k,n
A( i,k ) = A( i,k ) If the operations supported by dscal and daxpy
- A( i,j ) * A( k,j ) achieve high performance on a target architecture then
enddo so will the implementation of the Cholesky factoriza-
enddo tion, since it casts most computation in terms of those
enddo operations.
BLAS (Basic Linear Algebra Subprograms) B 

A representative calling sequence for a Level- BLAS A( j,j ) = sqrt( A( j,j ) )


routine is given by call dscal( n-j, 1.0d00 /
A( j,j ), A( j+1, j ), 1 ) B
_axpy( n, alpha, x, incx, y, incy )
call dsyr( ‘Lower triangular’,
which implements the operation y = αx + y. Here n-j, -1.0d00,
A( j+1,j ), 1, A( j+1,j+1 ),lda )
● The “_” indicates the data type. The choices for this
enddo
first letter are
Here, dsyr is the routine that implements a double
s single precision precision symmetric rank- update. Readability of the
d double precision code is improved by casting computation in terms of
c single precision complex routines that implement the operations that appear in
z double precision complex the algorithm: dscal for a = a /α  and dsyr for
A = −a aT + A .
● The operation is identified as axpy: alpha times x The naming convention for Level- BLAS routines
plus y. is given by
● n indicates the number of elements in the vectors x
and y. _XXYY,
● alpha is the scalar α.
where
● x and y indicate the memory locations where the
first elements of x and y are stored, respectively. ● “_” can take on the values s, d, c, z.
● incx and incy equal the increment by which one ● XX indicates the shape of the matrix:
has to stride through memory to locate the elements
XX matrix shape
of vectors x and y, respectively.
ge general (rectangular)
The following are the most frequently used Level- sy symmetric
BLAS:
he Hermitian
Routine/ tr triangular
Function Operation
_swap x↔y
In addition, operations with banded matrices are
_scal x ← αx supported, which we do not discuss here.
_copy y←x ● YY indicates the operation to be performed:
_axpy y ← αx + y
YY matrix shape
_dot xT y
mv matrix vector multiplication
_nrm2 ∥x∥
sv solve vector
_asum ∥re(x)∥ + ∥im(x)∥
r rank- update
i_max min(k) : ∣re(xk )∣ + ∣im(xk )∣ = max(∣re(xi )∣
+∣im(xi )∣) r2 rank- update

Matrix–Vector Operations (Level- BLAS) A representative call to a Level- BLAS operation is


The next level of BLAS supports operations with matri- given by
ces and vectors. The simplest example of such an oper-
dsyr( uplo, n, alpha, x, incx, A,
ation is the matrix–vector product: y ← Ax where x
lda )
and y are vectors and A is a matrix. Another example
is the computation A = −a aT + A (symmetric which implements the operation A = αxxT + A, updat-
rank- update) in the Cholesky factorization. This ing the lower or upper triangular part of A by choos-
operation can be recoded as ing uplo as ‘Lower triangular’ or ‘Upper
do j=1, n triangular,’ respectively. The parameter lda (the
 B BLAS (Basic Linear Algebra Subprograms)

leading dimension of matrix A) indicates the increment This yields the algorithm
by which memory has to be traversed in order to address
successive elements in a row of matrix A. ● A = L where A = L LT (Cholesky factorization
The following table gives the most commonly used of a smaller matrix).
Level- BLAS operations: ● A = L where L LT = A (triangular solve with
multiple right-hand sides).
Routine/ ● A = −L LT + A , updating only the lower trian-
Function Operation
gular part of A (symmetric rank-k update).
_gemv general matrix-vector multiplication
● Continue by overwriting A with L where A =
_symv symmetric matrix-vector multiplication L LT .
_trmv triangular matrix-vector multiplication
_trsv triangular solve vector A representative code in Fortran is given by
_ger general rank- update
do j=1, n, nb
_syr symmetric rank- update jb = min( nb, n-j+1 )
_syr2 symmetric rank- update call chol( jb, A( j, j ), lda )

There are also interfaces for operation with banded call dtrsm( ‘Right’, ‘Lower
matrices stored in packed format as well as for opera- triangular’,
tions with Hermitian matrices. ‘Transpose’,
‘Nonunit diag’,
Matrix–Matrix Operations (Level- BLAS) J-JB+1, JB,
The problem with vector operations and matrix–vector 1.0d00, A( j, j ),
operations is that they perform O(n) computations lda, A( j+jb, j ),
with O(n) data and O(n ) computations with O(n ) lda )
data, respectively. This makes it hard, if not impossible,
to leverage cache memory now that processing speeds call dsyrk( ‘Lower triangular’,
greatly outperform memory speeds, unless the problem ‘No transpose’,
size is relatively small (fits in cache memory). J-JB+1, JB,
The solution is to cast computation in terms of -1.0d00,
matrix-matrix operations like matrix–matrix multipli- A( j+jb, j ), lda,
cation. Consider again Cholesky factorization. Partition 1.0d00,
A( j+jb, j+jb ),
⎛ A ⋆ ⎞ ⎛ L  ⎞
 
A→⎜ ⎟ and L→⎜ ⎟, lda )
⎜ ⎟ ⎜ ⎟
⎝ A A ⎠ ⎝ L L ⎠ enddo

where A and L are nb × nb submatrices. Then Here subroutine chol performs a Cholesky factor-
ization; dtrsm and dsyrk are level- BLAS routines:
⎛ A ⋆ ⎞
⎜  ⎟= ● The call to dtrsm implements A ← L where
⎜ ⎟
⎝ A A ⎠ L LT = A .
T
● The call to dsyrk implements A ← −L LT +A .
⎛ L  ⎞⎛ L  ⎞
⎜  ⎟ ⎜  ⎟ = The bulk of the computation is now cast in terms
⎜ ⎟⎜ ⎟
⎝ L L ⎠ ⎝ L L ⎠ of matrix–matrix operations which can achieve high
performance.
⎛ L LT ⋆ ⎞ The naming convention for Level- BLAS routines
⎜   ⎟
⎜ ⎟ are similar to those for the Level- BLAS. A representa-
⎝ L LT L LT + L LT ⎠ tive call to a Level- BLAS operation is given by
BLAS (Basic Linear Algebra Subprograms) B 

dsyrk( uplo, trans, n, k, alpha, A, Sparse BLAS


lda, beta, C, ldc ) Several efforts were made to define interfaces for BLAS-
like operations with sparse matrices. These do not seem B
which implements the operation C ← αAAT + βC
to have caught on, possibly because the storage of sparse
or C ← αAT A + βC depending on whether trans
matrices is much more complex.
is chosen as ‘No transpose’ or ‘Transpose,’
respectively. It updates the lower or upper triangular
part of C depending on whether uplo equal ‘Lower Parallel BLAS
triangular’ or ‘Upper triangular,’ Parallelism with BLAS operations can be achieved in a
respectively. The parameters lda and ldc are the lead- number of ways.
ing dimensions of arrays A and C, respectively.
The following table gives the most commonly Multithreaded BLAS
usedLevel- BLAS operations: On shared-memory architectures multithreaded BLAS
Routine/ are often available. Such implementations achieve paral-
Function Operation lelism within each BLAS call without need for changing
_gemm general matrix-matrix multiplication code that is written in terms of the interface. Figure 
_symm symmetric matrix-matrix multiplication shows the performance of the Cholesky factorization
_trmm triangular matrix-matrix multiplication codes when multithreaded BLAS are used on a multi-
_trsm triangular solve with multiple right-hand sides core architecture.
_syrk symmetric rank-k update
_syr2k symmetric rank-k update PBLAS
As part of the ScaLAPACK project, an interface for
distributed memory parallel BLAS was proposed, the
Impact on Performance PBLAS. The goal was to make this interface closely
Figure  illustrates the performance benefits that come resemble the traditional BLAS. A call to dsyrk
from using the different levels of BLAS on a typical becomes
architecture. pdsyrk(uplo, trans, n, k, alpha, A,
iA, jA, descA, beta, C, iC, jC,
descC)
BLAS-Like Interfaces where the new parameters iA, jA, descA, etc., encap-
CBLAS sulate information about the submatrix with which
A C interface for the BLAS, CBLAS, has also been to multiply and the distribution to a logical two-
defined to simplify the use of the BLAS from C and C++. dimensional mesh of processing nodes.
The CBLAS support matrices stored in row and column
major format.
PLAPACK
The PLAPACK project provides an alternative to
Libflame ScaLAPACK. It also provides BLAS for distributed
The libflame library that has resulted from the memory architectures, but (like libflame) goes one
FLAME project encompasses the functionality of the step further toward encapsulation. The call for parallel
BLAS as well as higher level linear algebra operations. symmetric rank-k update becomes
It uses an object-based interface so that a call to a BLAS
PLA_Syrk( uplo, trans, alpha, A,
routine like _syrk becomes
beta, C )
FLA_Syrk( uplo, trans, alpha, A,
where all information about the matrices, their distri-
beta, C )
bution, and the storage of local submatrices are encap-
thus hiding many of the dimension and indexing details. sulated in the parameters A and C.
 B BLAS (Basic Linear Algebra Subprograms)

One thread

10

8
GFlops

Hand optimized
4 BLAS3
BLAS2
BLAS1
triple loops
2

0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
n

BLAS (Basic Linear Algebra Subprograms). Fig.  Performance of the different implementations of Cholesky
factorization that use different levels of BLAS. The target processor has a peak of . Gflops (billions of floating point
operations per second). BLAS, BLAS, and BLAS indicate that the bulk of computation was cast in terms of Level-, -, or
- BLAS, respectively

Available Implementations PLAPACK


Many of the software and hardware vendors mar- ScaLAPACK
ket high-performance implementations of the BLAS.
Examples include IBM’s ESSL, Intel’s MKL, AMD’s
ACML, NEC’s MathKeisan, and HP’s MLIB libraries. Bibliographic Notes and Further
Widely used open source implementations include Reading
ATLAS and the GotoBLAS. Comparisons of perfor- What came to be called the Level- BLAS were first
mance of some of these implementations are given in published in , followed by the Level- BLAS in 
Figs.  and . and Level- BLAS in  [, , ].
The details about the platform on which the per- Matrix–matrix multiplication (_gemm) is con-
formance data was gathered nor the versions of the sidered the most important operation, since high-
libraries that were used are given because architectures performance implementations of the other Level-
and libraries continuously change and therefore which BLAS can be coded in terms of it []. Many implementa-
is faster or slower can easily change with the next release tions of _gemm are now based on the techniques devel-
of a processor or library.
oped by Kazushige Goto []. These techniques extend to
the high-performance implementation of other Level-
Related Entries BLAS [] and multithreaded architectures []. Prac-
ATLAS (Automatically Tuned Linear Algebra tical algorithms for the distributed memory parallel
Software) implementation of matrix–matrix multiplication, used
libflame by ScaLAPACK and PLAPACK, were first discussed
LAPACK in [, ] and for other Level- BLAS in [].
BLAS (Basic Linear Algebra Subprograms) B 

Four threads

40 B
35

30

25
GFlops

20

Hand optimized
15 BLAS3
BLAS2
10 BLAS1
triple loops

0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
n

BLAS (Basic Linear Algebra Subprograms). Fig.  Performance of the different implementations of Cholesky
factorization that use different levels of BLAS, using four threads on a architectures with four cores and a peak of .
Gflops

One thread

10

8
GFlops

GotoBLAS
2 MKL
ACML
ATLAS

0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
n

BLAS (Basic Linear Algebra Subprograms). Fig.  Performance of different BLAS libraries for matrix–matrix
multiplication (dgemm)
 B Blocking

Four threads

40

35

30

25
GFlops

20

15

10
GotoBLAS
MKL
5 ACML
ATLAS
0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
n

BLAS (Basic Linear Algebra Subprograms). Fig.  Parallel performance of different BLAS libraries for matrix–matrix
multiplication (dgemm)

As part of the BLAS Technical Forum, an effort was . Duff IS, Heroux MA, Pozo R (June ) An overview of the
made in the late s to extend the BLAS interfaces to sparse basic linear algebra subprograms: the new standard from
the BLAS technical forum. ACM Trans Math Softw ():–
include additional functionality []. Outcomes included
. Goto K, van de Geijn R () High-performance implementa-
the CBLAS interface, which is now widely supported, tion of the level- BLAS. ACM Trans Math Softw ():–
and an interface for Sparse BLAS []. . Goto K, van de Geijn RA () Anatomy of high-performance
matrix multiplication. ACM Trans Math Softw ():–
. Kågström B, Ling P, Van Loan C () GEMM-based level-
BLAS: high performance model implementations and perfor-
Bibliography mance evaluation benchmark. ACM Trans Math Softw ():
. Agarwal RC, Gustavson F, Zubair M () A high-performance
–
matrix multiplication algorithm on a distributed memory parallel
. Lawson CL, Hanson RJ, Kincaid DR, Krogh FT (Sept ) Basic
computer using overlapped communication. IBM J Res Dev ()
linear algebra subprograms for Fortran usage. ACM Trans Math
. Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S,
Softw ():–
Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A,
. Marker B, Van Zee FG, Goto K, Quintana-Ortí G, van de Geijn RA
Pozo R, Remington K, Whaley RC (June ) An updated set () Toward scalable matrix multiply on multithreaded archi-
of Basic Linear Algebra Subprograms (BLAS). ACM Trans Math
tectures. In: Kermarrec A-M, Bougé L, Priol T (eds) Euro-Par,
Softw ():–
LNCS , pp –
. Chtchelkanova A, Gunnels J, Morrow G, Overfelt J, van de Geijn . van de Geijn R, Watts J (April ) SUMMA: Scalable uni-
RA (Sept ) Parallel implementation of BLAS: general tech-
versal matrix multiplication algorithm. Concurrency: Pract Exp
niques for level- BLAS. Concurrency: Pract Exp ():–
():–
. Dongarra JJ, Du Croz J, Hammarling S, Duff I (March )
A set of level- basic linear algebra subprograms. ACM Trans
Math Softw ():–
. Dongarra JJ, Du Croz J, Hammarling S, Hanson RJ (March ) Blocking
An extended set of FORTRAN basic linear algebra subprograms.
ACM Trans Math Softw ():– Tiling
Blue CHiP B 

The Blue CHiP Project and the CHiP architecture


Blue CHiP were strongly influenced by ideas tested in a design
included in the MPC- multi-project chip, a batch B
Lawrence Snyder
of chip designs created by Mead’s students []. That
University of Washington, Seattle, WA, USA
design, called the Tree Organized Processor Structure
(referenced as AG- in the archive []), was a height
Synonyms  binary tree with field-programmable processors for
Blue CHiP project; CHiP architecture; CHiP computer; evaluating Boolean expressions []. Though the design
Configurable, Highly parallel computer; Programmable used field-effect programmability, the binary tree was
interconnect computer; Reconfigurable computer embedded directly into the layout; see Fig. . Direct
embedding seemed unnecessarily restrictive. The lattice
Definition structure of the CHiP architecture, developed in Spring,
The CHiP Computer is the Configurable, Highly , was to provide a more flexible communication
Parallel architecture, a multiprocessor composed of capability.
processor-memory elements in a lattice of programm- Because parallel computation was not well under-
able switches []. Designed in  to exploit Very Large stood, because the CHiP computer was an entirely
Scale Integration (VLSI), the switches are set under new architecture, because nothing was known about
program control to connect and reconnect processors. programmable interconnect, and because none of the
Though the CHiP architecture was the first use of pro- “soft” parts of the project (algorithms, OS, languages,
grammable interconnect, it has since been applied most applications) had ever been considered in the context
widely in Field Programmable Gate Arrays (FPGAs). of configurability, there was plenty of research to do.
The Blue CHiP Project, the effort to investigate and And it involved so many levels that Mead’s “tall, thin
develop the CHiP architecture, took Carver Mead’s “tall, man” approach was effectively a requirement. Having
thin man” vision as its methodology. Accordingly, it secured Office of Naval Research funding for the Blue
studied “How best to apply VLSI technology” from CHiP Project in Fall, , work began at the start
five research perspectives: VLSI, architecture, software, of .
algorithms, and theory.

Discussion
Background and Motivation
In the late s as VLSI densities were improving
to allow significant functionality on a chip, Caltech’s
Carver Mead offered a series of short courses at sev-
eral research universities []. He taught the basics of
chip design to computer science grad students and fac-
ulty. In his course, which required students to do a
project fabricated using ARPA’s multi-project chip facil-
ity, Mead argued that the best way to leverage VLSI
technology was for a designer to be a “tall, thin man,”
a person knowledgeable in the entire chip development
stack: electronics, circuits, layout, architecture, algo-
rithms, software, and applications. “Thin” implied that
the person had a working knowledge of each level, but
might not be an expert. The Blue CHiP project adopted Blue CHiP. Fig.  The tree organized processor structure,
Mead’s idea: Use the “tall, thin man” approach to create an antecedent to the CHiP architecture []; the root is in the
new ways to apply VLSI technology. center and its children are to its left and right
 B Blue CHiP

The CHiP Architecture Overview and connecting processors as


Operation required for Phase[i]
The focal point of the Blue CHiP project was the Config- Run Phase[i] to completion
urable, Highly Parallel (CHiP) computer []. The CHiP i=i+1
architecture used a set of processor elements (PEs) – - }
bit processors with random access memory for program
and data – embedded in a regular grid, or lattice, of wires
Notice that lattice configurations are typically reused
and programmable switches. Interprocessor communi-
during a program execution.
cation was implemented by setting the switches under
program control to connect PEs with direct, circuit-
switched channels. The rationale was simple: Proces- Configuration
sor design was well understood, so a specific hardware Configuring a communication structure works as fol-
implementation made sense, but the optimal way to lows: The interconnection graph structure, or config-
connect together parallel processors was not known. uration, is stored in the lattice, and when selected,
Building a dynamically configurable switching fabric it remains in effect for the period of that phase
that could reconnect processes as the computation pro- of execution, allowing the PEs to communicate and
gressed permitted optimal communication. implement the algorithm. PE-to-PE communication is
realized by a direct circuit-switched connection []. The
Phases overall graph structure is realized by setting each switch
The CHiP architecture exploited the observation that appropriately.
large computations (worthy of parallel processing) are A switch has circuitry to connect its d incident
typically expressed as a sequence of algorithms, or edges (bit-parallel data wires), allowing up to d com-
phases, that perform the major steps of the computation. munication paths to cross the switch independently. See
Each phase – matrix multiplication, FFT, pivot selec- Figure . A switch has memory to store the settings
tion, etc. – is a building block for the overall computa- needed for a set of configurations. Each switch stores
tion. The phases execute one after another, often being its setting for configuration i in its memory location i.
repeated for iterative methods, simulations, approxima- The lattice implements configuration i by selecting the
tions, etc. Phases tend to have a specific characteris- ith memory location for all switches.
tic communication structure. For example, FFT uses Figure  shows schematic diagrams of a degree
the butterfly graph to connect processors; the Kung-  switch capable of  independent connections, that
Leiserson systolic array matrix multiplication uses a is, paths that can crossover. Interconnections are pro-
“hex array” interconnect [MC]; parallel prefix com- grammed graphically, using an iconography shown in
putations use a complete binary tree, etc. Rather than Fig. a (and Fig. ). Figure b shows the connectiv-
trying to design a one-size-fits-all interconnect, the ity allowing the wires to connect arbitrarily. Notice
CHiP architecture provides the mechanism to pro- that fan-out is possible. The independent connec-
gram it. tions require separate buses, shown as paired, overlap-
From an OS perspective a parallel computation on ping diamonds. Higher degree switches (more incident
the CHiP machine proceeds as follows: edges) require more buses; wider datapaths (-wide is
shown) require more diamonds. Figure c shows the
Load switch memories in the lattice configuration setting in memory location M asserting
with configuration settings for all the connection of the east edge to the right bus. Notice
phases that memory is provided to store the connections for
Load PE memories with binary object each bus, and that there are three states: disconnected,
code for all phases connect to left, and connect to right. Not shown in the
i = 0 figure is the logic to select a common memory location
while phases not complete { on all switches, that is, the logic selecting M, as well as
Select the lattice configuration circuitry to insure the quality of the signals.
Blue CHiP B 

Blue CHiP. Fig.  A schematic diagram for a degree four switch: (a) iconography used in programming
interconnection graphs; from top, unconnected, east-west connection, east-west and north-south connections, fan-out
connection, two corner-turning connections; (b) wire arrangement for four incident edges (-bit-parallel wires); the
buses implementing the connection are shown as overlapping diamonds; (c) detail for making a connection of the east
edge to the right bus per configuration stored in memory location M

Blue CHiP. Fig.  Sample switch lattices in which processors are represented by squares and switches by circles;
(a) an -degree, single wide lattice; (b) a -degree, double wide lattice []

As a final detail, notice that connections can cause Processor Elements


the order of the wires of a datapath to be flipped. Proces- Each PE is a computer, a processor with memory, in
sor ports contain logic to flip the order of the bits to the which all input and output is accessible via ports. (Cer-
opposite of the order it received. When a new configu- tain “edge PEs” also connect to external I/O devices.)
ration comes into effect, each connection sends/receives A lattice with switches having d incident edges hosts
an lsb , revealing whether the bits are true or reversed, PEs with d incident edges, and therefore d ports.
allowing the ports to select/deselect the flipping Programs communicate simply by sending and receiv-
hardware. ing through the ports; since the lattice implements
 B Blue CHiP

circuit-switched communication, the values can be sent the work was targeted at developing a CHiP computer,
directly with little “packetizing” beyond a task designa- related topics were also studied.
tion in the receiving processor.
Each PE has its own memory, so in principle each PE VLSI: Programmable Interconnect
can execute its own computation in each phase; that is, Because the CHiP concept was such a substantial depar-
the CHiP computer is a MIMD, or multiple-instruction, ture from conventional computers and the parallel com-
multiple-data, computer. As phase-based parallel algo- puters that preceded it, initial research focused less on
rithms became better understood, it became clear that building chips, that is, the VLSI, and more on the overall
PEs typically execute only a few different computations properties of the lattice that were the key to exploiting
to implement a phase. For example, in parallel-prefix VLSI. A simple field-effect programmable switch had
algorithms, which have tree-connected processors, the been worked out in the summer of , giving the team
root, the interior nodes, and the leaves execute slightly confidence that fabricating a production switch would
different computations. Because these are almost always be doable when the overall lattice design was complete.
variations of each other, a single program, multiple data The CHiP computer’s lattice is a regular grid of
(SPMD) model was adopted. (data) wires with switches at the cross points and pro-
cessors inset at regular intervals. The lattice abstraction
The Controller uses circles for switches and squares for processors. An
The orchestration of phase execution is performed by an important study was to understand how a given size lat-
additional processor, a “front end” machine, called the tice hosted various graphs; see Fig.  for typical abstrac-
controller. The controller loads the configuration set- tions. The one layer metal processes of the early s
tings into the switches, and loads the PEs with their code caused the issue of lattice bandwidth to be a significant
for each phase. It then steps through the phase execu- concern.
tion logic (see “Phases”), configuring the lattice for a Lattice Properties Among the key parameters []
phase and executing it to completion; processors use the that describe a lattice are the degree of the nodes, that
controller’s network to report completion. is, the number of incident edges and corridor width, the
The controller code embodies the highest-level logic number of switches separating two PEs. The architec-
of a computation, which tends to be very straightfor- tural question is: What is a sufficiently rich lattice to
ward. The controller code for the Simple Benchmark [] handle most interconnection graphs? This is a graph
for m := o to mmax do embedding question, and though graph embedding was
begin an active research area at the time, asymptotic results
phase(hydro); were not helpful. Rather, it was important for us to
phase(viscos); understand graph embeddings for a limited number of
phase(new_Δt); nodes, say  or fewer, of the sort of graphs arising in
phase(thermo); phase-based parallel computation.
phase(row_solve); What sort of communication graphs do parallel
phase(column_solve); algorithms use? Table  lists some typical graphs and
phase(energy_bal); algorithms that use them; there are other, less famil-
end. iar graphs. Since lattices permit nonplanar embeddings,
illustrates this point and is typical. that is, wires can cross over, a sufficiently wide corridor
lattice of sufficiently high degree switches can embed
Blue CHiP Project Research any graph in our domain. But economy is crucial, espe-
The Blue CHiP Project research, following Mead, con- cially with VLSI technologies with few layers.
ducted research in five different topic areas: VLSI, Embeddings It was found that graphs could be
architecture, software, algorithms, and theory. These embedded in very economical lattices. Examples of two
five topics provide an appropriate structure for the of the less obvious embeddings are shown in Fig. .
remainder of this entry, when extended with an addi- The embedding of the binary tree is an adaptation of
tional section to summarize the ideas. Though much of the “H-embedding” of a binary tree into the plane. The
Blue CHiP B 

shuffle exchange graph for  nodes of a single wide lat- wire length – communication time is proportional to
tice is quite involved – shuffle edges are solid, exchange length – takes some technique. See Fig. . An interleav-
edges are dashed. Shuffle exchange on  nodes is not ing of the row and column PEs of a torus by “folding” at B
known to be embeddable in a single wide lattice, and the middle vertically, and then “folding” at the middle
was assumed to require corridor width of . Notice that horizontally, produces a torus with interleaved proces-
once a graph has been embedded in a lattice, it can be sors. Each of a PE’s four neighbors is at most three
placed in a library for use by those who prefer to pro- switches away. Opportunities to use this “trick” of inter-
gram using the logical communication structure rather leaving PEs to shorten long wires arise in other cases.
than the implementing physical graph. A related idea (Fig. b) is to “lace” the communica-
Techniques Most graphs are easy to layout because tion channels in a corridor to maximize the number of
most parallel computations use sparse graphs with lit- datapaths used.
tle complexity. In programming layouts, however, one is The conclusion from the study of lattice embeddings
often presented with the same problem repeatedly. For indicated, generally, that the degree should be , and
example, laying out a torus is easy, but minimizing the that corridor width is sufficiently valuable that it should
be “as wide as possible.” Given the limitations of VLSI
technology at the time, only  was realistic.
Blue CHiP. Table  Communication graph families and
parallel algorithms that communicate using those graphs
Communication Graph Parallel Applications
Architecture: The Pringle
-degree, -degree mesh Jacobi relaxation for When the project began VLSI technology did not yet
Laplace equations
have the densities necessary to build a prototype CHiP
-degree, -degree mesh Matrix multiplication, computer beyond single-bit processors; fitting one seri-
dynamic programming
ous processor on a die was difficult enough, much
Binary tree Aggregation (sum,
less a lattice and multiple processors. Nevertheless, the
max), parallel prefix
Roadmap of future technology milestones was promis-
Toroidal -degree, -degree mesh Simulations with
ing, so the project opted to make a CHiP simulator to
periodic boundaries
use while waiting for technology to improve. The simu-
Butterfly graph Fast Fourier transform
lated CHiP was called the Pringle [] in reference to the
Shuffle-exchange graph Sorting
snack food by that name that also approximates a chip.

Blue CHiP. Fig.  Example embeddings; (a) a -node binary tree embedded in a  node, -degree, single wide lattice;
(b) a -node shuffle exchange graph embedded in a  node, -degree, single wide lattice
 B Blue CHiP

Blue CHiP. Fig.  Embedding techniques; (a) a -degree torus embedded using “alternating PE positions” to reduce the
length of communication paths to  switches; solid lines are vertical connections, dashed lines are horizontal connections;
(b) “lacing” a width  corridor to implement  independent datapaths, two of which have been highlighted in bold

Blue CHiP. Table  The Pringle datasheet; Pringle was an emulator for the CHiP architecture with an -bit datapath

Processor Elements Switch


Number of PEs  Switch structure Polled Bus
PE microprocessor Intel  Switch clock rate  MHz
PE datapath width -bits Bus Bandwidth  Mb/s
PE floating point chip Intel  Switch Datapath  bits
PE RAM size  Kb
PE EPROM size  Kb Controller
PE clock rate  MHz Controller microprocessor Intel 

The Pringle was designed to behave like a CHiP the structure of the programming system substantially.
computer except that instead of using a lattice for inter- But there were other influences as well.
processor communication, it simulated the lattice with a For historical context the brainstorming and design
polled bus structure. Of course, this design serializes the of Poker began at Purdue in January  in a seminar
CHiP’s parallel communication, but the Pringle could dedicated to that purpose. At that point, Xerox PARC’s
potentially advance  separate communications in one Alto was well established as the model for building a
polling sweep. The switch controller stored multiple modern workstation computer. It had introduced bit-
configurations in its memory (up to ), implementing mapped graphic displays, and though they were still
one per phase. See the datasheet in Table . extremely rare, the Blue CHiP Project was committed
Two copies of Pringle were eventually built, one for to using them. The Alto project also supported inter-
Purdue University and one for the University of Wash- active programming environments such as SmallTalk
ington. Further data on the Pringle is available in the and Interlisp. It was agreed that Poker needed to sup-
reports []. port interactive programming, too, but the principles
and techniques for building windows, menus, etc., were
not yet widely understood. Perhaps the most radical
Software: Poker aspect of the Alto, which also influenced the project,
The primary software effort was building the Poker Par- was dedicating an entire computer to support the work
allel Programming Environment []. Poker was targeted of one user. It was decided to use a VAX / for
at programming the CHiP Computer, which informed that purpose. This decision was greeted with widespread
Blue CHiP B 

astonishment among the Purdue CS faculty, because letters XX were used until someone could think of an
all of the rest of the departmental computing was per- appropriate name for the language, but soon it was sim-
formed by another time-shared VAX /. ply called Dos Equis.) Later, a C-preprocessor called B
The Poker Parallel Programming Environment Poker C was also implemented. One additional facil-
opened with a “configuration” screen, called CHiP ity was a Command Request window for assembling
Params, that asked programmers for the parameters phases.
of the CHiP computer they would be programming. After building a set of phases, programmers needed
Items such as the number of processors, the degree of to assemble them, using the Command Request (CR)
the switches, etc. were requested. Most of this informa- interface, into a sequence of phase invocations to solve
tion would not change in a production setting using the problem. This top-level logic is critical because when
a physical computer, but for a research project, it was a phase change occurs – for example, moving from a
important. pivot selection step to an elimination step – the switches
Programming the CHiP machine involved devel- in the lattice typically change.
oping phase programs for specific algorithms such as The overall structure of the Poker Environment is
matrix multiplication, and then assembling these phase shown in Fig.  []. The main target of CHiP pro-
pieces into the overall computation []. Each phase grams was a generic simulator, since production level
program requires these components: computations were initially not needed. The Cosmic
Cube from Caltech and the Sequent were contemporary
● Switch Settings (SS) specification – defines the inter-
parallel machines that were also targeted [].
connection graph
● Code Names (CN) specification – assigns a process
Algorithms: Simple
to each PE, together with parameters
In terms of developing CHiP-appropriate algorithms,
● Port Names (PN) specification – defines the names
the Blue CHiP Project benefited from the wide inter-
used by each PE to refer to its neighbors
est at the time in systolic array computation. Systolic
● I/O Names (IO) specification – defines the external
arrays are data parallel algorithms that map easily to a
data files used and created by the phase
CHiP computer phase []. Other parallel algorithms
● Text files – process code written in an imperative
of the time – parallel prefix, nearest neighbor iterative
language (XX or C)
relaxation algorithms, etc., – were also easy to program.
Not all phases read or write external files, so IO is So, with basic building blocks available, project person-
not always needed; all of the other specifications are nel split their time between composing these known
required. algorithms (phases) into complete computations, and
Poker emphasized graphic programming and mini- developing new parallel algorithms. The SIMPLE com-
mized symbolic programming. Programmers used one putation illustrates the former activity.
of several displays – full-screen windows – to program SIMPLE The SIMPLE computation is, as its name
the different types of information describing a computa- implies, a simple example of a Lagrangian hydrodynam-
tion. Referring to Fig. , the windows were: Switch Set- ics computation developed by researchers at Livermore
tings for defining the interconnection graph (Fig. a), National Labs to illustrate the sorts of computations of
Code Names for assigning a process (and parametric interest to them that would benefit from performance
data) to each processing element (Fig. b), Port Names improvements. The program existed only in Fortran, so
for naming a PE’s neighbors (Fig. c), I/O Names for the research question became developing a clean paral-
reading and writing to standard in and out, and Trace lel solution that eventually became the top-level logic
View (Fig. d) for seeing the execution in the same form of the earlier Section “The Controller.” Phases had to
as used in the “source” code. be defined, interconnection graphs had to be created,
In addition to the graphic display windows, a text data flow between phases had to be managed, and so
editor was used to write the process code, whose names forth. This common task – converting from sequential
are assigned in the Code Names window. Process code to parallel for the CHiP machine – is fully described for
was written in a very simple language called XX. (The SIMPLE in a very nice paper by Gannon and Panetta [].
 B Blue CHiP

Blue CHiP. Fig.  Poker screen shots of a phase program for a  processor CHiP computer for a master/slave computation;
(a) Switch Settings (, is the master), editing mode for SS is shown in Fig. a; (b) Code Names; (c) Port Names; and (d) Trace
View midway through the (interpreted) computation

New Algorithms Among the new algorithms were that is about to disappear due to the first rule should be
image-processing computations, since they were thought counted as a component. See Fig. .
to be a domain benefiting from CHiP-style parallelism The CHiP phase for the counting connected com-
[]. One particularly elegant connected components ponents computation assigns a rectangular subarray of
counting algorithm illustrates the style. pixels to each processor, and uses a -degree mesh inter-
Using a -degree version of a morphological trans- connection. PEs exchange bits around the edge of their
formation due to Levialdi, a bit array is modified using subarray with adjacent neighbors, and then apply the
two transformations applied at all positions, rules to create the next iteration, repeating. As is typical
the controller is not involved in the phase until all pro-
cessors signal that they have no more bits left in their
? ? X ? ?
? ?
subarray, at which point the phase ends.

Theory: Graphs
producing a series of morphed arrays. In words, the first Though the project emphasized the practical task of
rule specifies that a -bit adjacent to -bits to its north, exploiting VLSI, a few theorems were proved, too.
northwest, and west becomes a -bit in the next gener- Two research threads are of interest. The first con-
ation; the second rule specifies that a -bit adjacent to cerned properties of algorithms to convert data-driven
-bits to its north and west becomes a -bit in the next programs to loop programs. Data-driven is an asyn-
generation; all other bits are unchanged; an isolated bit chronous communication protocol where, for example,
Blue CHiP B 

Poker front-end Poker back-ends

Programming views Program Simulators


CHiP params database Generic
IO names Cosmic cube B
Port names CP
Code names Emulators
Switch settings Pringle
IO
Parallel computers
Pringle
PN Cosmic cube
Sequent

CN

SS

Text

Make view
Command request
Compilers

Poker C
XX

Run-time view
Trace

I/O

Text editor

Blue CHiP. Fig.  Structure of the Poker parallel programming environment. The “front end” was the portion visible in the
graphic user interface; the “back ends” were the “platforms” on which a Poker program could be “run”

Blue CHiP. Fig.  The sequence of ten-bit arrays produced using Levialdi’s transformation; pixels are blackened when
counted. The algorithm “melts” a component to the lower right corner of its bounding box, where it is counted
 B Blue CHiP

reads to a port prior to data arrival stall. The CHiP but a software problem. The two key challenges were
machine was data driven, but the protocol carries over- plain:
head. The question was whether data-driven programs
● Performance – developing highly parallel computa-
could be more synchronous. The theorems concerned
tions that exploit locality,
properties of an algorithm to convert from data-driven
● Portability – expressing parallel computations at a
to synchronous; the results were reported [] and
high enough level of abstraction that a compiler can
implemented in Poker, but not heavily used.
target to any MIMD machine.
The second thread turned out to be rather long. It
began with an analysis of yield in VLSI chip fabrication. The two challenges are enormous, and largely remain
Called the Tile Salvage Problem, the analysis concerned unsolved into the twenty-first century. Regarding the
how the chips on a wafer, some of which had tested first, exploiting locality was a serious concern for the
faulty, could be grouped into large working blocks []. CHiP architecture with its tight VLSI connection, mak-
The solution found that matching working pairs is in ing us very sensitive to the issue. But it was also clear
polynomial time, but finding  ×  working blocks is NP that locality is always essential in parallelism, and valu-
hard; an optimal-within- approximation was devel- able for all computation. Regarding the second, it was
oped for × blocks. When the work was shown to Tom clear in retrospect that the project’s programming and
Leighton, he found closely related planar graph match- algorithm development were very tightly coupled to the
ing questions, and added/strengthened some results. He machine, as were other projects of the day. Whereas
told Peter Shor, who did the same. And finally, David CHiP computations could usually be hosted effec-
Johnson made still more improvements. The work was tively on other computers, the converse was not true.
finally published as Generalized Planar Matching []. Shouldn’t all code be machine independent? The issue in
both cases concerned the parallel programming model.
Summary These conclusions were embodied in a paper known
The Blue CHiP Project, which ran for six years, pro- as the “Type Architecture” paper []. (Given the many
duced a substantial amount of research, most of which is ways “type” is used in computer science, it was not
not reviewed here. Much of the research was integrated a good name; in this case it meant “characteristic
across multiple topics by applying Mead’s “tall, thin form” as in type species in biology.) The paper, among
man” approach. The results in the hardware domain other things, predicts the rise of message passing pro-
included a field programmable switch, a programmable gramming, and criticizes contemporary programming
communication fabric (lattice), an architecture, wafer- approaches. Most importantly, it defines a machine
scale integration studies, and a hardware emulator model – a generic parallel machine – called the CTA.
(Pringle) for the machine. The results in the software This machine plays the same role that the RAM or
domain included the Poker Parallel Programming Envi- von Neumann machine plays in sequential computing.
ronment that was built on a graphic workstation, com- The CTA is visible today in applications ranging from
munication graph layouts, programs for the machine, message passing programming to LogP analysis.
and new parallel algorithms. Related research included
studies in wafer scale integration, both theory and
design, as well as the extension of the CHiP approach Related Entries
to signal processing and other domains. The work was CSP (Communicating Sequential Processes)
never directly applied because the technology of the day Graph Algorithms
was not sufficiently advanced;  years after the con- Networks, Direct
ception of the CHiP architecture, however, the era of Reconfigurable Computer
multiple-cores-per-chip offers suitable technology. Routing (Including Deadlock Avoidance)
Apart from the nascent state of the technology, an Systolic Arrays
important conclusion of the project was that the prob- Universality in VLSI Computation
lem in parallel computing is not a hardware problem, VLSI Computation
Blue Gene/P B 

Bibliographic Notes and Additional . Snyder L, Socha D () Poker on the cosmic cube: the first
Reading retargetable parallel programming language and environment.
The focal point of the project, the machine, and its soft- Proceedings of the international conference on parallel process-
ing, Los Alamitos B
ware were strongly influenced by VLSI technology, but
. Kung HT, Leiserson CE () Algorithms for VLSI processor
the technology was not yet ready for a direct appli- arrays. In: Mead C, Conway L (eds) Introduction to VLSI systems,
cation of the approach; it would take until roughly Addison-Wesley, Reading
. The ideas that the project developed – phase- . Cypher RE, Sanz JLC, Snyder L () Algorithms for image com-
based parallelism, high level parallel language, empha- ponent labeling on SIMD mesh connected computers. IEEE Trans
Comput ():–
sis on locality, emphasis on data parallelism, etc. –
. Cuny JE, Snyder L () Compilation of data-driven programs
turned out to drive the follow-on research much more for synchronous execution. Proceedings of the tenth ACM sym-
than VLSI. Indeed, there were several follow-on efforts posium on the principles of programming languages, Austin, pp
to develop high performance, machine independent –
parallel languages [], and eventually, ZPL []. The . Berman F, Leighton FT, Snyder L () Optimal tile salvage,
Purdue University Technical Report TR-, Jan 
problem has not yet been completely solved, but the
. Berman F, Johnson D, Leighton T, Shor P, Snyder L () Gen-
Chapel [] language is the latest to embody these eralized planar matching. J Algorithms :–
ideas. . Snyder L () Type architectures, shared memory, and the
corollary of modest potential, Annual review of computer science,
vol , Annual Reviews, Palo Alto
Bibliography . Lin C () The portability of parallel programs across MIMD
. Snyder L () Introduction to the configurable, highly parallel computers, PhD Dissertation, University of Washington
computer. Computer ():–, Jan  . Chamberlain B, Choi S-E, Lewis E, Lin C, Snyder L and Weath-
. Conway L () The MPC adventures, Xerox PARC Tech. ersby W () The case for high-level parallel programming in
Report VLSI--; also published at http://ai.eecs.umich.edu/ ZPL. IEEE Comput Sci Eng ():–, July–Sept 
people/conway/VLSI/MPCAdv/MPCAdv.html . Chamberlain BL, Callahan D, Zima HP () Parallel pro-
. Conway L MPC: a large-scale demonstration of a new way grammability and the Chapel language. Int J High Perform Com-
to create systems in silicon, http://ai.eecs.umich.edu/people/ put Appl ():–, Aug 
conway/VLSI/MPC/MPCReport.pdf . Snyder L () Overview of the CHiP computer. In: Gray JP (ed)
. Snyder L () Tree organized processor structure, A VLSI VLSI , Academic, London, pp –
parallel processor design, Yale University Technical Report
DCS/TR
. Snyder L () Introduction to the configurable, highly paral-
lel computer, Technical Report CSD-TR-, Purdue University,
Nov 
. Snyder L () Programming processor interconnection
Blue CHiP Project
structures, Technical Report CSD-TR-, Purdue University,
Oct 
Blue CHiP
. Gannon DB, Panetta J () Restructuring SIMPLE for the CHiP
architecture. Parallel Computation ():–
. Kapauan AA, Field JT, Gannon D, Snyder L () The
Pringle parallel computer. Proceedings of the th interna-
tional symposium on computer architecture, IEEE, New York, Blue Gene/L
pp –
. Snyder L () Parallel programming and the Poker program- IBM Blue Gene Supercomputer
ming environment. Computer ():–, July 
. Snyder L () The Poker (.) programmer’s guide, Purdue
University Technical Report TR-, Dec 
. Notkin D, Snyder L, Socha D, Bailey ML, Forstall B, Gates K,
Greenlaw R, Griswold WG, Holman TJ, Korry R, Lasswell G,
Mitchell R, Nelson PA () Experiences with poker. Proceed- Blue Gene/P
ings of the ACM/SIGPLAN conference on parallel programming:
experience with applications, languages and systems IBM Blue Gene Supercomputer
 B Blue Gene/Q

i.e., the address of the instruction following a branch B


Blue Gene/Q is predicted. Then the instructions following the branch
can speculatively progress in the processor pipeline
IBM Blue Gene Supercomputer without waiting for the completion of the execution of
branch B.
Anticipating the address of the next instruction to
be executed was recognized as an important issue very
Branch Predictors early in the computer industry back in the late s.
However, the concept of branch prediction was really
André Seznec introduced around  by Smith in []. On the occur-
IRISA/INRIA, Rennes, Rennes, France rence of a branch, the effective information that must
be predicted is the address of the next instruction.
However, in practice several different informations are
Definition predicted in order to predict the address of the next
The branch predictor is a hardware mechanism that instruction. First of all, it is impossible to know that an
predicts the address of the instruction following the instruction is a branch before it has been decoded, that
branch. For respecting the sequential semantic of a pro- is the branch nature of the instruction must be known
gram, the instructions should be fetched, decoded, exe- or predicted before fetching the next instruction. Sec-
cuted, and completed in the order of the program. This ond, on taken branches, the branch target is unknown
would lead to quite slow execution. Modern processors at instruction fetch time, that is the potential target of
implement many hardware mechanisms to execute con- the branch must be predicted. Third, most branches
currently several instructions while still respecting the are conditional, the direction of the branch taken or
sequential semantic. Branch instructions cause a partic- not-taken must be predicted.
ular burden since their result is needed to begin the exe- It is important to identify that not all information is
cution of the subsequent instructions. To avoid stalling of equal importance for performance. Failing to predict
the processor execution on every branch, branch pre- that an instruction is a branch means that instructions
dictors are implemented in hardware. The branch pre- are fetched in sequence until the branch instruction
dictor predicts the address of the instruction following is decoded. Since decoding is performed early in the
the branch. pipeline, the instruction fetch stream can be repaired
very quickly. Likewise, failing to predict the target of
the direct branch is not very dramatic. The effective
Discussion target of a direct branch can be computed from the
Introduction instruction codeop and its address, thus the branch
Most instruction set architectures have a sequen- can be computed very early in the pipeline — gen-
tial semantic. However, most processors implement erally, it becomes available at the end of the decode
pipelining, instruction level parallelism, and out-of- stage. On the other hand, the direction of a condi-
order execution. Therefore on state-of-the-art proces- tional branch and the target of an indirect branch are
sors, when an instruction completes, more than one only known when the instruction has been executed,
hundred of subsequent instructions may already be i.e., very late in the pipeline, thus potentially generating
in progress in the processor pipeline. Enforcing the very long misprediction penalties, sometimes  or 
semantic of a branch is therefore a major issue: The cycles.
address of the instruction following a branch instruc- Since in many applications most branches are con-
tion is normally unknown before the branch completes. ditional and the penalty on a direction misprediction
Control flow instructions are quite frequent, up to is high, when one refers to branch prediction, one
% in some applications. Therefore to avoid stalling generally refers to predicting directions of conditional
the issuing of subsequent instructions until the branch branches. Therefore, most of this article is dedicated to
completes, microarchitects invented branch prediction, conditional branch predictions.
Branch Predictors B 

General Hardware Branch Prediction a lot of attention from the research community in the
Principle s and the early s.
Some instruction sets have included some software B
hints to help branch prediction. Hints like “likely taken” PC-Based Prediction Schemes
and “likely not-taken” have been added to the encoding It is natural to predict a branch based on its program
of the branch instruction. These hints can be inserted counter. When introducing conditional branch predic-
by the compilers based on application knowledge, e.g., tion [], Smith immediately introduced two schemes
a loop branch is likely to be taken, or on profiling infor- that capture the most frequent behaviors of the condi-
mation. However, the most efficient schemes are essen- tional branches.
tially hardware and do not rely on any instruction set Since most branches in a program tend to be biased
support. toward taken or not-taken, the simplest scheme con-
The general principle that has been used in hard- sisting to predict that a branch will follow the same
ware branch prediction scheme is to predict that the direction than the last time it has been executed is
behavior of the branch to be executed will be a replay quite natural. Hardware implementation of this simple
of its past behavior. Therefore, hardware branch predic- scheme necessitates to store only one bit per branch.
tors are based on memorization of the past behavior of When a conditional branch is executed, its direction is
the branches and some limited hardware logic. recorded, e.g., in the BTB along with the branch target.
A single bit is used:  encodes not-taken and  encodes
Predicting Branch Natures and Direct Branch taken. The next time the branch is fetched, the direction
Targets stored in the BTB is used as a prediction. This scheme is
For a given static instruction, its branch nature (is it a surprisingly efficient on most programs often achieving
branch or not) remains unchanged all along the pro- accuracy higher than %. However, for branches that
gram life, apart in case of self-modifying code. This privilege one direction and from time to time branch
stands also for targets of direct branches. The Branch in the other direction, this -bit scheme tends to predict
Target Buffer, or BTB [], is a special cache which aims the wrong direction twice in a row. This is the case for
at predicting whether an instruction is a branch and its instance, for loop branches which are taken except the
potential target. At fetch time, the BTB is checked with last iteration. The -bit scheme fails to predict the first
the instruction Program Counter. On a hit, the instruc- iteration and the last iteration.
tion is predicted to be a branch and the address stored in Smith [] proposed a slightly more complex
the hitting line of the BTB is assessed to be the potential scheme based on a saturated -bit counter automaton
target of the branch. The nature of the branch (uncondi- (Fig. ): The counter is incremented on a taken branch
tional, conditional, return, indirect jumps) is also read and decremented on a not-taken branch. The most sig-
from the BTB. nificant bit of the counter is used as the prediction.
A branch may miss the BTB, e.g., on its first execu- On branches exhibiting a strong bias, the -bit scheme
tion or after its eviction on a conflict miss. In this case, avoids to encounter two successive mispredictions after
the branch is written in the BTB: its kind and its tar- one occurrence of the non-bias direction. This -bit
get are stored in association with its Program Counter.
Note that on a BTB hit, the target may be mispredicted
on indirect branches and returns. This will shortly be Predict Taken Predict Not-Taken
presented in section on “Predicting Indirect Branch T T T T
Targets”.
3 2 1 0
N N N N
Predicting Conditional Branch Directions
Most branches are conditional branches and the penalty 2:Weakly Taken 0:Strongly Not Taken
on mispredicting the direction of a conditional branch
3:Strongly Taken 1:Weakly Not Taken
is really high on modern processors. Therefore, predict-
ing the direction of conditional branches has received Branch Predictors. Fig.  A -bit counter automaton
 B Branch Predictors

counter predictor is often referred to as a -bit bimodal 1


1
predictor. 1
0
1 for (i=0; i<100; i++)
If the last 3 iterations
History-Based Conditional Branch Prediction 1 for (j=0; j<4; j++)
have been taken then predict
1 loop body
Schemes not taken else predict taken
0
In the early s, branch prediction accuracy became 1
an important issue. The performance of superscalar 1
1
processors is improved by any reduction of branch mis- 0
prediction rate. The -bit and -bit prediction schemes
use a very limited history of the behavior of a program Branch Predictors. Fig.  A loop with four iterations
to predict the outcome direction of a branch. Yeh and
Patt [] and Pan and So [] proposed to use more m bits
information on the passed behavior of the program, try- PC
ing to better isolate the context in which the branch is
executed.
Two families of predictors were defined. Local 2m L bits local
history predictors use only the past behavior of the history
program on the particular branch. Global history pre-
dictors use the past behavior of the whole program on
histo L bits
all branches.

PHT
Local History Predictors 2n 2 bits counters
Using the past behavior of the branch to be predicted
appears attractive. For instance, on the loop nest illus-
Prediction 0/1
trated on Fig. , the number of iterations in the inner
loop is fixed (). When the outcome of the last three Branch Predictors. Fig.  A local history predictor
occurrences of the branch are known, the outcome of
the present occurrence is completely determined: If the
last three occurrences of the branch were taken then the Effective implementation of local history predic-
current occurrence is the last iteration, therefore is not-
tors is quite difficult. First the prediction requires to
taken otherwise the branch is taken. Local history is not
chain two successive table reads, thus creating a diffi-
limited to capturing the behavior of branches with fixed
cult latency issue for providing prediction in time for
number of iterations. For instance, it can capture the
use. Second, in aggressive superscalar processors, sev-
behavior of any branch exhibiting a periodic behavior
eral branches (sometimes tens) are progressing specu-
as long as the history length is equal or longer than the
latively in the pipeline. Several instances of the same
period.
branch could have been speculatively fetched and pre-
A local history predictor can be implemented as
dicted before the first occurrence is finally committed.
illustrated Fig. . For each branch, the history of its
The branch prediction should be executed using the
past behavior is recorded as a vector of bits of length
speculative history. Thus , maintaining correct specula-
L and stored in a local history table. The branch pro- tive history for local history predictors is very complex
gram counter is used as an index to read the local history for wide-issue superscalar processors.
table. Then the branch history is associated with the
program counter to read the prediction table. The pre-
diction table entries are generally saturated -bit coun- Global History Predictors
ters. The history vector must be adapted on each branch The outcome of a branch is often correlated with the
occurrence and stored back in the local history table. outcome of branches that have been recently executed.
Branch Predictors B 

For instance, in the example illustrated in Fig. , the out- may have occurred and two paths can be represented by
come of the third branch is completely determined by the same global history vector. The path vector combin-
the outcome of the first two branches. ing all the addresses of the last control flow instructions B
First generation global history predictors such as that lead to a branch is unique. Using the path history
GAs [] or gshare [] (Fig. ) are associating with instead of the global history vector generally results in a
a fixed length global history vector and the program slightly higher accuracy [].
counter of the branch to index a prediction table. These
predictors were shown to suffer from two antagonistic
phenomena. First, it was shown that using a very long
Hybrid Predictors
history is sometimes needed to capture correlation. Sec-
Global history predictors and local history predictors
ond, using a long history results in possible destructive
were shown to capture different branch behaviors. In
conflicts on a limited size prediction table.
, McFarling [] proposed to combine several pre-
A large body of research in the mid-s was dedi-
dictors to improve their accuracy. Even combining a
cated to reduce the impact of these destructive conflicts
global history predictor with a simple bimodal predictor
or aliasing [, , , ].
was shown to provide enhanced prediction accuracy.
By the end of the s, these dealiased global his-
The first propositions of hybrid predictors were to
tory predictors were known to be more accurate at equal
use a metapredictor to select a prediction. The metapre-
storage complexity than local predictors.
dictor is also indexed with the program counter and
Path or global history predictor: In some cases,
the branch history. The metapredictor learns which
the global conditional branch history vector does not
predictor component is more accurate. The bc-gskew
uniquely determine the instruction path that leads to a
predictor  proposed for the cancelled Compaq EV
particular branch; e.g., an indirect branch or a return
processor [] leveraged hybrid prediction to combine
several global history predictors including a major-
ity vote gskew predictor with different history lengths.
B1: if cond1 then ..
However, a metapredictor is not a cost-effective solution
B2: if cond2 then .. to select among more than two predictions [].
B3: if cond1 and cond 2 then ..

Branch Predictors. Fig.  Branch correlation: outcome of


branch  is uniquely determined by the outcomes of Toward Using Very Long Global History
branches  and  While some applications will be very accurately pre-
dicted using a short branch history length, limit stud-
n bits ies were showing that some benchmarks would benefit
PC from using very long history in s of bit range.
In , Jimenez and Lin [] proposed to use neu-
Hash PC and address e.g. xor ral nets inspired techniques to combine predictions. On
XOR
perceptron predictors, the prediction is computed as the
histo L bits
sign of a dot-product of a vector of signed prediction
counters read in parallel by the branch history vector
(Fig. ). Perceptron predictors allow to use very long
PHT
global history vectors, e.g,  bits, but suffer from a
2n 2 bits
counters long latency prediction computation and require huge
predictor table volumes.
Building on top of the neural net inspired predic-
tors, the GEometric History Length or GEHL predictor
Prédiction 0/1
(Fig. ) was proposed in  []. This predictor com-
Branch Predictors. Fig.  The gshare predictor bines a few global history predictors (typically from 
 B Branch Predictors

BIM bimodal prediction


n
address
PREDICTION

G0
e-gskew n majority
vote

e–gskew prediction
G1
n
address history

Meta
n
metaprediction

Branch Predictors. Fig.  The bc-gskew predictor

Signed 8-bit branch history Final prediction computation through a sum


conters as (-1,+1)
T0
T1

X Σ L(0)
T2
Σ
T3
L(1)
L(2) T4
Sign=prediction
L(3)
Prediction=Sign
L(4)
Update on mispredictions or if ⎮SUM⎮< θ
Branch Predictors. Fig.  The GEHL predictor: tables are
indexed using different global history length that forms a
Branch Predictors. Fig.  The perceptron predictor
geometric series

to ) indexed with different history lengths. The pre- but relies on partial tag matching for final prediction
diction is computed as the sign of the sum of the read computation. Each table in the predictor is read and the
predictions. The set of history lengths forms a geomet- prediction is provided by the hitting predictor compo-
ric series, for instance, , , , , , , , . The use of nent with the longest history. If there is no hitting com-
such a geometric series allows to concentrate most of ponent then the prediction is provided by the default
the storage budget on short histories while still captur- predictor.
ing long-run correlations on very long histories. Using a Realistic size TAGE currently represents the state
medium number of predictor tables (–) and maximal of the art in conditional branch predictors [] and its
history length of more than  bits was shown to be misprediction rate is within % of the currently known
realistic. limits for conditional branch predictabilty [].
While using an adder tree was shown as an effec-
tive final prediction computation function by the GEHL
predictor, partial tag matching may be even more stor- Predicting Indirect Branch Targets
age effective. The TAGE predictor (Fig. ) proposed in Return and indirect jump targets are also only known at
 [] uses the geometric history length principle, execution time and must be predicted.
Branch Predictors B 

pc pc h[0:L1] pc h[0:L2] pc h[0:L3]

hash has hash has has has


B
ctr tag ctr tag ctr tag
u u u

=? =? =?

1 1 1 1 1 1
1

1
Tagless base
Predictor 1

prediction

Branch Predictors. Fig.  The TAGE predictor: partial tag match determines the prediction

Predicting Return Targets predicting the direction of the conditional branches, i.e.,
Kaeli and Emma [] remarked that, in most cases and the global branch history or program path history.
particularly for compiler generated code the return First it was proposed to use the last encountered
address of procedure calls obey a very simple call-return target, i.e., just the entry in the BTB. Chang et al. [] pro-
rule: The return address target is the address just fol- posed to use a predictor indexed by the global history
lowing the call instruction. They also remarked that the and the program counter. Driesen and Holzle [] pro-
return addresses of the procedures could be predicted posed to use an hybrid predictor based on tag matching.
through a simple return address stack (RAS). On a call, Finally Seznec and Michand [] proposed ITTAGE, a
the address of the next instruction in sequence is pushed multicomponent indirect jump predictor, based on par-
on the top of the stack. On a return, the target of the tial tag matching and the use of geometric history length
return is popped from the top of the stack. as TAGE.
When the code is generated following the call–
return rule and if there is no branch misprediction
Conclusion
between the call and the return, an infinite size RAS pre-
In the modern processors, the whole set of branch pre-
dicts the return target with a % accuracy. However,
dictors (conditional, returns, indirect jumps, BTB) is
several difficulties arise for practical implementation.
an important performance enabler. They are particu-
The call–return rule is not always respected. The RAS is
larly important on deep pipelined wide-issue super-
size limited, but in practice a -entry RAS is sufficient
scalar processors, but they are also becoming important
for most applications. The main difficulty is associated
in low power embedded processors as they are also
with the speculative execution: When on the wrong
implementing instruction-level parallelism.
path returns are fetched followed by one or more calls,
A large body of research has addressed branch pre-
valid RAS entries are corrupted with wrong informa-
diction during the s and the early s and com-
tion, thus generating mispredictions on the returns on
plex predictors inspired by this research have been
the right path. Several studies [, , ] have addressed
implemented in processors during the last decade. Cur-
this issue, and the effective accuracy of return target
rent state-of-the-art branch predictors combine multi-
prediction is close to perfect.
ple predictions and rely on the use of global history.
These predictors cannot deliver a prediction in a sin-
gle cycle. Since, prediction is needed on the very next
Indirect Jump Target Predictions cycle, various techniques such as overriding predictors
To predict the targets of indirect jumps, one can use [] or ahead pipelined predictors [] were proposed
the same kind of information that can be used for and implemented.
 B Brent’s Law

Branch predictor accuracy seems to have reached a . Seznec A () Analysis of the o-gehl branch predictor. In: Pro-
plateau since the introduction of TAGE. Radically new ceedings of the nd Annual International Symposium on Com-
puter Architecture, Madison, – June . IEEE, Los Alamitos
predictor ideas, probably new sources of predictabil-
. Seznec A () The idealistic gtl predictor. J Instruction Level
ity, will probably be needed to further increase the Parallelism. http://www.jilp.org/vol
predictor accuracy. . Seznec A () The l-tage branch predictor. J Instruction Level
Parallelism. http://www.jilp.org/vol
Bibliography . Seznec A, Fraboulet A () Effective a head pipelining of the
instruction addres generator. In: Proceedings of the th Annual
. Chang P-Y, Hao E, Patt YN () Target prediction for indirect
International Symposium on Computer Architecture, San Diego,
jumps. In: ISCA ’: Proceedings of the th annual international
– June . IEEE, Los Alamitos
symposium on Computer architecture, Denver, – June .
. Seznec A, Michaud P () A case for (partially)-tagged geo-
ACM Press, New York, pp –
metric history length predictors. J Instruction Level Parallelism.
. Diefendorff K () Compaq chooses SMT for alpha. Micropro-
http://www.jilp.org/vol
cessor Report, Dec 
. Skadron K, Martonosi M, Clark D () Speculative updates
. Driesen K, Holzle U () The cascaded predictor: economical
of local and global branch history: a quantitative analysis. J
and adaptive branch target prediction. In: Proceeding of the st
Instruction-Level Parallelism 
Annual ACM/IEEE International Symposium on Microarchitec-
. Smith J () A study of branch prediction strategies. In: Proceed-
ture, Dallas, Dec , pp –
ings of the th Annual International Symposium on Computer
. Eden AN, Mudge T () The yags branch predictor. In: Proceed-
Architecture, May . ACM, New York, pp –
ings of the st Annual International Symposium on Microarchi-
. Sprangle E, Chappell R, Alsup M, Patt Y () The agree pre-
tecture, Dallas, Dec 
dictor: a mechanism for reducing negative branch history inter-
. Evers M () Improving branch behavior by understanding
ference. In: th Annual International Symposium on Computer
branch behavior. Ph.D. thesis, University of Michigan
Architecture, Denver, – June . ACM, New York
. Jiménez D () Reconsidering complex branch predictors. In:
. Vandierendonck H, Seznec A () Speculative return address
Proceedings of the th International Symposium on High Perfor-
stack management revisited. ACM Trans Archit Code Optim
mance Computer Architecture, Anaheim, – Feb . IEEE,
():–
Los Alamitos
. Yeh T-Y, Patt Y () Two-level adaptive branch prediction. In:
. Jiménez DA, Lin C () Dynamic branch prediction with per-
Proceedings of the th International Symposium on Microarchi-
ceptrons. In: HPCA: Proceedings of the th International Sympo-
tecture, Albuquerque, – Nov . ACM, New York
sium on High Performance Computer Architecture, Monterrey,
. Yeh T-Y, Patt YN () Two-level adaptive training branch pre-
– Jan . IEEE, Los Alamitos, pp –
diction. In: Proceedings of the th Annual International Sympo-
. Jourdan S, Hsing T-H, Stark J, Patt YN () The effects of
sium on Microarchitecture, Albuquerque, – Nov . ACM,
mispredicted-path execution on branch prediction structures. In:
New York
Proceedings of the International Conference on Parallel Architec-
tures and Compilation Techniques, Boston, – Oct . IEEE,
Los Alamitos
. Kaeli DR, Emma PG () Branch history table prediction of
moving target branches due to subroutine returns. SIGARCH Brent’s Law
Comput Archit News ():–
. Lee C-C, Chen I-C, Mudge T () The bi-mode branch predic- Brent’s Theorem
tor. In: Proceedings of the th Annual International Symposium
on Microarchitecture, Dec 
. Lee J, Smith A () Branch prediction strategies and branch
target buffer design. IEEE Comput ():–
Brent’s Theorem
. McFarling S () Combining branch predictors, TN , John L. Gustafson
DECWRL, Palo Alto, June 
Intel Corporation, Santa Clara, CA, USA
. Michaud P, Seznec A, Uhlig R () Trading conflict and capac-
ity aliasing in conditional branch predictors. In: Proceedings of
the th Annual International Symposium on Computer Archi- Synonyms
tecture (ISCA), Denver, – June . ACM, New York Brent’s law
. Pan S, So K, Rahmeh J () Improving the accuracy of dynamic
branch prediction using branch correlation. In: Proceedings of the
th International Conference on Architectural Support for Pro-
Definition
gramming Languages and Operating Systems, Boston, – Oct Assume a parallel computer where each processor
. ACM, New York can perform an arithmetic operation in unit time.
Brent’s Theorem B 

Further, assume that the computer has exactly enough Example


processors to exploit the maximum concurrency in an Suppose the task is to solve the following system of two
algorithm with N operations, such that T time steps suf- equations in two unknowns, u and v: B
fice. Brent’s Theorem says that a similar computer with
fewer processors, P, can perform the algorithm in time au + bu = x
N−T cu + du = y
TP ≤ T + ,
P
where P is less than or equal to the number of proces- One solution method is Cramer’s Rule, which is far
sors needed to exploit the maximum concurrency in the less efficient than Gaussian elimination in general, but
algorithm. exposes so much parallelism that it can take less time on
a PRAM computer. Figure  shows how a PRAM with six
processors can calculate the solution values, u and v in
Discussion
only three time steps:
Brent’s Theorem assumes a PRAM (Parallel Random
Thus, T =  time steps, and P ≤ , the amount of
Access Machine) model [, ]. The PRAM model is an
concurrency in the first time step. The total number of
idealized construct that assumes any number of pro-
arithmetic operations, N, is  (six multiplications, fol-
cessors can access any items in memory instantly, but
lowed by three subtractions, followed by two divisions).
then take unit time to perform an operation on those
Brent’s Theorem tells us an upper bound on the time a
items. Thus, PRAM models can answer questions about
PRAM with four processors would take to perform the
how much arithmetic parallelism one can exploit in an
algorithm:
algorithm if the communication cost were zero. Since
many algorithms have very large amounts of theoretical
N−T
PRAM-type arithmetic concurrency, Brent’s Theorem TP ≤ T + , so
P
bounds the theoretical time the algorithm would take  − 
on a system with fewer processors. T ≤  + = .

PRAM variables: a b c d x y

Timestep 1 a×d b×c c×y d×x a×y b×x

Timestep 2 ad – bc = D dx – cy ay – bx

Timestep 3 (dx – cy) ÷ D (ay – bx) ÷ D

Brent’s Theorem. Fig.  Concurrency of a solver for two equations in two unknowns

Timestep 1 a×d b×c c×y d×x

Timestep 2 a×y b×x

Timestep 3 ad – bc = D dx – cy ay – bx

Timestep 4 (dx – cy) ÷ D (ay – bx) ÷ D

Brent’s Theorem. Fig.  Linear solver on only four processors


 B Brent’s Theorem

k1 k2 k3 k4 k5 k6 k7 k8 k9 k10 k11 k12 k13 k14 k15 k16

Timestep 1 + + + + + + + +

Timestep 2 + + + +

Timestep 3 + +

Timestep 4 +

Brent’s Theorem. Fig.  Binary sum collapse with a maximum parallelism of eight processors

Figure  shows that four time steps suffice in this case, Proof of Brent’s Theorem
fewer than the five in the bound predicted by Brent’s Let nj be the number of operations in step j
Theorem. The figure omits dataflow arrows, for clarity. that the PRAM can execute concurrently, where j is
Brent’s Theorem says the four-processor system , , . . . , T. Assume the PRAM has P processors, where
should require no more than five time steps, and Fig.  P ≤ max(nj ).
j
shows this is clearly the case. A property of the ceiling function is that for any two
positive integers k and m,
Asymptotic Example: Parallel Summation
A common use of Brent’s Theorem is to place bounds k k+m−
⌈ ⌉≤ .
on the applicability of parallelism to problems parame- m m
terized by size. For example, to find the sum of a list of Hence, the time for the PRAM to perform each time
n numbers, a PRAM can add the numbers in pairs, then step j has the following bound:
the resulting sums in pairs, and so on until the result is
a single summation value. Figure  shows such a binary nj nj + P − 
⌈ ⌉≤ .
sum collapse for a set of  numbers. P P
In the following discussion, the notation
The time for all time steps has the bound
lg(x) denotes the logarithm base  of x, and the
notation ⌈X⌉ denotes the “ceiling” function, the smallest T nj T n +P−
j
integer larger than x. TP ≤ ∑⌈ ⌉≤∑
j= P j= P
If n is exactly a power of , such that n = T , then T T
nj P T 
the PRAM can complete the sum in T time steps. That =∑ +∑ −∑ .
is, T = lg(n). If n is not an integer power of , then j= P j= P j= P

T = ⌈lg(n)⌉ because the summation takes one addi-


Since the sum of the nj is the total number of opera-
tional step. The total number of arithmetic operations
tions N, and the sum of  for T steps is simply T, this
for the summation of n numbers is n –  additions.
simplifies to
The maximum parallelism occurs in the first time
step, when n/ processors can operate on the list con- ∑ nj T ∑ N −T
currently. If the number of processors in a PRAM is P, TP ≤ +∑− =T+ .
P j= P P
and P is less than n/, then Brent’s Theorem shows []
that the execution time can be made less than or equal to
(n − ) − ∣lg(n)∣ Application to Superlinear Speedup
TP = ⌈lg(n)⌉ + .
P Ideal speedup is speedup that is linear in the number of
For values of n much larger than P, the above formula processors. Under the strict assumptions of Brent’s The-
shows TP ≈ Pn . orem, superlinear speedup (or superunitary efficiency,
Brent’s Theorem B 

sometimes erroneously termed superunitary speedup) Related Entries


is impossible. Bandwidth-Latency Models (BSP, LogP)
Superlinear speedup means that the time to per- Memory Wall B
form an algorithm on a single processor is more than P PRAM (Parallel Random Access Machines)
times as long as the time to perform it on P processors.
In other words, P processors are more than P times as
fast as a single processor. According to Brent’s Theorem
Bibliographic Notes and Further
with P = ,
Reading
As explained above, Brent’s Theorem reflects the com-
N −T puter design issues of the s, and readers should view
T ≤ T + = N.
 Brent’s original  paper in this context. It was in 
that the esoteric nature of the PRAM model became
Speedup greater than the number or processors means clear, in the opposing papers [] and []. Faber, Lubeck,
T > P ⋅ TP . Brent’s Theorem says this is mathemat- and White [] assumed the PRAM model to state that
ically impossible. However, Brent’s Theorem is based one cannot obtain superlinear speedup since computing
on the assumptions of the PRAM model that have lit- resources increase linearly with the number of proces-
tle resemblance to real computer design, such as the sors. Parkinson [], having had experience with the ICL
assumptions that every arithmetic operation requires Distributed Array Processor (DAP), based his model
the same amount of time, and that memory access has and experience on premises very different from the
zero latency and infinite bandwidth. The debate over hypothetical PRAM model used in Brent’s Theorem.
the possibility of superlinear speedup first appeared in Parkinson noted that simple constructs like the sum of
 [, ], and the debate highlighted the oversim- two vectors could take place superlinearly faster on P
plification of the PRAM model for parallel computing processors because a shared instruction to add, sent to
performance analysis. the P processors, does not require the management of a
loop counter and address increment that a serial proces-
sor requires. The  work Introduction to Algorithms
Perspective
by Cormen, Leiserson, and Rivest [] is perhaps the first
The PRAM model used by Brent’s Theorem assumes
textbook to formally recognize that the PRAM model is
that communication has zero cost, and arithmetic work
too simplistic, and thus Brent’s Theorem has diminished
constitutes all of the work of an algorithm. The oppo-
predictive value.
site is closer to the present state of computer tech-
nology (see Memory Wall), which greatly diminishes
the usefulness of Brent’s Theorem in practical problem Bibliography
solving. It is so esoteric that it does not even provide . Brent RP () The parallel evaluation of general arithmetic
useful upper or lower bounds on how parallel process- expressions. J ACM ():–
ing might improve execution time, and modern per- . Cole R () Faster optimal parallel prefix sums and list ranking.
formance analysis seldom uses it. Brent’s Theorem is Inf Control ():–
. Cormen TH, Leiserson CE, Rivest RL () Introduction to algo-
a mathematical model more related to graph theory
rithms, MIT Press Cambridge
and partial orderings than to actual computer behavior.
. Faber V, Lubeck OM, White AB () Superlinear speedup of
When Brent constructed his model in , most com- an efficient sequential algorithm is not possible. Parallel Comput
puters took longer to perform arithmetic on operands :–
than they took to fetch and store the operands, so the . Helmbold DP, McDowell CE () Modeling speedup (n) greater
approximation was appropriate. More recent abstract than n. In: Proceedings of the international conference on parallel
processing, :–
models of parallelism take into account communica-
. Parkinson D () Parallel efficiency can be greater than unity.
tion costs, both bandwidth and latency, and thus can Parallel Comput :–
provide better guidance for the parallel performance . Smith JR () The design and analysis of parallel algorithms.
bounds of current architectures. Oxford University Press, New York
 B Broadcast

● All nodes can communicate through a communica-


Broadcast tion network.
● Individual nodes perform communication opera-
Jesper Larsson Träff , Robert A. van de Geijn tions that send and/or receive individual messages.

University of Vienna, Vienna, Austria
 ● Communication is through a single port, such that a
The University of Texas at Austin, Austin, TX, USA
node can be involved in at most one communication
operation at a time. Such an operation can be either
Synonyms a send to or a receive from another node (unidi-
One-to-all broadcast; Copy rectional communication), a combined send to and
receive from another node (bidirectional, telephone
Definition like communication), or a send to and receive from
Among a group of processing elements (nodes), a two possibly different nodes (simultaneous send-
designated root node has a data item to be com- receive, fully bidirectional communication).
municated (copied) to all other nodes. The broad- ● It is assumed that the communication medium is
cast operation performs this collective communication homogeneous and fully connected such that all
operation. nodes can communicate with the same costs and any
maximal set of pairs of disjoint nodes can commu-
Discussion nicate simultaneously.
The reader may consider first visiting the entry on ● A reasonable first approximation for the time for
Collective Communication. transferring a message of size n between (any) two
Let p be the number of nodes in the group that nodes is α + nβ where α is the start-up cost (latency)
participate in the broadcast operation and number and β is the cost per item transfered (inverse of the
these nodes consecutively from  to p − . One node, bandwidth).
the root with index r, has a vector of data, x of
size n, to be communicated to the remainder p −  Under these assumptions, two lower bounds for the
nodes: broadcast operation can be easily justified, the first for
the α term and the second for the β term:
Before After ● ⌈log p⌉α. Define a round of communication as a
Node r Node  Node  Node r Node  Node  period during which each node can send at most one
x x x x message and receive at most one message. In each
round, the number of nodes that know message x
All nodes are assumed to explicitly take part in the can at most double, since each node that has x can
broadcast operation. It is generally assumed that before send x to a new node. Thus, a minimum of ⌈log p⌉
its execution all nodes know the index of the desig- rounds are needed to broadcast the message. Each
nated root node as well as the the amount n of data round costs at least α.
to be broadcast. The data item x may be either a sin- ● nβ. If p >  then the message must leave the root
gle, atomic unit or divisible into smaller, disjoint pieces. node, requiring a time of at least nβ.
The latter can be exploited algorithmically when n is
When assumptions about the communication system
large.
change, these lower bounds change as well. Lower
Broadcast is a key operation on parallel systems
bounds for mesh and torus networks, hypercubes, and
with distributed memory. On shared memory sys-
many other communication networks are known.
tems, broadcast can be beneficial for improving locality
and/or avoiding memory conflicts.
Tree-Based Broadcast Algorithms
Lower Bounds A well-known algorithm for broadcasting is the so-
To obtain some simple lower bounds, it is assumed that called Minimum Spanning Tree (MST) algorithm
Broadcast B 

Path
a ...

B
Binary tree

Fibonacci trees, F0, F1, F2, F3, F4

Binomial trees, B0, B1, B2, B3, B4

Star graph

e ...

Broadcast. Fig.  Commonly used broadcast trees: (a) Path; (b) Binary Tree; (c) Fibonacci trees, the ith Fibonacci tree for
i >  consists of a new root node connected to Fibonacci trees Fi− and Fi− ; (d) Binomial trees, the ith binomial tree for i > 
consists of a new root with children Bi− , Bi− , . . .; (e) Star

(which is a misnomer, since all broadcast trees are span- If the data transfer between two nodes is represented
ning trees and minimum for homogeneous communi- by a communication edge between the two nodes, the
cation media), which can be described as follows: MST algorithm constructs a binomial spanning tree
over the set of nodes. An equivalent (also recursive)
● Partition the set of nodes into two roughly equal- construction is as follows. The th binomial tree B con-
sized subsets. sists of a single node. For i >  the ith binomial tree Bi
● Send x from the root to a node in the subset that does consists of the root with i children Bj for j = i − , . . . , .
not include the root. The receiving node will become The number of nodes in Bi is i . It can be seen that the
a local root in that subset. number of nodes at level i is (logi p) from which the term

● Recursively broadcast x from the (local) root nodes originates.
in the two subsets. The construction and structure of the binomial
(MST) tree is shown in Fig. d.
Under the stated communication model, the total cost
of the MST algorithm is ⌈log p⌉(α + nβ). It achieves
the lower bound for the α term but not for the β term Pipelining
for which it is a logarithmic factor off from the lower For large item sizes, pipelining is a general technique to
bound. improve the broadcast cost. Assume first that the nodes
 B Broadcast

are communicating along a directed path with node  broadcast cost of


being the root and node i sending to node i +  for √ √
(logΘ p − )α +  (logΘ p − )α βn + βn,
 ≤ i < p −  (node p −  only receives data). This is

shown in Fig. a. The message to be broadcast is split where Θ = +   . Finally it should be noted that pipelin-
into k blocks of size n/k each. In the first round of the ing cannot be employed with any advantage for trees
algorithm, the first block is sent from node  to node . with nonconstant degrees like the binomial tree and the
In the second round, this block is forwarded to node  degenerated star tree shown in Fig. d–e.
while the second block is sent to node . In this fashion, A third lower bound for broadcasting of data k items
a pipeline is established that communicates the blocks can be justified similarly to the two previously intro-
to all nodes. duced bounds, and is applicable to algorithms that apply
Under the prior model, the cost of this algorithm is pipelining.
n k+p− ● k −  + ⌈log p⌉ rounds. First k −  items have to leave
(k + p − )(α + β) = (k + p − )α + nβ
k k the root, and the number of rounds for the last item
p− to arrive at some last node is an additional ⌈log p⌉.
= (p − )α + nβ + kα + nβ.
k
With this bound, the best possible broadcast cost in
It takes p −  communication rounds for the first piece
the linear cost model is therefore (k −  + log p)(α +
to reach node p − , which afterward in each succes-
βn/k) when x is (can be) divided into k roughly equal-
sive round receives a new block. Thus, an additional
sized blocks of size at most ⌈n/k⌉. Balancing the kα term
k −  rounds are required for a total of k + p −  rounds.
p− against the (log p − )βn/k term) achieves a minimum
Balancing the kα term against the k nβ term gives a
time of
minimum time of √ √ 
( (log p − )α + βn ) = (log p − )α
√ √ √

( (p − )α + βn ) = (p − )α +  (p − )nαβ + nβ. √ √
+  (log p − )α βn + βn


and best possible number of blocks k = (p−)βn . This (log  p−)βn
α with the best number blocks being k = α
.
meets the lower bound for the β term, but not for the α
term. For very large n compared to p the linear pipeline
Simultaneous Trees
can be a good (practical) algorithm. It is straightforward
None of the tree-based algorithm were optimal in the
to implement and has very small extra overhead.
sense of meeting the lower bound for both the α and the
For the common case where p and/or the start-up
β terms. A practical consequence of this is that broad-
latency α is large compared to n, the linear pipeline suf-
cast implementations become cumbersome in that algo-
fers from the nonoptimal (p−)α term, and algorithms
rithms for different combinations of p and n have to be
with a shorter longest path from root node to receiving
maintained. The breakthrough results of Johnsson and
nodes will perform better. Such algorithms can apply
Ho [] show how employing multiple, simultaneously
pipelining using different, fixed degree trees, as illus-
active trees can be used to overcome these limitations
trated in Fig. a–c. For instance, with a balanced binary
and yield algorithms that are optimal in the number of
tree as shown in Fig. b the broadcast time becomes
communication rounds. The results were at first formu-
√ √
(log p − ) +  (log p − )α βn + βn. lated for hypercubes or fully connected communication
networks where p is a power of two, and (much) later
The latency of this algorithm is significantly better than extended to communication networks with arbitrary
the linear pipeline, but the β term is a factor of  off number of nodes. The basic idea can also be employed
from optimal. By using instead skewed trees, the time for meshes and hypercubes.
at which the last node receives the first block of x The idea is to embed simultaneous, edge disjoint
be improved which affects both α and β terms. The spanning trees in the communication network, and use
Fibonacci tree shown in Fig. c for instance achieves a each of these trees to broadcast different blocks of the
Broadcast B 

input. If this embedding can be organized such that blocks, and so on. The broadcasting is done along a reg-
each node has the same number of incoming edges ular pattern, which is symmetric for all nodes. For a
(from different trees) and outgoing edges (to possibly d-dimensional hypercube, node i in round t, t ≥ , sends B
different trees), a form of pipelining can be employed, and receives a block to and from the node found by tog-
even if the individual trees are not necessarily fixed gling bit t mod d. As can be seen, each node is a leaf
degree trees. To illustrate the idea, consider the three- in tree j if bit j of i is , otherwise an internal node in
dimensional hypercube, and assume that an infinite tree j. Leaves in tree j = t mod d receive block t − d
number of blocks have to be broadcast. in round d. When node i is an internal node in tree j,
100 101 the block received in round j is sent to the children of
i which are the nodes found by toggling the next, more
000 001 significant bits after j until the next position in i that is a
. Thus, to determine which blocks are received and sent
by node i in each round, the number of zeroes from each
110 111 bit position in i until the next, more significant  will
suffice.
010 011 The algorithm for broadcasting k blocks in the opti-
mal k −  + log p number of rounds is given formally in
Let the nodes be numbered in binary with the Fig. . To turn the description above into an algorithm
broadcast root being node 000. The three edge-disjoint for an arbitrary, finite number of blocks, the follow-
trees used for broadcasting blocks , , , . . .; , , , . . . ing modifications are necessary. First, any block with
and , , , . . .; respectively, are as shown below: a negative number is neither sent nor received. Sec-
ond, for blocks with a number larger than k − , block
100 101 100 101 k −  is taken instead. Third, if k −  is not a multiple
000 001 000 001
of log p the broadcast is started at a later round f such
that indeed k + f −  is a multiple of log p. The table
110 111 110 111
BitDistancei [j] stores for each node i the distance
from bit position j in i to the next  to the left of posi-
010 011 010 011
tion j (with wrap around after d bits). The root 000 has
100 101 no ones. This node only sends blocks, and and there-
fore BitDistance [k] = k for  ≤ k < d. For each i,
000 001
 ≤ i < p the table can be filled in O(log p) steps.
110 111
Composing from Other Collective
010 011
Communications
Another approach to implementing the broadcast
The trees are in fact constructed as edge disjoint operation follows from the observation that data
spanning trees excluding node 000 rooted at the root can be broadcast to p nodes by scattering p dis-
hypercube neighbors 001, 010, and 100. The root is joint blocks of x of size n/p across the nodes, and
connected to each of these trees. In round , block  then reassembling the blocks at each node by an
is sent to the first tree, which in rounds , , and  is allgather operation. On a network that can host a
responsible for broadcasting to its spanning subtree. In hypercube, the scatter and allgather operations can
round , block  is sent to the second tree which uses p−
be implemented at a cost of (log p)α + p nβ each,
rounds , , and  to broadcast, and in round  block under the previously used communication cost model
 is sent to the third tree which uses rounds , , and (see entries on scatter and allgather). This yields a
 to broadcast this block. In round , the root sends cost of
the third block, the broadcasting of which takes place p−
simultaneously with the broadcasting of the previous  log  pα +  nβ.
p
 B Broadcast

f ← ((k mod d) + d − 1) mod d /* Start round for first phase */


t←0
while t < k + d − 1 do
/* New phase consisting of (up to) d rounds */
for j ← f to d − 1
s ← t − d + (1 − ij ) ∗ BitDistancei [j] /* block to send */
r ← t − d + ij ∗ BitDistancei [j] /* block to receive */
if s ≥ k then s ← k − 1
if r ≥ k then r ← k − 1
par/* simultaneous send-receive with neighbor */
if s ≥ 0 then Send block s to node (i xor 2j )
if r ≥ 0 then Receive block r from node (i xor 2j )
end par
t ← t + 1 /* next round */
end for
f ← 0 /* next phases start from 0 */
endwhile

Broadcast. Fig.  The algorithm for node i,  ≤ i < p for broadcasting k blocks in a d-dimensional hypercube or fully
connected network with p = d nodes. The algorithm requires the optimal number of k −  + d rounds. The jth bit of i is
denoted ij and is used to determine the hypercube neighbor of node i for round j

The cost of this algorithm is within a factor two of the Bibliographic Notes and Further
lower bounds for both the α and β term and is consid- Reading
erably simpler to implement than approaches that use Broadcast is one of the most thoroughly studied col-
pipelining. lective communication operations. Classical surveys
with extensive treatment of broadcast (and allgather/
gossiping) problems under various communication and
General Graphs network assumptions can be found in [, ]. For a
Good broadcast algorithms are known for many dif-
survey of broadcasting in distributed systems, which
ferent communication networks under various cost
raises many issues not discussed here, see []. Well-
models. However, the problem of finding a best broad-
known algorithms that have been implemented in,
cast schedule for an arbitrary communication network
for instance, Message Passing Interface (MPI) libraries
is a hard problem. More precisely, the following prob-
include the MST algorithm, binary trees, and scatter-
lem of determining whether a given number of rounds
allgather approaches. Fibonacci trees for broadcast were
suffices to broadcast in an arbitrary, given graph is NP-
explored in []. Another skewed, pipelined tree struc-
complete [, Problem ND]: Given an undirected
ture termed fractional trees that yield close to opti-
graph G = (V, E), a root vertex r ∈ V, and an inte-
mal results for certain ranges of p and n was proposed
ger k (number of rounds), is there a sequence of vertex
in [].
and edge subsets {r} = V , E , V , V , E , . . . , Ek , Vk = V
The basic broadcast techniques discussed in this
with Vi ⊆ V, Ei ⊆ E, such that each e ∈ Ei has one end-
entry date back to the early days of parallel comput-
point in Vi− and one in Vi , no two edges of Ei share an
ing [, ]. The scatter/allgather algorithm was already
endpoint, and Vi = Vi− ∪ {w∣(v, w) ∈ Ei }?
discussed in [] and was subsequently popularized for
mesh architectures in []. It was further popularized for
Related Entries use in MPI implementations in []. A different imple-
Allgather mentation of this paradigm was given in []. Modular
Collective Communication construction of hybrid algorithms from MST broadcast
Collective Communication, Network Support for and scatter/allgather is discussed in [].
Message Passing Interface (MPI) The classical paper by Johnsson and Ho []
PVM (Parallel Virtual Machine) introduced the simultaneous tree algorithm that was
Broadcast B 

discussed, which they call the n-ESBT algorithm. This . Bar-Noy A, Kipnis S, Schieber B () Optimal multiple message
algorithm can be used for networks that can host a broadcasting in telephone-like communication systems. Discret
hypercube and has been used in practice for hyper- App Math (–):–
. Barnett M, Payne DG, van de Geijn RA, Watts J () Broadcast- B
cubes and fully connected systems. It achieves the
ing on meshes with wormhole routing. J Parallel Distrib Comput
lower bound on the number of communication rounds ():–
needed to broadcast k blocks of data. The exposition . Beaumont O, Legrand A, Marchal L, Robert Y () Pipelin-
given here is based on [, ]. It was for a number ing broadcast on heterogeneous platforms. IEEE Trans Parallel
of years an open problem how to achieve similar opti- Distrib Syst ():–
. Bruck J, De Coster L, Dewulf N, Ho C-T, Lauwereins R () On
mality for arbitrary p (and k). The first round-optimal
the design and implementation of broadcast and global combine
algorithms were given in [, ], but these seem not to operations using the postal model. IEEE Trans Parallel Distrib
have been implemented. A different, explicit construc- Syst ():–
tion was found and described in [], and implemented . Bruck J, Cypher R, Ho C-T () Multiple message broadcasting
in an MPI library. A very elegant (and practical) exten- with generalized fibonacci trees. In: Symposium on Parallel and
sion of the hypercube n-ESBT algorithm to arbitrary p Distributed Processing (SPDP). IEEE Computer Society Press.
Arlington, Texas, USA, pp –
was presented in []. That algorithm uses the hyper-
. Chan E, Heimlich M, Purkayastha A, van de Geijn RA () Col-
cube algorithm for the largest i such that i ≤ p. Each lective communication: theory, practice, and experience. Concurr
node that is not in the hypercube is paired up with a Comput ():–
hypercube node and the two nodes in each such pair . Culler DE, Karp RM, Patterson D, Sahay A, Santos EE, Schauser
in an alternating fashion jointly carry out the work of a KE, Subramonian R, von Eicken T () LogP: A practical model
hypercube node. Yet another, optimal to a lower-order of parallel computation. Commun ACM ():–
. Défago X, Schiper A, Urbán P () Total order broadcast and
term algorithm based on using two edge-disjoint binary
multicast algorithms: taxonomy and survey. ACM Comput Sur-
trees was given in []. Disjoint spanning tree algo- veys ():–
rithms for multidimensional mesh and torus topologies . Fox G, Johnson M, Lyzenga M, Otto S, Salmon J, Walker D ()
were given in []. Solving problems on concurrent processors, vol I. Prentice-Hall,
The linear model used for modeling communica- Englewood Cliffs
. Fraigniaud P, Lazard E () Methods and problems of commu-
tion (transfer) costs is folklore. An arguably more accu-
nication in usual networks. Discret Appl Math (–):–
rate performance model of communication networks . Fraigniaud P, Vial S () Approximation algorithms for broad-
is the so-called LogGP model (and its variants), which casting and gossiping. J Parallel Distrib Comput :–
account more accurately for the time in which proces- . Garey MR, Johnson DS () Computers and intractability: a
sors are involved in data transfers. With this model, guide to the theory of NP-completeness. Freeman, San Francisco
yet other broadcast tree structures yield best perfor- (With an addendum, )
. Hedetniemi SM, Hedetniemi T, Liestman AL () A survey
mance [, ]. The so-called postal model in which a
of gossiping and broadcasting in communication networks. Net-
message sent at some communication round is received works :–
a number of rounds λ later at the destination was used . Jansen K, Müller H () The minimum broadcast time problem
in [, ] and gives rise to yet more tree algorithms. for several processor networks. Theor Comput Sci (&):–
Broadcasting in heterogeneous systems has recently . Jia B () Process cooperation in multiple message broadcast.
received renewed attention [, , ]. Parallel Comput ():–
. Johnsson SL, Ho C-T () Optimum broadcasting and per-
Finding minimum round schedules for broadcast
sonalized communication in hypercubes. IEEE Trans Comput
remain NP-hard for many special networks []. Gen- ():–
eral approximation algorithms have been proposed for . Kwon O-H, Chwa K-Y () Multiple message broadcasting in
instance in []. communication networks. Networks :–
. Libeskind-Hadas R, Hartline JRK, Boothe P, Rae G, Swisher J
() On multicast algorithms for heterogenous networks of
workstations. J Parallel Distrib Comput :–
Bibliography . Liu P () Broadcast scheduling optimization for heteroge-
. Bar-Noy A, Kipnis S () Designing broadcasting algorithms in neous cluster systems. J Algorithms ():–
the postal model for message-passing systems. Math Syst Theory . Saad Y, Schultz MH () Data communication in parallel archi-
():– tectures. Parallel Comput ():–
 B BSP

. Sanders P, Sibeyn JF () A bandwidth latency tradeoff for The BSP model can be regarded as an abstraction
broadcast and reduction. Inf Process Lett ():– of both parallel hardware and software, and sup-
. Sanders P, Speck J, Träff JL () Two-tree algorithms for ports an approach to parallel computation that is both
full bandwidth broadcast, reduction and scan. Parallel Comput
architecture-independent and scalable. The main prin-
:–
. Santos EE () Optimal and near-optimal algorithms for k-item ciples of BSP are the treatment of a communication
broadcast. J Parallel Distrib Comput ():– medium as an abstract fully connected network, and the
. Thakur R, Gropp WD, Rabenseifner R () Improving the per- decoupling of all interaction between processors into
formance of collective operations in MPICH. Int J High Perform point-to-point asynchronous data communication and
Comput Appl :–
barrier synchronization. Such a decoupling allows an
. Träff JL () A simple work-optimal broadcast algorithm for
message-passing parallel systems. In: Recent advances in paral-
explicit and independent cost analysis of local computa-
lel virtual machine and message passing interface. th European tion, communication, and synchronization, all of which
PVM/MPI users’ group meeting. Lecture Notes in Computer are viewed as limited resources.
Science, vol . Springer, pp –
. Träff JL, Ripke A () Optimal broadcast for fully con-
BSP Computation
nected processor-node networks. J Parallel Distrib Comput ():
– The BSP Model
. Watts J, van de Geijn RA () A pipelined broadcast for multi- A BSP computer (see Fig. ) contains
dimensional meshes. Parallel Process Lett :–
● p Processors; each processor has a local memory and
is capable of performing an elementary operation or
a local memory access every time unit.
BSP ● A communication environment, capable of accepting
a word of data from every processor, and delivering
Bandwidth-Latency Models (BSP, LogP) a word of data to every processor, every g time units.
BSP (Bulk Synchronous Parallelism) ● A barrier synchronization mechanism, capable of
synchronizing all the processors simultaneously
every l time units.
BSP (Bulk Synchronous The processors may follow different threads of compu-
Parallelism) tation, and have no means of synchronizing with one
another apart from the global barriers.
Alexander Tiskin
A BSP computation is a sequence of supersteps (see
University of Warwick, Coventry, UK
Fig. ). The processors are synchronized between super-
steps; the computation within a superstep is completely
Definition asynchronous. Consider a superstep in which every
Bulk-synchronous parallelism is a type of coarse-grain processor performs a maximum of w local operations,
parallelism, where inter-processor communication fol- sends a maximum of hout words of data, and receives a
lows the discipline of strict barrier synchronization. maximum of hin words of data. The value w is the local
Depending on the context, BSP can be regarded as computation cost, and h = hout +hin is the communication
a computation model for the design and analysis of cost of the superstep. The total superstep cost is defined
parallel algorithms, or a programming model for the
development of parallel software.
0 1 p−1
PM PM ··· PM
Discussion
Introduction COMM. ENV. (g, l)
The model of bulk-synchronous parallel (BSP) com-
putation was introduced by Valiant [] as a “bridg- BSP (Bulk Synchronous Parallelism). Fig.  The BSP
ing model” for general-purpose parallel computing. computer
BSP (Bulk Synchronous Parallelism) B 

0 ···

1 ··· B
···

p −1 ···

BSP (Bulk Synchronous Parallelism). Fig.  BSP computation

as w + h ⋅ g + l, where the communication gap g and the resulting lists of machine parameters can be found in
latency l are parameters of the communication environ- [, ].
ment. For a computation comprising S supersteps with
local computation costs ws and communication costs hs ,
 ≤ s ≤ S, the total cost is W + H ⋅ g + S ⋅ l, where BSP vs Traditional Parallel Models
Traditionally, much of theoretical research in parallel
● W = ∑≤s≤S ws is the total local computation cost.
algorithms has been done using the Parallel Random
● H = ∑s=≤s≤S hs is the total communication cost.
Access Machine (PRAM), proposed initially in [].
● S is the synchronization cost.
This model contains
The values of W, H, and S typically depend on the
● A potentially unlimited number of processors, each
number of processors p and on the problem size.
capable of performing an elementary operation
In order to utilize the computer resources efficiently,
every time unit
a typical BSP program regards the values p, g, and l as
● Global shared memory, providing uniform access
configuration parameters. Algorithm design should aim
for every processor to any location in one time unit
to minimize local computation, communication, and
synchronization costs for any realistic values of these A PRAM computation proceeds in a sequence of
parameters. The main BSP design principles are synchronous parallel steps, each taking one time unit.
Concurrent reading or writing of a memory location by
● Load balancing, which helps to minimize both the
several processors within the same step may be allowed
local computation cost W and the communication
or disallowed. The number of processors is potentially
cost H
unbounded, and is often considered to be a function of
● Data locality, which helps to minimize the commu-
the problem size. If the number of processors p is fixed,
nication cost H
the PRAM model can be viewed as a special case of the
● Coarse granularity, which helps to minimize (or
BSP model, with g = l =  and communication realized
sometimes to trade off) the communication cost H
by reading from/writing to the shared memory.
and the synchronization cost S
Since the number of processors in a PRAM can be
The term “data locality” refers here to placing a piece unbounded, a common approach to PRAM algorithm
of data in the local memory of a processor that “needs design is to associate a different processor with every
it most.” It has nothing to do with the locality of a pro- data item. Often, the processing of every item is identi-
cessor in a specific network topology, which is actively cal; in this special case, the computation is called data-
discouraged from use in the BSP model. To distin- parallel. Programming models and languages designed
guish these concepts, some authors use the terms “strict for data-parallel computation can benefit significantly
locality” [] or “co-locality” []. from the BSP approach to cost analysis; see [] for a
The values of network parameters g, l for a spe- more detailed discussion.
cific parallel computer can be obtained by benchmark- It has long been recognized that in order to be prac-
ing. The benchmarking process is described in []; the tical, a model has to impose a certain structure on the
 B BSP (Bulk Synchronous Parallelism)

communication and/or synchronization patterns of a p log p processors, a requirement know as slackness.


parallel computation. Such a structure can be provided Memory access in the randomized simulation is made
by defining a restricted set of collective communica- uniform by address hashing, which ensures a nearly
tion primitives, called skeletons, each with an associ- random and independent distribution of virtual shared
ated cost model (see, e.g., []). In this context, a BSP memory cells.
superstep can be viewed as a simple generalized skele- In the automatic mode of BSP programming pro-
ton (see also []). However, current skeleton proposals posed in [] (see also []), shared memory simula-
concentrate on more specialized, and somewhat more tion completely hides the network and processors’ local
arbitrary, skeleton sets (see, e.g., []). memories. The algorithm designer and the program-
The CTA/Phase Abstractions model [], which mer can enjoy the benefits of virtual shared memory;
underlies the ZPL language, is close in spirit to skele- however, data locality is destroyed, and, as a result, per-
tons. A CTA computation consists of a sequence of formance may suffer. A useful compromise is achieved
phases, each with a simple and well-structured com- by the BSPRAM model, proposed in []. This model
munication pattern. Again, the BSP model takes this can be seen as a hybrid of BSP and PRAM, where
approach to the extreme, where a superstep can be each processor keeps its local memory, but in addition
viewed as a phase allowing a single generic asyn- there is a uniformly accessible shared (possibly exter-
chronous communication pattern, with the associated nal) memory. Automatic memory management can still
cost model. be achieved by the address-hashing technique of [];
additionally, there are large classes of algorithms, iden-
tified by [], for which simpler, slackness-free solutions
Memory Efficiency
are possible.
The original definition of BSP does not account for
In contrast to the standard BSP model, the BSPRAM
memory as a limited resource. However, the model can
is meaningful even with a single processor (p = ).
be easily extended by an extra parameter m, represent-
In this case, it models a sequential computation that
ing the maximum capacity of each processor’s local
has access both to main and external memory (or the
memory. Note that this approach also limits the amount
cache and the main memory). Further connections
of communication allowed within a superstep: h ≤ m.
between parallel and external-memory computations
One of the early examples of memory-sensitive BSP
are explored, e.g., in [].
algorithm design is given by [].
Paper [] introduced a more elaborate model EM-
An alternative approach to reflecting memory cost
BSP, where each processor, in addition to its local mem-
is given by the model CGM, proposed in []. A CGM
ory, can have access to several external disks, which may
is essentially a memory-restricted BSP computer, where
be private or shared. Paper [] proposed a restriction
memory capacity and maximum superstep communi-
of BSPRAM, called BSPGRID, where only the exter-
cation are determined by the size of the input/output:
nal memory can be used for persistent data storage
h ≤ m = O(N/p). A large number of algorithms have
between supersteps – a requirement reflecting some
been developed for the CGM, see, e.g., [].
current trends in processor architecture. Paper []
proposed a shared-memory model QSM, where nor-
Memory Management mal processors have communication gap g, and each
The BSP model does not directly support shared mem- shared memory cell is essentially a “mini-processor”
ory, a feature that is often desirable for both algorithm with communication gap d. Naturally, in a superstep
design and programming. Furthermore, the BSP model every such “mini-processor” can “send” or “receive” at
does not properly address the issue of input/output, most p words (one for each normal processor), hence
which can also be viewed as accessing an external shared the model is similar to BSPRAM with communication
memory. Virtual shared memory can be obtained by gap g and latency dp + l.
PRAM simulation, a technique introduced in []. An Virtual shared memory is implemented in several
efficient simulation on a p-processor BSP computer is existing or proposed programming environments offer-
possible if the simulated virtual PRAM has at least ing BSP-like functionality (see, e.g., [, ]).
BSP (Bulk Synchronous Parallelism) B 

Heterogeneity disregards point-to-point communication efficiency by


In the standard BSP model, all processors are assumed setting g = , and instead puts the emphasis on synchro-
to be identical. In particular, all have the same local pro- nization efficiency and memory optimality. B
cessing speed (which is an implicit parameter of the
model), and the same communication gap g. In prac-
tice, many parallel architectures are heterogeneous, i.e., BSP Algorithms
include processors with different speeds and commu- Basic Algorithms
nication performances. This fact has prompted hetero- As a simple parallel computation model, BSP lends itself
geneous extensions of the BSP model, such as HBSP to the design of efficient, well-structured algorithms.
[], and HCGM []. Both these extended models The aim of BSP algorithm design is to minimize the
introduce a processor’s speed as an explicit parameter. resource consumption of the BSP computer: the local
Each processor has its own speed and communication computation cost W, the communication cost H, the
gap; these two parameters can be either independent synchronization cost S, and also sometimes the mem-
or linked (e.g., proportional). The barrier structure of ory cost M. Since the aim of parallel computation is
a computation is kept in both models. to obtain speedup over sequential execution, it is nat-
ural to require that a BSP algorithm should be work-
Other Variants of BSP optimal relative to a particular “reasonable” sequential
The BSP* model [] is a refinement of the BSP algorithm, i.e., the local computation cost W should
model with an alternative cost formula for small-sized be proportional to the running time of the sequential
h-relations. Recognizing the fact that communication algorithm, divided by p. It is also natural to allow a rea-
of even a small amount of data incurs a constant-sized sonable amount of slackness: an algorithm only needs
overhead, the model introduces a parameter b, defined to be efficient for n ≫ p, where n is the problem size.
as the minimum communication cost of any, even zero- The asymptotic dependence of the minimum value of n
sized, h-relation. Since the overhead reflected by the on the number of processors p must be clearly specified
parameter b can also be counted as part of superstep by the algorithm.
latency, the BSP* computer with communication gap g Assuming that a designated processor holds a
and latency l is asymptotically equivalent to a standard value a, the broadcasting problem asks that a copy of a is
BSP computer with gap g and latency l + b. obtained by every processor. Two natural solutions are
The E-BSP model [] is another refinement of the possible: either direct or binary tree broadcast.
BSP model, where the cost of a superstep is parameter- In the direct broadcast method, a designated proces-
ized separately by the maximum amount of data sent, sor makes p− copies of a and sends them directly to the
the maximum amount of data received, the total volume destinations. The BSP costs are W = O(p), H = O(p),
of communicated data, and the network-specific maxi- S = O(). In the binary tree method, initially, the des-
mum distance of data travel. The OBSP* model [] is an ignated processor is defined as being awake, and the
elaborate extension of BSP, which accounts for varying other processors as sleeping. The processors are woken
computation costs of individual instructions, and allows up in log p rounds. In every round, every awake pro-
the processors to run asynchronously while maintain- cessor makes a copy of a and sends it to a sleeping
ing “logical supersteps”. While E-BSP and OBSP* may processor, waking it up. The BSP costs are W = O(p),
be more accurate on some architectures (in [], a lin- H = O(log p), S = O(log p). Thus, there is a trade-
ear array and a D mesh are considered), they lack the off between the direct and the binary tree broadcast
generality and simplicity of pure BSP. A simplified ver- methods, and no single method is optimal (see [, ]).
sion of E-BSP, asymptotically equivalent to pure BSP, is The array broadcasting problem asks, instead of a
defined in []. single value, to broadcast an array a of size n ≥ p.
The PRO approach [] is introduced as another In contrast with the ordinary broadcast problem, there
parallel computation model, but can perhaps be better exists an optimal method for array broadcast, known as
understood as an alternative BSP algorithm design phi- two-phase broadcast (a folklore result, described, e.g.,
losophy. It requires that algorithms are work-optimal, in []). In this method, the array is partitioned into p
 B BSP (Bulk Synchronous Parallelism)

blocks of size n/p. The blocks are scattered across the computation [, , , , , , ], and selection
processors; then, a total-exchange of the blocks is per- [, ]. In the area of matrix algorithms, some exam-
formed. Assuming sufficient slackness, the BSP costs are ples of the proposed algorithms include matrix–vector
W = O(n/p), H = O(n/p), S = O(). and matrix–matrix multiplication [, , ], Strassen-
Many computational problems can be described as type matrix multiplication [], triangular system solu-
computing a directed acyclic graph (dag), which charac- tion and several versions of Gaussian elimination [,
terizes the problem’s data dependencies. From now on, , ], and orthogonal matrix decomposition [, ]
it is assumed that a BSP algorithm’s input and output are In the area of graph algorithms, some examples of the
stored in the external memory. It is also assumed that all proposed algorithms include Boolean matrix multipli-
problem instances have sufficient slackness. cation [], minimum spanning tree [], transitive clo-
The balanced binary tree dag of size n consists of sure [], the algebraic path problem and the all-pairs
n −  nodes, arranged in a rooted balanced binary shortest path problems [], and graph coloring [].
tree with n leaves. The direction of all edges can be In the area of string algorithms, some examples of the
either top-down (from root to leaves), or bottom- proposed algorithms include the longest common sub-
up (from leaves to root). By partitioning into appro- sequence and edit distance problems [–, , , ],
priate blocks, the balanced binary tree dag can be and the longest increasing subsequence problem [].
computed with BSP costs W = O(n/p), H = O(n/p),
S = O() (see [, ]).
BSP Programming
The butterfly dag of size n consists of n log n nodes,
and describes the data dependencies of the Fast Fourier The BSPlib Standard
Transform algorithm on n points. By partitioning into Based on the experience of early BSP programming
appropriate blocks, the butterfly dag can be computed tools [, , ], the BSP programming commu-
with BSP costs W = O(n log n/p), H = O(n/p), S = nity agreed on a common library standard BSPlib
O() (see [, ]). []. The aim of BSPlib is to provide a set of BSP
The ordered D grid dag of size n consists of n programming primitives, striking a reasonable bal-
nodes arranged in an n × n grid, with edges directed ance between simplicity and efficiency. BSPlib is based
top-to-bottom and left-to-right. The computation takes on the single program/multiple data (SPMD) pro-
n inputs to the nodes on the left and top borders, and gramming model, and contains communication prim-
returns n outputs from the nodes on the right and bot- itives for direct remote memory access (DRMA) and
tom borders. By partitioning into appropriate blocks, bulk-synchronous message passing (BSMP). Experience
the ordered D grid dag can be computed with BSP costs shows that DRMA, due to its simplicity and deadlock-
W = O(n /p), H = O(n), S = O(p) (see []). free semantics, is the method of choice for all but the
The ordered D grid dag of size n consists of n nodes most irregular applications. Routine use of DRMA is
arranged in an n×n×n grid, with edges directed top-to- made possible by the barrier synchronization structure
bottom, left-to-right, and front-to-back. The computa- of BSP computation.
tion takes n inputs to the nodes on the front, left, and The two currently existing major implementations
top faces, and returns n outputs from the nodes on of BSPlib are the Oxford BSP toolset [] and the PUB
the back, right, and bottom faces. By partitioning into library []. Both provide a robust environment for the
appropriate blocks, the ordered D grid dag can be com- development of BSPlib applications, including mech-
puted with BSP costs W = O(n /p), H = O(n /p/ ), anisms for optimizing communication [, ], load
S = O(p/ ) (see []). balancing, and fault tolerance []. The PUB library also
provides a few additional primitives (oblivious synchro-
Further Algorithms nization, processor subsets, multithreading). Both the
The design of efficient BSP algorithms has become a Oxford BSP toolset and the PUB library include tools
well-established topic. Some examples of BSP algo- for performance analysis and prediction. Recently, new
rithms proposed in the past include list and tree con- approaches have been developed to BSPlib implemen-
traction [], sorting [, , , , ], convex hull tation [, , ] and performance analysis [, ].
BSP (Bulk Synchronous Parallelism) B 

Beyond BSPlib Reduce and Scan


The Message Passing Interface (MPI) is currently the Sorting
most widely accepted standard of distributed-memory SPMD Computational Model B
parallel programming. In contrast to BSPlib, which Synchronization
is based on a single programming paradigm, MPI ZPL
provides a diverse set of parallel programming pat-
terns, allowing the programmer to pick-and-choose
a paradigm most suitable for the application. Conse- Bibliographic Notes and Further
quently, the number of primitives in MPI is an order Reading
of magnitude larger than in BSPlib, and the responsi- A detailed treatment of the BSP model, BSP program-
bility to choose the correct subset of primitives and to ming, and several important BSP algorithms is given
structure the code rests with the programmer. It is not in the monograph []. Collection [] includes several
surprising that a carefully chosen subset of MPI can be chapters dedicated to BSP and related models.
used to program in the BSP style; an example of such an
approach is given by [].
The ZPL language [] is a global-view array lan- Bibliography
guage based on a BSP-like computation structure.
. Alverson GA, Griswold WG, Lin C, Notkin D, Snyder L ()
As such, it can be considered to be one of the earliest Abstractions for portable, scalable parallel programming. IEEE
high-level BSP programming tools. Another growing Trans Parallel Distrib Syst ():–
trend is the integration of the BSP model with mod- . Alves CER, Cáceres EN, Castro Jr AA, Song SW, Szwarcfiter
ern programming environments. A successful example JL () Efficient parallel implementation of transitive closure
of digraphs. In: Proceedings of EuroPVM/MPI, Venice. Lecture
of integrating BSP with Python is given by the pack-
notes in computer science, vol . Springer, pp –
age Scientific Python [], which provides high-level . Alves CER, Cáceres EN, Dehne F, Song SW () Parallel
facilities for writing BSP code, and performs commu- dynamic programming for solving the string editing problem on
nication by calls to either BSPlib or MPI. Tools for a CGM/BSP. In: Proceedings of the th ACM SPAA, Winnipeg,
BSP programming in Java have been developed by pp –
projects NestStep [] and JBSP []; a Java-like multi- . Alves CER, Cáceres EN, Dehne F, Song SW () A parallel
wavefront algorithm for efficient biological sequence comparison.
threaded BSP programming model is proposed in []. In: Proceedings of ICCSA, Montreal. Lecture notes in computer
A functional programming model for BSP is given by science, vol . Springer, Berlin, pp –
the BSMLlib library []; a constraint programming . Alves CER, Cáceres EN, Song SW () A coarse-grained paral-
approach is introduced in []. Projects InteGrade [] lel algorithm for the all-substrings longest common subsequence
and GridNestStep [] are aiming to implement the BSP problem. Algorithmica ():–
. Ballereau O, Hains G, Lallouet A () BSP constraint program-
model using Grid technology.
ming. In: Gorlatch S, Lengauer C (eds) Constructive methods for
parallel programming, vol . Advances in computation: Theory
and practice. Nova Science, New York, Chap 
Related Entries . Bäumker A, Dittrich W, Meyer auf der Heide F () Truly effi-
Bandwidth-Latency Models (BSP, LogP) cient parallel algorithms: -optimal multisearch for an extension
of the BSP model. Theor Comput Sci ():–
Collective Communication
. Bisseling RH () Parallel scientific computation: A structured
Dense Linear System Solvers
approach using BSP and MPI. Oxford University Press, New York
Functional Languages . Bisseling RH, McColl WF () Scientific computing on bulk
Graph Algorithms synchronous parallel architectures. Preprint , Department of
Load Balancing, Distributed Memory Mathematics, University of Utrecht, December 
Linear Algebra, Numerical . Blanco V, González JA, León C, Rodríguez C, Rodríguez G,
Printista M () Predicting the performance of parallel pro-
Models of Computation, Theoretical
grams. Parallel Comput :–
Parallel Skeletons . Bonorden O, Juurlink B, von Otte I, Rieping I () The
PGAS (Partitioned Global Address Space) Languages Paderborn University BSP (PUB) library. Parallel Comput ():
PRAM (Parallel Random Access Machines) –
 B BSP (Bulk Synchronous Parallelism)

. Cáceres EN, Dehne F, Mongelli H, Song SW, Szwarcfiter JL () . Goldchleger A, Kon F, Goldman A, Finger M, Bezerra GC. Inte-
A coarse-grained parallel algorithm for spanning tree and con- Grade: Object-oriented Grid middleware leveraging idle com-
nected components. In: Proceedings of Euro-Par, Pisa. Lecture puting power of desktop machines. Concurr Comput Pract Exp
notes in computer science, vol . Springer, Berlin, pp – :–
. Calinescu R, Evans DJ () Bulk-synchronous parallel algo- . Goodrich M () Communication-efficient parallel sorting.
rithms for QR and QZ matrix factorisation. Parallel Algorithms In: Proceedings of the th ACM STOC, Philadelphia, pp
Appl :– –
. Chamberlain BL, Choi S-E, Lewis EC, Lin C, Snyder L, Weath- . Gorlatch S, Lengauer C (eds) () Constructive methods for
ersby WD () ZPL: A machine independent programming parallel programming, vol . Advances in computation: Theory
language for parallel computers. IEEE Trans Softw Eng (): and practice. Nova Science, New York
– . Goudreau MW, Lang K, Rao SB, Suel T, Tsantilas T () Portable
. Cinque L, Di Maggio C () A BSP realisation of Jarvis’ algo- and efficient parallel computing using the BSP model. IEEE Trans
rithm. Pattern Recogn Lett ():– Comput ():–
. Cole M () Algorithmic skeletons. In: Hammond K, . Gu Y, Lee B-S, Cai W () JBSP: A BSP programming library in
Michaelson G (eds) Research Directions in Parallel Functional Java. J Parallel Distrib Comput :–
Programming. Springer, London, pp – . Hains G, Loulergue F () Functional bulk synchronous par-
. Cole M () Bringing skeletons out of the closet: A pragmatic allel programming using the BSMLlib library. In: Gorlatch S,
manifesto for skeletal parallel programming. Parallel Comput Lengauer C (eds) Constructive methods for parallel program-
:– ming, vol . Advances in computation: Theory and practice.
. Corrêa R et al (eds) () Models for parallel and distributed Nova Science, New York, Chap 
computation: theory, algorithmic techniques and applications, . Hammond K, Michaelson G (eds) () Research directions in
vol . Applied Optimization. Kluwer, Dordrecht parallel functional programming. Springer, London
. Dehne F, Dittrich W, Hutchinson D () Efficient external . Heywood T, Ranka S () A practical hierarchical model of
memory algorithms by simulating coarse-grained parallel algo- parallel computation I: The model. J Parallel Distrib Comput
rithms. Algorithmica ():– ():–
. Dehne F, Fabri A, Rau-Chaplin A () Scalable parallel com- . Hill J () Portability of performance in the BSP model. In:
putational geometry for coarse grained multicomputers. Int J Hammond K, Michaelson G (eds) Research directions in parallel
Comput Geom :– functional programming. Springer, London, pp –
. Diallo M, Ferreira A, Rau-Chaplin A, Ubéda S () Scalable . Hill JMD, Donaldson SR, Lanfear T () Process migration
D convex hull and triangulation algorithms for coarse grained and fault tolerance of BSPlib programs running on a network
multicomputers. J Parallel Distrib Comput :– of workstations. In: Pritchard D, Reeve J (eds) Proceedings of
. Donaldson SR, Hill JMD, Skillicorn D () Predictable com- Euro-Par, Southampton. Lecture notes in computer science, vol
munication on unpredictable networks: Implementing BSP over . Springer, Berlin, pp –
TCP/IP and UDP/IP. Concurr Pract Exp ():– . Hill JMD, Jarvis SA, Siniolakis C, Vasilev VP () Analysing
. Dymond P, Zhou J, Deng X () A D parallel convex hull an SQL application with a BSPlib call-graph profiling tool. In:
algorithm with optimal communication phases. Parallel Comput Pritchard D, Reeve J (eds) Proceedings of Euro-Par. Lecture notes
():– in computer science, vol . Springer, Berlin, pp –
. Fantozzi C, Pietracaprina A, Pucci G () A general PRAM sim- . Hill JMD, McColl WF, Stefanescu DC, Goudreau MW, Lang K,
ulation scheme for clustered machines. Int J Foundations Comput Rao SB, Suel T, Tsantilas T, Bisseling RH () BSPlib: The BSP
Sci ():– programming library. Parallel Comput ():–
. Fortune S, Wyllie J () Parallelism in random access machines. . Hill JMD, Skillicorn DB () Lessons learned from implement-
In: Proceedings of ACM STOC, San Diego, pp – ing BSP. Future Generation Comput Syst (–):–
. Gebremedhin AH, Essaïdi M, Lassous GI, Gustedt J, Telle JA . Hinsen K () High-level parallel software development with
() PRO: A model for the design and analysis of efficient and Python and BSP. Parallel Process Lett ():–
scalable parallel algorithms. Nordic J Comput ():– . Ishimizu T, Fujiwara A, Inoue M, Masuzawa T, Fujiwara H ()
. Gebremedhin AH, Manne F () Scalable parallel graph color- Parallel algorithms for selection on the BSP and BSP ∗ models. Syst
ing algorithms. Concurr Pract Exp ():– Comput Jpn ():–
. Gerbessiotis AV, Siniolakis CJ () Deterministic sorting and . Juurlink BHH, Wijshoff HAG () The E-BSP model: Incor-
randomized median finding on the BSP model. In: Proceedings porating general locality and unbalanced communication into
of the th ACM SPAA, Padua, pp – the BSP model. In: Bougé et al (eds) Proceedings of Euro-Par
. Gerbessiotis AV, Siniolakis CJ, Tiskin A () Parallel pri- (Part II), Lyon. Lecture notes in computer science, vol .
ority queue and list contraction: The BSP approach. Comput Springer, Berlin, pp –
Informatics :– . Juurlink BHH, Wijshoff HAG () A quantitative comparison
. Gerbessiotis AV, Valiant LG () Direct bulk-synchronous par- of parallel computation models. ACM Trans Comput Syst ():
allel algorithms. J Parallel Distrib Comput ():– –
Bulk Synchronous Parallelism (BSP) B 

. Kee Y, Ha S () An efficient implementation of the BSP pro- Proceedings of CIAC, Rome. Lecture notes in computer science,
gramming library for VIA. Parallel Process Lett ():– vol . Springer, Berlin, pp –
. Keßler CW () NestStep: Nested parallelism and virtual . Skillicorn DB () Predictable parallel performance: The BSP
shared memory for the BSP model. J Supercomput :– model. In: Corrêa R et al (eds) Models for parallel and distributed B
. Kim S-R, Park K () Fully-scalable fault-tolerant simulations computation: theory, algorithmic techniques and applications,
for BSP and CGM. J Parallel Distrib Comput ():– vol . Applied Optimization. Kluwer, Dordrecht, pp –
. Krusche P, Tiskin A () Efficient longest common subse- . Song SW () Parallel graph algorithms for coarse-grained
quence computation using bulk-synchronous parallelism. In: Pro- multicomputers. In: Corrêa R et al (eds) Models for parallel
ceedings of ICCSA, Glasgow. Lecture notes in computer science, and distributed computation: theory, algorithmic techniques and
vol . Springer, Berlin, pp – applications, vol , Applied optimization. Kluwer, Dordrecht,
. Krusche P, Tiskin A () Longest increasing subsequences pp –
in scalable time and memory. In: Proceedings of PPAM , . Tiskin A () Bulk-synchronous parallel multiplication of
Revised Selected Papers, Part I. Lecture notes in computer science, Boolean matrices. In: Proceedings of ICALP Aalborg. Lecture
vol . Springer, Berlin, pp – notes in computer science, vol . Springer, Berlin, pp –
. Krusche P, Tiskin A () New algorithms for efficient paral- . Tiskin A () The bulk-synchronous parallel random access
lel string comparison. In: Proceedings of ACM SPAA, Santorini, machine. Theor Comput Sci (–):–
pp – . Tiskin A () All-pairs shortest paths computation in the BSP
. Lecomber DS, Siniolakis CJ, Sujithan KR () PRAM program- model. In: Proceedings of ICALP, Crete. Lecture notes in com-
ming: in theory and in practice. Concurr Pract Exp :– puter science, vol . Springer, Berlin, pp –
. Mattsson H, Kessler CW () Towards a virtual shared mem- . Tiskin A () A new way to divide and conquer. Parallel Process
ory programming environment for grids. In: Proceedings of Lett ():–
PARA, Copenhagen. Lecture notes in computer science, vol . . Tiskin A () Parallel convex hull computation by gener-
Springer-Verlag, pp – alised regular sampling. In: Proceedings of Euro-Par, Paderborn.
. McColl WF () Scalable computing. In: van Leeuwen J (ed) Lecture notes in computer science, vol . Springer, Berlin,
Computer science today: Recent trends and developments. Lec- pp –
ture notes in computer science, vol . Springer, Berlin, pp . Tiskin A () Efficient representation and parallel computation
– of string-substring longest common subsequences. In: Proceed-
. McColl WF () Universal computing. In: L. Bougé et al (eds) ings of ParCo Malaga. NIC Series, vol . John von Neumann
Proceedings of Euro-Par (Part I), Lyon. Lecture notes in computer Institute for Computing, pp –
science, vol . Springer, Berlin, pp – . Tiskin A () Communication-efficient parallel generic pair-
. McColl WF, Miller Q () Development of the GPL language. wise elimination. Future Generation Comput Syst :–
Technical report (ESPRIT GEPPCOM project), Oxford Univer- . Tiskin A () Parallel selection by regular sampling. In: Pro-
sity Computing Laboratory ceedings of Euro-Par (Part II), Ischia. Lecture notes in computer
. McColl WF, Tiskin A () Memory-efficient matrix multiplica- science, vol . Springer, pp –
tion in the BSP model. Algorithmica (/):– . Valiant LG () A bridging model for parallel computation.
. Miller R () A library for bulk-synchronous parallel pro- Commun ACM ():–
gramming. In: Proceeding of general purpose parallel computing, . Vasilev V () BSPGRID: Variable resources parallel compu-
British Computer Society, pp – tation and multiprogrammed parallelism. Parallel Process Lett
. Morin P () Coarse grained parallel computing on heteroge- ():–
neous systems. In: Proceedings of ACM SAC, Como, pp – . Williams TL, Parsons RJ () The heterogeneous bulk syn-
. Nibhanupudi MV, Szymanski BK () Adaptive bulk- chronous parallel model. In: Rolim J et al (eds) Proceedings of
synchronous parallelism on a network of non-dedicated IPDPS workshops Cancun. Lecture notes in computer science,
workstations. In: High performance computing systems and vol . Springer, Berlin, pp –
applications. Kluwer, pp – . Zheng W, Khan S, Xie H () BSP performance analysis and pre-
. The Oxford BSP Toolset () http://www.bsp-worldwide.org/ diction: Tools and applications. In: Malyshkin V (ed) Proceedings
implmnts/oxtool of PaCT, Newport Beach. Lecture notes in computer science, vol
. Ramachandran V () A general purpose shared-memory . Springer, Berlin, pp –
model for parallel computation. In: Heath MT, Ranade A,
Schreiber RS (eds) Algorithms for parallel processing. IMA vol-
umes in mathematics and applications, vol . Springer-Verlag,
New York
. Saukas LG, Song SW () A note on parallel selection on coarse-
Bulk Synchronous Parallelism
grained multicomputers. Algorithmica, (–):–,  (BSP)
. Sibeyn J, Kaufmann M () BSP-like external-memory com-
putation. In: Bongiovanni GC, Bovet DP, Di Battista G (eds) BSP (Bulk Synchronous Parallelism)
 B Bus: Shared Channel

wires) that allows one sender at a time to communi-


Bus: Shared Channel cate with all sharers of that medium. If the interconnect
must support multiple simultaneous senders, more scal-
Buses and Crossbars
able designs based on switched-media must be pursued.
The crossbar represents a basic switched-media building
block for more complex but scalable networks.
In traditional multi-chip multiprocessor systems
Buses and Crossbars (c. ), buses were primarily used as off-chip inter-
connects, for example, front-side buses. Similarly, cross-
Rajeev Balasubramonian, Timothy M. Pinkston
 bar functionality was implemented on chips that were
University of Utah, Salt Lake City, UT, USA

University of Southern California, Los Angeles, CA, USA used mainly for networking. However, the move to
multi-core technology has necessitated the use of net-
works even within a mainstream processor chip to con-
Synonyms nect its multiple cores and cache banks. Therefore, buses
Bus: Shared channel; Shared interconnect; and crossbars are now used within mainstream proces-
Shared-medium network; Crossbar; Interconnection sor chips as well as chip sets. The design constraints
network; Point-to-point switch; Switched-medium for on-chip buses are very different from those of off-
network chip buses. Much of this discussion will focus on on-
chip buses, which continue to be the subject of much
Definition research and development.
Bus: A bus is a shared interconnect used for connect-
ing multiple components of a computer on a single chip Basics of Bus Design
or across multiple chips. Connected entities either place A bus comprises a shared medium with connections to
signals on the bus or listen to signals being transmit- multiple entities. An interface circuit allows each of the
ted on the bus, but signals from only one entity at a entities either to place signals on the medium or sense
time can be transported by the bus at any given time. (listen to) the signals already present on the medium.
Buses are popular communication media for broadcasts In a typical communication, one of the entities acquires
in computer systems. ownership of the bus (the entity is now known as the bus
Crossbar: A crossbar is a non-blocking switching master) and places signals on the bus. Every other entity
element with N inputs and M outputs used for connect- senses these signals and, depending on the content of
ing multiple components of a computer where, typically, the signals, may choose to accept or discard them. Most
N = M. The crossbar can simultaneously transport sig- buses today are synchronous, that is, the start and end
nals on any of the N inputs to any of the M outputs as of a transmission are clearly defined by the edges of a
long as multiple signals do not compete for the same shared clock. An asynchronous bus would require an
input or output port. Crossbars are commonly used as acknowledgment from the receiver so the sender knows
basic switching elements in switched-media network when the bus can be relinquished.
routers. Buses often are collections of electrical wires,
(Alternatively, buses can be a collection of optical
waveguides over which information is transmitted pho-
Discussion tonically []) where each wire is typically organized as
Introduction “data,” “address,” or “control.” In most systems, networks
Every computer system is made up of numerous com- are used to move messages among entities on the data
ponents such as processor chips, memory chips, periph- bus; the address bus specifies the entity that must receive
erals, etc. These components communicate with each the message; and the control bus carries auxiliary sig-
other via interconnects. One of the simplest intercon- nals such as arbitration requests and error correction
nects used in computer systems is the bus. The bus codes. There is another nomenclature that readers may
is a shared medium (usually a collection of electrical also encounter. If the network is used to implement a
Buses and Crossbars B 

cache coherence protocol, the protocol itself has three out if it can determine that it is not the largest ID in the
types of messages: () DATA, which refers to blocks competition.
of memory; () ADDRESS, which refers to the mem- B
ory block’s address; and () CONTROL, which refers Pipelined Bus
to auxiliary messages in the protocol such as acknowl- Before a bus transaction can begin, an entity must arbi-
edgments. Capitalized terms as above will be used to trate for the bus, typically by contacting a centralized
distinguish message types in the coherence protocol arbiter. The latency of request and grant signals can be
from signal types on the bus. For now, it will be assumed hidden with pipelining. In essence, the arbitration pro-
that all three protocol message types are transmitted on cess (that is handled on the control bus) is overlapped
the data bus. with the data transmission of an earlier message. An
entity can send a bus request to the arbiter at any time.
The arbiter buffers this request, keeps track of occu-
Arbitration Protocols pancy on the data bus, and sends the grant signal one
Since a bus is a shared medium that allows a single cycle before the data bus will be free. In a heavily loaded
master at a time, an arbitration protocol is required to network, the data bus will therefore rarely be idle and
identify this bus master. A simple arbitration protocol the arbitration delay is completely hidden by the wait
can allow every entity to have ownership of the bus for for the data bus. In a lightly loaded network, pipelining
a fixed time quantum, in a round-robin manner. Thus, will not hide the arbitration delay, which is typically at
every entity can make a local decision on when to trans- least three cycles: one cycle for the request signal, one
mit. However, this wastes bus bandwidth when an entity cycle for logic at the arbiter, and one cycle for the grant
has nothing to transmit during its turn. signal.
The most common arbitration protocol employs a
central arbiter; entities must send their bus requests to Case Study: Snooping-Based Cache Coherence
the arbiter and the arbiter sends explicit messages to Protocols
grant the bus to requesters. If the requesting entity is As stated earlier, the bus is a vehicle for transmission
not aware of its data bus occupancy time beforehand, of messages within a higher-level protocol such as a
the entity must also send a bus release message to the cache coherence protocol. A single transaction within
arbiter after it is done. The request, grant, and release the higher-level protocol may require multiple messages
signals are part of the control network. The request sig- on the bus. Very often, the higher-level protocol and the
nal is usually carried on a dedicated wire between an bus are codesigned to improve efficiency. Therefore, as
entity and the arbiter. The grant signal can also be imple- a case study, a snooping bus-based coherence protocol
mented similarly, or as a shared bus that carries the ID of will be discussed.
the grantee. The arbiter has state to track data bus occu- Consider a single-chip multiprocessor where each
pancy, buffers to store pending requests, and policies to processor core has a private L cache, and a large L
implement priority or fairness. The use of pipelining to cache is shared by all the cores. The multiple L caches
hide arbitration delays will be discussed shortly. and the multiple banks of the L cache are the enti-
Arbitration can also be done in a distributed man- ties connected to a shared bus (Fig. ). The higher-level
ner [], but such methods often incur latency or band- coherence protcol ensures that data in the L and L
width penalties. In one example, a shared arbitration caches is kept coherent, that is, a data modification is
bus is implemented with wired-OR signals. Multiple eventually seen by all caches and multiple updates to
entities can place a signal on the bus; if any entity places one block are seen by all caches in exactly the same
a “one” on the bus, the bus carries “one,” thus using order.
wires to implement the logic of an OR gate. To arbi- A number of coherence protocol operations will
trate, all entities place their IDs on the arbitration bus; now be discussed. When a core does not find its data
the resulting signal is the OR of all requesting IDs. The in its local L cache, it must send a request for the data
bus is granted to the entity with the largest ID and this block to other L caches and the L cache. The core’s L
is determined by having each entity sequentially drop cache first sends an arbitration request for the bus to the
 B Buses and Crossbars

much simpler than directory-based protocols on more


scalable networks.
As described, each coherence transaction is handled
atomically, that is, one transaction is handled com-
pletely before the bus is released for use by other trans-
actions. This means that the data bus is often idle while
caches perform their snoops and array reads. Bus uti-
lization can be improved with a split transaction bus.
Once the requester has placed its request on the data
Buses and Crossbars. Fig.  Cores and L cache banks
bus, the data bus is released for use by other trans-
connected with a bus. The bus is composed of wires that
actions. Other transactions can now use the data bus
handle data, address, and control
for their requests or responses. When a transaction’s
response is ready, the data bus must be arbitrated for.
Every request and response must now carry a small tag
arbiter. The arbiter eventually sends the grant signal to so responses can be matched up to their requests. Addi-
the requesting L. The arbitration is done on the control tional tags may also be required to match the wired-OR
portion of the bus. The L then places the ADDRESS signals to the request.
of the requested data block on the data bus. On a syn- The split transaction bus design can be taken
chronous bus, we are guaranteed that every other entity one step further. Separate buses can be implemented
has seen the request within one bus cycle. Each such for ADDRESS and DATA messages. All requests
“snooping” entity now checks its L cache or L bank (ADDRESS messages) and corresponding wired-OR
to see if it has a copy of the requested block. Since CONTROL signals are carried on one bus. This bus
every lookup may take a different amount of time, a acts as the serialization point for the coherence pro-
wired-AND signal is provided within the control bus so tocol. Responders always use a separate bus to return
everyone knows that the snoop is completed. This is an DATA messages. Each bus has its own separate arbiter
example of bus and protocol codesign (a protocol CON- and corresponding control signals.
TROL message being implemented on the bus’ control
bus). The protocol requires that an L cache respond Bus Scalability
with data if it has the block in “modified” state, else, the A primary concern with any bus is its lack of scalabil-
L cache responds with data. This is determined with ity. First, if many entities are connected to a bus, the
a wired-OR signal; all L caches place the outcome of bus speed reduces because it must drive a heavier load
their snoop on this wired-OR signal and the L cache over a longer distance. In an electrical bus, the higher
accordingly determines if it must respond. The respond- capacitive load from multiple entities increases the RC-
ing entity then fetches data from its arrays and places delay; in an optical bus, the reduced photons received at
it on the data bus. Since the bus is not released until photodetectors from dividing the optical power budget
the end of the entire coherence protocol transaction, among multiple entities likewise increases the time to
the responder knows that the data bus is idle and need detect bus signals. Second, with many entities compet-
not engage in arbitration (another example of proto- ing for the shared bus, the wait-time to access the bus
col and bus codesign). Control signals let the requester increases with the number of entities. Therefore, con-
know that the data is available and the requester reads ventional wisdom states that more scalable switched-
the cache block off the bus. media networks are preferred when connecting much
The use of a bus greatly simplifies the coherence more than  or  entities []. However, the simplic-
protocol. It serves as a serialization point for all coher- ity of bus-based protocols (such as the snooping-based
ence transactions. The timing of when an operation cache coherence protocol) make it attractive for small-
is visible to everyone is well known. The broadcast of or medium-scale symmetric multiprocessing (SMP)
operations allows every cache on the bus to be self- systems. For example, the IBM POWERTM proces-
managing. Snooping bus-based protocols are therefore sor chip supports  cores on its SMP bus []. Buses
Buses and Crossbars B 

are also attractive because, unlike switched-media net-


works, they do not require energy-hungry structures
such as buffers and crossbars. Researchers have consid- B
ered multiple innovations to extend the scalability of
buses, some of which are discussed next.
One way to scale the number of entities connected
using buses is, simply, to provide multiple buses, for
example, dual-independent buses or quad-independent
buses. This mitigates the second problem listed above
Core Sub-bus Central-bus
regarding the high rate of contention on a single bus, but
steps must still be taken to maintain cache coherency
Buses and Crossbars. Fig.  A hierarchical bus structure
via snooping on the buses. The Sun Starfire multipro-
that localizes broadcasts to relevant clusters
cessor [], for example, uses four parallel buses for
ADDRESS requests, wherein each bus handles a dif-
ferent range of addresses. Tens of dedicated buses are
used to connect up to  IBM POWERTM proces-
sor chips in a coherent SMP system []. While this
option has high cost for off-chip buses because of
pin and wiring limitations, a multi-bus for an on-chip
network is not as onerous because of plentiful metal
area budgets.
Some recent works have highlighted the potential a
of bus-based on-chip networks. Das et al. [] argue
that buses should be used within a relatively small clus-
ter of cores because of their superior latency, power,
and simplicity. The buses are connected with a routed
mesh network that is employed for communication b
beyond the cluster. The mesh network is exercised
infrequently because most applications exhibit locality. Buses and Crossbars. Fig.  (a) A “dance-hall”
Udipi et al. [] take this hierarchical network approach configuration of processors and memory. (b) The circuit for
one step further. As shown in Fig. , the intra-cluster a  ×  crossbar
buses are themselves connected with an inter-cluster
bus. Bloom filters are used to track the buses that have
previously handled a given address. When coherence is always broadcast-style, i.e., even though a mes-
transactions are initiated for that address, the Bloom sage is going from entity-A to entity-B, all entities
filters ensure that the transaction is broadcasted only see the message and no other message can be simul-
to the buses that may find the address relevant. Local- taneously in transit. However, if the entities form a
ity optimizations such as page coloring help ensure that “dance hall” configuration (Fig. a) with processors on
bus broadcasts do not travel far, on average. Udipi et one side and memory on the other side, and most
al. also employ multiple buses and low-swing wiring to communication is between processors and memory,
further extend bus scalability in terms of performance a crossbar interconnect becomes a compelling choice.
and energy. Although crossbars incur a higher wiring overhead
than buses, they allow multiple messages simultane-
ously to be in transit, thus increasing the network
Crossbars bandwidth. Given this, crossbars serve as the basic
Buses are used as a shared fabric for communica- switching element within switched-media network
tion among multiple entities. Communication on a bus routers.
 B Buses and Crossbars

A crossbar circuit takes N inputs and connects each Related Entries


input to any of the M possible outputs. As shown in Cache Coherence
Fig. b, the circuit is organized as a grid of wires, with Collective Communication
inputs on the left, and outputs on the bottom. Each wire Interconnection Networks
can be thought of as a bus with a unique master, that is, Networks, Direct
the associated input port. At every intersection of wires, Network Interfaces
a pass transistor serves as a crosspoint connector to short Networks, Multistage
the two wires, if enabled, connecting the input to the PCI-Express
output. Small buffers can also be located at the cross- Routing (Including Deadlock Avoidance)
points in buffered crossbar implementations to store Switch Architecture
messages temporarily in the event of contention for the Switching Techniques
intended output port. A crossbar is usually controlled
by a centralized arbiter that takes output port requests Bibliographic Notes and Further
from incoming messages and computes a viable assign- Reading
ment of input port to output port connections. This, For more details on bus design and other networks,
for example, can be done in a crossbar switch alloca- readers are referred to the excellent textbook by Dally
tion stage prior to a crossbar switch traversal stage for and Towles []. Recent papers in the architecture com-
message transport. Multiple messages can be simulta- munity that focus on bus design include those by Udipi
neously in transit as long as each message is headed et al. [], Das et al. [], and Kumar et al. []. Kumar
to a unique output and each emanates from a unique et al. [] articulate some of the costs of implement-
input. Thus, the crossbar is non-blocking. Some imple- ing buses and crossbars in multi-core processors and
mentations allow a single input message to be routed to argue that the network must be codesigned with the
multiple output ports. core and caches for optimal performance and power.
A crossbar circuit has a cost that is proportional to A few years back, S. Borkar made a compelling argu-
N × M. The circuit is replicated W times, where W rep- ment for the widespread use of buses within multi-core
resents the width of the link at one of the input ports. chips that is highly thought provoking [, ]. The paper
It is therefore not a very scalable circuit. In fact, larger by Charlesworth [] on Sun’s Starfire, while more than a
centralized switches such as Butterfly and Benes switch decade old, is an excellent reference that describes con-
fabrics are constructed hierarchically from smaller siderations when designing a high-performance bus for
crossbars to form multistage indirect networks or MINs. a multi-chip multiprocessor. Future many-core proces-
Such networks have a cost that is proportional to sors may adopt photonic interconnects to satisfy the
Nlog(M) but have a more restrictive set of messages that high memory bandwidth demands of the many cores.
can be routed simultaneously without blocking. A single photonic waveguide can carry many wave-
A well-known example of a large-scale on-chip lengths of light, each carrying a stream of data. Many
crossbar is the Sun Niagara processor []. The cross- receiver “rings” can listen to the data transmission, each
bar connects eight processors to four L cache banks ring contributing to some loss in optical energy. The
in a “dance-hall” configuration. A recent example of Corona paper by Vantrease et al. [] and the paper
using a crossbar to interconnect processor cores in by Kirman et al. [] are excellent references for more
other switched point-to-point configurations is the Intel details on silicon photonics, optical buses, and optical
QuickPath Interconnect []. More generally, crossbars crossbars.
find extensive use in network routers. Meshes and tori, The basic crossbar circuit has undergone little
for example, implement a  ×  crossbar in router change over the last several years. However, given the
switches where the five input and output ports cor- recent interest in high-radix routers which increase the
respond to the North, South, East, West neighbors input/output-port degree of the crossbar used as the
and the node connected to each router. The mesh- internal router switch, Kim et al. [] proposed hierar-
connected Tilera Tile-GxTM -core processor is a chical crossbar and buffered crossbar organizations to
recent example []. facilitate scalability. Also, given the relatively recent shift
Butterfly B 

in focus to energy-efficient on-chip networks, Wang . Kirman N, Kyrman M, Dokania R, Martinez J, Apsel A, Watkins
et al. [] proposed techniques to reduce the energy M, Albonesi D () Leveraging optical technology in future
bus-based chip multiprocessors. In: Proceedings of MICRO,
usage within crossbar circuits. They introduced a cut- B
Orlando
through crossbar that is optimized for traffic that travels . Kongetira P () A -way multithreaded SPARC processor. In:
in a straight line through a mesh network’s router. The Proceedings of hot chips , Stanford. http://www.hotchips.org/
design places some restrictions on the types of mes- archives/
sage turns that can be simultaneously handled. Wang . Kumar R, Zyuban V, Tullsen D () Interconnections in multi-
et al. also introduce a segmented crossbar that pre- core architectures: understanding mechanisms, overheads, and
scaling. In: Proceedings of ISCA, Madison
vents switching across the entire length of wires when
. Tendler JM () POWER processors: the beat goes on. http://
possible. www.ibm.com / developerworks / wikis /download/attachments /
/POWER+-+The+Beat+Goes+On.pdf
Bibliography . Tilera. TILE-Gx processor family product brief. http://www.
. Borkar S () Networks for multi-core chips – a contrarian tilera.com /sites /default /files /productbriefs / PB_Processor_
view, ISLPED Keynote. www.islped.org/X/BorkarISLPED. A_v.pdf
pdf . Udipi A, Muralimanohar N, Balasubramonian R () Towards
. Borkar S () Networks for multi-core chips – a controversial scalable, energy-efficient, bus-based on-chip networks. In:
view. In: Workshop on on- and off-chip interconnection networks Proceedings of HPCA, Bangalore
for multicore systems (OCIN), Stanford . Vantrease D, Schreiber R, Monchiero M, McLaren M, Jouppi
. Charlesworth A () Starfire: extending the SMP envelope. N, Fiorentino M, Davis A, Binkert N, Beausoleil R, Ahn J-H
IEEE Micro ():– () Corona: system implications of emerging nanophotonic
. Dally W, Towles B () Route packets, not wires: on-chip inter- technology. In: Proceedings of ISCA, Beijing
connection networks. In: Proceedings of DAC, Las Vegas . Wang H-S, Peh L-S, Malik S () Power-driven design of
. Dally W, Towles B () Principles and practices of interconnec- router microarchitectures in on-chip networks. In: Proceedings
tion networks, st edn. Morgan Kaufmann, San Francisco of MICRO, San Diego
. Das R, Eachempati S, Mishra AK, Vijaykrishnan N, Das CR
() Design and evaluation of hierarchical on-chip network
topologies for next generation CMPs. In: Proceedings of HPCA,
Raleigh
. Intel Corp. An introduction to the Intel QuickPath interconnect.
http://www.intel.com/technology/quickpath/introduction.pdf Butterfly
. Kim J, Dally W, Towles B, Gupta A () Microarchitecture of a
high-radix router. In: Proceedings of ISCA, Madison Networks, Multistage
C
syntax. Rather than introducing a plethora of new lan-
C* guage constructs to express parallelism, C* relies on
existing C operators, applied to parallel data, to express
Guy L. Steele Jr.
such notions as broadcasting, reduction, and interpro-
Oracle Labs, Burlington, MA, USA
cessor communication in both regular and irregular
patterns [, , ].
Definition
The original proposed name for the language was
C* (pronounced “see-star”) refers to two distinct data-
*C, not only by analogy with *Lisp, but with a view for
parallel dialects of C developed by Thinking Machines
the potential of making a similarly data-parallel exten-
Corporation for its Connection Machine supercom-
sion to the C++ language, which would then naturally
puters. The first version () is organized around
be called *C++. However, the marketing department
the declaration of domains, similar to classes in C++,
of Thinking Machines Corporation decided that “C*”
but when code associated with a domain is activated,
sounded better. This inconsistency in the placement
it is executed in parallel within all instances of the
of the “*” did confuse many Thinking Machines cus-
domain, not just a single designated instance. Com-
tomers and others, resulting in frequent references to
pound assignment operators such as += are extended in
“*C” anyway, and even to “Lisp*” on occasion.
C* to perform parallel reduction operations. An elabo-
rate theory of control flow allows use of C control state-
ments in a MIMD-like, yet predictable, fashion despite The Initial Design of C*
the fact that the underlying execution model is SIMD. The basic idea was to start with the C programming lan-
The revised version () replaces domains with shapes guage and then augment it with the ability to declare
that organize processors into multidimensional arrays something like a C++ class, but with the keyword
and abandons the MIMD-like control-flow theory. class replaced with the keyword domain. As in C++,
a domain could have functions as well as variables as
members. However, the notion of method invocation
Discussion (calling a member function on a single specific instance)
Of the four programming languages (*Lisp, C*, CM For-
was replaced by the notion of domain activation (calling
tran, and CM-Lisp) provided by Thinking Machines
a member function, or executing code, on all instances
Corporation for Connection Machine Systems, C* was
simultaneously and synchronously). Everything else in
the most clever (indeed, perhaps too clever) in trying
the language was driven by that one design decision,
to extend features of an already existing sequential lan-
that one metaphor.
guage for parallel execution. To quote the language
Two new keywords, mono and poly, were intro-
designers:
duced to describe in which memory data resided – the
▸ C* is an extension of the C programming language front-end computer or the Connection Machine pro-
designed to support programming in the data paral- cessors, respectively. Variables declared within sequen-
lel style, in which the programmer writes code as if a tial code were mono by default, and variables declared
processor were associated with every data element. C* within parallel code were poly by default, so the prin-
features a single new data type (based on classes in cipal use of these keywords was in describing pointer
C++), a synchronous execution model, and a minimal types; for example, the declaration
number of extensions to C statement and expression mono int *poly p;

David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----,


© Springer Science+Business Media, LLC 
 C C*

indicates that p is a poly pointer to a mono int (i.e., for the maximum and minimum operators. We quickly
p holds many values, all stored in the Connection discovered that users had some difficulty remembering
Machine processors, one in each instance of the cur- which was which.) [, ]
rent domain; and each of these values is a pointer to an
integer that resides in the front-end processor, but each In addition, most of the binary compound operators are
of these pointers might point to a different front-end pressed into service as unary reduction operators. Thus
integer). +=b computes the sum of all the active values in b and
C* systematically extended the standard operators returns that sum as a mono value; similarly >?=b finds
in C to have parallel semantics by applying two rules: the largest active value in b.
() if a binary operator has a scalar operand and a paral- C* also added a modulus operator %% because of
lel operand, the scalar value is automatically replicated the great utility of modulus, as distingushed from the
to form a parallel value (an idea previously seen in both remainder operator %, in performing array index cal-
APL and Fortran x), and () an operator applied to par- culations: when k is zero, the expression (k-1)%%n
allel operands is executed for all active processors as if produces a much more useful result, namely, k-1, than
in some serial order. In this way, binary operators such does the expression (k-1)%n, which produces -1.
as + and - and % can be applied elementwise to many Because a processor can contain pointers to vari-
sets of operands at once, and binary operators with ables residing in other processors, interprocessor com-
side effects – the compound assignment operators – are munication can be expressed simply by dereferencing
guaranteed to have predictable sensible behavior; e.g., if such pointers. Thus if p is a poly pointer to a poly
a is a scalar variable and b is a parallel variable, then the int, then *p causes each active processor to fetch
effect of a+=b is to add every active element of b into an int value through its own p pointer, and *p = b
a, because it must behave as if each active element of b causes each active processor to store its b value indi-
were added into a in some sequential order. (In practice, rectly through p (thus “sending a message” to some
the implementation used a parallel algorithm to sum the other processor). The “combining router” feature of the
elements of b and then added that sum into a.) Connection Machine could be invoked by an expres-
C* makes two extensions to the standard set of C sion such as *p += b, which might cause the b val-
operators, both motivated by a desire to extend the par- ues to be sent to some smaller number of destina-
allel functionality provided by the standard operators tions, resulting in the summing (in parallel) of vari-
in a consistent manner. Two common arithmetic oper- ous subsets of the b values into the various individual
ations, min and max, are provided in C as preprocessor destinations.
macros rather than as operators; one disadvantage of a C* has an elaborate theory of implicitly synchro-
macro is that it cannot be used as part of a compound nized control flow that allows the programmer to code,
assignment operator, and so one cannot write a max= b for the most part, as if each Connection Machine pro-
in the same way that one can write a+=b. C* intro- cessor were executing its own copy of parallel code
duces operators <? and >? to serve as min and max independently as if it were ordinary sequential C code.
operations. The designers commented: The idea is that each processor has its own program
counter (a “virtual pc”), but can make no progress until
▸ They may be understood in terms of their traditional a “master pc” arrives at that point in the code, at which
macro definitions point that virtual pc (as well every other virtual pc wait-
a <? b means ((a) < (b)) ? (a) : (b) ing at that particular code location) becomes active,
joining the master pc and participating in SIMD exe-
a >? b means ((a) > (b)) ? (a) : (b)
cution. Whenever the master pc reaches a conditional
but of course the operators, unlike the macro defini- branch, each active virtual pc that takes the branch
tions, evaluate each argument exactly once. The opera- becomes inactive (thus, e.g., in an if statement, after
tors <? and >? are intended as mnemonic reminders evaluation of the test expression, processors that need
of these definitions. (Such mnemonic reminders are to execute the else part take an implicit branch to
important. The original design of C* used >< and <> the else part and wait while other processors possibly
C* C 

proceed to execute the “then” part). Whenever the mas- each such instance. The “magic identifier” this is a
ter pc reaches the end of a statement, every virtual pc pointer to the current domain instance, and its value is
becomes inactive, and the master pc is transferred from therefore different within each executing instance; sub-
the current point to a new point in the code, namely, the tracting from it a pointer to the first instance in the
earliest point that has waiting program counters, within array, namely, &sieve[0], produces the index of that C
the innermost statement that has waiting program coun- instance within the sieve array, by the usual process
ters within it, that contains the current point. Frequently of C pointer arithmetic. Local variables value and
this new point is the same as the current point, and fre- candidate are declared within each instance of the
quently this fact can be determined at compile time; domain, and therefore are parallel values. The while
but in general the effect is to keep trying to pull lag- statement behaves in such a way that different domain
ging program counters forward, in such a way that once instances can execute different number of iterations, as
the master pc enters a block, it does not leave the block appropriate to the data within that instance; when an
until every virtual pc has left the block. Thus this exe- active instance computes a zero (false) value for the test
cution rule respects block structure (and subroutine call expression candidate, that instance simply becomes
structure) in a natural manner, while allowing the pro- inactive. When all instances have become inactive, then
grammer to make arbitrary use of goto statements the while statement completes after making active
if desired. (This theory of control flow was inspired exactly those instances that had been active when exe-
by earlier investigations of a SIMD-like theory giving cution of the while statement began. The mono stor-
rise to a superficially MIMD-like control structure in age class keyword indicates that a variable should be
Connection Machine Lisp [].) allocated just once (in the front-end processor), not
A compiler for this version of C* was independently once within each domain instance. The unary oper-
implemented at the University of New Hampshire [] ator <?= is the minimum-reduction operator, so the
for a MIMD multicomputer (an N-Cube , whose expression (<?= value) returns the smallest integer
processors communicated by message passing through that any active domain instance holds in its value
a hypercube network). variable. The array sieve can be indexed in the nor-
Figure  shows an early () example of a C* pro- mal C fashion, by either a mono (front-end) index, as
gram that identifies prime numbers by the method of in this example code, or by a poly (parallel) value,
the Sieve of Eratosthenes, taken (with one minor cor- which can be used to perform inter-domain (i.e., inter-
rection) from []. In typical C style, N is defined to processor) communication. The if statement is han-
be a preprocessor name that expands to the integer dled in much the same way as the while statement:
literal 100000. The name bit is then defined to be If an active instance computes a zero (false) value for
a synonym for the type of -bit integers. The domain the test expression candidate, that instance simply
declaration defines a parallel processing domain named becomes inactive during execution of the “then” part of
SIEVE that has a field named prime of type bit, and the if statement, and then becomes active again. (If an
then declares an array named sieve of length N, each if statement has an else part, then processors that
element of which is an instance of this domain. (This had been active for the “then” part become temporarily
single statement could instead have been written as two inactive during execution of the else part.).
statements:
domain SIEVE { bit prime; };
The Revised Design of C*
domain SIEVE sieve[N];
In , the C* language was revised substantially []
In this respect domain declarations are very much like to produce version .; this version was initially imple-
C struct declarations.) The function find_primes mented for the Connection Machine model CM-
is declared within the domain SIEVE, so when it [, ] and later for the model CM- as well as the
is called, its body is executed within every instance model CM- [, ]. This revised design was intended
of that domain, with a distinct virtual processor of to be closer in style to ANSI C than to C++. The
the Connection Machine containing and processing biggest change was to introduce the notion of a shape,
 C C*

#define N 100000

typedef int bit:1;

domain SIEVE { bit prime; } sieve[N];

void SIEVE::find_primes() {
int value = this - &sieve[0];
bit candidate = (value >= 2);
prime = 0;
while (candidate) {
mono int next_prime = (<?= value);
sieve[next_prime].prime = 1;
if (value % next_prime == 0) candidate = 0;
}
}

C*. Fig.  Example version  C* program for identifying prime numbers

which essentially describes a (possibly multidimen- then cube is a three-dimensional shape having ,
sional) array of virtual processors. An ordinary variable ( ×  × ) distinct positions, z is a parallel variable
declaration may be tagged with the name of a shape, consisting of one float value at each of these ,
which indicates that the declaration is replicated for shape positions, and a is a parallel array variable that has
parallel processing, one instance for each position in a × array of int values at each of , shape posi-
the shape. Where the initial design of C* required that tions (for a total of ,, distinct int values). Then
all instances of a domain be processed in parallel, the [3][13][5]z refers to one particular float value
new design allows declaration of several different shapes within z at position (, , ) within the cube shape,
(parallel arrays) having the same data layout, and differ- and [3][13][5]a[4][7] refers to one particular
ent shapes may be chosen at different times for parallel int element within the particular  ×  array at that
execution. Parallel execution is initiated using a newly same position. One may also write [3][13][5]a to
introduced with statement that specifies a shape: refer to (the address of) that same  ×  array. (This
use of left subscripts to index “across processors” and
with (shape) statement right subscripts to index “within a processor” may be
compared to the use of square brackets and parenthe-
The statement is executed with the specified shape as the ses to distinguish two sorts of subscript in Co-Array
“current shape” for parallel execution. C operators may Fortran [].)
be applied to parallel data much as in the original ver- Although pointers can be used for interproces-
sion of C*, but such data must have the current shape (a sor communication exactly as in earlier versions
requirement enforced by the compiler at compile time). of C*, such communication is more conveniently
Positions within shapes may be selected by indexing. expressed in C* version . by shape indexing (writing
In order to distinguish shape indexing from ordinary [i][j][k]z = b rather than *p = b, for example),
array indexing, shape subscripts are written to the left thus affording a more array-oriented style of program-
rather than to the right. Given these declarations: ming. Furthermore, version . of C* abandons the
entire “master program counter” execution model that
shape [16][16][16]cube; allowed all C control structures to be used for parallel
float:cube z; execution. Instead, statements such as if, while, and
int:cube a[10][10]; switch are restricted to test nonparallel values, and a
C* C 

newly introduced where statement tests parallel val- and therefore are parallel values. The while statement
ues, behaving very much like the where statement in of Fig.  becomes a while statement and a where
Fortran . The net effect of all these revisions is to give statement in Fig. ; the new while statement uses an
version . of C* an execution model and programming explicit or-reduction operator |= to decide whether
style more closely resembling those of *Lisp and CM there are any remaining candidates; if there are, the C
Fortran. where statement makes inactive every position in the
Figure  shows a rewriting of the code in Fig.  into shape for which its candidate value is zero. The dec-
C* version .. The domain declaration is replaced by laration of local variable next_prime does not men-
a shape declaration that specifies a one-dimensional tion a shape, so it is not a parallel variable (note that
shape named SIEVE of size N. The global variable the mono keyword is no longer used). A left subscript
prime of type bit is declared within shape SIEVE. is used to assign 1 to a single value of prime at the
The function find_primes is no longer declared position within shape SIEVE indicated by the value of
within a domain. When it is called, the with state- next_prime.
ment establishes SIEVE as the current shape for par- The need to split what was a single while state-
allel execution; conceptually, a distinct virtual processor ment in earlier versions of C* into two nested statements
of the Connection Machine is associated with each posi- (a while with an or-reduction operator containing a
tion in the current shape. The function pcoord takes where that typically repeats the same expression) has
an integer specfying an axis number and returns, at been criticized, as well as the syntax of type and shape
each position of the current shape, an integer indicating declarations and the lack of nested parallelism [].
the index of the position along the specified axis; thus Version . of C* [] introduced a “global/local pro-
in this example pcoord returns values ranging from gramming” feature, allowing C* to be used for MIMD
0 to 99999. Local variables value and candidate programming on the model CM-. A special “proto-
are explicitly declared as belonging to shape SIEVE, type file” is used to specify which functions are local

#define N 100000

typedef int bit:1;

shape [N]SIEVE;
bit:SIEVE prime;

void find_primes() {
with (SIEVE) {
int:SIEVE value = pcoord(0);
bit:SIEVE candidate = (value >= 2);
prime = 0;
while (|= candidate) {
where (candidate) {
int next_prime = (<?= value);
[next_prime]prime = 1;
if (value % next_prime == 0) candidate = 0;
}
}
}
}

C*. Fig.  Example version  C* program for identifying prime numbers


 C Cache Affinity Scheduling

and the calling interface to be used when global code . Thinking Machines Corporation () C∗ programming guide.
calls each local function. The idea is that a local func- Cambridge, MA
tion executes on a single processing node but can use . Thinking Machines Corporation () Connection Machine
CM- technical summary, rd edn. Cambridge, MA
all the facilities of C*, including parallelism (which
. Thinking Machines Corporation () C∗ . Alpha release
might be implemented on the model CM- through notes. Cambridge, MA
SIMD vector accelerator units). This facility is very . Tichy WF, Philippsen M, Hatcher P () A critique of the
similar to “local subprograms” in High Performance programming language C∗ . Commun ACM ():–
Fortran [].

Related Entries Cache Affinity Scheduling


Coarray Fortran
Connection Machine Affinity Scheduling
Connection Machine Fortran
Connection Machine Lisp
HPF (High Performance Fortran) Cache Coherence
*Lisp
Xiaowei Shen
IBM Research, Armonk, NY, USA
Bibliography
. Frankel JL () A reference description of the C* language. Definition
Technical Report TR-, Thinking Machines Corporation, A shared-memory multiprocessor system provides a
Cambridge, MA
global address space in which processors can exchange
. Koelbel CH, Loveman DB, Schreiber RS, Steele GL Jr, Zosel ME
() The High Performance Fortran handbook. MIT Press, information and synchronize with one another. When
Cambridge, MA shared variables are cached in multiple caches simulta-
. Numrich RW, Reid J () Co-array Fortran for parallel pro- neously, a memory store operation performed by one
gramming. SIGPLAN Fortran Forum ():– processor can make data copies of the same variable
. Quinn MJ, Hatcher PJ () Data-parallel programming on mul-
in other caches out of date. Cache coherence ensures a
ticomputers. IEEE Softw ():–
. Rose JR, Steele GL Jr () C∗ : An extended C language for coherent memory image for the system so that each pro-
data parallel programming. Technical Report PL –, Thinking cessor can observe the semantic effect of memory access
Machines Corporation, Cambridge, MA operations performed by other processors in time.
. Rose JR, Steele GL Jr () C∗ : An extended C language for data
parallel programming. In: Supercomputing ’: Proceedings of
the second international conference on supercomputing, vol II:
Discussion
Industrial supercomputer applications and computations. Inter- The cache coherence mechanism plays a crucial role in
national Supercomputing Institute, Inc., St. Petersburg, Florida, the construction of a shared-memory system, because
pp – of its profound impact on the overall performance and
. Steele GL Jr, Daniel Hillis W () Connection Machine Lisp: implementation complexity. It is also one of the most
fine-grained parallel symbolic processing. In: LFP ’: Proc. 
complicated problems in the design, because an efficient
ACM conference on LISP and functional programming, ACM
SIGPLAN/SIGACT/SIGART, ACM, New York, pp –, Aug
cache coherence protocol usually incorporates various
 optimizations.
. Thinking Machines Corporation () Connection Machine
model CM- technical summary. Technical report HA-, Cache Coherence and Memory Consistency
Cambridge, MA
The cache coherence protocol of a shared-memory mul-
. Thinking Machines Corporation () C∗ programming guide,
version . Pre-Beta. Cambridge, MA tiprocessor system implements a memory consistency
. Thinking Machines Corporation () C∗ user’s guide, version model that defines the semantics of memory access
. Pre-Beta. Cambridge, MA instructions. The essence of memory consistency is the
Cache Coherence C 

correspondence between each load instruction and the Sometimes people may get confused between mem-
store instruction that supplies the data retrieved by ory consistency models and cache coherence protocols.
the load instruction. The memory consistency model A memory consistency model defines the semantics
of uniprocessor systems is intuitive: a load opera- of memory operations, in particular, for each memory
tion returns the most recent value written to the load operation, the data value that should be provided C
address, and a store operation binds the value for sub- by the memory system. The memory consistency model
sequent load operations. In parallel systems, however, is a critical part of the semantics of the Instruction-Set
notions such as “the most recent value” can become Architecture of the system, and thus should be exposed
ambiguous since multiple processors access memory to the system programmer. A cache coherence proto-
concurrently. col, in contrast, is an implementation-level protocol that
An ideal memory consistency model should allow defines how caches should be kept coherent in a mul-
efficient implementations while still maintaining sim- tiprocessor system in which data of a memory address
ple semantics for the architect and the compiler writer can be replicated in multiple caches, and thus should be
to reason about. Sequential consistency [] is a domi- made transparent to the system programmer. Generally
nant memory model in parallel computing for decades speaking, in a shared-memory multiprocessor system,
due to its simplicity. A system is sequentially consis- the underlying cache coherence protocol, together with
tent if the result of any execution is the same as if the some proper memory operation reordering constraint
operations of all the processors were executed in some often enforced when memory operations are issued,
sequential order, and the operations of each individual implements the semantics of the memory consistency
processor appear in this sequence in the order specified model defined for the system.
by its program.
Sequential consistency is easy for programmers
to understand and use, but it often prohibits many Snoopy Cache Coherence
architectural and compiler optimizations. The desire to A symmetric multiprocessor (SMP) system generally
achieve higher performance has led to relaxed mem- employs a snoopy mechanism to ensure cache coher-
ory models, which can provide more implementation ence. When a processor reads an address not in its
flexibility by exposing optimizing features such as cache, it broadcasts a read request on the bus or net-
instruction reordering and data caching. Modern work, and the memory or the cache with the most up-
microprocessors [, ] support selected relaxed memory to-date copy can then supply the data. When a processor
consistency models that allow memory accesses to be broadcasts its intention to write an address which it does
reordered, and provide memory fences that can be used not own exclusively, other caches need to invalidate or
to ensure proper memory access ordering constraints update their copies.
whenever necessary. With snoopy cache coherence, when a cache miss
It is worth noting that, as a reaction to ever- occurs, the requesting cache sends a cache request to
changing memory models and their complicated and the memory and all its peer caches. When a peer cache
imprecise definitions, there is a desire to go back receives the cache request, it performs a cache snoop
to the simple, easy-to-understand sequential consis- operation and produces a cache snoop response indi-
tency, even though there are a plethora of problems cating whether the requested data is found in the peer
in its high-performance implementation []. Ingenious cache and the state of the corresponding cache line.
solutions have been devised to maintain the sequen- A combined snoop response can be generated based
tial consistency semantics so that programmers can- on cache snoop responses from all the peer caches. If
not detect if and when the memory accesses are the requested data is found in a peer cache, the peer
out of order or nonatomic. For example, advances cache can source the data to the requesting cache via
in speculative execution may permit memory access a cache-to-cache transfer, which is usually referred to
reordering without affecting the semantics of sequential as a cache intervention. The memory is responsible for
consistency. supplying the requested data if the combined snoop
 C Cache Coherence

response shows that the data cannot be supplied by any It should be pointed out that the MESI protocol
peer cache. described above is just an exemplary protocol to show
the essence of cache coherence operations. It can be
Example: The MESI Cache Coherence modified or tailored in various ways for implementa-
Protocol tion optimization. For example, one can imagine that
A number of snoopy cache coherence protocols have a cache line in a shared state can provide data for a
been proposed. The MESI coherence protocol and its read cache miss, rather than letting the memory pro-
variations have been widely used in SMP systems. As vide the data. This may provide better response time
the name suggests, MESI has four cache states, modified for a system in which cache-to-cache data transfer is
(M), exclusive (E), shared (S), and invalid (I). faster than memory-to-cache data transfer. Since there
can be more than one cache with the requested data in
● I (invalid): The data is not valid. This is the initial
the shared state, the cache coherence protocol needs to
state or the state after a snoop invalidate hit.
specify which cache in the shared state should provide
● S (shared): The data is valid, and can also be valid
the data, or how extra data copies should be handled in
in other caches. This state is entered when the data
case multiple caches provide the requested data at the
is sourced from the memory or another cache in
same time.
the modified state, and the corresponding snoop
response shows that the data is valid in at least one
of the other caches. Example: An Enhanced MESI Cache
● E (exclusive): The data is valid, and has not been Coherence Protocol
modified. The data is exclusively owned, and cannot In modern SMP systems, when a cache miss occurs,
be valid in another cache. This state is entered when if the requested data is found in both the memory
the data is sourced from the memory or another and a cache, supplying the data via a cache interven-
cache in the modified state, and the corresponding tion is often preferred over supplying the data from
snoop response shows that the data is not valid in the memory, because cache-to-cache transfer latency is
another cache. usually smaller than memory access latency. Further-
● M (modified): The data is valid and has been mod- more, when caches are on the same die or in the same
ified. The data is exclusively owned, and cannot be package module, there is usually more bandwidth avail-
valid in another cache. This state is entered when a able for cache-to-cache transfers, compared with the
store operation is performed on the cache line. bandwidth available for off-chip DRAM accesses.
The IBM Power- system [], for example, enhances
With the MESI protocol, when a read cache miss occurs,
the MESI coherence protocol to allow more cache inter-
if the requested data is found in another cache and the
ventions. Compared with MESI, an enhanced coher-
cache line is in the modified state, the cache with the
ence protocol allows data of a shared cache line to be
modified data supplies the data via a cache intervention
sourced via a cache intervention. In addition, if data
(and writes the most up-to-date data back to the mem-
of a modified cache line is sourced from one cache to
ory). However, if the requested data is found in another
another, the modified data does not have to be written
cache and the cache line is in the shared state, the cache
back to the memory immediately. Instead, a cache with
with the shared data does not supply the requested data,
the most up-to-date data can be held responsible for
since it cannot guarantee from the shared state that it is
memory update when it becomes necessary to do so. An
the only cache that is to source the data. In this case, the
exemplary enhanced MESI protocol employing seven
memory will supply the data to the requesting cache.
cache states is as follows.
When a write cache miss occurs, if data of the mem-
ory address is cached in one or more other caches in the ● I (invalid): The data is invalid. This is the initial state
shared state, all those cached copies in other caches are or the state after a snoop invalidate hit.
invalidated before the write operation can be performed ● SL (shared, can be sourced): The data is valid and
in the local cache. may also be valid in other caches. The data can
Cache Coherence C 

be sourced to another cache in the same module requested data to reduce communication latency and
via a cache intervention. This state is entered when bandwidth consumption of cache intervention. Thus, it
the data is sourced from another cache or from the is probably desirable to enhance cache coherence mech-
memory. anisms with cost-conscious cache-to-cache transfers to
● S (shared): The data is valid, and may also be valid in improve overall performance in SMP systems. C
other caches. The data cannot be sourced to another
cache. This state is entered when a snoop read hit Broadcast-Based Cache Coherence Versus
from another cache in the same module occurs on a Directory-Based Cache Coherence
cache line in the SL state. A major drawback of broadcast-based snoopy cache
● M (modified): The data is valid, and has been mod- coherence protocols is that a cache request is usually
ified. The data is exclusively owned, and cannot be broadcast to all caches in the system. This can cause
valid in another cache. The data can be sourced to serious problems to overall performance, system scala-
another cache. This state is entered when a store bility, and power consumption, especially for large-scale
operation is performed on the cache line. multiprocessor systems. Further, broadcasting cache
● ME (exclusive): The data is valid, and has not been requests indiscriminately may consume enormous net-
modified. The data is exclusively owned, and cannot work bandwidth, while snooping peer caches unnec-
be valid in another cache. essarily may require excessive cache snoop ports. It is
● MU (unsolicited modified): The data is valid and worth noting that servicing a cache request may take
is considered to have been modified. The data is more time than necessary when far away caches are
exclusively owned and cannot be valid in another snooped unnecessarily.
cache. Unlike broadcast-based snoopy cache coherence
● T (tagged): The data is valid and has been modi- protocols, a directory-based cache coherence protocol
fied. The modified data has been sourced to another maintains a directory entry to record the cache sites in
cache. This state is entered when a snoop read hit which each memory block is currently cached []. The
occurs on a cache line in the M state. directory entry is often maintained at the site in which
the corresponding physical memory resides. Since the
When data of a memory address is shared in multiple locations of shared copies are known, the protocol
caches in a single module, the single module can include engine at each site can maintain coherence by employ-
at most one cache in the SL state. The cache in the SL ing point-to-point protocol messages. The elimination
state is responsible for supplying the shared data via a of broadcast overcomes a major limitation on scaling
cache intervention when a cache miss occurs in another cache coherent machines to large-scale multiprocessor
cache in the same module. At any time, the particular systems.
cache that can source the shared data is fixed, regard- Typical directory-based protocols maintain a direc-
less of which cache has issued the cache request. When tory entry for each memory block to record the caches
data of a memory address is shared in more than one in which the memory block is currently cached. With
module, each module can include a cache in the SL state. a full-map directory structure, for example, each direc-
A cache in the SL state can source the data to another tory entry comprises one bit for each cache in the sys-
cache in the same module, but cannot source the data tem, indicating whether the cache has a data copy of the
to a cache in a different module. memory block. Given a memory address, its directory
In systems in which a cache-to-cache transfer can entry is usually maintained in a node in which the cor-
take multiple message-passing hops, sourcing data from responding physical memory resides. This node is often
different caches can result in different communication referred to as the home of the memory address. When
latency and bandwidth consumption. When a cache a cache miss occurs, the requesting cache sends a cache
miss occurs in a requesting cache, if requested data is request to the home, which generates appropriate point-
shared in more than one peer cache, a peer cache that is to-point coherence messages according to the directory
closest to the requesting cache is preferred to supply the information.
 C Cache-Only Memory Architecture (COMA)

However, directory-based cache coherence protocols at different receiving caches. Appropriate mechanisms
have various shortcomings. For example, maintaining a are needed to guarantee correctness of cache coherence
directory entry for each memory block usually results for network-based multiprocessor systems [, ].
in significant storage overhead. Alternative directory
structures may reduce the storage overhead with perfor- Bibliography
mance compromises. Furthermore, accessing directory . Lamport L () How to make a multiprocessor computer that
can be time consuming since directory informa- correctly executes multiprocess programs. IEEE Trans Comput
tion is usually stored in DRAM. Caching recently C-():–
. May C, Silha E, Simpson R, Warren H () The powerPC archi-
used directory entries can potentially reduce direc-
tecture: a specification for a new family of RISC processors. Mor-
tory access latencies but with increased implementation gan Kaufmann, San Francisco
complexity. . Intel Corporation () IA- application developer’s architec-
Accessing directory causes three or four message- ture guide
passing hops to service a cache request, compared with . Gniady C, Falsafi B, Vijaykumar T () Is SC+ILP=RC? In: Pro-
two message-passing hops with snoopy cache coher- ceedings of the th annual international symposium on computer
architecture (ISCA ), Atlanta, – May , pp –
ence protocols. Consider a scenario in which a cache
. Tendler J, Dodson J, Fields J, Le H, Sinharoy B () POWER-
miss occurs in a requesting cache, while the requested system microarchitecture. IBM J Res Dev ():
data is modified in another cache. To service the cache . Chaiken D, Fields C, Kurihara K, Agarwal A () Directory-
miss, the requesting cache sends a cache request to based cache coherence in large-scale multiprocessors. Computer
the corresponding home. When the home receives the ():–
. Martin M, Hill M, Wood D () Token coherence: decoupling
cache request, it forwards the cache request to the cache
performance and corrections. In: Proceedings of the th annual
that contains the modified data. When the cache with international symposium on computer architecture international
the modified data receives the forwarded cache request, symposium on computer architecture, San Diego, – June 
it sends the requested data to the requesting cache (an . Strauss K, Shen X, Torrellas J () Uncorq: unconstrained snoop
alternative is to send the requested data to the home, request delivery in embedded-ring multiprocessors. In: Proceed-
which will forward the requested data to the requesting ings of the th annual IEEE/ACM international symposium on
microarchitecture, Chicago, pp –, – Dec 
cache).

Cache Coherence for Network-Based


Multiprocessor Systems
In a modern shared-memory multiprocessor system, Cache-Only Memory Architecture
caches can be interconnected with each other via a (COMA)
message-passing network instead of a shared bus to
Josep Torrellas
improve system scalability and performance. In a bus-
University of Illinois at Urbana-Champaign, Urbana,
based SMP system, the bus behaves as a central arbi-
IL, USA
trator that serializes all bus transactions. This ensures
a total order of bus transactions. In a network-based
multiprocessor system, in contrast, when a cache broad- Synonyms
casts a message, the message is not necessarily observed COMA (Cache-only memory architecture)
atomically by all the receiving caches. For example, it
is possible that cache A multicasts a message to caches Definition
B and C, cache B receives the broadcast message and A Cache-Only Memory Architecture (COMA) is a
then sends a message to cache C, and cache C receives type of cache-coherent nonuniform memory access
cache B’s message before receiving cache A’s multicast (CC-NUMA) architecture. Unlike in a conventional
message. CC-NUMA architecture, in a COMA, every shared-
Protocol correctness can be compromised when memory module in the machine is a cache, where each
multicast messages can be received in different orders memory line has a tag with the line’s address and state.
Cache-Only Memory Architecture (COMA) C 

As a processor references a line, it transparently brings inserted in both the processor’s cache and the node’s
it to both its private cache(s) and its nearby portion of AM. A line can be evicted from an AM if another line
the NUMA shared memory (Local Memory) – possibly needs the space. Ideally, with this support, the proces-
displacing a valid line from its local memory. Effectively, sor dynamically attracts its working set into its local
each shared-memory module acts as a huge cache mem- memory module. The lines the processor is not access- C
ory, giving the name COMA to the architecture. Since ing overflow and are sent to other memories. Because a
the COMA hardware automatically replicates the data large AM is more capable of containing a node’s current
and migrates it to the memory module of the node that working set than a cache is, more of the cache misses are
is currently accessing it, COMA increases the chances of satisfied locally within the node.
data being available locally. This reduces the possibility There are three issues that need to be addressed in
of frequent long-latency memory accesses. Effectively, COMA, namely finding a line, replacing a line, and deal-
COMA dynamically adapts the shared data layout to the ing with the memory overhead. In the rest of this article,
application’s reference patterns. these issues are described first, then different COMA
designs are outlined, and finally further readings are
suggested.
Discussion
Basic Concepts Finding a Memory Line
In a conventional CC-NUMA architecture, each node In a COMA, the address of a memory line is a global
contains one or more processors with private caches and identifier, not an indicator of the line’s physical location
a memory module that is part of the NUMA shared in memory. Just like a normal cache, the AM keeps a tag
memory. A page allocated in the memory module of with the address and state of the memory line currently
one node can be accessed by the processors of all other stored in each memory location. On a cache miss, the
nodes. The physical page number of the page speci- memory controller has to look up the tags in the local
fies the node where the page is allocated. Such node is AM to determine whether or not the access can be ser-
referred to as the Home Node of the page. The physi- viced locally. If the line is not in the local AM, a remote
cal address of a memory line includes the physical page request is issued to locate the block.
number and the offset within that page. COMA machines have a mechanism to locate a line
In large machines, fetching a line from a remote in the system so that the processor can find a valid copy
memory module can take several times longer than of the line when a miss occurs in the local AM. Differ-
fetching it from the local memory module. Conse- ent mechanisms are used by different classes of COMA
quently, for an application to attain high performance, machines.
the local memory module must satisfy a large fraction One approach is to organize the machine hierarchi-
of the cache misses. This requires a good placement of cally, with the processors at the leaves of the tree. Each
the program pages across the different nodes. If the pro- level in the hierarchy includes a directory-like structure,
gram’s memory access patterns are too complicated for with information about the status of the lines present in
the software to understand, individual data structures the subtree extending from the leaves up to that level of
may not end up being placed in the memory module of the hierarchy. To find a line, the processing node issues
the node that access them the most. In addition, when a a request that goes to successively higher levels of the
page contains data structures that are read and written tree, potentially going all the way to the root. The pro-
by different processors, it is hard to attain a good page cess stops at the level where the subtree contains the line.
placement. This design is called Hierarchical COMA [, ].
In a COMA, the hardware can transparently elimi- Another approach involves assigning a home node
nate a certain class of remote memory accesses. COMA to each memory line, based on the line’s physical
does this by turning memory modules into large caches address. The line’s home has the directory entry for the
called Attraction Memory (AM). When a processor line. Memory lines can freely migrate, but directory
requests a line from a remote memory, the line is entries do not. Consequently, to locate a memory line,
 C Cache-Only Memory Architecture (COMA)

a processor interrogates the directory in the line’s home It also enhances line migration to the AMs of the ref-
node. The directory always knows the state and location erencing nodes because less line relocation traffic is
of the line and can forward the request to the right node. needed.
This design is called Flat COMA []. Without unallocated space, every time a line is
inserted in the AM, another line would have to be relo-
Replacing a Memory Line cated. The ratio between the allocated data size and the
The AM acts as a cache, and lines can be displaced total size of the AMs is called the Memory Pressure. If the
from it. When a line is displaced in a plain cache, it is memory pressure is %, then % of the AM space is
either overwritten (if it is unmodified) or written back available for data replication. Both the relocation traffic
to its home memory module, which guarantees a place and the number of AM misses increase with the mem-
for the line. ory pressure []. For a given memory size, choosing an
A memory line in COMA does not have a fixed appropriate memory pressure is a trade-off between the
backup location where it can be written to if it gets effect on page faults, AM misses, and relocation traffic.
displaced from an AM. Moreover, even an unmodified
line can be the only copy of that memory line in the Different Cache-Only Memory Architecture
system, and it must not be lost on an AM displace- Designs
ment. Therefore, the system must keep track of the last Hierarchical COMA
copy of a line. As a result, when a modified or other- The first designs of COMA machines follow what has
wise unique line is displaced from an AM, it must be been called Hierarchical COMA. These designs orga-
relocated into another AM. nize the machine hierarchically, connecting the proces-
To guarantee that at least one copy of an unmodi- sors to the leaves of the tree. These machines include
fied line remains in the system, one of the line’s copies is the KSR- [] from Kendall Square Research, which has
denoted as the Master copy. All other shared copies can a hierarchy of rings, and the Data Diffusion Machine
be overwritten if displaced, but the master copy must (DDM) [] from the Swedish Institute of Computer
always be relocated to another AM. When a master copy Science, which has a hierarchy of buses.
or a modified line is relocated, the problem is decid- Each level in the tree hierarchy includes a directory-
ing which node should take the line in its AM. If other like structure, with information about the status of the
nodes already have one or more other shared copies of lines extending from the leaves up to that level of the
the line, one of them becomes the master copy. Other- hierarchy. To find a line, the processing node issues a
wise, another node must accept the line. This process is request that goes to successively higher levels of the tree,
called Line Injection. potentially going all the way to the root. The process
Different line injection algorithms are possible. One stops at the level where the subtree contains the line.
approach is for the displacing node to send requests to In these designs, substantial latency occurs as the
other nodes asking if they have space to host the line []. memory requests go up the hierarchy and then down
Another approach is to force one node to accept the line. to find the desired line. It has been argued that such
This, however, may lead to another line displacement. latency can offset the potential gains of COMA relative
A proposed solution is to relocate the new line to the to conventional CC-NUMA architectures [].
node that supplied the line that caused the displacement
in the first place []. Flat COMA
A design called Flat COMA makes it easy to locate a
Dealing with Memory Overhead memory line by assigning a home node to each memory
A CC-NUMA machine can allocate all memory to line [] – based on the line’s physical address. The line’s
application or system pages. COMA, however, leaves a home has the directory entry for the line, like in a con-
portion of the memory unallocated to facilitate auto- ventional CC-NUMA architecture. The memory lines
matic data replication and migration. This unallocated can freely migrate, but the directory entries of the mem-
space supports the replication of lines across AMs. ory lines are fixed in their home nodes. At a miss on a
Cache-Only Memory Architecture (COMA) C 

line in an AM, a request goes to the node that is keeping Consequently, S-COMA suffers from memory frag-
the directory information about the line. The directory mentation. This can cause programs to have inflated
redirects the request to another node if the home does working sets that overflow the AM, inducing frequent
not have a copy of the line. In Flat COMA, unlike in a page replacements and resulting in high operating sys-
conventional CC-NUMA architecture, the home node tem overhead and poor performance. C
may not have a copy of the line even though no pro- Multiplexed Simple COMA (MS-COMA) [] elim-
cessor has written to the line. The line has simply been inates this problem by allowing multiple virtual pages
displaced from the AM in the home node. in a given node to map to the same physical page at
Because Flat COMA does not rely on a hierarchy to the same time. This mapping is possible because all the
find a block, it can use any high-speed network. lines on a virtual page are not used at the same time.
A given physical page can now contain lines belonging
to different virtual pages if each line has a short vir-
Simple COMA
tual page ID. If two lines belonging to different pages
A design called Simple COMA (S-COMA) [] transfers
have the same page offset, they displace each other
some of the complexity in the AM line displacement and
from the AM. The overall result is a compression of the
relocation mechanisms to software. The general coher-
application’s working set.
ence actions, however, are still maintained in hard-
ware for performance reasons. Specifically, in S-COMA,
the operating system sets aside space in the AM for Further Readings
incoming memory blocks on a page- granularity basis. There are several papers that discuss COMA and
The local Memory Management Unit (MMU) has map- related topics. Dahlgren and Torrellas present a more
pings only for pages in the local node, not for remote in-depth survey of COMA machine issues []. There
pages. When a node accesses for the first time a shared are several designs that combine COMA and con-
page that is already in a remote node, the processor suf- ventional CC-NUMA architecture features, such as
fers a page fault. The operating system then allocates a NUMA with Remote Caches (NUMA-RC) [], Reactive
page frame locally for the requested line. Thereafter, the NUMA [], Excel-NUMA [], the Sun Microsystems’
hardware continues with the request, including locating WildFire multiprocessor design [], the IBM Prism
a valid copy of the line and inserting it, in the cor- architecture [], and the Illinois I-ACOMA architecture
rect state, in the newly allocated page in the local AM. []. A model for comparing the performance of COMA
The rest of the page remains unused until future and conventional CC-NUMA architectures is presented
requests to other lines of the page start filling it. Sub- by Zhang and Torrellas []. Soundarajan et al. []
sequent accesses to the line get their mapping directly describe the trade-offs related to data migration and
from the MMU. There are no AM address tags to check replication in CC-NUMA machines.
if the correct line is accessed.
Since the physical address used to identify a line in
the AM is set up independently by the MMU in each Bibliography
node, two copies of the same line in different nodes are . Basu S, Torrellas J () Enhancing memory use in simple coma:
likely to have different physical addresses. Shared data multiplexed simple coma. In: International symposium on high-
needs a global identity so that different nodes can com- performance computer architecture, Las Vegas, February 
municate. To this end, each node has a translation table . Burkhardt H et al () Overview of the KSR computer system.
Technical Report , Kendall Square Research, Waltham,
that converts local addresses to global identifiers and
February 
vice versa. . Dahlgren F, Torrellas J () Cache-only memory architectures.
IEEE Computer Magazine ():–, June 
. Ekanadham K, Lim B-H, Pattnaik P, Snir M () PRISM: an
Multiplexed Simple COMA integrated architecture for scalable shared memory. In: Interna-
S-COMA sets aside memory space in page-sized tional symposium on high-performance computer architecture,
chunks, even if only one line of each page is present. Las Vegas, February 
 C Caches, NUMA

. Falsafi B, Wood D () Reactive NUMA: a design for unify-


ing S-COMA and CC-NUMA. In: International symposium on Carbon Cycle Research
computer architecture, Denver, June 
. Hagersten E, Koster M () WildFire: a scalable path for SMPs. Terrestrial Ecosystem Carbon Modeling
In: International symposium on high-performance computer
architecture, Orlando, January 
. Hagersten E, Landin A, Haridi S () DDM – a cache-only
memory architecture. IEEE Computer ():–
. Joe T, Hennessy J () Evaluating the memory overhead Car-Parrinello Method
required for COMA architectures. In: International symposium
on computer architecture, Chicago, April , pp – Mark Tuckerman , Eric J. Bohm , Laxmikant V.
. Moga A, Dubois M () The effectiveness of SRAM network Kalé , Glenn Martyna

caches in clustered DSMs. In: International symposium on high- New York University, New York, NY, USA
performance computer architecture, Las Vegas, February  
University of Illinois at Urbana-Champaign, Urbana,
. Saulsbury A, Wilkinson T, Carter J, Landin A () An IL, USA
argument for simple COMA. In: International symposium on 
IBM Thomas J. Watson Research Center,
high-performance computer architecture, Raleigh, January ,
Yorktown Heights, NY, USA
pp –
. Soundararajan V, Heinrich M, Verghese B, Gharachorloo K,
Gupta A, Hennessy J () Flexible use of memory for repli- Synonyms
cation/migration in cache-coherent DSM multiprocessors. In:
Ab initio molecular dynamics; First-principles molecu-
International symposium on computer architecture, Barcelona,
June 
lar dynamics
. Stenstrom P, Joe T, Gupta A () Comparative performance
evaluation of cache-coherent NUMA and COMA architectures. Definition
In: International symposium on computer architecture, Gold A Car–Parrinello simulation is a molecular dynam-
Coast, Australia, May , pp –
ics based calculation in which the finite-temperature
. Torrellas J, Padua D () The illinois aggressive coma multi-
processor project (I-ACOMA). In: Symposium on the frontiers dynamics of a system of N atoms is generated using
of massively parallel computing, Annapolis, October  forces obtained directly from electronic structure
. Zhang Z, Cintra M, Torrellas J () Excel-NUMA: toward calculations performed “on the fly” as the simulation
programmability, simplicity, and high performance. IEEE proceeds. A typical Car–Parrinello simulation employs
Trans Comput ():–. Special Issue on Cache Memory,
a density functional description of the electronic struc-
February 
. Zhang Z, Torrellas J () Reducing remote conflict misses:
ture, a plane-wave basis expansion of the single-particle
NUMA with remote cache versus COMA. In: International orbitals, and periodic boundary conditions on the sim-
symposium on high-performance computer architecture, San ulation cell. The original paper has seen an exponen-
Antonio, February , pp – tial rise in the number of citations, and the method
has become a workhorse for studying systems which
undergo nontrivial electronic structure changes.

Discussion
Caches, NUMA Introduction
Atomistic modeling of many systems in physics,
NUMA Caches chemistry, biology, and materials science requires
explicit treatment of chemical bond-breaking and form-
ing events. The methodology of ab initio molecular
dynamics (AIMD), in which the finite-temperature
dynamics of a system of N atoms is generated using
Calculus of Mobile Processes forces obtained directly from the electronic structure
calculations performed “on the fly” as the simulation
Pi-Calculus proceeds, can describe such processes in a manner that
Car-Parrinello Method C 

is both general and transferable from system to sys- An AIMD calculation circumvents the need for
tem. The Car–Parrinello method is one type of AIMD an explicit force field model by obtaining the forces
simulation. FI (R , . . . , RN ) at a given configuration R , . . . , RN of
Assuming that a physical system can be described the nuclei from a quantum mechanical electronic struc-
classically in terms of its N constituent atoms, hav- ture calculation performed at this particular nuclear C
ing masses M , . . . , MN and charges Z e, . . . , ZN e, the configuration. To simplify the notation, let R denote the
classical microscopic state of the system is com- complete set R , . . . , RN nuclear coordinates. Suppose
pletely determined by specifying the Cartesian posi- the forces on the Born–Oppenheimer electronic ground
tions R , . . . , RN of its atoms and their corresponding state surface are sought. Let ∣Ψ (R)⟩ and Ĥel (R) denote,
conjugate momenta P , . . . , PN as functions of time. In a respectively, the ground-state electronic wave function
standard molecular dynamics calculation, the time evo- and electronic Hamiltonian at the nuclear configura-
lution of the system is determined by solving Hamilton’s tion R. If the system contains Ne electrons with position
equations of motion operators r̂ , . . . , r̂Ne , then the electronic Hamiltonian in
atomic units (e = , ħ = , me = , c = ) is
PI
ṘI = , ṖI = FI
MI  Ne  Ne N
ZI
Ĥel = − ∑ ∇i + ∑ − ∑∑
which can be combined into the second-order differen-  i= i>j ∣r̂ i − r̂j ∣ i= I= ∣r̂ i − R I ∣
tial equations
where the first, second, and third terms are the electron
MI R̈I = FI kinetic energy, the electron–electron Coulomb repul-
sion, and the electron–nuclear Coulomb attraction,
Here, ṘI = dRI /dt, ṖI = dPI /dt are the first deriva-
respectively. The interatomic forces are given exactly by
tives with respect to time of position and momentum,
respectively, and R̈I = d RI /dt  is the second derivative FI (R) = −⟨Ψ (R)∣∇I Ĥel (R)∣Ψ (R)⟩
of position with respect to time, and FI is the total force (RI − RJ )
on atom I due to all of the other atoms in the system. + ∑ ZI ZJ
J≠I ∣RI − RJ ∣
The force FI is a function FI (R , . . . , RN ) of all of the
atomic positions, hence Newton’s equations of motion by virtue of the Hellman–Feynman theorem.
constitute a set of N coupled second-order ordinary In practice, it is usually not possible to obtain the
differential equations. exact ground-state wave function ∣Ψ (R)⟩, and, there-
Any molecular dynamics calculation requires the fore, an approximate electronic structure method is
functional form FI (R , . . . , RN ) of the forces as an input needed. The approximation most commonly employed
to the method. In most molecular dynamics calcula- in AIMD calculations is the Kohn–Sham formulation
tions, the forces are modeled using simple functions of density functional theory. In the Kohn–Sham the-
that describe bond stretching, angle bending, torsion, ory, the full electronic wave function is replaced by
van der Waals, and Coulombic interactions and a set a set ψs (r), s = , . . . , Ns of mutually orthogonal
of parameters for these interactions that are fit either single-electron orbitals (denoted collectively as ψ(r))
to experiment or to high-level ab initio calculations. and the corresponding electron density
Such models are referred to as force fields, and while Ns
force fields are useful for many types of applications, n(r) = ∑ fs ∣ψ s (r)∣
they generally are unable to describe chemical bond- s=

breaking and forming events and often neglect elec- where fs is the occupation number of the state s. In
tronic polarization effects. Moreover, the parameters closed-shell calculations Ns = Ne / with fs = , while for
cannot be assumed to remain valid in thermodynamic open-shell calculations Ns = Ne with fs = . The Kohn–
states very different from the one for which they were Sham energy functional gives the total energy of the
originally fit. Consequently, most force fields are unsuit- system as
able for studying chemical processes under varying
external conditions. E[ψ, R] = Ts [ψ] + EH [n] + Eext [n] + Exc [n]
 C Car-Parrinello Method

where of expansion coefficients. At the Gamma-point, the


 Ns orbitals are purely real, and the coefficients satisfy the
Ts [ψ] = − ∑ fs ∫ dr ψ ∗s (r)∇ ψ s (r) condition C∗s (g) = Cs (−g), which means that only half
 s=
of the reciprocal space is needed to reconstruct the full

 ′ n(r)n(r ) set of Kohn–Sham orbitals. A similar expansion
EH [n] = ∫ dr dr
 ∣r − r′ ∣ 
n(r) = ∑ ñ(g)e
ig⋅r
N
ZI n(r) V g
Eext [n] = − ∑ ∫ dr
I= ∣r − RI ∣ is employed for the electronic density. Note that the
are the single-particle kinetic energy, the Hartree coefficients ñ(g) depend on the orbital coefficients
energy, and the external energy, respectively. The func- Cs (g). Because the density is real, the coefficients ñ(g)
tional dependence of the term Exc [n], known as the satisfy ñ∗ (g) = ñ(−g). Again, this condition means that
exchange and correlation energy, is not known and the full density can be reconstructed using only half of
must, therefore, be approximated. One of the most the reciprocal space. In order to implement these expan-
commonly used approximations is referred to as the sions numerically, they must be truncated. The trunca-
generalized gradient approximation tion criterion is based on the plane-wave kinetic energy
∣g∣ /. Specifically, the orbital expansion is truncated at
Exc [n] ≈ ∫ dr f (n(r), ∇n(r)) a value Ecut such that ∣g∣ / < Ecut , and the density,
where f is a scalar function of the density and its gra- being determined from the squares of the orbitals, is
dient. When the Kohn–Sham functional is minimized truncated using the condition ∣g∣ / < Ecut . When the
with respect to the electronic orbitals subject to the above two plane-wave expansions are substituted into
orthogonality condition ⟨ψ s ∣ψ s′ ⟩ = δ ss′ , then the inter- the Kohn–Sham energy functional, the resulting energy
atomic forces are given by is an ordinary function of the orbital expansion coef-
ficients that must be minimized with respect to these
(RI − RJ )
FI (R) = −∇I E[ψ () , R] + ∑ ZI ZJ coefficients subject to the orthonormality constraint
J≠I ∣RI − RJ ∣ ∑g C∗s (g)Cs′ (g) = δ ss′ .
where ψ () denotes the set of orbitals obtained by the An alternative to explicit minimization of the energy
minimization procedure. was proposed by Car and Parrinello (see bibliographic
Most AIMD calculations use a basis set for expand- notes) based on the introduction of an artificial dynam-
ing the Kohn–Sham orbitals ψ s (r). Because periodic ics for the orbital coefficients
boundary conditions are typically employed in molec- ∂E
μ C̈s (g) = − − ∑ Λ ss′ Cs′ (g)
ular dynamics simulations, a useful basis set is a simple ∂C∗s (g) s′
plane-wave basis. In fact, when the potential is periodic,
∂E
the Kohn–Sham orbitals must be represented as Bloch MI R̈I = −
∂RI
functions, ψ sk (r) = exp(ik ⋅ r)us (r), where k is a vector
in the first Brillouin zone. However, if a large enough In the above equations, known as the Car–Parrinello
simulation cell is used, k can be taken to be (, , ) equations, μ is a mass-like parameter (having units of
(the Gamma-point) for many chemical systems, as will energy×time ) that determines the time scale on which
be done here. In this case, the plane-wave expansion of the coefficients evolve, and Λ ss′ is a matrix of Lagrange
ψs (r) becomes the simple Fourier representation multipliers for enforcing the orthogonality condition
 as a constraint. The mechanism of the Car–Parrinello
ψ s (r) = √ ∑ Cs (g)eig⋅r equations is the following: An explicit minimization of
V g
the energy with respect to the orbital coefficients is car-
where g = πin/V / , with n a vector of integers, ried out at a given initial configuration of the nuclei.
denotes the Fourier-space vector corresponding to a Following this, the Car–Parrinello equations are inte-
cubic box of volume V, and the {Cs (g)} is the set grated numerically with a value of μ small enough to
Car-Parrinello Method C 

ensure that the coefficients respond as quickly as possi- a trajectory of just – ps. Thus, the parallelization
ble to the motion of the nuclei. In addition, the orbital scheme must, therefore, be efficient and scale well with
coefficients must be assigned a fictitious kinetic energy the number of processors. The remainder of this entry
satisfying will be devoted to the discussion of the algorithm and
μ ∑ ∑ ∣Ċs (g)∣ ≪ ∑ MI ṘI
parallelization techniques for it. C
s g I

in order to ensure that the coefficients remain close to Outline of the Algorithm
the ground-state Born–Oppenheimer surface through- The calculation of the total energy and its deriva-
out the simulation. tives with respect to orbital expansion coefficients and
In a typical calculation containing  – nuclei nuclear positions consists of the following phases:
and a comparable number of Kohn–Sham orbitals, the . Phase I: Starting with the orbital coefficients in
number of coefficients per orbital is often in the range reciprocal space, the electron kinetic energy and
 – , depending on the atomic types present in the its coefficient derivatives are evaluated using the
system. Although this may seem like a large number of formula
coefficients, the number would be considerably larger
 Ns
without a crucial approximation applied to the exter- Ts = ∑ fs ∑ ∣g∣ ∣Cs (g)∣
 
 s= g
nal energy Eext [n]. Specifically, electrons in low-lying
energy states, also known as core electrons, are elimi- . Phase II: The orbital coefficients are transformed
nated in favor of an atomic pseudopotential that requires from Fourier space to real space. This operation
replacing the exact Eext [n] by a functional of the follow- requires Ns three-dimensional fast Fourier trans-
ing form forms (FFTs).
N . Phase III: The real-space coefficients are squared
Eext [n, ψ] ≈ ∑ ∫ dr n(r)vl (r − RI ) + Enl [ψ] and summed to generate the density n(r).
I=
. Phase IV: The real-space density n(r) is used to
≡ Eloc [n] + Enl [ψ] evaluate the exchange-correlation functional and
where Eloc [n] is a purely local energy term, and Enl [ψ] is its functional derivatives with respect to the den-
an orbital-dependent functional known as the nonlocal sity. Note that Exc [n] is generally evaluated on
energy given by a regular mesh, hence, the functional derivatives
l− l
are replaced by ordinary derivatives at the mesh
Enl [ψ] = ∑ ∑ ∑ ∑ wl ∣ZsIlm ∣ points.
s I l= m=−l . Phase V: The density is Fourier transformed to
with reciprocal space, and the coefficients ñ(g) are used
ZsIlm = ∫ dr Flm (r − RI )ψ s (r) to evaluate the Hartree and purely local part of the
pseudopotential energy using the formulas
Here Flm (r) is a function particular to the pseudopo-
 π
tential, and l and m label angular momentum channels. EH = ∑ ∣ñ(g)∣
The value of l is summed up to a maximum l − . Typi- V g≠(,,) ∣g∣ 

cally, l =  or  depending on the chemical composition


 N ∗ −ig⋅R I
of the system, but higher values can be included when Eloc = ∑ ∑ ñ (g)ṽl (g)e
V I= g
necessary. Despite the pseudopotential approximation,
AIMD calculations in a plane-wave basis are compu- and their derivatives and nuclear position deriva-
tationally very intensive and can benefit substantially tives, where ṽl (g) is the Fourier transform of the
from massive parallelization. In addition, a typical value potential vl (r).
of the integration time step in a Car–Parrinello AIMD . Phase VI: The derivatives from Phase VI are Fourier
simulation is . fs, which means that  – or more transformed to real space and combined with the
electronic structure calculations are needed to generate derivatives from Phase V. The combined functional
 C Car-Parrinello Method

derivatives are multiplied against the real-space Hybrid scheme – Let ncoef represent the number of
orbital coefficients to produce part of the orbital coefficients used to represent each Kohn–Sham orbital,
forces in real space. and let nstate represent the number of orbitals (also
. Phase VII: The reciprocal-space orbital coefficients called “states”). In a serial calculation, the coefficients
are used to evaluate the nonlocal pseudopotential are then stored in two arrays a(ncoef,state)
energy and its derivatives. and b(ncoef,nstate) holding the real and imagi-
. Phase VIII: The forces from Phase VI are combined nary parts of the coefficients, respectively. Alternatively,
and are then transformed back into reciprocal space, complex data typing could be used for the coefficient
an operation that requires Ns FFTs. array. Let nproc be the number of processors available
. Phase IX: The nuclear–nuclear Coulomb repulsion for the calculation. In the hybrid parallel scheme, the
energy and its position derivatives are evaluated coefficients are stored in one of two ways at each point
using standard Ewald summation. in the calculation. The default storage mode is called
. Phase X: The reciprocal space forces are combined “transposed” form in which the coefficient arrays are
with those from Phases I and VII to yield the dimensioned as a(ncoef/nproc,nstate) and
total orbital forces. These, together with the nuclear b(ncoef/nproc,nstate), so that each proces-
forces, are fed into a numerical solver in order to sor has a portion the orbitals for all of the states.
advance the nuclear positions and reciprocal-space At certain points in the calculation, the coefficients are
orbital coefficients to the next time step, and the pro- transformed to “normal” form in which the arrays are
cess returns to Phase I. As part of this phase, the dimensioned as a(ncoef,nstate/nproc) and
condition of orthogonality b(ncoef,nstate/nproc), so that each processor
has a fraction of the orbitals but each of the orbitals is
⟨ψ s ∣ψ s′ ⟩ = ∑ C∗s (g)Cs′ (g) = δ ss′ complete on the processor.
g
In the hybrid scheme, operations on the density,
is enforced as a holonomic constraint on the both in real and reciprocal spaces, are carried out in
dynamics. parallel. These terms include the Hartree and local
pseudopotential energies, which are carried out on
Parallelization a spherical reciprocal-space grid, and the exchange-
Two basic strategies for parallelization of ab initio correlation energy, which is carried out on a rectangular
molecular dynamics will be outlined. The first is a real-space grid. The Hartree and local pseudopotential
hybrid state/grid-level parallelization scheme useful on terms require arrays gdens_a(ndens/nproc) and
machines with a modest number of fast processors hav- gdens_b(ndens/nproc) that hold, on each pro-
ing large memory but with slow communication. This cessor, a portion of the real and imaginary reciprocal-
scheme does not require a parallel fast Fourier trans- space density coefficients ñ(g). Here, ndens is the
form (FFT) and can be coded up relatively easily start- number of reciprocal-space density coefficients. The
ing from an optimized serial code. The second scheme exchange-correlation energy makes use of an array
is a fine-grained parallelization approach based on par- rdens(ngrid/nproc) that holds, on each proces-
allelization of all operations and is useful for massively sor, a portion of the real-space density n(r). Here,
parallel architectures. The tradeoff of this scheme is its ngrid is the number of points on the rectangular
considerable increase in coding complexity. An inter- grid.
mediate scheme between these builds a parallel FFT into Given this division of data over the processors, the
the hybrid scheme as a means of reducing the memory algorithm proceeds in the following steps:
and some of the communication requirements. In all
such schemes if the FFT used is a complex-to-complex . With the coefficients in transposed form, each pro-
FFT and the orbitals are real, then the states can be dou- cessor calculates its contribution to the electronic
ble packed into the FFT routine and transformed two kinetic energy and the corresponding forces on the
at a time, which increases the overall efficiency of the coefficients it holds. A reduction is performed to
algorithms to be presented below. obtain the total kinetic energy.
Car-Parrinello Method C 

. The coefficient arrays are transposed into normal . In order to enforce the orthogonality constraints,
form, and each processor must subsequently per- various partially integrated coefficient and coef-
form 0.5A∗nstate/nproc three-dimensional ficient velocity arrays are multiplied together in
serial FFTs on its subset of states to obtain the transposed form on each processor. In this way,
corresponding orbitals in real space. These orbitals each processor has a set of Ns × Ns matrices that C
are stored as creal(ngrid,nstate/nproc). are reduced over processors to obtain the cor-
Each processor sums the square of this orbital over responding Ns × Ns matrices needed to obtain
nstate/nproc at each grid point. the Lagrange multipliers. The Lagrange multiplier
. Each processor performs a serial FFT on its por- matrix is broadcast to all of the processors and each
tion of the density to obtain its contribution to the processor applies this matrix on its subset of the
reciprocal-space density coefficients ñ(g). coefficient or coefficient velocity array.
. Reduce_scatter operations are used to sum
each processor’s contributions to the real and Intermediate scheme – The storage requirements of this
reciprocal-space densities and distribute ngrid/ scheme are the same as those of the hybrid approach
nproc and ndens/nproc real and reciprocal- except that the coefficient arrays are only used in their
space density values on each processor. transposed form. Since all FFTs are performed in par-
. Each processor calculates its contribution to the allel, a full grid is never required, which cuts down on
Hartree and local pseudopotential energies, Kohn– the memory and scratch-space requirements. The key to
Sham potential, and nuclear forces using its this scheme is having a parallel FFT capable of achieving
reciprocal-space density coefficients and exchange- good load balance in the presence of the spherical trun-
correlation energies and Kohn–Sham potential cation of the reciprocal-space coefficient and density
using its real-space density coefficients. grids.
. As the full Kohn–Sham potential is needed on This algorithm is carried out as follows:
each processor, Allgather operators are used
to collect the reciprocal and real-space potentials. . With the coefficients in transposed form, each pro-
The reciprocal-space potential is additionally trans- cessor calculates its contribution to the electronic
formed to real space by a single FFT. kinetic energy and the corresponding forces on the
. With the coefficients in normal form, the Kohn– coefficients it holds. A reduction is performed to
Sham potential contributions to the coefficient obtain the total kinetic energy.
forces are computed via the product VKS (r)ψ s (r). . A set of 0.5∗nstate three-dimensional paral-
Each processor computes this product for the lel FFTs is performed in order to transform the
states it has. coefficients to corresponding orbitals in real space.
. Each processor transforms its coefficient force con- These orbitals are stored in an array with dimen-
tributions VKS (r)ψ s (r) back to reciprocal space by sions creal(ngrid/nproc,nstate). Since
performing 0.5∗nstate/nproc serial FFTs. each processor has a full set of state, the real-space
. With the coefficients in normal form, each proces- orbitals can simply be squared so that each processor
sor calculates its contribution to the nonlocal pseu- has the full density on its subset of ngrid/nproc
dopotential energy, coefficient forces, and nuclear grid points.
forces for its states. . A parallel FFT is performed on the density to
. Each processor adds its contributions to the nonlo- obtain the reciprocal-space density coefficients
cal forces to those from Steps  and  to obtain the ñ(g), which are also divided over the processors.
total coefficient forces in normal form. The energies Each processor will have ndens/nproc coeffi-
and nuclear forces are reduced over processors. cients.
. The coefficients and coefficient forces are trans- . Each processor calculates its contribution to the
formed back to transposed form, and the equations Hartree and local pseudopotential energies, Kohn–
of motion are integrated with the coefficients in Sham potential, and nuclear forces using its
transposed form. reciprocal-space density coefficients. Each processor
 C Car-Parrinello Method

also calculates its contribution to the exchange- to real space is based on data transposes. First, a
correlation energies and Kohn–Sham potential set of one-dimensional FFTs is computed to yield
using its real-space density coefficients. Cs (gx , gy , gz ) → C̃s (gx , gy , z). Since the data is dense
. A parallel FFT is used to transform the reciprocal- in real space, this operation transforms the sphere into
space Kohn–Sham potential to real space. a cylinder. Next, a transpose is performed on the z index
. With the coefficients in transposed form, the Kohn– to parallelize it and collect complete planes of gx and gy
Sham potential contributions to the coefficient along the cylindrical axis. Once this operation is com-
forces are computed via the product VKS (r)ψ s (r). plete, the remaining two one-dimensional FFTs can be
Each processor computes this product at the grid performed. The first maps the cylinder onto a rectan-
points it has. gular slab, and the final FFT transforms the slab into a
. The coefficient force contributions VKS (r)ψ s (r) on dense real-space array ψ s (x, y, z).
each processor are transformed back to reciprocal Given this strategy for the D FFT, the optimal data
space via a set of 0.5∗nstate three-dimensional decomposition scheme is based on dividing the state
parallel FFTs. and density arrays into planes both in reciprocal space
. With the coefficients in transposed form, each in and in real space. Briefly, it is useful to think of divid-
processor calculates the nonlocal pseudopotential ing the coefficient array into data objects represented as
energy, coefficient forces, and nuclear forces using G(s, p), where p indexes the (gx , gy ) planes. The planes
its subset of reciprocal-space grid points. are optimally grouped in such a way that complete lines
. Each processor adds its contributions to the non- along the gz axis are created. This grouping is important
local forces to those from Steps  and  to obtain due to the spherical truncation of reciprocal space. This
the total coefficient forces in transposed form. type of virtualization allows parallelization tools such
The energies and nuclear forces are reduced over as the Charm++ software to be employed as a way of
processors. mapping the data objects to the physical processors as
. In order to enforce the orthogonality constraints, discussed by Bohm et al. in []. The granularity of the
various partially integrated coefficient and coef- object G(s, p) is something that can be tuned for each
ficient velocity arrays are multiplied together in given architecture.
transposed form on each processor. In this way, Once the data decomposition is accomplished, the
each processor has a set of N s × Ns matrices that planes must be mapped to the physical processors.
are reduced over processors to obtain the cor- A simple map that allocates all planes of a state to the
responding Ns × Ns matrices needed to obtain same processor, for example, would make all D FFT
the Lagrange multipliers. The Lagrange multiplier transpose operations local, resulting in good perfor-
matrix is broadcast to all of the processors and each mance. However, such a mapping is not scalable because
processor applies this matrix on its subset of the massively parallel architectures will have many more
coefficient or coefficient velocity array. processors than states. Another extreme would map all
planes of the same rank in all of the states to the same
D parallel FFTs and fine-grained data decomposi- processor. Unfortunately, this causes the transposes to
tion – Any fine-grained parallelization scheme for be highly nonlocal and this leads to a communication
Car–Parrinello molecular dynamics requires a scal- bottleneck. Thus, the optimal mapping is a compro-
able three-dimensional FFT and a data decomposition mise between these two extremes, mapping collections
strategy that allows parallel operations to scale up to of planes in a state partition to the physical processors,
large numbers of processors. Much of this is true even where the size of these collections depends on the num-
for the intermediate scheme discussed above. ber of processors and communication bandwidth. This
Starting with the D parallel FFT, when the full mapping then enables the parallel computation of over-
set of indices of the coefficients Cs (g) are displayed, lap matrices obtained by summing objects with different
the array appears as Cs (gx , gy , gz ), where the reciprocal-
√ state indices s and s′ over reciprocal space.
space points lie within a sphere of radius Ecut . This entry is focused on the parallelization of
A common approach for transforming this array plane-wave based Car–Parrinello molecular dynamics.
Cedar Multiprocessor C 

However, other basis sets offer certain advantages over (eds) Parallel programming using C++. MIT, Cambridge, pp
plane waves. These are localized real-space basis sets –
. Vadali RV, Shi Y, Kumar S, Kale LV, Tuckerman ME, Martyna GJ
useful for chemical applications where the orbitals
() Scalable fine-grained parallelization of plane-wave-based
ψ s (r) can be easily transformed into a maximally spa- ab initio molecular dynamics for large supercomputers. J Comp
tially localized form. Future work will focus on devel- Chem : C
oping parallelization strategies for ab initio molecular . Bohm E, Bhatele A, Kale LV, Tuckerman ME, Kumar S, Gunnels
dynamics calculations with such basis sets. JA, Martyna GJ () Fine-grained parallelization of the Car–
Parrinello ab initio molecular dynamics method on the IBM Blue
Gene/L supercomputer. IBM J Res Dev :
Bibliographic Notes and Further . OpenAtom is freely available for download via the link http://
charm.cs.uiuc.edu/OpenAtom
Reading
As noted in the introduction, the Car–Parrinello
method was introduced in Ref. [] and is discussed
in greater detail in a number of review articles [, ]
and a recent book []. Further details on the algo- CDC 
rithms presented in this entry and their implementation
in the open-source package PINY_MD can be found Control Data 
in Refs. [, ]. A detailed discussion of the Charm++
runtime environment alluded to in the “Paralleliza-
tion” section can be found in Ref. []. Finally, fine-
grained algorithms leveraging the Charm++ runtime Cedar Multiprocessor
environment in the manner described in this article
are described in more detail in Refs. [, ]. These mas- Pen-Chung Yew
sively parallel techniques have been implemented in University of Minnesota at Twin-Cities, Minneapolis,
the Charm++ based open-source ab initio molecular MN, USA
dynamics package OpenAtom [].

Definition
Bibliography The Cedar multiprocessor was designed and built at
. Car R, Parrinello M () Unified approach for molecular the Center for Supercomputing Research and Develop-
dynamics and density-functional theory. Phys Rev Lett : ment (CSRD) in the University of Illinois at Urbana-
. Marx D, Hutter J () Ab initio molecular dynamics: theory and
Champaign (UIUC) in s. The project brought
implementation in modern methods and algorithms of quantum
chemistry. In: Grotendorst J (ed) Forschungszentrum, Jülich, NIC
together a group of researchers in computer archi-
Series, vol . NIC Directors, Jülich, pp – tecture, parallelizing compilers, parallel algorithms/
. Tuckerman ME () Ab initio molecular dynamics: basic con- applications, and operating system, to develop a scal-
cepts, current trends, and novel applications. J Phys Condens Mat able, hierarchical, shared-memory multiprocessor. It
:R was the largest machine building effort in academia
. Marx D, Hutter J () Ab initio molecular dynamics: basic
since ILLIAC-IV. The machine became operational in
theory and advanced methods. Cambridge University Press,
New York  and decommissioned in . Some pieces of
. Tuckerman ME, Yarne DA, Samuelson SO, Hughes AL, Martyna the boards are still displayed in the Department of
GJ () Exploiting multiple levels of parallelism in molecular Computer Science at UIUC.
dynamics based calculations via modern techniques and soft-
ware paradigms on distributed memory computers. Comp Phys
Commun : Discussion
. The open-source package PINY MD is freely available for down-
Introduction
load via the link http://homepages.nyu.edu/~mt/PINYMD/
PINY.html
The Cedar multiprocessor was the first scalable, cluster-
. Kale LV, Krishnan S () Charm++: parallel program- based, hierarchical shared-memory multiprocessor of
ming with message-driven objects. In: Wilson GV, Lu P its kind in s. It was designed and built at the
 C Cedar Multiprocessor

Center for Supercomputing Research and Develop- matrix algorithms that could speed up the most time
ment (CSRD) in the University of Illinois at Urbana- consuming part of many linear systems in large-scale
Champaign (UIUC). The project succeeded in building scientific applications.
a complete scalable shared-memory multiprocessor sys-
tem with a working -cluster (-processor) hardware Machine Organization of Cedar
prototype [], a parallelizing compiler for Cedar For- The organization of Cedar consists of multiple clusters
tran [, , ], and an operating system, called Xylem, for of processors connected through two high-bandwidth
scalable multiprocessor []. Real application programs single-directional global interconnection networks
were ported and run on Cedar [, ] with performance (GINs) to a globally shared memory system (GSM)
studies presented in []. The Cedar project was started (see Fig. ). One GIN provides memory requests from
in  and the prototype became functional in . clusters to the GSM. The other GIN provides data and
Cedar had many features that later used exten- responses from GSM back to clusters. In the Cedar
sively in large-scale multiprocessors systems, such as prototype built at CSRD, an Alliant FX/ system was
software-managed cache memory to avoid very expen- used as a cluster (See Fig. ). Some hardware com-
sive cache coherence hardware support; vector data ponents in FX/, such as the crossbar switch and the
prefetching to cluster memories for hiding long mem- interface between the shared cache and the crossbar
ory latency; parallelizing compiler techniques that take switch, were modified to accommodate additional ports
sequential applications and extract task-level paral- for the global network interface (GNI). GNI provides
lelism from their loop structures; language exten- pathways from a cluster to GIN. Each GNI board also
sions that include memory attributes of the data has a software-controlled vector prefetching unit (VPU)
variables to allow programmers and compilers to man- that is very similar to the DMA in later IBM’s Cell
age data locality more easily; and parallel dense/sparse Broadband Engine.

Global Global Global


memory memory memory
SP SP SP

Stage 2
8  8 Switch 8  8 Switch 8  8 Switch

8  8 Switch 8  8 Switch Stage 1 8  8 Switch

Cluster 0 Cluster 1 Cluster 3

SP. Synchronization processor

Cedar Multiprocessor. Fig.  Cedar machine organization


Cedar Multiprocessor C 

I/O coordination of all CEs. Concurrency control instruc-


Cluster memory modules
Subsystem
tions include fast fork, join, and fetch-and-increment
MEM MEM MEM type of synchronization operations. For example, a sin-
gle concurrent start instruction broadcast on CCB will
spread the iterations and initiate the execution of a con- C
current loop among all eight CEs simultaneously. Loop
Memory bus
iterations are self-scheduled among CEs by fetch-and-
incrementing a loop counter shared by all CEs. CCB
4 Way
Global also supports a cascade synchronization that enforces an
interleaved
cache interface ordered execution among CEs in the same cluster for a
concurrent loop that requires a sequential ordering for
a particular portion of the loop execution.
Memory Hierarchy: The  GB physical memory
Cluster switch 8  8 Switch address space of Cedar is divided into two equal halves
between the cluster memory and the GSM. There are
CE CE CE  MB of GSM and  MB of cluster memory in each
cluster on the Cedar prototype. It also supports a vir-
Concurrency control bus tual memory system with a  KB page size. GSM could
be directly addressed and shared by all clusters, but clus-
Cedar Multiprocessor. Fig.  Cedar cluster
ter memory is only addressable by the CEs within each
cluster. Data coherence among multiple copies of data
in different cluster memories is maintained explicitly
Cedar Cluster: In each Alliant system, there are through software by either programmer or the compiler.
eight processors, called computational elements (CEs). The peak GSM bandwidth is  MB/s and  MB/s per
Those eight CEs are connected to a four-way interleaved CE. The GSM is double-word interleaved, and aligned
shared cache through an  ×  crossbar switch, and four among all global memory modules. There is a syn-
ports to GNI that provide access to GIN and GSM (see chronization processor in each GSM module that could
Fig. ). On the other side of the shared cache is a high- execute each atomic Cedar synchronization instruction
speed shared bus that is connected to multiple cluster issued from a CE and staged at GNI.
memory modules and interactive processors (IPs). IPs Data Prefetching: It was observed early in the Cedar
handle input/output and network functions. design phase that scientific application programs have
Each CE is a pipelined implementation of Motorola very poor cache locality due to their vector operations.
 instruction set architecture, one of the most pop- To compensate for the long access latency to GSM and
ular high-performance microprocessors in s. The to overcome the limitation of two outstanding mem-
 ISA was augmented with vector instructions. ory requests per CE in the Alliant microarchitectural
The vector unit includes both -bit floating-point and design, vector data prefetching is needed. The large GIN
-bit integer operations. It also has eight -word bandwidth is ideal for supporting such data prefetching.
vector registers. Each vector instruction could have one To avoid major changes to Alliant’s instruction
memory- and one register-operand to balance the use set architecture (ISA) by introducing additional data
of registers and the requirement of memory bandwidth. prefetching instructions into its ISA, and also to avoid
The clock cycle time was  ns, but each CE has a major alterations to CE’s control and data paths of
peak performance of . Mflops for a multiply-and-add including a data prefetching unit (PFU) inside its CE,
-bit floating-point vector instruction. It was a very it was decided to build the PFU on the GIN board.
high-performance microprocessor at that time, and was Prefetched data is stored in a -word (each word is -
implemented on one printed-circuit board. byte) prefetch buffer inside PFU to avoid polluting the
There is a concurrency control bus (CCB) that con- shared data cache and to allow data reuse. A CE will
nects all eight CEs to provide synchronization and stage a prefetch operation at PFU by providing it with
 C Cedar Multiprocessor

the length, the stride, and the mask of the vector data to proposed in [] for multi-stage shuffle-exchange net-
be prefetched. The prefetching operation could then be works. There is a unique path between each pair of GIN
fired by providing the physical address of the first word input and output ports. A two-word queue is provided
of the vector. Prefetching operation could be overlapped at each input and output port of the × crossbar switch.
with other computations or cluster memory operations. Flow control between network stages is implemented to
When a page boundary is crossed during a data prevent buffer overflow and maintain low network con-
prefetching operation, the PFU will be suspended until tention. The total network bandwidth is  MB/s, and
the CE provides the starting address of the new page.  MB/s each network port to match the bandwidth of
In the absence of page crossing, PFU will prefetch  GSM and CE. Hence, the requests and data flow from
words without pausing into its prefetch buffer. Due to a CE to a GSM module and back to the CE. It forms
hardware limitation, only one prefetch buffer is imple- a circular pipeline with balanced bandwidth that could
mented in each PFU. The prefetch buffer will thus be support high-performance vector operations.
invalidated with another prefetch operation. Prefetched Memory-Based Synchronization: Given packet-
data could return from GSM out of order because of switched multi-stage interconnection networks in
potential network and memory contentions. A presence Cedar, instead of a shared bus, it is very difficult to
bit per data word in the prefetch buffer allows the CE implement atomic (indivisible) synchronization opera-
to both access the data without waiting for the com- tions such as a test-and-set or a fetch-and-add operation
pletion of the prefetch instruction, and to access the efficiently on the system interconnect. Lock and unlock
prefetch data in the same order as requested. It was operations are two low-level synchronization opera-
later proved from experiments that PFU is extremely tions that will require multiple passes through GINs
useful in improving memory performance for scientific and GSM to implement a higher-level synchronization
applications. such as fetch-and-add, a very frequent synchronization
Data prefetching was later incorporated extensively operation in parallel applications that could be used to
into high-performance microprocessors that prefetch implement barrier synchronizations and self-scheduling
data from main memory into the last-level cache mem- for parallel loops.
ory. Sophisticated compiler techniques that could insert Cedar implements a sophisticated set of synchro-
prefetching instructions into user programs at com- nization operations []. They are very useful in
piler time or runtime have been developed and used supporting parallel loop execution that requires fre-
extensively. Cache performance has shown significant quent fine-grained data synchronization between loop
improvement with the help of data prefetching. iterations that have cross-iteration data dependences,
Global Interconnection Network (GIN): The Cedar so called doacross loops []. Cedar synchronization
network is a multi-stage shuffle-exchange network as instructions are basically test-and-operate operations,
shown in Fig. . It is constructed with  ×  crossbar where test is any relational operation on -bit data
switches with -bit wide data paths and some control (e.g., ≤) and operate could be a read, write, add, subtract,
signals for network flow control. As there are only - or logical operation on -bit data. These Cedar syn-
processors on the Cedar prototype, there are only two chronization operations are also staged and controlled
stages needed for each direction of GIN between clus- in a GNI board by each CE. There is also a synchro-
ters and GSM, i.e., it only need two clock cycles to go nization unit at each GSM module that executes these
from GNI to one of the  GSM modules if there is atomic operations at GSM. It is a very efficient and
no network contention. Hence, it provides a very low effective way of implementing atomic global synchro-
latency and high-bandwidth communication path to nization operations right at GSM.
GSM from a CE. There is one GIN in each direction Hardware Performance Monitoring System: It was
between clusters and GSM as mentioned above. To cut determined at the Cedar design time that performance
down possible signal noises and to maintain high reli- tuning and monitoring on a large-scale multiproces-
ability, all data signals between stages are implemented sor requires extensive support starting at the lowest
using differential signals. GIN is packet-switched and hardware level. Important and critical system signals
self-routed. Routing is based on a tag-controlled scheme in all major system components that include GIN
Cedar Multiprocessor C 

and GSM must be made observable. To minimize the marks the task idle and stops execution. Wait_task
amount of hardware changes needed on Alliant clus- (tasknum) blocks the execution of the calling task until
ters, it was decided to use external monitoring hardware the task specified by tasknum enters an idle state. A task
to collect time-stamped event traces and histograms enters an idle state when it calls end_task. Hence, the
of various hardware signals. This allows the hardware waiting task will be unblocked when the task it is waiting C
performance monitoring system to evolve over time for (identified by tasknum) ends execution.
without having to make major hardware changes on the Memory management of Xylem is based on a
Cedar prototype. The Cedar hardware event tracers can paging system. It has the notion of global memory
each collect M events and the histogrammers have K pages and cluster memory pages, and provides kernels
-bit counters. These hardware tracers could be cas- to allow a task to control its own memory environ-
caded to capture more events. The triggering and stop- ment or that of any other task in the process. They
ping signals or events could come from special library include kernels to allocate and de-allocate pages for
calls in application programs, or some hardware signals any task in the process; change the attributes of pages;
such as a GSM request to indicate a shared cache miss. make copies and share pages with any other task in
Software tools are built to support starting and stop- the process; and unmap pages from any task in the
ping of tracers, off-loading data from the tracers and process. The attributes of a page include execute/no-
counters for extensive performance analysis. execute, read-write/no-read-write, local/global/global-
cached, and shared/private-copy/private-new/private.
Xylem Operating System
Xylem operating system [] provides support for par- Programming Cedar
allel execution of application programs on Cedar. It is A parallel program can be written using Cedar Fortran,
based on the notion that a parallel user program is a flow a parallel dialect of the Fortran language, and Fortran
graph of executable nodes. New system calls are added  on Cedar. Cedar Fortran supports all key features
to allow Unix processes to create and control multi- of the Cedar system described above. Those Cedar fea-
ple tasks. It basically links the four separate operating tures include its memory hierarchy, the data prefetching
systems in Alliant clusters into one Cedar operating sys- capability from GSM, the Cedar synchronization oper-
tem. Xylem provides virtual memory, scheduling and ations, and the concurrency control features on CCB.
file system services for Cedar. These features are supported through a language exten-
The Xylem scheduler schedules those tasks instead sion to Fortran . Programs written in Fortran  could
of the Unix scheduler. A Xylem task corresponds to one be restructured by a parallelizing restructurer, and trans-
or more executions of each node in the program flow- lated into parallel programs in Cedar Fortran. They are
graph. Support of multiprocessing in Xylem basically then fed into a backend compiler, mostly an enhanced
includes create_task, delete_task, start_task, end_task, and modified Alliant compiler, to produce Cedar exe-
andwait_task. The reason for separating create_task and cutables. The parallelizing restructurer was based on
start_task is because task creation is a very expen- the KAP restructurer [], a product of Kuck and
sive operation, while starting a task is a relatively Associates (KAI).
faster operation. Separating these two operations allow Cedar Fortran has many features common to the
more efficient management of tasks, same for separat- ANSI Technical Committee XH standard for parallel
ing delete_task and end_task operations. For example, Fortran whose basis was PCF Fortran developed by the
helper tasks could be created in the beginning of a pro- Parallel Computing Forum (PCF). The main features
cess then later started when parallel loops are being encompass parallel loops, vector operations that include
executed. These helper tasks could be put back to the vector reduction operations and a WHERE statement
idle state without being deleted and recreated later. that could mask vector assignment as in Fortran-,
There is no parent–child relationship between the declaration of visibility (or accessibility) of data, and
creator task and the created task though. Either task post/wait synchronization.
could wait for, start or delete the other task. A task could Several variations of parallel loops that take into
delete itself if it is the last task in the process. End_task account the loop structure and the Cedar organization
 C Cedar Multiprocessor

are included in Cedar Fortran. There are basically two to all CEs in a single cluster if in an SDOALL. The state-
types of parallel loops: doall loops and doacross loops. ments in the preamble of a loop are executed only once
A doacross loop is an ordered parallel loop whose loop by each CE when it first joins the execution of the loop
iterations start sequentially as in its original sequen- and prior to the execution of loop body. The statements
tial order. It uses cascade synchronization on CCB to in the postamble of a loop are executed only once after
enforce such sequential order on parts of its loop body all CEs complete the execution of the loop. Postamble is
(see Fig. ). A doall loop is an unordered parallel loop only available in SDOALL and XDOALL.
that enforces no order among the iterations of its loop. By default, data declared outside of a loop in a Cedar
However, a barrier synchronization is needed at the end Fortran program are visible to all CEs in a single clus-
of the loop. ter. However, it provides statements to explicitly declare
The syntactic form of all variations of these two variables outside of a loop to be visible to all CEs in all
parallel loops is shown in Fig. . clusters (see Fig. ).
There are three variations of the parallel loops: clus- The GLOBAL and PROCESS COMMON state-
ter loops, spread loops, and cross-cluster loops, denoted ments in Fig.  declare that the data are visible to all CEs
with prefixes C, S, X, respectively in the concurrent loop in all clusters. A single copy of the data exists in global
syntax shown in Fig. . CDOALL and CDOACROSS memory. All CEs in all clusters could access the data, but
loops require all CEs in a cluster to join in the execution it is the programmer’s responsibility to maintain coher-
of the parallel loops. SDOALL and SDOACROSS loops ence if multiple copies of the data are kept in separate
cause a single CE from each cluster to join the execution cluster memories. The CLUSTER and COMMON state-
of the parallel loops. It is not necessarily the best way ments declare that the data are visible to all CEs inside
to execute a parallel loop, but if each loop iteration has a single cluster. A separate copy of the data is kept in the
a large working set that could fill the cluster memory, cluster memory of each cluster that participates in the
such a scheduling will be very effective. Another com- execution of the program.
mon situation is to have a CDOALL loop nested inside Implementation of Cedar Fortran on Cedar: All
an SDOALL loop. It could engage all CEs in a cluster. parallel loops in Cedar Fortran are self-scheduled.
An XDOALL loop will require all CEs in all clusters to Iterations in a CDOALL or CDOACROSS loop are
execute the loop body. dispatched by CCB inside each cluster. CCB also pro-
Local declaration of data variables in a CDOALL and vides cascade synchronization for a CDOACROSS loop
XDOALL will be visible only to a single CE, while visible through wait and advance calls in the loop as shown
in Fig. . The execution of SDOALL and XDOALL
{C/S/X}{DOALL/DOACROSS} index = start, end [,increment] are supported by the Cedar Fortran runtime library.
The library starts a requested number of helper tasks
[local declarations of data variables]
[Preamble/Loop]
by calling Xylem kernels in the beginning of the pro-
Loop Body gram execution. They remain idle until a SDOALL or
[Endloop/Postamble] (only SDO or XDO) XDOALL starts. The helper tasks begin to compete for
END {C/S/X}{DOALL/DOACROSS} loop iterations using self-scheduling.
Subroutine-level tasking is also supported in the
Cedar Multiprocessor. Fig.  Concurrent loop syntax Cedar Fortran runtime library. It allows a new thread of
execution to be started for running a subroutine. In the
CDOACROSS j = 1, m
A(j) = B(j) + C(j)
call wait (1,1)
D(j) = E(j) + D(j-1) GLOBAL var [,var]
call advance (1) CLUSTER var [,var]
PROCESS COMMON /name/ var [,var]
END DOACROSS COMMON /name/ var [,var]

Cedar Multiprocessor. Fig.  An example DOACROSS Cedar Multiprocessor. Fig.  Variable declaration
loop with cascade synchronization statements in Cedar Fortran
Cedar Multiprocessor C 

mean time, the main thread of execution continues fol- also data locality is enhanced through advance tech-
lowing the subroutine call. The thread execution ends niques such as strip-mining, data globalization, and
when the subroutine returns. The new thread of exe- privatization [, ].
cution could be through one of the idle helper tasks or Performance measurements have been done exten-
through the creation of a new task. sively on Cedar [, , ]. Given the complexity of C
It also supports vector prefetching by generat- Cedar architecture, parallelizing restructurer, OS, and
ing prefetch instructions before a vector register load the programs themselves, it is very difficult to isolate the
instruction. It could reduce the latency and overhead individual effects at all levels that influence the final
of loading vector data from GSM. The Cedar synchro- performance of each program. The performance results
nization instructions are used primarily in the runtime shown in Fig.  are the most thorough measurements
library. They have been proven to be useful in control- presented in []. A suite of scientific application bench-
ling loop self-scheduling. They are also available to a mark programs called Perfect Benchmark [] were used
Fortran programmer through runtime library routines. to measure Cedar performance. It was a collection
Parallelizing Restructurer: There was a huge volume of Fortran programs that span a wide spectrum of
of research work on parallelizing scientific application scientific and engineering applications from quantum
programs written in Fortran before Cedar was built []. chromodynamics (QCD) to analog circuit simulation
Hence, there were many sophisticated program analysis (SPICE).
and parallelization techniques available through a very The table in Fig.  lists the performance improve-
advanced parallelizing restructurer based on the KAP ment over the serial execution time of each individual
restructurer for Cedar [, ]. The parallelizing restruc- program. The second column shows the performance
turer could convert a sequential program written in improvement using KAP-based parallelizing restruc-
FORTRAN  into a parallel program in Cedar For- turer. The results show that despite the most advance
tran that takes advantage of all unique features in Cedar. parallelizing restructurer at that time, the performance
Through this conversion and restructuring process, not improvement overall is still quite limited. Hence, each
only loop-level parallelism is exposed and extracted, but benchmark program was analyzed and parallelized

Program Complied by Auto, W/o Cedar W/o prefetch MFLOPS


Kap/Cedar transforms Synchronization time
time (Improvement) time (Improvement) time (% slowdown) (% slowdown) (YMP-8/Cedar)
ADM 689 (1.2) 73 (10.8) 81 (11%) 83 (2%) 6.9 (3.4)
ARC2D 218 (13.5) 141 (20.8) 141 (0%) 157 (11%) 13.1 (34.2)
BDNA 502 (1.9) 111 (8.7) 118 (6%) 122 (3%) 8.2 (18.4)
DYFESM 167 (3.9) 60 (11.0) 67 (12%) 100 (49%) 9.2 (6.5)
FLO52 100 (9.0) 63 (14.3) 64 (1%) 79 (23%) 8.7 (37.8)
MDG 3200 (1.3) 182 (22.7) 202 (11%) 202 (0%) 18.9 (1.1)
MG3Da 7929 (1.5) 348 (35.2) 346 (0%) 350 (1%) 31.7 (3.6)
OCEAN 2158 (1.4) 148 (19.8) 174 (18%) 187 (7%) 11.2 (7.4)
QCD 369 (1.1) 239 (1.8) 239 (0%) 246 (3%) 1.1 (11.8)
SPEC77 973 (2.4) 156 (15.2) 156 (0%) 165 (6%) 11.9 (4.8)
SPICE 95.1 (1.02) NA NA NA 0.5 (11.4)
TRACK 126 (1.1) 26 (5.3) 28 (8%) 28 (0%) 3.1 (2.7)
TRFD 273 (3.2) 21 (41.1) 21 (0%) 21 (0%) 20.5 (2.8)

aThis version of MG3D includes the elimination of file I/O.

Cedar Multiprocessor. Fig.  Cedar performance improvement for Perfect Benchmarks []
 C CELL

manually. The techniques are limited to those that could . Gallivan KA, Plemmons RJ, Sameh AH () Parallel algorithms
be implemented in an automated parallelizing restruc- for dense linear algebra computations. SIAM Rev ():–
. Guzzi MD, Padua DA, Hoeflinger JP, Lawrie DH () Cedar
turer. The third column under an automatable transfor-
Fortran and other vector and parallel Fortran dialects. J Super-
mation shows the performance improvement that could
comput ():–
be achieved if those programs are restructured by a . Kuck D et al () The cedar system and an initial performance
more intelligent parallelizing restructurer. study. In: Proceedings of international symposium on computer
It is quite clear that there was still a lot of potential in architecture, San Diego, CA, pp –
improving the performance of parallelizing restructurer . Kuck & Associates, Inc () KAP User’s Guide. Champaign
Illinois
because manually parallelized programs show substan-
. Lawrie DH () Access and alignment of data in an array
tial improvement in overall performance. However, processor. IEEE Trans Comput C-():–, Dec 
through another decade of research in more advanced . Midkiff S, Padua DA () Compiler algorithms for synchroniza-
techniques of parallelizing restructurer since the publi- tion. IEEE Trans C-():–
cation of [–, ], the consensus seems to be pointing . Padua DA, Wolfe MJ () Advanced compiler optimizations for
supercomputers. Commun ACM ():–
to a direction that programmers must be given more
. Zhu CQ, Yew PC () A scheme to enforce data dependence
tools and control to parallelize and write their own par- on large on large multiprocessor system. IEEE Trans Softw Eng
allel code instead of relying totally on a parallelizing SE-():–, June 
restructurer to convert a sequential version of their code
automatically into a parallel form, and expect to have a
performance improvement equals to that of the parallel
code implemented by the programmers themselves. CELL
The fourth column in the table of Fig.  shows the
performance improvement using Cedar synchroniza- Cell Broadband Engine Processor
tion instructions. The fifth column shows the impact
of vector prefetching on overall performance. The
improvement from vector prefetching was not as sig-
nificant as those obtained in later studies in other lit- Cell Broadband Engine Processor
eratures because the Cedar backend compiler was not
using sophisticated algorithms to place those prefetch- H. Peter Hofstee
ing instructions, and the number of prefetching buffers IBM Austin Research Laboratory, Austin, TX, USA
is too small.

Synonyms
CELL; Cell processor; Cell/B.E.
Bibliography
. Berry M et al () The perfect club benchmarks: effective
performance evaluation of supercomputers. Int J Supercomput Definition
Appl ():– The Cell Broadband Engine is a processor that conforms
. Eigenmann R et al () Restructuring Fortran Programs for to the Cell Broadband Engine Architecture (CBEA).
Cedar. In: Proceedings of ICPP’, vol , pp – The CBEA is a heterogeneous architecture defined
. Eigenmann R et al () The Cedar Fortran Project. CSRD
jointly by Sony, Toshiba, and IBM that extends the
Report No. , University of Illinois at Urbana-Champaign
. Eigenmann R, Hoeflinger J, Li Z, Padua DA () Experience
Power architecture with “Memory flow control” and
in the automatic parallelization of four perfect-benchmark pro- “Synergistic processor units.”
grams. In: Proceedings for the fourth workshop on languages CBEA compliant processors are used in a variety
and compilers for parallel computing, Santa Clara, CA, pp –, of systems, most notably the PlayStation  (now PS)
August  from Sony Computer Entertainment, the IBM QS,
. Emrath P et al () The xylem operating system. In: Proceedings
QS, and QS server blades, Regza-Cell Televisions
of ICPP’, vol , pp –
. Gallivan K et al () Preliminary performance analysis of the from Toshiba, single processor PCI-express accelerator
cedar multiprocessor memory system. In: Proceedings of  boards, rackmount servers, and a variety of custom sys-
ICPP, vol , pp – tems including the Roadrunner supercomputer at Los
Cell Broadband Engine Processor C 

Alamos National Laboratory that was the first to achieve


petaflop level performance on the Linpack benchmark
and was ranked as the world’s # supercomputer from Synergistic Load/store Local
processor store
June  to November . unit memory
Instr. fetch C
Discussion
Note: In what follows a system or chip that has mul- put/get (command) put/get (data)
tiple differing cores sharing memory is referred to as
heterogeneous and a system that contains multiple dif-
Memory
fering computational elements but that does not provide flow
shared memory between these elements is referred to as control
hybrid.

Cell Broadband Engine On-chip coherent bus


The Cell Broadband Engine Architecture (CBEA)
Cell Broadband Engine Processor. Fig.  Internal
defines a heterogeneous architecture for shared-memory
organization of the SPE
multi-core processors and systems. Heterogeneity allows
a higher degree of efficiency and/or improved per-
formance compared to conventional homogeneous
shared-memory multi-core processors by allowing results are produced in registers, and must be staged
cores to gain efficiency through specialization. through the local store on its way to main memory. The
The CBEA extends the Power architecture and pro- motivation for this organization is that while a large
cessors that conform to the CBEA architecture are also enough register file allows a sufficient number of opera-
fully Power architecture compliant. Besides the Power tions to be simultaneously executing to effectively hide
cores CBEA adds a new type of core: the Synergistic Pro- latencies to a store closely coupled to the core, a much
cessor Element (SPE). The SPE derives its efficiency and larger store or buffer is required to effectively hide the
performance from the following key attributes: latencies to main memory that, taking multi-core arbi-
tration into account, can approach a thousand cycles.
– A per-SPE local store for code and data
In all the current implementations of the SPE the local
– Asynchronous transfers between the local store and
store is  kB which is an order of magnitude smaller
global shared memory
than the size of a typical per-core on-chip cache for
– A single-mode architecture
similarly performing processors. SPEs can be effective
– A large register file
with an on-chip store that is this small because data
– A SIMD-only instruction set architecture
can be packed as it is transferred to the local store and,
– Instructions to improve or avoid branches
because the SPE is organized to maximally utilize the
– Mechanisms for fast communication and synchro-
available memory bandwidth, data in the local store is
nization
allowed to be replaced at a higher rate than is typical for
Whereas a (traditional) CISC processor defines instruc- a hardware cache.
tions that transform main memory locations directly In order to optimize main memory bandwidth, a
and RISC processors transform only data in registers sufficient number of main memory accesses (put and
and therefore must first load operands into registers, the get) must be simultaneously executable and data access
SPEs stage code and data from main memory to the must be nonspeculative. The latter is achieved by having
local store, and the SPEs RISC core called the Syner- software issue put and get commands rather than hav-
gistic Processor Unit (SPU) (Fig. ). The SPU operates ing hardware cache pre-fetch or speculation responsible
on local store the way a conventional RISC processor for bringing multiple sets of data on to the chip simul-
operates on memory, i.e., by loading data from the local taneously. To allow maximum flexibility in how put and
store into registers before it is transformed. Similarly, get commands are executed, without adding hardware
 C Cell Broadband Engine Processor

complexity, put and get semantics defines these opera- allows for a rich set of operations to be encoded in a
tions as asynchronous to the execution of the SPU. The -bit instruction, including some performance critical
unit responsible for executing these commands is the operations that specify three sources and an indepen-
Memory Flow Control unit (MFC). The MFC supports dent target including select (conditional assignment)
three mechanisms to issue commands. and multiply-add.
Any unit in the system with the appropriate mem- The abovementioned select instruction can quite
ory access privileges can issue an MFC command by often be used to design branchless routines. A second
writing to or reading from memory-mapped command architectural mechanism that is provided to limit
registers. branch penalties is a branch hint instruction that pro-
The SPU can issue MFC commands by reading or vides advance notice that an upcoming branch is pre-
writing to a set of channels that provide a direct inter- dicted taken and also specifies its target so that the
face to the MFC for its associated SPU. code can be pre-fetched by the hardware and the branch
Finally, put-list and get-list commands instruct the penalty avoided.
MFC to execute a list of put and get commands from the The Cell Broadband Engine and PowerXCelli pro-
local store. This can be particularly effective if a substan- cessors combine eight SPEs, a Power core and high-
tial amount of noncontiguous data needs to be gathered bandwidth memory controllers, and a configurable
into the local store or distributed back to main memory. off-chip coherence fabric onto a single chip. On-chip the
put and get commands are issued with a tag that cores and controllers are interconnected with a high-
allows groups of these commands to be associated in a bandwidth coherent fabric. While physically organized
tag group. Synchronization between the MFC and the as a set of ring buses for data transfers, the intent of
SPU or the rest of the system is achieved by checking this interconnect fabric is to allow all but those pro-
on the completion status of a tag group or set of tag grams most finely tuned for performance to ignore its
groups. The SPU can avoid busy waiting by checking bandwidth or connectivity limitations and only con-
on the completion status with a blocking MFC chan- sider bandwidth to memory and I/O, and bandwidth in
nel read operation. Put and get commands adhere to the and out of each unit (Fig. ).
Power addressing model for main memory and specify
effective addresses for main (shared) memory that are Cell B.E.-Based Systems
translated to real addresses according to the page and Cell Broadband Engine processors have been used in
segment tables maintained by the operating system. a wide variety of systems. Each system is reviewed
While the local store, like the register file, is considered briefly.
private, access to shared memory follows the normal
. Sony PlayStation . This is perhaps the best-known
Power architecture coherence rules.
use of the Cell B.E. processor. In PlayStation 
The design goals of supporting both a relatively large
local store and high single-thread performance imply
that bringing data from the local store to a register file Off-chip
is a multi-cycle operation (six cycles in current SPE Power I
implementations). In order to support efficient execu- SPE SPE SPE SPE O
core
tion for programs that randomly access data in the local and
store a typical loop may be unrolled four to eight times. C
Element interconnect bus O
Supporting this without creating a lot of register spills (on-chip coherent bus) H
requires a large register file. Thus the SPU was archi- E
R
tected to provide a -entry general-purpose register XDR E
file. Further efficiency is gained by using a single reg- N
Memory SPE SPE SPE SPE
T
ister file for all data types including bit, byte, half-word CTRL
bus
and word unsigned integers, and word and double-word
floating point. All of these data types are SIMD with a Cell Broadband Engine Processor. Fig.  Organization of
width that equates to  bits. The unified register file the Cell Broadband Engine and PowerXCelli
Cell Broadband Engine Processor C 

the Cell B.E. processor is configured with a high- bridge and network switching functions. This
bandwidth I/O interface to connect to the RSX configuration allows the system to achieve the very
graphics processor. A second I/O interface con- low communication latencies that are critical to
nects the Cell B.E. to a SouthBridge which provides Quantum Chromo Dynamics calculations. QPACE
the connectivity to optical BluRay, HDD, and other is a watercooled system, and the efficiency of water- C
storage and provides network connectivity. In the cooling in combination with the efficiency of the
PlayStation  application, seven of the eight syner- PowerXCelli processor made this system top the
gistic processors are used. This benefits the manu- green list (green.org) in November 
facturing efficiency of the processor. (Figs. –).
. Numerous PlayStation –based clusters. These range
In addition to the systems discussed above, the Cell
from virtual grids of PlayStations such as the grid
B.E. and PowerXCelli processors have been used in
used for the “Folding at Home” application that first
Televisions (Toshiba Regza-Cell), dual-Cell-processor
delivered Petascale performance, to numerous clus-
U servers, experimental blade servers that combine
ters of PlayStations running the Linux operating sys-
FPGAs and Cell processors, and a variety of custom
tem that are connected with Ethernet switches used
systems used as subsystems in medical systems and in
for applications from astronomy to cryptography.
aerospace and defense applications.
. IBM QS, QS, and QS server blades. In these
blades two Cell processors are used connected with
a high-bandwidth coherent interface. Each of the Programming Cell
Cell processors is also connected to a bridge chip The CBEA reflects the view that aspects of pro-
that provides an interface to PCI-express, Ethernet, grams that are critical to their performance should be
as well as other interfaces. The QS uses a ver- brought under software control for best performance
sion of the nm Cell processor with added double- and efficiency. Critical to application performance and
precision floating-point capability that also supports efficiency are:
larger (DDR) memory capacities.
– Thread concurrency (i.e., ability to run parts of the
. Cell-based PCI-express cards (multiple vendors).
code simultaneously on shared data)
These PCI-express cards have a single Cell B.E.
– Data-level concurrency (i.e., ability to apply opera-
or PowerXCelli processor and PCI-express bridge.
tions to multiple data simultaneously)
The cards are intended for use as accelerators in
– Data locality and predictability (i.e., ability to pre-
workstations.
dict what data will be needed next)
. The “Roadrunner” supercomputer at Los Alamos.
– Control flow predictability (i.e., ability to predict
This supercomputer consists of a cluster of more
what code will be needed next)
than , Dual Dual-Core Opteron-based IBM
server blades clustered together with an InfiniBand The threading model for the Cell processor is similar
network. Each Opteron blade is PCI-express con- to that of a conventional multi-core processor in that
nected to two QS server blades, i.e., one PowerX- main memory is coherently shared between SPEs and
Celli processor per Opteron Core on each blade. Power cores.
This supercomputer, installed at Los Alamos in The effective addresses used by the application
, was the first to deliver a sustained Petaflop on threads to reference shared memory are translated on
the Linpack benchmark used to rank supercomput- the SPEs in the same way, and based on the same seg-
ers (top.org). Roadrunner was nearly three times ment and page tables that govern translation on the
more efficient than the next supercomputer to reach Power cores. Threads can obtain locks in a consistent
a sustained Petaflop. manner on the Power cores and on the SPEs. To this
. The QPACE supercomputer developed by a Euro- end the SPE supports atomic “get line and reserve”
pean University consortium in collaboration with and “store line conditional” commands that mirror the
IBM leverages the PowerXCelli processor in com- Power core’s atomic load word and reserve and store
bination with an FPGA that combines the I/O word conditional operations. An important practical
 C Cell Broadband Engine Processor

1Gb
XDR Cell Ethernet
XDR IO
Broadband
XDR bridge
Engine
XDR

RSX
GPU

Cell Broadband Engine Processor. Fig.  PlayStation  (configuration and system/cluster picture)

IB
adapter

DDR2 PCIe AMD


DDR2 IBM PCIe bridge opteron
DDR2 PowerXCell8i bridge
DDR2

DDR2 PCIe AMD


DDR2 IBM PCIe bridge opteron
DDR2 PowerXCell8i bridge
DDR2
IB
adapter

Roadrunner accelerated node

Cell Broadband Engine Processor. Fig.  Roadrunner (system configuration and picture)
Cell Broadband Engine Processor C 

DDR2
DDR2 IBM
Xilinx C
Virtex 5
DDR2 PowerXCell8i
FPGA
DDR2

QPACE node card

Cell Broadband Engine Processor. Fig.  QPACE (card and system configuration and picture)

difference between a thread that runs on the Power core local store leaves the SPE idle for a substantial amount
and one that runs on an SPE is that the context on of time, then double-buffering of tasks within the same
the SPE is large and includes the  general-purpose user process can be an effective method to improve SPE
registers, the  kB local store, and the state of the utilization.
MFC. Therefore, unless an ABI is used that supports The SPU provides a SIMD instruction set that is
cooperative multitasking (i.e., switching threads only similar to the instruction sets of other media-enhanced
when there is minimal state in the SPE), doing a thread processors and therefore data-level concurrency is han-
switch is an expensive operation. SPE threads there- dled in much the same way. Languages such as OpenCL
fore are preferentially run to completion and when a provide language support for vector data types allowing
thread requires an operating system service it is gen- portable code to be constructed that leverages the SIMD
erally preferable to service this with code on another operations. Vectorizing compilers for Cell provide an
processor than to interrupt and context switch the SPE. additional path toward leveraging the performance pro-
In the Linux operating system for Cell, SPE threads are vided by the SIMD units while retaining source code
therefore first created as Power threads that initialize portability. The use of standard libraries, the implemen-
and start the SPEs and then remain available to service tation of which uses the SIMD instructions provided
operating system functions on behalf of the SPE threads by Cell directly, provides a third path toward leverag-
they started. Note that while a context switch on an SPE ing the SIMD capabilities of Cell and other processors
is expensive, once re-initialized, the local store is back in without sacrificing portability. While compilers for the
the same state where the process was interrupted, unlike Cell B.E. adhere to the language standards and thus also
a cache which typically suffers a significant number of accept scalar data types, performance on scalar applica-
misses right after a thread switch. Because the local store tions can suffer performance penalties for aligning the
is much larger than a typical register file and part of data in the SIMD registers.
the state of the SPE, a (preemptive) context switch on The handling of data locality predictability and the
an SPE is quite expensive. It is therefore best to think use of the local store is the most distinguishing char-
of the SPE as a single-mode or batch-mode resource. If acteristic of the Cell Broadband Engine. It is possible to
the time to gather the inputs for a computation into the use the local store as a software-managed cache for code
 C Cell Broadband Engine Processor

and data with runtime support and thus remove the or PowerXCelli to that of dual core processors that
burden of dealing with locality from both the program- require a similar amount of power and chip area in the
mer and the compiler. While this provides a path toward same semiconductor technology. For an application to
code portability of standard multithreaded languages, benefit from the Cell B.E. architecture it is most impor-
doing so generally results in poor performance due to tant that it be possible to structure the application such
the overheads associated with software tag checks on that the majority of the data can be pre-fetched into
loads and stores (the penalties associated with issuing the local store prior to program execution. The SIMD-
the MFC commands on a cache miss are insignificant in width on the Cell B.E. is similar to that of CPUs of its
comparison to the memory latency penalties incurred generation and thus the degree of data parallelism is not
on a miss). On certain types of codes, data pre-fetching a big distinguishing factor between Cell B.E. and CPUs.
commands generated by the compiler can be effective. GPUs are optimized for streaming applications and with
If the code is deeply vectorized then the local store can a higher degree of data parallelism can be more efficient
be treated by the compiler as essentially a large vector than Cell B.E. on those applications.
register file and if the vectors are sufficiently long this
can lead to efficient compiler-generated pre-fetching Related Entries
and gathering of data. Streaming languages, where a NVIDIA GPU
set of kernels is applied to data that is streamed from IBM Power Architecture
and back to global memory can also be efficiently sup-
ported. Also, if the code is written in a functional or Bibliographic Notes and Further
task-oriented style which allows operands of a piece Reading
of work to be identified prior to execution then again An overview of Cell Broadband Engine is provided in
compiler-generated pre-fetching of data can be highly [] with a more detailed look at aspects of the archi-
effective. Finally, languages that explicitly express data tecture in [] and []. In [] more detail is provided
locality and/or privacy can be effectively compiled to on the implementation aspects of the microprocessor.
leverage the local store. It is not uncommon for com- Reference [] goes into more detail on the design and
pilers or applications to employ multiple techniques, programming philosophy of the Cell B.E., whereas ref-
e.g., the use of a software data cache for hard to pre- erences [] and [] provide insight into compiler design
fetch data in combination with block pre-fetching of for the Cell B.E. Reference [] provides an overview
vector data. of the security architecture of the Cell B.E. References
Because the local store stores code as well as data, [] and [] provide an introduction to performance
software must also take care of bringing code into attributes of the Cell B.E.
the local store. Unlike the software data cache, a soft-
ware instruction cache generally operates efficiently and Bibliography
explicitly dealing with pre-fetching code to the local . Kahle JA, Day MN, Hofstee HP, Johns CR, Maeurer, TR, Shippy D
store can be considered a second-order optimization () Introduction to the cell multiprocessor. IBM J Res Dev
step for most applications. That said, there are cases, (/):–
such as hard-real-time applications, where explicit con- . Johns CR, Brokenshire DA () Introduction to the Cell Broad-
trol over code locality is beneficial. As noted earlier, band Engine architecture. IBM J Res Dev ():–
. Gschwind M, Hofstee HP, Flachs B, Hopkins M, Watanabe Y,
the Cell B.E. provides branch hint instructions that
Yamazaki T () Synergistic processing in cell’s multicore
allow compilers to leverage information about preferred architecture. IEEE Micro ():–
branch directions. . Flachs B, Asano S, Dhong SH, Hofstee HP, Gervais G, Kim R, Le
T, Liu P, Leenstra J, Liberty JS, Michael B, Oh H-J, Mueller SM,
Cell Processor Performance Takahashi O, Hirairi K, Kawasumi A, Murakami H, Noro H,
Cell Broadband Engine processors occupy a middle Onishi S, Pille J, Silberman J, Yong S, Hatakeyama A, Watan-
abe Y, Yano N, Brokenshire DA, Peyravian M, To V, Iwata E
ground between CPUs and GPUs. On properly struc-
() Microarchitecture and implementation of the synergistic
tured applications, an SPU is often an order of mag- processor in -nm and -nm SOI. IBM J Res Dev ():–
nitude more efficient than a high-end CPU. This can . Keckler SW, Olokuton K, Hofstee HP (eds) () Multicore
be seen by comparing the performance of the Cell B.E. processors and systems. Springer, New York
Cellular Automata C 

. Eichenberger AE, O’Brien K, O’Brien KM, Wu P, Chen T, Oden Discussion


PH, Prener DA, Sheperd JC, So B, Sura Z, Wang A, Zhang T,
Zhao P, Gschwind M, Achambault R, Gao Y, Koo R () Using Definition of Cellular Automata
advanced compiler technology to exploir the performance of the Cellular automata (CA) are a class of highly parallel
Cell Broadband Engine architecture. IBM Syst J ():–
computational systems based on a transition function
. Perez JM, Bellens P, Badia RM, Labarta J () CellSs: making C
it easier to program the Cell Broadband Engine processor. IBM J
applied to elements of a grid of cells. They were first
Res Dev ():– introduced in the s by John von Neumann as an
. Shimizu K, Hofstee HP, Liberty JS () Cell Broadband Engine effort to model self-reproducing biological systems in
processor vault security architecture, IBM J Res Dev ():– a framework based on mathematical logic. The system
. Chen T, Raghavan R, Dale JN, Iwata E () Cell Broadband is evolved in time by applying this transition function to
Engine architecture and its first implementation – a performance
all cells in parallel to generate the state of the system at
view. IBM J Res Dev ():–
. Williams S, Shalf J, Oliker L, Husbands P, Kamil S, Yelick K time t based on the state from time t−. Three properties
() The potential of the cell processor for scientific computing. are common amongst different CA algorithms:
In: Proceedings of the third conference on computing frontiers,
Ischia, pp –
. A grid of cells is defined such that for each cell there
exists a finite neighborhood of cells that influence
its state change. This neighborhood is frequently
defined to be other cells that are spatially adjacent.
The topology of the grid of cells determines the
Cell Processor neighbors of each cell.
. A fixed set of state values are possible for each cell.
Cell Broadband Engine Processor
These can be as simple as boolean true/false values,
all the way up to real numbers. Most common exam-
ples are restricted to either booleans or a small set of
integer values.
Cell/B.E. . A state transition rule that evolves the state of a cell
to a new state based on its current value and that of
Cell Broadband Engine Processor its neighbors.
Cells are positioned at the nodes of a regular grid,
commonly based on rectangular, triangular, or hexago-
nal cells. The grid of cells may be finite or unbounded
Cellular Automata in size. When finite grids are employed, the model must
Matthew Sottile account for the boundaries of the grid (Fig. ). The
Galois, Inc., Portland, OR, USA most common approaches taken are to either impose a
toroidal or cylindrical topology to the space in which all
or some edges wrap around, or adopt a fixed value for all
Definition
A cellular automaton is a computational system defined
as a collection of cells that change state in parallel based
on a transition rule applied to each cell and a finite
number of their neighbors. A cellular automaton can
be treated as a graph in which vertices correspond to
cells and edges connect adjacent cells to define their
neighborhood. Cellular automata are well suited to par-
allel implementation using a Single Program, Multiple a b
Data (SPMD) model of computation. They have histor-
ically been used to study a variety of problems in the Cellular Automata. Fig.  Common boundary types.
biological, physical, and computing sciences. (a) Toroidal, (b) planar with fixed values off grid
 C Cellular Automata

off-grid cells to represent an appropriate boundary con- are in the “on” state in the neighborhood of a cell is used
dition. Finite grids are easiest to model computationally, to determine the state of the central cell. In both cases
as unbounded grids will require a potentially large and there is a neighborhood of cells where a reduction oper-
endlessly growing amount of memory to represent all ation (such as +) is applied to their state to compute a
cells containing important state values. single value from which the new state of the central cell
is computed.
D Grids
In a simple D CA, the grid is represented as an array Common Cellular Automata
of cells. The neighborhood of a cell is the set of cells
D Boolean Automata
that are spatially adjacent within some finite radius. For
The simplest cellular automata are one-dimensional
example, in Fig.  a single cell is highlighted with a
boolean systems. In these, the grid is an array of boolean
neighborhood of radius , which includes one cell on
values. The smallest neighborhood is that in which a
each side.
cell is influenced only by its two adjacent neighbors. The
state transition rule for a cell ci is defined as
D Grids
In a D CA, there are different ways in which a neigh- ct+
i = R (cti , cti− , cti+ ) .
borhood can be defined. The most common are the von
Neumann and Moore neighborhoods as shown in Fig. . Wolfram Rule Scheme
For each cell, the eight neighbor cells are considered Wolfram [] defined a concise scheme for naming D
due to either sharing a common face or corner. The cellular automata that is based on a binary encoding
von Neumann neighborhood includes only those cells of the output value for a rule over a family of possible
that share a face, while the Moore neighborhood also inputs. The most common scheme is applied to systems
includes those sharing a corner. in which a cell is updated based only on the values of
When considering D systems, the transition rule itself and its two nearest neighbors. For any cell there
computation bears a strong similarity to stencil-based are only eight possible inputs to the transition rule R:
computations common in programs that solve systems , , , , , , , and . For each of these
of partial differential equations. For example, a simple inputs, R produces a single bit. Therefore, if each -bit
stencil computation involves averaging all values in the input is assigned a position in an -bit number and set
neighborhood of each point in a rectangular grid and that bit to either  or  based on the output of R for the
replacing the value of the point at the center with the corresponding -bit input state, an -bit number can be
result. In the example of Conway’s game of life intro- created that uniquely identifies the rule R.
duced later in this entry, there is a similar computation For example, a common example rule is number ,
for the transition rule, in which the number of cells that which in binary is written as . This means that
the input states , , , and  map to , and the
inputs , , , and  map to . It is common to see
these rules represented visually as in Fig. . The result of
Cellular Automata. Fig.  D neighborhood of radius  this rule is shown in Fig. .
This numbering scheme can be extended to D cellu-
lar automata with larger neighborhoods. For example, a
CA with a neighborhood of five cells (two on each side
of the central cell) would have  =  possible input
states; so each rule could be summarized with a single

a b
Cellular Automata. Fig.  D neighborhoods. (a) von Cellular Automata. Fig.  Rule  transition rule. Black is ,
Neumann (b) Moore white is 
Cellular Automata C 

This rule yields very interesting dynamic behavior as


it evolves. Many different types of structures have been
discovered, ranging from those that are static and stable,
those that repeat through a fixed sequence of states, and
persistent structures that move and produce new struc- C
tures. Figure a shows a typical configuration of the
space after a number of time steps from a random ini-
Cellular Automata. Fig.  A sequence of time steps for
tial state. A number of interesting structures that appear
the D rule  automaton, starting from the top row with a
commonly are also shown, such as the self-propagating
single cell turned on
glider (Fig. b), oscillating blinker (Fig. d), and a con-
figuration that rapidly stabilizes and ceases to change
(Fig. c).

Lattice Gas Automata


Cellular automata have a history of being employed for
modeling physical systems in the context of statistical
mechanics, where the macroscopic behavior of the sys-
tem is dictated primarily by microscopic interactions
with a small spatial extent. The Ising model used in solid
Cellular Automata. Fig.  A three color D automaton state physics is an example of non-automata systems
representing rule  with initial conditions  that share this characteristic of local dynamics leading
to a global behavior that mimics real physical systems.
In the s, cellular automata became a topic of
-bit number. The D scheme can also be extended to
interest in the physics community for modeling fluid
include cells that have more than two possible states.
systems based on the observation that macroscopic
Extensions to these larger state spaces yield example
behavior of fluids is determined by the microscopic
rules that result in very complex dynamics as they
behavior of a large ensemble of fluid particles. Instead
evolve. For example, rule  in Fig.  shows the evolu-
of considering cells as containing generic boolean val-
tion of the system starting from an initial condition of
ues, one could encode more information in each cell
 centered on the first row.
to represent entities such as interacting particles. Two
important foundational cellular automata systems are
D Boolean Automata described here that laid the basis for later models
In , Martin Gardner introduced a cellular automa- of modern relevance such as the Lattice Boltzmann
ton invented by John Conway [] on D grids with Method (LBM). The suitability of CA-based algorithms
boolean-valued cells that is known as the game of life. to parallelization is a primary reason for current interest
For each cell, the game of life transition rule consid- in LBM methods for physical simulation.
ers all cells in the -cell Moore neighborhood. Cells that
contain a true value are considered to be “alive,” while
HPP Lattice Gas
those that are false are “dead.” The rule is easily stated as
One of the earliest systems that gained attention was the
follows, where ct is the current state of the cell at time
HPP automaton of Hardy, Pomeau and de Pazzis []. In
step t, and N is the number of cells in its neighborhood
the simple single fluid version of this system, each cell
that are alive:
contains a -bit value. Each bit in this value corresponds
● If ct is alive: to a vector relating the cell to its immediately adjacent
– If N <  then ct+ is set to dead. neighbors (up, down, left, and right). A bit being on cor-
– If N >  then ct+ is set to dead. responds to a particle coming from the neighbor into
– If N =  or N =  then ct+ is set to alive. the cell, and a bit being off corresponds to the absence
● If ct is dead and N = , ct+ is set to alive. of a particle arriving from the neighbor.
 C Cellular Automata

Cellular Automata. Fig.  Plots showing behavior observed while evolving the game of life CA. (a) Game of life over three
time steps; (b) two full periods of the glider; (c) feature that stabilizes after three steps; (d) two features that repeat every
two steps

rule do not change if they are transformed by a simple


rotation. For example, in the cell illustrated here, rota-
tion of the state by ○ corresponds to a circular shift of
the input and output bit encodings by one. As such, a
small set of rules need to be explicitly defined and the
remainder can be computed by applying rotations. The
Cellular Automata. Fig.  A sample HPP collision rule same shortcut can be applied to generating the table of
transition rules for more sophisticated lattice gas sys-
The transition rule for the HPP automaton was con- tems due to the rotational invariance of the underlying
structed to represent conservation of momentum in a physics of the system.
physical system. For example, if a cell had a particle From a computational perspective, this type of rule
arriving from above and from the left, but none from set was appealing for digital implementation because
the right or from below, then after a time step the cell it could easily be encoded purely based on boolean
should produce a particle that leaves to the right and to operators. This differed from traditional methods for
the bottom. Similarly, if a particle arrives from the left studying fluid systems in which systems of partial dif-
and from the right, but from neither the top or bottom, ferential equations needed to be solved using floating
then two particles should exit from the top and bottom. point arithmetic.
In both of these cases, the momentum of the system
does not change. This simple rule models the physical
properties of a system based on simple collision rules. FHP Rules
This is illustrated in Fig.  for a collision of two particles The HPP system, while intuitively appealing, lacks
entering a cell facing each other. If the cell encoded the properties that are necessary for modeling realistic
entry state as the binary value  where the bits cor- physical systems – for example, the HPP system does
respond in order to North, East, South, and West, then not exhibit isotropic behavior. The details related to the
the exit state for this diagram would be defined as . physical basis of these algorithms are beyond the scope
The rules for the system are simple to generate by of this entry, but are discussed at length by Doolen [],
taking advantage of rotation invariant properties of the Wolf-Gladrow [], and Rivet []. A CA inspired by the
system – the conservation laws underlying a collision early HPP automaton was created by Frisch, Hasslacher,
Cellular Automata C 

and Pomeau [], leading to the family of FHP lattice gas any neighbor, remained in the same position. Multiple
methods. variants of the FHP-based rule set appeared in the lit-
The primary advance of these new methods made erature during the s and s. These introduced
over prior lattice gas techniques was that it was possible richer rule sets with more collision rules, with the effect
to derive a rigorous relationship between the FHP cellu- of producing models that corresponded to different vis- C
lar automaton and more common models for fluid flow cosity coefficients []. Further developments added the
based on Navier–Stokes methods requiring the solution ability to model multiple fluids interacting, additional
of systems of partial differential equations. Refinements forces influencing the flow, and so on.
of the FHP model, and subsequent models based on it, Modeling realistic fluid systems required models to
are largely focused on improving the correspondence expand to three dimensions. To produce a model with
of the CA-based algorithms to the physical systems correct isotropy, d’Humières, Lallemand and Frisch []
that they model and better match accepted traditional employed a four-dimensional structure known as the
numerical methods. face centered hyper-cubic, or FCHC, lattice. The three-
In the D FHP automaton, instead of a neighbor- dimensional lattice necessary for building a lattice gas is
hood based on four neighbors, the system is defined based on a dimension reducing procedure that projects
using a neighborhood of six cells arranged in a hexag- FCHC into three-dimensional space. Given this embed-
onal grid with triangular cells. This change in the ded lattice, a set of collision rules that obey the appro-
underlying topology of the grid of cells was criti- priate physical conservation laws can be defined much
cal to improving the physical accuracy of the system like those for FHP in two dimensions.
while maintaining a computationally simple CA-based Lattice gas methods were successful in demonstrat-
model. The exact same logic is applied in constructing ing that, with careful selection of transition rules, real-
the transition rules in this case as for the HPP case – istic fluid behavior could be modeled with cellular
the rules must obey conservation laws. The first instance automata. The lattice Boltzmann method (LBM) is a
of the automaton was based on this simple extension of derivative of these early CA-based fluid models that
HPP, in which the rules for transitions were intended remains in use for physical modeling problems today.
to model collisions in which the state represented only In the LBM, a similar grid of cells is employed, but
particles arriving at a cell. Two example rules are shown the advancement of their state is based on continuous
in Fig. , one that has two possible outcomes (each with valued function evaluation instead of a boolean state
equal probability), and one that is deterministic. In both machine.
cases, the starting state indicates the particles entering a
cell, and the exit state(s) show the particles leaving the Self-organized Criticality: Forest Fire Model
cell after colliding. In , a model was proposed by Drossel and Schw-
Later revisions of the model added a bit to the state abl [] to model forest fire behavior that built upon a
of each cell corresponding to a “rest particle” – a par- previous model proposed by Bak, Chen, and Tang [].
ticle that, absent any additional particle arriving from This model is one of a number of similar systems that
are used to study questions in statistical physics, such
as phase transitions and their critical points. A related
system is the sandpile model, in which a model is con-
structed of a growing pile of sand that periodically
or
experiences avalanches of different sizes.
As a cellular automaton, the forest fire model is
interesting because it is an automaton in which the rules
for updating a cell are probabilistic. The lattice gas mod-
els above include rules that are also probabilistic, but
unlike the forest fire system, the lattice gas probabili-
ties are fixed to ensure accurate correspondence with
Cellular Automata. Fig.  Two examples of FHP collision the physical system being modeled. Forest fire param-
rules eters are intended to be changed to study the response
 C Cellular Automata

of the overall system to their variation. Two probability within the system which is to be shown as capable of
parameters are required: p and f , both of which are real universal computation.
numbers between  and . The update rule for a cell is For example, the simple one-dimensional Rule 
defined as: automaton was shown by Cook [] to be capable of
universal computation. He accomplished this by iden-
● A cell that is burning becomes empty (burned out).
tifying structures known as “spaceships” that are self-
● A cell will start burning if one or more neighbor cells
perpetuating and could be configured to interact. These
are burning.
interactions could be used then to encode an exist-
● A cell will ignite with probability f regardless of how
ing computational system in the evolving state of the
many neighbors are burning.
Rule  system. The input data and program to exe-
● A cell that is empty turns into a non-burning cell
cute would then be encoded in a linear form to serve
with probability p.
as the initial state of the automaton. A similar approach
The third and fourth rules are the probabilistic parts was taken to show that the game of life could also
of the update algorithm. The third rule states that cells emulate any Turing machine or other universal com-
can spontaneously combust without requiring burning puting system. Using self-perpetuating structures, their
neighbors to ignite them. The fourth rule states that a generators, and structures that react with them, one
burned region will eventually grow back as a healthy, can construct traditional logic gates and connect them
non-burning cell. All four rules correspond to an intu- together to form structures equivalent to a traditional
itive notion for how a real forest functions. Trees can digital computer.
be ignited by their neighbors; events can cause trees to As Cook points out, a consequence of this is that it is
start burning in a non-burning region (such as by light- formally undecidable to predict certain behaviors, such
ning or humans), and over time, burned regions will as reaching periodic states or a specific configuration
eventually grow back. of bits. The property of universality does not imply that
This model differs from basic cellular automata due encoding of a system like a digital computer inside a CA
to the requirement that for each cell update there may be would be at all efficient – the CA implementation would
a required sampling of a pseudorandom number source be very slow relative to a real digital computer.
to determine whether or not spontaneous ignition or
new tree growth occurs. This has implications for par- Parallel Implementation
allel implementation because it requires the ability to Cellular automata fit well with parallel implementations
generate pseudorandom numbers in a parallel setting. due to the high degree of parallelism present in the
update rules that define them. Each cell is updated in
parallel with all others, and the evolution of the system
Universality in Cellular Automata proceeds by repeatedly applying the rule to update the
Cellular automata are not only useful for modeling entire set of cells.
physical phenomena. They have also been used to study The primary consideration for parallel implemen-
computability theory. It has been shown that cellu- tation of a cellular automaton is the dependencies
lar automata exist that exhibit the property of com- imposed by the update rule. Parallel implementation
putational universality. This is a concept that arises of any sequential algorithm often starts with an anal-
in computability theory and is based on the notion ysis of the dependencies within the program, both in
of Turing-completeness. Simply stated, a system that terms of control and data. In a CA implementation,
exhibits Turing-completeness is able to perform any cal- there exists a data dependency between time steps due
culation by following a simple set of rules on input data. to the need for state data from time step t to generate
In essence, given a carefully constructed input, the exe- the state for time t + .
cution of the automaton will perform a computation For this discussion, consider the simple D boolean
(such as arithmetic) that has been “programmed into” cellular automaton. To determine the value of the ith cell
the input data. This may be accomplished by finding at step t of the evolution of the system, the update rule
an encoding of an existing universal computing system requires the value of the cell itself and its neighbors (i−
Cellular Automata C 

and i + ) at step t − . This dependency has two effects interconnection network connecting processing nodes
that determine how the algorithm can be implemented together for exchanging data. The machine could be
in parallel. programmed for different CA rules, and was demon-
strated on examples from fluid dynamics (lattice gases),
. Given that the previous step t −  is not changed at
all and can be treated as read-only, all cell values for
statistical physics (diffusion limited aggregation), image C
processing, and large-scale logic simulation.
step t can be updated in parallel.
Another approach that has been taken to hard-
. An in-place update of the cells is not correct, as
ware implementation of CAs is the use of Field Pro-
any given cell i from step t −  is in the depen-
grammable Gate Arrays (FPGAs) to encode the boolean
dency set of multiple cells. If an in-place update
transition rule logic directly in hardware. More recently,
occurred, then it is possible that the value from step
hardware present in accelerator-based systems such as
t− would be destructively overwritten and lost, dis-
General Purpose Graphics Processing Units (GPGPU)
rupting the update for any other cell that requires its
presents a similar feature set as the traditional MPPs.
step t −  value. A simple double-buffering scheme
These devices present the programmer with the ability
can be used to overcome this issue for parallel imple-
to execute a large number of small parallel threads that
mentations.
operate on large volumes of data. Each thread in a GPU
In more sophisticated systems such as those dis- is very similar to the basic processing elements from a
cussed earlier, the number of dependencies that each traditional MPP. The appeal of these modern acceler-
cell has between time steps grows with the size of its ators is that they can achieve performance comparable
neighborhood. The game of life rule states that each cell to traditional supercomputers on small, specialized pro-
requires the state of nine cells to advance – the state of grams like cellular automata that are based on a single
the cell itself and that of its eight neighbors. The D small computation run in parallel on a large data set.
FCHC lattice gas has  channels per node to influ-
ence the transition rule leading to a large number of
dependencies per cell. References
. Bak P, Chen K, Tang C () A forest fire model and some
Hardware Suitability thoughts on turbulence. Phys Lett A :–
. Cook M () Universality in elementary cellular automata.
Historically cellular automata were of interest on mas- Complex Syst ():–
sively parallel processing (MPP) systems in which a . d’Humieres D, Lallemand P, Frisch U () Lattice gas models
very large number of simple processing elements were for -D hydrodynamics. Europhys Lett :–
available. Each processing element was assigned a small . Wolf-Gladrow DA () Lattice-gas cellular automata and lat-
region of the set of cells (often a spatially contigu- tice Boltz-Mann models: an introduction, volume  of Lecture
Notes in Mathematics. Springer, Berlin
ous patch), and iterated over the elements that it was
. Doolen GD (ed) () Lattice gas methods: theory, application,
responsible for to apply the update rule. A limited and Hardware. MIT Press, Cambridge, MA
amount of synchronization would be required to ensure . Drossel B, Schwabl F () Self-organized critical forest-fire
that each processing element worked on the same model. Phys Rev Lett ():–
update step. . Frisch U, Hasslacher B, Pomeau Y () Lattice-gas automata for
the Navier-Stokes equation. Phys Rev Lett ():–
In the s, there was a research effort at MIT to
. Gardner M () The fantastic combinations of John Conway’s
build the CAM [], a parallel architecture designed new solitaire game “life”. Sci Am :–
specifically for executing cellular automata. Machines . Hardy J, Pomeau Y, de Pazzis O () Time evolution of a two-
specialized for CA models could achieve performance dimensional model system. J Math Phys ():–
comparable to conventional parallel systems of the . Rivet J-P, Boon JP () Lattice gas hydrodynamics, volume 
era, including the Thinking Machines CM- and Cray of Cambridge Nonlinear Science Series. Cambridge University
Press, Cambridge
X-MP. The notable architectural features of the CAM
. Toffoli T, Margolus N () Programmable matter: concepts and
were the processing nodes based on DRAM for storing realization. Phys D :–
cell state data, SRAM for storing a lookup table hold- . Wolfram S () Statistical mechanics of cellular automata. Rev
ing the transition rules for a CA, and a mesh-based Mod Phys :–
 C Chaco

refinement provided by an implementation of the


Chaco Fiduccia-Mattheyses (FM) implementation.
● Chaco provides a robust yet efficient algorithm for
Bruce Hendrickson
computing Laplacian eigenvectors which can be
Sandia National Laboratories, Albuquerque, NM, USA
used for spectral partitioning or for other algo-
rithms.
Definition ● Chaco supports generalized spectral, combinatorial,
Chaco was the first modern graph partitioning code and multilevel algorithms that partition into more
developed for parallel computing applications. Although than two parts at each level of recursion.
developed in the early s, the code continues to be ● Chaco has several approaches that consider the
widely used today. topology of the target parallel computer during the
partitioning process.
Discussion
Chaco is a software package that implements a variety of Related Entries
graph partitioning techniques to support parallel appli- Load Balancing, Distributed Memory
cations. Graph partitioning is a widely used abstraction Graph Partitioning
for dividing work amongst the processors of a parallel Hypergraph Partitioning
machine in such a way that each processor has about METIS and ParMETIS
the same amount of work to do, but the amount of
interprocessor communication is kept small. Chaco is a Bibliographic Notes and Further
serial code, intended to be used as a preprocessing step Reading
to set up a parallel application. Chaco takes in a graph Chaco is available for download under an open source
which describes the data dependencies in a computa- license []. The Chaco Users Guide [] has much
tion and outputs a description of how the computation more detailed information about the capabilities of the
shouldbepartitionedamongsttheprocessorsofaparallel code. Subsequent partitioning codes like METIS, Jostle,
machine. PATOH, and Scotch have adopted and further refined
Chaco was developed by Bruce Hendrickson and many of the ideas first prototyped in Chaco.
Rob Leland at Sandia National Laboratories. It pro-
vides implementations of a variety of graph par- Acknowledgment
titioning algorithms. Chaco provided an advance Sandia is a multiprogram laboratory operated by
over the prior state of the art in a number of Sandia Corporation, a Lockheed Martin Company,
areas []. for the US Department of Energy under contract DE-
● Although multilevel partitioning was simultane- AC-AL.
ously co-invented by several groups [, , ], the
implementation in Chaco led to the embrace of Bibliography
this approach as the best balance between speed . Chaco: Software for partitioning graphs, http://www.sandia.
and quality for many practical problems. In paral- gov/∼bahendr/chaco.html
. Bui T, Jones C () A heuristic for reducing fill in sparse matrix
lel computing, this remains the dominant approach factorization. In: Proceedings of the th SIAM Conference on Par-
to partitioning problems. allel Processing for Scientific Computing, SIAM, Portland, OR,
● All the algorithms in Chaco support graphs with pp –
weights on both edges and vertices. . Cong J, Smith ML () A parallel bottom-up clustering algo-
● Chaco provides a suite of partitioning algorithms rithm with applications to circuit partitioning in VLSI design.
In: Proceedings th Annual ACM/IEEE International Design
including spectral, geometric and multilevel
Automation Conference, DAC ’, ACM, Dallas, TX, pp –
approaches. . Hendrickson B, Leland R () The Chaco user’s guide, version
● Chaco supports the coupling of global methods .. Technical Report SAND–, Sandia National Laborato-
(e.g., spectral or geometric partitioning) with local ries, Albuquerque, NM, October 
Chapel (Cray Inc. HPCS Language) C 

. Hendrickson B, Leland R () A multilevel algorithm for parti- academic partners at Caltech/JPL (Jet Propulsion Lab-
tioning graphs. In: Proceedings of the Supercomputing ’. ACM, oratory).
December . Previous version published as Sandia Technical
The second phase of the HPCS program, from sum-
Report SAND –
mer  through , saw a great deal of work in
the specification and early implementation of Chapel. C
Chapel’s initial design was spearheaded by David
Callahan, Brad Chamberlain, and Hans Zima. This
Chapel (Cray Inc. HPCS Language)
group published an early description of their design in
Bradford L. Chamberlain a paper entitled “The Cascade High Productivity Lan-
Cray Inc., Seattle, WA, USA guage” []. Implementation of a Chapel compiler began
in the winter of , initially led by Brad Chamberlain
and John Plevyak (who also played an important role in
Synonyms
the language’s early design). In late , a rough draft
Cascade high productivity language
of the language specification was completed. Around
this same time, Steve Deitz joined the implementation
Definition
effort, who would go on to become one of Chapel’s most
Chapel is a parallel programming language that
influential long-term contributors. This group estab-
emerged from Cray Inc.’s participation in the High Pro-
lished the primary themes and concepts that set the
ductivity Computing Systems (HPCS) program spon-
overall direction for the Chapel language. The motiva-
sored by the Defense Advanced Research Projects
tion and vision of Chapel during this time was captured
Agency (DARPA). The name Chapel derives from
in an article entitled “Parallel Programmability and the
the phrase “Cascade High Productivity Language,”
Chapel Language” [], which provided a wish list of
where “Cascade” is the project name for the Cray
sorts for productive parallel languages and evaluated
HPCS effort. The HPCS program was launched with
how well or poorly existing languages and Chapel met
the goal of raising user productivity on large-scale
the criteria.
parallel systems by a factor of ten. Chapel was
By the summer of , the HPCS program was
designed to help with this challenge by vastly improv-
transitioning into its third phase and the Chapel team’s
ing the programmability of parallel architectures while
composition had changed dramatically, as several of
matching or beating the performance, portability,
the original members moved on to other pursuits and
and robustness of previous parallel programming
several new contributors joined the project. Of the orig-
models.
inal core team, only Brad Chamberlain and Steve Deitz
remained and they would go on to lead the design and
Discussion implementation of Chapel for the majority of phase III
History of HPCS (ongoing at the time of this writing).
The Chapel language got its start in  during the first The transition to phase III also marked a point
phase of Cray Inc.’s participation in the DARPA HPCS when the implementation began to gain significant trac-
program. While exploring candidate system design con- tion due to some important changes that were made
cepts to improve user productivity, the technical leaders to the language and compiler design based on experi-
of the Cray Cascade project decided that one com- ence gained during phase II. In the spring of , the
ponent of their software solution would be to pursue first multi-threaded task-parallel programs began run-
the development of an innovative parallel programming ning on a single node. By the summer of , the first
language. Cray first reported their interest in pursuing a task-parallel programs were running across multiple
new language to the HPCS mission partners in January nodes with distributed memory. And by the fall of ,
. The language was named Chapel later that year, the first multi-node data-parallel programs were being
stemming loosely from the phrase “Cascade High Pro- demonstrated.
ductivity Language.” The early phases of Chapel devel- During this time period, releases of Chapel also
opment were a joint effort between Cray Inc. and their began taking place, approximately every  months. The
 C Chapel (Cray Inc. HPCS Language)

very first release was made available in December  Themes


on a request-only basis. The first release to the gen- Chapel’s design is typically described as being motivated
eral public occurred in November . And in April by five major themes: () support for general paral-
, the Chapel project more officially became an lel programming, () support for global-view abstrac-
open-source project by migrating the hosting of its code tions, () a multiresolution language design, () support
repository to SourceForge. for programmer control over locality and affinity, and
Believing that a language cannot thrive and become () a narrowing of the gap between mainstream and
adopted if it is controlled by a single company, the HPC programming models. This section provides an
Chapel team has often stated its belief that as Chapel overview of these themes.
grows in its capabilities and popularity, it should grad-
ually be turned over to the broader community as General Parallel Programming
an open, consortium-driven language. Over time, an Chapel’s desire to support general programming stems
increasing number of external collaborations have been from the observation that while programs and par-
established with members of academia, computing cen- allel architectures both typically contain many types
ters, and industry, representing a step in this direction. of parallelism at several levels, programmers typically
At the time of this writing, Chapel remains an active need to use a mix of distinct programming models
and evolving project. The team’s current emphasis is on to express all levels/types of software parallelism and
expanding Chapel’s support for user-defined distribu- to target all available varieties of hardware parallelism.
tions, improving performance of key idioms, support- In contrast, Chapel aims to support multiple levels of
ing users, and seeking out strategic collaborations. hardware and software parallelism using a unified set
of concepts for expressing parallelism and locality. To
this end, Chapel programs support parallelism at the
Influences function, loop, statement, and expression levels. Chapel
Rather than extending an existing language, Chapel language concepts support data parallelism, task par-
was designed from first principles. It was decided that allelism, concurrent programming, and the ability to
starting from scratch was important to avoid inher- compose these different styles within a single program
iting features from previous languages that were not naturally. Chapel programs can be executed on desktop
well suited to large-scale parallel programming. Exam- multicore computers, commodity clusters, and large-
ples include pointer/array equivalence in C and com- scale systems developed by Cray Inc. or other vendors.
mon blocks in Fortran. Moreover, Chapel’s design team
believed that the challenging part of learning any lan- Global-View Abstractions
guage is learning its semantics, not its syntax. To that Chapel is described as supporting global-view abstrac-
end, embedding new semantics in an established syn- tions for data and for control flow. In the tradition of
tax can often cause more confusion than benefit. To this ZPL and High Performance Fortran, Chapel supports
end, Chapel was designed from a blank slate. That said, the ability to declare and operate on large arrays in a
Chapel’s design does contain many concepts and influ- holistic manner even though they may be implemented
ences from previous languages, most notably ZPL, High by storing their elements within the distributed memo-
Performance Fortran (HPF), and the Tera/Cray MTA ries of many distinct nodes. Chapel’s designers felt that
extensions to C and Fortran, reflecting the backgrounds many of the most significant challenges to parallel pro-
of the original design team. Chapel utilizes a partitioned grammability stem from the typical requirement that
global namespace for convenience and scalability, simi- programmers write codes in a cooperating executable
lar to traditional PGAS languages like UPC, Co-Array or Single Program, Multiple Data (SPMD) program-
Fortran, and Titanium; yet it departs from those lan- ming model. Such models require the user to manually
guages in other ways, most notably by supporting more manage many tedious details including data ownership,
dynamic models of execution and parallelism. Other local-to-global index transformations, communication,
notable influences include CLU, ML, NESL, Java, C#, and synchronization. This overhead often clutters a pro-
C/C++, Fortran, Modula, and Ada. gram’s text, obscuring its intent and making the code
Chapel (Cray Inc. HPCS Language) C 

difficult to maintain and modify. In contrast, languages latencies incurred by communicating with other pro-
that support a global view of data such as ZPL, HPF, cessors over a network. For this reason, performance-
and Chapel shift this burden away from the typical minded programmers typically need to control where
user and onto the compiler, runtime libraries, and data data is stored on a large-scale system and where the
distribution authors. The result is a user code that is tasks accessing that data will execute relative to the data. C
cleaner and easier to understand, arguably without a As multicore processors grow in the number and variety
significant impact on performance. of compute resources, such control over locality is likely
Chapel departs from the single-threaded logical to become increasingly important for desktop program-
execution models of ZPL and HPF by also providing a ming as well. To this end, Chapel provides concepts
global view of control flow. Chapel’s authors define this that permit programmers to reason about the com-
as a programming model in which a program’s entry pute resources that they are targeting and to indicate
point is executed by a single logical task, and then addi- where data and tasks should be located relative to those
tional parallelism is introduced over the course of the compute resources.
program through explicit language constructs such as
parallel loops and the creation of new tasks. Support- Narrowing the Gap Between Mainstream
ing a global view for control flow and data structures and HPC Languages
makes parallel programming more like traditional pro- Chapel’s designers believe there to be a wide gap
gramming by removing the requirement that users must between programming languages that are being used in
write programs that are complicated by details related to education and mainstream computing such as Java, C#,
running multiple copies of the program in concert as in Matlab, Perl, and Python and those being used by the
the SPMD model. High Performance Computing community: Fortran, C,
and C++ in combination with MPI and OpenMP (and
A Multiresolution Design in some circles, Co-Array Fortran and UPC – Uni-
Another departure from ZPL and HPF is that those fied Parallel C). It was believed that this gap should
languages provide high-level data-parallel abstrac- be bridged in order to take advantage of productivity
tions without providing a means of abandoning those improvements in modern language design while also
abstractions in order to program closer to the machine being able to better utilize the skills of the emerging
in a more explicit manner. Chapel’s design team felt it workforce. The challenge was to design a language
was important for users to have such control in order that would not alienate traditional HPC programmers
to program as close to the machine as their algorithm who were perhaps most comfortable in Fortran or C.
requires, whether for reasons of performance or expres- An example of such a design decision was to have
siveness. To this end, Chapel’s features are designed in Chapel support object-oriented programming since it
a layered manner so that when high-level abstractions is a staple of most modern languages and program-
like its global-view arrays are inappropriate, the pro- mers, yet to make the use of objects optional so
grammer can drop down to lower-level features and that Fortran and C programmers would not need to
control things more explicitly. As an example of this, change their way of thinking about program and data
Chapel’s global-view arrays and data-parallel features structure design. Another example was to make Chapel
are implemented in terms of its lower-level task-parallel an imperative, block-structured language, since the lan-
and locality features for creating distinct tasks and map- guages that have been most broadly adopted by both the
ping them to a machine’s processors. The result is that mainstream and HPC communities have been impera-
users can write different parts of their program using tive rather than functional or declarative in nature.
different levels of abstraction as appropriate for that
phase of the computation. Language Features
Data Parallelism
Locality and Affinity Chapel’s highest-level concepts are those relating to data
The placement of data and tasks on large-scale machines parallelism. The central concept for data-parallel pro-
is crucial for performance and scalability due to the gramming in Chapel is the domain, which is a first-class
 C Chapel (Cray Inc. HPCS Language)

language concept for representing an index set, poten- forall person in Employees do
tially distributed between multiple processors. Domains if (Age(person) < 18) then
are an extension of the region concept in ZPL. They SSN = 0;
are used in Chapel to represent iteration spaces and Scalar functions and operators can be promoted in
to declare arrays. A domain’s indices can be Cartesian Chapel by calling them with array arguments. Such
tuples of integers, representing dense or sparse index promotions also result in data-parallel execution, equiv-
sets on a regular grid. They may also be arbitrary val- alent to the forall loops above. For example, each of the
ues, providing the capability for storing arbitrary sets following whole-array statements will be computed in
or key/value mappings. Chapel also supports the notion parallel in an element-wise manner:
of an unstructured domain in which the index val-
ues are anonymous, providing support for irregular, A = B + 1.0i * X;
B = sin(A);
pointer-based data structures.
Age += 1;
The following code declares three Chapel domains:
const D: domain(2) = [1..n, 1..n],
In addition to the use cases described above,
DDiag: sparse subdomain(D) domains are used in Chapel to perform set inter-
= [i in 1..n] (i,i), sections, to slice arrays and refer to subarrays, to
Employees: domain(string) perform tensor or elementwise iterations, and to
= readNamesFromFile(infile); dynamically resize arrays. Chapel also supports reduc-
tion and scan operators (including user-defined vari-
In these declarations, D represents a regular -
ations) for efficiently computing common collective
dimensional n×n index set; DDiag represents the sparse
operations in parallel. In summary, domains provide a
subset of indices from D describing its main diagonal;
very rich support for parallel operations on a rich set of
and Employees is a set of strings representing the names
potentially distributed data aggregates.
of a group of employees.
Chapel arrays are defined in terms of domains and
Locales
represent a mapping from the domain’s indices to a set
Chapel’s primary concept for referring to machine
of variables. The following declarations declare a pair of
resources is called the locale. A locale in Chapel is an
arrays for each of the domains above:
abstract type, which represents a unit of the target archi-
var A, B: [D] real, tecture that can be used for reasoning about locality.
X, Y: [DDiag] complex, Locales support the ability to execute tasks and to store
Age, SSN: [Employees] int; variables, but the specific definition of the locale for
The first declaration defines two arrays A and B over a given architecture is defined by a Chapel compiler.
domain D, creating two n × n arrays of real floating In practice, an SMP node or multicore processor is
point values. The second creates sparse arrays X and Y often defined to be the locale for a system composed of
that store complex values along the main diagonal of D. commodity processors.
The third declaration creates two arrays of integers Chapel programmers specify the number of locales
representing the employees’ ages and social security that they wish to use on the executable’s command
numbers. line. The Chapel program requests the appropriate
Chapel users can express data parallelism using resources from the target architecture and then spawns
forall loops over domains or arrays. As an example, the user’s code onto those resources for execution.
the following loops express parallel iterations over the Within the Chapel program’s source text, the set of exe-
indices and elements of some of the previously declared cution locales can be referred to symbolically using a
domains and arrays. built-in array of locale values named Locales. Like any
other Chapel array, Locales can be sliced, indexed, or
forall a in A do
reshaped to organize the locales in any manner that suits
a += 1.0;
the program. For example, the following statements
forall (i,j) in D do create customized views of the locale set. The first
A(i,j) = B(i,j) + 1.0i * X(i,j); divides the locales into two disjoint sets while the
Chapel (Cray Inc. HPCS Language) C 

second reshapes the locales into a -dimensional virtual of parallel, distributed data structures. While Chapel
grid of nodes: provides a standard library of domain maps, a major
// given: const Locales:
research goal of the language is to implement these
[0..#numLocales] locale; standard distributions using the same mechanism that
an end-user would rather than by embedding seman- C
const localeSetA = Locales[0..#localesInSetA], tic knowledge of the distributions into the compiler and
localeSetB = Locales[localesInSetA..]; runtime as ZPL and HPF did.
const compGrid
= Locales.reshape[1..2,numLocales/2];
Task Parallelism
As mentioned previously, Chapel’s data-parallel fea-
Distributions tures are implemented in terms of its lower-level task-
As mentioned previously, a domain’s indices may be dis- parallel features. The most basic task-parallel construct
tributed between the computational resources on which in Chapel is the begin statement, which creates a new
a Chapel program is running. This is done by specifying task while allowing the original task to continue execut-
a domain map as part of the domain’s declaration, which ing. For example, the following code starts a new task
defines a mapping from the domain’s index set to a tar- to compute an FFT while the original task goes on to
get locale set. Since domains are used to define iteration compute a Jacobi iteration:
spaces and arrays, this also implies a distribution of
begin FFT(A);
the computations and data structures that are expressed
Jacobi(B);
in terms of that domain. Chapel’s domain maps are
more than simply a mapping of indices to locales, how- Inter-task coordination in Chapel is expressed in a
ever. They also define how each locale should store its data-centric way using special variables called synchro-
local domain indices and array elements, as well as nization variables. In addition to storing a traditional
how operations such as iteration, random access, slic- data value, these variables also maintain a logical full/
ing, and communication are defined on the domains empty state. Reads to synchronization variables block
and arrays. To this end, Chapel domain maps can be until the variable is full and leave the variable empty.
thought of as recipes for implementing parallel, dis- Conversely, writes block until the variable is empty and
tributed data aggregates in Chapel. The domain map’s leave it full. Variations on these default semantics are
functional interface is targeted by the Chapel com- provided via method calls on the variable. This leads to
piler as it rewrites a user’s global-view array operations the very natural expression of inter-task coordination.
down to the per-node computations that implement the Consider, for example, the elegance of the following
overall algorithm. bounded buffer producer/consumer pattern:
A well-formed domain map does not affect the
var buffer: [0..#buffsize] sync int;
semantics of a Chapel program, only its implementation
and performance. In this way, Chapel programmers can begin { // producer
tune their program’s implementation simply by chang- for i in 0..n do
ing the domain declarations, leaving the bulk of the buffer[i%buffsize] = ...;
computation and looping untouched. For example, the }
{ // consumer
previous declaration of domain D could be changed as
for j in 0..n do
follows to specify that it should be distributed using an ...buffer[j%buffsize]...;
instance of the Block distribution: }
const D: domain(2) dmapped MyBlock
Because each element in the bounded buffer is declared
= [1..n, 1..n];
as a synchronized integer variable, the consumer will
However, none of the loops or operations written previ- not read an element from the buffer until the producer
ously on D, A, or B would have to change. has written to it, marking its state as full. Similarly, the
Advanced users can write their own domain maps in producer will not overwrite values in the buffer until the
Chapel and thereby create their own implementations consumer’s reads have reset the state to empty.
 C Chapel (Cray Inc. HPCS Language)

In addition to these basic task creation and syn- dynamic typing. It also supports generic program-
chronization primitives, Chapel supports additional ming and code reuse.
constructs to create and synchronize tasks in structured ● Iterators: Chapel’s iterators are functions that yield
ways that support common task-parallel patterns. multiple values over their lifetime (as in CLU or
Ruby) rather than returning a single time. These
Locality Control iterators can be used to control serial and parallel
While domain maps provide a high-level way of map- loops.
ping iteration spaces and arrays to the target architec- ● Configuration variables: Chapel’s configuration
ture, Chapel also provides a low-level mechanism for variables are symbols whose default values can be
controlling locality called the on-clause. Any Chapel overridden on the command line of the compiler
statement may be prefixed by an on-clause that indi- or executable using argument parsing that is imple-
cates the locale on which that statement should execute. mented automatically by the compiler.
On-clauses can take an expression of locale type as their ● Tuples: These support the ability to group values
argument, which specifies that the statement should be together in a lightweight manner, for example, to
executed on the specified locale. They may also take return multiple values from a function or to repre-
any other variable expression, in which case the state- sent multidimensional array indices using a single
ment will execute on the locale that stores that variable. variable.
For example, the following modification to an earlier
example will execute the FFT on Locale # while execut- Future Directions
ing the Jacobi iteration on the locale that owns element At the time of this writing, the Chapel language is
i,j of B: still evolving based on user feedback, code studies, and
the implementation effort. Some notable areas where
on Locales[1] do begin FFT(A);
additional work is expected in the future include:
on B(i,j) do Jacobi(B);
● Support for heterogeneity: Heterogeneous systems
Base Language are becoming increasingly common, especially
Chapel’s base language was designed to support pro- those with heterogeneous processor types such as
ductive parallel programming, the ability to achieve traditional CPUs paired with accelerators such as
high performance, and the features deemed necessary graphics processing units (GPUs). To support such
for supporting user-defined distributions effectively. systems, it is anticipated that Chapel’s locale con-
In addition to a fairly standard set of types, operators, cept will need to be refined to expose architectural
expressions, and statements, the base language supports substructures in abstract and/or concrete terms.
the following features: ● Transactional memory: Chapel has plans to support
an atomic block for expressing transactional compu-
● A rich compile-time language: Chapel supports the
tations against memory, yet software transactional
ability to define functions that are evaluated at com-
memory (STM) is an active research area in gen-
pile time, including functions that return types.
eral, and becomes even trickier in the distributed
Users may also indicate conditionals that should be
memory context of Chapel programs.
folded at compile time as well as loops that should
● Exceptions: One of Chapel’s most notable omissions
be statically unrolled.
is support for exception- and/or error-handling
● Static-type inference: Chapel supports a static-
mechanisms to deal with software failures in a
type inference scheme in which specifications can
robust way, and perhaps also to be resilient to hard-
be omitted in most declaration settings, causing
ware failures. This is an area where the original
the compiler to infer the types from  context. For
Chapel team felt unqualified to make a reasonable
example, a variable declaration may omit the vari-
proposal and intended to fill in that lack over time.
able’s type as long as it is initialized, in which case the
compiler infers its type from the initializing expres- In addition to the above areas, future design and
sion. This supports exploratory programming as in implementation work is expected to take place in the
scripting languages without the runtime overhead of areas of task teams, dynamic load balancing, garbage
Chapel (Cray Inc. HPCS Language) C 

collection, parallel I/O, language interoperability, and use of multiresolution language design in efforts like
tool support. Chapel [].
For programmers interested in learning how to
use Chapel, there is perhaps no better resource
Related Entries than the release itself, which is available as an C
Fortress (Sun HPCS Language) open-source download from SourceForge []. The
HPF (High Performance Fortran) release is made available under the Berkeley Software
NESL Distribution (BSD) license and contains a portable
PGAS (Partitioned Global Address Space) Languages implementation of the Chapel compiler along with doc-
Tera MTA umentation and example codes. Another resource is
ZPL a tutorial document that walks through some of the
HPC Challenge benchmarks, explaining how they can
be coded in Chapel. While the language has evolved
Bibliographic Notes and Further since the tutorial was last updated, it remains reason-
Reading ably accurate and provides a gentle introduction to the
The two main academic papers providing an overview language []. The Chapel team also presents tutorials to
of Chapel’s approach are also two of the earliest: “The the community fairly often, and slide decks from these
Cascade High Productivity Language” [] and “Parallel tutorials are archived at the Chapel Web site [].
Programmability and the Chapel Language” []. While Many of the resources above, as well as other useful
the language has continued to evolve since these papers resources such as presentations and collaboration ideas,
were published, the overall motivations and concepts can be found at the Chapel project Web site hosted at
are still very accurate. The first paper is interesting in Cray [].
that it provides an early look at the original team’s To read about Chapel’s chief influences, the best
design while the latter remains a good overview of the resources are probably dissertations from the ZPL
language’s motivations and concepts. For the most accu- team [, ], the High Performance Fortran Hand-
rate description of the language at any given time, the book [], and the Cray XMT Programming Environ-
reader is referred to the Chapel Language Specification. ment User’s Guide [] (the Cray XMT is the current
This is an evolving document that is updated as the lan- incarnation of the Tera/Cray MTA).
guage and its implementation improve. At the time of
this writing, the current version is . []. Acknowledgments
Other important early works describing specific lan- This material is based upon work supported by the
guage concepts include “An Approach to Data Distri- Defense Advanced Research Projects Agency under its
butions in Chapel” by Roxana Diaconescu and Hans Agreement No. HR---. Any opinions, find-
Zima []. While the approach to user-defined distri- ings and conclusions or recommendations expressed
butions eventually taken by the Chapel team differs in this material are those of the author(s) and do not
markedly from the concepts described in this paper [], necessarily reflect the views of the Defense Advanced
it remains an important look into the early design being Research Projects Agency.
pursued by the Caltech/JPL team. Another concept-
related paper is “Global-view Abstractions for User- Bibliography
Defined Reductions and Scans” by Deitz, Callahan, . Callahan D, Chamberlain B, Zima H (April ) The Cascade
Chamberlain, and Snyder, which explored concepts for high productivity language. th International workshop on high-
user-defined reductions and scans in Chapel []. level parallel programming models and supportive environments,
In the trade press, a good overview of Chapel in pp –, Santa Fe, NM
Q&A form entitled “Closing the Gap with the Chapel . Chamberlain BL (November ) The design and implementa-
tion of a region-based parallel language. PhD thesis, University of
Language” was published in HPCWire in  [].
Washington
Another Q&A-based document is a position paper . Chamberlain BL (October ) Multiresolution languages for
entitled “Multiresolution Languages for Portable yet portable yet efficient parallel programming. http://chapel.cray.
Efficient Parallel Programming,” which espouses the com/papers/DARPA-RFI-Chapel-web.pdf. Accessed  May 
 C Charm++

. Chamberlain BL, Callahan D, Zima HP (August ) Parallel Discussion


programmability and the Chapel language. Int J High Perform Charm++ [] is a parallel programming system devel-
Comput Appl ():– oped at the University of Illinois at Urbana-Champaign.
. Chamberlain BL, Deitz SJ, Hribar MB, Wong WA (November
It is based on a message-driven migratable objects pro-
) Chapel tutorial using global HPCC benchmarks: STREAM
Triad, Random Access, and FFT (revision .). http://chapel.cray. gramming model, and consists of a C++-based parallel
com/hpcc/hpccTutorial-..pdf. Accessed  May  notation, an adaptive runtime system (RTS) that auto-
. Chamberlain BL, Deitz SJ, Iten D, Choi S-E () User-defined mates resource management, a collection of debugging
distributions and layouts in Chapel: Philosophy and framework. and performance analysis tools, and an associated fam-
In: Hot-PAR ‘: Proceedings of the nd USENIX workshop on
ily of higher level languages. It has been used to program
hot topics, June 
. Chapel development site at SourceForge. http://sourceforge.net/ several highly scalable parallel applications.
projects/chapel. Accessed  May 
. Chapel project website. http://chapel.cray.com. Accessed  May Motivation and Design Philosophy
 One of the main motivations behind Charm++ is the
. Cray Inc., Seattle, WA. Chapel Language Specification (ver- desire to create an optimal division of labor between
sion .), October . http://chapel.cray.com/papers.html.
the programmer and the system: that is, to design a
Accessed  May 
. Cray Inc. Cray XMT Programming Environment User’s Guide, programming system so that the programmers do what
March  (see http://docs.cray.com). Accessed  May  they can do best, while leaving to the “system” what it
. Deitz SJ () High-Level Programming Language Abstractions can automate best. It was observed that deciding what to
for Advanced and Dynamic Parallel Computations. PhD thesis, do in parallel is relatively easy for the application devel-
University of Washington
oper to specify; conversely, it has been very difficult for
. Deitz SJ, Callahan D, Chamberlain BL, Synder L (March )
Global-view abstractions for user-defined reductions and scans.
a compiler (for example) to automatically parallelize a
In: PPoPP ’: Proceedings of the eleventh ACM SIGPLAN given sequential program. On the other hand, automat-
symposium on principles and practice of parallel programming, ing resource management – which subcomputation to
pp –. ACM Press, New York carry out on what processor and which data to store on a
. Diaconescu R, Zima HP (August ) An approach to data particular processor – is something that the system may
distributions in Chapel. Intl J High Perform Comput Appl
be able to do better than a human programmer, espe-
():–
. Feldman M, Chamberlain BL () Closing the parallelism cially as the complexity of the resource management
gap with the Chapel language. HPCWire, November . task increases. Another motivation is to emphasize the
http://www.hpcwire.com/hpcwire/--/closing_the_paral importance of data locality in the language, so that the
lelism_gap_with_the_chapel_language.html. Accessed  May programmer is made aware of the cost of non-local data

references.
. Koelbel CH, Loveman DB, Schreiber RS, Steele Jr GL, Zosel ME
(September ) the High Performance Fortran handbook. Sci-
entific and engineering computation. MIT Press, Cambridge, MA Programming Model in Abstract
In Charm++, computation is specified in terms of
collections of objects that interact via asynchronous
method invocations. Each object is called a chare.
Charm++ Chares are assigned to processors by an adaptive run-
time system, with an optional override by the program-
Laxmikant V. Kalé mer. A chare is a special kind of C++ object. Its behavior
University of Illinois at Urbana-Champaign, Urbana,
is specified by a C++ class that is “special” only in the
IL, USA
sense that it must have at least one method designated
as an “entry” method. Designating a method as an entry
method signifies that it can be invoked from a remote
Definition processor. The signatures of the entry methods (i.e., the
Charm++ is a C++-based parallel programming system type and structure of its parameters) are specified in a
that implements a message-driven migratable objects separate interface file, to allow the system to generate
programming model, supported by an adaptive runtime code for packing (i.e., serializing) and unpacking the
system. parameters into messages. Other than the existence of
Charm++ C 

the interface files, a Charm++ program is written in a method invocations. Note that this queue may include
manner very similar to standard C++ programs, and asynchronous method invocations for chares located on
thus will feel very familiar to C++ programmers. this processor, as well as “seeds” for the creation of new
The chares communicate via asynchronous method chares. These seeds can be thought of as invocations of
invocations. Such a method invocation does not return the constructor entry method. The scheduler repeatedly C
any value to the caller, and the caller continues with its selects a message (i.e., a pending method invocation)
own execution. Of course, the called chare may choose from the queue, identifies the object targeted, creat-
to send a value back by invoking an entry method upon ing an object if necessary, unpacks the parameters from
the caller object. Each chare has a globally valid ID (its the message if necessary, and then invokes the specified
proxy), which can be passed around via method invo- method with the parameters. Only when the method
cations. Note that the programmer refers to only the returns does it select the next message and repeats the
target chare by its global ID, and not by the proces- process.
sor on which it resides. Thus, in the baseline Charm++
model, the processor is not a part of the ontology of the Chare-arrays and Iterative Computations
programmer. The model described so far, with its support for dynamic
Chares can also create other chares. The creation of creation of work, is well-suited for expressing divide-
new chares is also asynchronous in that the caller does and-conquer as well as divide-and-divide computa-
not wait until the new object is created. Programmers tions. The latter occur in state-space search. Charm++
typically do not specify the processor on which the new (and its C-based precursor, Charm, and Chare Kernel
chare is to be created; the system makes this decision at []) were used in the late s for implementing par-
runtime. The number of chares may vary over time, and allel Prolog [] as well as several combinatorial search
is typically much larger than the number of processors. applications [].
In principle, singleton chares could also be used to
Message-Driven Scheduler create the arbitrary networks of objects that are required
At any given time, there may be several pending method to decompose data in Science and Engineering applica-
invocations for the chares on a processor. Therefore, the tions. For example, one can organize chares in a two-
Charm++ runtime system employs a user-level sched- dimensional mesh network, and through some addi-
uler (Fig. ) on each processor. The scheduler is user- tional message passing, ensure that each chare knows
level in the sense that the operating system is not aware the ID of its four neighboring chares. However, this
of it. Normal Charm++ methods are non-preemptive: method of creating a network of chares is quite cumber-
once a method begins execution, it returns control to some, as it requires extensive bookkeeping on the part of
the scheduler only after it has completed execution. the programmer. Instead, Charm++ supports indexed
The scheduler works with a queue of pending entry collections of chares, called chare-arrays. A Charm++
computation may include multiple chare-arrays. Each
chare-array is a collection of chares of the same type.
Each chare is identified by an index that is unique within
its collection. Thus, an individual chare belonging to
a chare-array is completely identified by the ID of the
Chares chare-array and its own index within it. Common index
Chares
structures include dense as well as sparse multidimen-
sional arrays, but arbitrary indices such as strings or bit
vectors are also possible. Elements in a chare-array may
Processor 1 Processor 2
be created all at once, or can be inserted one at a time.
Scheduler Scheduler
Method invocations can be broadcast to an entire
chare-array, or a section of it. Reductions over chare-
Message Queue Message Queue arrays are also supported, where each chare in a chare-
array contributes a value, and all submitted values are
Charm++. Fig.  Message-driven scheduler combined via a commutative-associative operation. In
 C Charm++

many other programming models, a reduction is a col- associate work directly with processors, one typically
lective operation which blocks all callers. In contrast, has two options. One may divide the set of processors
reductions in Charm++ are non-blocking, that is, asyn- so that a subset is executing one module (P) while the
chronous. A contribute call simply deposits the remaining processors execute the other (Q). Alterna-
value created by the calling chare into the system and tively, one may sequentialize the modules, executing P
returns to its caller. At some later point after all the first, followed by Q, on all processors. Neither alterna-
values have been combined, the system delivers them tive is efficient. Allowing the two modules to interleave
to a user-specified callback. The callback, for example, the execution on all processors is often beneficial, but
could be a broadcast to an entry method of the same is hard to express even with wild-card receives, and it
chare-array. breaks abstraction boundaries between the modules in
Charm++ does not allow generic global variables, any case. With message driven execution, such inter-
but it does allow “specifically shared variables.” The sim- leaving happens naturally, allowing idle time in one
plest of these are read-only variables, which are initial- module to be overlapped with computation in the other.
ized in the main chare’s constructor, and are treated as Coupled with the ability of the adaptive runtime sys-
constants for the remainder of the program. The run- tem to migrate communicating objects closer to each
time system (RTS) ensures that a copy of each read-only other, this adds up to strong support for concurrent
variable is available on all physical processors. composition, and thereby for increased modularity.

Benefits of Message-driven Execution Prefetching Data and Code


The message-driven execution model confers several Since the message driven scheduler can examine its
performance and/or productivity benefits. queue, it knows what the next several objects scheduled
to execute are and what methods they will be execut-
Automatic and Adaptive Overlap of ing. This information can be used to asynchronously
Computation and Communication prefetch data for those objects, while the system exe-
Since objects are scheduled based on the availability cutes the current object. This idea can be used by
of messages, no single object can occupy the proces- the Charm++ runtime system for increasing efficiency
sor while waiting for some remote data. Instead, objects in various contexts, including on accelerators such as
that have asynchronous method invocations (messages) the Cell processor, for out-of-core execution and for
waiting for them in the scheduler’s queue are allowed prefetching data from DRAM to cache.
to execute. This leads to a natural overlap of communi-
cation and computation, without any extra work from Capabilities of the Adaptive Runtime
the programmer. For example, a chare may send a mes- System Based on Migratability of Chares
sage to a remote chare and wait for another message Other capabilities of the runtime system arise from the
from it before continuing. The ensuing communication ability to migrate chares across processors and the abil-
time, which would otherwise be an idle period, is nat- ity to place newly created chares on processors of its
urally and automatically filled in (i.e., overlapped) by choice.
the scheduler with useful computation, that is, process-
ing of another message from the scheduler’s queue for Supporting Task Parallelism with Seed
another chare. Balancers
When a program calls for the creation of a singleton
Concurrent Composition chare, the RTS simply creates a seed for it. This seed
The ability to compose in parallel two individually par- includes the constructor arguments and class informa-
allel modules is referred to as concurrent composition. tion needed to create a new chare. Typically, these seeds
Consider two modules P and Q that are both ready are initially stored on the same processor where they are
to execute and have no direct dependencies among created, but they may be passed from processor to pro-
them. With other programming models (e.g., MPI) that cessor under the control of a runtime component called
Charm++ C 

the seed balancer. Different seed balancer strategies are operations. Thus, a , ×, ×,  cube of data par-
provided by the RTS. For example, in one strategy, each titioned into  ×  ×  array of chares, each holding
processor monitors its neighbors’ queues in addition ×× data subcube, can be shrunk from  cores
to its own, and balances seeds between them as it sees to (say)  cores without significantly losing efficiency.
fit. In another strategy, a processor that becomes idle Of course, some cores will house  objects, instead of C
requests work from a random donor – a work steal- the  objects they did earlier.
ing strategy [, ]. Charm++ also includes strategies
that balance priorities and workloads simultaneously, in
order to give precedence to high-priority work over the Fault Tolerance
entire system of processors. Charm++ provides multiple levels of support for fault
tolerance, including alternative competing strategies.
Migration-based Load Balancers At a basic level, it supports automated application-
Elements of chare-arrays can be migrated across pro- level checkpointing by leveraging its ability to migrate
cessors, either explicitly by the programmer or by the objects. With this, it is possible to create a checkpoint
runtime system. The Charm++ RTS leverages this capa- of the program without requiring extra user code. More
bility to provide a suite of dynamic load balancing interestingly, it is also possible to use a checkpoint cre-
strategies. One class of such strategies is based on the ated on P processors to restart the computation on a
principle of persistence, which is the empirical observa- different number of processors than P.
tion that in most science and engineering applications On appropriate machines and with job schedulers
expressed in terms of their natural objects, computa- that permit it, Charm++ can also automatically detect
tional loads and communication patterns tend to persist and recover from faults. This requires that the job sched-
over time, even for dynamically evolving applications. uler not kill a job if one of its nodes were to fail. At
Thus, the recent past is a reasonable predictor of the near the time of this writing, these schemes are available
future. Since the runtime system mediates communica- on workstation clusters. The most basic strategy uses
tion and schedules computations, it can automatically the checkpoint created on disk, as described above, to
instrument its execution so as to measure computa- effect recovery. A second strategy avoids using disks
tional loads and communication patterns accurately. for checkpointing, instead creating two checkpoints of
Load balancing strategies can use these measurements, each chare in the memory of two processors. It is suit-
or alternatively, any other mechanisms for predicting able for those applications whose memory footprint at
such patterns. Multiple load balancing strategies are the point of checkpointing is relatively small compared
available to choose from. The choice may depend on with the available memory. Fortunately, many applica-
the machine context and applications, although one can tions such as molecular dynamics and computational
always use the default strategy provided. Programmers astronomy fall into this category. When it can be used,
can write their own strategy, either to specialize it to the it is very fast, often accomplishing a checkpoint in less
specific needs of the application or in the hope of doing than a second and recovery in a few seconds. How-
better than the provided strategies. ever, both strategies described above send all processors
back to their checkpoints even when just one out of a
Dynamically Altering the Sets million processors has failed. This wastes all the compu-
of Processors Used tation performed by processors that did not fail. As the
A Charm++ program can be asked to change the set of number of processors increases and, consequently, the
processors it is using at runtime, without requiring any MTBF decreases, this will become an untenable recov-
effort by the programmer. This can be useful to increas- ery strategy. A third experimental strategy in Charm++
ing utilization of a cluster running multiple jobs that sends only the failed processor(s) to their checkpoints
arrive at unpredictable times. The RTS accomplishes by using a message-logging scheme. It also leverages
this by migrating objects and adjusting its runtime data the over-decomposition and migratability of Charm++
structures, such as spanning trees used in its collective objects to parallelize the restart process. That is, the
 C Charm++

objects from failed processors are reincarnated on mul- the main Charm-level entities, their types and sig-
tiple other processors, where they re-execute, in paral- natures. The program has a read-only integer called
lel, from their checkpoints using the logged messages. numChares that holds the size of the LJ chare-array. In
Charm++ also provides a fourth proactive strategy to this particular program, the main chare is called Main
handle situations where a future fault can be predicted, and has only one method, namely, its constructor. The
say based on heat sensors, or estimates of increasing LJ class is declared as constituting a one-dimensional
(corrected) cache errors. The runtime simply migrates array of chares. It has two entry methods in addition
objects away from such a processor and readjusts its to its constructor. Note the others[n] notation used
runtime data structures. to specify a parameter that is an array of size n, where
n itself is another integer parameter. This allows the
Associated Tools system to generate code to serialize the parameters
Several tools have been created to support the devel- into a message, when necessary. CkReductionMsg is a
opment and tuning of Charm++ applications. LiveViz system-defined type which is used as a target of entry
allows one to inject messages into a running program methods used in reductions.
and display attributes and images from a running Some important fragments from the C++ file that
simulation. Projections supports performance analysis define the program itself are shown in Fig. . Sequential
and visualization, including live visualization, parallel code not important for understanding the program is
on-line analysis, and log-based post-mortem analysis. omitted. The classes Main and LJ inherit from classes
CharmDebug is a parallel debugger that understands generated by a translator based on the interface file. The
Charm++ constructs, and provides online access to program execution consists of a number of time steps.
runtime data structures. It also supports a sophisticated In each time step, each processor sends its particles on a
record-replay scheme and provisional message delivery round-trip to visit all other chares. Whenever a packet
for dealing with nondeterministic bugs. The communi- of particles visits a chare via the passOn method, the
cation required by these tools is integrated in the run- chare calculates forces on each of the visiting (others)
time system, leveraging the message-driven scheduler. particles due to each of its own particles. Thus, when
No separate monitoring processes are necessary. the particles return home after the round-trip, they
have accumulated forces due to all the other particles in
Code Example the system. A sequential call (integrate) then adds
Figures  and  show fragments from a simple Charm++ local forces to the accumulated forces and calculates
example program to give a flavor of the programming new velocities and positions for each owned particle.
model. The program is a simple Lennard-Jones molec- At this point, to make sure the time step is truly fin-
ular dynamics code. The computation is decomposed ished for all chares, the program uses an “asynchronous
into a one-dimensional array of LJ objects, each hold- reduction” via the contribute call. To emphasize the
ing a subset of atoms. The interface file (Fig. ) describes asynchronous nature of this call, the example makes

mainmodule ljdyn {
readonly int numChares;
...
mainchare Main {
entry Main(CkArgMsg *m);
};

array [1D] LJ {
entry LJ(void);
entry void passOn(int home, int n, Particle others[n]);
entry void startNextStep(CkReductionMsg *m);
};
};

Charm++. Fig.  A simple molecular dynamics program: interface file


Charm++ C 

/*readonly*/ int numChares;

classMain : public CBase_Main {


Main(CkArgMsg*m){
//Process command-line arguments
...
numChares = atoi(m->argv[1]); C
... LJ[0]
CProxy_LJ arr = CProxy_LJ::ckNew(numChares);
LJ
} [n-1] LJ[1]
};

class LJ : public CBase_LJ {


int timeStep, numParticles, next;
LJ
Particle * myParticles; [n-2] LJ[2]
...

LJ(){
...
myParticles = new Particle[numParticles];
... // initialize particle data
next = (thisIndex + 1) % numChares; LJ[k]
timeStep = 0; LJ[k].passOn(...)
startNextStep((CkReductionMsg * )NULL);
}

void startNextStep(CkReductionMsg* m){


if (++timeStep > MAXSTEPS){
if (thisIndex == 0) {ckout << "Done\n" << endl; CkExit();}
}else
thisProxy[next].passOn(thisIndex, numParticles, myParticles);
}

void passOn(int homeIndex, int n, Particle* others) {


if (thisIndex ! = homeIndex) {
interact(n, others); //add forces on "others" due to my particles
thisProxy[next].passOn(homeIndex,n, others);
} else { // particles are home, with accumulated forces
CkCallback cb(CkIndex_LJ::startNextStep(NULL), thisProxy);
contribute(cb); // asynchronous barrier
integrate( n, others); // add forces and update positions
}
}

void interact(int n, Particle* others){


/* add forces on "others" due to my particles */
}

void integrate(int n, Particle* others){


/*... apply forces, update positions... */
}
};

Charm++. Fig.  A simple molecular dynamics program: fragments from the C++ file

the call before integrate. The contribute call simply a broadcast to all the members of the chare-array at
deposits the contribution into the system and continues the entry-method startNextStep. Inherited vari-
on to integrate. The contribute call specifies able thisproxy is a proxy to the entire chare-array.
that after all the array elements of LJ have contributed, Similarly, thisIndex refers to the index of the calling
a callback will be made. In this case, the callback is chare in the chare-array to which it belongs.
 C Charm++

Language Extensions and Features Charm++ runtime system. Each MPI process is imple-
The baseline programming model described so far is mented as a user level thread that is embedded inside
adequate to express all parallel interaction structures. a Charm++ object, as a threaded entry method. These
However, for programming convenience, increased objects can be migrated across processors, as is usual
productivity and/or efficiency, Charm++ supports a few for Charm++ objects, thus bringing benefits of the
additional features. For example, individual entry meth- Charm++ adaptive runtime system, such as dynamic
ods can be marked as “threaded.” This results in the load balancing and fault tolerance, to traditional MPI
creation of a user-level, lightweight thread whenever the programs. Since there may be multiple MPI “processes”
entry method is invoked. Unlike normal entry meth- on each core, commensurate with the overdecomposi-
ods, which always complete their execution and return tion strategy of Charm++ applications, the MPI pro-
control to the scheduler before other entry methods grams need to be modified in a mechanical, systematic
are executed, threaded entry methods can block their fashion to avoid conflict among the global variables.
execution; of course they do so without blocking the Adaptive MPI provides tools for automating this pro-
processor they are running on. In particular, they can cess to some extent. As a result, a standard Adaptive
wait for a “future,” wait until another entry method MPI program is also a legal MPI program, but the con-
unblocks them, or make blocking method invocations. verse is true only if the use of global variables has been
An entry method can be tagged as blocking (actually handled via such modifications. In addition, Adaptive
called a “sync” method), and such a method is capa- MPI provides primitives such as asynchronous collec-
ble of returning values, unlike normal methods that are tives, which are not part of the MPI  standard. An
asynchronous and therefore have a return type of void. asynchronous reduction, for example, carries out the
Often, threaded entry methods are used to describe communication associated with a reduction in the back-
the life cycle of a chare. Another notation within ground while the main program continues on with its
Charm++, called “structured dagger,” accomplishes the computation. A blocking call is then used to fetch the
same effect without the need for a separate stack and result of the reduction.
associated memory for each user level thread and the, Two recent languages in the Charm++ family are
admittedly small, overhead associated with thread con- multiphase shared arrays (MSA) and Charisma. These
text switches. However, it requires that all dependencies are part of the Charm++ strategy of creating a tool-
on remote data be expressed in this notation within the box consisting of incomplete languages that capture
text of an entry-method. In contrast, the thread of con- some interaction modes elegantly and frameworks that
trol may block waiting for remote data within functions capture the needs of specific domains or data struc-
called from a threaded entry method. tures, both backed up by complete languages such as
Charm++ as described so far does not bring in the Charm++ and AMPI. The compositionality afforded by
notion of a “processor” in the programming model. message-driven execution ensures that modules written
However, some low-level constructs that refer to pro- using multiple paradigms can be efficiently composed
cessors are also provided to programmers and especially in a larger application.
to library writers. For example, when a chare is created, MSA is designed to support disciplined use of
one can optionally specify which processor to create it shared address space. The computation consists of
on. Similarly, when a chare-array is created, one can collections of threads and multiple user-defined data
specify its initial mapping to processors. One can create arrays, each partitioned into user-defined pages. Both
specialized chare-arrays, called groups, that have exactly entities are implemented as migratable objects (i.e.,
one member on each processor, which are useful for chares) available to the Charm++ runtime system.
implementing services such as load balancers. The threads can access the data in the arrays, but
each array is in only one of a restrictive set of
Languages in the Charm Family access modes at a time. Read-only, exclusive-write, and
AMPI or Adaptive MPI is an implementation of the accumulate are examples of the access modes sup-
message passing interface standard on top of the ported by MSA. At designated synchronization points,
Charm++ C 

a program may change the access modes of one or Origin and History
more arrays. The Chare Kernel, a precursor of the Charm++ sys-
Charisma, another language implemented on top tem, arose from the work on parallel Prolog at the
of Charm++, is designed to support computations University of Illinois in Urbana-Champaign in the late
that exhibit a static data flow pattern among a set of s. The implementation mechanism required a col- C
Charm++ objects. Such a pattern, where the flow of lection of computational entities, one for each active
messages remains the same from iteration to itera- clause (i.e., its activation record) of the underlying logic
tion, even though the content and the length of mes- program. Each of these entities typically received mul-
sages may change, is extremely common in science tiple responses from its children in the proof tree,
and engineering applications. For such applications, and they needed to create new nodes in the proof
Charisma provides a convenient syntax that captures tree which had to fire new “tasks” for each of its
the flow of values and control across multiple col- active clauses. The implementation led to a message-
lections of objects clearly. In addition, Charisma pro- driven scheduler and dynamic creation of the seeds of
vides a clean separation of sequential and parallel code work. These entities were called chares, borrowing the
that is convenient for collaborative application devel- term used by an earlier parallel functional-languages
opment involving parallel programmers and domain project, RediFlow. Chare Kernel essentially separated
specialists. this implementation mechanism from its parallel Prolog
context, into a C-based parallel programming paradigm
of its own. Charm had similarities (in particular, its
Frameworks atop Charm++ message-driven execution) with the earlier research on
In addition to its use as a language for implement- reworking of the Hewitt’s Actor framework by Agha and
ing applications directly, Charm++ is seen as backend Yonezawa [, ]. However, its intellectual progenitors
for higher level frameworks and languages described were in parallel logic and functional languages. With
above. Its utility in this context arises because of the the increase in popularity of C++, Charm++ became
interoperability and runtime features it provides, which the version of Charm for C++, which was a natural fit
one can leverage to put together a new domain-specific for its object-based abstraction. As many researchers
framework relatively quickly. An example of such a in parallel logic programming shifted attention to sci-
framework is ParFUM, which is aimed at unstruc- entific computations, indexed collections of migratable
tured mesh applications. ParFUM allows developers chares were developed in Charm++ to simplify address-
of sequential codes based on such meshes to retarget ing chares in mid-s. In the late s, Adaptive
them to parallel machines, with relatively few changes. It MPI was developed in the context of applications being
automates several commonly needed functions includ- developed at the Center for Simulation of Advanced
ing the need to exchange boundary nodes (or boundary Rockets at Illinois. Charm++ continues to be developed
layers, in general) with neighboring objects. Once a and maintained from the University of Illinois, and its
code is ported to ParFUM, it can automatically bene- applications are in regular use at many supercomputers
fit from other Charm++ features such as load balancing around the world.
and fault tolerance.
Availability and Usage
Charm++ runs on most parallel machines available at
Applications the time this entry was written, including multicore
Some of the highly scalable applications developed desktops, clusters, and large-scale proprietary super-
using Charm++ are in extensive use by scientists on computers, running Linux, Windows, and other operat-
national supercomputers. These include NAMD (for ing systems. Charm++, its associated software tools and
biomolecular simulations), OpenAtom (for electronic libraries can be downloaded in source code and binary
structure simulations), and ChaNGa (for astrophysical forms from http://charm.cs.illinois.edu under a license
N-body simulations). that allows free use for noncommercial purposes.
 C Checkpoint/Restart

Related Entries . Agha G () Actors: a model of concurrent computation in


Actors distributed systems. MIT, Cambridge
. Yonezawa A, Briot J-P, Shibayama E () Object-oriented con-
NAMD (NAnoscale Molecular Dynamics)
current programming in ABCL/. ACM SIGPLAN Notices, Pro-
Combinatorial Search
ceedings OOPSLA ’, Nov , ():–
. Kale LV, Shu W () The Chare Kernel base language: prelim-
inary performance results. In: Proceedings of the  interna-
Bibliographic Notes and Further tional conference on parallel processing, St. Charles, August ,
Reading pp –
One of the earliest papers on the Chare Kernel, the . Kale LV () The Chare Kernel parallel programming language
precursor of Charm++ was published in  [], fol- and system. In: Proceedings of the international conference on
parallel processing, August , vol II, pp –
lowed by a more detailed description of the model and
. Kale LV, Krishnan S () Charm++: parallel programming
its load balancers [, ]. This work arose out of ear- with message-driven objects. In: Wilson GV, Lu P (eds) Parallel
lier work on parallel Prolog []. The C++ based version programming using C++. MIT, Cambridge, pp –
was described in an OOPSLA paper in  [] and . Gursoy A, Kale LV () Performance and modularity ben-
was expanded upon in a book on parallel C++ in []. efits of message-driven execution. J Parallel Distrib Comput
Early work on quantifying benefits of the programming :–
. Lawlor OS, Kale LV () Supporting dynamic parallel object
model is summarized in a later paper [].
arrays. Concurr Comput Pract Exp :–
Some of the early applications using Charm++ were . Brunner RK, Kale LV () Handling application-induced load
in symbolic computing and, specifically, in parallel imbalance using parallel objects. In: Parallel and distributed com-
combinatorial search []. A scalable framework for sup- puting for symbolic and irregular applications. World Scientific,
porting migratable arrays of chares is described in a Singapore, pp –
. Kale Lv, Zheng G () Charm++ and AMPI: adaptive runtime
paper [], which is also useful for understanding the
strategies via migratable objects. In: Parashar M (ed) Advanced
programming model. An early paper [] describes sup- computational infrastructures for parallel and distributed appli-
port for migrating Chares for dynamic load balancing. cations. Wiley-Interscience, Hoboken, pp –
Recent papers describe an overview of Charm++ [] . Kale LV, Bohm E, Mendes CL, Wilmarth T, Zheng G () Pro-
and its applications []. gramming petascale applications with Charm++ and AMPI. In:
Bader B (ed) Petascale computing: algorithms and applications.
Chapman & Hall, CRC, Boca Raton, pp –
Bibliography
. Kale LV, Krishnan S () CHARM++: a portable concur-
rent object oriented system based on C++. In: Paepcke A (ed)
Proceedings of OOPSLA’, ACM, New York, September , Checkpoint/Restart
pp –
. Shu WW, Kale LV () Chare Kernel – a runtime support Checkpointing
system for parallel computations. J Parallel Distrib Comput
:–
. Kale LV () Parallel execution of logic programs: the
REDUCE-OR process model. In: Proceedings of the fourth inter- Checkpointing
national conference on logic programming, Melbourne, May
, pp – Martin Schulz
. Kale LV, Ramkumar B, Saletore V, Sinha AB () Prioritization
Lawrence Livermore National Laboratory, Livermore,
in parallel symbolic computing. In: Ito T, Halstead R (eds) Lecture
CA, USA
notes in computer science, vol . Springer, pp –
. Lin Y-J, Kumar V () And-parallel execution of logic pro-
grams on a sharedmemory multiprocessor. J Logic Program Synonyms
(//&):– Checkpoint-recovery; Checkpoint/Restart
. Frigo M, Leiserson CE, Randall KH () The implementation
of the Cilk- multithreaded language. In: ACM SIGPLAN ’
conference on programming language design and implementa-
Definition
tion (PLDI), Montreal, June . vol  of ACM Sigplan Notices, In the most general sense, Checkpointing refers to the
pp – ability to store the state of a computation in a way that
Checkpointing C 

allows it be continued at a later time without changing of storage to which the checkpoint can be saved,
the computation’s behavior. The preserved state is called which is discussed in section “Checkpoint Storage
the Checkpoint and the continuation is typically referred Considerations”.
to as a Restart. Checkpointing is most commonly associated with
Checkpointing is most typically used to provide fault tolerance: It is used to periodically store the C
fault tolerance to applications. In this case, the state of state of an application to some kind of stable stor-
the entire application is periodically saved to some kind age, such that, after a hardware or operating sys-
of stable storage, e.g., disk, and can be retrieved in case tem failure, an application can continue its execu-
the original application crashes due to a failure in the tion from the last checkpoint, rather than having to
underlying system. The application is then restarted (or start from scratch. The following entry will concen-
recovered) from the checkpoint that was created last trate on this usage scenario, but will also discuss
and continued from that point on, thereby minimizing some alternate scenarios in section “Alternate Usage
the time lost due to the failure. Scenarios”.

Discussion
Checkpointing Types
Checkpointing is a mechanism to store the state of
Checkpointing can be implemented either at the sys-
a computation so that it can be retrieved at a later
tem level, i.e., by the operating system or the system
point in time and continued. The process of writing the
environment, or within the application itself.
computation’s state is referred to as Checkpointing, the
data written as the Checkpoint, and the continuation of
the application as Restart or Recovery. The execution System-Level Checkpointing
sequence between two checkpoints is referred to as a In system level checkpointing, the state of a compu-
Checkpointing Epoch or just Epoch. tation is saved by an external entity, typically with-
As discussed in section “Checkpointing Types”, out the application’s knowledge or support. Conse-
Checkpointing can be accomplished either at system quently, the complete process information has to be
level, transparently to the application (section “ System- included in the checkpoint, as illustrated in Fig. .
Level Checkpointing”), or at application level, inte- This includes not only the complete memory foot-
grated into an application (section “Application-Level print including data segments, the heap, and all stacks,
Checkpointing”). While the first type is easier to apply but also register and CPU state as well as open file
for the end user, the latter one is typically more and other resources. On restart, the complete mem-
efficient. ory footprint is restored, all file resources are made
While checkpointing is useful for any kind of com- available to the process again, and then the register
putation, it plays a special role for parallel applica- set is restored to its original state, including the pro-
tions (section “Parallel Checkpointing”), especially in gram counter, allowing the application to continue at
the area of High-Performance Computing. With ris- the same point where it had been interrupted for the
ing numbers of processors in each system, the over- checkpoint.
all system availability is decreasing, making reliable System-level checkpoint solutions can either be
fault tolerance mechanisms, like checkpointing, essen- implemented inside the kernel as a kernel module or
tial. However, in order to apply checkpointing to parallel service, or at the user level. The former has the advan-
applications, the checkpointing software needs to be tage that the checkpointer has full access to the tar-
able to create globally consistent checkpoints across the get process as well as its resources. User-level check-
entire application, which can be achieved using either pointers have to find other ways to gather this infor-
coordinated (section “Coordinated Checkpointing”) mation, e.g., by intercepting all system calls. On the
or uncoordinated (section “Uncoordinated Check- flip side, user-level schemes are typically more portable
pointing”) checkpointing protocols. and easier to deploy, in particular in large-scale pro-
Independent of the type and the underlying sys- duction environments with limited access for end
tem, checkpointing systems always require some kind users.
 C Checkpointing

Process address space

Text Data Heap Stack

CPU state: External state:


Registers, incl. SP & IP File I/O, Sockets, …

System-level
checkpoint

Checkpointing. Fig.  System-level checkpointing

Application-Level Checkpointing Application-level checkpointing is used in many


The alternative to system-level checkpointing is to high-performance computing applications, especially
integrate the checkpointing capability into the actual in simulation codes. They are part of the applica-
application, which leads to application-level check- tions’ base design and implemented by the programmer.
pointing solutions. In such systems, the application is Additionally, systems like SRS [] provide toolboxes that
augmented with the ability to write its own state into allow users to implement application-level checkpoints
a checkpoint as well as to restart from it. While this in their codes on top of a simple and small API.
requires explicit code inside the application and hence is
no longer transparent, it gives the application the ability Trade-offs
to decide when checkpoints should be taken (i.e., when Both approaches have distinct advantages and dis-
it is a good time to write the state, e.g., when memory advantages. The key differences are summarized in
usage is low or no extra resources are used) and what Table . While system-level checkpoints provide full
should be contained in the checkpoint (Fig. ). The lat- transparency to the user and require no special mecha-
ter enables applications to remove noncritical memory nism or consideration inside an application, this trans-
regions, e.g., temporary fields, from the checkpoint and parency is missing in application-level checkpointers.
hence reduce the checkpoint size. On the other hand, this transparency comes at the price
To illustrate the latter point, let us consider a classi- of high implementation complexity for the checkpoint-
cal particle simulation in which the location and speed ing software. Not only must it be able to checkpoint
of a set of particles is updated at each iteration step based the complete system state of an arbitrary process with
on the physical properties of the underlying system. The arbitrary resources, but also has to do so at any time
complete state of the computation can be represented independent of the state the process is in.
by only the two arrays representing the coordinates and In contrast to system-level checkpointers, application-
velocities of all particles in the simulation. If check- level approaches can exploit application-specific infor-
points are taken at iteration boundaries after the update mation to optimize the checkpointing process. They are
of all arrays is complete, only these arrays need to be able to control the timing of the checkpoints and they
stored to continue the application at a later time. Any can limit the data that is written to the checkpoint,
temporary array, e.g., used to compute forces between which reduces the data that has to be saved and hence
particles, as well as stack information does not need to the size of the checkpoints.
be included. A system-level checkpointer, on the other Combining the advantages of both approaches by
hand, would not be able to determine which parts of the providing a transparent solution with the benefits of
data are relevant and which are not and would have to an application-level approach is still a topic of basic
store the entire memory segment. research. The main idea is, for each checkpoint inserted
Checkpointing C 

Process address space

Text
Data Heap Stack
(+ CP code)

Application-level
checkpoint

Checkpointing. Fig.  Application-level checkpointing

Checkpointing. Table  Key differences between system- processing cores and systems with , to , cores
and application-level checkpointing are common in High-Performance Computing [];
System-level Application-level future architectures will have even more, as plans for
checkpointing checkpointing machines with over a million cores have already been
Transparent Integrated into the announced []. This scaling trend requires effective fault
application
tolerance solutions for such parallel platforms and their
Implementation complexity Implementation complexity applications.
high medium or low
However, checkpointing a parallel application can-
System specific Portable not be implemented by simply replicating a sequential
Checkpoints taken at Checkpoints taken at checkpointing mechanism to all tasks of a parallel appli-
arbitrary times predefined locations
cation, since the individual checkpoints would not be
Full memory dump Only save what is needed coordinated and it would therefore not be possible to
Large checkpoint files Checkpoint files only as capture a complete snapshot of the application’s state
large as needed
from which the computation can be restarted.
The solution of this problem depends on the type
of system used. In the following, the entry discusses
into the application (either by hand or by a compiler), approaches for shared memory and message passing
to identify which variables need to be included in the applications, the two dominant programming models
checkpoint at that location, i.e., those variables that are for HPC (High-Performance Computing).
in scope and that are actually used after the check-
point location (and cannot be easily recomputed) [, ]. Checkpointing in Shared Memory Applications
While this can eliminate the need to save some tem- Checkpointing for shared memory systems
porary arrays, current compiler analysis approaches are programmed using threads can be implemented very
not powerful enough to achieve the same efficiency as similar to a checkpointer for a sequential system. Since
manual approaches. all state is shared, it is sufficient for a system-level
checkpointer to suspend all threads before a check-
Parallel Checkpointing point and then checkpoint the complete state of the
Checkpointing is of special importance in parallel sys- process, including the separate call stacks and register
tems. The larger numbers of components used for a files of all threads. No further coordination is neces-
single computation naturally decreases the mean time sary. On restart, the system has to recreate all threads
between failures, making system faults more likely. that the original system had, including all resources like
Already today’s largest systems consist of over , locks or semaphores held by the application before the
 C Checkpointing

checkpoint, before loading the actual checkpoint into Coordinated Checkpointing protocols enforce a coordi-
the context of all threads. nation at checkpoint time and hence avoid the prob-
Application-level checkpointers typically use their lem before it occurs, while Uncoordinated Checkpointing
knowledge about the code to insert checkpoints at protocols allow checkpoints at arbitrary points, but then
locations in which only one thread is active and then use additional mechanisms to eliminate inconsistencies
can checkpoint the complete state as in a sequen- if they should occur.
tial program. Otherwise, a coordination between the
threads and individual local checkpoints is necessary to Coordinated Checkpointing
ensure checkpoints with consistent global state (which The first variant, coordinated checkpointing, controls
is shared and accessible by all threads) and matching the system in a way that all local checkpoints are taken
per thread call stacks (which are only accessible by the at a point that avoids such inconsistencies. However,
individual threads). An example for such a coordina- avoiding late messages in a general scheme without any
tion protocol is given by Bronevetsky et al. [], which application knowledge or global state is hard since a
ensures that all active threads are stopped at appropriate receiver can never know whether another process has
locations before taking a checkpoint. already sent a message or not when it decides to take
a checkpoint. Luckily, tracking and dealing with late
message is straightforward: They can simply be buffered
Checkpointing in Message Passing on the receiving task and stored at the receiver. During
Applications the restart, they are then replayed from the checkpoint
The situation for message passing systems is more instead of received from the network.
complicated since checkpoints from multiple processes, Early messages, on the other hand, are more difficult
potentially running on different nodes, have to be taken. to handle. Not only do they require a global operation
Therefore, a checkpointing mechanism needs to take to avoid messages not being sent on restart, but they
into account which communication and interprocess also require a deterministic execution after a restart to
dependencies exist before any checkpoint is actually ensure that the same message content would have been
committed to the system. In order to achieve this goal, sent in the early message, since the receiver has already
an additional protocol is required that coordinates the consumed the message before the checkpoint and hence
checkpoints in one way or another. its contents is now part of the checkpointed state.
Figure  illustrates the two conflict scenarios that Due to these problems, most coordinated check-
can occur without the proper coordination. All figures pointing schemes are designed in a way that prevents
show the time lines of two processes P and P with time the occurrence of early messages. In the simplest case,
increasing from left to right and arrows indicating mes- as illustrated in Fig. a, checkpoints are taking at global
sages between the two processes. If local checkpoints synchronization points, e.g., at barrier operations. This
are to be taken at arbitrary times and communication ensures that all tasks are at the same point in their exe-
occurs between the individual instances of checkpoints, cutions and that no early messages can occur in the
one of two scenarios can occur: A message is expected system. However, this solution is limited to codes that
by a process after restart that will not be sent anymore have natural points for global synchronization.
because it was originally sent before the other sender’s In the more general case, which is applicable to any
checkpoint (Fig. a); or a message should not be sent irregular application, the system aims at forming a glob-
anymore after a restart since it is already been consumed ally consistent cut through the program, which is not
by a receiver before its own checkpoint (Fig. b). The traversed by an early message (Fig. b). An algorithm
first type of scenario is typically referred to as a late for this has been introduced introduced by Chandy and
message, while the second one is referred to as an early Lamport []. In this approach, the system forces a local
message. checkpoint as soon as it receives a message from a pro-
Two main approaches exist to achieve proper cess that has already started a checkpoint, but before the
checkpoint coordination between parallel processes to message is consumed. This prevents any process from
eliminate the conflicts caused by these two scenarios: receiving a message sent from a future checkpointing
Checkpointing C 

P0 P0
CP RE

P1 P1
CP RE
C
(a) Late message conflict

P0 P0
CP RE

P1 P1
CP RE

(b) Early message conflict

Checkpointing. Fig.  Message conflicts caused by uncoordinated checkpoints (checkpoint shown left, restart shown
right)

Barrier Barrier
P0 P0
Parallel processes

Parallel processes

P1 P1

P2 P2

P3 P3

(a) Checkpoint at global barriers (b) Checkpoint at global recovery lines

Checkpointing. Fig.  Consistency lines in coordinated checkpointing systems

epoch, since such messages would have triggered the to correct for inconsistent messages. The technique
automatic checkpoint. most used for this purpose is message logging. In this
The main restriction of this approach is that it dic- approach, all messages in an epoch are logged (together
tates when local checkpoints have to be taken. This with the results of all nondeterministic operations and
limits possible optimizations of the checkpoint loca- library calls) and stored together with the checkpoint.
tion and reduces the flexibility when such a mechanism During restart, this complete message log is then used
is intended to be used in an application-level scheme. to recreate the complete global state of the applica-
Some research projects [] have their proposed alter- tion by replaying all messages and nondeterministic
nate solutions allowing early messages and adding the events that occurred between taking the individual local
necessary information to remove their effects to the checkpoints.
checkpoint. In contrast to coordinated checkpointing, this
approach requires less complexity during the initial
Uncoordinated Checkpointing application run since both the checkpoints and the mes-
A second approach to parallel checkpointing is the use sage logging are purely local operation; no global coor-
of an uncoordinated checkpointing system. In such sys- dination has to be executed. On the other hand, the
tems, local checkpoints are taken at arbitrary times, restart of an application is significantly more complex
but then the system deploys additional mechanisms since the shared state has to be recreated at that time. An
 C Checkpointing

additional drawback is that for some applications, the checkpoint. However, during the recovery, a more com-
message logs can be of significant size putting further plex protocol is required to identify the remote storage
pressure on the storage system. locations of the last checkpoint and to reassemble it for
a proper restart. Further, the application remains vul-
Checkpoint Storage Considerations nerable to a failure of the entire cluster, e.g., caused by
Any checkpointing solution, independent of the design power loss. Some checkpoint solutions, like SCR [],
decisions discussed above, requires that each generated therefore provide mechanisms to combine in mem-
checkpoint has to be written to a storage location out- ory and on disk checkpoints by adaptively choosing
side the current process, so that the computation can the appropriate location based on the application’s fault
be restarted once the original process terminates. In the tolerance needs.
case of checkpointing for fault tolerance, this storage
must also be able to survive the crash of the application Incremental Checkpointing
or the underlying system. A different aspect to optimizing storage are techniques
to reduce the size of the checkpoints. Of particular inter-
Checkpoint Storage Locations est for several research projects is thereby incremental
Therefore, the most common storage scenario is that checkpointing [, ]. In this approach, each check-
checkpoints are written to disk, preferably to a remote point only stores the difference between itself and the
server, since this allows the user to retrieve the check- previous checkpoint. In scenarios with frequent check-
point even if the initial system the application had been points or slowly evolving application, this has the poten-
executed on becomes inaccessible or suffers a complete tial to reduce the checkpoint storage requirements, and
data loss. In case of a parallel checkpointer, the storage with that also the time it takes to store the checkpoint,
location has to receive checkpoints from all processes significantly. On the downside, though, this approach
in the target application. While this is typically not a requires the system to keep all previous checkpoints
problem for small systems, it can lead to significant bot- around, since these may be needed to recreate the most
tlenecks on systems with large numbers of nodes. In recent state.
such cases, checkpoints are typically written to a par-
allel file system like lustre, GPFS, or PVFS, since those Alternate Usage Scenarios
are designed to handle concurrent writes from a large The usage scenario commonly associated with check-
number of processes. pointing is to provide transparent fault tolerance to the
application that is being checkpointed. However, the
Diskless Checkpointing idea of checkpointing, using the same techniques, can
However, even with the use of a parallel file system, the also be applied in different scenarios. Some of them are
I/O required for frequent checkpoints can be large and described in the following:
can, for growing numbers of nodes, start dominating
the execution cost. Alternatively, checkpoints can also Time Sharing of Large-Scale Resources
be stored in the memory of remote nodes, but within the In order to allow a fair scheduling for all users, large-
cluster itself, which is referred to as Diskless Checkpoint- scale compute centers often impose a limit on the total
ing and has, e.g., been demonstrated by Silvia et al. [], execution time of a single job. Large-scale applications,
Planck et al. [], and Zheng et al. []. in particular simulation codes, however, often require
Such systems often distribute the state of a single significantly longer overall execution times than those
node on multiple remote nodes using error correcting dictated by these artificial limits.
codes, such that the failure of one of the machines con- In order to execute longer running jobs, they can
taining the checkpoint does not lead to the loss of the take a checkpoint at the end of their time slot and then
checkpoint. Diskless checkpointing eliminates the need store this checkpoint on stable storage. When the appli-
for any I/O to a central storage facility outside the clus- cation is granted access to a partition again, it can then
ter and therefore reduces the time needed to store a read this checkpoint file and continue its execution.
Checkpointing C 

This way several partition allocations, potentially even parallelism and use it to checkpoint individual threads
from the different machines, can be chained together. or active objects in a large-scale system to achieve load
Such checkpointing is typically implemented with balancing.
the individual applications using an application-level
approach. This reduces the amount of data that has to be Virtual Machine Checkpointing C
stored and also allows for more flexibility during restart. Virtual machines that allow the execution or emulation
of an entire system on top of a (potentially different)
Migration host system have become commonplace. These systems
Similar to the above is the idea of process migration: A typically allow the state of the virtual machine, which
process is stored, removed from the system, and then is often contained in a single file within the host sys-
either immediately or later restored, but in this case on tem, to be saved such that the virtual machine can be
a different node. Figure  illustrates this further. The interrupted and later continued. This functionality rep-
ability to migrate checkpoints is straightforward as long resents a system-level checkpointer, which is completely
as the source and target architectures are identical, i.e., transparent. However, the target of the checkpointer is
the checkpoint can be transferred and restarted without no longer a single process or computation, but rather
conversion. the complete virtualized operating system image.
In heterogeneous environments, however, this can
be more complex. In this case, the checkpoint either Exploration of Alternate Executions
needs to be stored in an architecture-independent for- Checkpointing systems can also improve the explore
mat or the checkpoint needs to be converted from alternate execution sequences if those require a com-
one architecture to the next. This has many implica- mon prefix. Such scenarios can be found, e.g., in param-
tions, including the problem that heap address can- eters studies of simulations (in which the common
not be maintained anymore across checkpoint/restart prefix is a warmup or initialization phase that should
sequences. Consequently, this option typically requires not be included in the actual parameter study) or in
an application-level checkpointing approach or imposes exhaustive search algorithms that require backtracking
limits on the applications, e.g., by only allowing a subset to intermediate execution points.
of a program language to be used. Figure  illustrates this approach further: The top
Migration systems that use some kind of check- graphics (Fig. a) shows the conceptual execution with
pointing exist at various granularities. For example, the a common prefix and three alternate executions after
Condor scheduling system [, ] applies it to complete the prefix. Without checkpointing, the complete run
jobs to achieve a better utilization of compute clus- requires the execution of the application three times
ters for high-throughput computing, while systems like (Fig. b), repeating the common prefix. Using check-
Charm++ [, ] have the ability to apply it to fine grain pointing it is possible, though, to execute the prefix

Node A Node B Node A Node B Node B Node A

Process Process

Migration Migration Migration Migration Migration Migration


manager manager manager manager manager manager

CP CP

CP

Disk Disk Disk


a Stop process and b Invoke new remote c Restart process from
checkpoint process checkpoint

Checkpointing. Fig.  Migrating a process using checkpointing


 C Checkpointing

a) Multiple independent executions with common prefix

b) Sequential execution of all alternatives

c) Checkpoint common prefix and restart from checkpoint

Checkpointing. Fig.  Investigating alternate executions

only once, checkpoint the state of the process, and then in extensions to MVAPICH- by Bouteiller et al. []
continue with the first execution. Once this is complete, or the Optimistic Message Logging approach by Wang
the checkpoint can be restarted as many times as nec- et al. [].
essary to complete the remaining execution alternatives
(Fig. c).
Bibliography
In contrast to traditional checkpointing, where
. Vadhiyar S, Dongarra J () SRS – a framework for developing
checkpoints are written proactively and read only once, malleable and migratable parallel software. Parallel Process Lett
in this scenario, the application only writes a very ():–
limited set of checkpoints, but then reuses them sev- . Beck M, Plank JS, Kingsley G, Kingsley G () Compiler-
eral times. assisted checkpointing. In: Technical report CS--,
department of computer science, University of Tennessee,
Knoxville, December 
Related Entries . Chung chi Jim Li, Stewart EM, Fuchs WK () Compiler-
Fault Tolerance assisted full checkpointing. Pract Exper ():–
I/O . University of Mannheim, University of Tennessee, and NER-
SC/LBNL. TOP Supercomputing Sites. http://www.top.
org/
Bibliographic Notes and Further . Lawrence Livermore National Laboratory. NNSA awards
Reading IBM contract to build next generation supercomputer, press
Elnozahy et al. provide an in-depth overview of check- release. https://publica_airs.llnl.gov/news/newsreleases//
point/restart implementations []. Besides that, check- NR---.html. Accessed Feb 
. Bronevetsky G, Pingali K, Stodghill P () Experimental eval-
pointing solutions have been implemented on several
uation of application-level checkpointing for OpenMP programs.
platforms, typically in the form of system-level check- In: International conference on supercomputing (ICS), Queens-
pointing, in some cases, like on IBM’s BlueGene line, land, June 
even as part of the standard operating system stack. Also . Chandy M, Lamport L () Distributed snapshots: determining
some platform-independent solutions are commer- global states of distributed systems. ACM Transact Comput Syst
():–
cially available, one example being Librato’s Availabil-
. Schulz M, Bronevetsky G, Fernandes R, Marques D, Pingali K,
ity Services (AvS) [] (a user-level implementation). Stodghil l P () Implementation and evaluation of a scalable
Additionally, several academic projects offer system- application-level checkpoint-recovery scheme for MPI programs.
level checkpoint/restart solutions, e.g., LibCkpt [] In: Proceedings of IEEE/ACM supercomputing ’, Washington,
(user-level), BLCR [] (kernel-level), CoCheck [] DC, November 
(user-level for MPI and PVM applications), or Con- . Silva LM, Silva JG () An experimental study about diskless
checkpointing. EUROMICRO Conf :
dor [, ] (user-level, integrated with a cluster schedul-
. Plank JS, Li K, Puening MA () Diskless checkpointing. IEEE
ing system). Message logging–based approaches have Trans Parallel Distrib Syst ():–
received special attention in the past few years due . Zheng G, Shi L, Kale LV () FTC-Charm++: an In-Memory
to their scaling promises. Examples for the latter are checkpoint-based fault tolerant runtime for Charm++ and MPI.
Cilk C 

In:  IEEE international conference on cluster computing,


pp –, San Diego, September  CHiP Architecture
. Moody A, Bronevetsky G, Mohror K, de Supinski BR ()
Design, modeling, and evaluation of a scalable multi-level check- Blue CHiP
pointing system. In: Proceedings of IEEE/ACM supercomputing
’, New Orleans, LA,  C
. Agarwal S, Garg R, Gupta MS, Moreira JE () Adaptive incre-
mental checkpointing for massively parallel systems. In: ICS ’:
proceedings of the th annual international conference on super-
computing. ACM, New York, pp – CHiP Computer
. Sancho JC, Petrini F, Johnson G, Fernndez J, Frachtenberg E
() On the feasibility of incremental checkpointing for scien- Blue CHiP
tific computing. Parallel Distrib Process Symp Int :b
. Litzkow JBM, Tannenbaum T, Livny M (). Checkpoint and
migration of UNIX processes in the condor distributed process-
ing system. In: Technical report , University of Wisconsin,
Madison, 
. Condor. http://www.cs.wisc.edu/condor/manual Cholesky Factorization
. CHARM research group. http://charm.cs.uiuc.edu/
. Kale LV, Krishnan S () CHARM++: a portable concur- Dense Linear System Solvers
rent object oriented system based on C++. Parallel Process Lett
Sparse Direct Methods
():–
. Elnozahy M, Alvisi L, Wang YM, Johnson DB () A survey
of rollback-recovery protocols in message passing systems. In:
Technical report CMU-CS--, school of computer science,
Carnegie Mellon University, Pittsburgh, October 
. Librato. Availability Services (AvS). http://www.librato.com/ Cilk
products/availability.services
. Plank JS, Beck M, Kingsley G, Li K () Libckpt: transpar- Charles E. Leiserson
ent checkpointing under UNIX. In: Technical report UT-CS-- Massachusetts Institute of Technology, Cambridge,
, Department of Computer Science, University of Tennessee,
MA, USA
Princeton University
. Duell J The design and implementation of Berkeley lab’s
linux checkpoint/restart. http://www.nersc.gov/research/FTG/
checkpoint/reports.html Synonyms
. Stellner G () CoCheck: checkpointing and process migra- Cilk-; Cilk-; Cilk++; Cilk plus
tion for MPI. In: Proceedings of the th international parallel
processing symposium (IPPS ’), Honolulu, 
. Bouteiller A, Cappello F, Herault T, Krawezik G, Lemarnier Definition
P, Magniette F () MPICH-V: a fault tolerant MPI for Cilk (pronounced “silk”) is a linguistic and runtime
volatile nodes based on pessimistic sender based message logging. technology for algorithmic multithreaded program-
In: Proceedings of IEEE/ACM supercomputing ’, Phoenix,
ming originally developed at MIT. The philosophy
November 
behind Cilk is that a programmer should concentrate
. Wang YM, Fuchs WK () Optimistic message logging for
independent checkpointing in message-passing systems. In: Pro- on structuring her or his program to expose parallelism
ceedings of the th symposium on reliable distributed systems, and exploit locality, leaving Cilk’s runtime system with
Houston, October , pp – the responsibility of scheduling the computation to run
efficiently on a given platform. The Cilk runtime system
takes care of details like load balancing, synchroniza-
tion, and communication protocols. Cilk is algorith-
mic in that the runtime system guarantees efficient and
Checkpoint-Recovery predictable performance. Important milestones in Cilk
technology include the original Cilk-, which provided
Checkpointing a provably efficient work-stealing runtime support but
 C Cilk

little linguistic support; the later Cilk-, which pro- The semantics of spawning differ from a C++ func-
vided simple linguistic extensions for multithreading tion (or method) call only in that the parent can con-
to ANSI C; the commercial Cilk++, which extended tinue to execute in parallel with the child, instead of
the Cilk model to C++ and introduced “reducer hyper- waiting for the child to complete as is done in C++.
objects” as an efficient means for resolving races on The scheduler in the Cilk++ runtime system takes
nonlocal variables; and Intel Cilk Plus, which provided the responsibility of scheduling the spawned functions
transparent interoperability with legacy C/C++ binary on the individual processor cores of the multicore
executables. computer.
A function cannot safely use the values returned by
its children until it executes a cilk_sync statement.
Discussion The cilk_sync statement is a local “barrier,” not a
Introduction global one as, for example, is used in message-passing
Cilk technology has developed and evolved over more programming [, ]. In the quicksort example, a
than  years since its origin at MIT. Key releases of cilk_sync statement occurs on line  before the
Cilk include Cilk- [, , ], Cilk-NOW [, ], Cilk- function returns to avoid the anomaly that would occur
 [, , , ], JCilk [, , ], Cilk++ [, , ], if the preceding calls to qsort were scheduled to run
Intel Cilk Plus [], and Cilk-M []. Section “A Brief in parallel and did not complete before the return, thus
History of Cilk Technology” overviews the history of leaving the vector to be sorted in an intermediate and
Cilk technology. Some of the Cilks were more run- inconsistent state.
time systems than full-blown parallel languages, but In addition to explicit synchronization provided by
it is the simplicity of the more linguistically oriented the cilk_sync statement, every Cilk function syncs
Cilks – Cilk-, JCilk, Cilk++, and Cilk Plus – which implicitly before it returns, thus ensuring that all of its
makes the technology compelling. This article will focus children terminate before it does. Thus, for this exam-
on Cilk++, whose linguistics are both full featured and ple, the cilk_sync before the return is technically
simple. unnecessary.
Cilk++ is a faithful linguistic extension of the Cilk++ provides faithful extensions of other C++
serial C++ programming language [], which means language features. It provides full support for C++
that parallel code retains its serial semantics when exceptions. When exceptions are thrown in parallel,
run on one processor. The Cilk++ extensions to C++ the one that would occur first in a serial execution
consist of just three keywords, which can be under- is the one that executes the catch block. Loops can
stood from an example. Figure  shows a Cilk++ be parallelized by simply replacing the keyword for
program adapted from http://en.wikibooks.org/wiki/ with the keyword cilk_for keyword, which allows
Algorithm_implementation/Sorting/Quicksort, which all iterations of the loop to operate in parallel. Within
implements the quicksort algorithm [, Chap. ]. the main routine from Fig. , for example, the loop
Observe that the program would be an ordinary starting on line  fills the array in parallel with “ran-
C++ program if the keywords cilk_spawn and dom” numbers. In addition, Cilk++ includes a library
cilk_sync were elided and cilk_for replaced by for mutual-exclusion (mutex) locks. Locking tends to
for. The program so modified is called the serializa- be used much less frequently than in other parallel
tion of the Cilk++ program. (The term serial elision environments, such as Pthreads [], because all pro-
was used in earlier Cilks, because all the keywords could tocols for control synchronization are handled by the
simply be elided.) One of the things that makes Cilk Cilk++ runtime system. Cilk++ also provides a power-
simple is that the serialization of a parallel code always ful “hyperobject” library, which allows races on nonlo-
provides a legal semantics for the parallel code that can cal variables to be mitigated without lock contention or
execute on a single processor. restructuring of code. Cilk++ provides tool support in
Parallel work is created when the keyword the form of the Cilkscreen race detector, which guaran-
cilk_spawn precedes the invocation of a function. tees to find race bugs in ostensibly deterministic code,
Cilk C 

1 // Parallel quicksort
2 using namespace std;
3
4 #include <algorithm>
5 #include <iterator>
6 #include <functional>
7 C
8 template <typename T>
9 void qsort(T begin, T end) {
10 if (begin ! = end) {
11 T middle = partition(begin, end, bind2nd(less<typename
iterator_traits<T>::value_type>(),*begin));
12 cilk_spawn qsort(begin, middle);
13 qsort(max(begin + 1, middle), end);
14 cilk_sync;
15 }
16 }
17
18 // Simple test code:
19 #include <iostream>
20 #include <cmath>
21
22 int cilk_main() {
23 int n = 100;
24 double a[n];
25
26 cilk_for (int i = 0;i<n; ++i) {
27 a[i] = sin((double) i);
28 }
29
30 qsort(a, a + n);
31 copy(a, a + n, ostream_iterator<double>(cout, "\n"));
32
33 return 0;
34 }

Cilk. Fig.  Parallel quicksort implemented in Cilk++

and the Cilkview scalability analyzer, which extrap- A Brief History of Cilk Technology
olates speedups based on the algorithmic complexity This section overviews the development of the Cilk
measures of “work” and “span.” technology starting from its origin at the MIT Lab-
The remainder of this article is organized as follows. oratory for Computer Science under the direction of
Section “A Brief History of Cilk Technology” overviews Professor Charles E. Leiserson. The four MIT Ph.D. the-
the evolution of Cilk technology. Section “The Dag ses [, , , ] contain more detailed descriptions of
Model for Multithreading” provides a brief tutorial the foundation and early history of Cilk.
on the theory of parallelism. Section “Runtime Sys-
tem” describes the performance guarantees of Cilk++’s The Origins of Cilk
“work-stealing” scheduler, illustrates the Cilkview scal- The first implementation of Cilk arose from three
ability analyzer, and overviews how the scheduler oper- separate projects at MIT in . The first project
ates. section “Race Detection” briefly describes the was theoretical work [, ] on scheduling multi-
Cilkscreen race-detection tool, and section “Reducer threaded applications. The second was StarTech [,
Hyperobjects” explains Cilk++’s “hyperobject” tech- , ], a parallel chess program built to run on the
nology. Finally, section “Conclusion” provides some Thinking Machines Corporation’s Connection Machine
concluding remarks. Model CM- Supercomputer []. The third project
 C Cilk

was PCM/Threaded-C [], a C-based package for nonetheless useful consistency model, and its relaxed
scheduling continuation-passing-style threads on the semantics allows for an efficient, low-overhead software
CM-. implementation.
In April  the three projects were combined and
christened Cilk. (The name Cilk is not an acronym,
but an allusion to nice threads (silk) and to the C pro- Optimization and Enhancement
gramming language (hence Cilk and not Silk). One of With the Cilk- release in June , the authors of
Leiserson’s graduate students commented that Cilk is a Cilk changed their primary development platform from
language of the C ilk. When his graduate students held the distributed-memory CM- to the shared-memory
an acronym contest for Cilk, the winner was “Charles’s Sun Microsystems SPARC SMP. The compiler and run-
Idiotic Linguistic Kludge.”) The team began by imple- time system were completely reimplemented, eliminat-
menting a new parallel chess program. Unlike StarTech, ing continuation-passing as the basis of the scheduler,
which intertwined the search code and the scheduler, and instead embedding scheduling decisions directly
the new chess program implemented its search com- into the compiled code. Instead of stealing children, as
pletely on top of a general-purpose runtime system that in the earlier Cilk releases, Cilk- adopted the “lazy-
incorporated a provably efficient work-stealing sched- task-creation” strategy [] of stealing parent continua-
uler. At the end of June, the ⋆Socrates chess program tions. The overhead to spawn a parallel thread in Cilk-
was entered into the  ACM International Chess was typically less than three times the cost of an ordi-
Championship, where, running on a -node CM-, nary C procedure call, and so Cilk- programs “scaled
it finished third. The “Cilk-” system [] itself was down” to run on one processor with nearly the efficiency
released in September . A notable branch of the of analogous C programs.
Cilk- codebase was Cilk-NOW [, ], which provided Cilk- provided two new language features to sup-
an adaptively parallel and fault-tolerant network-of- port speculative parallelism, so that applications such as
workstations implementation. computer chess [] could be more easily programmed.
The keyword inlet specified that an internal function
Simple Linguistics could be invoked by a returning child to execute code on
The Cilk- release in May  [] featured full type- its parent’s frame. The keyword abort abruptly termi-
checking, supported all of ANSI C in its C-language nated subcomputations that had already been spawned.
subset, and offered call-return semantics for writ- These mechanisms allowed a programmer to speculate
ing multithreaded procedures. Thus, instead of hav- that a subcomputation would be worthwhile to execute
ing to “wire together” continuations, the program- in parallel, but abort it efficiently if the subcomputation
mer could simply insert the spawn keyword before turned out to be superfluous.
a function call and execute a sync statement to For Cilk- [], which was released in March ,
ensure that all spawned subroutines had completed the runtime system was rewritten to be more flexi-
(as described with cilk_spawn and cilk_sync in ble and portable. Cilk-. could use operating system
section “Introduction”). threads as well as processes to implement the indi-
The runtime system was made more portable by vidual Cilk “workers” that schedule Cilk threads. The
replacing the general continuation-passing mechanism Cilk-. release included a debugging tool called the
with a continuation mechanism based on Duff ’s device Nondeterminator [, , , ], which allowed Cilk
[]. The new continuation mechanism greatly simpli- programmers to localize data-race bugs in their code.
fied the runtime system, which allowed the base release With the Cilk-. release, Cilk graduated from research
to support several architectures other than the CM-. prototype to real-world tool intended to be used by pro-
Cilk-, which was released in October , featured grammers who are not necessarily parallel-processing
an implementation of “dag-consistent” distributed shared experts. Many improvements were made to the runtime
memory [, , , ]. With this addition of shared system to make it faster, more portable, and more main-
memory, Cilk could be applied to solve a much wider tainable, but the basic language has remained largely
class of applications. Dag-consistency is a weak but the same.
Cilk C 

A Cilk for Java Cilk++ improved upon the original MIT Cilk in
Up to this point, the Cilk technology was grounded in several ways. The linguistic distinction between Cilk
the C programming language. JCilk [, ], however, functions and C/C++ functions was lessened, allow-
marked an excursion into Java [], largely to explore ing C++ “call-backs” to Cilk code, as long as the C++
how exception mechanisms should interoperate with code was compiled with the Cilk++ compiler. (This C
the Cilk spawn and sync primitives. Specifically, JCilk distinction was later removed altogether by Intel Cilk
defined semantics for exceptions that are consistent Plus.) The spawn and sync keywords were renamed
with the existing semantics of Java’s try and catch cilk_spawn and cilk_sync to avoid naming con-
constructs, but which handle concurrency in spawned flicts. Loops were parallelized by simply replacing the
methods. JCilk extends Java’s exception semantics to keyword for with the keyword cilk_for keyword,
allow exceptions to be passed from a spawned method which allows all iterations of the loop to operate in
to its parent in a natural way that obviates the need parallel. Cilk++ provided full support for C++ excep-
for Cilk-’s inlet and abort constructs. This exten- tions. Cilk++ also introduced “reducer hyperobjects”
sion is “faithful” in that it obeys Java’s ordinary serial (see section “Reducer Hyperobjects”), which allow races
semantics when executed on a single processor. When on nonlocal variables to be mitigated without lock con-
executed in parallel, however, an exception thrown by tention or restructuring of code.
a JCilk computation signals its sibling computations to The Cilk++ toolkit included the Cilkscreen race-
abort, which yields a clean semantics in which only detection tool which guarantees to find race bugs
a single exception from the enclosing try block is in ostensibly deterministic code. It also included the
handled. Cilkview scalability analyzer [], a software tool
for profiling, estimating scalability, and benchmarking
The Multicore Era multithreaded Cilk++ applications.
During most of the evolution of Cilk technology, a mul- Cilk Arts was sold to Intel Corporation in July ,
tiprocessor with n processors typically cost more than which continued developing the technology. In Septem-
n times the cost of a single processor. Moreover, since ber , Intel released its ICC compiler with Intel Cilk
clock frequency was increasing at about % per year, Plus []. The product included Cilk support for C
software vendors preferred to wait for more perfor- and C++, and the runtime system provided transparent
mance than refactor their codebases to support parallel integration with legacy binary executables.
processing. Thus, parallel computing, with or without
Cilk, was very much a niche business. Around ,
Research at MIT
however, the trend of ever-increasing clock frequency
Research on Cilk technology has also continued at MIT
hit a brick wall. Although the density of integrated cir-
and includes the following contributions:
cuits continued to double about every  months, the
heat dissipated by transistor switching reached its phys- ● An adaptive scheduler for multiple Cilk jobs that
ical limit. Vendors of microprocessor chips responded uses parallelism feedback to minimize wasted pro-
by placing multiple processing cores on a single chip, cessor cycles while guaranteeing fairness among the
ushering in the era of multicore computing. jobs []
In September , responding to the multicore ● The design of a work-stealing runtime-system,
trend, MIT spun out the Cilk technology to Cilk Arts, called CWSTM, that supports transactional mem-
Inc., a venture-funded start-up founded by technical ory, where transactions can contain both nested
leaders Charles E. Leiserson and Matteo Frigo, together parallelism and nested transactions []
with Stephen Lewin-Berlin and Duncan C. McCallum. ● A library for performing parallel sparse matrix-
Although Cilk Arts licensed the historical Cilk code- vector and matrix-transpose-vector multiplication
base from MIT, it developed an entirely new codebase using a novel matrix layout called compressed sparse
for a C++ product aptly named Cilk++ [, ], which blocks []
was released in December  for the Windows Visual ● The Helper library, which supports “helper” locks
Studio and Linux/gcc compilers. in which a thread that cannot acquire a lock, rather
 C Cilk

than blocking, helps to complete the parallel work her application to exhibit sufficient parallelism. Before
in the critical region protected by the lock [] describing the Cilk++ runtime system, it is helpful to
● The Nabbit library for efficiently executing task understand something about the theory of parallelism.
graphs with arbitrary acyclic dependencies [] Many discussions of parallelism begin with Amdahl’s
● A fast breadth-first search algorithm and method Law [], originally proffered by Gene Amdahl in .
for analyzing nondeterministic Cilk programs that Amdahl made what amounts to the following observa-
incorporate reducer hyperobjects [] tion. Suppose that % of a computation can be paral-
● The research prototype Cilk-M, which uses mem- lelized and % cannot. Then, even if the % that is
ory mapping to solve the “cactus-stack problem,” parallel were run on an infinite number of processors,
an interoperability problem with legacy binary exe- the total time is cut at most in half, leaving a speedup
cutables [] (see also section “Runtime System”) of at most . In general, if a fraction p of a computa-
tion can be run in parallel and the rest must run serially,
Recognition Amdahl’s Law upper-bounds the speedup by /( − p).
Over the years, Cilk technology has influenced many Although Amdahl’s Law provides some insight into
other non-Cilk concurrency platforms, including Sun parallelism, it does not quantify parallelism, and thus
Microsystems’ Fortress [], University of Texas’s Hood it does not provide a good understanding of what a
[], Java’s Fork/Join Framework [], Microsoft’s Task concurrency platform such as Cilk++ should offer for
Parallel Library (TPL) [] and PPL [], Intel’s Thread- multicore application performance. Fortunately, there
ing Building Blocks (TBB) [], and IBM’s X []. is a simple theoretical model for parallel computing
Cilk technology has garnered many awards, includ- which provides a more general and precise quantifica-
ing the following: tion of parallelism that subsumes Amdahl’s Law. This
“dag model of multithreading” [, ] provides a gen-
● First Prize in the  International Conference
eral and precise quantification of parallelism based on
on Functional Programming’s Programming Contest
the theory developed by Graham []; Brent []; Eager,
for Cilk Pousse, where Cilk was cited as “the supe-
Zahorjan, and Lazowska []; and Blumofe and Leiser-
rior programming tool of choice for discriminating
son [, ]. Tutorials on the dag model can be found in
hackers” []
[, Ch. ] and [].
● First Prize in the  HPC Challenge Class  (Pro-
The dag model of multithreading views the exe-
ductivity) competition, where Cilk was cited for
cution of a multithreaded program as a set of vertices
“Best Overall Productivity” []
called strands – sequences of serially executed instruc-
● The  award by ACM SIGPLAN for Most Influ-
tions containing no parallel control – with graph edges
ential  PLDI Paper for []
indicating ordering dependencies between strands, as
● The  award by the ACM Symposium on Paral-
illustrated in Fig. . (The literature sometimes uses the
lelism in Algorithms and Architectures for Best Paper
term “Cilk thread” for “strand.”) A strand x precedes
for []
a strand y, denoted x -≺ y, if x must complete before y
In addition, the Cilk-based chess programs StarTech, can begin. If neither x -≺ y nor y -≺ x, the strands are in
⋆Socrates, and Cilkchess have won numerous prizes in parallel, denoted by x ∥ y. Figure , for example, has
international computer-chess competitions.  -≺ ,  -≺ , and  ∥ . A strand can be as small as a
single instruction, or it can represent a longer chain of
The Dag Model for Multithreading serially executed instructions. A maximal strand is one
The Cilk++ runtime system contains a provably efficient that cannot be included in a longer strand. A maximal
work-stealing scheduler [, ], which scales applica- strand can be diced into a series of smaller strands in
tion performance linearly with processor cores, as long any manner that is convenient.
as the application exhibits sufficient parallelism (and The dag model of multithreading can be interpreted
the processor architecture provides sufficient memory in the context of the Cilk++ programming model. Nor-
bandwidth). Thus, to obtain good performance, the mal serial execution of one strand after another cre-
programmer needs to know what it means for his or ates a serial edge from the first strand to the next.
Cilk C 

1 unit time to execute a strand, the work for the example


dag in Fig.  is .
A simple notation makes things more precise. Let
2
TP be the fastest possible execution time of the applica-
tion on P processors. Since the work corresponds to the C
3 execution time on  processor, it is denoted by T . One
reason that work is an important measure is because it
4
provides a lower bound on P-processor execution time:
6 13
Tp ≥ T /P . ()
7 9 14 16
This Work Law holds, because in this simple theoretical
model, each processor executes at most  instruction per
5 8 10 17 unit time, and hence P processors can execute at most
P instructions per unit time. Thus, with P processors,
11
doing all the work requires at least T /P time.
15
One can interpret the Work Law Eq. () in terms of
the speedup on P processors, which is just T /TP . The
12 speedup tells how much faster the application runs on P
processors than on  processor. Rewriting the Work Law
18 yields T /TP ≤ P, which is to say that the speedup on P
processors can be at most P. If the application obtains
Cilk. Fig.  A dag representation of a multithreaded speedup P (which is the best one can do in this sim-
execution. Each vertex is a strand. Edges represent ple theoretical model), the application exhibits linear
ordering dependencies between instructions speedup. If the application obtains speedup greater than
P (impossible in the model due to the Work Law, but
possible in practice due to caching and other processor
A cilk_spawn of a function creates two depen- effects), the application exhibits superlinear speedup.
dency edges emanating from the instruction immedi-
ately before the cilk_spawn: the spawn edge goes The Span Law
to the strand containing the first instruction of the The second measure is span, which is the maximum
spawned function, and the continuation edge goes to time to execute along any path of dependencies in the
the strand containing the first instruction after the dag. Assuming that it takes unit time to execute a strand,
spawned function. A cilk_sync creates a return edge the span of the dag from Fig.  is , which corresponds
from the strand containing the final instruction of each to the path  -≺  -≺  -≺  -≺  -≺  -≺  -≺  -≺ . This path is
spawned function to the strand containing the instruc- sometimes called the critical path of the dag, and span
tion immediately after the cilk_sync. A cilk_for is sometimes referred to in the literature as critical-path
can be viewed as parallel divide-and-conquer recur- length. Since the span is the theoretically fastest time the
sion using cilk_spawn and cilk_sync over the dag could be executed on a computer with an infinite
iteration space. number of processors (assuming no overheads for com-
The dag model admits two natural measures that munication, scheduling, etc.), it can be denoted by T∞ .
allow parallelism to be defined precisely and provide Like work, span also provides a bound on P-processor
important bounds on performance and speedup. execution time:
TP ≥ T∞ . ()
The Work Law This Span Law arises for the simple reason that a finite
The first measure is work, which is the total time spent number of processors cannot outperform an infinite
in all the strands. Assuming for simplicity that it takes number of processors, because the infinite-processor
 C Cilk

machine could just ignore all but P of its processors and achieves provably tight bounds. An application with
mimic a P-processor machine exactly. sufficient parallelism can rely on the Cilk++ runtime
system to dynamically and automatically exploit an
Parallelism arbitrary number of available processor cores near opti-
The parallelism is defined as the ratio of work to span, mally. Moreover, on a single core, typical Cilk++ pro-
or T /T∞ . Parallelism can be viewed as the average grams run with negligible overhead (often % or less)
amount of work along each step of the critical path. when compared with its C++ serialization.
Moreover, perfect linear speedup cannot be obtained for
any number of processors greater than the parallelism
Performance Bounds
T /T∞ . To see why, suppose that P > T /T∞ , in which
Specifically, for an application with T work and T∞
case the Span Law Eq. () implies that the speedup sat-
span running on a computer with P processors, the
isfies T /TP ≤ T /T∞ < P. Since the speedup is strictly
Cilk++ work-stealing scheduler achieves expected run-
less than P, it cannot be perfect linear speedup. Another
ning time:
way to see that the parallelism bounds the speedup is
to observe that, in the best case, the work is distributed TP ≤ T /P + O(T∞) . ()
evenly along the critical path, in which case the amount
of work at each step is the parallelism. But, if the paral- If the parallelism T /T∞ exceeds the number P of pro-
lelism is less than P, there isn’t enough work to keep P cessors by a sufficient margin, this bound (proved in
processors busy at every step. []), guarantees near-perfect linear speedup. To see
As an example, the parallelism of the dag in Fig.  is why, assume that T /T∞ ≫ P, or equivalently, that
/ = . Thus, executing it with more than two proces- T∞ ≪ T /P. Thus, in Inequality (), the T /P term dom-
sors would be wasteful of cycles and yield diminishing inates the O(T∞ ) term, and thus the running time is
returns, since the additional processors will be surely TP ≈ T /P, leading to a speedup of T /TP ≈ P.
starved for work. The Cilkview scalability analyzer allows a program-
As a practical matter, many problems admit con- mer to determine the work and span of an application.
siderable parallelism. For example, matrix multiplica- Figure  shows the output of this tool running the quick-
tion of  ×  matrices is highly parallel, with a sort program from Fig.  on  million numbers. The
parallelism in the millions. Many problems on large upper bound on speedup provided by the Work Law
irregular graphs, such as breadth-first search, generally corresponds to the line of slope , and the upper bound
exhibit parallelism on the order of thousands. Sparse- provided by the Span Law corresponds to the horizontal
matrix algorithms can often exhibit parallelism in the line at .. The performance analysis tool also pro-
hundreds. vides an estimated lower bound on speedup – the lower
curve in the figure – based on burdened parallelism,
Upper Bounds on Speedup which takes into account the estimated cost of schedul-
The Work and Span Laws engender two important ing. Although quicksort seems naturally parallel, one
upper bounds on speedup. The Work Law implies that can show that the expected parallelism for sorting n
the speedup on P processors can be at most P: numbers is only O(lg n). Practical sorts with more par-
allelism exist, however. See [, Chap. ] for more
T /TP ≤ P . ()
details.
The Span Law dictates that speedup cannot exceed In addition to guaranteeing performance bounds,
parallelism: the Cilk++ runtime system also provides bounds on
T /Tp ≤ T /T∞ . () stack space. Specifically, on P processors, a Cilk++ pro-
gram consumes at most P times the stack space of a
Runtime System single-processor execution. Specifically, let SP denote
Although optimal multiprocessor scheduling is known the stack space required by a P-processor execution of
to be NP-complete [], Cilk++’s runtime system a given Cilk++ application. Thus, S is the stack space
employs a “work-stealing” scheduler [, ] that required by the C++ serialization of a Cilk++ program.
Cilk C 

Cilk. Fig.  Parallelism profile of quicksort produced by the Cilkview scalability analyzer

The space guarantee of Cilk++’s work-stealing sched- on one processor, however, this Cilk++ code uses no
uler is more stack space than a serial C++ execution, that is, the
call depth is of whichever invocation of foo requires
SP ≤ S P .
the deepest stack. On two processors, it requires at most
For specific applications, tighter bounds on space can twice this space, and so on. This guarantee contrasts
be shown []. Work-stealing schedulers that provide with the lack of a guarantee by more naive schedulers.
better space bounds at the expense of higher commu- For example, some schedulers would schedule this code
nication are also known [, ]. fragment by creating a work queue of one billion tasks,
To illustrate why a space bound is important, con- one for each iteration of the subroutine foo, before exe-
sider the following simple code fragment: cuting even the first iteration, thus suffering exorbitant
memory use.
for (int i=0; i<1000*1000*1000; ++i) {
cilk_spawn foo(i);
}
Work Stealing
cilk_sync;
Cilk++’s work-stealing scheduler operates as follows.
This code conceptually creates one billion invoca- When the runtime system starts up, it allocates as many
tions of foo that operate logically in parallel. Executing operating-system threads, called workers, as there are
 C Cilk

processors (although the programmer can override this allocating function activation frames. When a function
default decision). Each worker’s stack operates like a is called, the stack pointer is advanced, and when the
work queue. When a subroutine is spawned, the sub- function returns, the original stack pointer is restored.
routine’s activation frame containing its local variables This style of execution is space efficient, because all
is pushed onto the bottom of the stack. When it returns, the children of a given function can use and reuse the
the frame is popped off the bottom. Thus, in the com- same region of the stack. The compact linear-stack rep-
mon case, Cilk++ operates just like C++ and imposes resentation is possible only because in a serial language,
little overhead. a function has at most one extant child function at
When a worker runs out of work, however, it any time.
becomes a thief and “steals” the top frame from another The notion of an invocation tree can be extended
victim worker’s stack. Thus, the stack is in fact a double- to include spawns, as well as calls, but unlike the serial
ended queue, with the worker operating on the bottom walk of an invocation tree, a parallel execution unfolds
and thieves stealing from the top. This strategy has the the invocation tree more haphazardly and in parallel.
great advantage that all communication and synchro- Since multiple children of a function may exist simul-
nization is incurred only when a worker runs out of taneously, a linear-stack data structure no longer suf-
work. If an application exhibits sufficient parallelism, fices for storing activation frames. Instead, the tree of
one can prove mathematically [, ] that stealing is extant activation frames forms a cactus stack [], as
infrequent, and thus the cost of communication and shown in Fig. . The implementation of cactus stacks
synchronization to effect a steal is negligible. is a well-understood problem for which low-overhead
The dynamic load-balancing capability provided implementations exist [, ].
by the Cilk++ runtime system adapts well in real- Although a work-stealing scheduler can guarantees
world multiprogrammed computing environments. If a bounds on both time and stack space [, ], imple-
worker becomes descheduled by the operating system mentations that meet these bounds – including Cilk-,
(for example, because another application starts to run), Cilk-, and Cilk++ – suffer from interoperability prob-
the work of that worker can be stolen away by other lems with legacy (and third-party) serial binary exe-
workers. Thus, Cilk++ programs tend to “play nicely” cutables that have been compiled to use a linear stack.
with other jobs on the system. (Although Fortress, Java Fork/Join Framework, TPL,
Cilk++’s runtime system also makes Cilk++ pro- and X employ work stealing, they do not suffer from
grams performance-composable. Suppose that a pro- the same interoperability problems, because they are
grammer develops a parallel library in Cilk++. That
library can be called not only from a serial program
or the serial portion of a parallel program, it can be
invoked multiple times in parallel and continue to A B C D E
exhibit good speedup. In contrast, some concurrency
A A A A A
platforms constrain library code to run on a given num-
ber of processors, and if multiple instances of the library A C C C
execute simultaneously, they end up thrashing as they B
B C E
compete for processor resources. D

a D E b
Stacks and Cactus Stacks
An execution of a serial Algol-like language, such as C Cilk. Fig.  A cactus stack. (a) The invocation tree, where
[] or C++ [], can be viewed as a “walk” of an invoca- function A invokes B and C, and C invokes D and E. (b) The
tion tree, which dynamically unfolds during execution view of the stack by each of the five functions. In a serial
and relates function instances by the “calls” relation: If execution, only one view is active at any given time,
function instance A calls function instance B, then A because only one function executes at a time. In a parallel
is a parent of the child B in the invocation tree. Such execution, however, if some of the invocations are spawns,
serial languages admit a simple array-based stack for then multiple views may be active simultaneously
Cilk C 

byte-code interpreted by a virtual-machine environ- As an example of a race bug, suppose that line  in
ment.) Transitioning from serial code (using a linear Fig.  is replaced with the following line:
stack) to parallel code (using a cactus stack) is prob- qsort(max(begin + 1, middle-1), end);
lematic, because the type of stack impacts the calling
conventions used to allocate activation frames and pass The resulting serial code is still correct, but the par- C
arguments. The property of allowing arbitrary calling allel code now contains a race bug, because the two sub-
between parallel and serial code – including especially problems overlap, which could cause an error during
legacy (and third-party) serial binaries – is called serial- execution.
parallel reciprocity, or SP-reciprocity for short. Race conditions have been studied extensively [,
SP-reciprocity is especially important if one wishes , , , , , , , , , , , ]. They are
to multicore-enable legacy object-oriented environ- pernicious and occur nondeterministically. A program
ments by parallelizing an object’s member functions. with a race bug may execute successfully millions of
For example, suppose that a function A allocates a new times during testing, only to raise its head after the
object x whose type has a member function foo(), application is shipped. Even after detecting a race bug,
which is parallelized. Now, suppose that A is linked with writing regression tests to ensure its continued absence
a legacy binary containing a function B, and A passes is difficult.
x to B, which proceeds to invoke x.foo(). Without The Cilkscreen race detector is based on provably
SP-reciprocity, this simple callback does not work. good algorithms [, , ] developed originally for
Cilk- and Cilk++ do not support SP-reciprocity. MIT Cilk. In a single serial execution on a test input for a
Cilk Plus supports SP-reciprocity by sacrificing the deterministic program, Cilkscreen guarantees to report
space bound Inequality (Eq. ) and using multiple lin- a race bug if the race bug is exposed: that is, if two differ-
ear stacks to implement the cactus stack. Although no ent schedulings of the parallel code would produce dif-
provable bound exists to date on the number of linear ferent results. Cilkscreen uses efficient data structures
stacks needed, empirical studies indicate that P linear to track the series-parallel relationships of the execut-
stacks suffice. ing application during a serial execution of the paral-
The research prototype Cilk-M [] provides both lel code. As the application executes, Cilkscreen uses
SP-reciprocity and good theoretical bounds on time and dynamic instrumentation [, ] to intercept every
space. Cilk-M uses thread-local memory mapping to load and store executed at user level. Metadata in the
implement cactus stacks. Benchmark results indicate Cilk++ binaries allows Cilkscreen to identify the paral-
that the performance of the prototype Cilk-M runtime lel control constructs in the executing application pre-
system is comparable to the Cilk .. system, and the cisely, track the series-parallel relationships of strands,
consumption of stack space is modest. and report races precisely. Additional metadata allows
the race to be localized in the application source code.

Race Detection Reducer Hyperobjects


The Cilk++ development environment includes a race Many serial programs use nonlocal variables, which
detector, called Cilkscreen, a powerful debugging tool are variables that are bound outside of the scope of the
that greatly simplifies the task of ensuring that a parallel function, method, or class in which they are used. If a
application is correct. A data race [] exists if logi- variable is bound outside of all local scopes, it is a global
cally parallel strands access the same shared location, variable. Nonlocal variables have long been considered
the two strands hold no locks in common, and at least a problematic programming practice [], but program-
one of the strands writes to the location. A data race is mers often find them convenient to use, because they
usually a bug, because the program may exhibit unex- can be accessed at the leaves of a computation without
pected, nondeterministic behavior depending on how the overhead and complexity of passing them as param-
the strands are scheduled. Serial code containing non- eters through all the internal nodes. Thus, nonlocal vari-
local variables is particularly prone to the introduction ables have persisted in serial programming. In the world
of data races when the code is parallelized. of parallel computing, nonlocal variables may inhibit
 C Cilk

1 bool has_property(Node *); 1 bool has_property(Node *);


2 std::list<Node *> output_list; 2 std::list<Node *> output_list;
3 //... 3 // ...
4 void walk(Node *x) 4 void walk(Node *x)
5 { 5 {
6 if (x) 6 if (x)
7 { 7 {
8 if (has_property(x)) 8 if (has_property(x))
9 { 9 {
10 output_list.push_back(x); 10 output_list.push_back(x);
11 } 11 }
12 walk(x->left); 12 cilk_spawn walk(x->left);
13 walk(x->right); 13 walk(x->right);
14 } 14 cilk_sync;
15 } 15 }
16 }
Cilk. Fig.  C++ code to create a list of all the nodes in a
binary tree that satisfy a given property Cilk. Fig.  A naive Cilk++ parallelization of the code in
Fig. . This code has a data race in line 

otherwise independent parts of a multithreaded pro- 1 bool has_property(Node *);


gram from operating in parallel, because they introduce 2 std::list<Node *> output_list;
3 mutex L;
races. This section describes Cilk++ reducer hyperob-
4 // ...
jects [], which can mitigate races on nonlocal vari- 5 void walk(Node *x)
ables without creating lock contention or requiring code 6 {
7 if (x)
restructuring.
8 {
As an example of how a nonlocal variable can intro- 9 if (has_property(x))
duce a data race, consider the problem of walking a 10 {
11 L.lock();
binary tree to make a list of those nodes that satisfy
12 output_list.push_back(x);
a given property. A C++ code to solve the problem is 13 L.unlock();
abstracted in Fig. . If the node x being visited is non- 14 }
15 cilk_spawnwalk(x->left);
null, the code checks whether x has the desired property
16 walk(x->right);
in line , and if so, it appends x to the list stored in 17 cilk_sync;
the global variable output_list in line . Then, it 18 }
recursively visits the left and right children of x in lines 19 }

 and . Cilk. Fig.  Cilk++ code that solves the race condition
Figure  illustrates a straightforward parallelization using a mutex
of this code in Cilk++. In line  of the figure, the
walk function is spawned recursively on the left child,
while the parent continues on to execute an ordinary line , and after the update, it is released in line .
recursive call of walk in line . As the recursion Although this code is now correct, the mutex may cre-
unfolds, the running program generates a tree of par- ate a bottleneck in the computation. If there are many
allel execution that follows the structure of the binary nodes that have the desired property, the contention on
tree. Unfortunately, this naive parallelization contains the mutex can destroy all the parallelism. For example,
a data race. Specifically, two parallel instantiations of on one set of test inputs for a real-world tree-walking
walk may attempt to update the shared global variable code that performs collision-detection of mechanical
output_list in parallel at line . assemblies, lock contention actually degraded perfor-
The traditional solution to fixing this kind of data mance on  processors so that it was worse than running
race is to associate a mutual-exclusion lock (mutex) on a single processor. In addition, the locking solution
L with output_list, as is shown in Fig. . Before has the problem that it jumbles up the order of list ele-
updating output_list, the mutex L is acquired in ments. That might be okay for some applications, but
Cilk C 

other programs may depend on the order produced by the programmer to restructure the logic of his or her
the serial execution. program.
An alternative to locking is to restructure the code to As an example, Figure  shows how the tree-walking
accumulate the output lists in each subcomputation and code from Fig.  can be parallelized using a reducer. Line
concatenate them when the computations return. If one  declares output_list to be a reducer hyperobject C
is careful, it is also possible to keep the order of elements for list appending. The reducer_list_append class
in the list the same as in the serial execution. For the implements a reduce function that concatenates two
simple tree-walking code, code restructuring may suf- lists, but the programmer of the tree-walking code need
fice, but for many larger codes, disrupting the original not be aware of how this class is implemented. All the
logic can be time-consuming and tedious undertaking, programmer does is identify the global variables as the
and it may require expert skill, making it impractical for appropriate type of reducer when they are declared. No
parallelizing large legacy codes. logic needs to be restructured, and if the programmer
Cilk++ provides a novel approach [] to avoiding fails to catch all the use instances, the compiler reports
data races in code with nonlocal variables. A Cilk++ a type error.
reducer hyperobject is a linguistic construct that allows This parallelization takes advantage of the fact that
many strands to coordinate in updating a shared vari- list appending is associative. That is, if a list L is
able or data structure independently by providing them appended to a list L and the result appended to L ,
different but coordinated views of the same object. it is the same as if list L were appended to the result
The state of a hyperobject as seen by a strand of an of appending L to L . As the Cilk++ runtime system
execution is called the strand’s “view” of the object load-balances this computation over the available pro-
at the time the strand is executing. A strand can cessors, it ensures that each branch of the recursive
access and change any of its view’s state independently, computation has access to a private view of the variable
without synchronizing with other strands. Through- output_list, eliminating races on this global vari-
out the execution of a strand, the strand’s view of the able without requiring locks. When the branches syn-
reducer is private, thereby providing isolation from chronize, the private views are reduced (combined) by
other strands. When two or more strands join, their concatenating the lists, and Cilk++ carefully maintains
different views are combined according to a system- the proper ordering so that the resulting list contains
or user-defined reduce() method. Thus, reducers the identical elements in the same order as in a serial
preserve the advantages of parallelism without forcing execution.

1 #include <reducer_list.h>
2 bool has_property(Node *);
3 cilk::hyperobject<cilk::reducer_list_append<Node*>>
output_list;

4 // ...
5 void walk(Node *x)
6 {
7 if (x)
8 {
9 if (has_property(x))
10 {
11 output_list().push_back(x);
12 }
13 cilk_spawnwalk(x->left);
14 walk(x->right);
15 cilk_sync;
16 }
17 }

Cilk. Fig.  A Cilk++ parallelization of the code in Fig. , which uses a reducer hyperobject to avoid data races
 C Cilk

Conclusion . Allen E, Chase D, Hallett J, Luchangco V, Maessen JW, Ryu S,


Multicore microprocessors are now commonplace, and Steele Jr. GL, Hochstadt ST () The Fortress language speci-
Moore’s Law is steadily increasing the pressure on soft- fication, version .. Sun Microsystems, Burlington
. Amdahl G () The validity of the single processor approach to
ware developers to multicore-enable their codebases.
achieving large-scale computing capabilities. In: Proceedings of
Cilk++ provides a simple but effective concurrency the AFIPS spring joint computer conference, Atlantic City. ACM,
platform for multicore programming which leverages New York, pp –
almost two decades of research on multithreaded pro- . Arnold K, Gosling J () The Java programming language.
gramming. The Cilk++ model builds upon the sound Addison-Wesley, Reading
. Bender MA, Fineman JT, Gilbert S, Leiserson CE () On-
theoretical framework of multithreaded dags, allow-
the-fly maintenance of series-parallel relationships in fork-join
ing parallelism to be quantified in terms of work and multithreaded programs. In: Proceedings of the sixteenth annual
span. The Cilkscreen race detector allows race bugs ACM symposium on parallel algorithms and architectures (SPAA
to be detected and localized, and the Cilkview scal- ), Barcelona, Spain. ACM, New York, pp –
ability analyzer allows the parallelism of an applica- . Blelloch GE, Gibbons PB, Matias Y, Narlikar GJ () Space-
tion to be quantitatively measured. Cilk++’s hyperobject efficient scheduling of parallelism with synchronization variables.
In: Proceedings of the th annual ACM symposium on parallel
library mitigates races on nonlocal variables. Although
algorithms and architectures (SPAA), Newport. ACM, New York,
parallel programming will surely continue to evolve, pp –
Cilk++ today provides a full-featured suite of tech- . Blumofe RD () Executing multithreaded programs efficiently.
nology for multicore-enabling any compute-intensive PhD thesis, Department of Electrical Engineering and Computer
application. Science, Massachusetts Institute of Technology, Cambridge, MA.
Available as MIT Laboratory for Computer Science Technical
Report MIT/LCS/TR-
. Blumofe RD, Frigo M, Joerg CF, Leiserson CE, Randall KH
Acknowledgments () An analysis of dag-consistent distributed shared-memory
Thanks to the many students and postdocs who con- algorithms. In: SPAA’, Padua, Italy. ACM press, New York,
tributed to Cilk technology over the years. Space does pp –
not permit all to be listed. Thanks to the team at . Blumofe RD, Frigo M, Joerg CF, Leiserson CE, Randall KH ()
Dag-consistent distributed shared memory. In: IPPS’, Hon-
Cilk Arts and the new Cilk team at Intel. Special
olulu, Hawaii, pp –
thanks to Matteo Frigo and Bradley C. Kuszmaul, . Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH,
who contributed mightily to developing and main- Zhou Y () Cilk: an efficient multithreaded runtime system. J
taining the MIT Cilk codebase. Many thanks to the Parallel Distribut Comp ():–
DARPA and NSF sponsors of this research over the . Blumofe RD, Leiserson CE () Space-efficient scheduling of
years. multithreaded computations. SIAM J Comput ():–
. Blumofe RD, Leiserson CE () Scheduling multithreaded com-
putations by work stealing. J ACM ():–
. Blumofe RD, Papadopoulos D () Hood: a user-level threads
Bibliography library for multi-programmed multiprocessors. Technical report,
. Adkins D, Barton R, Dailey D, Frigo M, Joerg C, Leiserson C, University of Texas, Austin
Prokop H, Rinard M () Cilk Pousse. Winner of the  ICFP . Blumofe RD, Park DS () Scheduling large-scale parallel com-
Programming Contest putations on networks of workstations. In: Proceedings of the
. Agrawal K, Fineman JT, Sukha J () Nested parallelism in third international symposium on high performance distributed
transactional memory. In: Proceedings of the th ACM SIG- computing (HPDC), San Francisco, pp –
PLAN symposium on principles and Practice of parallel program- . Brent RP () The parallel evaluation of general arithmetic
ming (PPoPP ), ACM, New York, pp – expressions. J ACM ():–
. Agrawal K, He Y, Hsu WJ, Leiserson CE () Adaptive schedul- . Bruening D () Efficient, transparent, and comprehensive
ing with parallelism feedback. TOCS, ()::–: runtime code manipulation. PhD thesis, Department of Electri-
. Agrawal K, Leiserson CE, Sukha J () Helper locks for cal Engineering and Computer Science, Massachusetts Institute
fork-join parallel programming. In: PPoPP’, ACM, New York, of Technology
pp – . Buluç A, Fineman JT, Frigo M, Gilbert JR, Leiserson CE ()
. Agrawal K, Leiserson CE, Sukha J () Executing task Parallel sparse matrix-vector and matrix-transpose-vector mul-
graphs using work stealing. In: IPDPS, Atlanta. IEEE, tiplication using compressed sparse blocks. In: SPAA. ACM,
pp – New York, pp –
Cilk C 

. Charles P, Grothoff C, Saraswat V, Donawa C, Kielstra A, . Frigo M, Halpern P, Leiserson CE, Berlin SL () Reduc-
Ebcioglu K, von Praun C, Sarkar V () X: an object-oriented ers and other Cilk++ hyperobjects. In: SPAA’, Calgary. ACM,
approach to non-uniform cluster computing. In: OOPSLA ’. New York, pp –
ACM, New York, pp – . Frigo M, Leiserson CE, Randall KH () The implemen-
. Cheng GI () Algorithms for data-race detection in mul- tation of the Cilk- multithreaded language. In: Proceed-
tithreaded programs. Master’s thesis, Department of Electrical ings of the ACM SIGPLAN conference on programming C
Engineering and Computer Science, Massachusetts Institute of language design and implementation, Montreal, Canada, pp
Technology, Cambridge –
. Cheng GI, Feng M, Leiserson CE, Randall KH, Stark AF () . Frigo M, Luchangco V () Computation-centric memory
Detecting data races in Cilk programs that use locks. In: Pro- models. In: SPAA’. ACM, New York, pp –
ceedings of the ACM Symposium on parallel algorithms and . Garey MR, Johnson DS () Computers and Intractability. W.H.
architectures. ACM press, New York Freeman and Company, San Francisco
. Cilk Arts Inc. () Cilk++ programmer’s guide, release . . Goldstein SC, Schauser KE, Culler D () Enabling primitives
edition, December  for compiling parallel languages. In: LCR ’, Troy, NY, USA,
. Cormen TH, Leiserson CE, Rivest RL, Stein C () Introduc- Available from http://en.scientificcommons.org/
tion to algorithms, rd edn. The MIT Press, Cambridge . Graham RL () Bounds for certain multiprocessing anomalies.
. Crummey JM () On-the-fly detection of data races for pro- Bell System Tech J :–
grams with nested fork-join parallelism. In: Super-computing’, . Halbherr M, Zhou Y, Joerg CF () MIMD-style parallel pro-
Albuquerque. IEEE Computer Society Press, Los Alamitos, gramming with continuation-passing threads. In: Proceedings of
pp – the nd international workshop on massive parallelism: hardware,
. Dailey D, Leiserson CE () Using Cilk to write multiprocessor software, and applications, Capri, Italy
chess programs. J Int Comput Chess Assoc ():– . Hauck EA, Dent BA () Burroughs’ B/B stack mech-
. Danaher JS, Lee ITA, Leiserson CE () Exception handling anism. In: Proceedings of the AFIPS spring joint computer con-
in JCilk. In: Synchronization and concurrency in object-oriented ference. AFIPS Press, Montvale, pp –
languages (SCOOL), October . Available at http://hdl.handle. . He Y, Leiserson CE, Leiserson WM () The Cilkview scalability
net// analyzer. In: SPAA’. ACM, New York, pp –
. Danaher JS () The JCilk- runtime system. Master’s thesis, . Helmbold DP, McDowell CE, Wang JZ () Analyzing traces
Department of Electrical Engineering and Computer Science, with anonymous synchronization. In: Proceedings of the 
Massachusetts Institute of Technology, Cambridge international conference on parallel processing, University Park.
. Dinning A, Schonberg E () An empirical comparison of mon- pp II.–II.
itoring algorithms for access anomaly detection. In: Proceedings . Institute of Electrical and Electronic Engineers () Informa-
of the ACM SIGPLAN symposium on principles and practice of tion technology – portable operating system interface (POSIX) –
parallel programming, San Diego, ACM, New York, pp – Part : system application program interface (API) [C language],
. Dinning A, Schonberg E () Detecting access anomalies in pro-  edn. IEEE Standard .
grams with critical sections. In: Proceedings of the ACM/ONR . Intel Corporation () Intel Cilk++ SDK programmer’s guide,
workshop on parallel and distributed debugging, ACM Press, October . Intel Corporation, Document number: -
New York, pp – US
. Eager DL, Zahorjan J, Lazowska ED () Speedup ver- . Intel Corporation () Intel R C++ compiler . user and
sus efficiency in parallel systems. IEEE Trans Comput (): reference guides, September . Intel Corporation, Document
– number: -US
. Emrath PA, Ghosh S, Padua DA () Event synchroniza- . Joerg C, Kuszmaul BC () Massively parallel chess. In:
tion analysis for debugging parallel programs. In: Supercomput- Proceedings of the third DIMACS parallel implementa-
ing ’, Albuquerque. IEEE Computer Society, Washington DC, tion challenge, Rutgers University, New Jersey, October
pp – –
. Feng M, Leiserson CE () Efficient detection of determinacy . Joerg CF () The Cilk system for parallel multithreaded com-
races in Cilk programs. In: Proceedings of the ACM sympo- puting. PhD thesis, Department of Electrical Engineering and
sium on parallel algorithms and architectures, New Port. ACM, Computer Science, Massachusetts Institute of Technology, Cam-
New York, pp – bridge. Available as MIT Laboratory for Computer Science tech-
. Fenster Y () Detecting parallel access anomalies. Master’s nical report MIT/LCS/TR-
thesis, Hebrew University, Jerusalem . Kernighan BW, Ritchie DM () The C programming language,
. Frigo M () A fast Fourier transform compiler. ACM SIG- nd edn. Prentice Hall, Englewood Cliffs
PLAN Notices ():– . Kuszmaul BC () Synchronized MIMD computing. PhD the-
. Frigo M () Portable high-performance programs. PhD the- sis, MIT Department of EECS, Cambridge
sis, Department of Electrical Engineering and Computer Science, . Kuszmaul BC () The StarTech massively parallel chess pro-
Massachusetts Institute of Technology, Cambridge gram. J Int Comput Chess Assoc ():–
 C Cilk Plus

. Kuszmaul BC () Brief announcement: Cilk provides the In: Proceedings of the  international conference on parallel
“best overall productivity” for high performance computing (and processing, St. Charles, IL
won the HPC challenge award to prove it). In: SPAA’. ACM, . Netzer RHB, Miller BP () What are race conditions? ACM
New York, pp – Lett Program Lang Syst ():–
. Lea D () A Java fork/join framework. In: Java Grande Con- . Nudler I, Rudolph L () Tools for the efficient development
ference, Stanford University, Palo Alto, pp – of efficient parallel programs. In: Proceedings of the first Israeli
. Lee ITA () The JCilk multithreaded language. Master’s the- conference on computer systems engineering, Tel Aviv
sis, Department of Electrical Engineering and Computer Science, . Perković D, Keleher P () Online data-race detection via
Massachusetts Institute of Technology, Cambridge coherency guarantees. In: Proceedings of the second USENIX
. Lee ITA, Wickizer SB, Huang Z, Leiserson CE () Using mem- symposium on operating systems design and implementation
ory mapping to support cactus stacks in work-stealing runtime (OSDI), Seattle, Washington
systems. In: PACT’, Vienna. ACM, New York, pp – . Randall KH () Cilk: efficient multithreaded computing. PhD
. Leijen D, Schulte W, Burckhardt S () The design of a task thesis, Department of Electrical Engineering and Computer Sci-
parallel library. In: OOPSLA ’, Orlando, FL. ACM, New York, ence, Massachusetts Institute of Technology, Cambridge
pp – . Reinders J () Intel threading building blocks: Outfitting C++
. Leiserson CE () The Cilk++ concurrency platform. J Super- for multi-core processor parallelism. O’Reilly, Sebastopol, CA
comput ():– . Savage S, Burrows M, Nelson G, Sobalvarro P, Anderson T ()
. Leiserson CE, Abuhamdeh ZS, Douglas DC, Feynman CR, Gan- Eraser: a dynamic race detector for multi-threaded programs.
mukhi MN, Hill JV, Hillis WD, Kuszmaul BC, St. Pierre MA, Wells In: Proceedings of the sixteenth ACM symposium on operating
DS, Wong MC, Yang SW, Zak R () The network architecture systems principles (SOSP). ACM Press, New York, pp –
of the Connection Machine CM-. J Parallel Distribut Comput . Stark AF () Debugging multithreaded programs that incor-
():– porate user-level locking. Master’s thesis, Massachusetts Institute
. Leiserson CE, Schardl TB () A work-efficient parallel of Technology, Department of Electrical Engineering and Com-
breadth-first search algorithm (or how to cope with the non- puter Science, Cambridge
determinism of reducers). In: SPAA’. ACM, New York, pp . Stroustrup B () The C++ programming language, rd edn.
– Addison-Wesley, Upper Saddle River
. Luk CK, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace . Supercomputing technologies group, Massachusetts Institute of
S, Reddi VJ, Hazelwood K () Pin: building customized pro- Technology Laboratory for Computer Science () Cilk ...
gram analysis tools with dynamic instrumentation. In: PLDI ’: reference manual, April 
proceedings of the  ACM SIGPLAN conference on program- . The MPI Forum () MPI: a message passing interface. In:
ming language design and implementation, Chicago, ACM Press, Supercomputing ’, Portland, OR, pp –
New York, pp – . The MPI Forum () MPI-: Extensions to the message-passing
. Microsoft Corporation () Parallel patterns library (PPL). interface. Technical Report, University of Tennessee, Knoxville
Available at http://msdn.microsoft.com/en-us/library/dd. . Tom Duff (). Duff ’s device. Usenet posting. http://www.
aspx lysator.liu.se/c/duffs-device.html. Accessed  Nov 
. Miller BP, Choi JD () A mechanism for efficient debugging . Wulf W, Shaw M () Global variable considered harmful.
of parallel programs. In: Proceedings of the  ACM SIGPLAN SIGPLAN Notices ():–
conference on programming language design and implementa-
tion (PLDI), Atlanta. ACM, New York, pp –
. Miller RC () A type-checking preprocessor for Cilk , a
multithreaded C language. Master’s thesis, Massachusetts Insti- Cilk Plus
tute of Technology Electrical Engineering and Computer Science,
Cambridge Cilk
. Min SL, Choi JD () An efficient cache-based access anomaly
detection scheme. In: Proceedings of the fourth international con-
ference on architectural support for programming languages and
operating systems (ASPLOS), Palo Alto., pp – Cilk++
. Mohr E, Kranz DA, Halstead RH, Jr. () Lazy task creation:
a technique for increasing the granularity of parallel programs. Cilk
IEEE Trans Parallel Distribut Systems ():–
. Narlikar G () Space-efficient scheduling for parallel, multi-
threaded computations. PhD thesis, Carnegie Mellon University,
Pittsburgh Cilk-
. Netzer RHB, Ghosh S () Efficient race condition detection
for shared-memory programs with post/wait synchronization. Cilk
Clusters C 

commodity subsystems is achieved through the inter-


Cilk- connection by one or more private networks that are
themselves available from commercial sources. A com-
Cilk
modity cluster incorporates a system software stack for
resource management and application programming. C
Usually (but not always) such software is a mix of open
source and proprietary components. Typically, a cluster
is programmed through a message-passing application
Cilkscreen programming interface like MPI running on top of the
operating system on each subsystem node.
Race Detectors for Cilk and Cilk++ Programs

Discussion
Introduction and Overview
Cluster File Systems A commodity cluster is a distributed computing system
consisting of an integrated set of fully and indepen-
File Systems
dently operational and marketed computer subsystems
(node) used together to perform a single application
program or workload. These include compute nodes
interconnected by a commercial network and possible
additional nodes configured for specific purposes such
Cluster of Workstations as login nodes, administrative nodes, and secondary
storage managers. Each component is COTS (commer-
Clusters
cial off the shelf) and can be acquired separately on
the open market. Clusters emerged in the mid-s to
largely replace most other forms of high performance
computing and now serve effectively in virtually every
scientific user and commercial market sector. Except at
Clusters
the very top end where MPPs (Massively Parallel Pro-
Thomas L. Sterling cessors) maintain a significant presence, clusters dom-
Louisiana State University, Baton Rouge, inate high performance computing. This remarkable
LA, USA evolution is driven by a number of motivating factors,
one of which is performance to cost achieved through
the exploitation of economy of scale of mass marketed
Synonyms components harnessed en masse to deliver significant
Beowulf clusters; Beowulf-class clusters; Commodity improvements in a multiple of operational properties
clusters; COW, cluster of workstations; Distributed through aggregation by integration. The major compo-
computer; Distributed memory computers; Linux clus- nents then are the compute nodes, the interconnection
ters; Multicomputers; NOW, network of workstations; network, the node operating system, the system man-
PC clusters; Server farm agement middleware, and the application programming
interface. Clusters ride the continuous wave of tech-
Definition nology advances to deliver ever increasing value per
A commodity cluster is a computing system compris- unit cost, exploiting progress in every technology class
ing an integrated set of subsystems, each of which including microprocessors, memory, communication
may be acquired commercially and is capable of inde- networks and switching, operating system software, and
pendent and complete operation. The synthesis of the parallel numerical algorithms.
 C Clusters

Motivation achieve early return on investment. Cluster systems


The advent of the modern commodity cluster has been integrating such commodity subsystems serving as clus-
motivated by the opportunities it offers. Here are pro- ter nodes therefore can incorporate and are able to
vided some of these that have been responsible for the exploit.
rapid growth and ultimate domination of clusters in the
Flexibility of Configuration: Because of the method by
domain of scalable computing.
which clusters are constructed, using building blocks
Scalable Performance: Performance measured by derived from already commercialized products and
throughput exhibits positive sensitivity to scale (num- COTS (commercial off the shelf) interconnect net-
ber of cluster nodes) over broad range for many work- works, the end system deployment is highly flexible
loads and applications. Although narrower in range in its configuration and not limited by a few system
variation in performance measured in terms of response offerings of vendors as is the case with custom MPPs.
time can also be seen to scale with respect to the number Number of nodes, topology, source of nodes, and many
of nodes employed. other attributes may be determined by the deploying
Cost: A major driver for the use of clusters is the institution and, indeed, can be modified over time, such
exploitation of economy of scale derived from mass as the addition of updated nodes or GPU accelera-
market of the comprising node subsystems resulting in tors. This further extends to the software stack that, if
dramatic advantage in performance to cost with respect derived from open source components, can largely be
to custom parallel systems and low cost for implementa- defined, refined, and updated by the local site, again not
tion of parallel systems. This factor is driving the dom- restricted by vendor proprietary constraints.
inance of commodity clusters in the mid-range of the Sum of the Parts: This notion reflects both the obvious
processing market and at the high end for major trans- attributes of the cluster itself but also more importantly
action processing and major scientific applications. It the benefit of many contributing to a single class of
should be noted that there are custom cases for specific systems, largely through open source framework even
algorithms in which performance to cost is competitive though commercial software targeting such framework
with clusters. also is of value.
Problem Size: Large applications require data sets larger Programming Model Commonality: A major feature of
than the capacity of single enterprise server mid-level clusters is that they share the same distributed memory
systems. Clusters provide a quick and relatively inex- programming models with their custom MPP counter-
pensive way of aggregating many boxes of memory, parts. Such application programming interfaces as PVM
cluster nodes, to build systems containing large mem- and MPI derived from the communicating sequential
ory capacity. Where memory is the most expensive processing model are portable between the two classes
component of the system, the use of mass market mem- of distributed memory system. Due to latency, band-
ories is a cost effective way of achieving the largest possi- width, and overhead considerations a commodity clus-
ble memory system. These are distributed memory, not ter of comparable size (number of nodes and memory
shared memory. capacity) to an MPP may deliver lower sustained per-
Mass Storage: Clusters make an excellent platform for formance than the MPP but will produce the same
building mass storage systems or applications that are results.
based on secondary storage. The use of industry stan-
Empowerment: This is the freedom many experience
dards and low cost RAID disk systems attached to nodes
when working with clusters and has provided a strong
of clusters is widely used, for example, by large on-
impetus to acquiring skills and experimenting with such
demand search engines making up some of the largest
systems. The sense of empowerment is a consequence
systems in the world.
of the ability to implement and deploy clusters without
Early Technology Adoption: Rapid advances in micro- constraints implicit with working with a particular ven-
processor technology are most quickly captured by the dor. Many young practitioners have been thus inspired
mass-market products to exploit economy of scale and with high school students finding ways to aggregate
Clusters C 

surplus computing elements into locally implemented ● Motherboard chip-set that provides all of the control
systems providing an important vehicle for education. features of the node, the intra-node communica-
tions, and the BIOS startup mechanisms for booting
the system
Coolness Factor: There is just something really cool
● Network Interface Controller to connect the node to C
about playing with clusters. Although a purely subjec-
the cluster system private network
tive factor, it nonetheless has driven much activity, espe-
● Local disk storage for the node. Some nodes do
cially among younger practitioners who react to many
not have this and are referred to as “diskless” or
of the empowerment issues above but also to the attrac-
“stateless”
tion of open source software, unbounded performance
● Packaging, power, and cooling
opportunities, the sand box mindset that enables self-
motivated experimentation, and a subtle attribute of the A cluster is the integration of such nodes by means of
pieces of a cluster “talking” to each other on a coopera- a private system area network. The network consists
tive basis to make something happen together. There is of the NICs (described above) installed in the nodes,
also a strong sense of community among contributors in the switches that route packets, and the channels or
the field, a basis for a sense of association that appeals to cables connecting the switches to the nodes and to
many people. Working with commodity clusters is fun! each other. The arrangement of the network compo-
nents is described as the network topology, which for
clusters is usually a tree (for smaller clusters) or a Clos
Cluster System Architecture
network for larger clusters. Important network prop-
Hardware Architecture erties are its bandwidth and latency. Bandwidth estab-
The hardware architecture of a cluster system is simple lishes the amount of information that can be passed
in concept, if not in the details of implementation. To between all pairs of nodes simultaneously. Latency
first order, it consists of a set of compute nodes and an establishes the time it takes to move a packet between
interconnection network at its core connected to sec- two nodes. The primary network types employed today
ondary storage, an external local area network, and one in clusters is Ethernet in its myriad forms (primarily
or more administration terminals. One or more of the GigE) and the increasingly important Infiniband (IBA)
compute nodes are dedicated to serve as the master, network that provides lower overheads and shorter
host, or login node to handle user sessions. latencies.
The heart of the cluster is the COTS (commercial
off the shelf) compute node. This subsystem is a fully
operational stand-alone computer with the important
Software Architecture
attribute that it benefits from a market far greater than
The cluster software structure matches that of the hard-
that of the cluster of which it is a part. The resulting
ware architecture in integrating the functionality of the
economy of scale yields an exceptional performance to
separate components in to a single operational system.
cost greatly in excess of alternative HPC strategies.
The roles and responsibilities of cluster software range
Each node is a fully functioning computer system in
from the general managing of the individual hardware
its own right. It incorporates as its major components:
resources to providing programming environments for
● One or more multicore processor components that users and overall system administration. Typically every
provides the primary computing capability of the node supports its own operating system. Dominant
node. By far, the most widely used processor family among these is Linux. However, commercial versions
for clusters is the X architecture currently includ- of Unix such as Solaris, AIX, and HPUX are employed
ing the Intel Xeon (in multiple instantiations) and as well. Microsoft Windows is found on some clusters
the AMD Opteron especially in the business community such as financial
● DRAM memory modules that provide the main markets.
storage of the node combined with a cache hierarchy The programming environments at least for the
supporting each of the processor cores scientific computing arena are based on the MPI
 C Clusters

application programming interface, which is ubiqui- marking a milestone year of the first commodity clus-
tous, stable, and portable. Application codes written in ter on the Top- list and the first Gordon Bell Prize
MPI (employing C, C++, or Fortran as the process lan- awarded to a cluster based computation. Here is pre-
guage) are easily conveyed across systems of different sented a brief history of clusters reflecting the diversity
scale (different number of nodes), different processor of ways they have been employed and the succession
architectures, different network types and topologies, of enabling conditions that catalyzed their implemen-
and even from clusters to MPPs with no or minimum tation, deployment, and use. For pedagogical purposes,
changes for correctness of operation. Getting optimal this history is organized in four distinguishable periods:
performance from the different systems may require () prior to , () –, () –, and ()
some code modifications. MPI describes communica-  to present and probably beyond. The first was a
tion patterns among processes on different nodes, the period of exploration where some rudimentary forms
content of different messages, and their functionality of clustering were attempted and various incipient tech-
as gather/scatter or collective operations. MPI has pro- nologies were derived in their earliest forms that would
vided the first supercomputing common programming one day become critical to successful cluster systems.
framework and is largely responsible for the success The second period saw the first real clusters and the
achieved by clusters today in industry and government development of the enabling technologies, both hard-
sectors as well as academic research. ware and software, that would ultimately in synthesis
Additional software has been developed for system launch the cluster revolution as they matured. The third
administration, workload scheduling, and initial instal- period introduced the classical period of clusters at a
lation and configuration. time when microprocessors were differentiated between
those employed in workstations (NOW) and those used
at the lowest cost personal computers or PCs (Beowulf)
History only to find this distinction to evaporate as micropro-
The history of cluster computing has been driven by the cessor architectures converged. Equally important were
opportunity to achieve rapid increase in one or more the advances in network technologies, programming
operational properties of a computer system through models, and middle ware. The final (fourth) ongoing
the aggregation and interoperability of replicated inde- period is witnessing the dominance of commodity clus-
pendent systems by means of interconnection networks. ters for scientific (STEM) and commercial high per-
Such properties include but are not limited to through- formance computing with the advent of new multicore
put performance, total storage capacity, performance processor sockets and GPU accelerators. Except for the
to cost, reliability, availability, and input/output com- highest end computing systems, clusters are likely to
munications rate. Indeed, the concept of clustering in serve as the principal family of supercomputers across
its broadest sense as an abstraction predates comput- a broad range of uses and markets for the foreseeable
ing altogether and has its antecedents in the genesis of future.
human civilization in the domains of transportation,
manufacturing, human enterprise (including military),
and living environments (buildings and collections Exploratory Period: Before 
thereof). The motto of the USA found on every dollar The earliest efforts to exploit the concept of cluster-
bill, “E Pluribus Unum,” is an emblem of clustering at ing predated the true opportunities for improvement
least at a high level of abstraction. This general strat- in performance-to-cost but addressed certain key niche
egy of synthesis of existing systems to comprise much requirements. It was also a period when some of the
larger ensemble systems has been applied to comput- technologies that would eventually contribute to the
ing systems since its second generation in the s wide spread uses of clusters were initially devised.
with SAGE and the IBM  as its earliest exemplars. Perhaps the first example of a cluster of comput-
The modern commodity cluster emerged in the early ers was the IBM SAGE computer system developed for
to mid s with the advent of critical enabling tech- the Department of Defense NORAD air defense system.
nologies derived from the previous decade with  Sage accepted radar information from a couple of dozen
Clusters C 

sites using early wide area network technology and pre- With Moore’s Law, DRAM memory density continued
sented the strategic airspace status to operators using its exponential growth with concomitant improvements
one of the first examples of digital video presentation in storage capacity per cost. Similar improvements
displays. SAGE consisted of two CPUs connected, not to were achieved in the arena of SCSI and EIDE disk
double the performance, but for reliability; if one failed, storage. C
the other could immediately pick up the workload and Software advances were equally important during
continue the time critical processing of defense data. this period providing the initial basis for the first clus-
The SAGE CPUs were derived from the original MIT ters. An advanced version of Unix, the DARPA spon-
Whirlwind computer architecture but with extended sored BSD Unix from UC Berkeley, which included
word lengths and other features. virtual memory, and network communications seman-
Many of the key technologies that would make up tics was a key development that spurred many industry
the primary components of later commodity clusters OS packages like Solaris from Sun Microsystems and
were first derived during this exploratory period. The the early open source Linux operating system that was
microprocessor was first developed by Intel introduced to become a major software base for future clusters.
in  () with the X family that has become Early implementations of programming tools based
the dominant CPU of commodity clusters marketed by on the CSP model were developed both commercially
. The first local area network technology to be stan- and in academia including the influential PVM system
dardized, Ethernet, developed at Xerox PARC. The TCP jointly developed by Emery University and Oak Ridge
protocol was also developed at this time. Together these National Laboratory. Originally targeted to early gener-
two technologies established the foundations for net- ation MPPs (Massively Parallel Processors), such appli-
working and through their evolutionary descendents cation programming interfaces (API) were well suited
created the single most widely used network in clusters for future clusters.
both integrating their constituent elements and their The use of local area networks (LAN), principally
external input/output interfaces. Software technology Ethernet, integrated working environments compris-
also was developed at this time that would have long- ing distributed collections of workstations usually for
term impact on clusters. Among these was the Unix personal use and shared resources such as printers,
operating system by Bell Labs that incorporated many of file servers, and external internet access. It was rec-
the service infrastructure capabilities that would be nec- ognized by some that dependent on individual usage,
essary for effective clustering. The programming lan- many workstations lay idle for considerable periods and
guages Fortran and C, both widely used on clusters were potentially available for other purposes. Efforts
today were developed at this time. Finally, the funda- were made to employ such aggregations of temporar-
mental model of computation that is used almost exclu- ily unused “workstation farms” for “cycle harvesting” by
sively on clusters, Communicating Sequential Processes scheduling applications on unused workstations when
or CSP was formally described at this time upon which available. Among many software packages developed
a number of future application programming inter- for this purpose was Condor from the University of
faces for distributed memory systems including clusters Wisconsin, still widely used today. Condor matches user
would be based. job requirements with available system capabilities and
schedules these pending jobs as suitable systems are
Enabling Period: – made available.
Throughout the enabling period, all of the technolo- Perhaps the earliest formal use of the term “clus-
gies, both hardware and software, were developed that ter” occurred near the end of this period introduced
catalyzed the cluster explosion. The microprocessor by Digital Equipment Corporation (DEC) to relate to
evolved from the  bit to the  bit architectures with the development of a collection of VAX minicomputers
clock rates reaching  MHz. Local area network Eth- interconnected by means of proprietary network hard-
ernet technologies achieved  Mbps throughput with ware and software. The early Andromeda (M) project
continuous improvements in bandwidth per cost as well at DEC explored this arena of opportunity followed by
as in switching and network interface controllers (NIC). the VAXcluster product offerings. While the network
 C Clusters

technology was custom, the cluster “nodes” were essen- (Ethernet), and the new trend in open source software
tially off-the-shelf VAX computers (e.g., model /) including the Gnu editors and compilers and the Linux
and therefore exhibited a critical property of future operating system. Both projects began in , with sig-
generation clusters to emerge a few years later. nificant systems deployed in , and both had strong
By the end of the Enabling Period, essentially impact on the community, essentially defining the range
all technologies, system architectures, software frame- of capabilities and techniques to be incorporated to this
works, and methodologies had been developed, at least day. Among these was the choice between vendor and
in experimental form or for ancillary purposes, which end user installations. At a time when there were few
would lay the foundation for the initial forays into vendor offerings in the cluster market, most clusters
various forms of clusters and the eventual dominance were assembled and installed (both hardware and soft-
of clusters as a principal means of achieving scalable ware) from component systems by the end user facility
systems. with early emphasis in the academic and government
laboratory environments.
Classical Period: – Another important advance was the development of
The emergence of the commodity cluster exploited the the system area network (SAN) explicitly devised for
concepts of clustering, the aggregation of multiple like interconnection networks in clusters. This was repre-
systems by means of interconnection techniques to sented by Myrinet, a high speed low overhead network
form a super-system, and key enabling technologies to marketed in  by Myricom initially for Sun work-
deliver a new and at the time unique capability mea- stations and later for general PCI connected nodes.
sured in performance-to-cost. While clusters are per- Myrinet offered a superior performance lower latency
vasive today in scientific and commercial computing and higher bandwidth interconnect to the competing
arenas, at the beginning of the classical period it was Ethernet but at a higher cost. It was very successful in
far from certain that clusters would be developed, let the emerging commercial marketplace but saw slower
alone have any significant impact on high performance adoption by the cost sensitive Beowulf user community.
computing. At that time, “big iron,” the use of custom Nevertheless, Myrinet narrowed the gap between com-
supercomputers, was ubiquitous with the cold war driv- modity clusters and MPPs with their custom intercon-
ing their development and much of their deployment. nects. Other commercial networks included Quadrics
To many the cluster concept was counter intuitive and and SCI.
even counterproductive to the goals and methodology The Beowulf Project was initiated in  at the
embodied by such machines as the CRI Cray YMP and NASA Goddard Space Flight Center in collaboration
the MasPar , both custom high-speed computer sys- with the USRA Center for Excellence in Space Data
tems of the day. To some, the cluster paradigm was and Information Systems (CESDIS) and the University
viewed as a threat to mainstream supercomputing and of Maryland at College Park. The project was under-
as it turned out, they were proved correct. taken to explore the opportunity of employing mass
A number of early attempts at cluster implemen- market personal computer systems for science simula-
tation were conducted with two projects standing out tion and data analysis with a critical objective of low cost
as the pathfinders for what was to become the domi- for dedicated user allocation. An initial set of perfor-
nant form of scalable computing up until the present. mance and capacity requirements at a cost within $K
These were the NOW (Network of Workstation) project () hardware deployment. The first Beowulf system,
at UC Berkeley and the Beowulf project at NASA Wiglaf, saw first life in the summer of , and fol-
(Goddard Space Flight Center and Jet Propulsion Lab- lowed by Hrothgar () and Hyglac () all  nodes
oratory). From the benefit of hindsight these system culminating in  Gigaflops sustained performance on a
concepts were very similar but in the early to mid-s real world problem for under $K, a performance to
they were considered as quite distinct. NOW empha- cost gain of a factor of X–X that of contempo-
sized high-end workstations, high performance net- rary MPPs. All Beowulf systems of that time consisted
works, and proprietary software. Beowulf emphasized of single microprocessor nodes, Intel X family micro-
low cost personal computers, mass market networks processors, Ethernet interconnect ( Mbps followed
Clusters C 

by  Mbps Fast Ethernet) with tree topology. These and commercially maintained versions of the software
systems would grow to between  and  nodes, for the widest usage.
all still in their original “pizza box” or “tower” cases As a commercial market for clusters emerged, ven-
and usually stacked on shelves or sometimes inserted dors drove the next evolutionary step by developing and
in racks. By , a combined project of the California offering products better suited to clusters. Perhaps most C
Institute of Technology and Los Alamos National Labo- importantly was the repackaging of the functional node
ratory (LANL) won the first Gordon Bell Prize awarded in to a dense rack mountable unit that provided high
to a cluster based computation. In the same year, the density packaging, superior air cooling, superior reli-
NOW project had the first cluster entry on the Top- ability, improved performance and memory capacity,
 List. The commodity cluster had been embraced built in I/O and network ports, and both installed disk
by the mainstream HPC community. In , How to and diskless versions. Such nodes borrowed from a class
Build a Beowulf by Sterling et al was published by MIT of multiprocessor servers to provide multiple micro-
Press and represented the state of the art as it was then processors in the same node sharing a common mem-
known providing a general methodology for realizing ory through cache coherence schemes (e.g., MESI).
commodity clusters. Offering alternative interconnection networks and dif-
The most important advance in software for com- ferent scale systems provided users with a wide space
modity cluster computing was the community-wide of choice and through mass market and economy of
development and support for a general and widely scale of these nodes excellent performance to cost. By
accepted application programming interface, MPI (Mes- , commodity clusters comprised more than %
sage Passing Interface) that became the lingua franca of the Top- List of the world’s fastest computer
for distributed memory system programming. The first systems.
MPI standard (MPI-) was released in  with the
first implementations available as early as a year later.
A critical factor in the success and early adoption of Advanced Period:  to Present
MPI by users and vendors alike was the MPICH library (and Beyond)
reference implementations that give early access and In recent years, important changes in enabling technol-
credible performance to appropriately crafted appli- ogy have altered the form and function of commodity
cations using the message-passing model. A more clusters. Prior to the advanced period that began in
advanced but fully backward compatible MPI stan- approximately , performance gains were achieved
dard (MPI-) was released in  with implementa- primarily by technology advances driven in part by
tions following. Within the scientific computing arena, Moore’s law. Both clock rate and processor core com-
MPI is ubiquitous on both commodity clusters and plexity continued to increase at a steady predictable
MPPs providing portability for users and stable tar- rate. This was augmented with improved high density
get platforms and market for Independent Software packaging and corresponding increases in the num-
Vendors. ber of nodes to yield performance gains of greater
During this period, advanced system software was than X per decade. By the beginning of the current
developed to control cluster management. This mid- period essentially all of the processor architecture tricks
dle layer of software such as PBS, the Maui scheduler, had been exploited while limits on power consumption
and the ROCKS deployment package was created out caused clock rates to flat-line around  GHz +/−%. At
of a critical necessity for higher productivity of clus- the same time, logic density continued to increase lead-
ter installation, operation, and application. The do-it- ing to multicore components providing multiple cores
yourself mentality while serving as a catalyst for initial per socket. Initially MPI was thought to be sufficient to
exploration and application of clusters nonetheless suf- program multicore based commodity clusters but subtle
fered from inadequate methods and means of stable changes in balance of communications resulting from
employment, expected of other commercial computer these new system structures is suggesting that slowly
systems. This middleware went a long way in rectify- alternative programming methods, perhaps augment-
ing this shortcoming while providing both open source ing rather than replacing MPI, such as Cilk, TBB, and
 C Clusters

Concert, may be required to make best use of multicore scaled (reduction in execution time) problems. Web
based clusters. search engines and internet accessed “Cloud Comput-
A second important change is the recent interest ing” that respond to very large remotely distributed user
in adapting special purpose graphical processor units base calls for unprecedented scale in commodity clus-
(GPU) originally intended for the entertainment indus- ters. This new market will have a dramatic albeit unpre-
try dependent on high-resolution and high-speed visu- dictable effect on the future evolution of the commodity
alization for movie special effects and video games to cluster. The only thing that is certain is that clusters will
more general purpose scientific computing application continue to serve the computing community in myriad
domain. This has resulted in the emergence of het- ways as they evolve to serve diverse markets and user
erogeneous cluster architectures with individual nodes communities.
incorporating ancillary accelerators for speeding up,
sometimes dramatically, aspects of the application
workflow. This has challenged programming methodol- Related Entries
ogy once again with such supportive APIs as CUDA and Distributed-Memory Multiprocessor
the emerging OCL to assist programmers in exploiting MPI (Message Passing Interface)
these new devices. Network of Workstations
A new system area network and community stan- Interconnection Networks
dard, Infiniband (IBA or IB), has emerged as the prin-
cipal challenger to Ethernet and has largely replaced
Myrinet as the high performance low latency and low
overhead interconnection fabric. Today, Infiniband rep-
Bibliography
. Pfister GF () In search of clusters: the coming battle in lowly
resents % of the installed base of systems on the parallel computing. Prentice-Hall, Inc., Upper Saddle River, New
Top- List surpassed only by Gigabit Ethernet at just Jersey
over %. Early problems with initial IBA offerings . Sterling TL, Salmon J, Becker DJ, Savarese DF () How to build
slowed market penetration but that has largely been a Beowulf: a guide to the implementation and application of PC
resolved and a mix of these two interconnect technolo- clusters. MIT Press, Cambridge, Massachusetts
. Sterling T, Lusk E, Gropp W (eds) () Beowulf cluster comput-
gies may be anticipated with IB increasing its usage in
ing with Linux. . MIT Press, Cambridge, Massachusetts
the scientific cluster computing arena. . Kronenberg N, Levey H, Streeker W, Merewood R () The
Commodity clusters now dominate high perfor- VAXcluster concept; an overview of a distributed system. Dig Tech
mance computing with greater than % of the HPC J ():–
market overall. But at the highest end, MPPs dominate. . Metcalfe RM, Boggs DR () Ethernet: distributed packet
switching for local computer networks. Commun ACM
This is due to the advantages of custom interconnect and
():–. DOI= http://doi.acm.org/./.
packaging even though the basic multicore micropro- . Gropp W, Lusk E, Skjellum A () Using Mpi: Portable paral-
cessor and memory technologies employed in both are lel programming with the message-passing interface. MIT Press,
basically the same. Such MPP machines as the IBM Blue Cambridge, Massachusetts
Gene/P and the Cray XT are optimized for the require- . Torvalds L () The Linux edge. Commun ACM ():–.
ments of major computer centers where performance, DOI= http://doi.acm.org/./.
. Pfister G () An introduction to the infiniband architecture.
power, size, and reliability are more important than nor-
High Performance Mass Storage and Parallel I/O, IEEE Press, Los
malized cost. With the reliance on increased number of Alamitos
cores for performance gain, custom MPPs are likely to . Papadopoulos PM, Katz MJ, Bruno G () NPACI Rocks: tools
find increased foothold at the highest realms of HPC and techniques for easily deploying manageable Linux clusters.
replacing in dominance the commodity cluster. Concurr Comput Pract Exp (–):–
. Henderson RL () Job scheduling under the portable batch sys-
A counter trend, however, is now asserting itself that
tem. In: Feitelson DG, Rudolph L (eds) Proceedings of the work-
is likely to increase cluster use even in some aspects shop on job scheduling strategies for parallel processing. Lec-
of the very highest end of system performance, at least ture notes in computer science, vol . Springer-Verlag, London,
as measured in terms of throughput rather than strong pp –
Cm* - The First Non-Uniform Memory Access Architecture C 

. Bode B, Halstead DM, Kendall R, Lei Z, Jackson D () . A ten-processor, three cluster system and opera-
The portable batch scheduler and the Maui scheduler on tion system were demonstrated in June . All fifty
Linux clusters. In Proceedings of the th annual Linux processors in five clusters were operational by Septem-
showcase & conference. vol . Atlanta, Georgia, –
ber . The machine was used experimentally for
October . Atlanta Linux Showcase. USENIX Association,
Berkeley, CA, p  applications until January . C
. Everett RR, Zraket CA, Benington HD () SAGE: a data- Cm* was one of the first large-scale general-
processing system for air defense. In: Papers and discussions purpose multiple-instruction/multiple-data (MIMD)
presented at the December –, , eastern joint computer processors []. Two complete operating systems –
conference: computers with deadlines to meet. Washington, DC,
StarOS and Medusa – were developed along with a
– Dec . IRE-ACM-AIEE ‘ (Eastern). ACM, New York,
pp –. DOI= http://doi.acm.org/./. host of applications. Over a decade of experience repre-
senting more than  person-years of effort was accu-
mulated in the emerging area of parallel processing.
Cm* researchers explored issues ranging from pro-
CM Fortran gramming parallel applications, to investigating inter-
actions between the operating system and architecture,
Connection Machine Fortran to developing an automated experimental environment
that facilitated the construction of prototype applica-
tions [].
Project Genesis: Historically, Cm* had its begin-
Cm* - The First Non-Uniform ning in the register-transfer module (RTM) project [].
Memory Access Architecture RTMs are a module set for the systematic construction
of digital systems at the register-transfer level. The suc-
Daniel P. Siewiorek , Ed Gehringer cessor to RTM project, and immediate forebear of Cm*,

Carnegie Mellon University, Pittsburgh, PA, USA was the computer module (CM) project [, ], whose

North Carolina State University, Raleigh, NC, USA major objective was also the systematic construction of
digital systems.
Definition The prime assumption of the CM project was that
The name Cm* derives from an architecture wherein a simple computer, a processor-memory pair, is an
each processor-memory pair was called a “computer appropriate module for building large digital systems.
module” that could be replicated any number of times, Modular structures have several advantages, includ-
as denoted by the Klein star borrowed from set theory. ing reduced cost through faster system design, faster
production, reduced inventories, and simplified main-
tenance. A second assumption of the CM project was
Discussion that communication between modules should be at the
Chronology level of a single memory reference. This allows fine-
The Cm* architecture was described in the fall of . It grained communication and supports the widest variety
was designed to explore how microcomputers could be of interprocessor communication patterns.
combined to form large digital systems. As the sixteen- Multiprocessors consist of many autonomous pro-
processor multiprocessor C.mmp became operational cessors that address the same primary memory. In
at Carnegie Mellon University in the early s and contrast, multicomputers are made up of autonomous
Digital Equipment Corporation introduced the LSI- computers communicating by means of messages
(January ), the project became focused on build- through static or dynamic communication links. In par-
ing a modularly expandable multiprocessor for parallel ticular, the Cm* architecture provides a continuum of
processing research. The architecture was specified by memory sharing between processors ranging from no
March , and the design was completed by the fall shared memory (multicomputer) to fully shared mem-
of . A single-cluster system was operational by July ory (multiprocessor).
 C Cm* - The First Non-Uniform Memory Access Architecture

Cm* - The First Non-Uniform Memory Access Architecture. Table  Synchronization granularity and distributed
computer structures
Synchronization interval Distributed computer Communication overhead
Grain size (instructions) structure (instructions)
Fine  Vector/array processor 
Medium – Multiprocessor –
Coarse –, Multicomputer –,
Large ,– million Network ,– million

One way of positioning a multiprocessor is to The number of bus connections grows approximately
consider the synchronization granularity [], or fre- linearly with N.
quency of synchronization between tasks. Table  shows, Rather than have separate processors and memo-
for a variety of synchronization granularities, the best- ries, Cm* coupled processors and memories so that not
suited distributed computer organization. Multiproces- all memory accesses encountered switch delays. Bor-
sor systems such as Cm* are more suitable for a medium rowing concepts from memory-mapped input/output,
grain of synchronization. portions of the memory address space were assigned
Multiprocessors and networks employ a variety of to local memory while the remainder of the address
interconnection structures to couple processors with space could be mapped to any memory location in any
memory units. Among these are full interconnections, other module – effectively turning Cm* into a software-
shared buses, multiple buses, crossbar switches, and managed hierarchical cache.
multistage networks. See Siegel [] for a complete Two conceptually different operating systems were
discussion of this topic. The fully connected network implemented for Cm*. Large portions of the operating-
couples each processor with each memory through a system kernels resided in firmware with operating-
dedicated link. A shared bus is a single communication system calls triggered by accessing special memory
path to which both processors and memory are con- addresses.
nected. Bus arbitration may either be done in a central
arbiter or be distributed among the units. The bus band- Cm* Architecture Evolution
width is divided between the processors and does not Cm* is structured using processor-memory pairs called
scale for large N. computer modules or Cm’s (after the PMS notation of
A multiple shared bus is a set of shared buses con- []). The memory local to a processor also serves as the
nected by gateways or switches. The switches may be shared memory in the system. Inherent in this structure
organized in many ways. The best-known topologies are is the assumption of program locality. The efficient use
crossbar switches and hierarchical switches. In cross- of the system depends on ensuring that most of the code
bar switches, each processor and each memory unit and data referenced by a processor will be held local to
is connected to a dedicated shared bus. Switches are that processor. Early Cm* measurements with various
used to connect the processor buses and memory buses benchmarks applications indicate that local-memory
into the form of a crossbar. In the -by- crosspoint hit ratios of .–. (the fraction of total memory ref-
switch of C.mmp, each switch connected a full bus erences directed to local rather than shared memory)
of over  wires. Crosspoint switch complexity grows were readily achieved.
as N  where N is the number of processor/memories, Initially, one self-contained module was envisioned
and it does not scale for large N. With a hierarchical that consisted of a processor, memory, and an intel-
switch, a group of functional units is clustered around ligent interface. The result was termed a computer
a single shared bus, and the clusters are interconnected module (Cm). The address-mapping controller (Kmap)
with another shared bus. By repeating the process, a performed all the functions necessary for generating
hierarchically organized multi-shared-bus is obtained. external memory requests and responding to external
Bandwidth can be increased by adding another bus. requests for its local memory. So that the capacity
Cm* - The First Non-Uniform Memory Access Architecture C 

for interprocessor communication would not be lim- Cm* Architecture


ited by any single communication path, each Kmap The original description of the design and implemen-
connected to two inter-Cm buses. Further investiga- tation of Cm* appears in two papers [, ]; the design
tion led to the conclusion that very little performance and switching of Cm* and a detailed account of one par-
was lost by centralizing the modules’ address-mapping ticular addressing structure are described by Swan []. C
and multiple-bus connection functions in a Kmap. The Cm* consisted of  Cm’s, connected together by a
Kmap was shared by a number of computer modules, hierarchical, distributed switching structure depicted in
which saved at least half the cost of a one-Kmap-per- Fig. . The lowest level of the switching hierarchy com-
module design. The programmable high-performance prised Slocals, local switches that connect individual
Kmap was shared by several Cm’s connected to an inter- Cm’s to the rest of the structure. Cm’s were grouped
Cm bus via simple interfaces (Slocals). The basic func- together into clusters that were presided over by high-
tion of the Slocal was to act as a buffer between the speed microprogrammable communication controllers
processor and the inter-Cm bus and provide sufficient called Kmaps. A Kmap provided the mechanism for
control functions to generate or respond to external Cm’s in its cluster to communicate with each other,
memory requests. and cooperated with other Kmaps to service requests

Cm20 Cm21 • • • Cm29

DA
Disk SLU Disk
link

Kmap2
Cm10 Cm11 • • • Cm19 Cm30 Cm31 • • • Cm39

Kmap1 Kmap3

DA DA
Disk SLU Disk Disk SLU Disk
link link

Kmap0 Kmap4

Cm*

LEGEND
Intercluster bus
Map bus
Cm00 Cm01 • • • Cm09 PDP-11 bus Cm40 Cm41 • • • Cm49
DA links
SLU to Host
DA DA
Disk SLU Disk Disk SLU Disk
link link

Cm* - The First Non-Uniform Memory Access Architecture. Fig.  Cm* architecture
 C Cm* - The First Non-Uniform Memory Access Architecture

from its Cm’s to access Cm’s in nonlocal clusters. Cm’s and Slocals
Besides this address-mapping function, since Kmaps Each Cm was a processor-memory-switch combina-
were microprogrammable, it was usual to implement tion, consisting of a standard off-the-shelf Digital
key operating-system functions in Kmap microcode. Equipment Corporation (DEC) LSI- (the first single-
All communication that involves Kmaps was per- board implementation of the PDP- instruction set),
formed via packet switching rather than circuit switch-  K or  K bytes of memory, one or more I/O
ing to avoid deadlock over dedicated switching paths. devices, and a custom-designed Slocal that connected
Packet-switched communication also allowed the pro- the processor-memory combination to the rest of the
cessing of requests by the Kmaps to be overlapped, since system (Fig. ). When the processor of a Cm initiated a
switching paths did not need to be allocated for the memory reference, its Slocal was responsible for deter-
duration of a request; this resulted in improved uti- mining whether the reference was to be directed to
lization of the switching structure. The interconnection local memory or out to the Kmap for further mapping.
structure between Cm* clusters was essentially arbi- The Slocal used the four high-order bits of the proces-
trary. The Kmap in each cluster had two bidirectional sor’s address, along with the current address space, to
ports, each of which could be connected to a separate access a mapping table that determined whether the ref-
intercluster bus to achieve a variety of interconnec- erence was to proceed locally or not. References that
tion schemes. In the implemented configuration, all five mapped to local memory proceeded with no loss of
Kmaps were connected to two intercluster buses, as performance; references that mapped to another Cm
shown in Fig. . The physical implementation of Cm* were slower by a factor of three; and references that
is shown in Fig. . mapped to a Cm in another cluster were slowed again

Cm* - The First Non-Uniform Memory Access Architecture. Fig.  Physical structure of Cm* with four Cm’s shown in the
upper drawer and the K.map in the lower drawer
Cm* - The First Non-Uniform Memory Access Architecture C 

by a factor of three. These figures were the best possi- can read and write with a delay of a few microinstruc-
ble ratios that could be achieved on Cm*, and therefore tions. Because the data RAM was shared by all contexts,
amount to constraints imposed by the hardware itself it typically held information that was of interest to more
rather than to any microcode implementation features. than one transaction, such as cached pieces of address-
All I/O devices in Cm* were connected to the various translation tables and mechanisms for synchronizing C
LSI- buses. Since there was no interprocessor com- the use of other resources among different contexts.
munication mechanism other that the standard one Taken individually, each Kmap appeared to the
for memory references, interrupts generated by an I/O Pmap microprogrammer as a nonpreemptive, hardware-
device needed to be fielded by the processor to which scheduled multiprogramming system (since contexts
the device is directly attached. were never interrupted, but suspended only when the
microprogrammer directed). Taken collectively, the
Kmaps: Transaction Controllers network of all Kmaps presented the microprogram-
Acting both as the switching center within a cluster, mer with a distributed system based on message-packet
and as a node in a network of clusters, the Kmap intercommunication.
served to coordinate synchronization and communica-
tion in Cm*. A fast (-ns. cycle), horizontally micro-
programmed (-bit wide) microprocessor, the Kmap The Interface Between Kmap and
itself, consisted of three tightly coupled processors. The Computer Module
bus controller, or Kbus, acted as the arbitrator for the The Pmap communicated with the computer modules
bus that connected Cm’s in the local cluster to their in its cluster via the map bus, a packet-switched bus
Kmap; the Linc managed communication to and from controlled by the Kbus. The Kbus fielded requests and
the Kamp to other Kmaps; and the mapping processor, replies from the computer modules, coordinated the
or Pmap, responded to requests from the Kbus and Linc, transfer of data across the map bus between computer
and performed most of the actual computation for a modules or between a computer module and a Pmap
service request. The Pmap also directed the Kbus and context, and kept track of which Pmap contexts are
Linc to perform any needed operations on behalf of the free to service new requests. Two queues, the Kbus out
request being processed. queue and the Pmap run queue, provided the interface
The Kmap was a transaction controller, sending and between the Kbus and Pmap.
receiving message packets that contained requests and A request by a computer module to the Kmap is
replies, following a protocol designed by the micro- said to invoke some Kmap operation. The most frequent
programmer. The Kmap routinely mapped memory Kmap operation was the mapping of nonlocal access to
accesses by one processor into the local memory of a location in the physical memory of some computer
another computer module. It needs to do so very rapidly module in the cluster. This was called a mapped refer-
and with little delay to achieve reasonable system per- ence. Each memory access issued by the processor of
formance. Thus, the Kmap contained some special a computer module passed through its Slocal, which
hardware features designed to assist it in controlling either routed the access directly to local memory or sent
many transactions at a high rate of speed. it out to the Kmap. A memory access handled by the
The Kmap hardware supported eight separate Pmap Kmap also could be mapped back to the local memory
processes, known as contexts, each with its own set of the issuing processor, but a direct local access was
of general-purpose registers and its own microsubrou- about three times faster.
tine stack. Typically, each context was in charge of For more complicated Kmap operations, the Pmap
one transaction. When a context needed to wait for a microprogrammer generally appropriated some sub-
message packet to return with the reply to some lower- set of the computer module processor’s virtual addess
level request, it had the Pmap switch to another con- space, designating specific addresses the processor
text so that work on another transaction can proceed could use to invoke other Kmap operations. These spe-
concurrently. cial Kmap operations were invoked in exactly the same
The Kmap also contained , words of random- manner as I/O operations are invoked on a computer
access memory, called the data RAM, which the Pmap that has memory-mapped device control registers.
 C Cm* - The First Non-Uniform Memory Access Architecture

The Interface Between Kmap and Kmap was initially implemented on a PDP-/ with  K
Kmaps communicated with each other via an inter- words of memory.
cluster bus, a packet-switched bus that was jointly Several communication lines connected the Host
controlled by the Linc processors in each of the to components of Cm*, to terminals, and to other
directly connected Kmaps. The Linc maintained queues Carnegie Mellon University Computer Science Depart-
of incoming and outgoing messages, interacted with ment computers. There were serial-line connections
the Kbus to activate and reactivate Pmap contexts, from the Cm* Host to ten individual Cm’s, two of which
and provided the local storage for Pmap contexts were in each of the five clusters. Because the other
to construct and inspect intercluster messages. Each  Cm’s lacked serial-line connections, they needed to
Linc was interfaced to two independent intercluster be loaded from other Cm’s via a Kmap or from periph-
buses. eral devices. The communication lines served two pur-
An intercluster message contained up to eight - poses. First, they allowed the user to communicate
bit words of data, of which all words except the first with other computers from Cm*, as some programs on
were totally uninterpreted by the Linc. Each interclus- Cm* interacted with programs on other machines. They
ter message was sent from an immediate source Kmap also permitted the user to monitor debugging infor-
to an immediate destination Kmap. The number of the mation on both machines simultaneously from one
destination Kmap appeared in a fixed place in the mes- terminal.
sage so that the Linc could determine which messages In addition to its  Cm’s, Cm* had three LSI-
are sent to its cluster. Intercluster messages were of two ’s, known as hooks processors, which were used for
types: forward messages, which invoked a new context debugging the Kmaps. These processors controlled the
at the destination Kmap, and return messages, which “hooks,” which was a collective term for a set of hard-
returned to a waiting context at the originating Kmap. ware that was designed into each Kmap to permit
A return message contained the context number of the complete external control and diagnosis. The hooks
to-be-reactivated Pmap context in a fixed place where consisted of several control registers and other hard-
the Linc could find it in order to inform the Kbus. ware within the Kmap, an LSI- interface to make
These intercluster messages were designed to be used as the hardware accessible from an LSI-, and a bidirec-
a mechanism for implementing remote procedure calls tional hooks bus used to transmit information between
between Kmaps. the control hardware and the LSI- bus interface. The
In a configuration of the Cm* system, it was quite hooks appeared to an LSI- as a group of eight words in
possible that two particular Kmaps had no interclus- its physical address space. By reading and writing these
ter bus in common and thus could not send messages words, the hooks processor had almost total control
directly to one another. Each Kmap connected directly over the internals of the Kmap. It could load microcode;
to two intercluster buses, however, and as long as some start, stop, and single-cycle the Pmap/Linc and Kbus
path through a series of intermediate Kmaps could be clocks, read out most, and write some of the inter-
found, the two Kmaps in question could still commu- nal registers of the Pmap; disable certain error checks
nicate, provided each of the intermediate Kmaps coop- within the Pmap; and initialize the Kmap.
erated by forwarding the message closer to its ultimate The diagnostic processor (DP) was added to the
destination. Cm* system to collect hardware reliability and avail-
ability information. It hosted a program called the
Auto-Diagnostic Master, which ran diagnostics on Cm’s
Communication with Cm* that were otherwise idle. The DP was an LSI- with
When software was first being developed on Cm*, there  K words of memory. It had two serial-line connec-
arose a need to let the software developer allocate cer- tions to the Cm* host. One connection provided a
tain resources (such as clusters and debugging tools) user interface through which one can request status
themselves and to protect these resources from being reports about particular Cm’s, particular clusters, or the
disturbed by other users. The Cm* Host system was entire Cm*. The second serial-line interface was used by
a serial-line-oriented resource management facility. It the Auto-Diagnostic program as a command interface
Cm* - The First Non-Uniform Memory Access Architecture C 

to the Host. The program logged in over the second architectures were parameterized to predict the perfor-
line, directs the Host to run diagnostics for it, and mance of a parallel implementation.
on operator request, transferred the statistics file to a
PDP-. Summary
Because the system software developed for Cm* was The Cm* architecture consisted of  computer mod- C
written, compiled, and stored on a remote machine, ules, connected together by a distributed switching
object code frequently had to be transferred to Cm*. To structure. The lowest level of the switching hierar-
facilitate high-speed transfers, a DA Link was developed chy consisted of Slocals, local switches that connected
to provide a -megabaud parallel DMA link between individual Cm’s to the rest of the structure. Cm’s
Cm* and a DEC-. Between an unloaded LSI- and were grouped together into clusters that were presided
an unloaded DEC KL-, these links could transfer over by high-speed microprogrammable communica-
more than , words per second. A file-transfer tion controllers called Kmaps. A Kmap provided the
system residing on the DEC- provided reliable file means for Cm’s to communicate with other Kmaps to
transfers. service requests for memory references to other clusters.
In addition to address mapping, key operating-system
functions generally were implemented in microcode.
Cm* Experiments
An extensive set of experiments were conducted with Bibliography
Cm* involving two different operating systems. Many of . Bell CG, Newell A () Computer structures: readings and
the experiments focused on speedup, the ratio between examples. McGraw-Hill, New York
the elapsed time required by the one-processor version . Bell CG, Eggert JL, Grason J, Williams P () The description
and the use of register transfer modules (RTM’s). IEEE Trans
of a parallel algorithm to the elapsed time for its N-
Comput C-():–
processor counterpart. Speedup is usually between  and . Flynn MJ () Very high-speed computing systems. Proc IEEE
N. Three factors were responsible for the shape of the :–
speedup curve (e.g., speedup as a function of the num- . Fuller SH, Siewiorek DP, Swan RJ () Computer modules:
ber of processors): algorithm penalty, implementation an architecture for large digital modules. In: Proceedings of the
first annual symposium on computer architecture, University
penalty, and their interaction [].
of Florida, Gainesville. ACM/SIGARCH, ACM, New York, pp
The algorithm penalty is composed of the separa- – (reprinted as Computer Architecture News  ())
tion overhead (cost of process decomposition and data . Fuller SH, Ousterhout JK, Raskin L, Rubinfeld P, Sindhu PS,
partitioning) and the reconstitution overhead (cost of Swan RJ () Multi-microprocessors: an overview and working
the interchange and reporting of intermediate and final example. Proc IEEE ():–
results). The implementation penalty is composed of . Gehringer EF, Siewiorek DP, Segall Z () Parallel processing:
the Cm* experience. Digital Press, Bedford
access overhead (cost of accessing shared resources) and
. Mohan J, Jones AK, Gehringer EF, Segall ZZ () Granularity of
the contention for these shared resources. The interac- parallel computation. In: Gallizzi EL et al. (eds) Proceedings of the
tion between the algorithm and implementation leads eighteenth Hawaii international conference on system sciences,
to two other types of overhead: the overhead of synchro- Hawaii, vol . IEEE, Washington, DC, pp –
nization and the cost of adapting a parallel algorithm to . Siegel HJ () Interconnection networks for large-scale parallel
erocessing. Lexington Books, Lexington
a specific architectural implementation.
. Swan RJ () The switching structure and addressing archi-
In addition, at the macro level, parallel algorithms tecture of an extensible multiprocessor, Cm*. PhD dissertation,
on Cm* were divided into six classes: asynchronous, Carnegie-Mellon University, Pittsburgh
synchronous, multiphase, partitioning, pipeline, and . Swan RJ, Fuller SH, Siewiorek DP () Pittsburgh Cm*: a mod-
transaction processing. These classes of parallel algo- ular, multi-processor. In: Proceedings of the national computer
rithm structures represented distinct cross-section of conference, AFIPS, Texas, USA, pp –
. Vrsalovic D, Gehringer EF, Segall ZZ, Siewiorek DP () The
the algorithm-overhead and implementation-overhead
influence of parallel decomposition on the performance of multi-
profiles. Once a parallel algorithm was classified into processor systems. In: Proceedings of the th annual symposium
one of the six categories, one could predict how it on computer architecture, Boston. IEEE Computer Society Press,
would perform. Further, the characteristics of parallel Los Alamitos, CA, pp –
 C CML

Execution model
CML The coarray execution model is based on the Single-
Program-Multiple-Data (SPMD) model where replica-
Concurrent ML
tions of a single program execute independently within
their separate memory spaces. Each replication is called
an image, and an intrinsic function,

CM-Lisp p = num_images()

returns the number of images at run-time. The run-


Connection Machine Lisp time system assigns local memory to each image, and
each image executes asynchronously along independent
paths through the program based on its own data and its
own unique image index. Another intrinsic function,
CnC me = this_image()
Concurrent Collections Programming Model returns the image index at run-time. Image indices start
at image number one following Fortran’s default rule for
lower bounds.
At times the independent images need to interact
Coarray Fortran with each other to coordinate their activities. The pro-
grammer inserts image control statements in the pro-
Robert W. Numrich gram at appropriate points to synchronize images and
City University of New York, New York, NY, USA to maintain memory consistency across images. The
statement,

Synonyms sync all


Co-array Fortran, CAF
for example, imposes a full barrier across all images.
Each image, when it encounters this statement, com-
Definition pletes all memory activity, both local and remote, and
The coarray programming model is an extension to the then registers its presence at the barrier. No image
Fortran language that provides an explicit syntax and an executes any statement following the barrier until all
explicit execution model for the development of parallel images have registered at the barrier. It is the easiest
application codes. way for the programmer to maintain memory consis-
tency across images because memory in each image
Introduction has reached a well-defined state. The programmer is
Fortran  contains the coarray parallel program- responsible for knowing what state is the correct state
ming model as a standard feature of the language []. across a barrier.
It is the first time that a parallel programming model The programmer may also want to synchronize
has been added to the language as a supported feature, among subsets of images. Pairwise interaction, for
portable across all platforms. Compilers supporting the example, may be accomplished by one image synchro-
model are available or under development from all the nizing with its neighbor to the right,
major compiler vendors.
sync images(me+1)
The coarray programming model consists of two
new features added to the language, an extension of and its partner synchronizing
the normal array syntax to represent data decomposi-
sync images(me-1)
tion plus an extension to the execution model to control
parallel work distribution. with its neighbor to the left.
Coarray Fortran C 

The programmer may also control interaction providing the programmer a powerful tool for defining
among processors by inserting critical segments into the distributed data structures. The only exception is that a
program, variable with the pointer attribute may not be declared
critical as a coarray.
⋮ Codimensions may also be multi-dimensional, C
end critical
type(someType) :: R[0:q,0:*]
where only one image at a time may enter the seg-
allowing the programmer to think of images as logi-
ment. The programmer may also use intrinsic locks to
cally decomposed into, for example, a two-dimensional
introduce more sophisticated locked segments similar
grid []. The final codimension is always an asterisk. An
to critical segments.
alternative form of the intrinsic function,
Coarrays and Codimensions [myP,myQ] = this_image(R)
A coarray is a variable,
returns image indices [myP,myQ] relative to the codi-
real :: x[*] mensions defined by the declaration statement. The pro-
grammer may want, for example, to identify the east-
declared with a codimension. The codimension in
west neighbors [, ] with co-indices [myP+1,myQ]
square brackets [*] means that there is a real variable
and [myP-1,myQ] and north-south neighbors with
with the same name x assigned to the local memory of
co-indices [myP,myQ+1] and [myP,myQ-1].
each image. By default the images are numbered start-
ing with image number one, and the asterisk means
that the upper bound on the codimension equals the Writing Code for the Coarray Model
number of images at run-time. The programmer writes Coarray variables are visible across all images. The pro-
code that is independent of the number of images and grammer may, therefore, write code that moves data
never writes code that assumes a particular number from one image to another by pointing to it with a
of images. If the programmer wants to think of image co-index. For example, the code
indices starting with number zero, or any other number real :: x[*],y
p, the alternative declaration, me = this_image()
y = x[me+1]
real :: x[p:*]
moves the value of the variable x[me+1], located in the
allows for complete flexibility. The upper bound for the
local memory of the image to the right, into the variable
codimension is always an asterisk, and image indices lie
y, located in the local memory of the image executing
between p and num_images()-p+1.
the statement. The variable y exists in the local memory
A normal array in Fortran is a set of scalar variables,
of each image but, since it is not a coarray variable, it is
real :: y(n) visible only locally to each image. An image may also
move values in the other direction,
each member of the set labeled by an integer dimen-
sion index. In the same way, a coarray is a set of general x[me-1] = y
objects, scalars, for example, in declaration () or arrays,
defining the value of x[me-1] in the local memory of
real :: z(n)[p:*] the image to the left.
An image executes each statement it encounters
in declaration (). Each image has an object of the same
according to the normal execution rules of Fortran.
name, same type, and same size assigned to its local
The image index in square brackets has nothing to do
memory, and each object in the set is labeled by an
with execution. The images to the right and to the left
integer codimension index. Coarray variables may be
know nothing about what other images may be doing
declared for any type including derived types,
with values of variables in their local memories. It is
type(someType) :: Q[p:*] the programmer’s responsibility to insert image control
 C Coarray Fortran

statements into the program to control memory activ- N

ity so that the state of memory across images is the state


the program expects for correct behavior. For example, [p,q+1]
suppose image number one reads the value of an input
variable that every image needs,
W [p-1,q] [p,q] [p+1,q] E
real :: n[*]
me = this_image()
if(me == 1) then
read(*,*) n
[p,q-1]
sync images(*)
else
sync images(1) S
n = n[1]
end if Coarray Fortran. Fig.  Representation of a
two-dimensional domain decomposition using
After reading the coarray variable n from stan-
codimensions for north-south-east-west directions
dard input, image number one executes a sync
images(*) statement notifying all the other images
that it has the value in its local copy of the vari- real,dimension(n,n),codimension[p,*]::a,b,c
able. The other images, meanwhile, execute the sync do R=1,p
c(:,:) = c(:,:) + &
images(1) statement waiting for a signal from image matmul(a(:,:)[myP,R],b(:,:)[R,myQ])
number one. When they receive it, they each individ- end do
ually read the value n[1] from image number one
without regard to what other images may be doing. They computes the matrix multiplication in parallel for a
each are guaranteed to receive the correct value of the [p,p] image decomposition and a global problem
variable. size that is a multiple of p. Each image accumulates
Multiple codimensions express the relationship the block of the result matrix that it owns by sum-
between north-south and east-west neighbors in a nat- ming together the products of the blocks of the first
ural way. A coarray variable with declaration matrix in the same row of images with the blocks of
the second matrix in the same column of images [].
real :: v(0:m+1,0:n+1)[proc,*]
When image indices are omitted from a coarray in a
might represent a velocity field decomposed into a two- statement, such as c(:,:), the indices default to the
dimensional grid with overlapping halo cells around the local values, c(:.:)[myP,myQ], as in this example.
edges to support, for example, a finite difference oper- This convention saves the programmer from specify-
ator [, ]. At certain points in the calculation, these ing redundant information and helps the compiler to
halo cells must be updated to maintain data consistency generate optimized code.
across images. With image indices as shown in Fig. ,
each image with image index [p,q] executes the halo Object-Oriented Design with Coarrays
exchange in the east-west direction, Starting with the Fortran  standard [], Fortran
is now an object-oriented language. Objects declared
v(:,n+1) = v(:,1)[p+1,q]
as coarrays provide a powerful combination for defin-
v(:,0) = v(:,n)[p-1,q]
ing distributed data structures with associated methods
using a natural representation in coarray syntax. Notice for communication between objects owned by different
that each statement moves an entire column of data images.
from one image to another. The north-south exchange
is similar. Abstract Maps
Multiple codimensions also represent linear algebra The coarray model contains no distributed data struc-
operations naturally using blocked matrix decomposi- tures. All objects are local objects assigned to the local
tions []. The following code, for example, memory of each image. Coarray objects are visible to
Coarray Fortran C 

other images but other objects are not. To create dis- AbstractMap
tributed objects, the programmer defines derived types
that are analogous with classes. Objects of this type ObjectMap
may have a complicated structure that describes how
the programmer wants data distributed across images. VectorMap SparseMatrixMap
C
Multi-dimensional coarrays may not be enough to
describe these structures. DenseMatrixMap

One way to describe complicated distributions is


Coarray Fortran. Fig.  Problem decomposition
through extensions of an abstract map,
represented as extensions of an abstract map
type,abstract,public :: AbstractMap
contains
procedure(getNumberOfObjects), & distribution. The vector map might be extended to rep-
public,pass(this),deferred :: &
getNumberOfObjects
resent a block matrix structure with one vector map
procedure(getImageIndex), & for the columns and another vector map for the rows.
public,pass(this),deferred :: & Another extension might represent a sparse vector or a
getImageIndex
sparse matrix structure [, ].
procedure(getLocalIndex), &
public,pass(this),deferred :: & An object map is a composite function,
getLocalIndex
⋮ Lp = Λp (Π(G)) ,
end type AbstractMap
abstract interface
loosely based on the Composite Design Pattern [, ]
integer function getNumberOfObjects(this) as shown in Fig. . A global set of n objects G =
import :: AbstractMap {G , . . . , Gn } is first permuted,
class(AbstractMap),intent(in) :: this
end function GetNumberOfObjects π = Π(G) ,

end interface
such that
π j = Πij (Gi ) i, j = , . . . , n .
This abstract object contains nothing more than a set of
interfaces for deferred procedures that every map must Subsets of these permuted objects are then projected to
p p
contain. For example, the function np local objects Lp = {L , . . . , Lnp } on specific images,
getNumberOfObjects() returns the number of Lp = Λp (π) ,
objects represented by the map. Another deferred func-
tion, getImageIndex(k), returns the image index such that
p j
that owns global object k in its own local memory. Lk = (Λp )k (π j ) ,
Another function, getLocalIndex(k), returns the k = , . . . , np ,
object index used locally by the image that owns the
p = , . . . , num_images()
global object k. Other functions, not listed, return the
global object index corresponding to a local object, represents local block k on image p.
the number of objects owned by a specific image, and The composite function has an inverse so that each
so forth. Any concrete map that extends this abstract image knows how its set of local objects is related to the
map must implement each of these functions plus other set of global objects. Figure , for example, shows how
procedures specific to a particular distribution. image q locates its neighbor to the left for its second
q
Specific concrete maps are extensions of the abstract local object L as the first local object L on image num-
p
map as shown in Fig. . One extension, called an object ber  and its neighbor to the right as local object L on
map, might represent the decomposition of a set of image number p. These relationships are encapsulated
unspecified objects. A specific kind of object might be a in procedures associated with each object that performs,
section of a vector with a corresponding map represent- for example, halo exchanges between distributed matrix
ing a particular decomposition such as a block-cyclic objects.
 C Coarray Fortran

G1 G2 Gj−1 Gj Gj+1 Gn

π1 π2 πk πk+1 πk+2 πn

L11 L12 Lq1 Lq2 Lq3 Lp1

Coarray Fortran. Fig.  A composite object map

G1 G2 Gj−1 Gj Gj+1 Gn

π1 π2 πk πk+1 πk+2 πn

L11 L12 Lq1 Lq2 Lq3 Lp1

Coarray Fortran. Fig.  Inverse map to nearest neighbors

Blocked Vector Maps Vector maps might also be extended to represent


An important distributed object is a blocked vector. In matrix maps using one vector map for rows of the
this case the objects to be mapped are blocks of a global matrix and another vector map for the columns of the
vector of size n where each block is a subsection of the matrix. Maps for irregular sparse matrices may also be
global vector block(n1:n2) with some rule for pick- designed where the objects represented in the map are
ing the lower bound n1 and the upper bound n2 for sparse blocks of the matrix arranged and distributed
each block. These blocks are assigned to images, and the in specific ways to match the requirements of specific
map keeps track of how many blocks go to each image, algorithms for direct or iterative equation solvers. Fur-
how the blocks are numbered locally on each image, and thermore, distributed data structures may be redefined
the relationship of the local blocks to the original global dynamically to reflect changes in the physical system as
vector. the computation advances in time. A new map is cre-
The programmer defines a Type VectorMap as ated and the data is redistributed from the old map to
an extension to the Type AbstractMap with what- the new map without having to change the details of the
ever components it needs to define a map for a blocked numerical algorithm.
vector and to implement the deferred functions, adding
new functions as needed. For example, there may be a Blocked Vectors
function, getBlockSize(k), that returns the size of A blocked vector object,
block k. The programmer also writes a constructor, a Type :: BlockedVector
function with the same name as the type, Type(VectorMap),pointer,private :: &
map
real,allocatable,target,private :: &
Type(VectorMap) :: map
block(:,:)
map = VectorMap(n,blockSize) contains
procedure,public,pass(this) :: &
getPointerToBlock
that fills in all the required information for map- procedure,public,pass(this) :: &
ping a global vector of length n into blocks of size haloExchange
blockSize and assigning them to the local memory ⋮
of each image. Other forms of the constructor, of course,
end type BlockedVector
might produce a more irregular decomposition of the
global vector with different block sizes on each image contains the data for each block in an allocatable
and different numbers of blocks on each image. component block(:,:) and a map that describes
Coarray Fortran C 

the distribution of the global vector. One form of the type(BlockedVector),codimension[*]::h,u,v


constructor for a concrete object of this type
The constructor for the vector map builds a map,
Type(BlockedVector) :: v
v = BlockedVector(map) type(VectorMap) :: map
map=VectorMap(n,k,p,w) C
accepts an existing vector map as an argument. The
constructor uses the map to determine the number of that describes a field with n grid points cut into blocks
blocks and the block sizes for each image and allocates of size k distributed over p images. The halo width w
space appropriately, equals one, wide enough for a two-point stencil for the
Type(BlockedVector) &
first-order difference operator. Three calls to the con-
function BlockedVector(map) result(this) structor for a blocked vector create the three fields h, u,
Type(VectorMap),target,intent(in)::map and v,
this%map=>map
k=map%getNumberOfLocalBlocks() h=BlockedVector(map)
do i=1,k u=BlockedVector(map)
n = map%getBlockSize(i) v=BlockedVector(map)
allocate(this%block(0:n+1,i)) h=h0
end do
end function BlockedVector based on the same predefined vector map guaranteeing
that all three fields have the same distribution. The state-
It also assigns its vector map component to point to the ment h=h0 uses an overloaded assignment statement to
dummy map argument. Notice that the constructor has initialize the field h from the array h0(:) containing
allocated halo cells of width one to each block. The halo its initial values. The other fields have initial value zero
width may be an optional argument to the constructor set by their constructors.
and may in fact be zero. Having created field objects, the programmer
decides to let each image perform work on the local
Application to Partial Differential blocks that it owns. For each of its local blocks, an image
Equations obtains the length of the block and a pointer into the
Consider a finite difference scheme applied to the one- block, with or without halos depending on how it is
dimensional shallow water equations []. The partial used in the difference formula. Each image performs the
differential equations, defined by Cahn [] and by appropriate finite difference operation independently
Arakawa and Lamb [], for the surface height h(x, t) and of the others. Synchronization among images occurs
the two velocity components u(x, t) and v(x, t), as func- within the halo exchange operation, which uses coar-
tions of the space variable x and the time variable t, are ray syntax internally to update overlapping halo regions.
the equations [, Eqs. –, p. ] With a loop over some predetermined number of time
steps, tMax, the code might look like the following:
∂u ∂h
− Fv + G = do t=1,tMax
∂t ∂x do b=1,u%getNumLocalBlocks()
∂v m = u%getBlockLength(b)
+ Fu =  hPtr => h%pointerToBlock(b)
∂t uPtr => u%pointerToBlockWithHalo(b)
∂h ∂u hPtr(1:m) = hPtr(1:m) &
+H =.
∂t ∂x -0.5*H*(dt/dx)*(uPtr(2:m+1) &
-uPtr(0:m))
end do
In these equations, F is the Coriolis frequency, G is the call h%haloExchange()
acceleration of gravity, and H is the mean height of the do b=1,u%getNumLocalBlocks()
surface, which is assumed to be small relative to the m = u%getBlockLength(b)
width of the space interval. hPtr => h%pointerToBlockWithHalo(b)
uPtr => u%pointerToBlock(b)
The fields u,v,h are represented as blocked vectors vPtr => v%pointerToBlock(b)
declared as coarrays, uPtr(1:m) = uPtr(1:m)+F*dt*vPtr(1:m) &
 C Code Generation

- 0.5*G*(dt/dx)*(hPtr(2:m+1) & (eds) Developments in teracomputing: proceedings of the ninth


- hPtr(0:m)) ECMWF workshop on the use of high performance comput-
vPtr(1:m) = vPtr(1:m) - F*dt*uPtr(1:m) ing in meteorology, Reading, – Nov . World Scientific,
end do pp –
call u%haloExchange() . Cahn A Jr () An investigation of the free oscillations of a
call v%haloExchange()
simple current system. J Meteorol ():–
end do . Fox GC, Otto SW, Hey AJG () Matrix algorithms on a hyper-
cube I: matrix multiplication. Parallel Comput :–
All the details of data distribution and all the details of
. Gamma E, Helm R, Johnson R, Vlissides J () Design patterns:
how to exchange data between objects is hidden from elements of reusable object-oriented software. Addison-Wesley,
the programmer. The blocked vector objects themselves Reading
contain all the necessary information, and the proce- . Metcalf M, Reid J, Cohen M () Fortran / explained.
dures associated with them know how to perform the Oxford University Press, Oxford
. Numrich RW () A parallel numerical library for co-array
required operations.
Fortran. In: Parallel processing and applied mathematics: pro-
ceedings of the sixth international conference on parallel pro-
Summary cessing and applied mathematics (PPAM), Poznan, – Sept
Fortran is a modern programming language. It is an . Lecture notes in computer science, LNCS, vol . Springer,
object-oriented language, defined by the Fortran  pp –
standard, and it is also a parallel language, defined . Numrich RW () Parallel numerical algorithms based on
tensor notation and co-array Fortran syntax. Parallel Comput
by the Fortran  standard. Programmers are able
:–
to develop modern application codes using object- . Numrich RW, Reid J, Kim K () Writing a multigrid solver
oriented design combined with the coarray parallel using co-array Fortran. In: Kågström B, Dongarra J, Elmroth
programming model. Development within a single lan- E, Waœniewski J (eds) Applied parallel computing: large scale
guage avoids many of the complications that occur scientific and industrial problems, th international workshop,
PARA, Umeå. Lecture notes in computer science, vol .
from differences between languages. New compilers will
Springer, pp –
support the model and will make application codes . Reid J () Coarrays in the next Fortran standard. ISO/IEC
portable across all platforms, and hardware vendors are JTC/SC/WG N
designing new architectures that will support the model
effectively.

Related Entries Code Generation


Chapel (Cray Inc. HPCS Language)
Fortran  and Its Successors Cédric Bastoul
MPI (Message Passing Interface) University Paris-Sud  - INRIA Saclay Île-de-France,
OpenMP Orsay, France
SPMD Computational Model
Titanium
UPC Synonyms
Polyhedra scanning
Bibliography
. Arakawa A, Lamb VR () Computational design of the basic
dynamical processes of the UCLA general circulation model. Definition
Meth Comput Phys :– Parallel code generation is the action of building a par-
. Balaji V, Clune TL, Numrich RW, Womack BT () An archi- allel code from an input sequential code according to
tectural design pattern for problem decomposition. In: Workshop some scheduling and placement information. Schedul-
on patterns in high performance computing, Champaign-Urbana,
ing specifies the desired order of the statement instances
– May 
. Burton PM, Carruthers B, Fisher GS, Johnson BH, Numrich RW with respect to each other in the target code. Placement
(). Converting the halo-update subroutine in the met office specifies the desired target processor for each statement
unified model to co-array fortran. In: Zwiehofer W, Kreitz N instance.
Code Generation C 

Discussion Two mathematical objects need to be defined for


the parallel code generation problem. First, iteration
Introduction
domains provide the relevant information for code
Exhibiting parallelism in a sequential code may require
generation from the input code, they are detailed in
complex sequences of transformations. When they
come from an expert in program optimization, they
Sect. on Representing Statement Instances: Iteration C
Domains. Second, space-time mapping functions pro-
are usually expressed by way of directives like tile or
vide the ordering and placement information to be
fuse or skew. When they come from a compiler, they
implemented by the target code. They are described in
are typically formulated as functions that map every
Sect. on Representing Order and Placement: Mapping
execution of every statement of the program in “time”
Functions.
(scheduling), to order them in a convenient way, and in
“space” (placement), to distribute them among various
processors. Representing Statement Instances: Iteration
Parallel code generation is the step in any program Domains
restructuring tool or parallelizing compiler that actu- The key aspect of the polyhedral model is to consider
ally generates a target code which implements the user statement instances. A statement instance is one partic-
directives or the compiler mapping functions. ular execution of a statement. Each instance of a state-
This task is challenging in many ways. Feasibility ment that is enclosed inside a loop can be associated
is the first challenge. Reconstructing a program with with the value of the outer loop counters (also called
respect to scheduling and placement information at iterators). For instance, let us consider the polynomial
the statement execution level may seem an impossi- multiply code in Fig. : the instance of statement S1 for
ble problem at first sight. It has been addressed by i =  is z[2] = 0.
working on a convenient representation of the prob- In the polyhedral model, statements are considered
lem itself, as it will be described momentarily. Scala- as functions of the outer loop counters that may produce
bility is another important issue. As compiler vendors statement instances: instead of simply “S1,” the nota-
try to ensure that compile time is (almost) linear in tion S1(i) is preferred. For instance, statement S1 for
the code length, the code generation step must be fast i =  is written S1(2) and statement S2 for i =  and
enough to be integrated into production tools. Quality j =  is written S1(). The vector of the iterator values
of the generated code must be paramount. The gener- is called the iteration vector.
ated code should not be too long and should not include Obviously, dealing with statement instances does
heavy control overhead that may offset the optimiza- not mean that unrolling all loops is necessary. First
tion it is enabling. Finally, flexibility must be provided because there would probably be too many instances to
to allow a large span of transformations on a large set of deal with, and second because the number of instances
programs. may not be known. For example, when the loops are
bounded with constants that are unknown at compile
time (called “parameters”), for instance, n in the exam-
Representation of the Problem ple code in Fig. . A compact way to represent all the
To solve the code generation problem, it needs to be for- instances of a given statement is to consider the set of
malized in some way. A mathematical representation, all possible values of its iteration vector. This set is called
known as the polyhedral model (also referred in the lit- the statement’s iteration domain. It can be conveniently
erature as the polytope model or the polyhedron model), described by all the constraints on the various iterators
is a convenient abstraction. It allows the description that the statement depends on. When those constraints
of the problem in a compact and expressive way. This
representation is also the key to solving it efficiently do i = 1, n
z[i] = 0 ! S1
thanks to powerful and scalable mathematical libraries do i = 1, n
as described in Sect. on Scanning Polyhedra. This do j = 1, n
model, with slight variations, has been used in most z[i+j] = z[i+j] + x[i] * y[j] ! S2
successful work on parallel code generation. Code Generation. Fig.  Polynomial multiply kernel
 C Code Generation

t = 1
!$OMP SECTIONS
!$OMP SECTION
x = a + b ! S1
x = a + b ! S1 qS1 = 1
y = c + d ! S2 qS2 = 2
!$OMP SECTION
z = a * b ! S3 qS3 = 1
z = a * b ! S3
!$OMP END SECTIONS
t = 2
y = c + d ! S2
a Original code b Scheduling c Target code

Code Generation. Fig.  One-Dimensional scheduling example

are affine and depend only on the outer loop counters In the case of scheduling, the logical dates express
and some parameters, the set of constraints defines a at which time a statement instance has to be executed,
polyhedron (more precisely this is a Z-polyhedron, but with respect to the other statement instances. It is typ-
polyhedron is used for short). Hence the name “polyhe- ically denoted θ S for a given statement S. For instance,
dral model.”A matrix representation with the following let us consider the three statements in Fig. a and their
general form for any statement S is used to facilitate the scheduling functions in Fig. b. The first and third state-
manipulation of the affine constraints: ments have to be executed both at logical date . This
means they can be executed in parallel at date  but they
DS = {xS ∈ ZnS ∣ AS xS + aS ≥ }
have to be executed before the second statement since
where xS is the nS -dimensional iteration vector, AS is its logical date is . The target code implementing this
a constant matrix and aS is a constant vector, possibly scheduling using OpenMP pragmas is shown in Fig. c,
parametric. For instance, here are the iteration domains where a fictitious variable t stands for the time. It can be
for the polynomial multiply example in Fig. : seen that at time t = , both S and S are run in parallel,
""⎡ while S is executed afterward at time t = .



⎪ """⎢  ⎤ ⎥ ⎛ − ⎞ ⎫



⎪ """⎢ ⎢

⎥ ⎜ ⎟ ⎪ Logical dates may be multidimensional, like clocks:
● DS = ⎨( i ) ∈ Z ""⎢ ⎥ ( i )+⎜ ⎟ ≥ ⎬
⎪ "
"""⎢ ⎪ the first dimension corresponds to days (most signif-


⎪ ""⎢ − ⎥
⎥ ⎝ n⎠ ⎪


⎩ ⎣ ⎦ ⎭ icant), next one is hours (less significant), the third
⎧ ⎡ ⎤ ⎫

⎪ ⎢   ⎥ ⎛ − ⎞ ⎪
⎪ to minutes, and so on. The order of multidimensional

⎪ ⎢ ⎥ ⎪


⎪ ⎢ ⎥ ⎜ ⎟ ⎪



⎪⎛ ⎞ ⎢ ⎢
⎥⎛ ⎞ ⎜
⎥ ⎜




⎪ dates with a decreasing significance for each dimen-

⎪ i ⎟ 
⎢ −  ⎥ i ⎜ n ⎟ ⎪

● = ⎨⎜  ⎥⎜ ⎟+⎜
DS ⎜ ⎟ ∈ Z ⎢ ⎥⎜ ⎟ ⎜
⎟ ≥  ⎬


⎪⎝ ⎠ ⎢ ⎢   ⎥ ⎝ ⎠

⎜ − ⎟ ⎪

⎪ sion is called the lexicographic order. Again, it is not

⎪ j ⎢ ⎥ j ⎜ ⎟ ⎪


⎪ ⎢ ⎥ ⎜ ⎟ ⎪


⎪ ⎢ ⎥ ⎪
⎪ possible to assign one logical date to each statement

⎪ ⎢  − ⎥ ⎝ n ⎠ ⎪

⎩ ⎣ ⎦ ⎭
instance for two reasons: this would probably lead to an
intractable number of logical dates and the number of
Representing Order and Placement: Mapping
instances may not be known at compile time. Hence, a
Functions
more compact representation called the scheduling func-
Iteration domains do not provide any information about
tion is used. A scheduling function associates a logical
the order in which statement instances have to be exe-
date with each statement instance of a statement. They
cuted, nor do they inform about the processor that has
have the following form for a statement S:
to execute them. Such information is provided by other
mathematical objects called space-time mapping func-
tions. They associate each statement instance with a θ S (x S ) = TS xS + t S ,
logical date when it has to be executed and a processor
coordinate where it has to be executed. In the literature, where x S is the iteration vector, TS is a constant
the part of those functions dedicated to time is called matrix, and t S is a constant vector, possibly paramet-
scheduling while the part dedicated to space is called ric. Scheduling functions can easily encode a wide range
placement (or allocation, or distribution). of usual transformations such as skewing, interchange,
Code Generation C 

reversal, shifting tiling, etc. Many program transforma- any ordering information: Iteration domains are noth-
tion frameworks have been proposed on top of such ing but “bags” of unordered statement instances. On the
functions, the first significant one being UTF (Uni- opposite, the space/time mapping functions, typically
fied Transformation Framework) by Kelly and Pugh in computed by an optimizing or parallelizing algorithm,
 []. provide the ordering information for each statement C
Placement is similar to scheduling, only the seman- instance. It is necessary to collect all this information
tics is different: instead of logical dates, a placement into a polyhedral representation before the actual code
function π S associates each instance of statement S with generation. There exist two ways to achieve this task. Let
a processor coordinate corresponding to the processor us consider an iteration domain defined by the system
that has to execute the instance. of affine constraints Ax + a ≥  and the transforma-
A space-time mapping function σS is a multidimen- tion function leading to a target index y = Tx. Any of
sional function embedding both space and time infor- the following formulas can be chosen to build the target
mation for statement S: some dimensions are devoted polyhedron T that embeds both instance and ordering
to scheduling while some others are dedicated to place- information:
ment. For instance, a compiler may suggest the fol-
Inverse Transformation By noticing that x = T − y it
lowing space-time mapping for the polynomial mul-
follows that the transformed polyhedron in the new
tiply code shown in Sect. on Representing Statement
coordinate system can be defined by:
Instances: Iteration Domains. Its first dimension is a
placement that corresponds to a wavefront parallelism T : {y ∣ [AT − ]y + a ≥ } .
for S2 and improves locality by executing the initializa-
tion of an array element by S on the same processor Generalized Change of Basis Alternatively, new dimen-
where it is used by S. The second dimension is a very sions corresponding to the ordering in leading posi-
simple constant scheduling that ensures the initializa- tions can be introduced (note that in the following
tion of the array element is done before its use (it is usual formula, constraints “above” the line are equalities
to add the identity schedule at the last dimensions; how- while constraints “under” the line are inequalities):
ever this will not be necessary for the continuation of ⎧
⎪ ⎡ ⎤ ⎫

⎪⎛ ⎞ ⎢ I −T ⎥
⎥ ⎛ y ⎞ ⎛ −t ⎞ = ⎪
⎪ y ⎟ ⎢

⎢ ⎥⎜ ⎟+⎜ ⎟


this example): T : ⎨⎜
⎜ ⎟ ⎢ ⎥⎜ ⎟ ⎜ ⎟  ⎬.
⎪ ⎪


⎪⎝ x ⎠ ⎢
⎢ 

A ⎥⎝ x ⎠ ⎝ a ⎠ ≥ ⎪ ⎪

⎡ ⎤ ⎩ ⎣ ⎦ ⎭
⎢ ⎥ ⎛⎞
⎢ ⎥
● σS ( i ) = ⎢ ⎥
⎢ ⎥( i ) + ⎜ ⎟
⎜ ⎟ The inverse transformation solution has been intro-
⎢⎥ ⎝⎠
⎢ ⎥
⎣ ⎦ duced since the seminal work on parallel code gen-
⎡ ⎤ eration by Ancourt and Irigoin []. It is simple and
⎛i⎞ ⎢ ⎢  
⎥⎛ i ⎞ ⎛  ⎞
⎥ compact but has several issues: the transformation
● σS ⎜ ⎟ ⎢
⎜ ⎟=⎢
⎥⎜ ⎟ +⎜ ⎟
⎥⎜ ⎟ ⎜ ⎟
⎝j⎠ ⎢ ⎥⎝ j ⎠ ⎝  ⎠ matrix must be invertible, and even when it is invertible,
⎢  ⎥
⎣ ⎦ the target polyhedra may embed some integer points
While working in the polyhedral representation, the that have no corresponding elements in the iteration
semantics of each dimension is not relevant: After code domain (this happens when the transformation matrix
generation, each dimension will be translated to some is not unimodular, i.e., whose determinant is neither +
loops that can be post-processed to become parallel nor −). This necessitates specific code generation pro-
or sequential according to their semantics (obviously cessing, briefly discussed in Sect. on Fourier–Motzkin
semantics information can be used to generate a better Elimination-Based Scanning Method. The second for-
code, but this is out of the scope of this introduction). mula is attributed to Le Verge, who named it the
Generalized Change of Basis []. It does not require
any property on the transformation matrix. Neverthe-
Putting Everything Together less, the additional dimensions may increase the com-
Iteration domains can be extracted directly by analyzing plexity of the code generation process. It has been
the input code. They represent for each statement the rediscovered independently from Le Verge’s work and
set of their instances. In particular, they do not encode used in production code generators only recently [].
 C Code Generation

Both formulas are used, and possibly mixed, in cur- Sect. on Parametric Integer Programming-Based Scan-
rent code generation tools, depending on the desired ning Method. Lastly, Quilleré, Rajopadhye, and Wilde
transformation properties. For instance, to apply the showed how to take advantage of high-level polyhe-
space-time mapping of the polynomial multiply pro- dral operations to generate efficient codes directly [].
posed in Sect. on Representing Order and Placement: As this later technique is now widely adopted in pro-
Mapping Functions, it is convenient to use the Gen- duction environments, it is discussed in some depth in
eralized Change of Basis because the transformation Sect. on QRW-Based Scanning Method.
matrices are not invertible:

⎪ ⎡ ⎤ ⎫
⎪ Fourier–Motzkin Elimination-Based Scanning
⎪ ⎢ − ⎥ ⎪


⎪⎛ p ⎞

⎢
⎢
  ⎥⎛ ⎞ ⎛  ⎞ ⎪


⎪ ⎢ ⎥ p ⎜ ⎟ ⎪
⎪ Method

⎪ ⎥⎜ ⎟ ⎜ ⎟ ⎪

⎪⎜
⎪⎜

⎟ ⎢    ⎥⎜ ⎟ ⎜  ⎟ = ⎪
⎥ ⎜ ⎟ ⎪

● TS = ⎨⎜ ⎟ ∈ Z ⎢

⎥⎜ ⎟ +⎜ ⎪
⎪⎜ ⎟
t ⎢ ⎢ ⎥⎜ t ⎟ ⎜ ⎟

 ⎬ Ancourt and Irigoin [] proposed in  the first solu-




⎜ ⎟
⎟ ⎢    ⎥ ⎜


⎟ ⎜ − ⎟ ≥ ⎪ ⎪



⎪ ⎢ ⎥ ⎜ ⎟ ⎪
⎪ tion to the polyhedron scanning problem. Their work

⎪⎝ i ⎠ ⎢ ⎥⎝ i ⎠ ⎜ ⎟ ⎪


⎪ ⎢ ⎥ ⎝ ⎪


⎪ ⎢   ⎥
− ⎦ n ⎠ ⎪

⎩ ⎣ ⎭ is based on the Fourier–Motzkin pair-wise elimina-

⎪ ⎡
⎢ ⎤ ⎫

tion technique []. The scope of their method was


⎪ ⎢   − − ⎥
⎥ ⎛  ⎞ ⎪



⎪ ⎢ ⎥ ⎜ ⎟ ⎪
⎪ quite restrictive since it could be applied to only one

⎪ ⎢ ⎥ ⎜ ⎟ ⎪

⎪ ⎛ ⎢  ⎥ ⎜ − ⎟ ⎪




p ⎞ ⎢    ⎥⎛ p ⎞ ⎜ ⎟ ⎪


⎪ polyhedron, with unimodular transformation matrices.

⎪ ⎜ ⎟ ⎢ ⎥⎜ ⎟ ⎜ ⎟ ⎪


⎪ ⎜ ⎟ ⎢ ⎥⎜ ⎟ ⎜ ⎟ ⎪


⎪⎜
⎜ t ⎟
⎟ 

⎢     ⎥
⎥⎜
⎜ t ⎟


⎜ − ⎟ = ⎪
⎟ ⎪


● TS = ⎨⎜ ⎟ ∈ Z ⎢ ⎥⎜ ⎟ +⎜ ⎟  ⎬ The basic idea was, for each dimension from the first one
⎪⎜

⎪ ⎜

⎟ ⎢
 ⎢ −
⎥⎜
 ⎥ ⎜

i ⎟



n ⎟ ≥




⎪ ⎜ i ⎟ ⎢   ⎥⎜ ⎟ ⎜ ⎟ ⎪
⎪ (outermost) to the last one (innermost), to project the

⎪⎜ ⎟ ⎢ ⎥⎜ ⎟ ⎜ ⎟ ⎪


⎪ ⎢ ⎥⎝ ⎜ ⎟ ⎪


⎪ ⎝ ⎠ ⎢  ⎥ j ⎠ ⎜ − ⎟ ⎪
⎪ polyhedron onto the axis and to deduce the correspond-

⎪ j ⎢    ⎥ ⎜ ⎟ ⎪



⎪ ⎢ ⎥ ⎜ ⎟ ⎪



⎪ ⎢ ⎥ ⎪


⎪ ⎢    − ⎥ ⎝ n ⎠ ⎪
⎪ ing loop bounds. For a given dimension ik , the Fourier–
⎩ ⎣ ⎦ ⎭
Motzkin algorithm can establish that L(i , ..., ik− ) + l ≤
In the target polyhedra, whatever the chosen for-
ck ik and ck ik ≤ U(i , ..., ik− )+u, where L and U are con-
mula, the order of the dimensions is meaningful: The
stant matrices, l and u are constant vectors of size ml
ordering is encoded as the lexicographic order of the
and mu respectively, and ck is a constant. Thus, the cor-
integer points. Thus, the parallel code generation prob-
responding scanning code for the dimension ik can be
lem is reduced to generating a code that enumerates the
derived:
integer points of several polyhedra, with respect to the
lexicographic ordering. ...
do ik = MAXm
j= ⌈(Lj (i , ..., ik− ) + lj )/ck ⌉,
l

ik ≤ MINm
j= ⌊(Uj (i , ..., ik− ) + uj )/ck ⌋
u

Scanning Polyhedra ...


Once the target code information has been encoded into
Body
some polyhedra that embed the iteration spaces as well
as the scheduling and placement constraints, the code The main drawback of this method is the large amount
generation problem translates to a polyhedra scanning of redundant control since eliminating a variable with
problem. The problem here is to find a code (preferably the Fourier–Motzkin algorithm may generate up to
efficient) visiting each integral point of each polyhedra, n / constraints for the loop bounds where n is the ini-
once and only once, with respect to the lexicographic tial number of constraints. Many of those constraints
order. Three main methods have been successful in are redundant and it is necessary to remove them for
doing this. Fourier–Motzkin elimination-based tech- efficiency.
niques have been the very first, introduced by the semi- Most further works tried to extend this first tech-
nal work of Ancourt and Irigoin []. They are discussed nique in order to reduce the redundant control and to
briefly in Sect. on Fourier–Motzkin Elimination-Based deal with more general transformations. Le Fur pre-
Scanning Method. While Fourier–Motzkin-based tech- sented a new redundant constraint elimination policy
niques aim at generating loop nests, an alternative by using the simplex method []. Li and Pingali []
method based on Parametric Integer Programming has as well as several other authors proposed to relax the
been suggested by Boulet and Feautrier to generate unimodularity constraint of the transformation to an
lower-level codes []. This method is discussed briefly in invertibility constraint by using the Hermite Normal
Code Generation C 

Form [] to avoid scanning “holes” in the polyhedron. Generalizing this method to many polyhedra implies
Griebl, Lengauer and Wetzel [] relaxed the constraints combining the different trees of conditions and sub-
of code generation further to transformation matri- sequent additional control cost and code duplication.
ces with non-full rank, and also presented preliminary While this technique has no widely used implementa-
techniques for scanning several polyhedra using a single tion, it is quite different than the others since it does not C
loop nest. Finally, Kelly, Pugh, and Rosser showed how aim at generating high-level loop statements. This prop-
to scan several polyhedra in the same code by generat- erty may be relevant for specific targets, for example,
ing a naive perfectly nested code and then (partly) elim- when the generated code is not the input of a compiler
inating redundant conditionals []. Their implemen- but of a high-level synthesis tool.
tation relies on an extension of the Fourier–Motzkin
technique called the Omega test. The implementation QRW-Based Scanning Method
of their algorithm within the Omega calculator is one Quilleré, Rajopadhye, and Wilde proposed in  the
of the most popular parallel code generators []. first code generation algorithm that builds a target code
without redundant control directly []. While previ-
ous schemes started from a generated code with some
Parametric Integer Programming-Based redundant control and then tried to improve it, their
Scanning Method technique (referred as the QRW algorithm) never fails
Boulet and Feautrier proposed in  a parallel code at removing control, and the processing is easier. Even-
generation technique that relies on Parametric Inte- tually it generates a better code more efficiently.
ger Programming (PIP for short) to build a code for The QRW algorithm is a generalization to several
scanning polyhedra []. The PIP algorithm computes polyhedra of the work of Le Verge, Van Dongen and
the lexicographic minimal integer point of a polyhe- Wilde on loop nest synthesis using polyhedral opera-
dron. Because the minimum point may not be the same tions []. It relies on high-level polyhedral operations
depending on the parameter values, it is returned as a (polyhedral intersection, union, projection, etc.), which
tree of conditions on the parameters where each leaf are available in various existing polyhedral libraries.
is either the solution for the corresponding parameter The basic mechanism is, starting from () the list of
constraints or ⊥ (called bottom), that is, no solution for polyhedra to scan and () a polyhedron encoding the
those parameter constraints. constraints on the parameters called the context, to
The basic idea of the Boulet and Feautrier algorithm recursively generate each level of the abstract syntax tree
(in the simplified case of scanning one polyhedron) is to of the scanning code (AST).
find the first integer point of the polyhedron, called first, The algorithm is sketched in Fig.  and a simplified
then to build a function next which for a given integer example is shown in Figs. –. It corresponds to the
point returns the next integer point in the polyhedron generation of the code implementing the polynomial
according to the lexicographic ordering. Both first and multiply space-time mapping introduced in Sect. on
next computations can be expressed as a problem of Representing Order and Placement: Mapping func-
finding the lexicographic minimum in a polyhedron. tions. Its input is the list of polyhedra to scan, the con-
Finally, the code can be built according to the following text and the first dimension to scan. This corresponds
canvas, where x is an integer point of the polyhedron to Fig.  in our example, with the first dimension to
that represents the iteration domain: scan being p. The first step of the algorithm intersects
the polyhedra with the context to ensure no instance
outside the context will be executed. Then it projects
x = first them onto the first dimension and separates the pro-
 if x =⊥ then goto  jections into disjoint polyhedra. For instance, for two
Body polyhedra, this could correspond to one domain where
the first polyhedron is “alone,” one domain where the
x = next
second polyhedron is “alone” and one domain where the
goto  two polyhedra coexist. This is depicted in Fig.  for our
 ... example: it depicts the projection onto the p axis and the
 C Code Generation

QRW: build a polyhedron scanning code AST without redundant control.


Input: a polyhedron list, a context C, the current dimension d.
Output: the AST of the code scanning the input polyhedra.

1. Intersect each polyhedron in the list with the context C


2. Project the polyhedra onto the outermost d dimensions
3. Separate these projections into disjoint polyhedra (this generates loops
for dimension d and new lists for dimension d +1)
4. Sort the loops to respect the lexicographic order
5. Recursively generate loop nests that scan each new list with dimension
d +1, under the context of the dimension d
6. Return the AST for dimension d

Code Generation. Fig.  Sketch of the QRW Code Generation Algorithm

j
n Context: n ≥ 3
2 ⎧
1 ⎨p = i
TS1 : t = 0

1 p 1 ≤ i ≤ 2n
2 ⎧
n ⎪
⎪ p=i+j

t=1
TS2 :

⎪ 1≤i≤n

2n 1≤j≤n
i 1 2 n 2n

Code Generation. Fig.  QWR Code Generation Example (/): Polyhedra to Scan and Context Information. The graphical
representation does not show the degenerated scheduling dimension t

S1 alone doall p = 1, 1
t=0
TS1 :
S1 and S2 i=1
Projection
doall p = 2, 2*n
onto (p) t=0
TS1 :
⎧i = p
0 p ⎪
⎪ t=1

 1≤i≤n
p>=2 p<=2n TS2 :

⎪ 1≤j≤n

p=1 i+j =p

Code Generation. Fig.  QWR Code Generation Example (/): Intersection with the Context, Projection, and Separation
onto the First Dimension. Two disjoint polyhedra are created: one where S is alone on p (it has only one integer point but
a loop is generated to scan it, for consistency) and one where S and S are together on p. In the right side, the new
polyhedra to scan have been intersected with the context (for the next step, p is a parameter as well as n)

separation (it can be seen here that the domain where S the next dimension loops for each disjoint polyhedron
is “alone” is empty). The constraints on dimension p for separately. The final result is shown in Fig.  for our
the resulting polyhedra give directly the loop bounds. example.
As the semantics of the placement dimension is to dis- The QRW algorithm is simple and efficient in prac-
tribute instances across different processors, this loop tice, despite the high theoretical complexity of most
is parallel. Then the algorithm recursively generates polyhedral operations. However, in its basic form, it
Code Generation C 

Projection p=1 p>=2 p<=2n doall p = 1, 1


onto (p,i) do t = 0, 0
p do i = 1, 1
1 i>=1 z[i] = 0 ! S1
2 S2 alone doall p = 2, 2*n
do t = 0, 0
n i<=n
do i = p, p C
z[i] = 0 ! S1
S1 alone i>=p−n do t = 1, 1
2n do i = max(1,p-n), min(p-1,n)
i<=p−1 do j = p-i, p-i
i 1 2 n 2n z[i+j] += x[i] * y[j] ! S2
i=p

Code Generation. Fig.  QWR Code Generation Example (/): Recursion on the Next Dimensions. First, the
projection/separation on (p, t) is done. It is trivial because t is a constant in every polyhedron: it only enforces disjonction
and ordering of the polyhedra inside the second doall loop. Next the same processing is applied for (p, t, i): the loop
bounds of the remaining dimensions can be deduced from the graphical representation (the trivial dimension t is not
shown)

tends to generate codes with costly modulo operations, is also solved only partly because only regular codes
and the separation process is likely to result in very long that fit the polyhedral model can be processed and only
codes. Several extensions to this algorithm have been affine transformations can be applied.
proposed to overcome those issues [, ]. CLooG, a
popular implementation of the extended QRW tech-
Future Directions
Two challenges of parallel code generation are partly
nique demonstrated effectiveness of the algorithm [].
solved: quality and flexibility. To achieve the best results,
It is now used in production environments such as in
autoparallelizers have to take into account some con-
GCC or IBM XL.
straints related to code generation that may conflict
with the extraction of parallelism, for example, limiting
Parallel Code Generation Today the absolute value of the transformation coefficients or
For a long time, scheduling and placement techniques relying on unimodular transformations. However, there
were many steps forward code generation capabili- exists an infinity of transformations that implement the
ties. In , Feautrier provided a general scheduling same mapping but have different properties with respect
technique for multiple polyhedra and general affine to code generation. Finding “code generation friendly”
functions []. At this time, the only code genera- equivalent transformations is a promising solution to
tion algorithm available had been designed in  by enhance the generated code quality.
Ancourt and Irigoin and supported only one polyhe- Several directions are under investigation to provide
dron and unimodular scheduling functions []. Some parallel code generation with more flexibility. Irregular
scheduling functions had to wait for nearly one decade extensions have been successfully implemented to some
to be successfully applied by a code generator. polyhedral code generators and ambitious techniques
Since then, the challenge of feasibility has been based on polynomials instead of affine expressions may
tackled: State-of-the-art parallel code generators can be the next step for parallel code generation [].
handle any affine transformation for many iteration
domains. Moreover, the scalability of code generators Related Entries
is good enough to enable parallel code generation as an Loop Nest Parallelization
option in production compilers. However, the quality of Omega Test
the generated code is still not guaranteed. Summarily, Parallelization, Automatic
code generators are very good for simple (unimodular) Polyhedron Model
transformations, reasonably good when the coefficients R-Stream Compiler
of the transformation functions are small and unpre- Scheduling Algorithms
dictable in the general case. The flexibility challenge Unimodular Transformations
 C Collect

Bibliographic Notes and Further . Le Fur M () Scanning parameterized polyhedron using
Reading Fourier-Motzkin elimination. Concurrency – Pract Exp ():
–
We detailed in Sect. on Scanning Polyhedra the three
. Le Verge H () Recurrences on lattice polyhedra and their
main techniques designed for parallel code genera- applications, April . Unpublished work based on a manuscript
tion. The reader will find a deeper level of details written by H. Le Verge just before his untimely death in 
in the related papers. Kelly, Pugh, and Rosser’s paper . Le Verge H, Van Dongen V, Wilde D () Loop nest synthesis
on code generation for multiple mappings provides using the polyhedral library. Technical Report , IRISA
the extensive description of the techniques behind the . Li W, Pingali K () A singular loop transformation frame-
work based on non-singular matrices. Int J Parallel Program
Omega Code Generator []. Boulet and Feautrier’s
():–
paper on code generation without do-loops gives thor- . Quilleré F, Rajopadhye S, Wilde D () Generation of effi-
ough depiction of the PIP-based code generation tech- cient nested loops from polyhedra. Int J Parallel Program ():
nique []. Finally, the details of the most powerful code –
generation technique known so far are provided in . Schrijver A () Theory of linear and integer programming.
Wiley, New York
Quilleré, Rajopadhye, and Wilde’s paper on generation
. Vasilache N, Bastoul C, Cohen A () Polyhedral code genera-
of efficient nested loops from polyhedra []. This read- tion in the real world. In: Proceedings of the international con-
ing is complemented by Bastoul’s paper, which details ference on compiler construction (ETAPS CC’), LNCS ,
several extensions to their algorithm and demonstrates Vienna, Austria, pp –
robustness of the extended technique for production
compilers [].

Bibliography Collect
. Ancourt C, Irigoin F () Scanning polyhedra with DO loops.
In: rd ACM SIGPLAN symposium on principles and practice of Allgather
parallel programming, Cologne, Germany, pp –
. Bastoul C () Code generation in the polyhedral model is eas-
ier than you think. In: IEEE international conference on parallel
architectures and compilation techniques (PACT’), Juan-les-
Pins, pp –
. Boulet P, Feautrier P () Scanning polyhedra without do-loops.
Collective Communication
In: IEEE international conference on parallel architectures and
compilation techniques (PACT’), Paris, France, pp –
Robert van de Geijn , Jesper Larsson Träff

. Feautrier P () Some efficient solutions to the affine scheduling The University of Texas at Austin, Austin, TX, USA

problem, part II: multidimensional time. Int J Parallel Program University of Vienna, Vienna, Austria
():–
. Griebl M, Lengauer C, Wetzel S () Code generation in the
polytope model. In: Proceedings of the international conference Synonyms
on parallel architectures and compilation techniques (PACT’), Group communication; Inter-process communication
pp –
. Größlinger A () The challenges of non-linear parameters
Definition
and variables in automatic loop parallelisation. Doctoral thesis,
Fakultät für Informatik und Mathematik, Universität Passau
Collective communication is communication that
. Kelly W, Maslov V, Pugh W, Rosser E, Shpeisman T, Wonna- involves a group of processing elements (termed nodes
cott D () The Omega library. Technical report, University of in this entry) and effects a data transfer between all or
Maryland some of these processing elements. Data transfer may
. Kelly W, Pugh W () A framework for unifying reordering include the application of a reduction operator or other
transformations. Technical Report UMIACS-TR--., Uni-
transformation of the data. Collective communication
versity of Maryland Institute for Advanced Computer Studies
. Kelly W, Pugh W, Rosser E () Code generation for multi-
functionality is often exposed through library interfaces
ple mappings. In: Frontiers’ Symposium on the frontiers of or language constructs. Collective communication is a
massively parallel computation, McLean, VA natural extension of the message-passing paradigm.
Collective Communication C 

Discussion various computation and communication models and


on practical implementations has been intensive over
Introduction
the past decades, and many good algorithms have found
Many commonly encountered communication pat-
their way into common practice.
terns and computational operations involving data dis-
tributed across sets of processing elements (nodes) can
Many algorithms and applications are naturally C
and effectively expressed in a coarse-grained style
be represented as collective communication in which all
as a sequence of local computations and collective
nodes in a (sub)set of nodes collaborate to carry out a
communication operations. It has been argued that
specific data redistribution or data reduction operation.
collective communication is more fundamental than
Making such operations available in parallel program-
explicit point-to-point or one-sided communication for
ming languages, interfaces, or libraries has a number of
expressing parallel computations []. Collective com-
advantages:
munication is a natural generalization of message-
● It simplifies parallel and distributed programming, passing point-to-point communication that explicitly
relieving the user from explicitly expressing com- involves two nodes, and one-sided communication that
plex communication patterns by means of more explicitly involves only one node.
primitive communication operations.
● It raises the level of abstraction at which algorithms A Motivating Example
can be expressed and abstracts away details of the Consider a matrix-vector multiplication y = Ax, where
underlying parallel communication system. A is an n×n matrix and x and y are column vectors to be
● It makes it easier to reason about algorithm correct- solved in parallel. Let the p compute nodes be indexed
ness and communication cost. from  to p − , assume for simplicity that p divides n,
● It improves the functional portability of applications and that the input is evenly distributed among the nodes
and contributes toward performance portability. as described below.
● It makes it possible to schedule complex communi-
cation patterns to efficiently exploit capabilities and Algorithm 
specific properties of the underlying communica- Partition A, x, and y:
tion system.
● It introduces a productive separation of concerns ⎛ A ⎞ ⎛ x ⎞
between the application and interface developers. ⎜  ⎟ ⎜  ⎟
⎜ ⎟ ⎜ ⎟
⎜ A ⎟ ⎜ x ⎟
⎜  ⎟ ⎜  ⎟
Collective communication operations are found in A→⎜ ⎟, x→⎜ ⎟,
⎜ ⎟ ⎜ ⎟
⎜ ⋮ ⎟ ⎜ ⋮ ⎟
some form in most parallel programming interfaces ⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
and languages, notably in the Message-Passing Inter- ⎜ ⎟ ⎜ ⎟
⎝ Ap− ⎠ ⎝ xp− ⎠
face (MPI) [], in Partitioned Global Address Space
(PGAS) languages like UPC [], in libraries for Bulk
⎛ y ⎞
Synchronous Processing (BSP) [, ] and many oth- ⎜  ⎟
⎜ ⎟
ers, but also in application-specific and special purpose ⎜ y ⎟
⎜  ⎟
libraries. By capturing patterns of communication and and y→⎜ ⎟,
⎜ ⎟
⎜ ⋮ ⎟
computation, collective communication is relevant not ⎜ ⎟
⎜ ⎟
only for distributed memory parallel systems, but also ⎜ ⎟
for systems with shared memory between all or subsets ⎝ yp− ⎠
of nodes. Collective communication indeed serves to
where Ak has n/p rows, and xk and yk are of size n/p.
abstract away such system characteristics, and can serve
Assume that initially Ak , xk , and yk are assigned only to
as a higher-level bridging model for ensuring (perfor-
node k, k = , . . . , p − .
mance) portability of applications across systems with
different communication capabilities. Research on effi- . Collect all xk ’s on all nodes (allgather) so that the
cient algorithms for collective communication under entire x is available on all nodes.
 C Collective Communication

. In parallel compute yk = Ak x locally on each node k, Unless the matrix A has additional structure that
k = , . . . , p − . can be exploited both in the local computations and
in the collective communication, neither Algorithm
Step  is an example of the collective communication
 nor  is scalable beyond p ≥ n compute nodes.
operation commonly referred to as allgather. Each node
Sequential matrix-vector multiplication takes O(n )
contributes a subvector and after the operation all nodes
operations, and parallel efficiency cannot be main-
have collected the full vector. Locally, each node per-
tained with increasing p beyond n if n is kept constant.
forms O(n/p × n) arithmetic operations, and under
The algorithms illustrate the use of collective opera-
reasonable communication assumptions O(n) time is
tions. Exploiting both ideas and using instead a two-
required for the allgather step (see entry on  Allgather).
dimensional partitioning of the matrix lead to a scalable
The local memory consumption is O(n /p + n).
algorithm which naturally relies on collective commu-
nication, as discussed later in this entry.
Algorithm 
Partition A, x, and y:

⎛ x ⎞ Classification of Collective Operations


⎜  ⎟
⎜ ⎟ The basic collective communication operations that will
⎜ x ⎟
⎜  ⎟ be described next can be divided into either redistri-
A → ( A A . . . Ap− ) , x→⎜

⎟,

⎜ ⋮ ⎟ bution (pure data transfer) or reduction operations that
⎜ ⎟
⎜ ⎟ also entail a computation on the transferred data. They
⎜ ⎟
⎝ xp− ⎠ can further be classified along the following lines.

⎛ y ⎞ Rooted/non-rooted: In rooted (or asymmetric) opera-


⎜  ⎟ tions, a specified node (the root) is either the sole
⎜ ⎟
⎜ y ⎟
⎜  ⎟ origin of data to be redistributed or the sole desti-
and y→⎜ ⎟,
⎜ ⎟ nation of data or results contributed by the nodes
⎜ ⋮ ⎟
⎜ ⎟
⎜ ⎟ involved in the communication. In non-rooted (or
⎜ ⎟
⎝ yp− ⎠ symmetric) operations all nodes contribute and
receive data.
where Ak has n/p columns, and xk and yk are of size n/p, For rooted collectives, it is typically assumed that all
and assume that initially Ak , xk , and yk are assigned only nodes know the identity of the root node.
to node k, k = , . . . , p − . Regular/irregular: A collective operation is regular if the
amount of data contributed or received by each
. In parallel, compute the n element vector y(k) on
involved node is the same, whereas for irregular col-
each node k = , . . . , p −  by y(k) = Ak xk .
lective operations, different nodes can contribute
. Compute the vector y = y() + y() + . . . + y(p−) and
and/or receive different amounts of data.
store the final subvector yk on node k, k = , . . . , p−.
For regular collectives, it is typically assumed that
The operation in Step  is an example of a collec- all processes know the amount of data to be redis-
tive reduction operation known as reduce-scatter. This tributed; irregular operations may not make this
operation computes a global result by element-wise assumption, and each node may know only the
summing n element vector contributions from all par- amount of data it has to contribute or receive in the
ticipating nodes, leaving parts of the result scattered operation.
elements over the nodes. Locally, each node performs Homogeneous/nonhomogeneous data: The data involved
O(n × n/p) arithmetic operations, and under reason- may either be homogeneous as in simple arrays of
able communication assumptions, the reduce-scatter elements of the same datatype, or nonhomogeneous
step can be done in O(n) time (see entry on Reduce and consist of elements of different types, possibly
and scan). stored in a nonconsecutive layout.
Collective Communication C 

Synchronizing/non-synchronizing: A collective opera- capability to perform collective communication on sub-


tion is synchronizing if a node starting the col- sets of nodes.
lective operation cannot continue its operation
before all other nodes have commenced and reached Commonly Used Collective
a certain point in the operation. In contrast, Communications C
in non-synchronizing operations, each node may The semantics (before-after) of some commonly iden-
be allowed to continue as soon as data from/to tified collective communication operations in scientific
that node required for the operation have been computing are explained in the following and depicted
sent and received. in Figs. –. Here, x, x(j) , and y are vectors of data, and
(j)
Blocking/nonblocking: Blocking collective operations xk , xk , and yk are subvectors of x, x(j) , and y, respec-
keep each node engaged until it has completed its tively. Input vector x(j) is assumed to be owned by node
involvement in the operation. Collectives may also j which means that this vector is stored in some form in
be nonblocking, meaning that initiation and com- the local memory of node j.
pletion of the operation are separated. As soon as a
node has initiated a nonblocking operation it may Data Redistribution Operations
perform other tasks, with collective communica- The first set of collective communications redistribute
tion logically taking place in the background. Col- or duplicate data from one or more nodes to one or
lective communication with nonblocking seman- more nodes.
tics might be able to leverage this for overlapping Broadcast: One node (the root) owns a vector of data,
communication with other useful (application level) x, that is to be copied to all other nodes. After the
computation. operations, all nodes have a copy of that data.
The rooted/non-rooted dichotomy can alternatively Scatter: One node (the root) owns a vector of data,
be captured by the trichotomy all-to-all/all-to-one/one- x, that is partitioned into subvectors, x , . . . , xp− .
to-all. Not all commonly found collective commu- Upon completion, node k owns xk , k = , . . . , p − .
nication operations follow these taxonomies. Collec- Gather: The inverse of the scatter operation, the subvec-
tive reduction operations can additionally be classified tors xk are gathered into the complete vector x at
according to the types of operations and transforma- the root.
tions that are allowed on the data. Finally, the power Allgather: The allgather operation is like the gather
of programming interfaces with explicit collective com- operation, except that all nodes receive all of the
munication is characterized not only by the types and data. The operation is also known as all-to-all broad-
set of collective operations included, but also on the cast, concatenation, and gossiping, and is equivalent

Operation Before After


Node 0 Node 1 Node 2 Node 0 Node 1 Node 2
Broadcast
x x x x

Node 0 Node 1 Node 2 Node 0 Node 1 Node 2


x0 x0
Scatter x1 x1
x2 x2

Node 0 Node 1 Node 2 Node 0 Node 1 Node 2


x0 x0
Gather x1 x1
x2 x2

Collective Communication. Fig.  Rooted redistribution operations. For illustration node  is chosen as root
 C Collective Communication

Operation Before After

Node 0 Node 1 Node 2 Node 0 Node 1 Node 2


x0 x0 x0 x0
Allgather x1 x1 x1 x1
x2 x2 x2 x2

Node 0 Node 1 Node 2 Node 0 Node 1 Node 2

All-to-all x(0)
0 x(1)
0 x(2)
0 x(0)
0 x(0)
1 x(0)
2
x(0)
1 x(1)
1 x(2)
1 x(1)
0 x(1)
1 x(1)
2
x(0)
2 x(1)
2 x(2)
2 x(2)
0 x(2)
1 x(2)
2

Node 0 Node 1 Node 2 Node 0 Node 1 Node 2


Permutation
x(0) x(1) x(2) x(π
1(0))
x(π
1(1))
x(π
1(2))

Collective Communication. Fig.  Non-rooted redistribution operations. For the permutation collective, π is a
permutation of {, . . . , p − } mapping node i to node π(i)

Operation Before After


Node 0 Node 1 Node 2 Node 0 Node 1 Node 2
Reduce p-1
x(0) x(1) x(2) y = ⊕j=0x(j)

Node 0 Node 1 Node 2 Node 0 Node 1 Node 2


p-1
Reduce- x(0)
0 x(1)
0 x(2)
0
y0 = ⊕j=0x(j)
0
p-1
scatter x(0)
1 x(1)
1 x(2)
1
y1 = ⊕j=0x(j)
1
p-1
x(0)
2 x(1)
2 x(2)
2
y2 = ⊕j=0x(j)
2

Allreduce Node 0 Node 1 Node 2 Node 0 Node 1 Node 2


p-1 p-1 p-1
x(0) x(1) x(2) y = ⊕j=0x(j) y = ⊕j=0x(j) y = ⊕j=0x(j)

Prefix Node 0 Node 1 Node 2 Node 0 Node 1 Node 2


0 1 2
x(0) x(1) x(2) y(0) = ⊕j=0 x(j) y(1) = ⊕j=0x(j) y(2) = ⊕j=0x(j)

Collective Communication. Fig.  Rooted and non-rooted reduction operations. For illustration of the reduce collective,
node  is chosen as root. Only the inclusive all prefix sums operation is depicted

to a gather to some root followed by a broadcast Permutation: For a given permutation π of the set
from that root, or to p (simultaneous) gather opera- {, . . . , p − } each node i, i = , . . . , p =  sends vec-
tions with roots i = , . . . , p − , each gathering the tor xi to node π(i). After the operation, node π(i)
−
same vector x. owns vector x(π (i)) .
All-to-all: Each node i, i = , . . . , p −  sends subvector (j)
(i) The input subvectors xi and xi are not necessarily
xk to node k for all k = , . . . , p − . It is some-
required to have the same number of elements. Regular
times emphasized that in contrast to the allgather
redistribution operations would require subvectors to
operation, this is a personalized all-to-all exchange
have the same size, whereas irregular collectives would
in that each node contributes a different subvector
not make this requirement.
to each other node. The all-to-all operation is equiv-
alent to p (simultaneous) scatter operations with
roots i = , . . . , p − , or p (simultaneous) gather Reduction Operations
operations each scattering or gathering a different Nodes often compute partial results that have to be
vector x(i) . The operation is also sometimes referred reduced (combined) to yield a final result. This is often
(j)
to as a transpose (of the matrix of subvectors xi ). abstracted as collective communication in the following
Collective Communication C 

way. Let ⊕ be an associative, binary operator on the Alternatively, some of the reduction patterns above
set of elements of the input vectors x(j) . The operator could be formulated as operations on row vectors
is extended to full vectors by element-wise application. instead. In this formulation, a single element y would
p− nj − (j)
More precisely, let x(j) for j = , . . . , p −  be vectors of be computed by y = ⊕j= ⊕i= xi , where nj is the size
size n. Then of the vector x(j) residing with node j. C
⎛ x(j) ⊕ x(k) ⎞ Barrier Synchronization
⎜   ⎟
⎜ (j) ⎟ Sometimes applications require nodes to synchronize
⎜ x ⊕ x(k) ⎟
⎜  ⎟
x(j) ⊕ x(k) = ⎜ ⎟ in the sense that no node shall continue beyond a cer-

⎜ ⎟
⎜ ⋮ ⎟ tain point in its execution, e.g., enter the next stage in
⎜ ⎟
⎜ ⎟
⎜ (j) (k)
⎟ a computation before all nodes have reached that stage.
⎝ xn− ⊕ xn− ⎠ A (semantic) barrier to ensure this is typically classified
as a collective communication operation, although no
for any j and k, j, k = , . . . , p − . data transfer or computation is implied.
Collective reduction operations now compute y = Barrier is often used for temporal side effects,
⊕j= x(j) , either for the specific k = p −  or for all
k
achieving an actual synchronization between a (large)
k = , . . . , p − . By the associativity of ⊕ this is well set of nodes as sometimes required for bench-marking
defined (brackets can be omitted); it may or may not purposes.
be assumed/required that the operator ⊕ be commuta-
tive. The collective reduction operations further differ
A Motivating Example (continued)
in how the elements of y are left distributed among the
We again discuss y = Ax but now view the p nodes as
nodes.
forming a r × c mesh.
Reduce: A designated root node receives the result y.
For emphasis, this collective operation is sometimes Algorithm 
called reduction-to-one. . Partition A, x, and y:
Allreduce: The result y is duplicated to all nodes.
Reduce-scatter: The result y is distributed among the ⎛ A A . . . A,p− ⎞ ⎛ x ⎞
nodes so that node k ends up with a subvector yk ⎜  ⎟ ⎜  ⎟
⎜ ⎟ ⎜ ⎟
⎜ A . . . A,p− ⎟ ⎜ x ⎟
of y. ⎜  A ⎟ ⎜  ⎟
All prefix sums: All prefix sums y(k) = ⊕kj= x(j) are com- A→⎜

⎟,
⎟ x→⎜

⎟,

⎜ ⋮ ⋮ ⋱ ⋮ ⎟ ⎜ ⋮ ⎟
⎜ ⎟ ⎜ ⎟
puted and the kth prefix y(k) stored at node k, k = ⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
, . . . , p−. This collective communication operation ⎝ Ap−, Ap−, . . . Ap−,p− ⎠ ⎝ xp− ⎠
is often termed scan.
The requirement that ⊕ must be associative enables ⎛ y ⎞
parallelization of the operations, since partial results can ⎜  ⎟
⎜ ⎟
⎜ y ⎟
be computed concurrently and later combined. Like- ⎜  ⎟
wise, commutativity may make more efficient imple- and y→⎜

⎟,

⎜ ⋮ ⎟
⎜ ⎟
mentations possible. Because the prefix sum for node k ⎜ ⎟
⎜ ⎟
includes the input vector of node k itself, the operation ⎝ yp− ⎠
described is sometimes referred to as the inclusive all
prefix sums operation. An exclusive prefix-sums oper- where Aij has n/p rows and columns, and xi and yi
ation would compute y(k) = ⊕j= k− (j)
x with some special are of size n/p.
provision for node . Note that the kth inclusive pre- . Assume that initially Aij , xi , and yi are assigned to
fix sum can trivially be computed from the exclusive nodes as illustrated in Fig. . Each node has p/r ×
kth prefix sums, but not vice versa, unless an inverse p/c = p submatrices and one subvector of size n/p.
operation is given for ⊕. . Allgather xj ’s within columns of nodes.
 C Collective Communication

y0 A00 A01 A02 A03 A04 A05 A06 A07 A08


A10 A11 A12 y1 A13 A14 A15 A16 A17 A18
A20 A21 A22 A23 A24 A25 y2 A26 A27 A28
x0 x3 x6
y3 A30 A31 A32 A43 A44 A45 A36 A37 A38
A40 A41 A42 y4 A53 A54 A55 A46 A47 A48
A50 A51 A52 A63 A64 A65 y5 A56 A57 A58
x1 x4 x7
y8 A60 A61 A62 A63 A64 A65 A66 A67 A68
A70 A71 A72 y9 A73 A74 A75 A76 A77 A78
A80 A81 A82 A83 A84 A85 y8 A86 A87 A88
x2 x5 x8

Collective Communication. Fig.  Initial distribution of submatrices and vectors. Here p = , r × c =  × , and the boxes
represent nodes

y(0)
0 A00 A01 A02 y(1)
0 A00 A01 A02 y(2)
0 A06 A07 A08
y(0)
1 A10 A11 A12 y(1)
1 A13 A14 A15 y(2)
1 A16 A17 A18
y(0)
2 A20 A21 A22 y(1)
2 A23 A24 A25 y(2)
2 A26 A27 A28
x0 x1 x2 x3 x4 x5 x6 x7 x8
y(0)
3 A30 A31 A32 y(1)
3 A33 A34 A35 y(2)
3 A36 A37 A38
y(0)
4 A40 A41 A42 y(1)
4 A43 A44 A45 y(2)
4 A46 A47 A48
y(0)
5 A50 A51 A52 y(1)
5 A53 A54 A55 y(2)
5 A56 A57 A58
x0 x1 x2 x3 x4 x5 x6 x7 x8
y(0)
6 A60 A61 A62 y(1)
6 A63 A64 A65 y(2)
6 A66 A67 A68
y(0)
7 A70 A71 A72 y(1)
7 A73 A74 A75 y(2)
7 A76 A77 A78
y(0)
8 A80 A81 A82 y(1)
8 A83 A84 A85 y(2)
8 A86 A87 A88
x0 x1 x2 x3 x4 x5 x6 x7 x8

Collective Communication. Fig.  After allgather of subvectors of x within columns and local matrix-vector
multiplication

. In parallel, compute the local part of each yi on the or column of a mesh) of nodes. For parallel interfaces
appropriate node: that do not support formation of subsets of nodes, the
(k) algorithm is difficult to express.
yi = ∑ Aij xj
local j
Interfaces
as illustrated in Fig. .
Message-Passing Interface (MPI)
. Reduce-scatter within rows of nodes to compute yi =
(k) A widely used parallel interface with an explicit, rich set
∑ yi .
of collective communication operations is the Message-
Since the extra local memory and time required for the Passing Interface (MPI) []. The MPI collectives are
gathered and reduced subvectors is only O(cn/p) and widely used in applications. MPI includes all collec-
O(rn/p), respectively, this algorithm can be shown to tive communication operations of Figs. –, except for
be essentially scalable by keeping local memory use on permutation, as well as a barrier synchronization col-
each node constant and keeping the ratio of r and c con- lective. Collective communication is effected by library
stant. Notice that Algorithm  results by taking c =  and calls, and for each operation, all nodes (in a set of
Algorithm  by taking r = . nodes) eventually have to make the same call, also for
The example illustrates that often collective commu- the rooted operations for which an explicit root argu-
nications must be performed within subsets (e.g., a row ment (that need to be given by all nodes involved
Collective Communication C 

in the call) designates the root node. All collectives and communicators), and collective communication
in MPI are blocking but non-synchronizing, except for can take place on any such subset of nodes.
MPI_Barrier, meaning that a node (process in MPI
terminology) is allowed to continue as soon as it has Partitioned Global Address Space (PGAS)
contributed and received its required data. Collectives Languages C
like MPI_Allreduce and MPI_Allgather are of Partitioned Global Address Space (PGAS) languages
course non-synchronizing only in the trivial case where strive to hide the details of communication from users
all processes contribute data of size zero. The redis- to improve programmability and/or reduce overhead.
tribution collectives, except for MPI_Bcast, come in PGAS languages and implementations nevertheless
both regular (e.g., MPI_Allgather) and irregular benefit from explicit collective communication, and
variants (e.g., MPI_Allgatherv), the latter termed such languages provide collective communication as
vector variants and designated by the “v” suffix. For either integral parts of the language as in X [], or
the regular collective operations, the size of the data is standard libraries as in Unified Parallel C (UPC) [].
given at the call of the operation, such that nodes know The standard UPC library provides the operations
the size of the data involved in the operations. For the listed in Figs.  and , including a permutation
irregular operations, each node only specifies the size collective (absent in MPI), in regular variants, and
of the data to be contributed and received by itself. This restricted to mostly homogeneous data. Placement of
means that in irregular all-to-all communication, no input and result vectors is mostly determined implic-
node can from the outset know what communication itly by the so-called affinity of the data elements.
will be entailed between other nodes, which makes it In particular the root node is the node to which
impossible to compute efficient communication sched- the input (for upc_all_broadcast) or result (for
ules without additional communication. MPI incorpo- upc_all_gather) has affinity. Blocking and syn-
rates general, powerful, and convenient mechanisms for chronization properties can be flexibly controlled by
collective communication with nonhomogeneous data, a mode given with each collective call. Unlike MPI,
which again increases algorithmic and software com- the reduction operations upc_all_reduce and
plexity of MPI library implementations. upc_all_prefix_reduce are operations on row
The MPI reduction operations correspond exactly vectors that produce a single result or prefix sum for
to those listed in Fig.  and are all regular, except for each node. At the time of this writing UPC collectives
MPI_Reduce_scatter which allows blocks (sub- are over the full set of nodes, i.e., UPC does not sup-
vectors) of different sizes to be distributed across the port the formation of node subsets over which collective
nodes. A regular variant also exists. For the reduction communication can be performed.
operations, a set of standard, arithmetical and logi- Coarray Fortran . (CAF .) addresses some of
cal binary operators is predefined and it is further- these issues and provides a full and very general set of
more possible for the application to introduce its own collective operations that can be applied over arbitrary
user-defined operators; these are required to be asso- teams of nodes [].
ciative. MPI provides vague quality guarantees on the
reductions, mostly relevant for reductions with float- Bulk Synchronous Processing Libraries
ing point numbers, for instance, that the same result Bulk synchronous or coarse-grained processing is a par-
is computed on all nodes in an MPI_Allreduce, allel algorithmic design approach that structures appli-
that the result is independent of the physical place- cations into phases of local computation on local data
ment of the nodes in the system, or that all ele- interleaved with phases of global exchange or rout-
ments of result vectors are computed in the same order ing. This approach naturally relies on collective com-
and so forth. Such requirements sometimes complicate munication as explicated by the exchange phases, but
the practical design and implementation of reduction can additionally benefit from explicit collective com-
collectives. munication for realizing special communication pat-
Finally, MPI has powerful mechanisms for form- terns. This is based on the observation that applications
ing named subsets of nodes (namely, process groups often entail specific communication patterns that are
 C Collective Communication

both conveniently expressed as collective communica- is treated more extensively in a separate entry, and
tion and much more efficiently implemented by special- completes the tour of basic techniques for the imple-
ized algorithms than by relying on the general routing mentation of collective communication on distributed
communication of the next routing phase. Most prac- memory systems.
tical realizations of the Bulk Synchronous Processing
(BSP) model included explicit collective communica- Related Entries
tion operations [, , ] as explained here, typically in Allgather
a nonblocking form. Some BSP libraries [] supported All-to-all
formation of subsets of nodes. BSP (Bulk Synchronous Parallelism)
Collective Communication
Numerical Libraries MPI (Message Passing Interface)
Numerical libraries that explicitly have or expose col- PETSc (Portable, Extensible Toolkit for Scientific
lective communication operations include ScaLAPACK Computation)
and PLAPACK. See the entries for these libraries. PGAS (Partitioned Global Address Space) Languages
PLAPACK
Other Collective Interfaces and ScaLAPACK
Frameworks Scan for Distributed Memory, Message-Passing
Other, more specialized collective communication Systems
operations than those described in this entry are some-
times included in higher-level interfaces and frame- Bibliographic Notes and Further
works, or in application-specific libraries. For instance, Reading
libraries like PETSc and Trillinos have collective com- The benefits of encapsulating certain collective commu-
putation (reduction) operations, and I/O libraries like nication patterns as specific collective operations in lan-
pnetCDF and HDF have collective data movement guages, interfaces and libraries have been realized from
operations. As an example, algorithmic libraries may the early days of parallel computing, especially in sci-
include merging and sorting primitives and paral- entific computing []. For distributed memory systems
lel operations on data structures in their repertoire. explicit collective communication is a natural exten-
The skeletal approach to parallel programming [] sion to the message-passing paradigm. Convergence
expresses applications in terms of instances of higher- between many different message-passing languages and
level generic patterns, many of which entail collective libraries in the early s led to definition of a com-
communication. Such approaches provide considerable mon Message-Passing Interface [], well noted for the
expressive flexibility and often a very rich set of col- important role given to collective communication. The
lective communication and computation patterns. A advantages of and need for interfaces and libraries with
concrete example in this direction is the MapReduce fast and efficient collective communication to better
framework []. exploit the underlying, concrete communication sys-
tem was realized and argued in a number of papers,
Algorithms e.g., [, ]. The study of algorithms for efficient col-
Algorithmic and further details on some of the most lective communication likewise dates back to the early
important of the operations described above are given days of distributed memory parallel computing [, ].
in related entries. Basic techniques for collective Another early approach to systematic collective com-
communication implementation are found under the munication algorithm design can be found in []. A
Broadcast entry, which should be read first. The sym- comprehensive discussion of practical implementations
metric operations allgather and all-to-all are treated for most of the collective communications discussed
in separate entries, and discuss techniques not used here can be found in []. There has been a huge amount
by Broadcast. The Reduce to root operation can often of research and development work on efficient realiza-
be treated as the inverse of Broadcast (with additional tion of these collectives for a large variety of (often
operation). The fundamental all prefix sums collective extremely high-performance) systems. It is a major goal
Collective Communication, Network Support For C 

for MPI implementers to provide well-performing col- . Saraswat VA, Sarkar V, von Praun C () X: Concurrent pro-
lective communication for the intended target systems. gramming for modern architectures. In: ACM SIGPLAN sympo-
sium on principles and practice of parallel programming (PPoPP),
In [] semantic relationships between different col-
San Jose, p 
lective operations were used to urge implementers to
. Träff JL, Gropp WD, Thakur R () Selfconsistent MPI
provide consistently well-performing collective opera- performance guidelines. IEEE Trans Parallel Distrib Syst : C
tions in order to enhance performance portability of –
applications written in MPI.

Bibliography Collective Communication,


. Bala V, Bruck J, Cypher R, Elustondo P, Ho A, Ho C-T, Kipnis S,
Snir M () CCL: A portable and tunable collective communi-
Network Support For
cations library for scalable parallel computers. IEEE Trans Parallel
Distrib Syst ():–
Dhabaleswar K. Panda, Sayantan Sur, Hari
. Bonorden O, Juurlink BHH, von Otte I, Rieping I () Subramoni, Krishna Kandalla
The Paderborn University BSP (PUB) library. Parallel Comput The Ohio State University, Columbus, OH, USA
():–
. Chan E, Heimlich M, Purkayastha A, van de Geijn RA () Col-
lective communication: theory, practice, and experience. Concurr Synonyms
Comput Pract Exp ():–
Inter-process Communication; Network Architecture;
. Dean J, Ghemawat S () MapReduce: simplified data process-
ing on large clusters. Commun ACM ():– Network Offload
. El-Ghazawi T, Carlson W, Sterling T, Yelick K () UPC: dis-
tributed shared memory programming. Wiley, Hoboken
. Fox G, Johnson M, Lyzenga G, Otto S, Salmon J, Walker D () Definition
Solving problems on concurrent processors, vol . Prentice-Hall, Collective Communication involves more than one pro-
Englewood Cliffs cess participating in one communication operation.
. Gorlatch S () Send-receive considered harmful: Myths and Collective communication operations are: broadcast,
realities of message passing. ACM Trans Program Lang Syst
barrier synchronization, reduction, gather, scatter, all-
():–
. Goudreau M, Lang K, Rao SB, Suel T, Tsantilas T () Portable
to-all complete exchange, and scan. Network support
and efficient parallel computing using the BSP model. IEEE Trans for collective communication provides architectural
Comput ():– support for efficient and scalable collective operations
. Hambrusch SE, Hameed F, Khokhar AA () Communica- with low processor overhead.
tion operations on coarse-grained mesh architectures. Parallel
Comput ():–
. Hempel R, Hey AJG, McBryan O, Walker DW () Special issue: Discussion
message passing interfaces. Parallel Comput ():–
. Hill JMD, McColl B, Stefanescu DC, Goudreau MW, Lang K, Introduction
Rao SB, Suel T, Tsantilas T, Bisseling RH () BSPlib: The BSP Collective communication operations aim at reducing
programming library. Parallel Comput ():– both latency and network traffic with respect to the
. Mellor-Crummey J, Adhianto L, Scherer III WN, Jin G () A case where the same operations are implemented with a
new vision for Coarray Fortran. In: Third conference on Parti-
tioned Global Address Space Programming Models, Asburn, VA
sequence of unicast messages. The significance of collec-
. Mitra P, Payne DG, Schuler L, van de Geijn R () Fast collec- tive communication operations for scalable parallel sys-
tive communication libraries, please. In: Proceedings of the Intel tems has been emphasized by their inclusion in widely
supercomputer users’ group meeting used parallel programming models, such as the Mes-
. MPI Forum. MPI: A message-passing interface standard, version sage Passing Interface (MPI) []. A large number of
..  Sept . www.mpi-forum.org
parallel applications depend on the performance ben-
. Rabhi FA, Gorlatch S (eds) () Patterns and skeletons for
parallel and distributed computing. Springer-Verlag, London
efits provided by collective operations. MPI libraries
. Saad Y, Schultz MH () Data communication in parallel archi- strive to provide the best possible performance for col-
tectures. Parallel Comput ():– lective operations, since they are the key for good end
 C Collective Communication, Network Support For

application performance. Consequently, a large number are used in this operation to achieve the desired data
of algorithms and advanced network support for col- movement. This operation is used in distributed memory
lective communication have been proposed. The com- systems to redistribute data.
monly used collective communication operations are: Scan operation involves reduction of data with
broadcast, barrier synchronization, reduction, gather, ranks. Sometimes, this operation is also referred to as
scatter, all-to-all complete exchange, and scan. parallel prefix computation. Unlike regular reduction,
Broadcast is a very common operation needed in a this operation is carried out in such a way that a mem-
parallel system to distribute code and data from a host ber of rank i only receives reduction results data asso-
process to a set of processes before computation begins. ciated with members of ranks  to i. This operation
This is especially useful for SPMD style programming. is used widely in image processing and visualization
During the computational phase of distributed mem- applications.
ory applications, this operation is used to distribute data
among other processes. In the literature, it is also known Architectural Support for Collective
as a one-to-all operation with non-personalized data Communication
movement. If multiple broadcasts happen at the same The importance of collective operations in parallel com-
time from different processes, it is called as a many-to- puting is underscored by the increasing network and
all operation. If the data is distributed to only a subset architectural support for improving collective operation
of processes, the operation is called a multicast. In MPI, performance. There are three major types of network
this operation is called MPI_Bcast. support for collective operations: (a) adapter support,
Barrier synchronization is a common operation in (b) switch support, and (c) dedicated networks for col-
parallel systems to synchronize execution of multiple lectives. The network support aims to improve latency
processes. Primarily, it is used to support a producer– and improve bandwidth. Another important perfor-
consumer relationship of shared data. Functionally, this mance metric for collective operations is tolerance to
operation involves many-to-one communication fol- system noise. As the number of processes involved in
lowed by a one-to-many communication operation. In a collective operation increases, short delays at indi-
MPI, this operation is called MPI_Barrier. vidual processes start to accumulate over the various
Reduction operations are used widely in distributed stages of the collective and adversely affect the overall
memory systems and consist of computing a single time taken for the collective. This is called as system
value out of data sent by the different participating noise. Tolerance to noise means that the collective oper-
processes. Commonly used reduction operations are ations should be able to proceed regardless of other
max, min, sum, or any user-defined operations. The tasks running on the processor, a technique called
MPI specification includes two forms of reduction application-bypass. These techniques can be enabled
MPI_Reduce and MPI_Allreduce. In MPI_Reduce, the through intelligent network adapters, switches or by
result is available only on the root process, and in having dedicated collective communication networks.
MPI_Allreduce, the result is available on all participat- In addition to these, MPI libraries have advanced soft-
ing processes. ware support for bridging the gap between low-level
Gather operation is defined as gathering of data architectural support and higher-level user collective
from a set of member processes by a member. operations.
This involves gathering personalized data (different
data from different processes). In MPI, it is called Adapter-Based Support
MPI_Gather. Scatter is the reverse operation of Gather. Compute nodes are attached to the network via network
MPI also provides MPI_Allgather that performs an all- adapters. Adapters are the closest to the end proces-
to-all broadcast. sors, and therefore can be used to perform a wide vari-
All-to-all complete exchange combines scatter and ety of network offload mechanisms. Intelligent adapters
gather. In this operation, every member has some per- with processors can be used to perform complex oper-
sonalized data to send to every other group mem- ations on incoming network data. They can also be
ber. A sequence of interleaved gather and scatter steps used to offload a series of operations that the adapter
Collective Communication, Network Support For C 

then continues to perform, thus freeing up the host programming libraries (Elanlib) built on top of lower
processor to return to computation. The following are level point-to-point communication primitives. Elan-
some of the adapter-based network support that have lib provides two barrier functions, elan_gsync() and
been employed by leading high-performance network elan_hgsync(). The latter takes advantage of the hard-
vendors over the past few decades. ware broadcast primitive and provides a very efficient C
and scalable barrier operation. However, it requires that
Myrinet the calling processes are well synchronized in their
Myrinet is a high-performance full-duplex network stages of computation. Otherwise, it falls back on the
which uses NICs with programmable processors. elan_gsync() to complete the barrier with a tree-based
Myrinet provides a user-level message passing system gather-broadcast algorithm.
(GM) which uses the programmable NICs for much
of the protocol processing. GM consists of three com- InfiniBand RDMA Enabled Collectives
ponents: a kernel module, a user-level library, and a Remote Direct Memory Access (RDMA) is a technique
control program (MCP) which runs on the NIC pro- using which a process can directly access the memory
cessor. The driver loads the MCP on to the NIC when it locations of some other process, without participation
is loaded. It is possible to modify the MCP and cross- of the remote process. InfiniBand is one of the net-
compile for the network processor. Researchers have works that provides high-performance RDMA. Using
modified the control program to support advanced col- the RDMA primitives, efficient collective operations
lective operations. The modified MCP can be accessed can be designed. The main benefits of RDMA collectives
through new interfaces by the MPI library. Finally, over regular point-to-point collectives are: (a) commu-
collective operations of the end application can be nication calls can bypass several intermediate software
offloaded to the NIC. In effect, the MCP (through layers, (b) reduce number of message copies to and from
multiple finite state machines) will tell the NIC the bounce buffers (bounce buffers can be directly accessed
times at which it should initiate the various send and by InfiniBand), (c) reduce protocol handshakes, such
receive operations which form the basis of the collec- as rendezvous for large MPI messages, and (d) reduce
tive operation. Myrinet also provides useful primitives the total number of registration operations, i.e., per-
such as multi-send which allows the user to offload form one larger memory registration, instead of several
a series of send operations to the NIC. Such prim- smaller ones.
itives are extremely useful to design collective algo- These mechanisms can further be enhanced for
rithms such as barrier and broadcast. In tree-based shared memory processor (SMP)–based systems by
collective algorithms, although operations like multi- performing the collective operation in two steps. In the
send reduce the overhead of the broadcast operations first step, all processes local to a node write their data
at the root processes, the intermediate nodes would to a common shared buffer. This is then exchanged over
still require involvement from the host processor to the network to remote nodes by designated processes
progress the collective operation. Researchers have called leaders. The third step involves the leader pro-
devised application-bypass techniques through a mod- cess distributing the data to the local processes through
ified MCP that enable intermediate network adapters shared memory.
in the collective tree to forward messages from the NIC
itself, thereby mitigating effects of system noise. InfiniBand Collective Offload
Collective operations defined in the current MPI-
Quadrics standard are blocking operations. This implies that
The Quadrics interconnect had very advanced net- the applications need to wait until collective opera-
work adapter support. The Quadrics network access tions complete before doing any other compute tasks.
library is called Elan. Elan exposes a protected, pro- Since collective operations involve many processes,
grammable network interface to end users through their latency is bound to increase as systems scale. Sev-
vendor-provided programming libraries (QSNet). These eral researchers have also observed that system noise
libraries are built in a layered fashion with higher-level can potentially affect the latency of blocking collective
 C Collective Communication, Network Support For

operations, and hence the performance of end appli- The broadcast mechanism replicates packets to partic-
cations. It is widely believed that hiding the latency of ular switch output ports, with the capability of broad-
collective operations can strongly benefit the perfor- casting to a subset of processors. Using the broadcast
mance of parallel applications. InfiniBand vendors, such mechanism, both broadcast and barrier collectives can
as Mellanox, have recently introduced new network be implemented.
adapters, such as the ConnectX- that offer network-
offload features. Host processors can create arbitrary InfiniBand Multicast
task-lists comprising of send and receive operations Modern networks such as InfiniBand provide hardware-
and post them to the work request queue of the based multicast operation. The multicast operation
network interface. The network interface can indepen- gives the ability for a process to send a single message to
dently execute these task lists without further interven- a specific subset of processes which may be on different
tion from the host processors. This application-bypass end nodes. Such a hardware-based multicast capability
technique has the potential advantage of minimizing provides the following benefits: (a) only one send oper-
the effects of system noise along with offering more ation is needed to initiate the multicast, greatly reducing
CPU cycles to applications for performing compute the host overhead at the sender, and (b) packets can be
tasks. duplicated at the switch ports, reducing network traffic.
MPI libraries are designed in a highly efficient, scal- InfiniBand multicast does not guarantee reliable
able manner to leverage the full functionality of these delivery. It is based on Unreliable Datagram (UD) trans-
network interfaces. Non-blocking collective operations port. Software-based solutions to ensure reliability need
designed in this manner can potentially allow applica- to be provided. MPI library designers must take this
tions to scale better by countering the effects of sys- into account. As with the RDMA-based approach, this
tem noise and allowing applications to hide the time scheme can also be enhanced for current generation
required for collective operations by overlapping them of SMP systems. Such studies have already been done
with compute tasks. A new revision to the MPI stan- by researchers for IB systems and they show consid-
dard, MPI-, is likely to standardize non-blocking col- erable improvement in performance as opposed to a
lective communication. non-SMP-aware approach.

Switch-Based Support InfiniBand Fabric Channel Accelerator


Network switches form the backbone of the inter- The Fabric Channel Accelerator technology from
connect. High-performance networks have point-to- Voltaire (now merged with Mellanox) introduced this
point links to maximize performance. Switches are feature in . The aim of this feature is to enable high-
responsible for routing packets from source nodes to level optimizations in InfiniBand switches by employ-
destination nodes. Since switches are in a central posi- ing a general purpose CPU on the switch itself. The
tion, it is a very attractive point to introduce net- CPU on the switch can then manage communications
work support for collective communication. The fol- to different nodes. The individual nodes can connect
lowing architectural supports are available in Quadrics to the CPU on the switch over Unreliable Datagram
and InfiniBand switches for collective communication (UD) transport. The FCA library and algorithms then
operations. implement reliability on top of the unreliable transport.
MPI libraries can talk with the FCA library to imple-
Quadrics ment collective operations. Several commonly used
The Quadrics network features hardware support for collectives, such as barrier, broadcast, and reduce are
collective communication. Each process can map a por- supported.
tion of the address space into global memory. These
addresses constitute virtual shared memory. The switch Support Through Dedicated Networks
provides hardware-based broadcast. This broadcast is The most advanced support for collective operations
reliable, i.e., no software mechanism is required to is by dedicating entire networks for efficient imple-
ensure that broadcast packets reach their destinations. mentation of collective operations. The communication
Collective Communication, Network Support For C 

requirements of point-to-point and collective opera- with these topologies in mind. The IBM BlueGene and
tions are often very different. Collective operations are Cray feature several such algorithmic optimizations.
very sensitive to latency and system noise. The follow- Modern compute servers are based on multi/many-
ing section describes the dedicated architectural sup- core architectures and offer high compute densities.
port provided in IBM BlueGene system to accelerate Interestingly, these architectures have also introduced C
collectives. a new design space for collective algorithms. Since
many processes share the same address space within a
BlueGene compute node, they can communicate directly through
The IBM BlueGene system featured one of the most memory with smaller latency. Suppose we consider a
advanced support for scalable collective operations. The MPI_Bcast operation. One or more processes within
compute nodes were connected through five networks: a node can be chosen as the leaders and they com-
a -D torus network for point-to-point messaging, a municate over the network to receive the data from
global collective network, a global barrier and interrupt the root. These leaders then broadcast the data to
network, and two other Ethernet networks for con- the other processes that are within the same com-
trol and I/O. The global collective network is useful for pute node, directly through memory. Researchers have
speeding up commonly used MPI collective commu- also extended this concept to consider the network
nications constructs. And the global barrier network topology to further improve the latency of collective
quickly synchronizes state across all processes in the operations. These algorithms are particularly useful on
system. The BlueGene/L system had collective network large-scale systems, where compute nodes are organized
with bandwidth up to  MBps with . μs latency. The across multiple racks and inter-rack communication are
later BlueGene/P system featured collective network costlier.
bandwidth of MBps and . μs per tree traversal. The
global barrier and interrupt network has a hardware Collectives Today
latency of . μs. Owing to their ease of use and portability, collec-
In addition to the hardware support, the IBM MPI tive communication operations are used extensively
libraries provide many advanced collective communi- in scientific parallel applications. Hence, it is crit-
cation algorithms that map the hardware resources to ical to improve the performance and scalability of
user collective operations. collective operations. Most of the current research
in this field is focused on two important aspects:
Software Optimizations (a) improving the latency of collective operations
In addition to network support, software optimizations through enhanced software designs that leverage the
also play an important role in the performance of col- advanced multi-core architectures and network features
lective operations on modern platforms. There are two and (b) designing non-blocking collective operations
major ways by which software optimizations may be that allow applications to achieve communication/com-
employed for collective operations: (a) algorithm design putation overlap.
based on network topology, and (b) algorithm design in Many MPI libraries such as MVAPICH [], Open-
multi-core and many-core systems. MPI [], Intel-MPI use aggressive shared-memory-
Large supercomputer systems must scale with based designs to optimize latency of blocking collective
increasing node count. One of the major problems with operations on modern multi-core architectures. How-
scaling is the number of links required to interconnect ever, the performance of collective operations also
the system. An all-to-all connectivity crossbar connec- strongly depends on the topology of interconnection
tion requires N  links, which is clearly not scalable. networks. Supercomputing systems use complex net-
In response, system designers have employed -D, -D work architectures ranging from fat-trees to -D Torus
meshes, k-dimensional Tori, and other interconnection and hypercubes. Researchers are exploring alternatives
topologies. The IBM BlueGene and Cray systems have to detect the system topology and design collective algo-
used -D torus for their large-scale systems. Algorithms rithms in a topology-aware manner to achieve lower
that implement collective operations must be designed latency.
 C Collective Communication, Network Support For

As parallel applications are scaled out to tens of for several years. Researchers have investigated opti-
thousands of processes, it becomes necessary to hide mizing collective communication on the IBM BlueGene
the costs associated with collective operations. Non- supercomputers in [, , , , ]. Designs that leveraged
blocking collective operations are widely believed to various features offered by Quadrics and Myrinet net-
address this requirement. The goal of such an inter- works to improve the performance of collective opera-
face is to allow applications to perform compute tasks, tions were proposed in [, , , , –]. Researchers
while the collective operations are in progress. The have explored the challenges associated with off-
biggest challenge of designing collective operations in loading collective operations to network adapters in
this manner is that they need to be progressed, with little [, , , ]. Software-based optimizations that con-
intervention from the host processors. Researchers are sider the network topology and multi-core architec-
exploring ways to offload collective operations to mod- tures to improve collective operations were proposed
ern network interfaces so that the host processors can by various researchers in [, –, –, , ]. The
be directly used to perform application-level compute impact of system noise on the performance of collec-
tasks. tive operations and real-world applications were studied
in [, , , , ]. Recently, researchers have pro-
Future Directions posed the need for an efficient non-blocking interface
Given the current growth in multiprocessor chip for collective operations in [, ].
designs, next-generation compute servers are expected
to offer hundreds to thousands of compute cores. Next- Bibliography
generation networks are expected to offer a rich set of . Almasi G, Dozsa G, Chris Erway C, Steinmacher-Burow B ()
advanced features, apart from guaranteeing lower com- Efficient implementation of Allreduce on BlueGene/L collective
munication overheads. Designing algorithms to mini- network. In: Recent advances in parallel virtual machine and mes-
mize the communication latency of blocking collective sage passing interface. Lecture notes in computer science, vol
. Springer, Berlin/Heidelberg, pp –
operations will continue to be a critical aspect of any
. Almasi G, Heidelberger P, Archer CJ, Martorell X, Chris Erway
communication library. At the same time, designing an C, Moreira JE, Steinmacher-Burow B, Zheng Y () Optimiza-
efficient set of non-blocking collective operations that tion of mpi collective communication on bluegene/l systems.
allow parallel applications to hide the communication In: Proceedings of the th annual international conference on
latency promises to improve application performance. Supercomputing, ICS ’, Cambridge, pp –
Communication libraries offering a high performance, . Buntinas D, Panda DK, Duato J, Sadayappan P () Broad-
cast/Multicast over Myrinet using NIC-Assisted multidestination
scalable suite of collective operations can significantly
messages. In: Proceedings of the th international workshop on
improve the performance of scientific applications on network-based parallel computing: communication, architecture,
next-generation systems. The upcoming MPI- stan- and applications. Springer, London, pp –
dard will include non-blocking communication opera- . Buntinas D, Panda DK, Sadayappan P () Performance ben-
tions. In order to leverage the latency hiding properties efits of NICBased barrier on Myrinet/GM. In: IPDPS, San
Francisco, p 
of non-blocking collectives, end parallel applications
. Faraj A, Kumar S, Smih B, Mamidala A, Gunnels J, Heidelberger P
will need to be redesigned. () Mpi collective communications on the blue gene/p super-
computer: algorithms and optimizations. In: Proceedings of the
Related Entries rd international conference on supercomputing, ICS ’, York-
Clusters town Heights, pp –
. Faraj A, Kumar S, Smith B, Mamidala A, Gunnels J ()
Collective Communication
MPI collective communications on the blue Gene/P supercom-
Interconnection Networks puter: algorithms and optimizations. In: Symposium on high-
Routing (Including Deadlock Avoidance) performance interconnects, New York, pp –
. Faraj A, Patarasuk P, Yuan X () A study of process arrival
patterns for MPI collective operations. In: Proceedings of the
Bibliographic Notes and Further st annual international conference on supercomputing, Seattle,
Reading pp –
Optimizing collective operations across various net- . Garbriel E, Fagg GE, Bosilica G, Angskun T, Dongarra JJ, Squyres
works and systems has been an area of active research JM, Sahay V, Kambadur P, Barrett B, Lumsdaine A, Castain RH,
Collective Communication, Network Support For C 

Daniel DJ, Graham RL, Woodall TS () Open MPI: goals, . Kumar S, Sabharwal Y, Garg R, Heidelberger P () Optimiza-
concept, and design of a next generation MPI implementation. tion of all-to-all communication on the Blue Gene/L Supercom-
In: Proceedings of the th European PVM/MPI Users’ group puter . In: Proceedings of the  th international conference
meeting, Budapest on parallel processing, Portland, pp –
. Graham R, Poole S, Shamis P, Bloch G, Boch N, Chapman H, . Lawrence J, Yuan X () An MPI tool for automatically discov-
Kagan M, Shahar A, Rabinovitz I, Shainer G () Overlap- ering the switch level topologies of ethernet clusters. In: IPDPS C
ping computation and communication: Barrier algorithms and workshop on system management techniques, processes, and
connectx- CORE-direct capabilities. In: Proceedings of the nd services, Miami
IEEE international parallel & distributed processing symposium, . Liu J, Jiang W, Wyckoff P, Panda DK, Ashton D, Buntinas D,
workshop on communication architectures for clusters (CAC) ’, Gropp W, Toonen B () Design and implementation of
Atlanta, GA, USA MPICH over infiniband with RDMA support. In: Proceedings
. Graham R, Poole S, Shamis P, Bloch G, Boch N, Chapman H, of international parallel and distributed processing symposium
Kagan M, Shahar A, Rabinovitz I, Shainer G () ConnectX (IPDPS ’), Santa Fe
infiniband management queues: new support for network Ofaded . Mamidala A, Kumar R, Panda DK () MPI collectives on
collective operations. In: CCGrid’, Melbourne, – May  modern multicore clusters: performance optimizations and com-
. Graham R, Shipman G () MPI support for multi-core munication characteristics. In: CCGrid, Lyon
architectures: optimized shared memory collectives. In: Recent . Mamidala AR, Chai L, Jin HW, Panda DK () Efficient SMP-
advances in parallel virtual machine and message passing inter- aware MPI-level broadcast over InfiniBand’s hardware multicast.
face, lecture notes in computer science, vol /, Sorrento, In: IEEE international symposium on parallel and distributed
pp – processing, IPDPS , Greece, p 
. Gunawan TS, Cai W () Performance analysis of a . Mamidala AR, Vishnu A, Panda DK () Efficient shared mem-
Myrinet-based cluster, vol . Springer, Netherlands, pp ory and RDMA based design for MPI-allgather over infiniBand.
– In: th European PVM/MPI user’s group meeting, vol .
. Hoefler T, Mosch M, Mehlan T, Rehm W () CollGM – A Bonn, – Sept 
Myrinet/GM optimized collective component for Open MPI. In: . Message Passing Interface Forum () MPI: a message-passing
Proceedings of rd KiCC Workshop , RWTH Aachen interface standard, Mar 
. Hoefler T, Schneider T, Lumsdaine A () The impact of . Patarasuk P, Yuan X () Bandwidth efficient all-reduce opera-
network noise at large-scale communication performance. In: tion on tree topologies. In: HIPS, Long Beach
Proceedings of IEEE International Symposium on parallel and . Patarasuk P, Yuan X () Efficient MPI Bcast across different
distributed processing, Rome, pp – process arrival patterns. In: Proceedings of international paral-
. Hoefler T, Schneider T, Lumsdaine A () Characterizing the lel and distributed processing symposium (IPDPS), Miami, pp
influence of system noise on large-scale applications by simula- –
tion. In: Proceedings of the rd annual international conference . Petrini F, Kerbyson DJ, Pakin S () The case of the miss-
on Supercomputing, Greece ing supercomputer performance: achieving optimal performance
. Hoefler T, Squyres J, Bosilca G, Fagg G, Lumsdaine A, Rehm W on the , processors of ASCI Q. In: Proceedings of the
() Non-blocking collective operations for MPI-. Technical  ACM/IEEE conference on supercomputing, Washington,
report, Open Systems Lab, Indiana University DC p 
. Hoefler T, Squyres JM, Rehm W, Lumsdaine A () A case . Scott Hemmert K, Barrett BW, Underwood KD () Using trig-
for nonblocking collective operations. In: Frontiers of high per- gered operations to offload collective communication operations.
formance computing and networking, ISPA  workshops. In: Proceedings of the th european MPI users’ group meeting
Lecture notes in computer science, vol /. Sorrento, Italy, conference on recent advances in the message passing interface,
pp – EuroMPI’, Springer, Berlin/Heidelberg, pp –
. Kandalla K, Subramoni H, Panda DK () Designing topology- . Subramoni H, Kandalla K, Sur S, Panda DK () Design and
aware collective communication algorithms for large scale infini- evaluation of generalized collective communication primitives
band clusters: case studies wih Scatter and Gather. In: Workshop with overlap using connectX- offload engine. In: The th annual
on Communication Architecture for Clusters, (CAC’), Austin symposium on high performance interconnects, HotI , Santa
. Kandalla K, Subramoni H, Santhanaraman G, Koop M, Panda Clara
DK () Designing multi-leader-based Allgather algorithms . Thakur R, Gropp W () Improving the performance of
for multi-core clusters. In: Proceedings of the  IEEE inter- collective operations in MPICH. In: Recent advances in par-
national symposium on parallel and distributed processing, IEEE allel virtual machine and message passing interface. Lecture
computer society, Washington, DC, pp – notes in Computer Science, vol /. Venice, Italy, pp
. Karonis NT, de Supinski BR, Foster I, Gropp W, Lusk E, –
Bresnahan J () Exploiting hierarchy in parallel computer . Tipparaju V, Nieplocha J () Optimizing all-to-all collective
networks to optimize collective operation performance. In: Pro- communication by exploiting concurrency in modern networks.
ceedings of the th international symposium on parallel and In: Proceedings of the  ACM/IEEE conference on supercom-
distributed processing, Cancun, Mexico, p  puting, SC ’, Washington, pp –
 C COMA (Cache-Only Memory Architecture)

. Yu W, Buntinas D, Graham RL, Panda DK () Efficient and each of a collection of variables. Each member of the
scalable barrier over quadrics and myrinet with a new NIC-based set represents a configuration or state that the basic ele-
collective message passing protocol. In: th international parallel ments of the problem domain can assume. Therefore,
and distributed processing symposium (IPDPS’) – Workshop
the set is also called a state space. Combinatorial opti-
, vol . Santa Fe, p b
. Yu W, Buntinas D, Panda DK () High performance and mization problems lend themselves particularly well to
reliable NIC-based multicast over myrinet/GM-. In: Proceed- expression in this state space search paradigm. The basic
ings of the international conference on parallel processing, components of state space search are the enumeration
Kahosiung, p  of states and their evaluation for fitness according to the
. Yu W, Sur S, Panda DK, Aulwes RT, Graham RL () High
criteria defined by the problem. Since the state space of
performance broadcast support in LA-MPI over quadrics. In:
Los Alamos computer science institute symposium, LACSI , these problems can be extremely large, explicit genera-
Santa Fe tion of the entire space is often impossible due to mem-
ory and time constraints. Therefore, the set of possible
states is generated on the fly, by transforming one state
into another through the application of a suitable oper-
COMA (Cache-Only Memory ator. However, in the absence of a mechanism to detect
Architecture) the generation of duplicates, this procedure might begin
exploring a path of infinite length, thereby failing to ter-
Cache-Only Memory Architecture (COMA) minate. This is the problem of infinite regress, and can
be mitigated by one of two techniques. The first relies
on the comparison of a generated state with its ances-
tors, which are stored either in a distributed table or
Combinatorial Search passed from a parent to its newly-created child state.
Another approach constrains the search procedure to
Laxmikant V. Kalé, Pritish Jetley
consider only cost-bounded solutions. This is the case,
University of Illinois at Urbana-Champaign, Urbana,
IL, USA for instance, in the Iterative Deepening A* algorithm
discussed later. In certain situations, the search may be
guided by a heuristic measure of distance between states:
Synonyms those considered closer to the goal may be given priority
State space search of consideration over those believed to be more distant.

Definition Searching for All Feasible Solutions


Combinatorial search involves the systematic explo- A basic combinatorial search problem is one in which all
ration of the space of configurations, or states, of a prob- feasible configurations of the search space are desired.
lem domain. A set of operators can transform a given An example is the N-Queens problem: find all configu-
state to a series of successor states. The objective of the rations of N queens on an N ×N chessboard such that no
exploration is to find one, all, or optimal goal states sat- queen attacks, i.e., shares the same row, column, or diag-
isfying certain desired properties, possibly along with onal, with another. In a particular formulation, a state
a path from the start state to each goal. Combinato- of the problem is represented by a configuration of the
rial search has widespread applications in optimization, chessboard in which each of the first k rows holds a non-
logic programming, and artificial intelligence. attacking queen and the remaining N − k are empty. The
enumeration process can be described as a tree search.
Discussion The root represents a state in which no queens have
Given an implicitly defined set, combinatorial search been placed on the board. The rest of the search tree
involves finding one or more of its members that satisfy is defined implicitly as follows: given a state s with k
specific properties. More formally, it entails the system- non-attacking queens placed in the first k rows, its child
atic assignment of discrete values from a finite range to state has the same arrangement of queens as s, and an
Combinatorial Search C 

additional queen placed in the next empty row. Defin- Parallelization


ing the search tree in this fashion is advantageous: the In a sequential search, the tree is typically explored in
tree can be preemptively pruned at nodes that cannot its entirety by a depth-first procedure. A stack is used to
possibly yield solutions. In the running example, child hold unexplored children at each level. Given a search
states are only generated if they do not produce conflicts tree of depth d and branching factor b, this only requires C
with previously placed queens. O(bd) memory, in contrast to the O(bd ) size of the
The search strategy described above is a type of search space. A parallel implementation of this search
multistep decision procedure (MDP). Each step of the procedure requires the distribution of work into dis-
algorithm generates a new state s from its parent p by crete chunks called tasks. To ensure the efficient execu-
deciding an element of the configuration space. At step tion of these tasks on processors, the inter-related issues
k+ of N-Queens, this element is the position of a queen of task creation, grainsize control, and load balance
in row k +  of the N − k remaining rows. An MDP must be considered. Grainsize can be defined roughly
ensures that no duplicate states are generated. In par- as the ratio of computation work to number of mes-
ticular, there can be no recurring states along the path sages sent. There is a certain overhead in creating tasks,
from the start to the current state, so that infinite regress and a separate overhead if the task description is moved
cannot occur. to another processor. Thus, it is important to keep the
The tree search defined above is useful even when average grainsize above a certain threshold to limit the
the configurations are defined differently. Consider the impact of parallel overhead. At the same time, no single
problem of finding a knight’s tour on the chessboard: task should be so large as to make all processors wait for
Starting at a square and using only legal moves, can its completion.
a knight visit every square on the board exactly once, A parallel search may define a task as the subtree
returning finally to the starting square? This is a spe- beneath a single tree node. Grainsize estimation then
cial instance of the Hamiltonian Circuit Problem: given requires a simple metric which is correlated to the com-
a graph, a circular path must be found that visits every putational cost of a node. In N-Queens this could be
vertex exactly once. Here, the vertices correspond to the number of queens that remain to be placed. It is not
squares on the chessboard. An edge (u, v) connects ver- an exact measure: sometimes a node with many queens
tex u to v such that v can be reached in one knight-move remaining may still generate very little work under it,
from u. To enumerate all such tours, a tree search might because of the impossibility of finding a solution there.
define a state as a knight’s path of length k originating With a suitable metric formulated, a threshold amount
at the start. Each child extends the path by adding a of work may be set, below which new tasks are not
move to it. created; such subtrees are explored sequentially. Con-
tinuing with the N-Queens example, an efficient serial
Variable Selection backtracking mechanism may be employed when, say,
Given a particular node in the search tree, there is a only m ≤ N queens remain. The threshold m could be
choice of which branch to select when considering new estimated by the exploration of small parts of the search
children. This is called variable selection. Continuing space.
with the N-Queens example, at step k + , any of the Another technique defines a task as a set of frontier
remaining N − k rows may be chosen as the recipi- nodes. The exploration proceeds in a depth-first man-
ent of the k + -th queen, not just the k + -th row on ner, maintaining an explicit stack. At some point, the
the chessboard. The size of the tree beneath the current stack may be split in two (or more) pieces, with each
node (and therefore the effort involved,) can be signif- piece assigned to a new task on a possibly different pro-
icantly affected by the choice of branching variable. A cessor. Typically, nodes near the bottom of the stack
good heuristic is to select the most constrained vari- are used to create a new task. However, vertical splitting
able at each step. In the N-Queens example, this would has also been proposed, wherein half of the unexplored
mean placing the next queen in the row with the fewest branches at each level of the stack are picked to form
non-attacked squares. starting positions for a new task. This strategy works
 C Combinatorial Search

well for irregular search spaces, but can be expensive for Searching for Any Feasible Solution
deep stacks. Unlike the N-Queens or graph coloring problems where
The literature discusses two methods to determine the objective is to identify all ways of satisfying given
when stacks are split. The first combines grainsize con- constraints, certain situations require the generation of
trol with the load balancing strategy: in “work-stealing” any satisfactory configuration. A classic example is the
a processor’s stack is split upon receipt of a request -SAT problem. Value ordering becomes an important
for work from an idle processor. The idea was first heuristic in this context: given b children of a node,
described by Lin and Kumar [] and later formalized by explore the subtree of that child next whose subtree is
the Cilk system []. An alternative, due to Kalé et al. [], most likely to contain a solution. Problem-specific fig-
aims to separate the two issues: the amount of work ures of merit are used to order children. For instance, a
done by a task is tracked. Once an amount of work above simple -SAT heuristic might rank the two values pos-
a certain threshold T has been performed, the stack is sible for a variable (true or false) in decreasing order of
split into k pieces, each assigned to a new task. This the number of clauses satisfied by each.
ensures that the grainsize is T/(k + ). However, care The same parallelization techniques as outlined for
must be taken to avoid long chains of large tasks, i.e., the all-solutions search may be used, modifying them
when exactly one child has significant work, whereas to terminate when a solution is found. However, note
others do not. the speculative nature of the methods in this context:
regions of the tree that may not be visited in a sequential
procedure are searched concurrently in the hope that a
Load Balancing solution may be found quicker. This can lead to anoma-
Early literature classifies load balancers as sender- or lies in parallel performance [], because nodes are not
receiver-initiated. Work-stealing is a receiver-initiated visited in the same order by the parallel algorithm. In
load balancing technique. When a processor has no the worst case, an added processor may generate use-
work, it steals work from a randomly chosen processor. less work for other processors in the form of nodes
It has been shown that random stealing of shallow nodes that do not yield solutions. This situation is referred to
is asymptotically optimal, in that it leads to near-linear as a detrimental speedup anomaly. On the other hand,
speedups []. The stealing mechanism could vary from as illustrated in Fig. , this addition of resources could
messaging (on distributed memory systems) to exclu- lead to superlinear speedups (the acceleration anomaly.)
sive shared queue access. In contrast, sender-initiated Even on a fixed processor count, variations in the timing
schemes assign newly spawned tasks to some proces-
sor, either randomly, or based on a load metric such as
a queue size.
This taxonomic distinction fails to hold for load bal- 1
ancing schemes where each processor monitors the size
of its own queue and that of its “neighbors,” periodically 2 3 4
balancing them. Initially, a created task is placed on the t p1 p2 p3
creating processor’s queue. Unlike random assignment, 5 6 7 8 1 1
2 2 3 4
this avoids unnecessary communication. Depending 3 5 7 8
on the thresholds used to exchange queues and the a 9 10 11 12 13 14 15
b 4 9 12 14
periodicity of load exchanges, such schemes can tune
themselves effectively. At one extreme, they can approx- Combinatorial Search. Fig.  Acceleration anomaly in
imate work-stealing, where neighborhoods are global, parallel depth-first search: (a) illustrates the state space
an empty queue is used as a trigger, and rebalancing is tree of a problem. Search begins at node ; node  is the
done by selecting a random neighbor. Another advan- goal state. A serial depth-first search for the first solution
tage is their proactive agility: they increase efficiency takes  steps; (b) shows a schedule for the parallel
by moving work before a processor goes idle. However, depth-first search of the same tree with three processors.
these schemes create more communication traffic. The search takes four steps, for a speedup of / = . > 
Combinatorial Search C 

of load balancing could lead to widely differing execu- The lower bound on the total cost of a node is the
tion spans between runs. Kalé et al. [] obtain consistent sum of this value and the cost of arriving at it from
speedups by prioritizing node exploration in the follow- the start. If unopened nodes are processed in ascend-
ing manner: if a node with k children has priority q, ing order of their lower-bounds, it can be shown that
each child’s priority is obtained by appending its log k bit the first solution found is optimal []. This is called the C
rank to q. This leads to a lexicographic ordering of tree A∗ search procedure. The A∗ strategy leads to an expo-
nodes. Therefore, all descendants of a node’s left child nentially sized node queue; by contrast, depth-first
have higher priority than the right child’s descendants. search is highly memory efficient. The Iterative Deep-
A prioritized queue (on a shared memory machine) or a ening A∗ (IDA*) technique developed by Korf [] com-
prioritized load balancing scheme is used to steer pro- bines the best properties of the two. It is applicable when
cessors toward high priority work. This scheme also the solution cost is quantized. For example, the num-
leads to low memory usage: the search frontier forms ber of moves needed in the -puzzle is an integer. If
a characteristic broom shape that sweeps the state space there is no solution of d moves, the procedure looks for
in accordance with the ordering on the nodes. Assum- a solution of d +  moves. (Because of parity arguments,
ing a constant branching factor, the total memory usage the number of moves needed for a given starting state
with p processors is O(p + d) rather than the O(pd) is known to be either even or odd.) A depth-first search
requirement of the depth-first tree. On distributed bounded by a cost d may then be organized. An admissi-
memory machines, achieving this bound depends on ble heuristic is used to stop the search below nodes with
the quality of the prioritized load balancer. Further- cost lower bounds greater than d. If there is no solu-
more, since such schemes move work from the left tion of cost d, the bound is increased to the next possible
part of the tree to all processors, they tend to have a value (d+ for the -puzzle), and the search is restarted.
sizable communication overhead. Therefore, a simple Although there is duplication of the higher levels of the
depth-first search may be more suitable in some cases. search tree, this technique is asymptotically optimal in
the amount of work done []. Further, it offers control
Searching for an Optimal Solution over the amount of memory used, while maintaining
Some problems assign measures of fitness to feasible the property that the first solution found is the optimal
solutions. This renders certain solutions “better” than one. Moreover, this method precludes the problem of
all others by some metric, so that the objective becomes infinite regress without the need for duplicate detection.
the search for an optimal solution. Examples include An effective parallelization of IDA* is described by
integer programming, solving the -puzzle (or Rubik’s Kalé et al. []. It uses bitvector prioritization to overlap
cube) with the fewest moves, and the search of a graph execution of multiple iterations with different bounds.
for a least cost Hamiltonian cycle. One could use the all- This ensures that the last iteration (the one in which
solutions methods discussed previously, and then select the solution is found) is executed in a way that mini-
the best solution among all found. However, this is typ- mizes speculative loss, leading, as before, to consistent
ically wasteful, and more efficient search methods are and monotonic speedups. In addition, the latter part of
available. one iteration, where it winds down causing low proces-
sor utilization, is speculatively overlapped with the start
A* Search and Iterative Deepening of the next, thus improving efficiency.
In many problems, it is possible to define a heuristic
function that, given a node n, computes a lower bound Branch-and-Bound
on the cost of any solution in the subtree beneath it. Optimization problems can also be solved using the
Such a function is called an admissible heuristic. For branch-and-bound technique, wherein properties of
instance, in the -puzzle, the number of tiles that are partial solutions are used to discard infeasible portions
out of place is an admissible heuristic: at least that many of the search tree. This can reduce the effort expended
moves are needed to attain the goal state, since only one in finding optimal solutions. The main components of
tile is shifted per move. (The “Manhattan distance” of this mechanism are the branching and bounding pro-
each tile from its final position is a stronger heuristic.) cedures. Given a node n in the search tree, the branch
 C Combinatorial Search

procedure generates a finite number of children. The This approach retains complete information about the
bounding procedure assigns to each child a cost bound global state of the search, thereby enabling an optimal
that determines when the child is explored, and whether exploration of the frontier without any speculative
it is explored at all. Nodes are pruned if they need not work. However, it is not scalable due to the bottleneck
be explored. In addition, the bounding function must be at the master. Distributing the node pool affords more
monotonic: the bound of a node may be no better than autonomy to individual processors. Efficiency can be
the bound of its parent. An exploration mechanism is increased if a processor broadcasts newly encountered
required to choose between the children of a node gen- lower bounds to others. This helps other processors
erated at each step. Such a mechanism may prioritize discard nodes that cannot yield optimal solutions. How-
children based on their depth or their estimated ability ever, since it is hard to track node quality, effort might be
to yield a solution. The procedure terminates when all wasted in exploring infeasible nodes for want of accu-
subproblems have either been explored or pruned. rate bound information. Quality equalization may be
Consider the Traveling Salesman Problem. Whereas performed to reduce speedup anomalies and distribute
more efficient bounding techniques have been dis- useful work equitably. This is either done periodically
cussed in the literature, the following naïve procedure is or when an associated trigger is activated. Equalization
presented for the purpose of exposition. A node n repre- involves the movement of promising nodes between
sents a partial solution comprising paths between cities processors, which can be done in a hierarchical fashion
such that they form a (possibly incomplete) tour. The to reduce communication. Grama and Kumar [] and
children of n are enumerated by listing cities which can Kalé et al. [] survey such parallel branch-and-bound
be visited from the most recently added destination. A techniques.
partial solution has a cost bound equal to the length of
the path that it represents. This is a monotonic lower Bidirectional Search
bound, since the cost of a path through a child of n is Problems such as Rubik’s Cube stipulate a priori the
at least as great as the cost of a path through n itself. exact configuration of the goal state. In such problems,
The algorithm tracks the cost c of the cheapest complete a bidirectional search may be employed to construct a
tour encountered up to a certain point in the search. path to the goal from the start state. By initiating two
Notice that the tree can be pruned at children with paths of search, one moving forward from the start state
lower bounds greater than c. Starting with an empty and the other backward from the goal, the size of the
tour, all possible tours may be considered using this search space explored can be reduced substantially. On
procedure. average, a unidirectional search of a tree of depth d and
The basic data structure in the branch-and-bound is branching factor b visits O(bd ) states to find an optimal
a prioritized queue that stores the nodes comprising the solution. In contrast, by starting two opposing searches
frontier of the search. Anomalous speedups can result if that meet, on the average, at depth d/, only O(bd/ )
these nodes are processed in an unordered fashion. The states are explored.
literature suggests unambiguous heuristic functions [] There are two main ways of organizing the forward
for shared queues of nodes. Such a function h differen- and backward searches. The first method uses a back-
tiates between nodes based on their values, so that for ward depth-first procedure to exhaustively explore the
nodes n and n , h(n ) = h(n ) ⇔ n = n . Even with tree starting from the goal, up to a certain height h
consistent speedups, performance is bound by queue above it. The states generated by this backward search
locking and access overheads. These can be mitigated are stored in the intermediate goal layer, the size of
by the use of concurrent priority queues. which is limited by the amount of memory available.
In distributed memory implementations, the fron- The forward search is a (memory efficient) depth-first
tier may either be stored in a centralized or distributed procedure or a best-first search with iterative deepen-
manner. The former strategy usually engenders master- ing. Instead of checking for goal states, the forward
slave parallelism, where the master processor distributes search looks for each enumerated state in the interme-
work from a central node pool to worker processors. diate layer. A match indicates the presence of a solution.
Combinatorial Search C 

For distributed memory systems, the intermediate layer fewer than O(bd ) nodes, and there is no practical
may either be replicated on every processor, or dis- advantage to using bidirectional search.
tributed across all available processors. Using replicas of
the intermediate layer lowers the communication cost Game Tree Search
of the algorithm, but forces a reduction in the depth Game-playing can be represented by trees that trace the C
of the backward search. Kalé et al. [] describe the sequence of moves made by adversaries. Levels of these
use of distributed tables in multiprocessor bidirectional trees are marked max and min in an alternating fash-
search. ion, depicting the moves made by the player and the
A second class of bidirectional search schemes uses opponent respectively. The leaves represent the various
best-first techniques in either direction. This approach outcomes of the game, and are assigned values com-
has its pitfalls: The frontiers of the forward and back- mensurate with their estimated favorability. However,
ward search may pass each other, resulting in two non- the enumeration of all possible outcomes and paths to
intersecting paths (in opposite directions) from the start their corresponding leaves can be prohibitively expen-
to the goal. Therefore, instead of doing less work, the sive. Chess, for example, has on the order of  states.
algorithm will have performed more work than a uni- Therefore, given a current state, modern game-playing
directional search, yielding poor performance. To over- strategies limit the search for good states to a cer-
come this situation, wave-shaping algorithms attempt tain look-ahead depth d from it. The minimax principle
an intersection between the forward and backward posits that under the assumption of rational play, an
search frontiers. Nelson and Toptsis [] survey bidirec- optimal strategy can be formulated by picking the most
tional search and provide parallel variants of pioneering favorable child c of the current node n at each turn. A
uniprocessor techniques. Pohl presented the original good approximation of the optimal move at n can be
non-wave-shaping uniprocessor bidirectional search. computed by evaluating all nodes under it up to a suf-
De Champeaux and Sint formulated a wave-shaping ficient depth d, and choosing the best child c. Using
algorithm which estimates the distance between the a depth-first procedure, this requires O(bd ) time and
advancing frontiers to encourage an intersection. This O(bd) space, where b is the branching factor.
was improved upon by Politowski and Pohl to reduce
the computational complexity of the heuristic. Alpha-Beta Pruning
Kaindl and Kainz [] provide empirical evidence The alpha-beta pruning procedure tracks the values
suggesting that bidirectional search is inefficient not of nodes encountered in the left-to-right, depth-first
because of non-intersecting frontiers, but because of traversal of the tree to reduce the amount of work done
the effort expended in trying to establish the optimal- in searching for optimal moves. For this purpose, the
ity of the various paths constructed upon their meeting. procedure tracks the lower bound on projected value of
Even with the most efficient implementations, bidirec- each max node (α) and the upper bound on the pro-
tional search can sometimes fail to provide the expected jected value for each min node (β). Consider a max
speedups, because of the structure of the problem’s state node n that has two min children l and r. Suppose that
space. Consider the game of peg-solitaire. The initial the value of its left child l is found to be v(l) = .
states of the game have many pegs but few spaces to Then, the procedure may prune the tree at any chil-
allow jumps, so that there are few available moves. As dren of r once it processes a c such that r is the parent
the balance between free spaces and pegs becomes more of c and v(c) < . This pruning reduces the number
even, more moves can be made and the branching fac- of nodes significantly. However, the left-to-right order
tor increases. Toward the terminal stages of the game, of tree traversal makes it challenging to formulate an
few pegs remain on the board and the average branch- efficient parallelization.
ing factor once again reduces dramatically. Therefore, Parallel forms of the algorithm address the different
the state space of peg-solitaire does not form a tree that kinds of node in the tree. Principal Variation Splitting
fans out with increasing depth. Consequently, a unidi- evaluates the leftmost branch of a game tree before
rectional search from the start state visits significantly allowing the parallel evaluation of sibling nodes. Ideally,
 C Combinatorial Search

the leftmost branch represents the optimal sequence of software-hardware approach, load was balanced by
moves for the player (game tree branches are generally pushing long hardware searches into software, and by
ordered so that the principal variation is the leftmost sharing large pieces of work between workers. In ,
branch of the tree) and so yields the tightest bounds Deep Blue defeated the then-reigning world champion
for the pruning of the rest of the tree, greatly reducing Garry Kasparov.
the amount of work done in parallel. Once the leftmost
branch l has been scored, sibling nodes are evaluated AND-OR Tree Search
in parallel using the bounds obtained from l. The sib- AND-OR trees arise naturally in the execution of logic
lings of a node n to its right may be sent refinements of programs, and in the problem-solving and planning lit-
bounds as these are calculated by n. The amount of par- erature within artificial intelligence. The min-max trees
allelism is limited by the branching factor of the prob- used in evaluating two-person games also reduce to
lem. Further, load imbalance between siblings causes AND-OR trees for the special case when the value of a
synchronization delays. node can only be a win or a loss. Another example
The Young Brothers Wait Concept (YBWC) orders of such trees arises in the graph coloring problem, when
nodes similarly: the first child of a node (the eldest subgraphs may be colored independently following the
brother) must be explored before the others (the coloring of a partitioning layer of vertices.
younger brothers) are examined. This scheme extends In logic programming terminology, the top-level
parallel evaluation beyond the principal variation query is a conjunction of literals called goals. Typically,
nodes. Processors are said to own nodes if they are multiple clauses are available to solve a given literal.
evaluating the subtrees beneath them. Initially, all pro- Each clause is a conjunction of literals. The evaluation
cessors except one, p , are idle. Processor p is given of a query then naturally leads to an AND-OR tree.
ownership of the root node. Idle processors request There are two kinds of node in an AND-OR tree: an
work from those that have satisfied the YBWC sibling AND-node requires solutions to each of its children,
order constraint, i.e., those processors that have eval- whereas an OR-node requires a solution to at least one
uated at least one eldest brother n. This establishes a of its children. Search procedures used for simple state
master–slave relationship between the sender and the space search may be extended to this case. In particu-
recipient of work, and marks a split-point at N, which lar, to construct a solution, an AND-node requires that
is the parent of n. The master and the slave cooper- information be sent up from its subtrees. A number of
ate to solve the tree under N. Further, a master that approaches to the effective organization of this search
has become idle can request work from a slave that has procedure have been surveyed, chiefly in the context of
not completed evaluation. Improved bound and cut-off logic programming, by Gupta et al. [].
information is shared with collaborating processors. A An interesting issue concerns the dependence
stronger formulation [] of the parallelism constraint between the sub-problems represented by the chil-
has been used to yield good speedups in chess-playing dren of AND nodes. Consider a query such as:
on distributed memory machines. Dynamic Tree Split- p(a, X), q(b, Y), r(c, X, Y, Z). Upper case letters repre-
ting uses a similar collaboration technique in the con- sent variables that are instantiated by a solution. A
text of shared memory systems. However, split-points solution to p may require that X have a value d, and a
are chosen according to constraints expressed in terms solution to q may require that Y have a value e. Unless
of the α and β values of nodes. there is a solution to r that also has X = d and Y = e,
The Deep Blue computer chess system [] used these solutions to p and q are not useful. Therefore, the
a master-slave approach to parallelism. Upper levels search space under r is constrained by using solutions
of the tree were evaluated by the master, the slave produced by p and q. Whereas the subtrees correspond-
processors being allotted work as the search grew ing to p and q can be explored in parallel, for r, a subtree
deeper. To alleviate the performance bottleneck at the can only be created for each instance produced by the
master, slave nodes were always kept busy with “on- cross-product of solutions to p and q. This “consumer-
deck” jobs. Since search was performed using a hybrid instance” parallelism is supported by the REDUCE-OR
Community Climate Model (CCM) C 

process model []. Further, to avoid wasted work, the . Dechter R, Pearl J () Generalized best-first search strategies
occurrence of duplicate states along distinct paths must and the optimality of A∗ . J ACM ():–
. Korf RE () Depth-first iterative-deepening: an optimal
be addressed. This issue is separate from the problem of
admissible tree search. Artif Intell :–
infinite regress described earlier, which arose due to the
. Li G-J, Wah BW () Coping with anomalies in parallel branch-
generation of duplicates along the same path. To avoid and-bound algorithms. IEEE Trans Comput ():– C
duplication of effort, the newer state may be made a . Grama A, Kumar V () State of the art in parallel search tech-
client for the result generated by the older instance. The niques for discrete optimization problems. IEEE Trans Knowl
issue of duplicates arises in other search patterns as well Data Eng ():–
. Nelson PC, Toptsis AA () Unidirectional and bidirectional
(Game Tree Search, IDA∗ , etc.) In these paradigms, it
search algorithms. IEEE Softw :–
may be advisable to terminate one of these two paths, . Kaindl H, Kainz G () Bidirectional heuristic search reconsid-
depending on their (under)estimated costs. ered. J Artif Intell Res :–
. Feldmann R, Mysliwiete P, Monien B () Studying overheads
More Search Techniques in massively parallel min/max-tree evaluation. In: SPAA ’:
Several search techniques exist in addition to the ones proceedings of the sixth annual ACM symposium on parallel
algorithms and architectures, ACM Press, New York, pp –
described in this chapter. Path finding in the context
. Campbell M, Hoane AJ, Hsu F-H () Deep blue. Artif Intell
of imperfect graph information is done using dynamic (–):–
algorithms such as D* and LPA*. Little work has been . Gupta G, Pontelli E, Ali KAM, Carlsson M, Hermenegildo MV
done on the parallelization of these techniques. () Parallel execution of prolog programs: a survey. ACM
Many metaheuristic techniques have been devel- Trans Program Lang Syst ():–
. Kalé LV () The REDUCE-OR process model for parallel
oped. The metaheuristic approach uses generalized
execution of logic programs. J Logic Program ():–
search mechanisms, usually inspired by physical and
natural phenomena, in lieu of problem-specific heuris-
tics and techniques. Examples include Genetic Algo-
rithms, Simulated Annealing, Local and Tabu Searches,
Commodity Clusters
etc. The use of graphics processors as offload devices to
aid the solution of such state space search problems has Clusters
also been considered recently.

Related Entries
Charm++
Communicating Sequential
Cilk
Processes (CSP)
Logic Languages
CSP (Communicating Sequential Processes)
Bibliography
. Lin Y-J, Kumar V () And-parallel execution of logic pro-
grams on a sharedmemory multiprocessor. J Logic Program
(–):– Community Atmosphere Model
. Blumofe RD, Joerg CF, Kuszmaul BC, Leiserson CE, Randall KH,
Zhou Y () Cilk: an efficient multithreaded runtime system.
(CAM)
J Parallel Distrib Comput ():–
Community Climate System Model
. Kalé LV, Ramkumar B, Saletore V, Sinha AB () Prioritization
in parallel symbolic computing. In: Ito T, Halstead R (eds) Lecture
notes in computer science, vol . Springer-Verlag, Heidelberg,
pp –
. Rao VN, Kumar V () Superlinear speedup in parallel state-
space search. In: Proceedings of the eighth conference on foun-
Community Climate Model (CCM)
dations of software technology and theoretical computer science,
Springer, London, UK, pp – Community Climate System Model
 C Community Climate System Model

and Space Administration, and National Science Foun-


Community Climate System dation. In  the model was extended and renamed
Model the Community Earth System Model (CESM). This
article focuses on CCSM, reserving a brief discus-
Patrick H. Worley , Mariana Vertenstein , sion of CESM for section “Community Earth System
Anthony P. Craig Model”.

Oak Ridge National Laboratory, Oak Ridge, TN, USA
 CCSM consists of a system of four parallel geophys-
National Center for Atmospheric Research, Boulder,
ical component models (atmosphere, land, ocean, and
CO, USA
sea ice) that exchange two-dimensonal boundary data
(flux and state information) periodically through a par-
allel coupler. The coupler coordinates the interaction
Synonyms and time evolution of the component models, and also
Community atmosphere model (CAM); Community
serves to remap the boundary-exchange data in space.
climate model (CCM); Community climate system
The atmosphere model is CAM, the Community Atmo-
model (CCSM); Community earth system model
sphere Model. The ocean model is POP, the Parallel
(CESM); Community ice code (CICE); Community
Ocean Program. The land model is CLM, the Com-
land model (CLM); Model coupling toolkit (MCT); Par-
munity Land Model. The sea ice model is CICE, the
allel I/O library (PIO); Parallel ocean program (POP)
Community Ice Code.
All component models are hybrid parallel applica-
Definition tion codes, using MPI, the Message Passing Interface, to
The Community Climate System Model, CCSM, is a define and coordinate distributed-memory parallelism
parallel climate model consisting of four parallel geo- and OpenMP to define and coordinate shared-memory
physical component models (atmosphere, land, ocean, parallelism. The components are all, for the most part,
and sea ice) that exchange boundary data periodically written in Fortran , though this is not a requirement.
through a parallel coupler component. It is a freely CCSM is unusual among parallel computational science
available community model developed by researchers models in that the hybrid message-passing/shared-
funded by the US Department of Energy (DOE), memory parallel programming paradigm was imple-
the National Aeronautics and Space Administration mented within many of the component models early on,
(NASA), and the National Science Foundation (NSF), and is now supported in all components except the cou-
and maintained at the National Center for Atmospheric pler. It has been found that using both message-passing
Research. CCSM targets a wide variety of comput- and shared-memory parallelism has been critical for
ing platforms, ranging from workstations to the largest achieving good performance on many past and current
DOE, NASA, and NSF supercomputers. HPC architectures. Another unusual aspect of CCSM is
that it is a community code that is evolving continually
to evaluate and include new science. In consequence it
Discussion has been very important that CCSM be easy to maintain
Introduction and port to new systems, and that CCSM performance
Investigating the impact of climate change is a com- be easy to optimize for new systems or for changes in
putationally expensive process, and as a result mod- problem specification or processor count.
ern climate models are designed to take advantage of CCSM supports a large range of computational
high performance computing (HPC) architectures. The grid resolutions, with current production runs span-
Community Climate System Model (CCSM) is the best ning a nominal ○ horizontal grid (approximately ,
known of a series of climate models that have been grid points per vertical level for the atmosphere and
developed by and maintained at the National Cen- land and , for the ocean and sea ice) for pale-
ter for Atmospheric Research (NCAR), with contri- oclimate simulations, to .○ atmosphere and land
butions from external researchers funded by the US horizontal grids (approximately , grid points
Department of Energy (DOE), National Aeronautics per level) and .○ resolution ocean and sea ice grids
Community Climate System Model C 

(approximately ,, grid points per level) that can Terminology


resolve global tropical cyclones. Furthermore, the atmo- To help differentiate between MPI and OpenMP-
sphere component can span the range of altitudes from based parallelism, processes will be referred to as MPI
the Earth’s surface to the thermosphere. CCSM also processes or tasks. Computational threads associated
supports a variety of physical process options. Exam- with an MPI process but which are spawned by the C
ples include different atmospheric chemistry packages OpenMP runtime system will be referred to as OpenMP
and the option to run with a full carbon/nitrogen bio- threads.
geochemical cycle in the ocean and land. Figure  is a Each component model utilizes one or more spa-
snapshot of precipitable water from a high resolution tial computational grids with at least two horizontal
simulation. indices, e.g., representing longitude and latitude, and
For low resolution simulations and less expensive one vertical index. All grid points with a given horizon-
physical process options, CCSM can be run on small tal location are referred to as a “vertical column.” For
clusters and even on a laptop computer. For high some of the component models each grid point also rep-
resolution simulations and the more computationally resents a geographical region surrounding that physical
expensive physics options, the largest available super- location. In this case a grid point is also referred to as a
computers are required, both to satisfy the memory “grid cell.”
requirements and to achieve a reasonable throughput Component models approximate the solution at a
rate for the simulations. discrete set of simulation times, with results at one
Each component model is a parallel application simulation time dependent only on data and results
code in its own right, and was developed for the associated with the same or earlier simulation times.
most part independently from the other compo- Moreover, results for earlier simulation times are always
nent models. These are discussed in turn. Common completed before the computation for a later time is
technologies used in multiple components are also begun. The computation for a given simulation time is
described. a “timestep.”

Community Climate System Model. Fig.  A snapshot of the column integrated precipitable water from a
high-resolution CCSM simulation. The snapshot is from early January and shows well-resolved winter-time cyclone
activity in the Indian Ocean. Visualization courtesy of J. Daniel of Oak Ridge National Laboratory
 C Community Climate System Model

The parallel implementation of each component flexibility to select the component processor layout. It
model is based on one or more decompositions of a is typical for the atmosphere, land, and sea ice model
computational grid, with each MPI process assigned to run on a common set of processors, while the ocean
data and the responsibility of computing results asso- model runs concurrently on a disjoint set of processors.
ciated with a subset of the grid. Computing the next This is not a requirement, however, and CCSM can now
timestep for a given subset of the grid will sometimes run all components on disjoint processor subsets, all on
require data and results associated with grid points the same processors, or any combination in between.
external to the subset. These external grid points are Each component model has its own performance
referred to as the “halo” for the grid subset. For a parallel characteristics, and the coupling itself adds to the com-
algorithm based on grid decomposition, a “halo update” plexity of the performance characterization. The first
is the interprocess communication required to acquire step in CCSM performance optimization is to deter-
the halo information that resides in the memory space mine the optimized performance of each of the compo-
of other processes. nent models for a number of different processor counts.
This performance information is then used to deter-
Community Climate System Model mine how to assign processors to components in order
The first fully coupled system in the CCSM line of to optimize the performance of CCSM as a whole. Car-
development, version  of the Community Climate Sys- toons of two example configurations for a Cray XT
tem Model (CCSM), was released in . Although are displayed in Fig. . Figure  describes the perfor-
CCSM included support for both distributed-memory mance scaling for the larger of the two problems used in
multiprocessor systems (via message passing) and par- Fig. .
allel shared-memory vector systems, it was officially
supported only on the NCAR Cray computing sys- Community Atmosphere Model
tems. A subsequent minor release added support for the CAM and its predecessors were all developed at
SGI Origin  system. CCSM, the Community Cli- NCAR. The first version (CCM) was completed in
mate System Model version , was released in  and , followed by CCM in , CCM in , and
was the first version to use CAM, CLM, and POP for CCM in . CCM and prior versions targeted
the atmosphere, land, and ocean component models, shared-memory parallel vector systems. An experimen-
respectively. Target architectures included IBM SP sys- tal distributed-memory parallel version of CCM was
tems, SGI Origin , and Compaq/DEC Alphaserver. developed through the DOE CHAMMP (Computer
Version  (CCSM) was released in June  and ver- Hardware, Advanced Mathematics, and Model Physics)
sion  (CCSM) was released in April, . CCSM program, targeting the Intel Delta initially and later
was the first public release to use CICE for the sea ported to other systems including the Intel Paragon,
ice component. CCSM theoretically runs on any IBM SP, and Cray TE. However, CCM was the first
distributed-memory system with an MPI library and official release with a distributed-memory paralleliza-
a compiler capable of compiling CCSM and required tion option. In , a new version was released and
libraries, such as netCDF. At the high end, CCSM ran the name of the model was changed to the Commu-
on Cray XT, IBM BlueGene/P, and IBM Power cluster nity Atmosphere Model, to better reflect its role in the
systems in . fully coupled climate system. CAM was also a signifi-
For science reasons, the atmosphere, land, and sea cant departure from the CCM in terms of its software
ice models are partially serialized in time, limiting the architecture, resulting in much improved extensibility,
fraction of time when all four CCSM components can maintainability, portability, and parallel performance.
execute simultaneously. In the first three versions of CAM is characterized by two computational phases:
CCSM each component model and the coupler were the dynamics, which advances the evolutionary equa-
run as separate executables assigned to nonoverlapping tions for the atmospheric flow, and the physics, which
processor sets. As of CCSM, the entire system is now approximates subgrid phenomena such as precipitation
run as a single executable and there is greatly increased processes, clouds, long- and short-wave radiation, and
Community Climate System Model C 

POP 384
(64 × 6) CLM 768
(128 × 6)

CAM 3072
(512 × 6)
CAM 19968 C
(3,328 × 6)
Time

Time
POP 9120
CICE 3072 CICE 21600 (1,520 × 6)
(512 × 6) (3,600 × 6)

CPL 512 CLM 384 CPL 3328


(64 × 6)

1 Processor cores 3,844 1 Processor cores 31,488

Community Climate System Model. Fig.  Cartoons of two example configurations for a Cray XT with two hex-core
processors per compute node. The left configuration is for a simulation using one degree resolution horizontal grids on
, processor cores. The right configuration is for a simulation using quarter degree resolution atmosphere/land grids
and tenth degree ocean/sea ice grids on , processor cores. The number of cores used is listed per component, as well
as the number of MPI tasks and number of OpenMP threads per task (tasks × threads). The time direction indicates the
relative amount of time spent in each component, and whether components run sequentially or concurrently with each
other. Note that the coupler will sometimes run sequentially with the ocean, for example, when the ocean and
atmosphere are communicating, and sometimes concurrently

3 350
CCSM CCSM
POP
Average seconds per simulation day

300 CAM
2.5 CICE
CLM
Simulation years per day

250
2
200
1.5
150
1
100

0.5 50

0 0
0 5000 10000 15000 20000 25000 30000 35000 0 5000 10000 15000 20000 25000 30000 35000
Processor cores Processor cores

Community Climate System Model. Fig.  Example CCSM performance scaling on a Cray XT with two hex-core
processors per compute node. The left graph is the throughput rate for a simulation using quarter degree resolution
atmosphere/land and tenth degree ocean/sea ice horizontal grids. The right graph is the average wallclock seconds per
simulation day for CCSM and for the atmosphere, land, ocean, and sea ice component models. For a given total processor
core count, cores were assigned to each component so as to minimize total execution time. The number of cores assigned
per component did not always increase with the total number of cores

turbulent mixing. Separate data structures and par- turn during each model simulation timestep, requir-
allelization strategies are used for the dynamics and ing that some data be rearranged between the two data
physics. The dynamics and physics are executed in structures each timestep.
 C Community Climate System Model

CAM includes multiple compile-time options for they are similar, each process may need to communicate
the dynamics, referred to as dynamical cores or dycores. with only a small number of other processes (or possibly
During – the most utilized dycore was a finite- none at all).
volume flux-form semi-Lagrangian dynamical core The computational cost in the physics is not uniform
(FV) formulated originally by Lin and Rood []. FV uses over the vertical columns, with the cost for an individual
a tensor-product latitude × longitude × vertical-level column depending on both geographic location and on
computational grid over the sphere. A Lagrangian ver- simulation time. Some physics decompositions are bet-
tical coordinate is used to define flux volumes, within ter than others in equidistributing this cost (“balancing
which the horizontal dynamics evolve. Vertical trans- the load”) across the processes. A number of predefined
port is modeled through evolution of the geopotential physics decompositions are provided that attempt to
along each vertical column. A conservative Lagrangian minimize the combined effect of load imbalance and the
surface remap is performed each model time step. Other communication cost of mapping to/from the dynamics
supported dycores include an Eulerian global spectral decompositions. The optimal choice is a function of the
method (EUL) and a spectral element method (SEM). problem specification, computer system, and number of
Like FV, EUL uses a latitude × longitude horizontal processors used.
computational grid. In contrast, SEM uses a quasi- Historically CAM has been the performance bot-
uniform “cubed sphere” horizontal grid that avoids the tleneck in fully coupled climate simulations. In conse-
clustering of grid points near the poles that is typical of quence many performance optimization options have
latitude × longitude grids. All options as of June  are been implemented in the model, to make it easier to
described in []. optimize performance for different problem specifica-
The parallel implementation of the FV dycore tions and target computer systems. Despite this flex-
is based on two-dimensional tensor-product “block” ibility, the parallel scalability of the model is limited
decompositions of the computational grid into a set when using either the FV or EUL dycores. As shown
of geographically contiguous subdomains. Each subdo- in Fig. , new dycores such as SEM demonstrate sig-
main is assigned to a single process and no more than nificantly improved scalability and will almost certainly
one block is assigned to any MPI process. A latitude- become the default in the near future.
vertical decomposition is used for the main dynamical
algorithms and a latitude-longitude decomposition is
used for the Lagrangian surface remapping and (option- Community Land Model
ally) geopotential calculation. Halo updates are the pri- CLM was developed at NCAR. It evolved from the effort
mary MPI communications required by computation to expand the strictly biogeophysical LSM, the NCAR
for a given decomposition. OpenMP is used for addi- Land Surface Model, to include the carbon cycle, veg-
tional loop-level parallelism. etation dynamics, and river routing. The first version,
CAM physics is based on vertical columns and CLM., was released in . Two major releases of the
dependencies occur only in the vertical direction. Thus model have occurred since then, CLM. and CLM..
computations are independent between columns. The The latter releases use MPI for distributed-memory par-
parallel implementation of the physics is based on a allelism and OpenMP for shared-memory parallelism.
fine-grain latitude-longitude decomposition. Each sub- CLM is a single column (snow-soil-vegetation) model
domain, referred to as a “chunk,” is a collection of verti- of the land surface, and in this aspect it is embarassingly
cal columns. Multiple chunks can be assigned to a single parallel.
MPI process, and OpenMP parallelism is applied to the Spatial land surface heterogeneity in CLM is repre-
loops over these chunks. sented as a nested subgrid hierarchy in which grid cells
Transitioning from one grid decomposition to are composed of multiple landunits, snow/soil columns,
another, for example, latitude-vertical to latitude- and plant functional types (PFTs). Each grid cell can
longitude or dynamics to physics, may require that have a different number of landunits, each landunit can
information be exchanged between processes. If the have a different number of columns, and each column
decompositions are very different, then every process can have multiple PFTs. The landunits represent the
may need to exchange data with every other process. If broadest spatial patterns of subgrid heterogeneity and
Community Climate System Model C 

16
SEM (777,602)
FV (768  1,152)
EUL (512  1,024)
8

Simulation years per day 4


C

0.5

0.25
512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 13,1072
Processor cores

Community Climate System Model. Fig.  CAM throughput as a function of processor core count on an IBM BG/P with
one quad-core processor per compute node for example simulations using the EUL, FV, and SEM dycores, respectively.
Horizontal grid resolution used with each dycore is reported as either total number of grid points or as the dimensions of
a two-dimensional grid. When using the FV dycore, the simulation used the same quarter degree grid as that used for
Fig. . Grid resolutions when using the EUL and SEM dycores were chosen so that all simulations achieved comparable
numerical accuracy. SEM data courtesy of M. Taylor of Sandia National Laboratories

currently are glacier, lake, wetland, urban, and vege- to clumps in a segmented round robin approach with
tated. Columns represent potential variability in the soil the expectation that each clump will contain grid cells
and snow within a single landunit. Finally, PFTs rep- from a variety of geographic locations. Clumps are then
resent differences in functional characteristics between assigned in a round-robin fashion to the processes. This
categories of plants. Currently, up to  possible PFTs decomposition strategy provides performance portabil-
that differ in physiology and structure may coexist on ity across a wide range of computer architectures.
a single column. All fluxes to and from the surface are
defined at the PFT level.
Grid cells are grouped into blocks (called “clumps”) Parallel Ocean Program
of nearly equal computational cost, and these clumps POP is a descendant of the Bryan-Cox-Semtner class of
are subsequently assigned to MPI processes. When run models []. It was developed at Los Alamos National
serially or with MPI-only parallelism, each process has Laboratory (LANL) in , designed specifically to
only one clump. When OpenMP is enabled, the number take advantage of high-performance computer archi-
of clumps per process is set to the maximum number of tectures. POP was written initially in a data paral-
OpenMP threads available. lel formulation using CM Fortran and targeting the
The computational cost of a grid cell is approxi- Thinking Machines CM- and CM- systems, but
mately proportional to the number of plant functional has since moved to a traditional message-passing
types (PFTs) contained within it so some gridcells implementation.
will be higher cost than others. Since similar PFTs POP approximates the three-dimensional primitive
tend to cluster geographically, balancing the workload equations for fluid motions on a generalized orthogo-
across MPI processes requires that geographically dis- nal computational grid on the sphere. Each timestep of
tinct gridcells be assigned to each MPI process to the model is split into two phases. A three-dimensional
improve load balance. In CLM, grid cells are distributed “baroclinic” phase uses an explicit time integration
 C Community Climate System Model

method. A “barotropic” phase includes an implicit solu- smaller blocks and generally more complex communi-
tion of the two-dimensional surface pressure using a cation patterns, can incur higher communication cost
preconditioned conjugate gradient solver. in halo updates. For any given POP resolution, the opti-
The parallel implementation is based on a two- mal block size and decomposition strategy depends on
dimensional tensor-product “block” decomposition of the computer architecture and the processor count.
the horizontal dimensions of the three-dimensional
computational grid. Blocks are then distributed to MPI Community Ice Code
processes. The vertical dimension is not decomposed. The CICE sea ice model was developed at LANL in
The amount of work associated with a block is propor- the mid-s, with version  released to the public
tional to the number of grid cells located in the ocean. in . It was designed to be integrated into a cou-
Grid cells located over land are “masked” and elimi- pled climate model, and, in particular, to be compatible
nated from the computational loops. OpenMP paral- with POP. Version  was selected to be the sea ice com-
lelism is applied to loops over blocks assigned to an MPI ponent in CCSM, replacing the Community Sea Ice
process. The number of MPI processes and OpenMP Model (CSIM) that was used in CCSM. However CSIM
threads to be used are specified at compile-time. This is closely related to CICE and much of the following
information is used to generate enough blocks so that discussion applies to CSIM as well.
all computational threads are assigned work. The CICE sea ice model is formulated on a two-
The parallel implementation of the baroclinic phase dimensional horizontal grid representing the earth’s
requires only limited nearest-neighbor MPI commu- surface. An orthogonal vertical dimension exists to
nication (for halo upates) and performance is domi- represent the sea ice thickness. Similar to POP, the
nated primarily by computation. The barotropic phase parallel implementation decomposes the horizontal
requires both halo updates and global sums (imple- dimensions into equal-sized two-dimensional blocks
mented with local sums and a call to MPI_Allreduce for that are then assigned to processes. The blocks can be
a small number of scalars) for each iteration of the con- distributed using relatively arbitrary algorithms. The
jugate gradient algorithm. The solution of the implicit vertical dimension is not decomposed.
system can require hundreds of iterations, and parallel CICE supports both distributed-memory and shared-
performance of the barotropic phase is dominated by memory parallelism over the same dimension, namely
the communication cost of the halo updates and global grid blocks. The primary interprocess communication
sum operations. operation is a halo update. No global sums are required
Two different approaches to domain decomposition in the prognostic calculation, but global sums are
are supported currently: “cartesian” and “spacecurve.” used when calculating diagnostics. Currently the CICE
The cartesian option decomposes the grid onto a two- decomposition is static and set at initialization. The per-
dimensional virtual processor grid, and then further formance of the CICE model is highly dependent on
subdivides the local subgrids into blocks to provide both the block size and the decomposition.
work for OpenMP threads. This generates an efficient The relative cost of computing on the sea ice grid
decomposition for halo communication. The cartesian varies significantly both spatially and temporally over
option tends to be most efficient when using one block a climate simulation because the sea ice distribution
per thread. The spacecurve option begins by eliminat- is changing constantly. It varies most dramatically on
ing blocks having only “land” grid cells. A space-filling seasonal timescales but can also vary on interannual
curve ordering of the remaining blocks is then cal- timescales. This has a huge impact on the load bal-
culated, and an equipartition of this one-dimensional ance of the sea ice model in a statically decomposed
ordering of the blocks is used to assign blocks to model. The load balance will be generally optimized
processes. The land block elimination step tends to if grid cells from varied geographical locations (high
improve load balance compared to the cartesian option. latitude / tropics, northern / southern hemisphere)
This is most effective if the blocks are small, which are assigned to each process. This is similar to the
can lead to assigning more than one block per thread. CLM and CAM physics decomposition load balance
However, the spacecurve distribution, with potentially issues.
Community Climate System Model C 

In addition, CICE performs regular and frequent largely on the Model Coupling Toolkit (MCT) Library
halo updates with a resultant performance cost that developed at Argonne National Laboratory. The coupler
depends on the amount of data communicated and the released as part of CCSM in  underwent signfi-
number of messages sent. Halo updates are written such cant architectural revisions, but still relied heavily on
that all data communicated between two processes are the MCT library to support decompositions and parallel C
aggregated into a single message, thereby minimizing operations.
the total number of messages needed to carry out a The coupler receives grid information in parallel at
halo update for more unusual decompositions. CICE runtime from all of the model components. Domain
halo communication can be minimized by assigning decompositions are determined on the fly based upon
neighbor blocks to the same process, minimizing edge the model resolutions, the component model decom-
lengths, and minimizing the number of neighbors and, positions, and the processors used by the coupler. The
hence, messages communicated. coupler treats the grids as one-dimensional arrays of
As in POP, the load balance and halo cost compete grid points so relatively arbitrary decompositions can
for optimal static load balance. At lower processor be implemented to optimize performance. The coupler
counts, load balance is more important. As proces- does not require information about grid point connec-
sor count is increased, halo cost becomes increasingly tivity within each domain. Work is either completely
important. As a compromise, strips of neighboring local or data is referenced by global grid indices for non
gridcells that span relatively large swaths of latitude local operations, as specified by input files.
tend to be grouped into blocks using a simple two- Both rearrangement and mapping require inter-
dimensional decomposition. At higher processor count, process communication. Rearrangement is handled
this tends to result in relatively long, skinny blocks by MCT “routers” that define the required message
on the physical grid and relatively high cell aspect exchanges. Routers are created at model initialization
ratios for communication. The optimal decomposition and reused throughout the model run. All required
for any hardware and resolution depends strongly on data communications between a pair of processes are
the processor count. Weighted space-filling curves and aggregated into a single message exchange, and data are
other decompositions have been implemented in CICE rearranged using a minimal number of messages.
and are being explored as a means to improve perfor- Mapping of fields between two grids is done in
mance at higher resolution. In addition, other tech- parallel using the MCT rearrangement logic described
niques such as dynamic load balancing and improved above. The required mapping weights are precomputed,
overlap of work and interprocess communication are read in during initialization, and decomposed onto the
being explored to improve performance and scalability. processes based upon two possible mapping imple-
mentations. The first implementation assigns weights
Coupler defining the mapping from the source grid to the des-
The CCSM coupler is responsible for several actions tination grid using the decomposition of the source
including rearranging data between different process grid. Partial sums computed on the source grid decom-
sets, interpolating (mapping) data between different position are then rearranged to the destination grid
grids, merging data from different components, flux decomposition, where final sums are computed. The
calculations, and diagnostics. Many of the algorithms second implementation rearranges the data from the
are trivially parallel and require no communication source grid onto the destination grid decomposition.
between grid cells. The mapping of the source grid data to the destina-
The first CCSM couplers of the late s were tion grid is then computed using the destination grid
run on a single processor and communication between decomposition. The number of floating point opera-
components was implemented using PVM, the Paral- tions are nearly identical in both cases. The difference
lel Virtual Machine. Eventually, PVM was replaced by is whether data are mapped primarily using the source
MPI and the coupler was parallelized. The first cou- or destination grid decomposition. The most efficient
pler parallelized for MPI distributed-memory paral- mapping method depends mostly on the relative sizes
lelism was released with CCSM in  and was based of the source and destination grids.
 C Community Climate System Model

Parallel I/O and aerosol dynamics, and ocean ecosystems and bio-
An efficient parallel I/O subsystem is a critical compo- geochemical coupling, all necessary for an earth system
nent of a parallel application code. Limiting external model, as distinct from a purely physical model like
storage accesses to a single master process creates a CCSM. While enabling the new CESM model options
serial bottleneck, degrading parallel performance and will change the performance characteristics of a simula-
scalability of the application as a whole, and/or exhaust- tion, increasing the computational cost and amount of
ing local memory. Allowing all processes to access I/O primarily, the prior discussion of the parallel algo-
the external storage, especially access to the same file, rithms in the component models and in the coupled
can lead to failure or very poor performance when model apply to the CESM virtually unchanged.
thousands (or hundreds of thousands) of processes are
involved. To address this need for CCSM, a new parallel Related Entries
I/O library called PIO has been developed and included Collective Communication
in CCSM. Computational Sciences
PIO was initially designed to allow better memory Distributed-Memory Multiprocessor
management for very high resolution simulations, by Fortran  and Its Successors
relaxing the requirement for retaining the memory cor- I/O
responding to the global two-dimensional horizontal Load Balancing, Distributed Memory
resolution on the master I/O task. Since then, PIO has Loop Nest Parallelization
developed into a general purpose parallel I/O library MPI (Message Passing Interface)
that serves as a software interface layer designed to NetCDF I/O Library, Parallel
encapsulate the complexities of parallel I/O and to make OpenMP
it easier to replace the lower level software backend. PIO Space-Filling Curves
has been implemented throughout the entire CCSM Shared-Memory Multiprocessors
system and currently supports serial I/O using netCDF
and parallel I/O using pnetCDF. Bibiographic Notes and Further
PIO calls are collective. An MPI communicator is set Reading
in a call to the PIO initialization routine and all tasks Washington and Parkinson [] is a comprehensive
associated with that communicator must participate in introduction to climate modeling. The  DOE report
all subsequent calls to PIO. One of the key features of “A Science-Based Case for Large-Scale Simulation” []
PIO is that it takes the model’s decomposition and redis- includes estimates of the computational requirements
tributes it to an “I/O-friendly” decomposition on the for continued progress in computational climate sci-
requested number of I/O tasks. In using the PIO library, ence, arguing for the necessity of parallel simulation
the user must specify the number of I/O tasks to be used, models targeting the highest performing computing
the stride or number of tasks between I/O tasks and platforms. A more recent discussion of the need for
whether the I/O will use the serial netCDF or pnetCDF high performance computing in computational climate
library. By increasing the number of I/O tasks, the user science is available in [].
can easily reduce the serial I/O memory bottleneck even Detailed information on CCSM and CESM is avail-
with the use of serial netCDF. able at the URLs
http://www.cesm.ucar.edu/models/ccsm./ and
Community Earth System Model
Version  of the CESM, the Community Earth Sys- http://www.cesm.ucar.edu/models/cesm./
tem Model, was released in June, . The CESM is
a superset of CCSM in that it can be configured to respectively. For descriptions of the climate science
run the same science scenarios as CCSM. However, capabilities of CCSM, see Collins et al. []. Papers
the CESM also contains options for a terrestrial carbon documenting CCSM and CESM are in preparation,
cycle and dynamic vegetation, atmospheric chemistry and will be cited at the above URLs when available.
Community Earth System Model (CESM) C 

The parallel implementation and performance char- . Collins WD, Bitz CM, Blackmon ML, Bonan GB, Bretherton CS,
acteristics of CCSM and of the component models have Carton JA, Chang P, Doney SC, Hack JH, Henderson TB, Kiehl
JT, Large WG, McKenna DS, Santer BD, Smith RD () The
been documented in numerous journal and proceed-
community climate system model version  (CCSM). J Clim
ings articles. Early work was reported in the proceed-
:–
ings of a series of conferences on the use of parallel . Drake J, Jones P, Vertenstein M, White J III, Worley P () C
processing in meterology that were held at the Euro- Software design for petascale climate science. In: Bader D (ed)
pean Center for Medium Range Weather Forecasting Petascale computing: algorithms and applications. Chapman &
beginning in the mid-s, in particular see [, ]. An Hall/CRC Press, New York, pp –, chap 
. Drake JB, Foster IT () Special issue on parallel computing in
overview of many elements of this early work is available
climate and weather modeling. Parallel Comput :–
in a  special issue of Parallel Computing on parallel . Drake JB, Jones PW, Carr G () Special issue on climate
computing in climate and weather modeling []. modeling. Int J High Perform Comput Appl :–
A description of more current aspects of the soft- . Hoffman G-R, Kauranne T (eds) () Parallel supercomput-
ware engineering of CCSM component models is con- ing in atmospheric science. In: Proceedings of the fifth ECMWF
workshop on use of parallel processors in meteorology. World
tained in a  special issue of the International Journal
Scientific, Singapore
of High Performance Computing Applications on climate . Hoffman G-R, Kreitz N (eds) () Making its mark – the use of
modeling []. A recent overview of the software design parallel processors in meteorology. In: Proceedings of the seventh
of CCSM as a whole appeared in []. ECMWF workshop on use of parallel processors in meteorology.
World Scientific, Singapore
. Lin S-J () A ‘vertically lagrangian’ finite-volume dynamical
Acknowledgments core for global models. Mon Weather Rev :–
This work has been coauthored by a contractor of . Neale RB, Chen C-C et al () Description of the NCAR
the US government under contract No. DE-AC- Community Atmosphere Model (CAM .), NCAR Tech Note
OR, and was partially sponsored by the Cli- NCAR/TN-???+STR, June . National Center for Atmospheric
mate and Environmental Sciences Division of the Office Research, Boulder
. Office of Science US, Department of Energy (). A science-
of Biological and Environmental Research and by the
based case for large-scale simulation. http://www.pnl.gov/scales/.
Office of Advanced Scientific Computing Research, Accessed  July 
both in the Office of Science, US Department of Energy, . Washington W, Bader D, Collins W, Drake J, Taylor M, Kirt-
under Contract No. DE-AC-OR with UT- man B, Williams D, Middleton D () Scientific grand
Battelle, LLC. Accordingly, the US government retains challenges: challenges in climate change science and the role
of computing at the extreme scale, Tech. Rep. PNNL-.
a nonexclusive, royalty free license to publish or repro-
Pacific Northwest National Laboratory. http://www.science.doe.
duce the published form of this contribution, or allow gov/ascr/ProgramDocuments/Docs/ClimateReport.pdf
others to do so, for US government purposes. This . Washington W, Parkinson C () An introduction to three-
work used resources of the Oak Ridge Leadership dimensional climate modeling, nd edn. University Science
Computing Facility, located in the National Center for Books, Sausalito
Computational Sciences at Oak Ridge National Labo-
ratory, which is supported by the Office of Science of
the Department of Energy under Contract DE-AC-
Community Climate System
OR, and of the Argonne Leadership Comput-
ing Facility at Argonne National Laboratory, which is
Model (CCSM)
supported by the Office of Science of the US Depart- Community Climate System Model
ment of Energy under contract DE-AC-CH.

Bibliography
. Semtner JAJ () Finite-difference formulation of a world
Community Earth System Model
ocean model. In: O’Brien JJ (ed) Advanced physical oceano- (CESM)
graphic numerical modeling. NATO ASI Series. Reidel, Norwell,
pp – Community Climate System Model
 C Community Ice Code (CICE)

Community Ice Code (CICE) Complex Event Processing


Community Climate System Model Stream Programming Languages

Community Land Model (CLM) Computational Biology

Community Climate System Model Bioinformatics

Compiler Optimizations for Array Computational Chemistry


Languages NWChem

Array Languages, Compiler Techniques for

Computational Models
Compilers Models of Computation, Theoretical

Array Languages, Compiler Techniques for


Banerjee’s Dependence Test
Code Generation Computational Sciences
Dependence Abstractions
Dependences Geoffrey Fox
Loop Nest Parallelization Indiana University, Bloomington, IN, USA
Modulo Scheduling and Loop Pipelining
Omega Test Synonyms
Polyhedron Model Applications and parallelism; Problem architectures
Parallelization, Automatic
Parallelization, Basic Block
Definition
Parallelism Detection in Nested Loops, Optimal
Here it is asked which applications should run in paral-
Parafrase
lel and correspondingly which areas of computational
Polaris
science will benefit from parallelism. In studying this
R-Stream Compiler
it will be discovered which applications benefit from
Speculative Parallelization of Loops
particular hardware and software choices. A driving
Tiling
principle is that in parallel programming, one must map
Trace Scheduling
problems into software and then into hardware. The
Unimodular Transformations
architecture differences in source and target of these
maps will affect the efficiency and ease of parallelism.

Discussion
Complete Exchange Introduction
I have an application – can and should it be imple-
All-to-All mented on a parallel architecture and if so, how should
Computational Sciences C 

this be done and what are appropriate target hardware find languages that preserve key information needed for
architectures, what is known about clever algorithms parallelism while hardware designers design computers
and what are recommended software technologies? Fox that can work around this loss of information. For an
introduced in [] a general approach to this question example, use of arrays in many data parallel languages
by considering problems and the computer infrastruc- from APL, HPF, to Sawzall can be viewed as a way to C
ture on which they are executed as complex systems. preserve spatial structure of problems when expressed
Namely each is a collection of entities and connec- in these languages. In this article, these issues will not
tions between them governed by some laws. The entities be discussed in depth but rather it will be discussed
can be illustrated by mesh points, particles, and data what is possible with “knowledgeable users” mapping
points for problems; cores, networks, and storage loca- problems to computers or particular programming
tions for hardware; objects, instructions, and messages paradigms.
for software. The processes of deriving numerical mod-
els, generating the software to simulate model, com-
Simple Example
piling the software, generating the machine code, and
The simple case of a problem whose complex system
finally executing the program on particular hardware
spatial structure is represented as a D mesh is consid-
can be considered as maps between different complex
ered. This comes in material science when one considers
systems. Many performance models and analyses have
local forces between a regular array of particles or in the
been developed and these describe the quality of map.
finite difference approach to solving Laplace or Poisson’s
It is known that maps are essentially never perfect and
equation in two dimensions. There are many impor-
describing principles for quantifying this is a goal of
tant subtleties such as adaptive meshes and hierarchical
this entry. At a high level, it is understood that the
multigrid methods but in the simplest formulation such
architecture of problem and hardware/software must
problems are set up as a regular grid of field values
match; given this we have quantitative conditions that
where the basic iterative update links nearest neighbors
the performance of the parts of the hardware must be
in two dimensions.
consistent with the problem. For example, if two mesh
If the points are labeled by an index pair (i, j),
points in problem are strongly connected, then band-
then Jacobi’s method (not state of the art but cho-
width between components of hardware to which they
sen as simplicity allows a clear discussion) can be
are mapped must be high. In this discussion, the issues
written
of parallelism are being described and here there are two
particularly interesting general results. Firstly a space
(the domain of entities) and a time associated with a ϕ(i, j) is replaced by (ϕLeft +ϕRight +ϕUp +ϕDown )/ ()
complex system are usually defined. Time is nature’s
time for the complex system that describes time depen- where ϕLeft = ϕ(i, j − ) and similarly for ϕRight , ϕUp ,
dent simulations. However, for linear algebra, time for and ϕDown .
that complex system is an iteration count. Note that for Such problems would usually be implemented on a
the simplest sequential computer hardware there is no parallel architecture by a technique that is often called
space and just a time defined by the control flow. Thus “domain decomposition” or “data parallelism” although
in executing problems on computers one is typically these terms are not very precise. Parallelism is natu-
mapping all or part of the space of the problem onto rally found for such problems by dividing the domain
time for the computer and parallel computing corre- up into parts and assigning each part to a different pro-
sponding case where both problem and computer have cessors as seen in Fig. . Here the problem represented
well defined spatial extent. Mapping is usually never : as a  ×  mesh is solved on a  ×  mesh of processors.
and reversible, and “information is lost” as one maps For problems coming from nature this geometric view
one system into another. In particular, one fundamen- is intuitive as say in a weather simulation, the atmo-
tal reason why automatic parallelism can be hard is sphere over California evolves independently from that
that the mapping of problem into software has thrown over Indiana and so can be simulated on separate pro-
away key information about the space-time structure of cessors. This is only true for short time extrapolations –
original problem. Language designers in this field try to eventually information flows between these sites and
 C Computational Sciences

a few messages as latency can be an important part of


communication overhead.
Note that this type of data decomposition implies
the so-called “owner’s-compute” rule. Here each data
point is imagined as being owned by the processor to
which the decomposition assigns it. The owner of a
given data-point is then responsible for performing the
computation that “updates” its corresponding data val-
ues. This produces a common scenario where parallel
program consists of a loop over iterations divided into
compute-communicate phases:
● Communicate: At the start of each iteration, first
communicate any outside data values needed to
update the data values at points owned by this
processor.
● Compute: Perform update of data values with each
processor operating without need to further syn-
chronize with other machines.
Computational Sciences. Fig.  Communication structure
for D complex system example. The dots are the  This general structure is preserved even in many com-
points in the problem. Shown by dashed lines is the plex physical simulations with fixed albeit irregular
division into  processors. The circled points are the halo decompositions. Dynamic decompositions introduce a
or ghost grid points communicated to processor they further step where data values are migrated between
surround processors to ensure load balance but this is usually
still followed by similar communicate-compute phases.
The communication phase naturally synchronizes the
their dynamics are mixed. Of course it is the communi- operation of the parallel processors and provides an effi-
cation of data between the processors (either directly in cient barrier point which naturally scales. The above
a distributed memory or implicitly in a shared memory) discussion uses a terminology natural for distributed
that implements this eventual mixing. memory hardware or message passing programming
Such block data decompositions typically lead to a models. With a shared memory model like OpenMP,
SPMD (Single Program Multiple Data) structure with communication would be implicit and the “communi-
each processor executing the same code but on differ- cation phase” above would be implemented as a barrier
ent data points and with differing boundary conditions. synchronization.
In this type of problem, processors at the edge of the
( × ) mesh do not see quite the same communica- Performance Model
tion and compute complexity as the “general case” of an Our current Poisson equation example can be used to
inside processor shown in Fig. . For the local nearest illustrate some simple techniques that allow estimates
neighbor structure of Eq. , one needs to communicate of the performance of many parallel programs.
the ring of halo points shown in figure. As compu- As shown in Fig. , the node of a parallel machine is
tation grows like the number of points (grain size) n characterized by a parameter tfloat , which is time taken
in each processor and communication like the num- for a single floating point operation. tfloat is of course

ber on edge (proportional to n), the time “wasted” not very well defined as depends on the effectiveness
communicating decreases as a fraction of the total as the of cache, possible use of fused multiply-add and other
grain size n increases. Further one can usually “block” issues. This implies that this measure will have some
the communication to transmit all the needed points in application dependence reflecting the goodness of the
Computational Sciences C 

Memory n Memory n which defines efficiency ε and overhead f . Note that it


Node A Node B is preferred to discuss overhead rather than speed-up
tcomm or efficiency as one typically gets simpler models for f
CPU tfloat CPU tfloat as the effects of parallelism are additive to f but for
example occur in the denominator of Eq.  for speedup C
Computational Sciences. Fig.  Parameters determining and efficiency. The communication part fcomm of the
performance of loosely synchronous problems overhead f is given by combining Eqs.  and  as

fcomm = tcomm /( n ⋅ tfloat ) ()
match of the problem to the node architecture. We let
n be the grain size – the number of data points owned Note that in many instances, fcomm can be thought of
by a typical processor. Communication performance – as simply the ratio of parallel communication to par-
whether through a shared or distributed memory archi- allel computation. This equation can be generalized to
tecture – can be parameterized as essentially all problems we will later term loosely syn-
chronous. Then in each coupled communicate-compute
Time to communicate Ncomm words
phases of such problems, one finds that the overhead
= tlatency + Ncomm ⋅ tcomm () takes the form:
This equation ignores issues like bus or switch con-
tention but is a reasonable model in many cases. Laten- fcomm = constant ⋅ tcomm /(n/d ⋅ tfloat ) ()
cies tlatency can be around  μs on high performance
Here d is an appropriate (complexity or information)
systems but is measured in milliseconds in a geograph-
dimension, which is equal to the geometric dimension
ically distributed grid. tcomm is time to communicate a
for partial differential based equations or other geomet-
single word and for large enough messages, the latency
rically local algorithms such as particle dynamics. A
term can be ignored which will be done in the following.
particularly important case in practice is the D value
Parallel performance is dependent on load balanc-
d =  when n−/d is just surface/volume in three dimen-
ing and communication and both can be discussed but
sions. However Eq.  describes many non geometrically
here it is focused on communication with problem of
local problems with for example the value d =  for
Fig.  generalized to Nproc processors arranged in an
√ √ the best decompositions for full matrix linear algebra
Nproc by Nproc grid with a total of N mesh points
and d =  for long range interaction problems. The Fast
and the grain size n = N/Nproc . Let T(Nproc ) be the exe-
Fourier Transform FFT finds n/d in Eq.  replaced by
cution time on Nproc processors and two contributions
ln(n) corresponding to d = ∞.
are found to this ignoring small load imbalances from
From Eq. , it can be found that S(Nproc ) increases
edge processors. There is a calculation time expressed as
linearly with Nproc as long as Nproc is increased with
n ⋅ tcalc with tcalc = tfloat as the time to execute the basic
fixed fcomm which implies fixed grain size n, while tcomm
update Eq. . In addition the parallel program has com-
and tfloat are naturally fixed. This is scaled speedup
munication overhead, which adds to T(Nproc ) a term
√ where the problem size N = n ⋅ Nproc also increases lin-
 n ⋅ t comm . Now the speed up formula is found:
early with Nproc . The continuing success of parallel com-
S(Nproc ) = T()/T(Nproc ) puting even on very large machines can be considered

= Nproc /( + tcomm /( n ⋅ tfloat )) () as a consequence of equations like Eq.  as the formula
for fcomm only depends on local node parameters and
It is noted that this analysis ignores the possibility avail-
not on the number of processors. Thus as we scale up
able on some computers of overlapping communication
the number of processors keeping the node hardware
and computation which is straightforwardly included.
and application grain size n fixed, we will get scaling
The above formalism can be generalized most conve-
performance – speedup proportional to Nproc . Note this
niently using the notation that
implies that total problem size increases proportional to
S(Nproc ) = ε ⋅ Nproc = Nproc /( + f ), () Nproc – the defining characteristic of scaled speedup.
 C Computational Sciences

Complex Applications Are Better for The communication overhead is found to decrease
Parallelism systematically as shown in Fig.  as the range of the force
The simple problem described above is perhaps the one increases. The general D result is:
where the parallel issues are most obvious; however it

is not the one where good parallel performance is eas- fcomm ∝ tcomm /(l ⋅ n ⋅ tfloat ) ()
iest to obtain as the small computational complexity of
the update Eq.  makes the communication overhead This is valid for l which is large compared to  but
relatively more important. There is a fortunate general smaller than the length scale corresponding to region
rule that as one increases the complexity of a problem, stored in each processor. In the interesting limit of an
the computation needed grows faster than the com- infinite range (l → ∞) force, the analysis needs to be
munication overhead and this will be illustrated below. redone and one finds the result that is independent of
Jacobi iteration does have perhaps the smallest com- the geometric dimension
munication for problems of this class. However it has
one of largest ratios of communication to computation
fcomm ∝ tcomm /(n ⋅ tfloat ) ()
and correspondingly high parallel overhead. Note one
sees the same effect on a hierarchical (cache) memory
which is of the general form of Eq.  with complexity
machine, where problems such as Jacobi Iteration for
dimension d = . This is the best-understood case where
simple equations can perform poorly as the number of
the geometric and complexity dimensions are differ-
operations performed on each word fetched into cache
ent. The overhead formula of Eq.  corresponds to the
is proportional to number of links per entity and this is
computationally intense O(N  ) algorithms for evolv-
small (four in the -D mesh considered above) for this
ing N-body problems. The amount of computation is so
problem class.
large that the ratio of communication to computation is
As an illustration of the effect varying computa-
extremely small.
tional complexity, it can be seen in Fig.  how the above
analysis is altered as one changes the update formula
of Eq. . The size of the stencil parameterized can now Application Architectures
be systematically increased by an integer l and it can The analysis above can be applied to many SPMD prob-
be found how fcomm changes. In the case where points lems and addresses the matching of “spatial” struc-
are particles the value of l corresponds to the range of ture of applications and computers. This drives needed
their mutual force and in the case of discretization of linkage of individual computers in a parallel system
partial differential equations l measures the order of the in terms of topology and performance of network.
approximation. However this only works if we can match the tempo-
ral structure and this aspect is more qualitative and
perhaps controversial. The simplest ideas here under-
lied the early SIMD (Single Instruction Multiple Data)
machines that were popular some  years ago. These
are suitable for problems where each point of the com-
plex system evolves with the same rule (mapping into
machine instruction) at each time. There are many
such problems including for example the Laplace solver
discussed above. However many related problems do
fcomm ∝ 1/√n
not fit this structure – called synchronous in [] –
1/(2√n) 1/(3√n) 1/(4√n) tcomm/tfloat
with the simplest reason being heterogeneity in sys-
Computational Sciences. Fig.  Communication structure tem requiring different computational approaches at
as a function of stencil size. The figure shows  stencils different points. A huge number of scientific problems
with from left to right, range l = , , ,  fit a more general classification – loosely synchronous.
Computational Sciences C 

Here SPMD applications can be seen which have the which does not need high performance communication
compute-communication stages described above but between different nodes. Parameter searches and many
now the compute phases are different on different pro- data analysis applications of independent observations
cessors. One uses load balancing methods to ensure fall into this class.
that the computational work on each node is balanced From the start, we have seen a fifth class – C
but not on each machine instruction but rather in a termed metaproblems – which refer to the coarse
coarse grain fashion at every iteration or time-step – grain linkage of different “atomic” problems. Here
whatever defines the temporal evolution of the com- synchronous, loosely synchronous, asynchronous and
plete system. Loosely synchronous problems fit natu- pleasingly parallel are the atomic classes. Metaprob-
rally MIMD machines with the communication stages lems are very common and expected to grow in
at macroscopic “time-steps” of the application. This importance. One often uses a two level program-
communication ensures the overall correct synchro- ming model in this case with the metaproblem link-
nization of the parallel application. Thus overhead for- age specified by workflow and the component prob-
mulae like Eqs.  and  describe both communication lems with traditional parallel languages and runtimes.
and synchronization overhead. As this overhead only Grids or Clouds are suitable for metaproblems as coarse
depends on local parameters of the application, it is grain decomposition does not usually require stringent
understood why loosely synchronously can get good performance.
scalable performance on the largest supercomputers. These five categories are summarized in Table 
Such applications need no expensive global synchro- which also introduces a new category MapReduce++
nization steps. Essentially all linear algebra, particle which has recently grown in importance to described
dynamics and differential equation solvers fall in the data analysis. Nearly all the early work on paral-
loosely synchronous class. Note synchronous problems lel computing focused on simulation as opposed to
are still around but they are run on MIMD (Multiple data analysis (or what some call data intensive appli-
Instruction Multiple Data) machines with the SPMD cations). Data analysis has exploded in importance
model. recently [] correspondingly to growth in number of
A third class of problems – termed asynchronous – instruments, sensors and human (the web) sources
consists of asynchronously interacting objects and is of data.
often people’s view of a typical parallel problem. It prob-
ably does describe the concurrent threads in a modern
operating system and some important applications such Summary
as event driven simulations and areas like search in Problems are set up as computational or numerical
computer games and graph algorithms. Shared mem- systems and these can be considered as a “space” of
ory is natural for asynchronous problems due to low linked entities evolving in time. The spatial structure
latency often needed to perform dynamic synchroniza- (which is critical for performance) and the temporal
tion. It wasn’t clear in the past but now it appears this structure which is critical to understand the class of soft-
category is not very common in large scale parallel ware and computer needed were discussed. These were
problems of importance. The surprise of some at the termed “basic complex systems” and characterized by
practical success of parallel computing can perhaps be their possibly dynamic spatial (geometric) and tempo-
understood from people thinking about asynchronous ral structure. The difference between the structure of
problems whereas its loosely synchronous and pleas- the original problem and that of computational system
ingly parallel problems that dominate. The latter class is derived from it have been noted. Much of the past expe-
the simplest algorithmically with disconnected parallel rience can be summarized in parallelizing applications
components. However the importance of this category by the conclusion:
has probably grown since the original  analysis [] Synchronous and Loosely Synchronous problems
when it was estimated as % of all parallel comput- perform well on large parallel machines as long as the
ing. Both Grids and clouds are very natural for this class problem is large enough. For a given machine, there is
 C Computational Sciences

Computational Sciences. Table  Application classification

# Class Description Machine Architecture


 Synchronous The problem class can be implemented with SIMD
instruction level Lockstep Operation as in SIMD
architectures
 Loosely Synchronous (or BSP These problems exhibit iterative MIMD on MPP (Massively
Bulk Synchronous Processing) Compute-Communication stages with Parallel Processor)
independent compute (map) operations for each
CPU that are synchronized with a communication
step. This problem class covers many successful
MPI applications including partial differential
equation solution and particle dynamics
applications.
 Asynchronous Illustrated by Compute Chess and Integer Shared Memory
Programming; Combinatorial Search often
supported by dynamic threads. This is rarely
important in scientific computing but at heart of
operating systems and concurrency in consumer
applications such as Microsoft Word.
 Pleasingly Parallel Each component is independent. In , Fox Grids moving to Clouds
estimated this at % of the total number of
applications [] but that percentage has grown
with the use of Grids and data analysis applications
including for example the Large Hadron Collider
analysis for particle physics.
 Metaproblems These are coarse grain (asynchronous or dataflow) Grids of Clusters
combinations of classes (–) and (). This area has
also grown in importance and is well supported by
Grids and described by workflow of Section ..
 MapReduce++ It describes file(database) to file(database) Data-intensive Clouds
operations which has three subcategories given
(a) Master-Worker or
below and in Table .
MapReduce
(a) Pleasingly Parallel Map Only – similar to (b) MapReduce
category 
(c) Twister
(b) Map followed by reductions
(c) Iterative “Map followed by reductions” –
Extension of Current Technologies that supports
much linear algebra and data mining
The MapReduce++ category has three subdivisions (a) “map only” applications similar to pleasingly parallel category; (b) The classic
MapReduce with file to file operations consisting of parallel maps followed by parallel reduce operations; (c) captures the extended
MapReduce introduced in [–]. Note this category has the same complex system structure as loosely synchronous or pleasingly par-
allel problems but is distinguished by the reading and writing of data. This comparison is made clearer in Table . Note nearly all early
work on parallel computing discussed computing with data on memory. MapReduce and languages like Sawzall [] and Pig-Latin []
emphasize the parallel processing of data on disks – a field that until recently was only covered by database community

a typical sub-domain size (i.e. the grain size or size of size and total size proportional to Nproc . This conclusion
that part of the problem stored on each node) above has been enriched by study of grids and clouds with
which one can expect to get good performance. There an emphasize on pleasingly parallel and MapReduce++
will be a roughly constant ratio of parallel speedup to style problems often with a data intensive focus. These
Nproc if one scales the problem with fixed sub-domain also parallelize well.
Computational Sciences C 

Computational Sciences. Table  Comparison of MapReduce++ subcategories and Loosely Synchronous category

Map-only Classic MapReduce Iterative MapReduce Loosely Synchronous


input
input j
input map()
map() B
C

i A Pi
reduce() reduce()
map()

output output output


● Document conversion ● High Energy Physics ● Expectation maximization ● Many MPI scientific
(e.g. PDF->HTML) (HEP) Histograms algorithms applications utilizing
● Brute force searches in ● Distributed search ● Linear Algebra wide variety of
cryptography ● Distributed sort ● Datamining including communication
● Parametric sweeps ● Information retrieval ● Clustering constructs including
● Gene assembly ● Calculation of Pairwise ● K-means local interactions
● Much data analysis of Distances for ● Multidimensional Scaling ● Solving differential
independent samples sequences (BLAST) (MDS) equations and
● Particle dynamics with
short range forces
◂ Domain of MapReduce and Iterative Extensions ▸ MPI

Bibliographic Notes and Further . Ekanayake J, Li H, Zhang B, Gunarathne T, Bae S, Qiu J,


Reading Fox G () Twister: a runtime for iterative MapReduce. In:
The approach followed here was developed in [, ] Proceedings of the first international workshop on MapReduce
and its applications of ACM HPDC  conference. ACM,
with further details in [, ]. The extension to
Chicago, – Jun . http://grids.ucs.indiana.edu/ptliupages/
include data intensive applications was given in [, ]. publications/hpdc-camera-ready-submission.pdf
There are many good discussions of speedup including . Yingyi B, Howe B, Balazinska M, Ernst MD () HaLoop: effi-
Gustafson’s seminal work [] and the lack of it – cient iterative data processing on large clusters. In: The th
Amdahl’s law []. The recent spate of papers on MapRe- international conference on very large data bases, VLDB Endow-
ment, vol , Singapore, – Sept . http://www.ics.uci.edu/
duce [, ] and its applications and extensions [–, ]
~yingyib/papers/HaLoop_camera_ready.pdf
allow one to extend the discussion of parallelism from . Zhang B, Ruan Y, Tak-Lon W, Qiu J, Hughes A, Fox G
simulation (which implicitly dominated the early work) () Applying twister to scientific applications. In: Cloud-
to data analysis []. Com . IUPUI Conference Center, Indianapolis,  Nov–
 Dec . http://grids.ucs.indiana.edu/ptliupages/publications/
PID.pdf
. Dean J, Ghemawat S () MapReduce: simplified data process-
Bibliography ing on large clusters. Commun ACM ():–
. Fox GC, Williams RD, Messina PC () Parallel computing . Ekanayake J () Architecture and performance of runtime
works!. Morgan Kaufmann, San Francisco. http://www.old-npac. environments for data intensive scalable computing. Ph. D. the-
org/copywrite/pcw/node.html#SECTION  sis, School of Informatics and Computing, Indiana University,
 Bloomington, Dec . http://grids.ucs.indiana.edu/ptliupages/
. Fox GC () What have we learnt from using real parallel publications/thesis_jaliya_v.pdf
machines to solve real problems. In: Fox GC (ed) Third con- . Malewicz G, Austern MH, Bik AJC, Dehnert JC, Horn I, Leiser N,
ference on hypercube concurrent computers and applications, Czajkowski G () Pregel: a system for large-scale graph
vol. . ACM, New York, pp – processing. In: International conference on management of data,
. Gray J, Hey T, Tansley S, Tolle K () The fourth paradigm: Indianapolis, pp –
data-intensive scientific discovery. Accessed  Oct . Avail- . Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I
able from: http://research.microsoft.com/en-us/collaboration/ () Spark: cluster computing with working sets. In: Sec-
fourthparadigm/ ond USENIX workshop on hot topics in cloud computing
 C Computer Graphics

(HotCloud ’), Boston,  Jun . http://www.cs.berkeley.edu/ data, e.g., image pixels, light rays, mesh elements, or
~franklin/Papers/hotcloud.pdf moving particles.
. Pike R, Dorward S, Griesemer R, Quinlan S () Interpreting
Many operations in computer graphics are spe-
the data: parallel analysis with sawzall. Sci Program J (Special
cializations of more general numerical, scientific, and
Issue on Grids and Worldwide Computing Programming Mod-
els and Infrastructure) ():–. http://iospress.metapress. combinatorial algorithms that have been parallelized
com/content/VJKGKAEJKVUT in more general settings. The distribution of light in a
. Olston C, Reed B, Srivastava U, Kumar R, Tomkins A () scene is a recurrent integration, often solved with Monte
Pig Latin: a not-so-foreign language for data processing. In: Pro- Carlo methods. Linear system solvers including itera-
ceedings of the  ACM SIGMOD international conference
tive methods, conjugate gradients, and multigrid meth-
on management of data. ACM, Vancouver, pp –. http://
portal.acm.org/citation.cfm?id= ods are used for meshing, rendering and animating
. Dongarra J, Foster I, Fox G, Gropp W, Kennedy K, Torczon L, -D objects and scenes, and a wide variety of powerful
White A () The sourcebook of parallel computing. Morgan image editing operations. Combinatorial optimization
Kaufmann, San Francisco. ISBN:- algorithms like GraphCut can be used to merge texture
. Fox GC, Coddington P () Parallel computers and complex
elements. N-body solutions have been used to find a
systems. In: Bossomaier TRJ, Green DG (eds) Complex sys-
tems: from biology to computation. Cambridge University Press,
light equilibrium and for surface reconstruction from
pp –. http://surface.syr.edu/npac// scattered point data. Hence parallel computer graph-
. Ekanayake J, Gunarathne T, Qiu J, Fox G, Beason S, Choi JY, ics is often simply finding the appropriate numerical
Ruan Y, Bae SH, Li H () Applicability of DryadLINQ to or combinatorial method, usually from scientific com-
scientific applications. Community Grids Laboratory, Indiana puting, and parallelizing that algorithm for deployment
University,  Jan . http://grids.ucs.indiana.edu/ptliupages/
on a graphics platform, sometimes capitalizing on the
publications/DryadReport.pdf
. Gustafson JL () Reevaluating Amdahl’s law. Commun ACM reduced precision and other approximations afforded to
():–. doi:./. visual simulation.
. Wikipedia () Amdahl’s law. Accessed  Dec . Available Much of the remainder of parallel graphics work
from: http://en.wikipedia.org/wiki/Amdahl’s_law has focused on rendering, particularly maximizing the
quality of rendering while maintaining a desired rate
of image production (e.g., real-time). Rendering is the
process of taking a computer representation of a scene
Computer Graphics of one or more shapes and constructing one or more
images depicting that scene. Rendering can be photo-
John C. Hart realistic, based on a physical model of light, or expres-
University of Illinois at Urbana-Champaign, Urbana, sive, based on a perceptual model of the viewer. Ren-
IL, USA derings can be of contrived scenes for entertainment,
or of data for visualization, and can range from real
time (faster than  images/s) to interactive (around
Definition  images/s) to offline (taking minutes, hours, and
The field of computer graphics is concerned with
even days per image). In all cases, parallelism pro-
developing algorithms for visual simulation, and has
vides a toolbox for achieving greater image generation
both benefitted from parallel computing, especially to
throughput.
achieve real-time rendering rates, and contributed to
Modern rendering falls into three categories: rasteri-
parallel computing, especially through streaming archi-
zation, ray tracing, and micropolygons, and the remain-
tectures and programming languages.
der of this discussion describes the parallelism used to
accelerate these rendering algorithms to support real-
Discussion time graphics.
The field of computer graphics embodies algorithms
for generating and manipulating images, representing
and constructing shapes, and simulating and capturing Rasterization
motion. These algorithms often lend themselves natu- Rasterization has been the primary rendering method
rally to parallel approaches due to the regularity of its for real-time graphics and parallelism is used to increase
Computer Graphics C 

C
a b c d
Computer Graphics. Fig.  Vertices from a mesh polygon in a -D scene (a) are shaded and transformed independently in
parallel into screen pixel locations (b). Screen pixels between these locations are interpolated by rasterization (c) and
processed independently in parallel (d), in this example, saturating the color

the number and quality of rasterized polygons. To ras- coordinate frame of the forearm. Pixel shaders were
terize each polygon in a mesh, the polygon’s vertices primarily motived by the need to interpolate surface
must first be shaded and transformed from scene normals between vertices and compute shading per
coordinates to pixel locations. These operations are pixel, instead of computing shading per vertex and
pipelined and lend themselves naturally to parallel pro- interpolating the vertex color across the polygon pixels.
cessing since the vertices can be processed completely Since these processes are independent and SIMD,
independently from each other (Fig. ). their parallel processing is quite efficient, which made
The actual rasterization process fills in the screen the GPU an attractive parallel computational platform
pixels between the pixel locations of the screen pro- for “general purpose” parallel programming, called
jected polygon vertices, interpolating any attributes GPGPU programming. As parallel graphics processing
(e.g., color) that accompany the vertices. These pixels developed, general purpose programs (e.g., comput-
are then further processed for texturing and fine-detail ing a Mandelbrot set or a Voronoi diagram) had to be
shading, which also can be trivially parallelized since formulated as multiple passes of polygon rendering pro-
the pixel computations are independent of each other. grams to be accelerated by the GPU. Modern GPUs
This vertex and pixel parallelism has been imple- do not require disguising general-purpose applications
mented in a variety of graphics architectures over as graphics rasterization tasks. They support a “com-
many decades, leading to the modern graphics pro- pute” mode that disables rasterization and allows the
cessing unit (GPU). Since the computations are inde- SIMD processors to execute directly on input data, and
pendent and identical over all vertices, and similarly can be programmed directly with general purpose par-
over all pixels, single-instruction-multiple-data (SIMD) allel streaming languages, development environments,
processing is utilized, and evolved into present day GPU and libraries such as Nvidia’s CUDA, OpenCL, and
single-process-multiple-data (SPMD) processing. For Intel’s ArBB.
efficient processing GPU programs must still remain
sensitive to SIMD performance issues, such as control
flow divergence. Though we distinguish between ver- Ray Tracing
tex shaders and pixel shaders by their position in the By leveraging the power of parallel processing enabled
graphics pipeline, the same processing array is used by the GPU and the multicore PC, ray tracing is becom-
for both. ing a viable alternative to rasterization for the real-
In order to support a wider variety of shading effects, time rendering needs of video games, virtual environ-
these parallel vertex and pixel processes have supported ments, and interactive visualization. Whereas rasteri-
custom programming, which led to the modern pro- zation finds for each polygon which pixels it covers,
grammable GPU which can support programmable ray tracing finds for each pixel which polygons cover
vertex and pixel “shaders.” For example, “skinning” is it (in both cases displaying for each pixel the closest
a vertex shader that smoothly deforms an elbow by polygon in the view direction). For each pixel, ray trac-
interpolating at each elbow mesh vertex the rotated ing constructs a line of sight from the viewer through
coordinate frame of the upper arm with the rotated it into the scene, and intersects that line of sight with
 C Computer Graphics

a b c d
Computer Graphics. Fig.  Stages of tracing a ray. (a) Traversing a spatial data structure to find likely candidate elements
for intersection. (b) Intersecting an element. (c) Shading the intersection. (d) Spawning seconding rays for reflection,
refraction, and shadowing

the scene’s geometry. For curved (non-polygonal) sur- where each pixel corresponds to a single ray. One tex-
faces this intersection is a numerical root-finding solver. ture image’s RGB values were the XYZ coordinates of
The intersection point can yield further ray tracing of the anchor (origin) of each ray, and a second texture
shadow, reflection, refraction, and other rays to find image’s RGB values were the XYZ coordinates of the
further illumination from the light sources. This pro- ray directions. Then a single screen-filling quad would
cess can be trivially parallelized over the pixels for be rendered, whose vertices shared the same attribute
MIMD processing, but can lead to load imbalance since data, in this case the nine values corresponding to the
some pixels miss the scene entirely while others dis- three coordinates of the three vertices of the triangle. As
play complex illumination requiring many recursive ray the screen-filling quad is rasterized, the attribute data
intersections (Fig. ). is reproduced across each pixel, and when combined
Scenes are often complex, and require complex spa- with the texture data provided each pixel’s shader with
tial data structures to efficiently determine which scene access to the quad’s triangle and the pixel’s ray, allow-
elements are intersected by which rays to avoid an ing it to compute a ray intersection it could report back
expensive all-pairs ray scene intersection. Especially as the pixel’s color and z-value. While it leverages full
for meshed scenes, ray intersection becomes a spatial speed SIMD GPU computation of ray-triangle intersec-
database query, using hierarchical or grid structures to tions, this approach requires a lot of CPU communica-
organize the scene. When parallelized, this structure tion to coordinate spatial data structures, shading, and
needs to be shared or distributed among the processors. spawning new rays.
When the scene is dynamic, then this structure needs Ray tracing can be performed completely in SIMD
to be updated or reconstructed, which itself can rely on on the GPU by treating the process as a state machine.
parallel speedups. The process of tracing a ray can be decomposed into
The intersection of a pair of rays passing through four states: () traversing a spatial data structure to find
neighboring pixels often share the same computation which elements might intersect a ray, () intersecting
and can benefit from spatial coherence and acceler- the ray with elements, () shading the resulting inter-
ated vector processing. However, the rays they spawn section, and () spawning new rays to recursively deter-
to measure secondary illumination are rarely coherent mine illumination. Since the GPU is a SIMD processor
and their computation can diverge precluding efficient (actually a collection of independent SIMD processors)
vector processing. Some approaches seek to cache sec- each ray’s process must wait until all of the other rays’
ondary rays into coherent bundles that can be inter- processors finish the current state before any of them
sected more efficiently, though the expense of the can move to the next state.
caching and reordering often outweighs the reward of
the parallelism speedup. Micropolygons
The GPU, designed for parallel rasterization-based Yet a third rendering approach, called micropolygon
rendering, can be reprogrammed for parallel ray trac- rendering, subdivides scene elements in -D into small
ing. One approach created a pair of texture images “micropolygons” that each project to about the size of
Computer Graphics C 

a pixel and thus do not need to be rasterized. Microp- Planar Voronoi diagrams were computed with early
olygons allow surface geometry to be displaced to more parallel graphics hardware by depth-buffered rasteriza-
easily model embossed surface detail, and since each tion of cones extending from the data points [], and a
micropolygon projects to such a small pixel neighbor- Mandelbrot set could be computed by repeated image
hood, they can more easily be rendered in parallel. operations, called multipass programming []. C
Micropolygons led to the concept of the pro- Parallel ray tracing often suffers from memory man-
grammable “shader,” in an application specific little lan- agement issues, especially when reorganizing rays into
guage called Renderman, that led to the vertex and pixel coherent bundles []. This plagued the ray engine
shaders of the modern GPU. The Renderman Shad- which used the GPU to intersect rays and the CPU
ing Language succeeded because it contained a small to manage ray coherence [], while simpler geometry
collection of powerful high-level operators and data coherence structures, such as a grid, could be imple-
structures devoted specifically to the narrow task of mented on the GPU in a state-based system [] suffered
illuminating and texturing surfaces. It inspired paral- from excessive waiting that can be improved through
lel GPU shader programming languages including Cg, better scheduling [].
GLSL, and HLSL, which are in comparison less elegant The micropolygon approach was pioneered by Pixar
but sought to broaden the scope of GPU programming, as the REYES (Render Everything You Ever Saw) archi-
eventually to CUDA, OpenCL, and other GPGPU pro- tecture [] that formed the basis for the parallel Pixar
gramming frameworks and domain-specific languages Image Computer, which could process each micropoly-
for parallel programming. gon independently in parallel. This lead to the develop-
ment of programmable shaders and Renderman [].

Related Entries Bibliography


NVIDIA GPU . Veach E, Guibas LJ () Metropolis Light Transport. Proceed-
SPMD Computational Model ings of SIGGRAPH, Los Angeles, pp –
Stream Programming Languages . Goral CM, Torrance KE, Greenberg DP, Battaile B () Model-
ing the interaction of light between diffuse surfaces. Proceedings
of SIGGRAPH, Minneapolis, pp –
. Pérez P, Gangnet M, Blake A () Poisson image editing. Pro-
Bibliographical Notes and Further ceedings of SIGGRAPH, San Diego, pp –
Reading . Dong S, Bremer P-T, Garland M, Pascucci V, Hart JC ()
Applications of general scientific computing methods Spectral surface quadrangulation. Proceedings of SIGGRAPH,
to specific problems in graphics can be found through- pp –
out the history of computer graphics literature. Some . Kwatra V, Schödl A, Essa I, Turk G, Bobick A () Graphcut
textures: image and video synthesis using graph cuts. Proceedings
key examples of these are the Metropolis method for
of SIGGRAPH, San Diego, pp –
Monte Carlo sampling of light paths [] and solving a . Hanrahan P, Salzman D, Aupperle L () A rapid hierarchi-
symmetric positive definite matrix to find the equilib- cal radiosity algorithm. Proceedings of SIGGRAPH, Las Vegas,
rium of light in a closed room []. Researchers have pp –
framed a number of graphics problems, such as seam- . Carr JC, Beatson RK, Cherrie JB, Mitchell TJ, Fright WR, McCal-
lum BC, Evans TR () Reconstruction and representation
lessly copying a portion of one image to another [] or
of D objects with radial basis functions. Proceedings of SIG-
tiling a surface with quadrilaterals [], as a Poisson par- GRAPH, Los Angeles, pp –
tial differential equation, which opened a floodgate of . Hoff KE III, Zaferakis A, Lin M, Manocha D () Fast and
“gradient domain” methods in graphics, solved largely simple D geometric proximity queries using graphics hardware.
interactively or in real-time by parallel matrix solvers. Proceedings of Interactive D Graphics, pp –
The GraphCut network optimization method can be . Peercy MS, Olano M, Airey J, Ungar PJ () Interactive multi-
pass programmable shading. Proceedings of SIGGRAPH, New
used to stitch pixel swatches into an image []. Both
Orleans, pp –
radiosity [] and radial basis function surface recon- . Wald I, Slusallek P, Benthin C, Wagner M () Interactive Ren-
struction [] have been accelerated by treating them as dering with coherent ray tracing. Proceedings of Eurographics,
N-body problems. pp –
 C Computing Surface

. Carr NA, Hall JD, Hart JC () The ray engine. Proceedings of tuple spaces. The three main constructs in the CnC pro-
Graphics Hardware, pp – gramming model are step collections, data collections,
. Purcell TJ, Buck I, Mark WR, Hanrahan P () Ray tracing on and control collections. A step collection corresponds to
programmable graphics hardware. Proceedings of SIGGRAPH,
a computation, and its instances correspond to invoca-
San Antonio, pp –
. Parker SG, Bigler J, Dietrich A, Friedrich H, Hoberock J, Lue- tions of that computation that consume and produce
bke D, McAllister D, McGuire M, Morley K, Robison A, Stich M data items. A data collection corresponds to a set of
() OptiX: a general purpose ray tracing engine. Proceedings data items, indexed by item tags that can be accessed via
of SIGGRAPH, Los Angeles, Article ,  p put and get operations; once put, data items cannot be
. Cook RL, Carpenter L, Catmull E () The Reyes image
overwritten and are required to be immutable. A control
rendering architecture. Proceedings of SIGGRAPH, Anaheim,
pp – collection corresponds to a factory [] for step instances.
. Hanrahan P, Lawson J () A language for shading and lighting A put operation on a control collection with a control
calculations. Proceedings of SIGGRAPH, Dallas, pp – tag results in the prescription (creation) of step instances
from one or more step collections with that tag passed
as an input argument. These collections and their rela-
tionships are defined statically as a CnC graph in which a
Computing Surface node corresponds to a step, data or item collection, and
a directed edge corresponds to a put, get, or prescribe
Meiko
operation.

Discussion
Concatenation
Introduction
Allgather Parallel computing has become firmly established since
the s as the primary means of achieving high per-
formance from supercomputers. Concurrent Collec-
tions (CnC) was developed to address the need for
Concurrency Control making parallel programming accessible to nonprofes-
sional programmers. One approach that has histori-
Path Expressions
cally addressed this problem is the creation of domain-
specific languages (DSLs) that hide the details of par-
allelism when programming for a specific application
Concurrent Collections domain.
Programming Model In contrast, CnC is a model for adding parallelism
to any host language (which is typically serial and may
Michael G. Burke , Kathleen Knobe , Ryan be a DSL). In this approach, the parallel implemen-
Newton , Vivek Sarkar tation details of the application are hidden from the

Rice University, Houston, TX, USA domain expert, but are instead addressed separately by

Intel Corporation, Cambridge, MA, USA users (and tools) that serve the role of tuning experts.

Intel Corporation, Hudson, MA, USA The basic concepts of CnC are widely applicable. Its
premise is that domain experts can identify the intrin-
Synonyms sic data dependences and control dependences in their
CnC, TStreams application (irrespective of lower-level implementation
choices). This identification of dependences is achieved
Definition by specifying a CnC graph for their application. Paral-
Concurrent Collections (CnC) is a parallel program- lelism is implicit in a CnC graph. A CnC graph has a
ming model, with an execution semantics that is influ- deterministic semantics, in that all executions are guar-
enced by dynamic dataflow, stream-processing, and anteed to produce the same output state for the same
Concurrent Collections Programming Model C 

input. This deterministic semantics and the separation ● A step begins execution with one input argument –
of concerns between the domain and tuning experts are the tag indexing that step instance. The tag argument
the primary characteristics that differentiate CnC from contains the information necessary to compute the
other parallel programming models. tags of all the step’s input and output data. For exam-
ple, in a stencil computation a tag “i,j” would C
CnC Specification Graph be used to access data at positions “i+1,j+1”,
The three main constructs in a CnC specification graph “i-1,j-i”, and so on. In a CnC specification file,
are step collections, data collections, and control col- a step collection is written (foo) and the com-
lections. These collections and their relationships are ponents of its tag indices can optionally be docu-
defined statically. But for each static collection, a set of mented, as in (foo: row, col).
dynamic instances is created as the program executes. ● Putting a tag into a control collection will cause
A step collection corresponds to a specific computa- the corresponding steps (in all controlled step col-
tion, and its instances correspond to invocations of that lections) to eventually execute when their inputs
computation with different input arguments. A control become available. A control collection C is denoted
collection is said to control a step collection – adding as <C> or as <C:x,y,z>, where x, y, z comprise
an instance to the control collection prescribes one or the tag. Instances of a control collection contain no
more step instances i.e., causes the step instances to information except their tag, so the word “tag” is
eventually execute when their inputs become available. often used synonymously with “control instance.”
The invoked step may continue execution by adding ● A data collection is an associative container indexed
instances to other control collections, and so on. by tags. The contents indexed by a tag i, once writ-
Steps also dynamically read (get) and write (put) ten, cannot be overwritten (dynamic single assign-
data instances. The execution order of step instances is ment). The immutability of entries within a data
constrained only by their producer and consumer rela- collection, along with other features, provides deter-
tionships, including control relations. A complete CnC minism. In a specification file a data collection is
specification is a graph where the nodes can be either referred to with square-bracket syntax: [x:i,j].
step, data, or control collections, and the edges represent
producer, consumer, and control relationships. Using the above syntax, together with :: and →
A whole CnC program includes the specification, for denoting prescription and production/consump-
the step code and the environment. Step code imple- tion relations, we can write CnC specifications that
ments the computations within individual graph nodes, describe CnC graphs. For example, below is an exam-
whereas the environment is the external user code that ple snippet of a CnC specification showing all of the
invokes and interacts with the CnC graph while it exe- syntax.
cutes. The environment can produce data and control
// control relationship: myCtrl
instances. It can consume data instances and use control
prescribes instances of mystep
instances to prescribe conditional execution.
<myCtrl> :: (myStep);
// myStep gets items from myData, and
Collections Indexed by Tags puts tags in myCtrl and items in
Within each collection, control, data, and step instances
myData
are each identified by a unique tag. In most CnC imple-
[myData] → (myStep) → <myCtrl>,
mentations, tags may be of any data type that supports
[myData];
an equality test and hash function. Typically, tags have
a specific meaning within the application. For exam- Further, in addition to describing the graph struc-
ple, they may be tuples of integers modeling an iteration ture, we might choose to use the CnC specification to
space (i.e., the iterations of a nested loop structure). Tags document the relationship between tag indices:
can also be points in non-grid spaces – nodes in a tree,
in an irregular mesh, elements of a set, etc. Collections [myData: i] → (myStep: i) → <myCtrl:
use tags as follows: i+1>, [myData: i+1];
 C Concurrent Collections Programming Model

Model Execution parallel constructs within serial code. Tag functions


During execution, the state of a CnC program is defined provide a tuning expert with additional information
by attributes of step, data, and control instances. (These needed to map the application to a parallel architec-
attributes are not directly visible to the CnC program- ture, and for static analysis, they provide information
mer.) Data instances and control instances each have needed to optimize distribution and scheduling of the
an attribute Avail, which has the value true if and only application.
if a put operation has been performed on it. A data
instance also has a Value attribute representing the value
assigned to it where Avail is true. When the set of Example
all data instances to be consumed by a step instance The following simple example illustrates the task and
and the control instance that prescribes a step instance data parallel capabilities of CnC. This application takes
have Avail attribute value true, then the value of the a set (or stream) of strings as input. Each string is split
step instance attribute Enabled is set to true. A step into words (separated by spaces). Each word then passes
instance has an attribute Done, which has the value through a second phase of processing that, in this case,
true if and only if all of its put operations have been puts it in uppercase form.
performed. <stringTags> :: (splitString); // step 1
Instances acquire attribute values monotonically <wordTags> :: (uppercaseWord); // step 2
during execution. For example, once an attribute // The environment produces initial inputs
and retrieves results:
assumes the value true, it remains true unless an exe-
env → <stringTags>, [inputs];
cution error occurs, in which case all attribute val- env ← [results];
ues become undefined. Once the Value attribute of // Here are the producer/consumer relations
a data instance has been set to a value through a for both steps:
put operation, assigning it a subsequent value through [inputs] → (splitString) → <wordTags>,
[words];
another put operation produces an execution error,
[words] → (uppercaseWord) → [results];
by the single assignment rule. The monotonic assump-
tion of attribute values simplifies program under- The above text corresponds directly to the graph in
standing, formulating and understanding the pro- Fig. . Note that separate strings in [input] can be
gram semantics, and is necessary for deterministic processed independently (data parallelism), and, fur-
execution. ther, the (splitString) and (uppercase) steps
Given a complete CnC specification, the tuning may operate simultaneously (task parallelism).
expert maps the specification to a specific target archi- The only keyword in the CnC specification language
tecture, creating an efficient schedule. This is quite dif- is env, which refers to the environment – the world
ferent from the more common approach of embedding outside CnC, for example, other threads or processes

env

<stringTags> <wordTags>

env

[inputs] (splitString) [words] (uppercase) [results]


env

Concurrent Collections Programming Model. Fig.  A CnC graph as described by a CnC specification. By convention, in
the graphical notation specific shapes correspond to control, data, and step collections. Dotted edges represent
prescription (control/step relations), and arrows represent production and consumption of data. Squiggly edges
represent communication with the environment (the program outside of CnC)
Concurrent Collections Programming Model C 

// The execute method ‘‘fires’’ when a tag is available.


// The context c represents the CnC graph containin item and tag collections.
int splitString::execute(const int & t, partStr_context & c ) const
{
// Get input string
string in;
c.input.get(t, in); C
// Use C++ standard template library to extract words:
istringstream iss(in);
vector<string> words;
copy(istream_iterator<string>(iss),
istream_iterator<string>(),
back_inserter<vector<string> >(words));

// Finally, put words into an output item collection:


for(int i=0; i < words.size(); i++) {
pair<int,int> wtag(t,i);
c.wordTags.put( wtag );
c.words.put( wtag, words[i]);
}
return CnC::CNC_Success;
}

// Convert word to upper case form:


int uppercaseWord::execute(const pair<int,int> & t, partStr_context & c ) const
{
string word;
c.words.get(t, word);
strUpper(word);
c.results.put(t, word);
return CnC::CNC_Success;
}

int main()
{
partStr_context c;
for(...)
c.input.put(t, string); // Provide initial inputs
c.wait(); // Wait for all steps to finish
... // Print results
}

Concurrent Collections Programming Model. Fig.  C++ code implementing steps and environment. Together with a
CnC specification file the above forms a complete application

written in a serial language. The strings passed into pair structure of word tags: e.g., (uppercaseWord:
CnC from the environment are placed into [inputs] stringID, wordNum).
using any unique identifier as a tag. The elements of Figure  contains C++ code implementing the steps
[inputs] may be provided in any order or in parallel. splitString and uppercase. The step imple-
Each string, when split, produces an arbitrary number mentations, specification file, and code for the environ-
of words. These per-string outputs can be numbered  ment together make up a complete CnC application.
through N – a pair containing this number and the orig- Current implementations of CnC vary as to whether the
inal string ID serves as a globally unique tag for all out- specification file is required, can be constructed graphi-
put words. Thus, in the specification we could annotate cally, or can be conveyed in the host language code itself
the collections with tag components indicating the through an API.
 C Concurrent Collections Programming Model

Mapping to Target Platforms evaluates to true does the Java implementation spawn a
There is wide latitude in mapping CnC to different plat- task to execute the step.
forms. For each there are several issues to be addressed: Initialization and Shutdown: All implementations
grain size, mapping data instances to memory loca- require some code for initialization of the CnC graph:
tions, steps to processing elements, and scheduling creating step objects and a graph object, as well as
steps within a processing element. A number of dis- performing the initial puts into the data and control
tinct implementations are possible for both distributed collections.
and shared memory target platforms, including static, In the C++ implementation, ensuring that all the
dynamic, or a hybrid of static/dynamic systems with steps in the graph have finished execution is done
respect to the above choices. by calling the run() method on the graph object,
The way an application is mapped will determine which blocks until all runnable steps in the program
its execution time, memory, power, latency, and band- have completed. In the Java implementation, ensuring
width utilization on a target platform. The mappings that all the steps in the graph have completed is done
are specified as part of the static translation and com- by enclosing all the control collection puts from the
pilation as well as dynamic scheduling of a CnC pro- environment in a Habanero-Java finish construct [],
gram. In dynamic run-time systems, the mappings are which ensures that all transitively spawned tasks have
influenced through scheduling strategies, such as LIFO, completed.
FIFO, work-stealing, or priority-based. Safety properties: In addition to the differences
Implementations of CnC typically provide a trans- between step implementation languages, different CnC
lator and a run-time system. The translator uses a CnC implementations enforce the CnC graph properties dif-
specification to generate code for a run-time system API ferently. All implementations perform run-time system
in the target language. As of the writing of this entry, checks of the single assignment rule, while the Java and
there are known CnC implementations for C++ (based .NET implementations also enforce tag immutability.
on Intel’s Threading Building Blocks), Java (based on Finally, CnC guarantees determinism as long as steps are
Java Concurrency Utilities), .NET (based on .NET Task themselves deterministic – a contract strictly enforce-
Parallel Library), and Haskell. able only in Haskell.
Step Execution and Data Puts and Gets: There is Memory reuse: Another aspect of CnC run-time
much leeway in CnC implementation, but in all imple- systems is garbage collection. Unless the run-time sys-
mentations, step prescription involves creation of an tem at some point deletes the items that were put, the
internal data structure representing the step to be exe- memory usage will continue to increase. Managed run-
cuted. Parallel tasks can be spawned eagerly upon pre- time systems such as Java or .NET will not solve this
scription, or delayed until the data needed by the task problem, since an item collection retains pointers to all
is ready. The get operations on a data collection could instances. Recovering memory used by data instances
be blocking (in cases when the task executing the step is a separate problem from traditional garbage collec-
is to be spawned before all the inputs for the step are tion. There are two approaches identified thus far to
available) or non-blocking (the run-time system guar- determine when data instances are dead and can safely
antees that the data is available when get is executed). be released (without breaking determinism). First, []
Both the C++ and Java implementations have a rollback introduces a declarative slicing annotation for CnC that
and replay policy, which aborts the step performing a can be transformed into a reference counting procedure
get on an unavailable data item and puts the step in a for memory management. Second, the C++ implemen-
separate list associated with the failed get. When a cor- tation provides a mechanism for specifying use counts
responding put is executed, all the steps in a list waiting for data instances, which are discarded after their last
on that item are restarted. The Java implementation also use. Irrespective of which of these mechanisms is used,
has a “delayed async” policy [], which requires the user data collections can be released after a graph has fin-
or the translator to provide a boolean ready() method ished running. Frequently, an application uses CnC for
that evaluates to true once all the inputs required by the finite computations inside a serial outer loop, thereby
step are available. Only when ready() for a given step reclaiming all memory between iterations.
Concurrent Collections Programming Model C 

Related Work as constructs that explicitly indicate when determinism


Table  is used to guide the discussion in this section. cannot be guaranteed for certain code regions, which
This table classifies programming models according to is why it contains a “hybrid” entry in the Deterministic
their attributes in three dimensions: Declarative, Deter- column.
ministic, and Efficient. A few representative examples The next three languages in the table – High Perfor- C
are included for each distinct set of attributes. The mance Fortran (HPF) [], X [], Linda [] – contain
reader can extrapolate this discussion to other pro- hybrid combinations of imperative and declarative pro-
gramming models with similar attributes in these three gramming in different ways. HPF combines a declara-
dimensions. tive language for data distribution and data parallelism
A number of lower-level programming models in with imperative (procedural) statements, X contains a
use today – e.g., Intel TBB [], .Net Task Paral- functional subset that supports declarative parallelism,
lel Library [], Cilk, OpenMP [], Nvidia CUDA, and Linda is a coordination language in which a thread’s
Java Concurrency [] – are non-declarative, nonde- interactions with the tuple space is declarative. Linda
terministic, and efficient. Here a programming model was a major influence on the CnC design. CnC shares
is considered to be efficient if there are known imple- two important properties with Linda: both are coordi-
mentations that deliver competitive performance for nation languages that specify computations and com-
a reasonably broad set of programs. Deterministic munications via a tuple/tag namespace, and both create
Parallel Java [] is an interesting variant of Java; it new computations by adding new tuples/tags to the
includes a subset that is provably deterministic, as well namespace. However, CnC also differs from Linda in
many ways. For example, an in() operation in Linda
atomically removes the tuple from the tuple space, but
Concurrent Collections Programming a CnC get() operation does not remove the item from
Model. Table  Comparison of several parallel the collection. This is a key reason why Linda programs
programming models can be nondeterministic in general, and why CnC pro-
Parallel prog.
grams are provably deterministic. Further, there is no
model Declarative Deterministic Efficient
separation between tags and values in a Linda tuple;
Intel TBB [] No No Yes instead, the choice of tag is implicit in the use of wild-
.Net Task Par. No No Yes
cards. In CnC, there is a separation between tags and
Lib. values, and control tags are first class constructs like
Cilk No No Yes data items.
The last four programming models in the table
OpenMP [] No No Yes
are both declarative and deterministic. Asynchronous
CUDA No No Yes
Sequential Processes [] is a recent model with a clean
Java No No Yes
semantics, but without any efficient implementations.
Concurrency
[] In contrast, the remaining three entries are efficient
as well. StreamIt [, ] is representative of a modern
Det. Parallel No Hybrid Yes
Java [] streaming language, and LabVIEW [] is representa-
High Perf. Hybrid No Yes
tive of a modern dataflow language. Both streaming
Fortran [] and dataflow languages have had major influence on
X [] Hybrid No Yes the CnC design. The CnC semantic model is based on
Linda [] Hybrid No Yes
dataflow in that steps are functional and execution can
proceed whenever data is ready.
Asynch. Seq. Yes Yes No
Processes [] However, CnC differs from dataflow in some key
ways. The use of control tags elevates control to a
StreamIt [] Yes Yes Yes
first-class construct in CnC. In addition, item collec-
LabVIEW [] Yes Yes Yes
tions allow more general indexing (as in a tuple space)
CnC Yes Yes Yes
compared to dataflow arrays (I-structures). CnC is like
 C Concurrent Collections Programming Model

streaming in that the internals of a step are not visible [] is a pioneering paper in dataflow languages. For
from the graph that describes their connectivity, thereby a description of coordination languages and their use,
establishing an isolation among steps. A producer step see [].
in a streaming model need not know its consumers; it Many members of the Concurrent Collections com-
just needs to know which buffers (collections) to per- munity gathered at workshops held in  (http://
form read and write operations on. However, CnC dif- habanero.rice.edu/cnc) and  (http://cnc.rice.
fers from streaming in that put and get operations edu/cnc).
need not be performed in FIFO order, and (as men-
tioned above) control is a first-class construct in CnC.
Further, CnC’s dynamic put/get operations on data
Bibliography
. Budimlić Z, Burke M, Cavé V, Knobe K, Lowney G, Newton R,
and control collections serves as a general model that
Palsberg J, Peixotto D, Sarkar V, Schlimbach F, Sağnak T (February
can be used to express many kinds of applications that ) Cnc programming model. Technical Report TR-, Rice
would not be considered to be dataflow or streaming University
applications. . Denis C, Ludovic H, Bernard PS () Asynchronous sequential
processes. Information Comput, ():–
. Chandra R, Dagum L, Kohr D, Maydan D, McDonald J,
Future Directions Menon R () Programming in OpenMP. Academic Press, San
Future work on the CnC model will focus on incorporat- Diego, California
ing more power in the specification language (module . Chandramowlishwaran A, Budimlic Z, Knobe K, Lowney G,
Sarkar V, Treggiari L () Multi-core implementations of the
abstraction, libraries of patterns) and integration with
concurrent collections programming model. In: Proceedings of
persistent data models. th international workshop on compilers for parallel computers
Combinations of static/dynamic treatment of schedul- (CPC). Zurich, Switzerland, Jan 
ing and step/data distribution will continue to be . Dennis JB () First version of a data flow procedure lan-
explored. Run-time strategies, such as for reducing guage. In: Programming symposium, proceedings colloque sur la
overhead for finer-grained parallelism and for memory programmation, Paris, pp –
. Charles P et al () X: An object-oriented approach to
management, will be developed.
non-uniform cluster computing. In: Proceedings of OOPSLA’,
Static graph analysis will play a role in performance ACM SIGPLAN conference on object-oriented programming
optimization in the future. systems, languages and applications. San Diego, California,
pp –
. Bocchino RL et al () A type and effect system for Determinis-
Related Entries tic Parallel Java. In: Proceedings of OOPSLA’, ACM SIGPLAN
Cilk conference on object-oriented programming systems, languages
and applications. Orlando, Florida, pp –
Dependences
. Budimlić Z et al () Declarative aspects of memory manage-
Deterministic Parallel Java ment in the concurrent collections parallel programming model.
HPF (High Performance Fortran) In DAMP ’: the workshop on declarative aspects of multicore
Linda programming. Savannah, Georgia, ACM, pp –
OpenMP . Gamma E, Helm R, Johnson R, Vlissides J () Design patterns:
elements of reusable object-oriented software. Addison-Wesley,
Reading, Massachusetts
. Gelernter D () Generative communication in linda. ACM
Bibliographic Notes and Further Trans Program Lang Syst ():–
Reading . Gelernter D, Carriero N () Coordination languages and their
The basic principles behind the Concurrent Collections significance. Commun ACM ():–
programming model are outlined in []. The model is . Gordon MI et al () A stream compiler for communication-
built on past work done at Hewlett Packard Labs on exposed architectures. In: ASPLOS-X: Proceedings of the th
international conference on architectural support for program-
TStreams, described in [].
ming languages and operating systems. ACM, New York,
A technique is presented in [] for memory manage- pp –
ment of data items with lifetimes that are longer than a . Gordon MI et al () Exploiting coarse-grained task, data,
single computation step. and pipeline parallelism in stream programs. In: ASPLOS-XII:
Concurrent ML C 

Proceedings of the th international conference on architec- the term “CML” refers to a specific language implemen-
tural support for programming languages and operating systems. tation, it is also used to refer to implementations of its
ACM, New York, pp –
language primitives in other systems.
. Habanero multicore software research project. http://habanero.
rice.edu.
. Kennedy K, Koelbel C, Zima HP () The rise and fall of high C
performance Fortran. In: Proceedings of HOPL’, Third ACM Discussion
SIGPLAN history of programming languages conference, San
Diego, California, pp – Concurrent ML Basics
. Knobe K, Offner CD () Tstreams: A model of parallel compu- Concurrent ML was designed with the goal of sup-
tation (preliminary report). Technical Report HPL--, HP porting high-level concurrent programming. Its design
Labs
. Peierls T, Goetz B, Bloch J, Bowbeer J, Lea D, Holmes D () Java
was originally motivated by the idea that user-interface
concurrency in practice. Addison-Wesley Professional, Reading, software should be concurrent and that message pass-
Massachusetts ing was the best way to construct such concurrent
. Reinders J () Intel threading building blocks: Outfitting C++ programs [, , ]. Its basic concurrency features
for Multi-Core Processor Parallelism. O’Reilly Media, Sebastopol, were heavily influenced by earlier message-passing
California
designs, including Hoare’s Communicating Sequential
. Toub S () Parallel programming and the .NET Frame-
work .. http://blogs.msdn.com/pfxteam/archive////
Processes [] and Cardelli’s Amber language [].
.aspx Applications of CML include a multithreaded GUI
. Travis J, Kring J () LabVIEW for everyone: graphical pro- toolkit [], a distributed tuple-space implementa-
gramming made easy and fun, rd edn. Prentice Hall, Upper tion [], and a system for implementing partitioned
Saddle River, New Jersey applications in a distributed setting [].
Because CML is embedded in SML, SML specifica-
tions are used to introduce CML’s features and the code
examples are also written in SML. For readers who are
Concurrent Logic Languages unfamiliar with SML syntax, this paragraph provides
a brief description of SML specification and type syn-
Logic Languages tax. Specifications in SML are declarations that appear
in the signature of a module and are used to describe the
types, values, etc., defined by the module. For example,
the output function has the specification
Concurrent ML
val output : (outstream * string)
John Reppy -> unit
University of Chicago, Chicago, IL, USA
which specifies that output is a function that takes
a pair of arguments, an outstream and a string, and
Synonyms returns the unit value. Unit is the type with only one
CML value (written as “()”). The symbol “*” is the prod-
uct type constructor and the symbol “->” is the func-
Definition tion type constructor. Lastly, SML is a polymorphic
Concurrent ML (CML) is a higher-order concurrent language, which means that specifications can have uni-
language embedded in the sequential language Stan- versally qualified type variables. For example:
dard ML (SML). CML’s basic programming model con-
type ’a vector
sists of dynamically created threads that communicate
val length : ’a vector -> int
via message passing over dynamically created channels.
CML also provides first-class synchronous operations, is the specification of the abstract vector type construc-
called event values, which support user-defined syn- tor and its length function. The identifier “’a” is an
chronization and communication abstractions. While SML type variable. The vector type constructor can
 C Concurrent ML

be applied to other types to construct arbitrary vec- This code illustrates a couple of idioms that CML inher-
tor types, such as “int vector” and “int vector its from SML. The first is the use of tail-recursion to
vector” (note that type-constructor application is define threads that loop for ever. The second is the use
postfix). The length function will work on any of these of lexical scoping to limit access to channels. In this
vector types, since it is polymorphic. case, the channel ch is visible to both the producer and
New threads are created in CML using the spawn consumer threads, but not outside the scope of the let
function, which has the following SML value specifica- binding.
tion:
First-Class Synchronous Operations
val spawn : (unit -> unit)
The distinguishing feature of CML is its support for first-
-> thread_id
class synchronous operations. Much the same way that
Because SML is a strict functional language, the argu- higher-order languages support first-class function val-
ment to spawn must be encapsulated in a unit to ues, CML supports first-class synchronization values.
unit function or thunk. The spawn function creates This design is motivated by two observations:
a new thread to evaluate its argument and returns the . Most interprocess interactions involve multiple
new thread’s ID. CML threads communicate over typed messages
channels that have the following signature: . Processes need to interact with multiple partners
type ’a chan For example, consider a typical client–server protocol.
val channel : unit -> ’a chan The client sends a request message to the server and
val recv : ’a chan -> ’a then waits for a reply. But suppose that the client wants
val send : (’a chan * ’a) -> unit to request a result from two different servers, accepting
whichever reply it receives first? In that case, the proto-
Message passing is synchronous: both the recv and col must include a negative acknowledgment mechanism
send operation are blocking. The model is distributed for the case where the client wishes to abort the pro-
in that threads only interact via messages, but it is tocol. Figure  illustrates the two possible outcomes of
usually implemented in shared memory and is quite this scenario. This example assumes the existence of an
efficient. In addition to channels, CML provides other asynchronous mechanism for transmitting the negative
communication mechanisms, such as asynchronous acknowledgments.
channels and synchronous variables. The semantics of The client-side code for this interaction might look
these mechanisms is defined in terms of the channel and something like the following pseudo-CML code, where
thread operations, but their implementation is more each server has its own request channel (reqCh1 and
efficient. reqCh2) and the requests consist of a triple of the
As an illustration of these basic primitives, the fol- request message, a private reply channel, and a signal
lowing code creates a simple producer/consumer thread variable for aborting the transaction:
pair, which communicate over a shared channel:
let val replCh1 = channel()
let val ch = channel () val nack1 = cvar()
fun producer () = ( val replCh2 = channel()
send (ch, produce ()); val nack2 = cvar()
producer ()) in
fun consumer () = ( (* send requests to the servers *)
consume (recv ch); send (reqCh1,
consumer ()) (req1, replCh1, nack1));
in send (reqCh2,
spawn producer; (req2, replCh2, nack2));
spawn consumer (* wait for a reply from one of
end * the servers *)
Concurrent ML C 

Server1 Client Server2 Server1 Client Server2

request request
request request

reply/ack reply/ack
nack
nack

Concurrent ML. Fig.  Interacting with two servers

selectRecv [ An event value represents a potential computation with


(replCh1, fn repl1 => ( latent communication and synchronization effects. Its
set nack2; act1 repl1)), type argument specifies the type of the result when a
(replCh2, fn repl2 => ( thread synchronized on the event, which is done using
set nack1; act2 repl2)) the following operator:
]
val sync : ’a event -> ’a
end
Base-event constructors create event values for the
This example uses two additional concurrency features channel communication primitives.
that are not part of CML, but which can be easily
definable in CML. The “cvars” are unit-valued write- val recvEvt : ’a chan -> ’a event
once synchronous variables that are used to implement val sendEvt : ’a chan * ’a
the asynchronous negative acknowledgements, and the -> unit event
selectRecv function implements a nondeterminis- Note that these functions are pure – they define event
tic choice of reading from a list of channels paired with values but have no side effects themselves. They also
message handlers. As can be seen from this code, the satisfy the obvious equations:
interactions with the two servers are quite entangled.
The situation would become even more complicated if sync ○ recvEvt = recv
a third server were added. Furthermore, the client and sync ○ sendEvt = send
server sides of the protocol are split apart, which makes
changes to the protocol more difficult to implement. The power of events comes from the combinators
The traditional mechanism of procedural abstrac- that CML provides for building more complicated val-
tion is not appropriate in this situation, because it hides ues. The first of these is the wrap combinator, which
away the synchronous aspect of the operations and does takes an event value and a post-synchronization action
not handle the negative acknowledgments. and returns a new event that will apply the action after
CML solves this problem by introducing a new synchronization. It has the type
abstraction mechanism called first-class synchronous
val wrap : (’a event * (’a -> ’b))
operations. It defines a type of abstract values, called
-> ’b event
events, that represent synchronous operations, such as
receiving a message on a channel or a timeout. and satisfies the equation

type ’a event sync (wrap (ev, f )) = f (sync ev)


 C Concurrent ML

for any event value ev and function f that are well typed A more modular interface to the servers in the
for the context. client–server example from above can be implemented
The second combinator is choose, which takes a using CML’s abstraction mechanisms. Using event val-
list of event values and returns a new event that repre- ues, one can define an abstract interface that hides the
sents the nondeterministic choice of the events in the details of the client–server protocol from the client:
list. It has the type
type server
val choose : ’a event list val rpcEvt : (server * req)
-> ’a event -> repl event
Note that all of the events in the list must have the
same result type, but one can use wrap to convert With this interface, the client-side code has a clear sep-
between types as necessary. The sync operator and aration of concerns between the interactions with the
wrap and choose combinators define the PML subset two servers.
of CML, which is named after the first language to sup-
sync (
port first-class synchronization []. Full CML adds two
choose [
event-generator combinators, which are used to define
wrap (rpcEvt server1,
pre-synchronization operations. The first of these is the
fn repl1 => act1 repl1),
guard combinator, which takes a event-valued func-
wrap (rpcEvt server2,
tion and returns an event value that will evaluate the
fn repl2 => act2 repl2)
function at synchronization time and then synchronize
])
on its result. It has the type
val guard : (unit -> ’a event) The implementation of this interface is straightforward
-> ’a event using CML’s event combinators. The server type is rep-
resented by a request channel that takes a triple of
and satisfies the equation
the request message, a private reply channel, and a
negative-acknowledgment event that signals when the
sync (guard f ) = sync (f ())
transaction should be aborted.
for any function f that is well typed for the context.
type req_msg =
Typically the guard combinator is used to package up
(req * repl chan * unit event)
operations, such as initiating a transaction with a server,
type server = reqmsg chan
that should be done before synchronization. As long as
the operations in the function are idempotent, guard is The rpcEvt function is implemented using the
sufficient, but when they are not, some additional mech- withNack combinator.
anism is required for aborting the transaction. The final
event combinator provides this mechanism: fun rpcEvt (serverCh, req) =
val withNack : withNack (fn nackEvt => let
(unit event -> ’a event) val replCh = channel ()
-> ’a event in
send (serverCh,
Like guard, its argument is a function that is evalu- (req, replCh, nackEvt));
ated at synchronization time to produce an event value, recvEvt replCh
which is then synchronized on. What is different is that end)
the function is passed an event-valued argument, which
is a negative-acknowledgment event that is used to sig- Modularity is supported by defining the rpcEvt func-
nal when some other event in the synchronization is tion in the module that implements the service and only
chosen instead. exposing the abstract interface to clients.
Concurrent ML C 

Event values also provide a uniform framework type-based control-flow analysis for a CML subset [],
for incorporating other types of synchronous opera- but did not propose any applications for their analysis.
tions into the language. For example, synchronizing on Colby developed an abstract-interpretation for a subset
thread termination is supported by the combinator of CML that is based on a semantics that uses con-
val joinEvt : thread_id
trol paths (i.e., an execution trace) to identify threads. C
Unlike using spawn points to identify threads (as in
-> unit event
[]), control paths distinguish multiple threads created
and timeouts are supported by at the same spawn point, which is a necessary condition
to understand the topology of a program. Reppy and
val timeOutEvt : time -> unit event
Xiao developed an analysis for CML that can be used
System services, such as I/O and system-process to specialize channel primitives for better performance
management, are also be modeled as event-valued in a parallel implementation of CML []. For example,
operations. if a channel is only used for point-to-point communi-
Events can also be used to implement many of cation, then it can be implemented using fixed storage
the high-level concurrency mechanisms that have been and an atomic swap instruction.
defined over the years, including Ada’s rendezvous
mechanisms [], Multilisp’s futures [], asynchronous
RPCs (or promises) [], Linda-style distributed tuple Implementations
spaces [], and the Join-calculus communication mech- The idea of awarding first-class status to synchronous
anisms []. Many of these implementations are operations was first imlemented as part of the PML
described in the definitive description of CML: Concur- language [], which was the metalanguage for the
rent Programming in ML []. Pegasus system designed at Bell Laboratories [].
These ideas were then reimplemented as Concurrent
Semantics and Analysis ML on top of the Standard ML of New Jersey system
Since its earliest days, CML has been the subject of (SML/NJ) [] and released in the autumn of . As
research on the semantics of concurrent languages and discussed above, CML extended the PML design with
on program analyses for concurrent programs [, ]. two important combinators: guard and a precursor
Because CML provides a mechanism for imple- to withNack called wrapAbort []. These primi-
menting communication and synchronization abstrac- tives greatly increased the expressiveness of the mecha-
tions, one might ask what expressiveness limits are there nism. CML continues to be been distributed as part of
on these abstractions? The answer is that it depends on SML/NJ.
the underlying communication primitives. In the case CML’s implementation uses SML/NJ’s heap-allocated
of CML, which provides synchronous communication first-class continuations to implement threads []. This
with output guards (i.e., sendEvt), communication implementation strategy has three important advan-
mechanisms that require two-way common knowledge tages: The implementation is written in a high-level
can be implemented, but it is not possible to define language (SML), threads are extremely lightweight in
a three-way synchronization mechanism []. If CML both space and time, and threads are garbage collected.
were based on asynchronous message passing (i.e., non- CML is preemptively scheduled, but, like most of the the
blocking send) or did not have output guards, then even user-space threading libraries developed in the s
two-way common knowledge would be impossible to and s, it does not support multiprocessors (the
achieve. main reason for this lack of multiprocessor support is
The problem of program analysis for CML has also because the underlying SML/NJ system does not sup-
been studied by a number of researchers. Nielson and port multiprocessors). Because the main motivation for
Nielson developed an effects-based analysis for detect- CML was the use of concurrency as a programming
ing when programs written in a subset of CML have idiom, and not parallelism, the lack of multiprocessor
finite topology and thus can be mapped onto a finite support was not a major concern. More recently, there
processor network []. Debbabi et al. developed a has been an interest in parallel implementations of the
 C Concurrent ML

CML primitives. Reppy and Xiao developed an opti- [, ], which combined the Isis model of atomic mul-
mistic concurrency algorithm for the asymmetric subset ticast [] with CML’s events.
of CML (i.e., CML without output guards) []. This More recently, researchers have been exploring the
algorithm was later extended to include support for the marriage of software transactions with CML’s events.
full range of CML primitives and has been implemented Donnelly and Fluet generalized the idea of events to
both as part of the Manticore system [] and as a library allow protocols that involve multiple synchronization
in C# []. points []. They defined a new combinator
In addition to the SML/NJ implementation, the
val thenEvt :
CML design has been ported to a number of other
’a event * (’a -> ’b event)
systems and languages, including other implementa-
-> ’b event
tions of SML, other dialects of ML, other functional
languages, such as Haskell and Scheme, and other high- with the semantics that the expression thenEvt ev f
level languages, such as Java. defines an event value, which synchronizes on ev, passes
the result to f , and then synchronizes on the result of f .
● MLton is a whole-program optimizing compiler for
The key feature of this combinator is the synchroniza-
Standard ML, which has its own implementation
tion is transactional; i.e., either all the synchronizations
of the CML primitives []. Like the SML/NJ ver-
occur or none. Donnelly and Fluet’s design was imple-
sion, it multiplexes CML threads onto a single sys-
mented in Haskell, but it has been adapted to ML by
tem thread and supports preemptive scheduling. Its
Effinger-Dean, Kehrt, and Grossman []. A related idea
implementation of threads is stack based, so they are
is the notion of stabilizers, which are a checkpointing
somewhat heavier than the SML/NJ version and are
mechanism for CML programs [].
not garbage collected. This implementation is being
used by Jagannathan and his students to explore
concurrent language design [].
Bibliography
. Appel AW, MacQueen DB () Standard ML of New Jersey. In:
● One of the earliest re-implementations of CML was
Programming language implementation and logic programming.
in the OCaml system, which is another dialect of Lecture notes in computer science, vol . Springer, New York,
ML []. pp –
● There have been two implementations of CML’s . Appel AW () Compiling with continuations. Cambridge Uni-
primitives in Haskell [, ]; both implemented versity Press, Cambridge
. Berry D, Milner R, Turner DN () A semantics for ML concur-
using the primitives of Concurrent Haskell [].
rency primitives. In: Conference record of the th annual ACM
These implementations support multiprocessors, symposium on principles of programming languages (POPL ’),
since the underlying Concurrent Haskell implemen- Albuquerque, Jan , pp –
tation runs on multiprocessors. . Birman KP, Joseph TA () Reliable communication in the
● The PLT-Scheme system implements CML-style presence of failures. ACM T Comput Syst ():–
concurrency with an additional mechanism, called . Cardelli L, Pike R () Squeak: a language for communi-
cating with mice. In: SIGGRAPH’, San Francisco, July ,
kill-safe abstractions, to support safe asynchronous
pp –
termination of threads []. The Scheme  system . Cardelli L () Amber. In: Combinators and functional pro-
also supports CML-style concurrency []. gramming languages. Lecture notes in computer science, vol .
● Outside of the world of functional languages, Springer, New York, pp –, July 
the CML primitives have been reimplemented in . Carriero N, Gelernter D () Linda in context. Commun ACM
():–
Java [] and C# [].
. Chaudhuri A () A Concurrent ML library in Concurrent
Haskell. In: Proceedings of the th ACM SIGPLAN international
conference on functional programming, Aug–Sept . ACM,
New York, pp –
Derivatives
. Cooper R, Krumvieda C () Distributed programming with
CML has also served as the basis for further research asynchronous ordered channels in Distributed ML. In: Birman
into language mechanisms for concurrency. One of the KP, Renesse RV (eds) Reliable distributed computing with the Isis
first examples is Krumvieda’s Distributed ML language toolkit, pp –. IEEE Computer Society Press, Los Alamitos
Concurrent ML C 

. Debbabi M, Faour A, Tawbi N () Efficient type-based control- . Nielson HR, Nielson F () Higher-order concurrent programs
flow analysis of higher order concurrent programs. In: Proceed- with finite communication topology. In: Conference record of
ings of the international workshop on functional and logic pro- the st annual ACM symposium on principles of programming
gramming, IFL’, Sept . Lecture notes in computer science, languages (POPL ’), Portland, Jan , pp –
vol . Springer, New York, pp – . Panangaden P, Reppy JH () The essence of concurrent ML. In:
. Demaine ED () Higher-order concurrency in java. In: Nielson F (ed) ML with concurrency. Springer, New York (Chap ) C
Proceedings of the parallel programming and java conference . Peyton Jones S, Gordon A, Finne S () Concurrent haskell.
(WoTUG), Enschede, Apr , pp –. Available from In: Conference record of the rd annual ACM symposium on
http://theory.csail.mit.edu/~edemaine/papers/WoTUG/ principles of programming languages (POPL’), Jan . ACM,
. United States Department of Defense, American National Stan- New York, pp –
dards Institute () Reference manual for the Ada program- . Pike R () A concurrent window system. Comput Syst
ming language. Springer, New York ():–
. Donnelly K, Fluet M () Transactional events. J Funct Pro- . Reppy J, Russo C, Xiao Y () Parallel concurrent ML. In: Pro-
gram (–):– ceedings of the th ACM SIGPLAN international conference
. Effinger-Dean L, Kehrt M, Grossman D () Transactional on functional programming, Aug–Sept . ACM, New York,
events for ML. In: Proceedings of the th ACM SIGPLAN pp –
international conference on functional programming, Sept . . Reppy J, Xiao Y () Specialization of CML message-passing
ACM, New York, pp – primitives. In: Conference record of the th annual ACM sym-
. Flatt M, Findler RB () Kill-safe synchronization abstrac- posium on principles of programming languages (POPL ’), Jan
tions. In: Proceedings of the SIGPLAN conference on program- . ACM, New York, pp –
ming language design and implementation (PLDI’), June . . Reppy J, Xiao Y () Toward a parallel implementation of con-
ACM, New York, pp – current ML. In: Proceedings of the ACM SIGPLAN workshop on
. Fluet M, Rainey M, Reppy J, Shaw A, Xiao Y () Manti- declarative aspects of multicore programming, Jan . ACM,
core: a heterogeneous parallel language. In: Proceedings of the New York
ACM SIGPLAN workshop on declarative aspects of multicore . Reppy JH () Synchronous operations as first-class values.
programming, Jan . ACM, New York, pp – In: Proceedings of the SIGPLAN’ conference on program-
. Fournet C, Gonthier G () The reflexive CHAM and the join- ming language design and implementation, Atlanta, June ,
calculus. In: Conference record of the rd annual ACM sym- pp –
posium on principles of programming languages (POPL’), Jan . Reppy JH () CML: a higher-order concurrent language. In:
. ACM, New York, pp – Proceedings of the SIGPLAN’ conference on programming lan-
. Gansner ER, Reppy JH () A multi-threaded higher-order guage design and implementation, June . ACM, New York,
user interface toolkit. Software trends, vol . Wiley, New York, pp pp –
– . Reppy JH () An operational semantics of first-class syn-
. Halstead Jr RH () Multilisp: a language for concurrent sym- chronous operations. Technical Report TR -, Department
bolic computation. ACM T Program Lang Syst ():– of Computer Science, Cornell University, Ithaca, Aug 
. Hoare CAR () Communicating sequential processes. Com- . Reppy JH () Concurrent programming in ML. Cambridge
mun ACM ():– University Press, Cambridge
. Jagannathan S, Weeks S () Analyzing stores and references . Reppy JH, Gansner ER () A foundation for programming
in a parallel symbolic language. In: Conference record of the  environments. In: Proceedings of the ACM SIGSOFT/SIGPLAN
ACM conference on lisp and functional programming, June . software engineering symposium on practical software develop-
ACM, New York, pp – ment environments, Palo Alto, pp –, Dec 
. Kelsey R, Rees J, Sperber M () The incomplete scheme  . Russell G () Events in Haskell, and how to implement them.
reference manual. Available from www.s.org In: Proceedings of the sixth ACM SIGPLAN international con-
. Krumvieda CD () Distributed ML: abstractions for efficient ference on functional programming, Florence, Sept , pp
and fault-tolerant programming. Ph.D. thesis, Department of –
Computer Science, Cornell University, Ithaca, Aug . Available . Young C, Lakshman YN, Szymanski T, Reppy J, Pike R, Narlikar
as Technical Report TR - G, Mullender S, Grosse E () Protium, an infrastructure
. Leroy X () The objective Caml system (release .), Apr for partitioned applications. In: Proceedings of the eighth IEEE
. Available from http://caml.inria.fr workshop on hot topics in operating systems (HotOS), Elmau, Jan
. Liskov B, Shrira L () Promises: linguistic support for effi- , pp –
cient asynchronous procedure calls in distributed systems. In: . Ziarek L, Schatz P, Jagannathan S () Stabilizers: a mod-
Proceedings of the SIGPLAN’ conference on programming ular checkpointing abstraction for concurrent functional pro-
language design and implementation, Atlanta, June , pp grams. In: Proceedings of the eleventh ACM SIGPLAN interna-
– tional conference on functional programming, Sept . ACM,
. MLton: Concurrent ML. http://mlton.org/ConcurrentML New York, pp –
 C Concurrent Prolog

bandwidth, thus wasting it, and packet retransmis-


Concurrent Prolog sions significantly increase individual packet latency. In
the latter, on the contrary, packets usually can not be
Logic Languages discarded (“lossless” networks) due to the prohibitive
delay in detection mechanisms and retransmissions for
current parallel applications. Moreover, the designers
Configurable, Highly Parallel of interconnection networks for modern parallel sys-
tems try to use as few network components as pos-
Computer
sible, due to their high cost and power consumption,
Blue CHiP but this increases link utilization, thus making con-
gestion situations more likely to happen. Even when
interconnection networks are overdimensioned, the
actions taken by power management mechanisms tend
Congestion Control to bring the network close to saturation, with identical
consequences.
Congestion Management Therefore, in modern, high-performance paral-
lel computing systems, thorough strategies should
be used in order to solve the problems related
Congestion Management to congestion situations that may arise in the net-
work interconnecting processing or storage nodes.
Pedro J. Garcia In fact, congestion management is one of the most
Universidad de Castilla-La Mancha, Albacete, Spain important issues that high-performance interconnect
designers face nowadays, always trying to guarantee
a certain network performance level, even in con-
Synonyms
Congestion control gestion situations. The following sections offer an
overview of the congestion phenomenon, its effects
on network performance and the most common
Definition
approaches to deal with congestion in parallel comput-
Congestion appears in an interconnection network
ing systems.
when intense traffic clogs any number of inter-
nal network paths, thus slowing down traffic flow-
ing. Congestion management refers to any strategy Congestion in Interconnection Networks
focused on avoiding, reducing, or eliminating net- As mentioned above, congestion may appear under
work congestion and/or its negative impact on network conditions of intense traffic in the network, when some
performance. of its internal paths, or sections of paths, become
jammed. However, this idea is just the most basic defi-
Discussion nition of that phenomenon, while any study of the main
congestion management strategies requires a deeper,
Introduction previous analysis of both the congestion formation pro-
Congestion is a phenomenon that may dramatically
cess and its consequences.
degrade network performance. Congestion situations
may actually appear in both computer communica-
tion networks (like the Internet) and interconnection Congestion Basics
networks of parallel computing systems. In the former, The immediate cause of congestion situations in
packet dropping is allowed (“lossy” networks), thus interconnection networks is contention, which hap-
congestion is not a critical problem as congested pack- pens when several flows of packets crossing the net-
ets can be discarded. Nevertheless, congestion should work simultaneously request access to some network
be avoided because dropped packets have consumed resource (typically, a switch output port). If the internal
Congestion Management C 

speedup of switches (the maximum speed at which output port at Switch C. In this example, it is assumed
packets can be forwarded from input to output ports, that switch speedup is . Note that the example (and the
relative to link speed) is not enough for granting several following ones) shows switches with buffers (queues)
requests at the same time, the access to the requested at both input and output ports, but it could be easily
output port will be granted to only one packet, while adapted to switches with other queue schemes. Note C
the rest must wait. Figure a shows a contention situa- also that the requested output port may be connected
tion caused by two incoming flows requesting the same to an end node or to another switch.

Flow
Contention control
Congestion

Switch A Switch A

Switch C Switch C

Flow
control

Switch B Switch B
a Contention situation b Congestion situation

Congestion
tree leaf
Congestion
Congestion tree root
tree branch

Switch A

Switch C
Congestion
tree branch

Congestion
tree leaf
Switch B
c Congestion tree

Congestion Management. Fig.  Contention and congestion in switches of an interconnection network


 C Congestion Management

When contention persists over time, congestion reported that congestion leads to a decrease in net-
appears. In these situations, the buffers containing the work throughput and to an increase in packet latency.
blocked packets will be filled and the flow control, in However, if the simple congestion tree described in the
lossless networks, will prevent other switches (or end previous example is considered, it is not obvious that it
nodes) from sending packets to the congested ports. could produce these effects in the network. Note that,
Although flow control is essential to avoid discarding in fact, the congested output port in the example could
packets, it will rapidly propagate congestion to other forward packets at maximum rate, and congested pack-
switches, as packets stored at some of their ports will ets advance at the maximum possible speed for that
also be blocked. Figure b shows a congestion situa- specific traffic load. Thus, if the presence of congestion
tion whose origin is the contention situation shown trees does not directly imply any negative effect, which
in Fig. a. In this example, it is assumed that the net- is the actual cause of network performance degradation
work is a lossless one (so, packets are not discarded under congestion situations?
when buffers are full), and also that the sources of The answer to the previous question is that con-
the contending flows continuously inject packets into gested flows are actually dangerous when they share
the network. These flows contributing to congestion are network resources (buffers, links) with other, non-
usually referred to as “congested flows.” Note also that congested flows. In these cases, the final effect is
packet-level flow control is assumed, but a similar situ- non-congested flows advancing at the same speed
ation could be easily imagined for networks using flow as congested ones, thereby increasing the latency of
control at other levels (flit-level, for instance). non-congested packets and decreasing overall network
In this way, congestion may progressively spread throughput.
through the network, even reaching the sources of con- Figure  shows an example of this “real” negative
gested flow packets. The whole of the network resources effect of congestion trees. In this example, it is assumed
affected by the spreading of congestion is commonly that four sending end nodes (sources s, s, s, and s)
known as a “congestion tree.” In a congestion tree, the inject packets at the maximum speed allowed by the
“root” is the point where the congested flows finally network at any time, while there exist two receiving
meet, the “branches” are series of consecutive congested end nodes (destinations d and d). As can be seen,
points along any path followed by congested flows, and three sources (s, s, and s) inject packets addressed
the “leaves” are the points at the extreme of each branch. to the same destination (d), forming a congestion tree
Figure c shows the final congestion tree that is formed whose root is at the output port leading to d. Although
from the congestion situation shown in the previous there exists another flow in the network (from s to d),
examples. it is a non-congested flow, since packets belonging to
Note that the congestion tree shown in Fig. c is a this flow do not request to cross the root of the tree.
very simple example, since congestion trees may exhibit Since the three congested flows that create the tree meet
very complex dynamics. In fact, congestion trees may at Switch , they must share the bandwidth of the link
actually evolve in very different ways, depending on that connects the root point to d. Therefore, assuming
traffic patterns, routing schemes, and switch architec- a fair arbitration from the switch scheduler, each con-
ture, as shown in []. For example, a congestion tree gested flow will cross Switch  at a speed equal to one
may grow from leaves to root and vice versa. Also, sev- third (%) of the maximum link rate. Moreover, due to
eral congestion trees may grow independently and later the spreading of congestion by means of flow control,
merge, and it is even possible that some trees completely all the packets along the congestion tree branches will
overlap while being independent. advance at the same speed, and finally even the sources
of these packets will be forced to slow injection until this
rate. The figure indicates link utilization percentages
Congestion Impact on Network Performance when the network reaches this “stable” state.
It is a well-known fact that the presence of conges- Note that the utilization of the link that connects
tion trees in a network may dramatically degrade net- the hot-spot destination (d) to the network is maxi-
work performance. Specifically, it has been exhaustively mum (%), so it is not possible to improve network
Congestion Management C 

Traffic
33 %
s1 d1
33 % s2 d1 Congested flows
s1 s3 d1
s4 d2 Non−congested flow C
Congested packets
Switch 1
Non−congested packets
33 %
33 %
Switch 5
33 % Switch 8
s2 100 %

33 %
d1
Switch 2

Switch 6 33 %
66 %
33 % 33 % d2

s3

Switch 3 Switch 7
33 %
33 % 33% sending
33% blocked

s4
33% sending

Switch 4

Congestion Management. Fig.  Congestion tree causing Head-Of-Line (HOL) blocking to a non-congested flow

throughput at this point. In fact, network throughput as the congested flow ones. Note that, without this
would be the maximum if only congested flows exist. interaction between congested and non-congested
As mentioned above, this means that the mere exis- packets, the link leading to d would be used at max-
tence of the congestion tree does not justify by itself imum link rate.
the network throughput degradation observed in con- Therefore, the cause of network performance degra-
gested networks. On the contrary, note that the link dation associated to congestion is that packets belong-
leading to d is used only at % of the maximum link ing to congested flows may prevent other flows from
rate. However, there is no contention for this link, since advancing at the injection rate, even if they belong to
only one flow requests to access it. So, why is not this non-congested flows. Actually, this effect is a general-
link used at full rate? This is because the non-congested ization of the phenomenon known as Head-Of-Line
flow requesting this link follows a path partially affected (HOL) blocking, which, in general, happens whenever a
by the congestion tree, in such a way that packets packet ahead of a First In, First Off (FIFO) queue blocks,
belonging to the non-congested flow share a link and a preventing the rest of packets in the same queue from
queue with congested flow packets. Due to this sharing, advancing. In fact, congested packets may produce HOL
non-congested flow packets advance at the same speed blocking to non-congested ones in any network point
 C Congestion Management

where both types of packets share network resources. interconnection networks. First, it is quite difficult to
Of course, the wider the congestion tree, the greater implement this strategy without discarding packets
the probability of HOL blocking caused by congested belonging to non-congested flows. For instance, in a
packets. queue shared by congested and non-congested pack-
In conclusion, the performance degradation (throu- ets, instead of just emptying the buffer at once, it would
ghput decreases, latency increases) that networks suffer be necessary to perform some kind of “selective dis-
in the presence of congestion trees is due to the HOL carding” in the usual FIFO queues, thus introducing
blocking that congested flows cause to non-congested significant complexity and probably an additional delay.
ones when they share network resources. As explained Second, and more important, for most parallel applica-
in the following, some congestion management strate- tions packets that are discarded must be retransmitted,
gies are based on that principle, while others deal thus increasing final packet latency. Note that in these
with congestion without explicitly considering HOL cases non-congested packets may experience a delay
blocking. similar (or even worse) to the one they suffer in the
presence of congestion trees; thus, this approach may
Congestion Management Strategies produce the same effect that the phenomenon it tries to
As mentioned in the introduction, congestion manage- avoid!
ment is currently an essential issue in the design of Summing up, the congestion detection and packet
the interconnection network of modern parallel com- dropping processes, and the associated re-transmissions
puting systems. In fact, it has been a popular research may lead to some packets experiencing a very high
topic for many years, thus existing many approaches latency, which may have a direct impact on application
to solve the problems related to congestion. The most execution time in a parallel computer. In particular, par-
relevant of these approaches are analyzed below. As allel applications with Quality of Service (QoS) require-
several specific strategies that differ in some aspects ments would be very sensitive to this effect. For that
or details may share the same basic approach, differ- reason, this congestion management approach is invalid
ent strategies may appear in the following grouped into for the interconnection networks of current parallel
the same category, although independently described computing systems. In fact, modern high-performance
when necessary. Note, however, that taxonomies of con- interconnect technologies do not allow packet dropping
gestion management techniques different than the one (they are “lossless” networks). However, packet drop-
shown here are perfectly possible. Note also that some ping is allowed in other environments (for instance,
of the strategies described below are currently obso- computer networks like the Internet) where latency
lete or unfeasible in modern interconnect technologies, requirements are not strict at all.
but they have been considered in order to offer a wide
overview of both past and present strategies. Network Overdimensioning
Overdimensioning the network consists in using many
Packet Dropping more network components than the minimum required
This approach, as it has already been explained, con- for connecting all the system end nodes, with the aim
sists in discarding packets when congestion occurs. of making the network to operate well below the satu-
Ideally, the discarded packets should be those actu- ration point, thus keeping low link utilization and thus
ally contributing to congestion, i.e., packets belonging reducing congestion probability. Obviously, a network
to congested flows. Obviously, if congested packets are offering much more bandwidth than the required by the
discarded, congestion trees would disappear or even applications is unlikely to suffer congestion situations.
would never form, depending on the criteria for detect- This was a valid approach for parallel systems some
ing congestion. Thus, theoretically, this approach would time ago, when network utilization in parallel machines
straightforwardly solve the problems congestion may and clusters was low. However, overdimensioned net-
produce. works are becoming inappropriate for current par-
However, packet dropping presents some drawbacks allel systems due to cost and power consumption
that make it not suitable for current high-performance constraints. On the one hand, current interconnect
Congestion Management C 

components are very expensive when compared to pro- Note that these strategies need to know in advance
cessors. Thus, the network represents a high percentage both the resource requirements of each transmission
of the total system cost. On the other hand, as VLSI and the occupancy of network resources, in order to
technology advances and link speed increases, inter- obtain the optimal resource reservation schedule. How-
connects are consuming an increasing fraction of the ever, this knowledge is not always available. Besides, C
total system power. Moreover, current high-speed links resource reservation incurs significant overhead. Thus,
require continuous pulse transmission in order to keep this kind of technique is not suitable for general-
both ends synchronized, even when no data are being purpose congestion management. In fact, these strate-
transmitted. As a consequence, link power consump- gies are strongly related to QoS provision, and most of
tion is almost independent of link utilization. Note that them have been proposed in that sense, thereby only
dynamic frequency/voltage scaling techniques could be implicitly addressing congestion. An example of these
used in order to reduce power consumption, but unfor- strategies is the one proposed in [].
tunately the efficiency of these proposals is not com-
pletely satisfactory due to their slow response against Prevention-Based Strategies
traffic variations and the suboptimal frequency/voltage Prevention-based strategies control the traffic in such a
settings during transitions. Anyway, even if these tech- way that congestion trees should not happen. In gen-
niques were more efficient, they would not solve the cost eral, decisions are made “on the fly,” based on limiting
problem. or modifying routes (or memory accesses) with the aim
Therefore, a trend in current network design is to of preventing the appearance of congestion situations,
reduce the number of networks components with the as proposed in [].
aim of reducing both system cost and power consump- Like avoidance-based strategies, prevention-based
tion. Obviously, if the system computational power techniques need a knowledge of network status in order
must be maintained, then the only ways to reduce the to obtain the optimal traffic schedule. Consequently,
number of network components are to increase the their drawbacks are basically the same as the aforemen-
number of end nodes attached to each switch (note tioned ones for the former category. Analogously, these
that most current interconnect technologies allow that) techniques can also be software-based or hardware-
or to use a more suitable network topology. However, oriented.
any of these solutions lead to a higher level of link uti- Note that the differences between avoidance-based
lization, thereby driving the network to work closer and prevention-based strategies are quite subtle, and,
to its saturation point, and thus increasing congestion in fact, in some taxonomies both types are grouped
probability. into the same category. In this case, all these strate-
Summing up, overdimensioning the network is gies are usually referred to as “proactive” strategies,
practically unfeasible in current parallel systems, which because they all try to solve congestion before it
in fact are prone to suffer congestion situations due to appears.
the low number of components used to build their inter-
connection networks, thus requiring specific, efficient Reactive Techniques
congestion management mechanisms. Reactive (or detection-based) techniques are based on
detecting the appearance of congestion, in order to acti-
Avoidance-Based Strategies vate some control mechanism that should eliminate any
This category includes all the techniques based on plan- possible congestion tree in the network. Note that in
ning the use of network resources, in order to find this case, in contrast with the two previous approaches,
a schedule, which guarantees that congestion never congestion situations are actually allowed to happen,
appears. Following this schedule, the network resources assuming that the control mechanism will finally solve
required by each data transmission are reserved before the problem.
starting it, thus avoiding congestion situations. Some of This basic approach has been followed by many
these techniques are software-based, while others are proposals, which mainly differ in two aspects: the
hardware-oriented. congestion detection criteria and the mechanism in
 C Congestion Management

charge of eliminating congestion. Regarding conges- trees’ shape (which, as mentioned above, depends on
tion detection, some feedback information is usu- several factors).
ally required. For instance, some proposals locally
measure the occupancy of buffers in the switches,
while other techniques (especially in multiproces- HOL Blocking Elimination Techniques
sor systems) monitor the amount of memory access While the aforementioned strategies try to eliminate,
requests. reduce, or delay congestion itself, other techniques
Regarding congestion elimination mechanisms, most focus on eliminating the actual cause of performance
of them involve notifying the end nodes of the conges- degradation under congestion situations: The HOL
tion so that they throttle the injection of packets or they blocking. In fact, note that, if the HOL blocking pro-
cease (or reduce) memory access requests. Depending duced by congested packets is eliminated, congestion
on which end nodes receive congestion notifications, becomes harmless. Most of the existing HOL block-
reactive techniques can be divided into three subcate- ing elimination proposals are based on having different
gories: queues at each switch port, in order to separately store
packets belonging to different flows, thus lowering HOL
. Congestion notifications are broadcast to all the end blocking probability. Note that congestion trees would
nodes. not be eliminated but isolated as much as possible, in
. Notifications are sent only to end nodes contribut- order to avoid interaction with non-congested flows.
ing to congestion. This is the common basic approach followed by sev-
. Notifications are sent only to end nodes attached to eral techniques which, on the other hand, differ in many
the switch where congestion is detected. aspects, basically in the number of queues required, in
the policy for mapping packets to queues, and in queue
It is obvious that if the first (broadcast) notifica- management.
tion policy is implemented, a large fraction of the For instance, a well-known HOL blocking elimina-
offered network bandwidth is consumed for sending tion technique is Virtual Output Queues (VOQs) [],
notifications, thus inefficiently using network resources. which requires at each port as many queues as des-
Another serious drawback, common to the first and sec- tinations in the network. This scheme allows that, at
ond notification policies, is the lack of scalability, as in each port, all the packets addressed to a specific desti-
these cases reaction time is directly proportional to the nation are exclusively stored in the queue assigned to
distance from the point where congestion is detected to that destination, and never share that queue with pack-
the traffic sources. Therefore, if network size and/or link ets addressed to other destinations, thus completely
bandwidth increase, the effective delay between conges- eliminating HOL blocking. However, as the number
tion detection and reaction increases linearly. Note that of queues per port grows with the number of desti-
during this time, a high number of packets contribut- nations, this scheme does not scale with network size,
ing to congestion could be injected, in such a way that, becoming unfeasible in medium or large networks (note
by the time congestion notifications reach the sources, that each queue requires a minimum silicon area to be
big congestion trees may have already been formed. All implemented).
that leads to slow response and the typical oscillations In order to overcome this problem, a variation of
that arise in closed-loop control systems with delays in this scheme uses as many queues at each port as out-
the feedback loop, thus the first two notification policies put ports in the switch [], and each incoming packet
may be not appropriate for many networks, especially is stored in the queue assigned to its output port.
for those with long round trip times (RTTs). The third This scheme is usually referred to as VOQ at switch
notification policy (used, for instance in []), however, level (VOQsw), in contrast to the former scheme, that
does not present the described scalability problems, but, may be named as VOQ at network level (VOQnet).
unfortunately, it fails if the sources contributing to con- VOQsw scales with network size, and it eliminates HOL
gestion are not attached to the switch where congestion blocking in a switch if it is directly caused by pack-
is detected. As this is perfectly possible, the efficacy of ets contending for output ports in the same switch.
the third policy would vary depending on congestion On the contrary, in switches affected by congestion
Congestion Management C 

“spreading” from other switches, VOQsw cannot guar- this solution would not accurately detect congestion
antee that congested packets do not share queues with and it would not efficiently eliminate HOL blocking.
non-congested packets, thus VOQsw eliminates just On the contrary, other proposals dynamically assign
partially HOL blocking. The Dynamically Allocated queues to isolate packets addressed to congested points,
Multi-Queues (DAMQs) technique [] uses the same either inside the network or end nodes. This is the C
queue scheme, although in this case the size of queues case for the Regional Explicit Congestion Notification
may dynamically vary when required. Virtual Channels (RECN) [] and the Flow-Based Implicit Congestion
[] may also reduce HOL blocking at the switch level Management (FBICM) [] techniques, which probably
but do not eliminate it. achieve the most effective HOL blocking elimination
Another scalable solution is the Destination-Based that can be obtained with a reduced set of queues. RECN
Buffer Management (DBBM) strategy []. This pro- has been proposed for technologies using source rout-
posal also uses a reduced set of queues at each port, but ing, while FBICM for deterministic distributed rout-
in this case packets are assigned to queues according ing networks. Note that both strategies require control
to a modulo-mapping function: a packet with desti- memories at ports in order to keep track of congestion
nation D is stored in the queue whose number is D information and to manage the dynamic queues (specif-
mod N, where N is the number of queues in a port. ically, CAM memories are used). Additionally, special
As a result, a set of nonconsecutive destinations are control messages exchanging congestion information
assigned to each queue, thus packets addressed to a con- between switch ports are necessary. Summing up, these
gested destination produce HOL blocking only to pack- two strategies are very effective and scalable, but their
ets addressed to destinations in the same set. Dynamic implementation is not simple.
Switch Buffer Management (DSBM) also assigns sets
of destinations to queues, but in this case it assigns Other Approaches
depending on queue occupancy. Note that these solu- Other strategies that may help to alleviate congestion
tions, as VOQsw, eliminate HOL blocking just partially, or delay its appearance are the use of fully adaptive
as congested packets are allowed to share queues with routing [] or load balancing techniques []. In order
non-congested ones. to be actually effective against congestion, these strate-
In general, for all the aforementioned HOL block- gies should take into account network status for mak-
ing elimination techniques, the greater the number of ing routing decisions, as the technique proposed for
queues per port, the more HOL blocking is eliminated. Networks-on-Chip (NoCs) in [] does. Note, however,
This is because congested packets are not explicitly that these techniques delay the appearance of conges-
identified, and consequently the probability of isolating tion, but cannot avoid network performance degrada-
them grows with the number of queues. Moreover, note tion if congestion is reached (especially under heavy
that packets are assigned to queues following “static” traffic loads). Moreover, adaptive routing may cause
criteria, independently of network status. out-of-order packet delivery, which is unacceptable for
By contrast, other solutions detect congestion in some parallel applications.
order to explicitly identify congested packets, later iso-
lating them in dynamically allocated queues. Note that Future Directions
these queues would only store packets belonging to con- Despite many proposals for solving the problems related
gested flows (i.e., those packets that could produce HOL to congestion in interconnection networks, congestion
blocking), and if these flows vanish, the corresponding management can be considered still an open issue. On
queues could be deallocated and later reallocated for the one hand, some solutions cannot guarantee that
new congested flows. In this way, these techniques are network performance will not suffer a significant degra-
able to eliminate HOL blocking without relying on an dation in congestion situations. On the other hand,
unaffordable number of queues, thereby achieving scal- other solutions minimize the impact of congestion but
ability. In that sense, the solution proposed in [] for are difficult to implement because they either con-
the ATLAS I assigns queues to isolate packets addressed sume too many network resources, or require expen-
to congested end nodes. However, in many cases con- sive additional resources, or increase switch complex-
gestion arises inside the network, thus in these cases ity. As concerns about network components’ cost and
 C Congestion Management

power consumption are not likely to disappear, cost- Bibliography


effective approaches will probably be the most popular . Anderson T, Owicki S, Saxe J, Thacker C () High-speed
ones in the future. switch scheduling for local-area networks. ACM Trans Comput
In that sense, as mentioned above, the strategies Syst ():–
. Baydal E, Lopez P () A robust mechanism for congestion
currently offering the best relationship among efficacy,
control: INC. In: Proceedings of the th international euro-par
feasibility, and scalability are probably the ones based conference, Klagenfurt, 
on eliminating HOL blocking, especially those that . Dally WJ, Carvey P, Dennison L () The avici terabit
dynamically assign queues to isolate congested pack- switch/router. In: Proceedings of the Hot Interconnects ,
ets. Therefore, it make sense that future proposals about Stanford, 
. Dally WJ () Virtual-channel flow control. IEEE Trans Parallel
congestion management should try, at least, to improve
Distrib Syst ():–
the efficacy and/or cost-effectiveness of such strategies, . Duato J () A new theory of deadlock-free adaptive routing in
although it does not necessarily imply they should fol- wormhole networks. IEEE Trans Parallel Distrib Syst ():–
low the same approach to solve the problems related 
to congestion. In particular, in some emerging tech- . Duato J, Johnson I, Flich J, Naven F, Garcia PJ, Nachiondo T
() A new scalable and cost-effective congestion manage-
nologies like NoCs, where silicon area and power con-
ment strategy for lossless multistage interconnection networks.
sumption constraints are quite strong, most existing In: Proceedings of the th international symposium on high-
solutions are not suitable at all, thus new approaches performance computer architecture (HPCA), San Francisco,
should be proposed to deal with congestion in these 
environments. . Escudero-Sahuquillo J, Garcia PJ, Quiles FJ, Flich J, Duato J
() FBICM: efficient congestion management for high-
performance networks using distributed deterministic routing.
Related Entries In: Proceedings of the th international conference of high per-
Interconnection Networks formance computing, Bangalore, 
. Franco D, Garces I, Luque E () A new method to make com-
Routing (Including Deadlock Avoidance)
munication latency uniform: distributed routing balancing. In:
Switch Architecture
Proceedings of the ACM international conference on supercom-
Switching Techniques puting, Rhodes, 
. Garcia PJ, Flich J, Duato J, Johnson I, Quiles FJ, Naven F ()
Dynamic evolution of congestion trees: analysis and impact on
Bibliographic Notes and Further switch architecture. Lect Notes Comput Sci (HiPEAC-),
Reading :–
As can be drawn from the number and diversity of . Gratz P, Grot B, Keckler SW () Regional congestion aware-
the aforementioned strategies, there is a huge research ness for load balance in networks-on-chip. In: Proceedings of
the th international conference on high-performance computer
body in the congestion management area (although, as
architecture (HPCA-), Salt Lake City, 
mentioned above, not all the strategies are suitable for . Ho WS, Eager DL () A novel strategy for controlling hot
the interconnection networks of modern parallel sys- spot contention. In: Proceedings of the international conference
tems). Moreover, new proposals focused on or related parallel processing I, pp –, St. Charles, 
to congestion are not currently infrequent in journals . Katevenis M, Serpanos D, Spyridakis E () Credit-flow-
and conferences. Of course, this is a consequence of controlled atm for mp interconnection: the atlas I single-chip atm
switch. In: Proceedings of the th international symposium on
both the serious impact of congestion situations in net- high-performance computer architecture, Las Vegas, 
work performance and the aforementioned difficulties . Nachiondo T, Flich J, Duato J () Efficient reduction of HOL
for achieving a truly efficient congestion management. blocking in multistage networks. In: Proceedings of the th inter-
Taking all that into account, most of the references national parallel and distributed processing symposium (IPDPS
included below in the bibliography are just a small selec- ), Denver, 
. Tamir Y, Frazier GL () Dynamically-allocated multi-queue
tion from a vast number of works proposing strategies
buffers for VLSI communication switches. IEEE Trans Comput
to solve the problems related to congestion. Note, how- ():–
ever, that there exist few studies, like the one shown in . Yew P, Tzeng N, Lawrie DH () Distributing hot-spot
[], devoted to the complex dynamics of the congestion addressing in large-scale multiprocessors. IEEE Trans Comput
phenomenon in interconnection networks. ():–
Connection Machine C 

designed, though only one “quadrant” of  processors


Connected Components was delivered to NASA. (The first Connection Machines
Algorithm had a similar SIMD architecture, though with thou-
sands of processors rather than dozens, and were like-
Graph Algorithms wise divided into quadrants.) The processing elements C
of the Illiac IV were connected into a square two-
dimensional grid such that each processing element
could communicate with its four nearest neighbors.
Connection Machine There is an engineering trade-off between the pro-
Guy L. Steele Jr. cessing element size and the total number of process-
Oracle Labs, Burlington, MA, USA ing elements. Each Illiac IV processing element was a
full-fledged floating-point unit. In contrast, many other
early parallel machine designs were SIMD architectures
Definition
with much larger numbers of processing elements, each
The term “Connection Machine” refers to massively
just  bit wide, all simultaneously executing an instruc-
parallel supercomputers manufactured by Thinking
tion stream broadcast by a central controller.
Machines Corporation. The CM- () was a purely
As early as , Unger proposed a SIMD design
SIMD architecture intended for artificial intelligence
with -bit processors having only a few bits of state
(AI) applications; , single-bit processors were con-
apiece, for purposes of “spatial problems” such as visual
nected by a hypercube network. The CM- () added
pattern recognition. This design was never constructed,
-bit-wide floating-point coprocessors and set perfor-
but was simulated on a (conventionally sequential) IBM
mance records for numerical applications. The CM-
 computer []. The network connecting the pro-
() was MIMD, using Sun Microsystems SPARC pro-
cessors was a two-dimensional, eight-nearest-neighbors
cessors connected by a fat-tree network to drive pro-
grid, but for some purposes interprocessor links could
prietary SIMD floating-point coprocessors. The first
be “enabled” or “disabled” and then data could be trans-
TOP Supercomputer List (June ) showed that
ferred between distant processors in one step through
the four fastest supercomputers were CM- systems,
chains of enabled links. Unger (correctly) envisioned
and  of the top  were either CM- or CM-
the need for such a machine to have many hundreds
systems.
of processors in order to be useful, and commented
“this means thousands of memory elements and tens
Discussion of thousands of gate inputs in the matrix alone. These
Origins are alarming figures … However, progress in the com-
Influential predecessors of the initial Connection ponents field is such that it is reasonable to hope that
Machine architecture include the Illiac IV, the ICL Dis- within a few years there may be available manufactur-
tributed Array Processor (DAP), and the Goodyear ing processes whereby entire blocks of logical circuitry
Massively Parallel Processor (MPP). Certain other early could be constructed in one unit.… While the author
designs also merit discussion. has specifically discussed a computer based on a rectan-
While the Illiac IV [, ] never met its original gular matrix of modules using one-bit registers, it might
design goals, it was for a time the world’s fastest super- be worthwhile to consider other arrangements. Possible
computer and used to solve important real-world scien- variations include matrices in three or more dimen-
tific problems, and so it may be regarded as the first suc- sions, registers enlarged to handle multi-bit words, and
cessful parallel supercomputer. Several designers and polar co-ordinate arrays.”
implementors who developed language compilers for The SOLOMON computer [, ] was designed
the Illiac IV later developed similar compilers for the to connect up to , processors into a  × 
Connection Machine. The Illiac IV had a SIMD archi- two-dimensional, four-nearest-neighbors grid though
tecture, with a central control unit that issued instruc- smaller sizes such as  ×  or even  ×  were
tions to many processing elements –  as originally envisioned (and apparently the only hardware actually
 C Connection Machine

constructed was a demonstration system much smaller was not a hypercube; in some ways it resembled the
than this). NYU Ultracomputer.
The Illiac III [] was designed for visual pattern
recognition and had multiple independent computa-
Connection Machine Model CM-
tional units, one of which was a Pattern Articulation
A full Connection Machine model CM- consisted of
Unit containing , SIMD processors arranged as a
 = ,  processors [], each having:
 ×  two-dimensional, eight-nearest-neighbors grid.
The Staran [] had multiple array modules of  ● An arithmetic-logic unit (ALU) and associated
processors each and a rather complex communication latches
network [] capable of permuting  bits in many ways. ●  = K bits of bit-addressable local memory
The ICL DAP [] had , processors connected ● Eight -bit flag registers
into a  ×  two-dimensional, four-nearest-neighbors ● Router interface
grid. ● NEWS grid interface
The Goodyear MPP [] had , processors con- Each ALU was an almost trivial piece of circuitry: a
nected into a  ×  two-dimensional, four-nearest- -input, -output logic element that could compute any
neighbors grid. two Boolean functions of three inputs. The functions
Supercomputer projects whose gestation was roughly were specified by two -bit truth tables, so in effect the
contemporaneous with that of the first Connection ALU consisted simply of two -of- selectors controlled
Machine designs include the NYU Ultracomputer [], by the three input bits. The three inputs were any two
a MIMD architecture with a “fetch-and-add” network; bits from local memory and any of the eight flags; of the
NON-VON [, ], with thousands of -bit processors two outputs, one was written back to memory and one
connected into a complete binary tree, intended to pro- was written to a specified flag. Finally, the entire oper-
cess massive quantities of data kept on secondary stor- ation was conditional on yet another specified flag; if
age; and the Caltech Cosmic Cube [], with  conven- its value was , then the two results were not written
tional microprocessors connected by a six-dimensional after all.
hypercube network (however, the notion of connect- Using this ALU, all data processing was carried out
ing thousands of processors using a hypercube network bit-serially within each processor. One of the origi-
goes back at least to  []). nal inspirations for the Connection Machine was Scott
Many of the software techniques, such as bit-serial Fahlman’s PhD work on semantic networks and com-
arithmetic, developed and used on earlier massively putation within such networks via marker propagation.
parallel machines with -bit processors were directly Fahlman himself had sketched an idea for a hardware
applicable to the Connection Machine. The Connec- implementation []. W. Daniel “Danny” Hillis refined
tion Machine model CM- was distinguished from these these ideas in his own PhD work at MIT [], and in
predecessors by its inclusion of a hypercube network particular designed a more efficient and more easily
that provided much faster communication between dis- controlled network for connecting the processors [].
tant processing elements and much greater overall net- Each processor might then represent one node in a
work bandwidth (and by being the first massively par- semantic network and contain the addresses of a moder-
allel architecture to be put into commercial production, ate number of other processors; the network could then
with multiple copies sold to multiple customers). The be used to propagate markers, each consisting of a sin-
Connection Machine model CM- introduced hard- gle bit or a small number of bits, among the processors.
ware floating-point coprocessors that allowed it to be The CM- ALU design was, however, adequate to imple-
programmed in some ways more like the Illiac IV ment a full adder, and so it was reasonably convenient
than like the Goodyear MPP. The Connection Machine to implement addition and subtraction in a bit-serial
model CM- represented a radical departure, with an fashion, and then multiplication, division, and all other
entirely new architecture that was MIMD, used con- arithmetic operations of interest.
ventional microprocessors (optionally augmented with Consider, for example, conditional addition of two
vector coprocessors), and had a network topology that signed integers, each k bits long. The following steps
Connection Machine C 

were carried out simultaneously within each hardware floating-point addition, the input significands must be
processor. First a bit was loaded from memory into a flag aligned and the output sum must be normalized; for
bit to control whether the results of the addition should m-bit significands, the cost is roughly m⌈log m⌉ ALU
be stored for that processor. Next a second hardware flag cycles. The addition itself requires m ALU cycles. For -
was cleared for use as a carry bit. Next came k iterations bit floating-point operands with m = , this comes to C
of an ALU operation that read one bit of each operand about  ALU cycles; for -bit operands with m = ,
from memory as well as the carry bit from the hardware it is about  ALU cycles. For floating-point multipli-
flag, computed the sum (a three-way exclusive OR) and cation, the cost is roughly m ALU cycles, plus integer
carry-out (a three-input majority function), and stored addition of the exponents, plus a -bit normalization
the sum back into memory and the carry-out back into step. For -bit floating-point operands with m = ,
the hardware flag. These iterations started with the least this comes to about  ALU cycles; for -bit operands
significant bits of the operands and proceeded toward with m = , it is about , ALU cycles. If an ALU
the most significant bits. The sum could replace one cycle is about half a microsecond, and one assumes a
of the input operands or be stored into a third area of mix of floating-point operations that is half adds and
memory. The last of the k iterations stored the carry- half multiplies, then the estimated speed is roughly 
out into a third hardware flag rather than the second MFLOPS for -bit operations and  MFLOPS for
one; a further computation that compared the second -bit operands – not quite Cray- class, but still fast
and third flags (the carry-outs from the last two itera- enough to be interesting.)
tions) could then determine whether integer overflow The real innovation in the Connection Machine was
occurred. its router, which passed message packets bit-serially
This bit-serial strategy was not original with the among processors. While the network is often described
Connection Machine; earlier machines that performed as a -dimensional hypercube, it should be noted that
all arithmetic in bit-serial fashion include the Goodyear the processors were packaged  to a chip. The  =
MPP supercomputer [] and the Digital Equipment ,  chips were connected by a -dimensional hyper-
Corporation PDP-/S minicomputer []. The CM- cube of actual physical wires, and additional multiplex-
could perform one ALU iteration for integer addition ing was done on-chip. Routing of data among chips
in about half a microsecond; taking instruction decod- made use of some of the same hardware flags and mem-
ing and initialization overheads into account, a -bit ory paths as the ALU operations. Each processor that
integer add required about  μs. But with  pro- needed to send a message (possibly all of them) would
cessors performing such additions simultaneously, this first load  bits from memory, one at a time, represent-
produced an aggregate rate of , MIPS (i.e., . bil- ing the address of a destination processor, and then n
lion -bit integer additions per second). By way of bits of data to be sent. The router on each chip could
comparison, the earlier Cray- vector supercomputer accept up to four messages at a time from processors on
() had a peak speed of  MFLOPS when perform- that chip, and would respond to each sending processor
ing multiply-add operations on -bit floating-point with a bit saying whether its message had been accepted;
operands [], and the initial model of the Cray X-MP processors whose messages were not accepted on this
() had a peak speed of  MFLOPS per CPU, routing cycle would try again on the next routing cycle.
where a system could have up to two () or four Message bits traversed the hypercube in pipelined
() CPUs [, ]. While no one expected the bit- fashion; at each of  steps, each accepted message
serial arithmetic of the CM- to outperform Cray super- would have the opportunity to traverse one of the
computers on floating-point computation, it was clear  hypercube dimensions. (A message would need to
that the CM- could be competitive for integer com- traverse a wire along a given dimension if and only if the
putations and perhaps supremely successful for marker address of the destination processor differed in that bit
propagation algorithms. position from the address of the originating processor.)
(It is instructive, however, to do a back-of-the- After those  steps had processed the -bit destina-
envelope calculation for the speed of floating-point tion address in each message, data bits would follow
arithmetic carried out on such a bit-serial processor. For the address bits serially along the paths that had been
 C Connection Machine

established through the hypercube. As data bits arrived the hypercube, called the NEWS (North/East/West/-
at a destination processor, they were then stored one South) network, that allowed a processor to pass a single
at a time. For sufficiently long messages, on each rout- data bit to any of four nearest neighbors without the
ing iteration a bit would be loaded from memory to be overhead of first loading a destination address.
sent out and a newly arrived bit (from an earlier posi- Instructions were provided to the , processors
tion in the message) would be stored back into memory; from a central source through a two-step process. The
toward the end of the process, the final steps would CM- was not a stand-alone system, but a coproces-
simply store the last arriving bits. sor that was always attached to a front-end computer
Two or more messages might compete for the same (a Symbolics  Lisp Machine). Programs running
hypercube wire, in which case one message would get within the front-end computer would issue instruc-
the use of the wire, and the others would not. These tions, one at a time, to the Connection Machine. These
others would still have the opportunity to traverse other instructions constituted an abstract architecture that
hypercube dimensions, but would not reach their desti- supported integer and floating-point arithmetic, other
nations during the current routing cycle; instead, they operations on multi-bit operands, message-passing,
would be stored into temporary buffers within the reduction operations such as finding the sum or max-
memory of whatever processors they had managed to imum of a collection of values, and most importantly
arrive at, and then forwarded by another round of rout- virtual processors, by which the memory of each hard-
ing. A more subtle point is that, at each of the  dimen- ware processor was split into m regions, and each hard-
sion steps, each router had to be prepared to accept a ware processor was time-sliced so as to simulate m
message on the incoming dimension wire. Therefore, if processors. By the time of the CM-, this instruction set
the router buffers on some chip were all full and none of had been given a name: Paris, for PARallel Instruction
those messages needed to traverse the outgoing dimen- Set. Paris instructions sent from the front-end processor
sion wire, then one message was chosen and sent out on arrived at a microcoded sequencer within the Connec-
that dimension anyway – and therefore that particular tion Machine. Two networks (over and above the hyper-
message would be unable to reach its destination in the cube and the NEWS network) connected the sequencer
current routing cycle – just to make sure that a buffer to the Connection Machine processors: A broadcast net-
would be free to accept an incoming message (this work allowed SIMD nanoinstructions to be sent from the
move was called desperation routing). Software would sequencer to the processors, and a combining network
then repeat the routing process – forwarding mes- took a -bit signal from every processor and delivered
sages that had not reached their destinations because the logical OR of these , signals to the sequencer.
of wire competition or desperation routing, and accept- Finally, the sequencer had the ability to read or write
ing messages that had not been previously accepted – any -bit word within the memory of any Connection
until all messages eventually reached their destinations Machine processor.
[, ]. A single CM- nanoinstruction specified one sub-
The router allowed any processor to establish a data cycle of an ALU or router operation. A nanoinstruction
connection with any other processor with reasonable contained a -bit operation code, a -bit flag number,
speed and efficency – from this idea came the very a -bit memory address, and one -bit truth table for
name “Connection Machine.” Compare this to the Illiac the Connection Machine ALU. As an example, the basic
IV [, ] or the Goodyear MPP [], each of which ALU iteration for the integer add operation described
connected processors in a two-dimensional grid, each above would consist of three nanoinstructions:
processor able to communicate directly with only its
four nearest neighbors: Transferring data to a distant LOADA: read memory operand A, read flag operand,
processor could take quite a long time. On the other latch one truth table
hand, it was also clear that a two-dimensional grid could LOADB: read memory operand B, read conditional
be particularly attractive and efficient for certain appli- flag, latch other truth table
cations such as image processing. Therefore the CM- STORE: store one result bit in memory, store other
had a second network, completely independent from result bit in flag
Connection Machine C 

The nominal clock speed of the CM- was  MHz [], held  processor chips of  processors each and had
so such an ALU cycle would take about half a  red light-emitting diodes mounted on the edge oppo-
microsecond. site the backplane connectors [, p. ]. These lights
In order to allow large machines to be shared among were visible through translucent panels on the front and
multiple users and to allow a range of machine sizes to back of the cabinet; thus , lights were visible on C
be offered for sale, there was actually one sequencer for the front and , on the back. These lights could be
every , processors. A  ×  crossbar switch called used for diagnostic reporting purposes, but also served
the nexus allowed up to four front-end computers to be to give the cabinet a visual “wow factor” intended to
connected to a single CM- system, and such a system communicate to spectators that there was a lot going on
could consist of , , or  quadrants. These quadrants inside that otherwise static-looking black cabinet [].
could be dynamically assigned to front-end processors (Indeed, software was eventually written specifically to
for their use. If a front-end processor were assigned produce good-looking patterns for still photos and for
more than one quadrant, then Paris instructions issued video recording.) The cabinet could be physically split
by that front-end processor were broadcast by the nexus apart into four quadrants (each consisting of two verti-
to multiple sequencers, and in all other ways the ganged cally stacked subcubes) for shipping, and systems were
quadrants were made to behave as if under the control offered in three sizes, consisting of one, two, or four
of a single sequencer. quadrants.
The cabinetry of the CM- (which was also used Most software for the CM- was coded in *Lisp,
for the CM-) had a striking physical design: It looked which was a package of functions that extended Sym-
like a “cube of cubes” that was  inches on a side bolics Zetalisp to provide access to the Connection
and was intended to evoke the hypercube topology of Machine. If the sequencer was instructed to simulate
the router network (see Fig. ). Each of the eight sub- some number n of virtual processors, then *Lisp allowed
cubes was about  by  by  in. and contained  the user to allocate certain data structures within the
vertically oriented printed circuit boards connected by CM- that behaved like vectors of length n that could
a large square backplane. Each printed circuit board perform various elementwise operations simultane-
ously on all elements, as well as finding the largest or
smallest value, the sum of all values, permuting the
elements according to a vector of indices, and so on.
A more elaborate dialect of Lisp called Connection
Machine Lisp (CM-Lisp) was also designed and proto-
typed, but never became a full-fledged product. Both
were aimed at supporting the programming of data
parallel algorithms [].
One example of an artificial intelligence applica-
tion explored on the CM- was memory-based rea-
soning []. Examples of non-AI applications imple-
mented on the CM- include circuit simulation [],
ray-tracing [], and free-text search [].

Connection Machine Model CM-


The CM- was a fairly compatible revision of the CM-
architecture and design [, , ]. A full Connec-
tion Machine model CM- similarly consisted of  =
,  processors, each consisting of:
Connection Machine. Fig.  Connection Machine model ● An arithmetic-logic unit (ALU) and associated
CM- with DataVault and graphic display latches
 C Connection Machine

●  =  K (and later  =  K) bits of bit- with floating-point accelerators would contain ,
addressable local memory floating-point coprocessor chips.
● Four (rather than eight) -bit flag registers The CM- had a separate hardware grid for its
● Router interface NEWS network; in the CM-, circuitry was added to
● NEWS grid interface the processor chips to use the hypercube wires for
● I/O interface NEWS connections. It was already well known that
● Floating-point accelerator interface a two-dimensional grid (or, more generally, a grid of
● Improved error-detection circuitry, including single- any dimension) could be embedded in a hypercube,
error correction, double-error detection (SECDED) and that all processors could be part of such a grid
for memory if the length of each axis of the grid were a power of
two. This implementation strategy reduced the hard-
The four biggest changes in the CM- hardware from ware cost of the CM- while increasing its flexibility to
the CM- were the floating-point accelerators, the efficiently support three-dimensional grids (useful for
NEWS grid implementation, in-router combining of physical simulations []) and four-dimensional grids
messages, and the I/O interface. The CM- also had a (useful for QCD calculations []). The CM- NEWS
slightly higher clock rate,  MHz []. interface also included features to assist the computa-
A CM- could be built with or without floating- tion of parallel prefix operations along any axis of a
point accelerators: For every  Connection Machine multidimensional grid.
processors (i.e., for every pair of Connection Machine In the CM- router, as in the CM- router, if two
processor chips) there was a proprietary custom inter- or more messages were competing for use of the same
face chip and an off-the-shelf floating-point coproces- hypercube wire, then one was chosen to get use of the
sor chip (the Weitek WTL, and later the Weitek wire. However, if any of the other competing messages
WTL []). The main task of the custom inter- had the same destination address, then the CM- router
face chip was to transpose  ×  bit matrices. Each could combine them with the chosen one, rather than
of the  Connection Machine processors in a group simply denying them use of the wire during that rout-
would supply a -bit operand, one bit at a time, to ing step. Messages could be combined in any of four
the floating-point interface chip; thus, in  Connection ways: bitwise OR, choosing the largest (unsigned) value,
Machine ALU cycles,  complete operands could be (unsigned) integer addition, or simply choosing one
transferred to the interface chip, with bit j of each of  arbitrarily (thereby discarding the others). (Software
operands being transferred during cycle j. The interface could also get the effect of bitwise AND or choosing
chip would then present complete -bit operands, one the smallest value by exploiting De Morgan’s laws.) The
at a time, to the floating-point coprocessor. The floating- CM- combined messages in all the same ways, but only
point coprocessor had multiple registers, so operands at the destination processor, by using its ALU as mes-
might reside within the coprocessor while multiple sages were received; in-router combining allowed the
floating-point operations were performed. Eventually CM- to provide the same functionality with greater
results would be transferred from the coprocessor, one efficiency. The CM- router also had a way to fetch
at a time, back to the interface chip, which would then messages from, rather than send messages to, other
transfer the results simultaneously to the  Connec- processors: Request messages were routed in the usual
tion Machine processors, one bit at a time to each. With manner, with a special kind of in-router combining
careful microprogramming of the Connection Machine operation, and then the router was run “backwards”
sequencer, these steps were pipelined so that operand to return data to the requesting processor – wherever
transfer was overlapped with the floating-point opera- in-router combining had occurred in the forward direc-
tions. Later floating-point coprocessor chips also sup- tion, the returned data was replicated in the backward
ported -bit floating-point operations; the process was direction (in effect, the forward operation constructed
much the same, with each operand transferred from a set of “broadcast trees,” one tree for each distinct fetch
chip to chip as two -bit portions. A full CM- system target).
Connection Machine C 

The I/O interface was originally planned for the every processor, or the floating-point sum of one value
CM- but not implemented. In the CM-, the I/O from every processor), and reduction and parallel pre-
interface provided wide datapaths that connected fix operations along any axis of a multidimensional
directly to the Connection Machine processors, allow- array. Every virtual processor had four flags: a carry
ing high-bandwidth transfers between Connection bit, an overflow bit, a test bit (to receive the result of C
Machine memory and peripheral devices. Each CM- comparison operations), and a context bit (which if 
 subcube backplane provided two I/O channels, and suppressed storing of results for most Paris operations).
each channel could accommodate either a framebuffer Paris supported multiple virtual processor sets simulta-
card or an I/O controller card. The , processors in neously within a single program; each virtual processor
each subcube were divided into two banks of , pro- set could have a different NEWS configuration, and
cessors, that is,  processor chips, each having one interprocessor message sends could be used to trans-
I/O line, so I/O transfers to or from the processors fer data between virtual processors in different virtual
consisted of -bit words. The graphic display frame- processor sets. Within each virtual processor set, Paris
buffer card drove a -inch color CRT monitor (typically supported dynamic memory allocation of fields, where
,  × ,  pixels), using either -bit or full -bit a field was simply number number of consecutive bits at
color, and supported data transfer from the Connection the same address within each virtual processor in a vir-
Machine processors at  megabytes per second. The tual processor set. Paris supported both stack (“push”
CM I/O controller, on the other hand, multiplexed each and “pop”) and heap (analogous to “malloc” and “free”)
-bit word into four -bit words for transmission on allocation and deallocation of fields.
the CM I/O bus. With the CM-, Thinking Machines offered a choice
Two devices were initially offered by Thinking of front-end bus interfaces, so that either a Symbolics
Machines Corporation for connection to the CM I/O  Lisp Machine or a Digital Equipment Corporation
bus: the DataVault and a VME I/O interface. The VAX computer could be used as a front-end proces-
DataVault was the first commercially available RAID sor (indeed, a single CM- system might have both
disk system; it supported data transfers of  megabytes kinds of front end attached to its nexus crossbar). A
per second, and four could be driven simultaneously VAX front end was required to execute programs writ-
by the four quadrants of a full CM- system. Each ten in C* or Connection Machine Fortran; either kind
DataVault could hold either  or  off-the-shelf disk of front end could execute programs written in *Lisp or
drives, thereby providing either  G bytes or  G bytes CM-Lisp.
of data storage. Each group of  drives was treated as The cabinetry of the CM- was essentially iden-
 for data, seven for error correction and detection, tical to that of the CM- (see Fig. ). At one point,
and three “hot spares.” The VME I/O interface allowed Danny Hillis investigated the possibility of making the
a computer with a VME bus (which might or might not lights blue rather than red, but discovered that while
also be in use as a front-end computer) to be connected blue LEDs had just become available, their relatively
to the CM I/O interface. high cost and short lifetime at that time made them
Each CM- sequencer had four times as much impractical for use in the CM-.
microcode memory as in the CM-. The Paris instruc- The DataVault cabinet was another striking example
tion set, which was implemented by sequencer of industrial design: Rather than being rectilinear, the
microcode, was greatly expanded for the CM-. Arith- cabinet had a gentle curve that made it look rather like
metic operations included integer and floating-point an information desk or a bartender’s station (Fig. ). The
arithmetic, transcendental and trigonometric functions intent was that a Connection Machine cabinet might
(including hyperbolic functions and their inverses), bit- be encircled by up to half a dozen DataVault cabinets
wise logical operations, interprocessor message sends separated by access paths.
with combining, multidimensional nearest-neighbor By , a version of the CM- called the CM-a was
NEWS communication, global reduction operations offered in a smaller cabinet, approximately one eighth
(such as computing the logical OR of one bit from the size of the original CM-/CM- cabinet:  by 
 C Connection Machine

and the reported peak LINPACK performance was


. gigaflops.
The  Gordon Bell prize for absolute computer
performance was won by a seismic modeling applica-
tion running on the CM- that achieved sustained
performance of  GFLOPS []. The same application
was further improved by use of a “stencil compiler” []
that compiled a subroutine written in Connection
Machine Fortran into application-specific microcode
that could be downloaded into the sequencer; it earned
a  Gordon Bell Prize honorable mention for
improving the performance of this application to over
Connection Machine. Fig.  Connection Machine model  GFLOPS [, ].
CM-a with graphic display A variety of other scientific applications were imple-
mented on the CM- [].

by  in., consisting of a single CM- subcube standing Connection Machine Model CM-
on a pedestal that contained power supplies and cooling The CM- was designed to be upward-compatible with
fans (see Fig. ). The CM-a could be populated with CM- software, but its hardware architecture repre-
either , or , Connection Machine processors sented a radical break in nearly every aspect [, ].
(and optionally with the same floating-point acceler- Rather than using proprietary -bit SIMD processors
ators used in the CM-). Three choices of front-end controlled by a proprietary sequencer, it used off-the-
processor for the CM- and CM-a were now avail- shelf RISC processors (SuperSPARC I chips from Sun
able: Symbolics , DEC VAX, and Sun Microsystems Microsystems), thus supporting MIMD programming
Sun-/ series. The CM-Lisp language proved to be as well as the data-parallel style of programming. The
difficult to implement in its full generality and by  original plan was for the CM- to be a purely MIMD
was no longer emphasized in Thinking Machines liter- machine, but the performance of SuperSPARC I chips
ature; *Lisp, C*, and Connection Machine Fortran all turned out to be lower than predicted by the historical
enjoyed continued support. SPARC technology curve; in order to provide competi-
A revised version of the CM-, with a faster clock tive numerical performance, Thinking Machines made
rate and larger memory, might have become the CM-, a late design change that added proprietary floating-
but was eventually shipped under the name CM-, point vector coprocessor chips. As a result, although
to emphasize continuity with the CM- product line. MIMD programming could be exploited for some pur-
There was no CM- product, though that number was poses, the best performance on the CM- was attained
sometimes used to refer to a follow-on SIMD machine only by using the same data-parallel programming
design that was discarded early on in favor of what model that had been used on the CM- and CM-.
became the CM-. Rather than using a hypercube network to connect
The fastest CM- systems ever listed in a TOP these processors, and separate I/O busses to connect
list were two ,-processor machines, one at Bran- processors to I/O devices, the CM- used a fat-tree
deis University and one at Florida State University [, network [] to which processors and I/O devices
June , ranked # and #]; the theoretical aggre- were attached on an equal footing [, , ]. One
gate peak performance of each was  gigaflops, of the (quite plausible) goals of the CM- architec-
and the reported peak LINPACK performance was ture was to allow eventual scaling to teraflop perfor-
. gigaflops. The fastest CM- systems ever listed mance [, , ], thus allowing systems to be offered
in a TOP list were two ,-processor machines, in a software-compatible range of sizes spanning more
both at Los Alamos National Laboratory [, June than three orders of magnitude []. The choice of fat-
, ranked # and #]; the theoretical aggre- tree over hypercube made this vision possible because
gate peak performance of each was  gigaflops, the fat-tree, unlike the hypercube, could be made larger
Connection Machine C 

without increasing the number of wires connected to would be delivered to all processors in the partition.
each processor. The fat-tree also had the advantage that Each broadcast message could be from  to 
the number of processors need not be an exact power -bit words in length. There were actually three
of two. distinct broadcast facilities: user-mode, supervisor-
The fat-tree network was actually used to connect mode, and interrupt. Supervisor-mode and inter- C
three type of nodes: ordinary processing nodes (perhaps rupt broadcasts were privileged operations. An
having vector coprocessors), usually present in large interrupt broadcast could cause every processor in
numbers; a handful of control processor nodes, each of the partition to receive either an interrupt or a hard-
which ran an enhanced version of UNIX and could ful- reset signal.
fill the role of a CM- front-end processor; and a modest ● Combining: Each processor could (asynchronously)
number of I/O nodes, which were almost identical to inject a message into the control network; after every
processor nodes but had conventional I/O interfaces processor in the partition had done so, a result
instead of vector coprocessor chips. would be delivered to each processor. Four differ-
User-mode code on the SPARC could access the ent combining modes were supported: global reduc-
fat-tree network interface directly, but had to use vir- tion, parallel prefix, parallel suffix, and router-done
tual addresses to identify destinations; the CM- fat-tree (which was simply a specialized logical-OR reduc-
network then provided virtual address translation. In tion intended to assist processors in cooperatively
this way the processors of a CM- system could be determining whether a bulk data-network transmis-
divided into groups called partitions, with each parti- sion phase had been completed). The combining
tion assigned to a different user task and each task pro- operation for the first three modes could be bit-
tected against interference from other tasks. Privileged wise OR, bitwise XOR, signed maximum, signed
(supervisor) code on the SPARC could send messages integer addition, or unsigned integer addition. (Soft-
using either virtual or physical (absolute) addresses to ware could also get the effect of bitwise AND, signed
identify destinations. Typically the operating system minimum, unsigned maximum, or unsigned mini-
would dedicate one control processor to manage each mum by exploiting De Morgan’s laws or related tech-
partition. niques.) Each message could be  to  bits long.
A high-bandwidth I/O device was typically con- ● Global bit: As in the CM- and CM-, each pro-
nected to the CM- by striping a wide bus across a cessor could provide one bit, but the logical OR of
set of I/O node processors that together formed a par- these bits was delivered to all CM- processors in the
tition; typically the operating system would dedicate partition rather than to a sequencer. In the CM-,
one control processor to each I/O partition to manage there were actually three global-bit interfaces: user-
its I/O requests. Yet another advantage of the fat-tree mode synchronous, user-mode asynchronous, and
structure was that if partitions were allocated appro- supervisor-mode asynchronous.
priately, then network traffic within a partition never
interfered (competed for network resources) with traf- The CM- also included a third network, the diag-
fic in any other partition, nor with traffic between nostic network, that was invisible to application pro-
two other partitions (thus I/O activity by one user grammers but allowed the system to self-monitor for
task would not interfere with computation by another hardware failures and to isolate failed components. This
user task). was based on then-emerging standards for on-chip
The ,-way logical-OR signaling network in the testing and connection-control circuitry using a serial
CM- and CM- was replaced in the CM- with a more interface that required only a few extra pins per chip,
comprehensive control network. This was a tree struc- but the CM- implementation was notable for using a
ture that paralleled the fat-tree data network but did tree-structured network to allow testing of thousands
not become fatter near the root of the tree. The CM- of chips in parallel. The CM- diagnostic network could
control network provided these facilities: also diagnose itself by using chips higher in the tree to
test chips lower in the tree [, ].
● Broadcasting: Any processor could inject a message A CM- processor node without vector coproces-
into the control network, and a copy of that message sors consisted of a -bit memory bus connecting three
 C Connection Machine

chips: the RISC processor chip, a network interface chip various shift-and-mask or shift-and-XOR operations in
that connected to the control network and data net- a single cycle.
work [], and a memory interface that could manage The CM- had an option to integrate a massively
from one to four memory chips of  MB each. A CM- parallel disk array (called SDA, or Scalable Disk Array)
processor node with vector coprocessors consisted of into the CM- processor cabinets: One form of CM-
a -bit memory bus connecting six chips: the RISC I/O node consisted of the usual RISC processor and
processor chip, a network interface chip, and four vec- CM- network interface chip, a data buffer, four SCSI-
tor coprocessor chips, each of which also served as a controllers, and eight .-in. hard disk drives. Another
memory interface that could manage from one to four form of CM- I/O node was similar but had no disk
memory chips of  MB each. Thus a processor node drives and just two SCSI- controllers; these nodes were
without vector coprocessors could have up to  MB of connected to tape drives mounted in a separate cabi-
memory, but a processor node with vector coprocessors net (called ITS, the Integrated Tape System). A CM-
could have up to  MB of memory. HIPPI interface allowed the CM- to be connected to
In , each vector coprocessor had a peak mem- an industry-standard HIPPI (HIgh Performance Par-
ory bandwidth of  MB/s,  megaflops of peak - allel Interface) bus; a CM-HIPPI interface was also
bit floating-point performance, and  mega-ops peak offered for CM- and CM- systems, allowing dif-
-bit integer performance []. (By , these fig- ferent generations of Connection Machine systems to
ures had been increased to  MB/s,  megaflops, communicate at high speed with each other or to share
and  mega-ops [].) The RISC microprocessor could I/O devices. Yet another form of CM- I/O node was
issue intructions to individual vector units or to all a CM I/O bus adapter, which allowed use of CM-
four at once. Each vector unit had  -bit registers peripherals, such as the DataVault, on CM- systems.
(which were also addressable as  -bit registers), a The CM-, like the CM- and CM-, sup-
-bit vector mask, and a -bit vector length register. ported the *Lisp, C*, and Connection Machine Fortran
Each vector instruction could process up to  sets of languages, as well as CMSSL (the Connection Machine
operands. Each register vector operand was specified by Scientific Software Library []) and the Prism pro-
a -bit starting register number and a -bit stride; the gramming environment [, ]. Prism was notable
first element for that vector operand was taken from the for using graphics-intensive visualization techniques
starting register; that register number was then repeat- to display debugging data for thousands of proces-
edly incremented by the given stride to produce register sors [, ] and for being one of the software prod-
numbers containing succeeding elements of the vector ucts that survived the dissolution of Thinking Machines
operand. A large stride had the same effect as a small Corporation [].
negative stride, so vector operands could be processed The operating system for the CM-, which ran in the
in reverse order. Each vector coprocessor provided control processors, was an enhanced version of UNIX
addition, multiplication, memory load/store, indirect called CMOST, for “Connection Machine Operating
register addressing, indirect memory addressing, and SysTem” [, ]. (The more obvious acronym “CMOS”
population count (counting the -bits in a word). Each was passed over because it was, and still is, in wide use
vector-unit instruction could specify at least one arith- to denote the integrated-circuit technology “Comple-
metic operation and an independent memory opera- mentary Metal-Oxide-Semiconductor.”) This operating
tion, allowing pipelining of operands from memory system eventually included a “scalable file system” that
and results to memory while computation was taking extended supported file sizes from -bit integers to
place internal to the coprocessor. Single-cycle multiply- -bit integers [].
add and multiply-subtract operations were supported, The cabinetry of the CM- was quite different from
as well as the fairly unusual combination of an inte- that of the CM- and CM- but no less striking; multiple
ger multiply (high or low part) followed by a bitwise -foot-tall, oblong cabinets, each of which could hold up
Boolean operation – by choosing the high or low part to  processors, were connected in groups of five in a
of the multiply and using an appropriate power of  as staggered “lightning bolt” arrangement that allowed one
the multiplier, the programmer could get the effect of end of each cabinet to display a large rectangular panel
Connection Machine C 

The fastest CM- system ever listed in a TOP list


was a ,-processor machine at Los Alamos National
Laboratory [, June , ranked #]; each processor
node was rated at  megaflops, for a theoretical aggre-
gate peak performance of  gigaflops, and the reported C
peak LINPACK performance was . gigaflops.
Examples of scientific applications implemented on
the CM- include computational fluid mechanics []
and a solution of the Boltzmann equation [] that was
measured as running at . gigaflops.

Related Entries
C*
Cache-Only Memory Architecture (COMA)
Connection Machine Fortran
Connection Machine Lisp
MPP
HPF (High Performance Fortran)
Illiac IV
*Lisp
Connection Machine. Fig.  Four cabinets of a MasPar
Connection Machine model CM-
Bibliography
of the red LEDs that had become the signature look . Anonymous () Lights! Action! Cue the computer! Parallelo-
of the Connection Machine series (see Fig. ). These gram: The international journal of high performance computing,
LEDs were not, however, mounted on the processor  (September/October ). Fitzroy, London, ISSN -,
pp –
boards, and each cabinet had a panel of lights, no matter . Bailie CF, Brickner RG, Gupta R, Johnsson L () QCD with
whether the cabinet contained processors, storage mod- dynamical fermions on the Connection Machine. In: Super-
ules, fat-tree nodes, I/O interfaces, or a mixture. (The computing ’: Proceedings  ACM/IEEE conference on
fact that the lights were now physically separated from supercomputing. ACM, New York, pp –. ISBN ---.
the processors made it less costly to exploit their eerie http://doi.acm.org/./.
. Barnes GH, Brown RM, Kato M, Kuck DJ, Slotnick DL, Stokes RA
look in the  movie Jurassic Park; CM- cabinets –
() The Illiac IV computer. IEEE Trans Comput C-,  August
with the blinking red lights but without processors , pp –. ISSN -. http://dx.doi.org/./TC.
[] – were featured in the “control room” scenes, the .
phrase “Connection Machine” was used in a bit of dia- . Batcher KE () STARAN parallel processor system hardware.
logue, and the Thinking Machines Corporation logo In: AFIPS ’: proceedings national computer conference and
exposition. ACM, New York, pp –. http://doi.acm.org/
appeared prominently in the credits.)
./.
The central cabinet in each group of five could . Batcher KE () The multidimensional access memory in
contain fat-tree nodes sufficient to connect the other STARAN. IEEE Trans Comput ():–. IEEE Comput Soc,
four cabinets to each other and to other lightning Washington, DC. ISSN -. http://dx.doi.org/./TC.
bolts. CM- systems with more than four processor cab- .
inets were arranged on the machine room floor as paral- . Batcher KE () MPP: A supersystem for satellite image pro-
cessing. In: AFIPS ’: proceedings, National computer confer-
lel lighning bolts with their central cabinets connected
ence, – June . ACM, New York, pp –. ISBN --
by overhead cable bridges, each several feet wide, that -X. http://doi.acm.org/./.
supported the numerous physical fat-tree connections . Bouknight WJ, Denenberg SA, McIntyre DE, Randall JM, Sameh
between the five-cabinet groups. AH, Slotnick DL () The ILLIAC IV system. Proceedings IEEE
 C Connection Machine

,  Apr , pp –. ISSN -. http://dx.doi.org/ . Hillis WD () The Connection Machine (computer architec-
./PROC.. ture for the new wave). AI Memo , Massachusetts Institute
. Bromley M, Heller S, McNerney T, Steele GL Jr. () Fortran of Technology Artificial Intelligence Laboratory, Cambridge, MA,
at ten gigaflops: The Connection Machine convolution compiler. September 
In: PLDI ’: proceedings ACM SIGPLAN  conference on . Hillis WD () The Connection Machine. MIT Press,
programming language design and implementation. ACM, New Cambridge. ISBN ---
York, pp –. ISBN ---. http://doi.acm.org/./ . Hillis WD () The Connection Machine. Scientific American
. ():–. Scientific American, New York. ISSN -.
. Cray Research, Inc. () CRAY- computer system hardware http://dx.doi.org/./scientificamerican-
reference manual . Bloomington, MN, November . . Hillis WD, Steele GL Jr. () Data parallel algorithms. Commun
http://www.bitsavers.org/pdf/cray/C CRAY- Hardware ACM ():–. ACM, New York. ISSN -. http://
Reference Nov.pdf doi.acm.org/./.
. Cray Research, Inc. () CRAY X-MP series mainframe refer- . Hillis WD, Tucker LW () The CM- Connection Machine:
ence manual HR-. Mendota Heights, MN, November . A scalable supercomputer. Commun ACM ():–. ACM,
http://www.bitsavers.org/pdf/cray/ R- X-MP MainframeRef New York. ISSN -. http://doi.acm.org/./.
Nov.pdf 
. Cray Research, Inc () CRAY X-MP series model  main- . Johnsson SL () CMSSL: A scalable scientific software library.
frame reference manual HR-. Mendota Heights, MN, August In: Proceedings  conference on scalable parallel libraries.
. http://www.bitsavers.org/pdf/cray/HR- CRAY X-MP IEEE Computer Society Press, Silver Spring, pp –. http://dx.
Series Model  Mainframe Ref Man Aug.pdf doi.org/./SPLC..
. Delany HC () Ray tracing on a Connection Machine. . Johnsson SL () The Connection Machine systems CM-. In:
In: ICS ’: proceedings nd international conference on super- SPAA ’: proceedings fifth annual ACM symposium on parallel
computing. ACM, New York, pp –. ISBN ---. algorithms and architectures. ACM, New York, pp –. ISBN
http://doi.acm.org/./. ---. http://doi.acm.org/./.
. Digital Equipment Corporation () PDP-/S mainte- . Kahle BU, Nesheim WA, Isman M () Unix and the Con-
nance manual, F-S, fourth printing. Maynard, MA, August nection Machine operating system. In: Proceedings workshop
. http://www.bitsavers.org/pdf/dec/pdp/pdps/PDPS on UNIX and supercomputers. USENIX Association, Berkeley,
MaintMan.pdf pp –
. Dongarra J, Karp AH, Kennedy K, Kuck D () Special report: . Leiserson CE () Fat-trees: Universal networks for hardware-
 Gordon Bell prize. Software ():–, . IEEE, Los efficient supercomputing. IEEE Trans Comput ():–.
Alamitos, California. http://dx.doi.org/./MS.. IEEE Computer Society, Washington, DC. ISSN -
. Dongarra JJ, Karp A, Miura K, Simon HD () Gordon Bell prize . Leiserson CE () The networks of the Connection Machine
lectures. In: Supercomputing ’: proceedings  ACM/IEEE CM-. In: Meyer F, Monien B, Rosenberg AL (eds) Parallel archi-
conference on Supercomputing. ACM, New York, pp –. tectures and their efficient use, first Heinz Nixdorf symposium,
ISBN ---. http://doi.acm.org/./. Paderborn, Germany, – November, proceedings. Lecture notes
. Fahlman SE () Design sketch for a million-element NETL in computer science, vol . Springer, Berlin, pp –. ISBN
machine. In: AAAI-: proceedings first national confer- ---. http://dx.doi.org/./---
ence on artificial intelligence. Morgan-Kaufmann, Los Altos, . Leiserson CE, Abuhamdeh ZS, Douglas DC, Feynman CR,
CA, pp –. http://www.aaai.org/Papers/AAAI// Ganmukhi MN, Hill JV, Hillis WD, Kuszmaul BC, St. Pierre MA,
AAAI\penalty-\@M-.pdf Wells DS, Wong-Chan MC, Yang S-W, Zak R () The net-
. Gabriel RP () Massively parallel computers: The Connection work architecture of the Connection Machine CM-. J Parallel
Machine and NON-VON. Science ():–. American Distr Comput ():–. Elsevier. ISSN -. http://dx.
Association for the Advancement of Science, New York. ISSN doi.org/./jpdc..
-. http://dx.doi.org/./science... . Leiserson CE, Abuhamdeh ZS, Douglas DC, Feynman CR, Gan-
. Gottlieb A, Grishman R, Kruskal CP, McAuliffe KP, Rudolph L, mukhi MN, Hill JV, Hillis WD, Kuszmaul BC, St. Pierre MA, Wells
Snir M () The NYU Ultracomputer—Designing an MIMD DS, Wong MC, Yang S-W, Zak R () The network architec-
shared memory parallel computer. IEEE Trans Comput ture of the Connection Machine CM- (extended abstract). In:
():–. IEEE Computer Society, Washington, DC. SPAA ’: proceedings fourth annual ACM symposium on par-
ISSN -. http://dx.doi.org/./TC.. allel algorithms and architectures. ACM, New York, pp –.
. Gregory J, McReynolds R () The SOLOMON computer. IEEE ISBN ---X. http://doi.acm.org/./.
Trans Electronic Comput EC-():–. ISSN -. . Long LN, Myczkowski J () Solving the Boltzmann equation at
http://dx.doi.org/./PGEC..  gigaflops on a -node CM-. In: Supercomputing ’: pro-
. Hillis WD () Multi-dimensional message transfer router. ceedings  ACM/IEEE conference on supercomputing. ACM,
United States Patent ,,. Filed  March . Granted  New York, pp –. ISBN ---. http://doi.acm.org/
September  ./.
Connection Machine C 

. LoVerso SJ, Isman M, Nanopoulos A, Nesheim W, Milne ED, . Sistare S, Allen D, Bowker R, Jourdenais K, Simons J, Title R ()
Wheeler R () sfs: A parallel file system for the CM-. A scalable debugger for massively parallel message-passing pro-
In: USENIX-STC’: proc. USENIX summer  tech- grams. IEEE Parallel Distri Technol ():–. IEEE Computer
nical conference. USENIX Association, Berkeley, CA, pp Society Press, Los Alamitos, CA. ISSN -. http://dx.doi.
– org/./.
. McCormick BH () The Illinois pattern recognition . Sistare S, Dorenkamp E, Nevin N, Loh E () MPI support C
computer—ILLIAC III. IEEE Trans Electron Comput EC- in the Prism programming environment. In: Supercomputing
():–. ISSN -. http://dx.doi.org/./PGEC. ’: proceedings  ACM/IEEE conference on supercomputing
. (CDROM). ACM, New York. ISBN ---. http://doi.acm.
. Myczkowski J, Steele G () Seismic modeling at  gigaflops org/./.
on the Connection Machine. In: Supercomputing ’: proceed- . Slotnick DL, Borck WL, McReynolds RC () The SOLOMON
ings  ACM/IEEE conference on supercomputing. ACM, computer. In: AFIPS ’ (Fall): proceedings fall joint computer
New York, pp –. ISBN ---. http://doi.acm.org/ conference. ACM, New York, December , pp. –. http://
./. doi.acm.org/./.
. Negele JW () QCD teraflops computer. Nuclear physics B— . Squire JS, Palais SM () Programming and design consid-
Proceedings supplements  (March ), pp –. ISSN erations of a highly parallel computer. In: AFIPS ’ (Spring):
-. http://dx.doi.org/./-()-O proceedings spring joint computer conference. ACM, New York,
. Palmer J, Steele GL Jr. () Connection Machine model CM- May , pp –. http://doi.acm.org/./.
system overview. In: Proceedings fourth symposium on the fron- . Stanfill C, Kahle B () Parallel free-text search on the
tiers of massively parallel computation. IEEE Computer Society Connection Machine system. Commun ACM ():–.
Press, Los Alamitos, pp –. ISBN ---. http://dx. ACM, New York. ISSN -. http://doi.acm.org/./
doi.org/./FMPC.. .\penalty-\@M
. Reddaway SF () DAP—A distributed array processor. In: . Stanfill C, Waltz D () Toward memory-based reasoning.
ISCA ’: Proceedings st annual symposium on computer archi- Commun ACM ():–. ACM, New York. ISSN -
tecture. ACM, New York, pp –. http://doi.acm.org/./ . http://doi.acm.org/./.
. . Thiel T () The design of the Connection Machine. Design
. Seitz CL () The Cosmic Cube. Commun ACM ():– issues ():–. MIT Press, Cambridge. ISSN -
. ACM, New York. ISSN -. http://doi.acm.org/./ . Thinking Machines Corporation () Connection Machine
. model CM- technical summary. Technical report HA-.
. Sethian JA () Computational fluid mechanics and massively Cambridge, April 
parallel processors. In: Supercomputing ’: proceedings  . Thinking Machines Corporation () Connection Machine
ACM/IEEE conference on supercomputing. ACM, New York, technical summary, version .. Cambridge, May 
pp –. ISBN ---. http://doi.acm.org/./. . Thinking Machines Corporation (). Prism user’s guide, ver-
 sion .. Cambridge, December 
. Sethian JA, Brunet J-P, Greenberg A, Mesirov JP () Comput- . Thinking Machines Corporation () Connection Machine
ing turbulent flow in complex geometries on a massively parallel CM- technical summary, third edition. Cambridge, November
processor. In: Supercomputing ’: proceedings  ACM/IEEE 
conference on Supercomputing. ACM, New York, pp –. . Thinking Machines Corporation () Programming the NI,
ISBN ---. http://doi.acm.org/./. version .. Cambridge, February 
. Shaw DE, Stolfo SJ, Ibrahim H, Hillyer B, Wiederhold G, Andrews . Top  supercomputer sites () Semiannual lists since 
JA () The NON-VON database machine: A brief overview. of the top  supercomputer sites in the world as measured
Database engineering bulletin: a quarterly bulletin of the IEEE by a LINPACK benchmark. http://www.top.org. Accessed 
Comput Soc Tech Committ Database Eng ():–. IEEE Com- March 
puter Society, Washington, DC. http://sites.computer.org/debull/ . Tucker LW, Robertson GG () Architecture and applications
DEC-CD.pdf of the Connection Machine. Computer ():–. IEEE. ISSN
. Simon HD (ed) Proceedings conference scientific applications of -. http://dx.doi.org/./.
the Connection Machine. World Scientific, Singapore, September . Unger SH () A computer oriented toward spatial problems.
. ISBN --- In: Proceedings IRE ():–. Institute of Radio Engi-
. Sistare S, Allen D, Bowker R, Jourdenais K, Simons J, Title R neers/IEEE. ISSN -. http://dx.doi.org/./JRPROC.
() Data visualization and performance analysis in the Prism .
programming environment. In: Topham NP, Ibbett RN, Bem- . Webber DM, Sangiovanni-Vincentelli A () Circuit simula-
merl T (eds) Proceedings of the IFIP WG . workshop on tion on the Connection Machine. In: DAC ’: proceedings
programming environments for parallel computing, vol. A- of th ACM/IEEE design automation conference. ACM, New York,
IFIP Transactions. North-Holland Publishing, Amsterdam, April pp –. ISBN ---. http://doi.acm.org/./.
, pp –. ISBN ---X 
 C Connection Machine Fortran

IMPLICIT NONE statement), and a subset of features


Connection Machine Fortran proposed in the draft ANSI Fortran x standard (draft
S, version ), including some features described
Guy L. Steele Jr.
in that draft as “removed extensions.” The Fortran
Oracle Labs, Burlington, MA, USA
x features adopted by CM Fortran included array-
valued expressions and elemental functions; other
Synonyms intrinsic functions such as reduction operations (such
CM fortran as SUM, PRODUCT, MAXVAL, MAXLOC, ANY, and
ALL), dot product (DOTPROD), and matrix multipli-
Definition cation (MATMUL); array constructor expressions; array
Connection Machine Fortran is a data-parallel ver- sections and vector-valued subscripts; masked array
sion of Fortran developed around  for Connection assignment (the WHERE statement and WHERE con-
Machine supercomputers manufactured by Thinking struct); and the FORALL statement and FORALL con-
Machines Corporation. It consists essentially of For- struct, which are somewhat like a DO loop, but possibly
tran  augmented by array-processing features that having more than one index variable, that executes its
had been proposed for Fortran x (and were eventu- body (a single statement or a block of statements) in
ally adopted as part of the Fortran  and Fortran  a data-parallel fashion for all possible combinations of
standards), additional data-parallel intrinsic functions values of its index variables simultaneously – and if a
(such as for parallel prefix operations), and data dis- logical mask expression is also included, then only com-
tribution directives. It was one of the parallel Fortran binations of index values for which the mask expression
projects that contributed several noteworthy features to is true are used. For example, assume that A and B are
the design of High Performance Fortran. arrays of shape  × ; then the FORALL construct
FORALL (I=1:100, J=1:100, I .NE. J)
Discussion
A(I,J) = 0.0
Connection Machine Fortran (also called CM Fortran)
B(I,J) = C(I,J)
was developed and implemented for the CM- and
C(I,J) = C(J,I)
CM- models of Connection Machine supercomputer.
END FORALL
The language was specified by Thinking Machines Cor-
poration but was initially implemented under contract first sets all the off-diagonal elements of A to 0.0,
by Compass, Inc. []. (That well-known compiler com- and only then copies the off-diagonal elements of C
pany, also known as Massachusetts Computer Asso- into corresponding positions of B, and only after that
ciates, had previously implemented a Fortran compiler transposes (the off-diagonal elements of) the array C.
for the Goodyear MPP [].) It is noteworthy that the FORALL statement and
Of the four programming languages (*Lisp, C*, CM construct were among the “removed extensions” in the
Fortran, and CM-Lisp) provided by Thinking Machines Fortran x draft and were not incorporated into the
Corporation for Connection Machine Systems, CM Fortran  standard; that the FORALL statement and
Fortran was perhaps the most conventional, adhering construct were adopted as part of High Performance
as closely as possible to existing Fortran standards or Fortran [, pp –]; and that as a result of expe-
projected future standards, and yet also the most influ- rience with High Performance Fortran, the FORALL
ential, because of its contributions to High Performance statement and construct were included in the Fortran
Fortran.  standard. However, though the FORALL statement
The version as of April  was described [] and construct were described as part of CM Fortran as
as including all of Fortran  as defined by ANSI early as April  [], they were not yet implemented
standard X.-, as well as two sets of exten- in CM Fortran as of February  [, page ] (see also
sions: those defined by MIL-STD- [] (princi- [, p ]). They do appear to have been implemented by
pally intrinsic functions for bit manipulation, the mid- (an internal design document for implement-
DO WHILE statement, the END DO statement, and the ing FORALL [] is dated November , and published
Connection Machine Fortran C 

release notes dated October  refer to an “enhance- These layout and alignment directives were eventually
ment” to FORALL that generates parallel code for more adopted, with some modifications and contributions
cases than in “previous releases” [, p ]). Over the next from other parallel Fortran projects [] (notably For-
 years, additional enhancements and optimizations for tran D [–] and Vienna Fortran [, ]), into High
FORALL were introduced for the both CM- and the Performance Fortran [, pp –]. C
CM- [–]. As implemented for the Connection Machine model
By September , a set of compiler directives had CM-, CM Fortran did not have intrinsic functions
been added to CM Fortran [, pp –]. Following for scan (parallel prefix and parallel suffix) operations;
the practice of Fortran compilers for other supercom- instead, there was a utility library that allowed CM For-
puters such as the Cray series, CM Fortran directives tran programs to invoke individual instructions from
took the form of structured comments – in this case, the PARIS (Connection Machine PARallel Instruction
exploiting the fact that both Fortran comments and the Set), including instructions that performed scan oper-
abbreviation “CMF” for Connection Machine Fortran ations. With the introduction of the model CM-,
begin with the letter “C”. The two directives of greatest a CM Fortran Utility Library was introduced that
historical and technical interest were the LAYOUT and provided a more abstract interface to certain com-
ALIGN directives. The LAYOUT directive specified how putational and communications facilities that could
a Fortran array should be laid out across Connection be supported on both the CM- and CM-. This
Machine processors; for example, library included scan operations with names such
DIMENSION A(64, 64, 64, 100) as CMF_SCAN_ADD and supported segmented scan
CMF$ LAYOUT A(:SEND,:NEWS,:NEWS,:SERIAL) operations [, pp –]. It also included a group
of operations labeled “scatters with combining,” with
specified that the four-dimensional array A should be names such as CMF_SEND_ADD [, pp –]. These
laid out in such a way that elements whose indices dif- library functions were the direct predecessors of intrin-
fered only in the last axis should reside in the same
sic functions such as SUM_PREFIX, SUM_SUFFIX,
virtual processor, and that the other three axes should
and SUM_SCATTER in High Performance Fortran [,
be distributed across virtual processors in such a way
pp –].
that the processor numbering along the first axis cor-
By , Thinking Machines Corporation in collab-
responded to the Connection Machine router’s “send
oration with Applied Parallel Research had produced a
address” ordering, and the processor numbering along
parallelizing translator from Fortran  to CM Fortran
the second and third axes corresponded to the Connec-
called CMAX [].
tion Machine’s NEWS grid ordering (which amounted Figure  shows an early () example of a CM For-
to a Gray encoding of some portion of the router tran program that identifies prime numbers less than
address). The ALIGN directive specified that the lay- , by the method of the Sieve of Eratosthenes,
out of one array should be determined by the layout taken (with one minor correction) from []. This
of another array, so as to maintain minimum com- program is written in what is now entirely conven-
munications cost between corresponding elements. For tional Fortran, except that the expression [1:N] would
example, have to be written [I,I=1,N]. Because Fortran
DIMENSION A(64,64,64,100),B(64,64,64),C(64) arrays normally use -origin indexing (rather than the
CMF$ ALIGN B(I,J,K) WITH A(I,J,K,1)
-origin indexing provided by such languages as C,
CMF$ ALIGN C(M) WITH B(M,M,M)
Java, and Common Lisp), the parameter N used to
causes array B to be aligned with (and therefore allo- specify array length is defined to be 99999 rather
cated in the same virtual processors as) a slice of array than 100000. The first three assignment statements
A, and causes array C to be aligned with the main space set every element of array PRIME to false, every ele-
diagonal of array B. Such directives allowed for explicit ment of array CANDIDATE to true, and the first ele-
programmer control over array allocation decisions that ment of CANDIDATE to false. While CM Fortran
had previously been made automatically by an opti- did have a DO WHILE statement, this example hap-
mization phase in the CM Fortran compiler [, –]. pens to use a statement label 20 and a conditional
 C Connection Machine Fortran

SUBROUTINE FINDPRIMES(PRIME)
PARAMETER (N = 99999)
LOGICAL PRIME(N),CANDIDATE(N)
PRIME = .FALSE.
CANDIDATE = .TRUE.
CANDIDATE(1) = .FALSE.
20 NEXTPRIME = MINLOC([1:N],CANDIDATE)
PRIME(NEXTPRIME) = .TRUE.
FORALL (I = 1:N, MOD(I,NEXTPRIME) .EQ. 0) CANDIDATE(I) = .FALSE.
IF (ANY(CANDIDATE)) GO TO 20
RETURN
END

Connection Machine Fortran. Fig.  Example Connection Machine Fortran program for identifying prime numbers

GO TO statement to implement the loop. The stan- . Thinking Machines Corporation () Connection Machine
dard Fortran intrinsic function MINLOC is used to model CM- technical summary. Technical Report HA-,
find the index of the smallest value between 1 and Cambridge, MA
. United States Department of Defense () MIL-STD-.
N for which the array CANDIDATE has a true value.
Military standard: FORTRAN, DOD supplement to American
This index is then used to determine which element national standard X.-. Washington, DC, Nov 
of PRIME to set to true. The FORALL statement . Koelbel CH, Loveman DB, Schreiber RS, Steele GL Jr, Zosel ME
processes all elements of CANDIDATE in parallel, () The High Performance Fortran handbook. MIT Press,
setting to false those elements for which the expres- Cambridge, MA
. Thinking Machines Corporation () Getting started in CM
sion MOD(I,NEXTPRIME) .EQ. 0 is true. The stan-
Fortran. version .–., Cambridge, MA
dard Fortran intrinsic function ANY returns true if any . Thinking Machines Corporation () CM Fortran release notes.
element of its array argument is true; thus the loop version .–., Cambridge, MA
is repeated if any candidate remains. Note that both . Mincy J () Forall design. Unpublished document,  pages
MINLOC and ANY are reduction operations at heart. plus title page, Nov 
. Thinking Machines Corporation () CM Fortran release notes.
version . Beta , Cambridge, MA
Related Entries . Thinking Machines Corporation () CM Fortran release notes:
detailed. version . Beta ., Cambridge, MA
C*
. Thinking Machines Corporation () CM Fortran for the CM-
Connection Machine
 release notes. version .., Cambridge, MA
Connection Machine Lisp . Thinking Machines Corporation () CM Fortran for the CM-
Fortran  and Its Successors release notes. version .., Cambridge, MA
HPF (High Performance Fortran) . Thinking Machines Corporation () CM Fortran . Beta
*Lisp release notes. Cambridge, MA
. Thinking Machines Corporation () CM Fortran reference
manual. version .-., Cambridge, MA
Bibliography . Knobe K, Lukas JD, Steele GL Jr () Massively parallel data
optimization. In: Frontiers ’: Proc. nd symposium on the fron-
. Albert E, Knobe K, Lukas JD, Steele GL Jr () Compiling
tiers of massively parallel computation, IEEE Computer Society
Fortran x array features for the Connection Machine computer
Press, Washington, DC, pp –, October 
system. In: PPEALS ’: Proceedings of the ACM/SIGPLAN con-
. Knobe K, Lukas JD, Steele GL Jr () Data optimization: Allo-
ference on parallel programming: Experience with applications,
cation of arrays to reduce communication on SIMD machines.
languages and systems, ACM, New York, pp –, June 
J Parallel Distrib Comput ():–
. Knobe K, Loveman DB, Marcus M, Wells I () A Fortran
. Knobe K, Lukas JD, Steele GL Jr () Data parallel comput-
compiler for the Massively Parallel Processor. Technical Report
ers and the FORALL statement. In: Frontiers ’: Proceedings
CADD--, Massachusetts Computer Associates (COM-
of the rd symposium on the frontiers of massively parallel
PASS), Wakefield, Feb 
Connection Machine Lisp C 

computation, IEEE Computer Society Press, Los Alamitos, Cal- from S-expressions to S-expressions, and special syntax
ifornia, pp –, October  for performing elementwise operations, reductions, and
. Albert E, Lukas JD, Steele GL Jr () Data parallel computers and permutations on these data structures.
the FORALL statement. J Parallel Distrib Comput ():–
. Steele GL Jr () High Performance Fortran: Status report.
Workshop on languages, compilers, and run-time environments Discussion C
for distributed memory multiprocessors. SIGPLAN Notices Of the four programming languages (∗Lisp, C∗ , CM
():– Fortran, and CM-Lisp) provided by Thinking Machines
. Fox G, Hiranandani S, Kennedy K, Koelbel C, Kremer U, Tseng
Corporation for Connection Machine Systems, CM-
CW, Wu MY () Fortran D language specificaton. Tech. Rep.
CRPC-TR , Center for Research on Parallel Computation, Lisp was the most radical in design (requiring the use
Rice University, Houston, Texas, December  of non-ASCII characters in its notation, and intro-
. Hiranandani S, Kennedy K, Koelbel C, Kremer U, Tseng CW ducing an associative data structure indexed by non-
() An overview of the Fortran D programming system. Tech. numeric values) and the most difficult to implement
Rep. CRPC-TR , Center for Research on Parallel Computa-
(requiring automatic garbage collection of parallel data
tion, Rice University, Houston, Texas, March 
. Hiranandani S, Kennedy K, Tseng CW (August ) Compiling
structures). Danny Hillis used it in his book about
Fortran D for MIMD distributed-memory machines. Commun the Connection Machine [] to explain how he imag-
ACM ():– ined programming a massively parallel supercomputer
. Chapman B, Mehrotra P, Zima H () Programming in Vienna intended for nonnumerical applications in a data paral-
Fortran. Sci Program ():– lel style []. While it was featured along with the other
. Chapman B, Moritsch H, Mehrotra P, Zima H () Dynamic
three languages in the earliest Connection Machine
data distributions in Vienna Fortran. In: Supercomputing ’:
proceedings of the  ACM/IEEE conference on supercomput- Technical Summary [], it was dropped from all later
ing, ACM, New York, NY, pp – versions [, ], and Thinking Machines never offered
. Thinking Machines Corporation () CM Fortran user’s guide a complete parallel implementation as a commercial
for the CM-. version .., Cambridge, MA product.
. Sabot G, Wholey S () Cmax: A Fortran translator for the
CM-Lisp consists of Common Lisp augmented with
Connection Machine system. In: ICS ’: Proc. th international
conference on supercomputing, ACM, New York, pp –
an extra data structure, the xapping, essentially an
unordered set of ordered index-value pairs suitable for
parallel processing, where each element of each pair
may be any CM-Lisp data structure and the index of
each pair must be unique within that xapping. Here is
Connection Machine Lisp the CM-Lisp notation for a xapping that maps names of
colors to numbers:
Guy L. Steele Jr.
Oracle Labs, Burlington, MA, USA {red→14 green→3 purple→93 blue→3.5}

For the common and useful special case that the


Synonyms indices are consecutive integers starting from zero, the
CM-lisp xapping is also called a xector, and a square-bracket
notation may be used:
Definition [banana apple pear kumquat]
Connection Machine Lisp (CM-Lisp) is a data-parallel
version of Lisp developed around  for Connection is simply an alternate notation for
Machine supercomputers manufactured by Thinking {0→banana 1→apple 2→pear 3→kumquat}
Machines Corporation. Unlike the ∗ Lisp language, it
drew no sharp distinction between front-end data and A distinctive feature of CM-Lisp is that xappings
parallel data, and provided for parallel processing of could be conceptually infinite by having a default value
S-expressions, not just numbers and bit fields. CM- considered to be associated with every index not explic-
Lisp introduced an aggregate data type called a xap- itly mentioned:
ping, which was essentially a (not necessarily finite) map {red→14 green→3 purple→93 blue→3.5 →0}
 C Connection Machine Lisp

is a xapping that has a pair x→0 for every possible CM- β+ is a function that will sum all the values in a xapping,
Lisp data structure x other than red, green, purple, and βmax will return the largest value in a xapping.
and blue. The third notation uses the same character β (in an
CM-Lisp also adds three notations for parallel pro- overloaded fashion) as a permutation functional that,
cessing. The α notation allows elementwise parallel pro- in an abstract sense, describes what the Connection
cessing; if two or more xappings are processed by an Machine router does when passing messages from pro-
α-expression, then corresponding values are combined cessor to processor. If f is a function of two arguments,
by matching up their indices (a process that can be then βf is a function that, when given two xappings x
regarded as a simple form of database join operation – and y as arguments, matches up pairs by their indices
simple because no input xapping will have multiple just as the α notation does; if x contains a pair p→q and
pairs with the same index, and therefore no output y contains a pair p→r, then the result will contain a pair
xapping will have have multiple pairs with the same q→r. If one regards xapping indices as naming Con-
index). As a simple example, if x names the xapping nection processors (or, to turn it around, if one thinks
of a xapping as having a Connection Machine processor
{red→14 green→3 purple→93 blue→3.5}
associated with each pair), then this describes the action
and y names the xapping of processor p sending to processor q a message con-
taining the value r. If x happens to contain more than
{red→4 yellow→5 green→6}
one pair whose value is q, then multiple processors will
then the expression α(+ ●x (∗ 2 ●y)) produces the send messages to the same processor q, in which case
value the function f is used to combine those values into a
single value so as to produce a single pair in the result
{red→22 green→15}
with index q. (This is why the same character β is used
The character “α” indicates that the following to notate two different operations: In the most general
expression is to be executed “in parallel, as many copies case, they each involve reduction of multiple values to a
as needed”; the character “●” in effect says, “but not this single value).
part – just evaluate it once and use the value” (i.e., the The design of CM-Lisp attempts to be completely
following expression should be evaluated just once and general in giving meaning to every possible Common
is expected to produce a xapping). Thus, in the expres- Lisp operation when α is applied to it. This leads to
sion α(+ ●x (∗ 2 ●y)), there will be many copies of a rather intricate theory of control flow and function
the addition operation + and the multiplication opera- application that can be described either by a metacircu-
tion ∗, and also of the constant 2, but x and y already lar interpreter or algebraic relationships between α and
have parallel values. For every possible index, the result such Lisp constructs as lambda and if [].
xapping will have a pair with that index if and only if Figure  shows an early () example of a Con-
all input xappings have a pair with that index. This use nection Machine Lisp program that identifies prime
of the α and ● characters is syntactically reminiscent of numbers less than , by the method of the Sieve
(and intentionally modeled on) the backquote notation of Eratosthenes, taken (with minor alterations) from
of Common Lisp [, ], in which the backquote char- []. The function find-primes takes an argument
acter “ ‘ ” indicates that a copy should be made of the n indicating the number of integers to be tested; for
following data structure, and the comma character “,” this example, it should be called as (find-primes
in effect says, “but not here – just evaluate it and use the 100000). The function make-xector makes a xec-
value.” tor (a xapping whose indices happen to be consecu-
The second notation uses the character β as a reduc- tive integers starting from ); thus the local variable
tion functional: If f is a function of two arguments, then candidate is bound to a xector of length n whose
βf is a function that, when given a xapping x as an argu- elements are all initially t (true), and the local variable
ment, will use f to combine the values of all the pairs in primes is bound to a xector of length n whose ele-
x pairwise, repeatedly, until a single value results. Thus ments are all initially nil (false). The function iota
Connection Machine Lisp C 

(defun find-primes (n)


(let ((candidate (make-xector n :initial-element t))
(primes (make-xector n :initial-element nil))
(value (iota n)))
(αsetf candidate ’[nil nil]) C
(do ((next-prime (position t candidate) (position t candidate)))
((null next-prime) primes)
(setf (xref primes next-prime) t)
α(setf ●candidate
(and ●candidate
(not (zerop (mod ●value next-prime))))))))

Connection Machine Lisp. Fig.  Example Connection Machine Lisp program for identifying prime numbers

creates a xector that maps integers to themselves; thus many Connection Machine virtual processors, this con-
the local variable value is bound to a xector such struction would perform interprocessor communica-
that element k has value k, for  ≤ k < n. The con- tion.) The second expression in the body of the do loop
struction αsetf indicates that setf is to be used as updates elements of the candidate xector; because
many times as needed; the second argument is a xec- the operands candidate and value are each pre-
tor [nil nil] of length , and so at most two setf ceded by a bullet character “● ,” they are treated as already
operations will be performed (fewer if n is  or ). The parallel; on the other hand, the operand next-prime
effect is to set elements  and  (if they exist) of the has no bullet in front of it and therefore is automat-
xector in variable primes to nil. The do loop is a ically replicated (broadcast to all virtual processors).
conventional Common Lisp do loop; it binds, initial- This expression uses the conventional (and idiomatic)
izes, and steps the variable next-prime repeatedly, Common Lisp special form and, but it could (equally
executing the body in between, until the test expres- idiomatically) have been written using when:
sion (null next-prime) is true, at which point
α(when ●candidate
the result expression primes is evaluated and its result
(setf ●candidate
is returned. The function position is the conven- (not (zerop (mod ●value
tional Common Lisp sequence function position, next-prime)))))
overloaded to process xectors (and implemented using
parallel techniques); it returns the index of the leftmost Either version has the effect of inactivating virtual
element (i.e, the element of smallest index) whose value processors for which candidate has the value nil,
matches the given value (in this case t), but returns then executing the expression (not (zerop (mod
●value next-prime))) on whatever virtual pro-
nil if no element of the xector matches. Each time
that next-prime is not nil, the two body expres- cessors remain active, then reactivating virtual proces-
sions are executed. The function xref is analogous sors disabled in the first step.
to Common Lisp aref, but indexes a xapping rather By  there was a working implementation of CM-
than an array; thus the expression (setf (xref Lisp that included a garbage collector and a compiler,
primes next-prime) t) sets one element of the but it restricted the values of xappings stored within
xector primes, namely the one selected by the index Connection Machine processors to be integers, floating-
next-prime, to t. (One could select many elements point numbers, or characters, and did not yet support
in parallel, thereby producing a new xapping or updat- the execution of nested α expressions in parallel [].
ing many elements of an existing xapping, by using α Programming languages whose designs were influenced
with xref in an expression such as α(xref primes by experience with CM-Lisp include Paralation Lisp [],
●indices). With elements of a xapping spread across which was by design less abstract than CM-Lisp so
 C Consistent Hashing

as to admit an efficient compiled implementation [],


and NESL [], which successfully implemented nested Consistent Hashing
parallel execution of data-parallel operations on nested
vectors of differing size []. Peer-to-Peer

Related Entries Control Data 


C*
John Swensen
Connection Machine
CPU Technology, Pleasanton, CA, USA
Connection Machine Fortran
*Lisp
Nesl Synonyms
CDC 

Bibliography Definition
. Blelloch GE () Vector models for data-parallel computing. The Control Data  computer, regarded by many
MIT, Cambridge as the world’s first supercomputer, was designed by
. Blelloch GE () Programming parallel algorithms. Commun Seymour Cray and James Thornton, introduced in ,
ACM ():–
and featured an instruction issue rate of  MHz, with
. Blelloch GE, Hardwick JC, Chatterjee S, Sipelstein J, Zagha M
() Implementation of a portable nested data-parallel lan- overlapped, out-of-order instruction execution in mul-
guage. In: PPOPP ’: Proceedings of the fourth ACM SIGPLAN tiple functional units and interleaved memory banks.
symposium on principles and practice of parallel programming, A unique “scoreboard” unit controlled instruction issue
ACM, New York, pp – and execution so as to match the behavior of traditional,
. Hillis WD () The Connection Machine. MIT, Cambridge
sequential execution.
. Hillis WD, Steele GL Jr () Data parallel algorithms. Commun
ACM ():–
. Sabot GW () The paralation model: Architecture- Discussion
independent parallel programming. MIT, Cambridge
. Steele GL Jr, Hillis WD () Connection Machine Lisp: Fine-
Introduction
grained parallel symbolic processing. In: LFP ’: Proceedings The state of the art in scientific computing, immedi-
of the  ACM conference on LISP and functional program- ately prior to the Control Data , was represented
ming, ACM SIGPLAN/SIGACT/SIGART, New York, August by the Univac LARC, introduced in , with K–
, pp – K -bit words of core memory, a -μs floating-
. Steele GL Jr, Fahlman SE, Gabriel RP, Moon DA, Weinreb DL
point add time, an -μs floating-point multiply time,
() Common Lisp: The language. Digital, Burlington
. Steele GL Jr, Fahlman SE, Gabriel RP, Moon DA, Weinreb DL, and a -μs floating-point divide time. The LARC was
Bobrow DG, DeMichiel LG, Keene SE, Kiczales G, Perdue C, closely followed, in , by the IBM  (also called
Pitman KM, Waters RC, White JL () Common Lisp: The Stretch), with  K– K -bit words of core memory,
language, nd edn. Digital, Bedford a .-μs floating-point add time, a .-μs floating-point
. Thinking Machines Corporation () Connection Machine
multiply time, and a -μs floating-point divide time.
model CM- technical summary, technical report HA-. Cam-
The IBM  executed . million instructions per
bridge
. Thinking Machines Corporation () Connection Machine second (MIPS), on average, and occupied ,  ft of
technical summary, version .. Cambridge floor space.
. Thinking Machines Corporation () Connection Machine Three years later, the Control Data  computer
CM- technical summary, rd edn. Cambridge offered K–K -bit words of core memory, a .-
. Wholey S, Steele GL Jr () Connection Machine Lisp: A dialect
μs floating-point add time, a .-μs floating-point mul-
of common lisp for data parallel programming. In: Kartashev
LP, Kartashev SI (eds) Proceedings of the second international tiply time, and a .-μs floating-point divide time. Its
conference on supercomputing, vol III, International Supercom- execution rate averaged  MIPS, . times faster than
puting Institute, Santa Clara, pp – the IBM , but occupied only  ft of floor space.
Control Data  C 

Architecture Loading and storing of registers was through an


The CDC  instruction-set architecture had several unusual mechanism; an instruction that wrote to one of
features common to Cray’s machines, including: registers A through A caused the corresponding reg-
ister X through X to be loaded from memory at the
● A load-store architecture
● Several register types
address in the corresponding A register. An instruction C
that wrote to either A or A caused the contents of
● An absence of condition-codes
either X or X to be stored at the address in the corre-
● Three-address instruction formats
sponding A register. Registers A and X did not have
● Word-only access to memory
this behavior.
● Non-standard arithmetic
In addition to the general-purpose A, B, and X reg-
● Delegation of input–output and overhead functions
isters, the CDC  had an -bit program counter, an
to peripheral processors
-bit base register for all memory accesses, an -bit
These architectural features were chosen to allow Cray’s field-length register for all memory accesses, an exit-
characteristically aggressive clocking, with multiple enable register, as well as a pair of base and field-length
cycles required, even for simple instructions. registers for Extended Core Storage accesses.
The program counter could only point to a -bit
location in memory, even though up to four instruc-
Load-Store Architecture tions could be packed into a memory location. This
Load-store architectures require operands to be loaded necessitated that every branch target was to the instruc-
into registers before they can be operated on, and results tion encoded in the high  or  bits of the destina-
must be explicitly stored to memory. With explicit load tion word.
instructions, support for overlapped instruction exe- The base and field-length registers allowed program
cution, and with a sufficient number of registers, it is and data relocation within the memory space of the
often possible to schedule long-latency memory loads CDC , supporting efficient sharing of the computer
of operands in advance of their use, so that a computa- by multiple programs. Other than this linear reloca-
tion is not delayed by the access latency. With explicit tion, no other address translation was performed by the
store instructions, memory bandwidth is only used to CDC .
save desired intermediate results. The exit-enable register encoded the conditions that
The CDC  was a load-store architecture, but it could cause a program to exit, and included address-
had no pure load or store instructions. Rather, loads out-of-range, operand-out-of-range, and indefinite-
were initiated when one of a subset of address regis- operand conditions. Upon program exit, this register
ters was written, and stores were initiated when one of field held the actual conditions at the time of pro-
a different subset of address registers was written. gram exit.
All of the registers of the CDC  were saved
(and restored) in a -word block of memory by the
Register Sets Exchange Jump operation.
The registers of the CDC  included a set of
eight address registers (A registers), a set of eight
index registers (B registers, with B holding the con- Branch Condition Encoding
stant zero), and a set of computation registers (X reg- Conditional branches in the CDC  were based on
isters). The A and B registers were  bits in length, the states of specified registers (or pairs of registers),
large enough to address the entire, K word memory rather than the state of a single, condition-code register.
space. Loop indices and other housekeeping variables Branch conditions include the zero, negative, in-range,
could, also, be represented with  bits of precision. The and indefinite status of any X register, as well as equality
X registers were  bits in length, and were used to hold and inequality relations between arbitrary pairs of B
fixed-point and floating-point operands. registers.
 C Control Data 

In common computer architectures, a single a -bit word; branches were always to the most sig-
condition-code register is usually implicitly set by the nificant  or  bits of the destination addresses.
last arithmetic or logical operation, and sometimes by The -bit and -bit instruction formats are shown
the last load operation, as well. Typically, it encodes if below.
the last result was zero, was negative, caused a carry-out,
or caused an arithmetic overflow. For an implementa- Memory Access Sizes
tion with strictly sequential execution, condition codes Loads and stores in the CDC  always moved 
cause few performance penalties, although determina- bits; there was no direct architectural support for access-
tion of a zero result often requires an additional cycle ing smaller operands. Furthermore, loads only targeted
after an arithmetic result is available. registers X–X, and only registers X and X could
When instruction execution is overlapped, how- be stored. Loading an operand into an A or B register
ever, the implicitly set, single-condition-code register required a load to an X register, followed by a move to the
requires that the condition-setting instruction imme- desired register. If other than the low  bits of the mem-
diately precede its conditional branch. If branches are ory word were desired, the X register could be shifted,
not based on the state of a single condition code, use- with a separate instruction, before moving the new,
ful instructions can often be inserted between a long- low  bits of its contents to the desired A or B
executing instruction and the branch depending on its register.
result. The advantages to such inconvenient restrictions
Other approaches to branch conditions include con- were that the distribution networks between memory
ditional writing the of the condition code (for example, and the registers could be highly optimized and that
in the Sun SPARC architectures), or multiple condition fewer load and store opcodes were required. These
codes (for example, in the IBM Power and PowerPC advantages were minor, however, and the architec-
architectures). These approaches both require addi- tural trade-off reflects Cray’s philosophy that ease of
tional opcode space to specify the condition-writing programming was far less important than speed of
behavior. execution.

Number Representations
Three-Address Instruction Format Integer arithmetic in the CDC  was performed
Most of the CDC  instructions specified three reg- using a ’s-complement representation, in which arith-
isters, two source registers, and a destination register. metic negation, like logical negation, required invert-
Although requiring a larger instruction word than a ing each bit of the input operation. Two disadvan-
two-address instruction format (where the destination tages of the ’s-complement representation were that
register is, also, one of the source registers), a three- additions and subtractions required that the carry-out
address instruction format does not require an extra from the sum be added into the least-significant bit of
instruction to move a result. This is particularly helpful the sum (an end-around carry), and that there were
when registers are not completely symmetrical, as was two representations for zero (in octal, for -bit values,
the case with the CDC ’s loadable registers X–X + = , − = ), although the CDC 
and storable registers X and X. fixed-point add units never produced a negative-zero
The opcode field implicitly encoded the type of reg- result.
isters addressed by the instruction, so that each register Floating-point numbers in the CDC  had a -
specifier field was only three bits in length, for at total of bit sign, an -bit binary exponent (biased by ,), and
nine bits for the three operands, leaving six bits for an a -bit coefficient with the radix point to the right of
opcode in the -bit instruction format. A longer, -bit the least significant bit. Both the exponent and the coef-
instruction format included a -bit constant field that ficient were represented as ’s complement numbers.
was combined with one -bit register-specifier field to The radix point on the right allowed easy conversion
form an -bit constant for branch targets or immediate of integers to floating-point values. In addition to finite
operands. Note that the -bit branch target addressed values, the floating point format encoded positive and
Control Data  C 

negative infinity (that might be generated via arith-  Stop


metic overflow or division by zero), as well as indefinite  Return Jump to K
values (that might be generated by dividing zero by  Central Processor-initiated Exchange Jump (special
zero, or by adding positive infinity to negative infinity). hardware option)
These exceptional results could be tested by branch  Goto K + Bi C
instructions.  Goto K if Xj is zero
The CDC ’s floating point representation dif-  Goto K if Xj is not zero
fered from all other computer families, most notably
 Goto K if Xj is positive
differing from that of the IBM , with its hexadec-
 Goto K if Xj is negative
imal exponent, and sign-and-magnitude fraction. At
 Goto K if Xj is in range (not infinite)
the time of its introduction, this mattered little, but,
over time, the ubiquity of the  and  architec-  Goto K if Xj is out of range (infinite)

tures meant that most of a growing collection of care-  Goto K if Xj is definite
fully crafted numerical routines could not be used  Goto K if Xj is indefinite (similar to IEEE NaN)
by programmers of the CDC  and its successor  Goto K if Bi == Bj
machines.  Goto K if Bi ! = Bj
 Goto K if Bi >= Bj
Peripheral Processors  Goto K if Bi < Bj
The central processor of the CDC  was dedicated to  Transmit Xj to Xi
computation; input–output operations, job scheduling,
 Logical Product (AND) of Xj and Xk to Xi
and other “overhead” operations were delegated to a set
 Logical Sum (OR) of Xj and Xk to Xi
of ten peripheral processors, or PPUs. Although imple-
 Logical Difference (XOR) of Xj and Xk to Xi
mented in the same technology as the central processor,
and housed within the same cabinet, each of the PPUs  Transmit not Xk to Xi

executed at a fraction of the central processor’s speed  Logical Product of Xj and not Xk to Xi
and had a very different instruction set architecture,  Logical Sum of Xj and not Xk to Xi
along with its own, private memory.  Logical Difference of Xj and not Xk to Xi
These PPUs were connected to the computer  Rotate Xi left jk places
system’s input–output devices and controlled them  Shift Xi right jk places (with sign-extension)
directly. The PPUs interfaced to the central processor  Shift Xi nominally left Bj places (right if Bj negative)
via the system memory. The PPUs could interrupt the  Shift Xi nominally right Bj places (left if Bj negative)
central processor by using the Exchange Jump oper-
 Normalize Xk to Xi and Bj
ation, which simultaneously saved the current central
 Round and normalize Xk to Xi and Bj
processor program context to memory and loaded a
 Unpack Xk to Xi and Bj
new context from memory. In one sense, the central
 Pack Xk and Bj to Xi
processor was a compute-server slave to the collection
of PPUs. In a different sense, the central processor was  Floating point sum of Xj and Xk to Xi
the star performer, freed from mundane responsibilities  Floating point difference of Xj and Xk to Xi
by its entourage of support processors.  Floating double sum of Xj and Xk to Xi
 Floating double difference of Xj and Xk to Xi
Instruction Set Listing  Rounded floating point sum of Xj and Xk to Xi
The instruction set of the CDC  central processor  Rounded floating point difference of Xj and Xk to Xi
is listed below. Octal notation is used to represent the  Integer sum of Xj and Xk to Xi
six-bit opcodes; where three-digit opcodes are specified,
 Integer difference of Xj and Xk to Xi
they indicate that the i-field of the instruction further
 Floating point product of Xj and Xk to Xi
expand the opcode. The value jk is the -bit constant
 Rounded floating point product of Xj and Xk to Xi
formed by the j- and k-fields.
 C Control Data 

 Floating double product of Xj and Xk to Xi holding ,, -bit subwords, with five memory mod-
ules making up a bank of ,, -bit words. A fully
 Form jk mask in Xi
populated CDC  had  interleaved banks of mem-
 Floating point divide Xj by Xk to Xi
ory; each bank had an access time of  ns and cycle
 Rounded floating point divide Xj by Xk to Xi
time of , ns, and a different bank could be accessed
 Nop every  ns.
 Sum of ’s (population count) in Xk to Xi Memory reads to core memory were destructive, so
 Sum of Aj and K to Ai (load or store Xi if i ! = ) that, following a read, the data had to be written back
 Sum of Bj and K to Ai (load or store Xi if i ! = ) or it would be lost. This characteristic was exploited by
 Sum of Xj and K to Ai (load or store Xi if i ! = ) the Exchange Jump operation, which saved the current
 Sum of Xj and Bk to Ai (load or store Xi if i ! = ) -word program context, while simultaneously loading
 Sum of Aj and Bk to Ai (load or store Xi if i ! = ) a new program context.
 Difference of Aj and Bk to Ai (load or store Xi if i ! = )
 Sum of Bj and Bk to Ai (load or store Xi if i ! = ) Implementation Features
 Difference of Bj and Bk to Ai (load or store Xi if i ! = ) The major implementation features of the CDC 
 Sum of Aj and K to Bi were its:
 Sum of Bj and K to Bi
● Ten functional units capable of overlapped execu-
 Sum of Xj and K to Bi
tion
 Sum of Xj and Bk to Bi
● An Instruction Stack for caching instructions in
 Sum of Aj and Bk to Bi loops
 Difference of Aj and Bk to Bi ● A Scoreboard unit to manage the overlapped and
 Sum of Bj and Bk to Bi out-of-order instruction execution
 Difference of Bj and Bk to Bi
 Sum of Aj and K to Xi
 Sum of Bj and K to Xi Functional Units
 Sum of Xj and K to Xi Functional units of the CDC  included:
 Sum of Xj and Bk to Xi
● A -bit Boolean unit ( cycles)
 Sum of Aj and Bk to Xi
● A -bit shift unit ( cycles)
 Difference of Aj and Bk to Xi
● A -bit fixed (integer) add unit ( cycles)
 Sum of Bj and Bk to Xi
● A -bit floating point add unit ( cycles)
 Difference of Bj and Bk to Xi ● Two -bit floating point multiply units ( cycles)
● A -bit floating point divide unit ( cycles or 
cycles)
Implementation Technology ● Two -bit increment (short integer) units (–
The resistor-transistor logic gates of the CDC  were cycles)
packaged in “cordwood” modules, approximately .′′ × ● An -bit branch unit (– cycles)
.′′ × .′′ in size, in which pairs of circuit boards
were connected by resistors between the boards. With These functional units all required multiple cycles to
the silicon transistors on the inside, wiring traces on execute, as indicated, but they were not pipelined; they
the outside, off-module wiring on one edge, power and could only process one set of operands at a time. Two
mechanical connections to a cooling plate on the oppo- multiply units were provided, because multiplies are rel-
site edge, each module implemented a high-density, atively common in scientific programming, but slow
serviceable logical element. to execute. Two increment units were, also, provided,
The core memory used by the CDC  was pack- because their operations, which included load and store
aged in blocks, approximately ′′ × ′′ × .′′, each operations, were very common.
Control Data  C 

One clock cycle was the minimum time required to storage, although loops were limited to, at most, 
issue a new instruction, so it was possible, with care, for instructions, plus one branch, if they were to remain in
a program to keep multiple functional units busy simul- the Instruction Stack.
taneously. Furthermore, differing execution times could
cause instructions to complete out of program order. Scoreboard Unit C
For example, a divide, followed by a series of adds could The Scoreboard Unit enforced sequential execution
finish after all of the following adds finished. In the fig- semantics, while allowing instructions to execute and
ure below, the divide issues and begins execution in complete out of program order. For this discussion,
cycle , completing in cycle . The following floating- an instruction issues when it is decoded and sent to a
point add (to X) issues and begins execution in cycle functional unit for eventual execution, an instruction
, completing in cycle . The next add (to X) cannot executes when it has received all of its operands and
issue until cycle  because the add unit is busy; its input begins the specified operation, and it completes when it
operand is available, so it executes in cycle  and com- writes its final result(s) to a memory location or to one
pletes in cycle . The next instruction, an -bit add (to or more registers. Instructions in the CDC  always
B), issues and begins execution in cycle , completing issued in program order, although they often executed
in cycle . The final add (to X) issues and executes in and completed out of order.
cycle , when the add unit is not busy, completing in Because the CDC  functional units were not
cycle . pipelined, a functional unit remained busy from the
time an instruction was issued to it until one cycle after
X0 = X1 / X2
that instruction completed. A result register remained
X3 = X1 + X2
X5 = X4 + X3
busy from the time its instruction issued until it
B1 = B2 + B3 completed.
X7 = X6 + X5 During the issue process, each instruction was
decoded and its required functional unit(s) and desti-
nation register(s) were determined. If a required func-
The floating point add and multiply units sup- tional unit was unavailable, or if the destination register
ported both single and double precision arithmetic, but was busy, issuing stalled until all were available. These
required two instructions to return the more and less were called first-order conflicts by the designers of the
significant parts of a double-precision result. CDC . Once the first-order conflicts disappeared,
Conversions between integer and floating point the instruction was sent to the required functional
operands used the shift unit, sometimes requiring two unit(s), along with the source and destination regis-
instructions to complete a conversion. ter specifiers. This enabled the issue process for the
The divide unit executed floating point divides, as following instruction to begin.
well as a population-count instruction (which counted If all required input registers were available, execu-
the number of -bits in a word) and a no-operation tion began; otherwise, the functional unit waited for
(nop) instruction. Because of the branch-target restric- them to become available. Unavailability of input reg-
tions, nop instructions were very common in CDC isters was called a second-order conflict; this tied up a
 programs. functional unit, but allowed other instructions to con-
Most branch instructions, besides requiring the tinue executing.
branch unit, also required the use of an increment unit Execution proceeded and, one cycle before the result
or the fixed-point add unit for operand comparison or was ready, the functional unit signaled that it was ready
testing. to release its result to the destination register. If all
release datapaths were busy, or if one or more previously
Instruction Stack issued instructions had not yet used the older value in
Instructions for the CDC  were cached in an the destination register, completion stalled. Inability to
-word instruction buffer (called the Instruction Stack), write a result was called a third-order conflict; like a
so up to  instructions could be held in fast-access second-order conflict, it tied up the functional unit, but
 C Control Data 

did not affect the execution of instructions that did not Notes on the issue process:
depend on the uncompleted instruction. ● Steps  and  depend on step , but can execute
The Scoreboard Unit operated on state information during the same processor cycle as step .
associated with the functional units and the registers. ● Step  could repeat for many cycles, given a long
With each functional unit (FU) was associated: chain of instruction dependencies.
● Fm – function to be performed by the unit (e.g., ● Once Step  is past, Steps , , , , and  can proceed
integer add, integer subtract) in parallel.
● Fi – the destination register for the result The execute process can be described by the follow-
● Fj – the first source register for the function ing algorithm, operating in each functional unit in
● Fk – the second source register for the function parallel:
● Qj – the functional unit producing the result in Fj
. If RFj is clear and RL[Qj] is active, load operand j
● Qk – the functional unit producing the result in Fk
with the result on the release datapath and set RFj.
● RFj – a bit indicating that Fj was ready
. If RFk is clear and RL[Qk] is active, load operand k
● RFk – a bit indicating that Fk was ready
with the result on the release datapath and set RFk.
● XS – a bit indicating that the FU had begun
. If RFj is set and RFk is set, start execution, set AC[Fj]
execution
and AC[Fk], and set XS.
● RQ – a signal indicating that the FU was requesting
to release its result Notes on the execute process:
● RL – a signal indicating that the FU was releasing its ● Steps  and  execute in parallel.
result ● The release datapaths for operand j and operand k
● AC[] – a set of bits indicating that registers Fj and Fk are either independent, or RL[Qj] and RL[Qk] are
had not been read asserted on different cycles, so no release datapath
conflicts can occur.
Also, copies of the current input operands were kept
within each functional unit. The result release process can be described by the fol-
With each register r in {A-A, B-B, X-X}, was lowing algorithm:
associated QR, the functional unit or memory read stor- . One cycle before completing its operation, a func-
age channel that would, eventually, write its data, or tional unit asserts RQ.
an indication that the register had its data. With each . If any AC[∗ ][Fi], for all relevant functional units, is
register r was associated AC, or All Clear, indicating clear, delay the grant.
that no functional unit reading that register had started . If a functional unit of higher priority for the release
execution. datapath is asserting RQ, delay the grant.
The issue process can be described by the fol- . Otherwise, grant the release request.
lowing algorithm, operating on each instruction p in . The following cycle, the functional unit asserts RL
succession: and writes its result onto the release datapath to the
appropriate register.
. Decode instruction p to determine the functional
unit FUp, Fm, Fi, Fj, and Fk.
. Use Fj and Fk to look up values for Qj, Qk in QR[].
. While FUp is busy or register Fi is busy, stall.
. Send Fm, Fi, Fj, Fk, Qj, Qk to FUp. 6 3 3 3
Opcode i j k
. If Fj is not waiting for a FU result, set RFj and
transmit register Fj to FUp, otherwise clear RFj. 6 3 3 18
. If Fk is not waiting for a FU result, set RFk and Opcode i j K
transmit register Fk to FUp, otherwise clear RFk.
. Clear AC[FUp][Fj] and AC[FUp] [Fk]. Control Data . Fig.  Issue and execute timing
. Set QR[Fi] to FUp. example for dependent and independent instructions
Control Data  C 

X3 = X1* X2 No conflicts; issue and execute cycle 0, multiply unit free cycle 5, X3 released cycle 4

X5 = X4 + X3 Issue cycle 1; execute cycle 10 (wait for X3), multiply unit free cycle 15, X5 released cycle 14

A4 = B2 + K Issue and execute cycle 2, increment unit free cycle 12, X4 released cycle 11 (wait for X4 all-clear) C
X7 = X6 + X5 Issue and execute cycle15,
add unit free cycle 20, X7 released cycle 19

Control Data . Fig. 

Notes on the release process: before the add to X began. Sequential execution would
have required an additional eight cycles to complete.
● Because the RQ signal is sent one cycle early, there For all its apparent complexity, the Scoreboard Unit
is time to detect release conflicts and, if there are required fewer gates than the average functional unit in
none, the release can be granted without delaying the CDC  []; nevertheless, this was the only com-
the functional unit. puter to use such a unit. A more complex algorithm,
● Any other functional units waiting for the result can the Tomasulo algorithm, used in the IBM  Model
select it from the release datapath, saving a register , provided faster execution for poorly scheduled and
read. register-allocated code []. A simpler algorithm, which
only tracked pending writes to registers, was used in
A simple example illustrates some of the scoreboard
Cray’s later, pipelined machines, allowed faster instruc-
functions. Assume that all four instructions have been
tion issuing and, with good instruction scheduling and
loaded into the Instruction Stack. The multiply to X has
register allocation, allowed instruction execution as fast
no conflicts, so it issues and begins execution in cycle .
as did the Scoreboard Unit of the CDC .
The multiplier remains busy through cycle  (indicated
by the light arrow), and execution runs from cycle 
through cycle  (indicated by the bold line please see Derivative Machines
Fig. ). The Control Data  employed the same technol-
The add to X issues in cycle  because both X and ogy as the CDC , but only supported sequential
the adder are free. However, the operand X is not avail- execution of instructions in a single, non-pipelined exe-
able until cycle , when the add begins to execute. The cution unit. In the Control Data , two  central
adder releases X in cycle  and the unit becomes free processors shared a single, interleaved memory system.
in cycle . In , the Control Data  succeeded the CDC
The -bit add to A (causing a load to X) issues in  as the fastest computer in the world. It executed
cycle  because X, while an input operand to a previ- the same instruction set in multiple, fully pipelined,
ous, uncompleted instruction, was copied into the add functional units, with an instruction issue rate of
unit at cycle . The release of X is delayed until after  MHz.
X receives the all-clear signal (in cycle , when the
add started execution). X is released in cycle  and the
increment unit becomes free in cycle . Bibliography
Issuing of the add to X stalls from cycle  until cycle . Control Data Corporation () Control data //
, when the add unit becomes free. X is available by computer systems reference manual, Publication No. . St.
then so execution also begins in cycle , releasing X Paul, Minnesota
. Control Data Corporation ()  central processor, vol :
in cycle , and freeing the adder in cycle .
functional units, publication No.  (formerly Publication
The scoreboard unit enforced the serial dependency No. ). St. Paul, Minnesota
of X -> X -> X, allowed the load to X proceed in . Control Data Corporation ()  central processor, vol :
parallel, but prevented the load to X from completing control and memory, publication No. . St. Paul, Minnesota
 C Coordination

. Thornton JE () Parallel operation in the Control Data . before impact, is transformed into deformation energy.
In: AFIPS Proc. FJCC, pt  vol , pp – Crash simulation indicates the capability of the material
. Thornton JE () Design of a computer the Control Data
used for the design, to absorb this energy and protect
. Scott, Foresman, and Company, Glenview (out of print,
the occupant. Most important results are the decel-
but available, online, at http://www.bitsavers.org/pdf/cdc/cyber/
books/DesignOfAComputer_CDC.pdf) eration felt by the occupants, which must fall below
. Tomasulo RM () An efficient algorithm for exploiting multiple threshold values fixed in legal car safety regulations [].
arithmetic units. IBM J :– To ensure driver safety during a car crash, and meet
the regulations in order to get the official approval and
homologation of a new car model for road services, car
manufacturers need to perform crash tests. Race cars
Coordination need also to meet requirements specified by the FIA,
“Federation Internationale de l”Automobile []. Safety
Path Expressions
regulations require several physical crash tests that must
be performed on a new model. These tests are extremely
expensive; to reduce the number of physical crash tests,
Copy and also the product development lead time and costs,
engineers need to carry out a range of virtual tests,
Broadcast or crash simulations, using crash codes. These codes
are based on explicit finite element method well suited
for analyzing nonlinear dynamic response of struc-
tures. The ability of crash codes is to effectively handle
Core-Duo / Core-Quad material nonlinearity, and nonlinear behavior such as
Processors contact. To perform a crash simulation in a reason-
able time scale, safety engineers use High-Performance
Intel Core Microarchitecture, x Processor Family Computing (HPC) that allows to perform several crash
simulation, during a day, from frontal crash to compo-
nents impact simulation. Nowadays, automotive indus-
try relies heavily on simulations; by using simulation,
COW the number of real test crash is reduced, but computer
Clusters simulations cannot fully replace the crash test, since a
crash test is required in the final stage of development
to validate numerical results. Simulations are compared
to test data using high-speed camera for visualization,
Crash Simulation and accelerometers that are placed at different loca-
tions on the dummy. Before the first prototype is built,
M’hamed Souli , Timothy Prince , Jason Wang a new car model goes through thousands of computer

Université des Sciences et Technologies de Lille, simulations, crash and components system impact sim-
Villeneuve d’Ascq cédex, France ulations. When engineers conduct the physical crash

Intel Corporation, Santa Clara, CA, USA
 tests, the model has already achieved a high standard
LSTC, Livermore, CA, USA
performance through computer simulations. In auto-
motive industry, car crash simulation is the most com-
Definition puter time-consuming task, thus the need to use parallel
During the creation and conception of a new car model, computing. With adoption of HPC technology, new
safety engineers from automotive industry perform systems can perform crash simulations using clusters,
crash simulation, in order to evaluate the level of safety an assembly of several computers running in parallel,
of a car and its occupants. During a car crash, the kinetic “parallel computing,” that can achieve a speed that is
energy of the car, E = /M.v , that the vehicle has proportional to the number of CPU’s in the parallel
Crash Simulation C 

system. In a crash simulation, the most critical part networking system. MPP version of crash codes as LS-
of the parallel simulation is the contact handling, new DYNA and other crash codes has been performing with
contact search algorithms based on bucket sort have scalability up to hundred and even thousands of proces-
been developed that lead to better scalability. A detailed sors; however, it has been observed that interconnected
description of the performance of contact algorithm is speed is a significant limiting factor in the scalability of C
described in []. MPP performance [].
When using the MPP version, prior to computer
processing, domain decomposition needs to be per-
Discussion formed to divide the problem. In domain decomposi-
To ensure driver safety during a car crash, impact struc- tion, the model is decomposed into subdomains; each
tures are designed and optimized to absorb the kinetic subdomain is processed by a CPU and uses the memory
energy during the crash, and limit decelerations acting dedicated to that CPU.
on parts of human body like knees and necks. Decelera- Current compilers are not yet capable of automat-
tions values have to meet safety requirements specified ically translating an SMP version of a crash code that
by government regulations. To meet product develop- runs on shared memory into an MPP version that
ment schedule of new car models, it is necessary to use runs efficiently on distributed memory. MPP requires
parallel processing for crash simulation. Vector process- a certain amount of development that needs to be per-
ing has been in use from the beginning of practical crash formed jointly by developers from computer and soft-
simulation. Vector processing which permits the perfor- ware development companies.
mance of several calculations simultaneously was intro- Several computer companies, such as Fujitsu in
duced as an extension to serial processor [], to reduce Japan which has been involved from the beginning in
computer time for crash simulation. Vector process- the vectorization and parallelization of crash simula-
ing techniques, first developed for supercomputers for tion codes, first started developing SMP version of crash
high-performance applications, were commonly used codes using OpenMP language, and then moved to
in crash simulation in the s and s. MPP using MPI library (Message Passing Interface).
HPC remained expensive and limited because of Today, most engineering simulations are performed on
single CPU system. In recent years, multiple CPUs and Quad core or more powerful CPUs.
multiple cores have offered increase of multiple perfor- Few year ago, simple simulation events, such as a
mance and lower cost for crash simulation and other small model using a few dozen elements and repre-
engineering simulation tasks. Later, in order to extend senting an event that lasts  ms, required a day or
execution of data parallel processing constructs as DO more on a vector supercomputer to complete. Today,
loop, IF loop analysis, and nested loops in the Fortran by using low-cost parallel computing, safety engineers
language, SMP (Shared Memory Processing) version of can model large-scale crash car to car simulation with
crash code has been developed. airbags and occupants in just few hours. Nowadays, all
In the SMP technology, multiple CPUs share same car companies have incorporated parallel computing
memory. Since CPUs are accessing shared memory into their car design.
simultaneously, data supplied form memory to CPUs is
likely to be slow, when the number of CPUs increases,
which limits the number of CPUs that can be used Historical Performance Trends
efficiently. However, it has been observed that the scala- The primary goal of crash simulation is to improve
bility of the SMP version of crash codes was observed to safety for human occupants and evaluate design
stop at eight processors []. To overcome this problem, improvements to enable vehicles to score well on crash
and solve the limitation of shared memory process- tests performed by regulatory and insurance agen-
ing, the MPP (Massive Parallel Processing) version of cies. In order to improve the accuracy of crash test
crash codes has been developed, which uses distributed safety analysis and to represent vehicle geometry more
memory. In MPP, each small group of CPUs has its own accurately, the number of finite elements in analysis
memory, and CPUs are interconnected via a high-speed models has been continuously increasing.
 C Crash Simulation

Crash safety numerical simulation is performed pri- Interconnect system did not support more than 
marily by explicit time marching, in which the entire CPUs efficiently, thus a simulation with , ele-
evolution of a crash from initial impact to final rest- ments would be the most detailed, suitable for overnight
ing state, about a tenth of a second, is calculated in completion.
steps of about  ms. During crash simulation, differ- Also, by that year, the detail and computational
ent parts of the car get in contact with themselves cost of a useful practical analysis far exceeded the
and each other, this can be handled by using contact Dodge Neon benchmark (, elements), the only
algorithms. Contact analysis is an important feature benchmark quoted the first year. This detail of the car
in crash simulation, and continuing research for effi- crash model of the benchmark is fully described in the
cient parallelization of contact algorithms is still ongo- next section, (elements number, nodes number, etc.).
ing. A detail description of the performance of con- The benchmark has been completed on several com-
tact algorithm is described in []. Crash simulation is mercial analysis codes of similar nature, all available
performed using finite difference method for time inte- for parallel computing. For this car model, we com-
gration, and Finite element method FEM, for space pare the performance of high-end dual CPU technical
discretization []. Deformations, velocities, and forces workstations quoted in  against a consumer prod-
on all relevant parts of the vehicle and of dummies sim- uct single CPU quad core desktop of . From Tables 
ulating the danger of injury to human occupants are and , we observe that not only the performance has
calculated at each time step. To simulate a crash and been improved by a factor of six, the computer system in
occupant safety, crash analysis software, including LS- Table  is five times more expensive than the computer
DYNA, PAM-CRASH, and other commercial and aca- used in Table . Consequently, cost and performance of
demic codes, must be able to handle large deformation, minimum parallel computing platforms for crash sim-
material nonlinearity, and complex contact conditions ulation each improved by more than a factor of  over
among multiple components []. The software must be  years time.
able to simulate different type of car crash events, as These quotations are for the SMP (OpenMP) single
frontal, side, and rear impact. node threaded analysis, at least in the case where that
In the early s, crash simulation would typi- information was provided in the submission.
cally be performed on Symmetrical Multi-Processor Tables  and  compare MPP performance clus-
supercomputers, with , elements considered a ter results quoted in  and , using IBM and
large model. In year , several car manufacturers Intel CPUs; Table  represents the highest performance
had adopted MPI cluster computing [] for crash sim- reported in that year. From these tables, we can con-
ulation, using CPUs such as AMD Athlon K [], clude that for clusters, performance has been increasing
Intel Xeon DP, or Itanium. At that time, high-speed at least by a factor of  between the years  and .

Crash Simulation. Table  Computing Time for HP and IBM with  CPU

Vendor Computer Processor Total CPU Time (s) Benchmark Date


HP HP-UX . Ghz  , Neon //
Itanium Itanium
Cluster Rx
IBM IBM p . Ghz  , Neon //
Power+

Crash Simulation. Table  Computing Time for ARD with  CPU

Vendor Computer Processor Total CPU Time (s) Benchmark Date


ARD CAi . Ghz  , Neon //
Power+
Crash Simulation C 

Crash Simulation. Table  Computing Time for IBM with  and  CPU and HP  CPU

Vendor Computer Processor Total CPU Time (s) Benchmark Date


IBM IBM p . Ghz   Neon //
Power+
IBM IBM p . Ghz   Neon // C
Power+
HP HP-UX . Ghz  , Neon //
Itanium Itanium
Cluster RX

Crash Simulation. Table  Computing CPU Time for SGI and INTEL with  CPU

Vendor Computer Processor Total CPU Time (s) Benchmark Date


SGI Altix INTEL XEON   Neon //
ICEEX Quad Core
X . Ghz
INTEL Supermicro INTEL XEON   Neon //
XDTN Quad Core X
X

Crash Simulation. Table  Computing Time for INTEL and CRAY with  CPU

Vendor Computer Processor Total CPU Time (s) Benchmark Date


INTEL SPAL INTEL XEON  , Carcar //
Dual Core
X
CRAY CRAY XT AMD  , Carcar //
Single Core . Ghz

Crash Simulation. Table  Computing Time for SGI and CRAY with  CPU

Vendor Computer Processor Total CPU Time (s) Benchmark Date


SGI Altix ICEEX INTEL XEON  , Carcar //
Quad Core
X . Ghz
CRAY CX INTEL XEON  , Carcar //
Quad Core
X . Ghz

In , much larger clusters could be purchased for the performance are demonstrated. In the TopCrunch
price of clusters quoted in . report, the carcar benchmark model using close to six
As reported in TopCrunch [], a project initi- million elements (,, elements) was introduced
ated to track the aggregate performance trends of in  to represent better the detail of simulation
high-performance computer systems and engineer- required in practice at that time.
ing software [], InfiniBandTM adapters and switches In Tables  and , we compare the performance of
had proven cost-effective for cluster communication, realistic size models for clusters installed in  and
and substantial improvements in simulation cost and . In these two tables, it is shown that in  years time,
 C Crash Simulation

computational performance doubled, while cutting the exceed ten million elements. All current simulations
number of cluster nodes in half, using the same total performed on clusters.
number of CPU cores.
In the performance quotations referred in Tables –, Description of the Dodge Neon Benchmark
an increasing number of cores has accounted for a The first test case illustrated here is a vehicle crash-
major part of enhanced performance. This trend may worthiness application of vehicle impacting into a rigid
be expected to continue. barrier. The digitized vehicle model, a Dodge Neon,
Recent development in parallel computing com- is developed by the National Crash Analysis Center at
bines both SMP and MPP technologies, called hybrid the George Washington University. The model is devel-
OpenMP/MP. The hybrid model may see increased oped through reverse engineering process that involves
usage as the number of cores within shared mem- systematic disassembly of the physical vehicle to finite
ory nodes increases, as it should moderate growth of element model of each and every part in the vehicle.
memory size and communications load with increasing The geometry data for each part is converted to an FE
number of threads. (finite element) mesh and carefully reassembled with
the appropriate consideration of connections and con-
straints between elements to create a full vehicle FE
model. A series of material level characterization tests,
Practical Benefit of Improved Parallel
Computation
In , only a few passenger vehicles, aimed toward the
safety conscious market segments, had been designed
by the aid of crash simulation, and achieved good rat-
ings on various governmental and insurance indus-
try tests. By , all vehicles marketed internation-
ally had published crash survival ratings, most of them
substantially improved, while the required tests had
become more comprehensive. Cost-effective parallel
Crash Simulation. Fig.  Car deformation at time t =  ms
computing enabled this degree of success in product
introductions.
CPU vendors like to tell how, if the automotive
industry had made progress equivalent to the comput-
ing industry, we would all be driving vehicles faster and
safer than bullet trains at less than the price of a single
train ride. In fact, the safety record of current vehicles
has improved by several times in less than a decade, due
directly to the success of their manufacturers in com-
putational design and product improvement, with no Crash Simulation. Fig.  Car deformation at time
adverse impact on consumer acceptability. t =  ms
For the simulations, the models have been improved
and details taken into consideration by increasing the
number of elements in the model.
● : First model had , elements
● : . ×  –. ×  elements
● : . ×  –. ×  elements
● :  ×  –. ×  elements
● : . ×  –. ×  elements
In the near future, in order to have an accurate Crash Simulation. Fig.  Car deformation at time
model, the number of elements in a full scale crash will t =  ms
CRAY TE C 

on coupons extracted from various locations of the


vehicle, are performed to gather the required input for Cray SeaStar Interconnect
the material models in FE program. The fully assembled
model of the vehicle consisting of , elements is Cray XT and Cray XT Series of Supercomputers
shown below with model statistics listed. C
Number of components 
Number of nodes ,  CRAY TE
Number of shells , 
Michael Dungworth, James Harrell, Michael
Number of beams 
Levine, Stephen Nelson, Steven Oberlin,
Number of solids ,  Steven P. Reinhardt
Number of elements , 

Run have been performed using LS-DYNA MPP ver-


Synonyms
sion. The following pictures show the car deformation
OS jitter
at time t = , t =  ms, and t =  ms (Figs. –).

Bibliography Definition
. Zienkiewicz OC, Taylor RL () The finite element method for This entry describes the hardware and software archi-
solid and structural mechanics. McGraw Hill, New York. ISBN
tecture of the CRAY TE Massively Parallel Processor
----
. Heimbs S, Strobl F, Middendorf P, Gardner S, Eddington B, Key J
(MPP), a landmark supercomputer system that became
() Crash simulation of an F racing car front impact struc- the first commercially successful MPP, and the first to be
ture. In: th European LS-DYNA Conference, Salzburg, Austria, used in production data centers around the world. It dis-
May  cusses the historical context leading to the development
. LS-DYNA theoretical manual () Livermore Software Tech- of the TE and its predecessor system, the CRAY TD,
nology, Livermore, CA
and the significance of the TE to the computational
. Kondo K, Makino M () Crash simulation of large number of
elements by LS-DYNA on highly parallel computers. Fujitsu Sci science and engineering community.
Tech J ():–
. Yih-Yih L, Wang J () Performance of the hybrid LS-DYNA
on crash simulation with multiple core architecture. In: th Overview
Europena LS-DYNA Conference, Salzburg, Austria
. Li-Xin G, Gong J, Jin-Li L () Three dimensional finite element
The Transition from Vector Systems
modeling and front crash process analysis of car bodywork. Appl to Massively Parallel Systems
Mech Mater –:– In , the first CRAY- supercomputer was shipped to
. Haug E, Scharnhorst T, Du Bois P () FEM-Crash, Berech- Los Alamos National Laboratory, setting a new bench-
nung eines Fahrzeugfrontalaufprall. VDI Berichte :– mark for general purpose supercomputing. Although
. http://www.mpi-forum.org
Cray Research was a very small company with less
. http://en.wikipedia.org/wiki/Athlon
. http://topcrunch.org/ than  employees at the time, Seymour Cray and
a small team had already started down what would
turn out to be a rather difficult path to design a suc-
Cray MTA cessful follow-on product, the CRAY-. The success of
the single-processor CRAY- came from a combination
Tera MTA of high-performance chip and packaging technology
together with an effective streaming vector architec-
ture. Operations achieved a large parallel speedup, pro-
Cray Red Storm vided that long sequences of data could stream through
highly segmented “vector pipes” consisting of special
Cray XT and Seastar -D Torus Interconnect vector pipeline registers feeding highly segmented func-
Cray XT and Cray XT Series of Supercomputers tional units. This architecture was a natural fit for many
 C CRAY TE

problems that simulate extensive n-dimensional com- microprocessors with on-chip multi-level cache. The
putational space. However, successful speed-ups did personal computer and workstation markets embraced
require management of an extra layer of architectural this solution, which drove down production costs by
complexity – the efficient loading and unloading of the orders of magnitude, but there were serious difficul-
vector registers from, and to, main memory. ties for supercomputer designers trying to exploit this
Seymour Cray always saw simplicity as a mark of opportunity. High production volumes precluded any
architectural elegance. So for the new project, he nat- alterations to the internal design of the CPU for the
urally began to think about alternatives to the vector very high performance market. Due to a very limited
registers. Perhaps these registers could be eliminated pin count, the balance of functional unit performance
and performance could be more than made up by a with the delivery of data from outside the device was
large number of CPUs sharing a common memory seriously tilted in the wrong direction. If that were not
space. (In current parlance multiple integrated CPUs enough, there was one more hurdle: The single chip
are now usually called “cores.”) So it was that the first processors were getting high performance in no small
iteration of the CRAY- design would have up to  measure from onboard cache subsystem. This created a
scalar “M” processors, each handing off intermediate nightmare for managing coherence of data across a large
-bit results to similar processors throughout a shared- multiprocessor system. Yet the microprocessors were
memory architecture. I/O work would be managed by becoming impossible to ignore as internal clock rates
a cluster of -bit “A” processors also connected to this were becoming incredibly fast and at ever lower costs as
shared-memory system. Significant architectural and generation after generation followed Moore’s Law. From
detailed logic designs for the “M” and “A” processors the supercomputer designers’ perspective these upstarts
were completed, and both an experimental operating were extremely frustrating!
system and a Fortran compiler were running in a simu- The CRAY TD was the first serious attempt to
lated environment. have at least some success in influencing the I/O prop-
Then one day this entire phase of the CRAY- work erties of a commercial -bit microprocessor so that
came to an abrupt halt. It had become apparent after large numbers of processors could cooperate efficiently.
consultation with users of the CRAY- that there were Additional special purpose hardware around each CPU
huge challenges to effectively program the proposed to fill in the gaps. CRAY TE came incrementally closer
CRAY- for anything but the most regularly organized to the mark. Eventually and inevitably, microproces-
problems. Memory management and data coherence sors for desktop computers hit some of the same fun-
issues were unfamiliar and intimidating. Effective com- damental obstacles that supercomputer designers had
piler technology even for the vector-based CRAY- was faced decades earlier. Desktop (and even notebook)
only at an immature state at the time. The CRAY- CPUs became multi-core. Seymour Cray’s early hard-
project was in very real danger of having operating ware vision was at last shared, and its challenges faced,
hardware available but without nearly adequate system by a much broader spectrum of computer systems
software structures and with no clear timely path to designers.
discovering them. For a small new company in need Equally critical to the evolution of massively parallel
of a strong successor story to help sell its first prod- systems were significant advances in operating system,
ucts, this was not a promising formula for success. There compiler and applications software technology. The suc-
would be more starts and restarts of the CRAY- project cess of the CRAY TE was due in large measure to a
before it could successfully fill a new application space unique collaboration of hardware, system software, and
using a moderate number of vector processors but with user applications designers from academia, government
a competitively larger memory subsystem that was able laboratories, and private industry. The programming
to exploit dense dynamic memory chip technology. In obstacles encountered by Seymour Cray’s first CRAY-
fact, the first production. project had finally been overcome.
Cray designers were sensitive to the notion that
there are CPUs that run “fast” and then there are The Cray TE
computers that solve problems fast, and it was The Cray TE, which first shipped in  to the
becoming clear that fast CPUs would be single chip Pittsburgh Supercomputing Center, was Cray Research,
CRAY TE C 

Inc.’s second-generation MPP supercomputer, the follow- impact of the significant increase in the number of
on product to the first-generation Cray TD. The TD hardware components necessitated by the TE architec-
and TE were the first Cray Research supercomput- ture had generated a requirement that the TE should
ers to use commodity processors instead of in-house- be able to ride through PE failures. But, because UNI-
designed custom CPUs. The TE was also a departure COS/mk was conceived as a distributed system, a com- C
from the TD because it was self-hosted, had a very pute PE failure was easily managed.
different interconnect architecture, and an entirely new The TE software architecture had two main com-
I/O subsystem. ponents: “system” PEs and “application” PEs, whose sole
In a technology world driven by Moore’s Law, the purpose was to run user applications code. There were
TE enjoyed a relatively long service life: It remained three specialized types of system PEs: OS PEs, Com-
in production for over  years with some systems still mand PEs, and I/O PEs. Command PEs were used
in use in . It was popular with users primarily for user login and job launch. The user interface on
because of its good communication characteristics and a command PE was the standard UNIX interface.
its high ratio of bandwidth between processors, rela- The operating system ran only system PEs distributed
tive to the performance of the processors when com- throughout the machine at a ratio of  system PE
pared to the ratios achieved on simple clusters. This for every  application PEs. System PEs could also
made it much easier for users to attain high efficiency flexibly be substituted for failed user PEs as neces-
and good scaling performance on parallel applications. sary to maintain a full complement of application PE’s.
In addition, the large distributed memory proved very The ability to map in replacement PEs and eventually
valuable to users. The TE was also valued by data repair, reboot, and reintegrate failed PEs, without the
center managers for the efficiency of its operating sys- need to reboot the whole system, was a tribute to the
tem and the high utilization that could be sustained effectiveness of the distributed hardware and software
on high-performance computing (HPC) production architecture.
workloads. The CRAY TE was available in two versions that
The TE was “self-hosted,” running a distributed varied by cooling technology and system scale. The
version of Cray’s Unix operating system, UNICOS/mk. liquid-cooled (LC) version scaled to  user process-
UNICOS was the Unix port to Cray hardware that ing elements (PEs) while air-cooled (AC) systems scaled
had been in use since the early s on the CRAY- to  user PEs. The first system was introduced using
, CRAY-, and all later models of Cray parallel-vector the  MHz version of the DEC Alpha EV micropro-
systems. The/mk designation was taken from Chorus, a cessor. This first release was later dubbed the TE-
company that produced a Unix microkernel, which was (referring to the peak MFLOPS performance of each
used as the basis for the Unicos microkernel and server Alpha processor). Subsequent product versions tracked
structure. UNICOS/mk was unique because it incorpo- Moore’s Law improvements in the performance of the
rated some of the latest operating system ideas from microprocessor and the EV was supplanted by the
both microkernels and servers and also maintained the advanced-process EV (A) resulting in increased
UNICOS user interface across a distributed memory. clock speeds of  MHz (TE ),  MHz (TE
Essentially, the system was standard UNICOS reorga- ), and  MHz (TE ). Memory per PE could
nized into servers so as to support different varieties of be configured in sizes ranging from  Mbytes to 
Cray hardware, different scales of system size, and all Gbytes, for a maximum aggregate system memory of
the features needed to meet existing requirements for TBytes.
usability, performance, and manageability. In traditional Cray style, the TE was designed to
The distributed nature of both the operating sys- be a balanced system and, courtesy of the CRAY Model
tem and the hardware allowed UNICOS/mk to pro- F I/O subsystem, could be configured with prodigious
vide improvements in resiliency. Distribution of OS I/O throughput that was commensurate with its com-
services simplified the software that ran on any par- putational capability. This distributed I/O system used
ticular processing element (PE) and made it easier to Cray’s GigaRing network channel to string together
isolate problems. The resultant simplification also made chains of I/O PEs hosting channel adapters attached to
the software more robust. Concerns about the reliability external disks, tapes, and communications networks in
 C CRAY TE

fault-tolerant ring topologies. Each GigaRing channel started over with the DEC EV, but some concepts from
was capable of sustained data transfer rates of over  the MicroUnity experience, notably the aggressiveness
Mbytes/s and the system could employ  GigaRing for of the SCI channel signaling and ring architecture, sig-
every  PEs in the system. nificantly influenced Cray’s designers and were incor-
In , a  PE TE- became the first com- porated into the GigaRing I/O channel and the design
puter to sustain more than  TFLOPS (Trillion Float- of the TE communications link.
ing Point Operations Per Second) on a real application The DEC Alpha EV superscalar processor was a
code, LSMS (Locally Self-Consistent Multiple Scatter- large step over its predecessor. Many of its performance
ing), a first-principles metallic magnetism simulation characteristics that were important to the TE were a
code written by Oak Ridge National Laboratory and result of improvements to its memory interface, includ-
optimized by Pittsburgh Supercomputer Center for the ing support for up to two outstanding  byte cache
TE. An earlier MPP system, known as ASCI Red, was line fills and the addition of a -Kbyte on-chip second-
already installed at Sandia National Laboratory, and in level cache. Since the TD and TE both emphasized
June  was the first system to demonstrate a peak per- bandwidth improvement over latency reduction, no
formance of . TFLOPS, but the TE was able to deliver board-level caches were provided in either system.
sustained performance in excess of  TFLOPS from a
system with a peak performance of . TFLOPS. Global Memory
At the time there was considerable concern about mem-
ory management and memory use in MPP systems, and
Architecture Cray focused considerable attention on issues such as
Processor Choice memory mapping, translation look-aside buffer (TLB)
Initial work on the TE architecture and design started misses, communication performance between PEs and
in , more than a year before the first shipment of I/O. The features that are described here provided a
its predecessor, the TD. Although the TD used the set of tools for both system and applications software
DEC EV microprocessor and Cray was well aware that made the new environment much more easily used
of DEC’s next-generation, more powerful, EV design than expected, and in some ways better than on more
they nonetheless investigated alternative processors for conventional systems.
the TE and initially favored a design being proposed The TE followed in the design footsteps of the
by the start-up company MicroUnity. That proces- TD and implemented a distributed memory architec-
sor, code-named Terpsichore, would have been imple- ture, but with a shared global address space (GAS) that
mented in MicroUnity’s own Bi-CMOS technology and allowed fine-grain communications between processors
was designed to use several advanced single instruction without hardware-imposed cache coherency for remote
multiple data (SIMD) architectural features to enable memory. Conventional distributed memory comput-
very high performance on regular vectors of data. Terp- ers have isolated local memories in each PE that can
sichore was connected to memory and peripheral chips only be directly referenced by the local processor(s),
using multiple high-speed ring networks derived from with data communications between PEs accomplished
the emerging Scalable Coherent Interface (SCI) chan- using an I/O-like mechanism. These designs incur rela-
nel standard. If successful, it would have run at a then tively large transfer overhead and start-up penalties that
unheard-of  GHz clock frequency and would have pro- dramatically reduce both efficiency and performance
vided the very large address space and latency-hiding on small or nonuniform access patterns. For example,
features necessary for large-scale distributed systems. the IBM SP- MPI ping-pong latency was greater than
Unfortunately manufacturing process problems, exces-  μs, compared to about  μs for the TE. The TD
sively high power dissipation in test chips, and pressure and TE were designed to support direct access of other
from majority investors to target wider, more-easily PEs’ memories using loads and stores (or intermediate
attained markets, such as television set-top media pro- “puts” and “gets”), enabling higher payload-to-overhead
cessors, caused delays and ultimately abandonment ratios on the sparse communications patterns that are
of the Terpsichore chip. Consequently, Cray quickly important to many scientific algorithms.
CRAY TE C 

The EV (aka, ), the second-generation “Alpha” stores to remote memory used an external address seg-
-bit architecture processor from Digital Equipment ment register set called the annex. Latency hiding was
Corporation, was the world’s most powerful micropro- accomplished by using either a prefetch queue (PFQ) (a
cessor in its day but, like all microprocessors intended set of first-in, first-out registers that could be the targets
for desktops and conventional servers, its memory of “special” loads), or an external block transfer engine C
interface was designed to operate in an environment (BLT) that could move memory autonomously, but
where the memory was physically close (on the same imposed significant overhead penalties. TD memory
printed circuit board), small (a few Gbytes at most), and management also required that all PEs participating in
accessed using address patterns that made profitable a computation have the same local mapping of memory,
use of large caches as a latency-avoidance mechanism. causing a page miss in one processor to require identi-
Unfortunately, none of these characteristics are present cal fix-ups in all PEs. While helpful, these mechanisms
in large-scale distributed global memory architectures could be confusing and poor choices were punished by
like the TE, so a major architectural challenge was to poor performance. Overall, achieving high efficiency
provide external extensions to the address space, the was difficult unless the programmer deeply understood
memory management, and the latency-hiding capabil- and carefully planned around the performance quirks.
ity. A maximum scale TE might have over  Tbytes
of addressable memory, beyond the addressable range
of the EV; conventional memory management trans- E-Registers
lation look-aside buffers (TLBs) could only map a very The TE architecture simultaneously simplified the
small fraction of the TE global address space, making way global shared memory was supported and greatly
significant performance delays due to TLB misses on increased performance through the introduction of a
every memory reference a near-certainty; and, latency large set of registers external to the microprocessor,
hiding was required because the memory of a remote PE called E-registers. This set of  user-level registers plus
might be thousands of clock periods away, thus stalling  system registers was mapped to a section of the
a processor that was waiting on a data load. EV’s memory space normally reserved for memory-
The Global Segments (Global Memory Segments, mapped I/O devices. Data could be moved efficiently
GSEGs) capability was provided to help with remote between E-registers and processor-internal registers
memory access. The operating system was able to con- using standard loads and stores. The EV automatically
vert a user process I/O buffer address to a GSEG and merged sequential loads or stores to I/O space into four
pass it through the service PEs to the GigaRing device -bit word transfers that maximized the external chip
processors directly. The GigaRing devices were then data bandwidth.
able to use the address and do a zero-copy RDMA to A meta-instruction set controlling E-register oper-
or from the user’s buffer directly to a device. This was a ations was defined that could be issued by the proces-
mechanism that was as simple and high performance sor by writing to additional special memory-mapped
as the previous I/O subsystems used by Cray vector addresses. The particular address used represented the
mainframes. appropriate op-code, and instruction arguments could
Although the TE team had to freeze the design very be passed in the data.
early in the life of the TD, much was learned from Possible E-register operations included “gets,” which
the predecessor system and the accumulated experience would accomplish a data load from local or remote
of compiler, library, and application developers as they memory into an E-register, and “puts,” which would
struggled to generate efficient and scalable code for a store data from an E-register into memory. To fetch data
distributed memory architecture. As a first-generation from a remote PE’s memory, the programmer would
design, the TD had provided programmers a variety issue a “get” by storing the desired virtual address in the
of mechanisms to extend the addressing and memory encoded memory address. The destination E-register
interface limitations of the Alpha microprocessor and would be implicitly specified in the op-code. Circuitry
enable a processor to read or write data from a remote external to the EV then translated the global virtual
PE’s memory. Address extension for direct loads and address to determine the physical PE containing the
 C CRAY TE

desired data and forwarded the request over the net- those achieved in server designs using the same pro-
work. The returning data would land in the reserved cessor but with conventional board-level caches (
E-register, to be subsequently loaded by the proces- Mbytes/s compared to  Mbytes/s).
sor from its memory-mapped location. As many “gets”
could be outstanding as there were available E-registers
Synchronization
to act as destinations for the returning data, thus provid-
E-registers enabled advanced memory-based synchro-
ing an efficiently pipelined latency-hiding mechanism.
nization primitives including atomic memory opera-
E-registers also supported advanced memory-based
tions (such as fetch and increment, compare and swap,
synchronization primitives including atomic memory
and masked swap) and user-space messaging queues.
operations (such as fetch and increment, compare and
Synchronization was further enhanced using a virtual
swap, and masked swap) and user-space messaging
barrier network that allowed arbitrary membership, and
queues.
had the ability to send messages over embedded virtual
Using E-registers enabled the TE to sustain
spanning trees in the data interconnect network.
network-saturating data transfer speeds even on global
memory reference patterns with large non-sequential
address “strides.” Network
Perhaps the most remarkable and visible attribute of
the TE was its D torus interconnect network. In this
Memory Management scheme PEs are physically connected to their nearest
The TE implemented a unique global memory man- neighbors in three dimensions and the faces of the
agement and virtual address translation scheme that resulting logical cube are “wrapped around.” The router
supported implicit data distributions provided by lan- in each PE consisted of a fully adaptive router switch
guages like Fortran-D and Cray’s CRAFT, as well as steering traffic among seven pairs of high-speed (
allowing independent and different mappings of each Mbytes/s) channels: carrying data simultaneously in
PE’s local memory, even as it is shared and viewed as both directions of three dimensions, plus a connection
globally contiguous. This was accomplished by using to the PE. Traffic was routed according to predeter-
a hardware “centrifuge,” which could separate PE- mined routing tables in direction order: +x, followed
designating bits and PE-oriented address bits from by +y, followed by +z, followed by −x, −y, and −z. Five
the virtual address, and remote translation of the vir- virtual channels were utilized: Four to prevent dead-
tual address to the physical address within the remote lock and one to enable adaptive routing. The routing
PE’s memory. The remote translation mechanism also algorithm was designed to send each packet along the
implemented an “atomic memory mover” that allowed shortest possible route, but adaptive routing allowed
relocation of memory pages even while the page was packets traveling in the network to route around bro-
being accessed, preventing common memory manage- ken or congested paths by selecting an alternative route.
ment tasks from disrupting computation. The TE’s “minimal adaptive routing” strategy allowed
packets to change the order in which they traversed
the various x, y, and z hops, but the actual path length
Local Memory was not changed. One advantage of a D torus inter-
The TE provided no board-level cache and the EV was connect topology is the ability to route a packet “the
only permitted to cache local memory. Global mem- other way” along any dimension and, in some failure
ory references changing data in local memory were kept cases, “long way around” routing was needed to route
coherent by a back-map that automatically invalidated around faults, but this was accomplished by changing
any cached copy on the processor. Instead of board-level the routing tables.
cache, the TE provided stream buffers that detected The TE network and communications infrastruc-
and pre-fetched sequential streams of data (as are com- ture was optimized for pipelined delivery of fine-grain
mon in bandwidth-intensive scientific applications), messages. Packets traveling through the network could
resulting in sustainable streamed data rates over twice have payloads as small as one -bit word, making the
CRAY TE C 

TE very efficient at random communications support- The LC cold plate was approximately a half-inch
ing sparse matrix algorithms. thick and was configured as a sandwich between two
The router internal clock ran at  MHz, however motherboards supporting a total of  PEs per module.
data was transmitted on the communications link at five AC cold plates supported a motherboard on one side
times the internal frequency ( MHz). The links were and were machined with conventional air-cooling fins C
 bits wide, for a raw bandwidth of  Mbytes/s. Peak on the other, so they only supported  PEs operating at
data payload was around  Mbytes/s after protocol a somewhat elevated temperature, compared to the LC
overheads. The network of the TE provided a bisection design.
bandwidth of over  GBytes/s at  PEs and nearly Standard-package DRAM memory chips were
flat global bandwidth of over  Mbytes/PE over scale mounted separately from the motherboard on four
to that size system. daughter cards per PE. Each daughter card also held an
M-chip, which drove a bank of memory and interfaced
Technology between the DRAM and the C-chip. Pairs of daughter
The TE was implemented using a combination of DEC cards were sandwiched on two thin aluminum plates,
Alpha EV processors, commodity DRAM memory which were then plugged as mezzanine assemblies over
chips, and Cray-designed CMOS application-specific each PE. The memory aluminum cool plates contacted
integrated circuits (ASICs) blended in a custom archi- cold plate plateaus, providing a thermal conduction
tecture using the exotic packaging, power, and cooling path for the heat of the DRAM to the cooling fluid or air,
technologies that were a hallmark of Cray supercom- depending on the LC or AC configuration. No cooling
puters to achieve high-performance in a dense form fans were used in the LC design.
factor. The -PE LC module, including memory mezza-
The custom ASICs were sourced from LSI Logic and nine memory cards, was approximately  by  in. and
used a . micron process and three metal layers to pro- a mere  in. thick. Connectors for the router chan-
vide a maximum of approximately one million gates nel wiring cables ran along both long edges of each
per chip. motherboard, and used a unique zero-insertion-force
A single large printed circuit board (PCB) with more connector design that required insertion of a tapered
than  layers formed the motherboard for a TE mod- metal wedge, after seating the module in the chassis, to
ule. Each PE consisted of a processor chip, a C-chip slide individual pin shuttles into contact with the sock-
that connected the processor to both the local mem- ets on the boards. Each LC module had a blue and red
ory and the interconnect network router, and a router hose (hot/cold) with quick-disconnect inlet and outlet
or R-chip that completed the implementation of a sin- connectors to plug into Fluorinert manifolds mounted
gle PE in the TE’s D torus network. Four PEs were on each side of the rack.
laid out on each motherboard, and motherboards were The AC chassis could support up to  four-PE mod-
attached to a “cold plate” for cooling. In the LC configu- ules per chassis, and two chassis could be connected
ration, the cold plate was an aluminum heat exchanger together to create a max-scale  user PE system. The
through which a chilled inert fluorocarbon liquid (M’s LC system had considerably higher density, with 
“Fluorinert”) was circulated. The cold plate had a series module slots to accommodate up to  user PEs per
of machined “plateaus” on the surface that protruded chassis. Power supplies, pumps, and Fluorinert/chilled
through holes in the PCB beneath each chip, thus mak- water heat exchangers occupied perhaps two thirds of
ing direct contact with the chip. Heat was conducted the chassis, with a single stack of modules in the front.
through the plateau into the cold plate and removed Flat cables, one per router channel, were strung
through an external chilled water heat exchanger. between the modules in the chassis (the design of the
Fluorinert, though quite expensive, was used for torus network was “folded” so no cable ever had to
the closed-loop in-chassis cooling cycle because it has cross over to the opposite module side) on the sides of
the property of being electrically nonconductive and the chassis. In larger configurations (greater than 
noncorrosive, so no damage to the densely packed elec- PEs), multiple chassis would be arrayed in an alternat-
tronics would occur in the event of a leak. ing line, with every-other chassis facing the opposite
 C CRAY TE

direction and overlapping by the depth of the mod- At Cray, the view had always been that the very high-
ule stack, “cheek to cheek.” This allowed torus network est performance systems, pushing the limits of tech-
connections between chassis with the shortest possible nology, would never be as reliable as more common
cable connections as well as enabling side access to the computers. Cray was designing at the leading edge of
pumps, power supplies, and heat exchangers in the rest technology, and reliability was inevitably compromised
of the chassis. From overhead, a large TE looked like a in favor of higher performance, in the same way that
very large zipper. specialist high-performance cars are often less reliable
With no fans, the sound level of a full-scale LC than family automobiles. Traditionally, Cray customers
TE was considerably less than the AC version, or any had an “early-adopter” culture whereby they were will-
contemporary conventional server racks. ing to accept a degree of risk in return for the earliest
possible access to new, more powerful, systems. How-
Operating System Software ever, their users were now raising the bar. The new
The development of a distributed operating system for expectation was that systems had to be for what had
the TE was a significant effort for a small company like previously seemed like long periods of time – now mea-
Cray. The transition from UNICOS on vector systems to sured in months rather than weeks. This trend had been
UNICOS/mk on massively parallel systems was a major in progress for several years, but in the s the market
effort for the Cray software group. In order to meet the requirement was rising sharply. Hardware was becom-
schedule, average of  people worked on the system ing more reliable at the component level, but massively
for more than  years, while a skeleton crew of  senior parallel systems contained many more components, and
people was assigned to manage and support UNICOS the pressure was on software to find methods that would
on Cray’s entire customer base. This was an extremely make the overall system more reliable.
high-risk approach. In effect, the Software Group The advent of UNIX had brought a change to the
was betting on the extremely high reliability of the way programmers thought about systems. The use of a
installed UNICOS systems, which proved to be a sound high-level implementation language (C) and standard-
decision. ized interfaces was a very different approach from the
The competing requirements of supporting a new assembler-programming techniques that most compa-
hardware architecture while also carrying forward an nies and laboratories had used in the s. However,
existing customer base presented significant challenges. the actual evolution of UNIX was surprising to its orig-
Some of the new ideas for the Cray Parallel Program- inators: Rather than collaborate on an industry stan-
ming Environment were very different from the inter- dard, each vendor produced its own flavor of UNIX.
faces and functionality that had been in use on vector All the companies that adopted UNIX were under pres-
mainframes. All of this contributed to the uncertainty sure to differentiate their products, and porting features
experienced by the software system designers as they from a proprietary OS that their customers had come
debated how to prepare for a new machine that was to depend on was one way to differentiate. But this
expected to run a whole new set of applications. approach led to divergence of UNIX implementations,
and at this point, in the early s, there was no Linux
Historical Perspective with its common kernel. There were some industry
By the early s, there was an ongoing sea change standards but they did not cover as many interfaces
in supercomputing system software. The first wave, in as were in common use, and they did not cover the
the mid-s, had been the transition from vendor- machine-dependent variations.
proprietary operating systems to UNIX. Despite early However, in many cases these vendor-specific
skepticism, the Unix operating system and environment additions to UNIX constituted useful improvements.
had successfully replaced the traditional operating sys- For example, the original UNIX did not have the
tems of the mainstream supercomputer vendors. Fur- performance-oriented features needed for supercom-
thermore, as supercomputers moved from research labs puters, such as a fast file system or a tunable process
to production engineering groups, a new expectation of scheduler. UNIX also lacked a number of features, such
higher reliability was sweeping through the industry. as job accounting, that were necessary for managing
CRAY TE C 

large shared computational resources. Operating sys- as MACH did, so it was possible to port Chorus and
tem developers were quick to resolve these problems UNICOS to a Cray vector system. The experiences of the
by adding features to the kernel. Even though the ker- TD team with the difficulties of MACH and the deci-
nel was not really the place where much of this code sion to make TE self-hosted made Chorus the obvious
belonged, expediency and the pressure to deliver per- choice. The final problem with MACH that caused Cray C
formance and resiliency were the rationale for making to abandon it as a potential microkernel for self- hosted
this entire code kernel based. systems was a set of fundamental security issues that
The consequence of all this feature addition was to were also being experienced by the developers of the
take a lightweight kernel, roughly , lines of code Open Software Foundation’s OSF/AD, which was also
in AT&T System V circa , and create multimillion based on MACH.
line kernels by the early s. As a result of its vector and TD experience, Cray
then began to look at a combination of Chorus and
The Evolution of Microkernels UNICOS as not just a potential OS for future MPP sys-
At Cray, by the early s, the UNICOS kernel had tems, but also as a next-generation implementation of
grown to more than , lines of code. This was UNICOS on vector systems. The big difference between
far too much for even the best developers to keep the TD, which comprised an MPP running MACH
in their heads, and ongoing software reliability was a front-ended by a vector machine running UNICOS, and
concern. So the development team began to look at the TE was that the TE MPP would run as a com-
microkernels with the goal of separating the kernel plete single system without front-end support. The TE
components and creating a set of “firewalled” functions would be “self-hosted”; simply running a microkernel
with defined interfaces. Microkernels were a new idea on all the MPP PEs would not suffice. In addition, a
at that time. The Defense Advanced Projects Research primary objective of the TE software design was to
Agency (DARPA) was actively supporting a microker- present a single system image, defined as a single Pro-
nel called MACH, developed at Carnegie Mellon Uni- cess Identifier space (pid space) and single file-name
versity, but there was another microkernel, Chorus, that space, to the user, and to achieve this goal the oper-
had originated at the French national research insti- ating system would have to be distributed across mul-
tute, INRIA, and had been adopted by AT&T. There tiple MPP PEs. Thus the microkernel was conceived
was much discussion at Cray about the relative merits to be the low-level PE OS responsible for providing
of the two microkernels and much speculation about communication between all the PEs. But, the big depar-
the potential performance loss that was expected from ture was to take the rich set of UNICOS services and
using either of them. The performance loss was certain distribute them in “stacks” across MPP PEs depend-
because this architecture required multiple additional ing on the software-defined PE type and the need for
context switches between user, microkernel, and oper- system services. These ideas were speculative at the
ating system services. The question was whether the time. No previous high-performance computing (HPC)
performance loss would be an acceptable price to pay company had delivered a distributed system, and there
for the benefits. were a number of valid concerns starting with the
When the TD project started at Cray, the Software potential performance of a single HPC application dis-
Group was reorganized into multiple OS development tributed across many PEs and the performance impact
teams, based on the TD, the TE and future Vector Sys- of PE-to-PE communication on a mixed scientific
tems, each with its own views of the evolving software workload.
world. The TD developers chose to use a modified ver- The Cray software group as a whole was divided
sion of the MACH microkernel for the back-end MPP, almost into thirds in their views of the wisdom of this
and were successful in producing a viable MPP. The direction. A little more than a third of them wanted
Vector team elected to build a prototype version of UNI- to go forward with the microkernel-based UNICOS
COS (Cray’s proprietary UNIX operating system) on plan; a little less than a third were convinced that this
top of the Chorus microkernel. Chorus did not require technological approach would fail; and about a third
virtual memory (unsupported by Cray vector system) thought the whole MPP concept was doomed anyway.
 C CRAY TE

The people at DARPA who had been funding MPPs Serverization


and the development of MACH were also extremely The following diagram shows the structure of the OS
unhappy that Cray had decided not to continue stack and all the possible servers (Fig. ).
with MACH. This simple example of server stacks shows compute
PEs with minimal services and OS PEs with complete
stacks of services (Fig. ).
Distributed Software Architecture A useful way of demonstrating the contents of both
There were serious technical issues in dividing UNI- service and applications PEs was to use the Cray D
COS into servers (serverization). The organization of viewer. Here are three examples of different views of two
the Unix system was built around the concept of files, support PEs and a command PE in the torus, and the
and separating the process management services from processes running on those PE’s (Figs. –).
the file system, while also creating a simple interface, The system PEs did not allow user processes to
proved difficult. The idea of firewalling the microkernel run. Users logging into the system were automatically
from the servers was dropped because of performance directed to a command PE where the user interface
issues in the context switching. Without the firewall appeared to be standard UNICOS, but the underly-
the concept of being able to reboot individual services ing system was actually distributing the user’s processes
instead of the whole OS (on a node) was not a reality. across multiple command PEs as needed. Compute PEs
This made some of the resiliency hopes less possible. were only able to execute applications that had been
Still, the idea of rebooting individual PEs, instead of the submitted to the resource manager/schedulers.
whole system, was possible and was an important step The distribution of system services meant that some
toward a resilient HPC system. set of PEs became “system PEs.” This was hotly debated
Besides resiliency, there were other compelling rea- because the market expected that a TE with “N” PE’s
sons to go forward with a distributed UNICOS oper- would allow a single application to run on “N” PEs
ating system on the TE. First, the back-end TD-style simultaneously: Cray customers had become accus-
of MPP had been difficult for programmers to run and tomed to the idea that a user application could access
debug applications. A back-end system, by definition, all the user-programmable hardware resources of the
is always more opaque than a system with which the machine in a single execution. Thus system PEs could
user interacts directly when testing a code, and the lim- not be included in the advertised processor count and
itations of the TD in this respect were a concern for needed to be “extras.” The fact that the system PEs could
Cray customers. Second, UNICOS had evolved on vec- not be used by applications was irritating to some users,
tor systems to become a very stable, high-performing, who saw them as pure overhead, and the cost of these
full-featured HPC Unix system. The customers liked it extra PEs also had to be considered.
and the idea of a UNICOS version running on the TE However, reality was that operating systems services
was well received. In addition, the performance of a are essential to the running of any application, but the
microkernel-based UNICOS running on a single Cray cost of such services on conventional mainframes had
vector CPU had proven to be much closer to the perfor- traditionally been accrued as system overhead. Also, on
mance of native UNICOS than most people had thought mainframes, this overhead was a variable number that
possible. Finally, there were not a lot of alternatives. This depended on application characteristics, whereas the
technological direction was undoubtedly risky, but the TE serverization approach essentially allocated a fixed
whole TE program was risky. So the company decided proportion of total system computational resource to
to go forward, and carefully manage the risk by estab- system services. The resolution was to peg the number
lishing well-defined milestones and deliverables. As is of system PEs, as a percentage of total PE’s, to the over-
often the case, there was a protracted debate over a less head of a non-distributed OS on a shared-memory vec-
critical issue, the OS name, but UNICOS/mk stuck. It tor system. As a result Cray decided, based on the mea-
was naming similar to the Chorus-enabled version of sured performance of the installed base of CRAY C
Unix from AT&T and it preserved the UNICOS brand. vector systems, to use no more than one in  TE PEs
CRAY TE C 

CRAY TE. Fig.  The Unicos/mk Software Stack – The microkernel and all the OS servers

(.%) for dedicated system purposes, and to include reduced the number of different activity types on the
these PE’s as “free” PEs that were not included in the OS PEs, and thereby reduced the number of interactions
MPP processor count. Thus a  PE TE would actu- between different services, which had been the source of
ally comprise  computational PE’s and up to  PE’s many failures on vector systems. Focusing the tasks allo-
dedicated to OS services, for a maximum of  PE’s. cated to any given PE and simplifying the services made
All system services ran on the service PEs, includ- the system inherently more resilient and easier to debug.
ing PEs used by users logging into the system on “login” The use of OS PEs also allowed for the develop-
PEs. A few extra PEs were also kept available as replace- ment of Global and Local service divisions. This was
ments for failed compute PEs, which might be lost used extensively in process management. The local
between planned maintenance outages. The operator process manager (PM) could respond on its own to
could repurpose the replacements and, later, a PE reboot a getpid(), but would forward a kill() to the
feature was added that allowed operators to restart Global Process Manager (GPM). This global/local split
failed compute PEs. The TE cabinet to support this sys- was popular because it improved local performance and
tem organization held a maximum of  PEs:  user reduced the amount of communication.
PEs;  system and/or redundant PEs. Another example of local versus global servers was
As part of scaling, and also of distribution, the stack the development of the File System Assistant (FSA).
of OS services on an OS PE was differentiated. The most File system requests are dominated by read and write
obvious difference was “Command” PEs that were used requests. The extra requests to OS servers from compute
for user logins. But other OS services were separated or PEs and the copying of data across multiple PEs was
replicated onto OS PEs. Depending on the system size something to avoid. The FSA communicated the user’s
and configuration, there could be different numbers of open() request to the file servers. In the file server’s
these specialized OS PEs. OS PEs could be allocated to reply to the FSA open request there was a disk server
services based on the utilization of the service within address and a set of disk blocks for the file. Depending
the system. I/O device support was often configured on on the request type there could be multiple disk servers
I/O channel-connected PEs with no other services. This and multiple block allocations. Read and write requests
 C CRAY TE

CRAY TE. Fig.  The Unicos/mk OS deployed on a system showing the bulk of the OS servers on OS PEs and a light
weight set of servers on the Compute PEs

CRAY TE. Fig.  A snapshot of a system in the left-hand window and the actual OS servers in the right-hand window.
This shows the main OS servers on one of the OS PEs
CRAY TE C 

CRAY TE. Fig.  A snapshot of a system in the left-hand window and the actual OS servers in the right-hand window.
This shows the file system, disk, and other I/O OS servers on one of the OS PEs

CRAY TE. Fig.  A snapshot of a system in the left-hand window and the actual OS servers and user processes in the
right-hand window. This shows the OS servers and user processes on one of the Command PEs
 C CRAY TE

from the user could now bypass the file server and be needed to be fast, accurate, and ensure that no compute
sent from the FSA directly to the disk servers. The file PEs were loaded with more or less than a single run-
server was responsible for managing and guaranteeing able process. A simple algorithm was developed to make
coherency. a serial array of the three dimensions of the compute
There were a number of key optimizations of the PEs in the TE torus. This made allocation and man-
TE serverized system. The distribution of global pro- agement much simpler and defied earlier predictions of
cess management and file I/O was primarily via system- poor placement. UNICOS/mk supported space-sharing
call forwarding from either compute or service PEs to from the outset; that is, the ability to split the system into
the process-management and file-server PEs respec- multiple disjoint partitions and run a different job in
tively. The code for these calls was optimized to do as each one. An interesting additional feature was coarse-
few data transfers as possible. The goal was to keep the granularity time-sharing with gang-scheduling within
PE interconnect as free of service requests as possible. a set of PEs. This was the ability to have a partition
The term “piggybacking” was used to describe these load processes from multiple different applications, and
optimizations. Substantial changes were made to the then run simultaneously all the processes from a specific
IPC mechanism to improve performance including the application across the whole partition. This would idle
creation of interrupt threads to reduce the latency of the other applications that happened to share PEs with
thread switches for forwarded requests. Another opti- an active application and, where there was a discontinu-
mization that was important in process management ity in the number of PEs a process used, even more PEs
was the separation of fork from exec. In small Unix might be idled. However, combined with scheduling
systems, often a fork/exec would be the same pro- constraints at the job level, the process/memory sched-
gram replicated, like a shell. In HPC, the exec on a uler was tuned well enough that it was able to overlay
compute PE was guaranteed to be an application code applications across PEs and achieve a remarkably good
not the shell that forked it. This optimization elimi- overall utilization of the compute PEs.
nated the unnecessary copying of process images across
the interconnect network. High-performance applica- Checkpoint/Restart
tion workloads were a performance benefit for the The TE was one of the last systems to have a completely
operating system too. These applications were invari- usable checkpoint/restart capability. At the outset of the
ably compute-bound and made few systems calls, which program, many developers considered this feature to be
helped with reducing overhead and non-application impossible to implement successfully. The number of
communications. processes, the size of the checkpoint image, and the dif-
The amount of available memory on compute PEs ficulty of coordinating the saving and restoring of pro-
was a critical resource. The OS was allowed to use all the cess state all presented daunting challenges. However,
memory on service PEs, but had to be using the absolute the checkpoint feature was an important requirement
minimum on compute PEs. The target was set initially at of several large customers and Cray decided to invest
% of a MB PE and this turned out to be a good tar- in the best implementation possible. This project took
get. Later, other MPP operating systems were found to a long time to complete and was hard to debug com-
be using % or more of compute-PE memory and this pletely, but utilization charts subsequently produced by
became a competitive marketing advantage for Cray. the National Energy Research Scientific Computer Cen-
ter (NERSC), which reported system utilization greater
Schedulers than %, vindicated the effort put into this feature.
Scheduling had been an important part of HPC operat- The interesting technical lesson learned from this
ing systems on vector machines, but the MPP required a project is that a process checkpoint can be viewed as
new scheduler to manage the compute PEs. This sched- a primitive that, once in place, allows the development
uler became important in providing overall system uti- of other features such as process migration. The migra-
lization. There was a performance advantage in locating tion feature had never been considered in the original
processes on PEs that were “close” to the other PEs run- design process, but the checkpoint capability had essen-
ning processes for the same application. The scheduling tially created the potential for the system to package an
CRAY TE C 

active process and restart it at a different location in the issue as application I/O usage definitely lagged behind
MPP array, which is the definition of process migration. the system’s ability to scale.
The availability of process migration is of significant Scaling I/O is a good example of the work done
benefit to a datacenter operation that wants to run an to scale the TE system and user software. The sys-
interactive program development environment during tem could improve scaling by splitting the software C
the daytime hours and a batch processing environment stack onto different PEs. This was “vertical scaling.”
overnight. Many of the big production batch jobs may As discussed previously, this was separating drivers
need to run for several days, and thanks to process onto channel-attached PEs, and file and network ser-
migration they can be automatically suspended, saved, vices onto other OS PEs. Each group of drivers or
and rescheduled in a flexible manner. It also provides file server PEs could be scaled independently, horizon-
an effective mechanism for unscheduled high-priority tally. I/O scaling was also improved by shortcutting
jobs to preempt an existing workload on some or all the global communication between servers whenever
of the compute PEs. This workload scheduling capabil- possible. Finally, in another effort to group requests,
ity was fundamental to the outstanding results achieved users were given new interfaces that allowed lists of I/O
by NERSC. requests across multiple destination PEs.
Overall, the development team was somewhat sur-
System Scaling prised at how easily the system worked. Features that
Despite concerns that the operating system would not they were concerned about mostly seemed to scale with-
be ready in time to support the early hardware deliver- out problems. The hardware was actually helping the OS
ies, the OS software was actually delivered more or less in that the interconnect network was so fast and able
as planned. Because there was market demand for TE to communicate so many small messages quickly that
from a number of TD users and a constrained initial running the inter-service communication mechanism
production capacity, Cray decided to allocate smaller was less of a problem than had been expected. The user
numbers of PEs to each of several early adopters to interface looked like a conventional shared-memory
be followed by upgrade deliveries, rather than allocate vector UNICOS system and was easy to demonstrate
the whole initial production to a single large customer. to customers. Explaining the system organization had
In practice, as the installed systems transitioned from been difficult because the concepts were new, but the
small to larger scale, the software team was able to deal development of a graphical D display of the PEs and
more easily with the inevitable scaling issues than if PE status finally gave a view of the system to users that
they had to deal with a single full-size machine from could be readily understood. Internally, the microker-
day . nel architecture had transformed much of the original
As a general rule of thumb, it seemed that every UNICOS kernel into special processes or threads, thus
factor-of- increment in system size unearthed new allowing the developers to use standard debuggers on
scaling problems. The distribution of process manage- many of the kernel processes. It had made produc-
ment and process running between service and com- tive programmers out of the majority of the Operating
pute PEs scaled pretty easily. There was not a lot of stress Systems Group.
on the process management PEs except when they were Here are a few examples of the D viewer measuring
subjected to specific stress tests. The compute PEs were processor utilization on all the PEs in a system. Note
sufficiently balanced that most applications could run that not all the PEs are running applications in these
without generating much OS overhead and communi- examples. This allowed us to demonstrate activity in the
cation to the process services, even though almost all system and how it changed as applications started and
system calls were forwarded. So, distribution worked stopped (Figs. –).
without a lot of hierarchical scaling, which was a relief.
I/O and networking had been predicted to scale quickly OS Jitter
from one file system PE to many, but this happened One of the clear lessons from the TD related to the
more slowly than expected. There were probably a lot necessity of coping with what later became known as
of reasons for this, but in practice it was not a major OS jitter; that is, the effect that typical general-purpose
 C CRAY TE

CRAY TE. Fig.  A snapshot of a system showing processor activity on different PEs. This system is running several
applications. The snapshot shows application placement and available compute PEs

OS services could have in a thousand-processor col- coarsening the response time, and to synchronize these
lective execution. Because the TD and TE were coarse timer interrupts across all the PEs in the sys-
targeted at tightly coupled problems, the frequency tem. This was made possible by the global clock in
of inter-processor synchronization in applications was the TD/TE, which is no longer typical of high-end
high, often in the range of every  to a few hun- parallel systems. These changes addressed the issue suf-
dred microseconds. At each of these synchronizations, ficiently for the TD, but UNICOS/mk had a decid-
if even one processor was delayed due to some other edly richer set of system services that were used by
OS service, all the remaining processors would wait for daemons, subsystems, etc., and further steps were
the lagging processor to catch up. Even in the TD’s necessary.
constrained back-end configuration and accompany- The TE compute-PE OS stack was limited to the
ing microkernel OS architecture the effect of timer microkernel and a process manager. This reduction
interrupts (used for a variety of purposes, such as in local services ensured that the PE itself would be
dead-man timeouts) that were unsynchronized across “quiet.” This is more important than synchronization.
the whole system often cost a factor of  or more in Synchronizing ensures “lockstep,” but reducing the ser-
performance. vices gives cycles to the user application and reduces
The major changes to address OS jitter on the the need for synchronization. The global clock provided
TD were to drastically reduce the actual frequency synchronization, similar to the TD. This kept time-
of timer interrupts on the compute PEs, in effect related “housekeeping” in parallel, but communications
CRAY TE C 

CRAY TE. Fig.  A snapshot of a system showing multiple applications running and available compute PEs

between services, such as “heartbeat” checks had to interconnect and synchronization meant the building
be carefully examined and reduced as the system scale blocks for a wide variety of approaches could be made
increased. There had always been an ethic at Cray to run fast.
related to ensuring the services did not reduce the time Most of the TD/TE development team (both
and space available for applications. This new environ- hardware and software) came from the legacy of Cray
ment was another lesson in monitoring system over- parallel-vector machines, which had a global address
heads and developing an awareness of their communi- space and shared memory (though not hardware cache-
cations impact. coherence, as there were no caches, and the software
managed the “cache” implemented by the vector reg-
isters). Many at Cray believed that the global address
Programming Environment space provided a significant benefit in both ease of pro-
Programming Models gramming and performance, and wanted to find ways
Cray learned from the TD that there were a variety to expose that appropriately through the prevailing
of programming approaches for MPP systems, and no languages of the day, Fortran and C. This put Cray at
single approach was likely to satisfy everyone, although odds with many other MPP pioneers, who had adopted
message-passing was emerging as the most common. the message-passing approach as being simple and less
Both the TD and TE provided fertile substrates for dependent on specific hardware, compiler, and run-
programming model development and exploration, as time capabilities that might not be available on all
the hardware global address space (GAS) and fast systems.
 C CRAY TE

CRAY TE. Fig.  A snapshot of a system showing applications in different states of starting and completing

Message-Passing transfers ( KB or more), and the ping-pong latency


Cray had implemented the Parallel Virtual Machine was good ( us), but not yet to the  us level achieved in
(PVM) interface on the TD in –, partly because later GAS machines.
of the need for a mechanism not only to communicate
among the distributed processes on the TD but also to SHMEM
communicate between the parallel-vector host system (a SHMEM was a direct result of the fertile hardware com-
CRAYY-MPorCRAYC)andthedistributedprocesses. munication and synchronization infrastructure pro-
Since then the message-passing community had coa- vided by the TD. One of Cray’s MPP performance
lescedandtheMessage-Passing Interface(MPI)standard specialists realized that the TD prefetch queue (PFQ)
was under development. While PVM was still supported could be used to overlap communication with compu-
for backward compatibility with TD customers, most tation, and wrote a small prototype library that exposed
of the TE focus for message-passing development was this capability. He used it in a variety of codes and
on MPI. realized its advantages for fine-grained communica-
The E-register mechanism in the TE hardware was tion. When it came time to implement libraries for the
a much better infrastructure for message-passing imple- TE, the team realized that this interface, now known
mentation than the block transfer engine (BLT) and as SHMEM, was valuable and should be carried for-
prefetch queue (PFQ) of the TD, and the resulting ward. The TE’s E-registers were a major expansion
performance was very good. The MPI libraries could of the capabilities of the PFQ, enabling much more
drive the hardware interconnect at  MB/s for bigger data to be in flight between remote processes and the
CRAY TE C 

requesting process, and so SHMEM was consequently Cray was so constrained by its own financial difficul-
a much more potent performance enhancer. The TE ties that it did not implement HPF in its own compil-
SHMEM implementation exploited the larger num- ers, instead relying on third-party compilers from The
ber of outstanding requests, each for  -bit words Portland Group and others. While HPF’s progress took
of data (compared to a single -bit word for each several years to unfold, ultimately it was unable to pro- C
PFQ request) to deliver the full bandwidth of the TE vide the performance that most MPP users expected,
interconnect ( MB/s) for even small transfers of – and consequently withered away.
 KB and delivering unprecedented -μs latency for However, Cray’s mainstream compiler development
single--bit-word requests. The much better perfor- was not the only source of language-based ideas for
mance of SHMEM reflected in part the close match with using the TE. A TE performance specialist, with help
the underlying hardware and partly Cray’s minimal from the Fortran front-end group, integrated his ideas
experience with optimizing MPI protocols onto GAS for a simple “PE-oriented” approach, first embodied
hardware. in SHMEM, into Fortran in an interface first known
as “F−−”(F minus minus, a parody of C++) and later,
Language-Based Global-Address-Space Models at the request of for a less negative spin, formally
Researchers and developers had proposed and imple- named Co-Array Fortran (a contraction of covariant
mented a variety of language approaches to parallelism. array, a mathematical term). The key innovation was to
OpenMP was becoming a standard by this time, tar- extend indexing operations with another set of square
geted at cache-coherent systems of modest scalabil- brackets that indicated the PE on which data resided;
ity. Vienna Fortran and Fortran D had both proposed indexing without this extra information defaulted to
and implemented early interfaces from Fortran to dis- the local PE. This work was eventually released in
tributed memory systems, and Cray had implemented the late s as part of the Cray Fortran com-
its own version, CRAFT, on the TD. These languages piler, though not fully documented and supported,
had explored different approaches, but had in com- and was later incorporated into the Fortran standard
mon the provision of global variables and a means of in .
specifying their distribution across the PEs of the sys- Similar ideas had been percolating in groups that
tem and the use of those variables. Cray’s experience focused on C more than Fortran. A team from the Insti-
with CRAFT on the TD was mixed: A few develop- tute for Defense Analyses Center for Communications
ers had excellent success in both speed of development and Computing had developed the ac compiler on the
and performance, but the CRAFT implementation was TD, exploiting its PFQ similar to the way SHMEM did,
incomplete in some respects (e.g., there were no col- though integrated with the C language. A team from
lective operations) and not deeply optimized. In addi- Lawrence Livermore National Laboratory had imple-
tion, its portability was limited to Cray systems, and mented the Parallel C Preprocessor (PCP) first on the
many of the most avid user proponents of GAS, espe- BBN TC system and later the TD, and a team
cially those whose workloads were characterized by the at UC Berkeley had implemented the Split-C compiler.
GUPS benchmark, had migrated to C as their base lan- These languages attempted to extend C’s close-to-the-
guage of choice, not Fortran. As Cray was deciding hardware nature with minimal mechanisms support-
what to implement for the TE, the High-Performance ing distributed memory. The ac compiler was ported
Fortran (HPF) group was gaining critical mass, and it to the TE, but the bigger story was that these three
was clear that implementing a GAS Fortran language efforts merged into the Unified Parallel C (UPC) effort,
different from HPF made no sense. Unfortunately, inter- which later delivered compilers that work on a variety
actions between the HPF team and the Cray software of parallel systems.
team were not the best, so HPF did not incorporate all The fertility of the TE infrastructure was proven
the CRAFT lessons, notably the difficulty of optimizing by one last effort, little known at the time, a project
even less-general distributions than were supported by at Lawrence Berkeley Laboratory/NERSC that resulted
HPF and the magnitude of effort required to provide a in the Parallel Problems Server being ported to the
fast comprehensive run-time and libraries. For its part, TE. The tool later was commercialized by Interactive
 C CRAY TE

Supercomputing as Star-P, extending the M language of from the PVP systems, providing a caching layer on
MATLABTM to run on distributed memory top of the typical system libraries. Parallel scientific
clusters. libraries were another strong focus, as the Cray scien-
tific library, LibSci, developed for the PVP systems was
Tools redesigned to a great extent so as to run effectively on
the distributed-memory TE.
Compilers
The TD/TE Fortran and C compilers were hybrid,
Debugger
with the front-ends being those used on Cray’s parallel-
The debugger for the TE was Cray TotalView, which
vector (PVP) systems and the back-end being a heavily
Cray had licensed from BBN for the TD and ported
modified version of the Compass compiler back-end.
to its UNICOS MAX operating system from the orig-
While the use of the Compass back-end quickly deliv-
inal platform on the BBN Butterfly and TC, and
ered the ability to generate decent code for the TD/E
then ported to UNICOS/mk. It supported debugging
PEs, based on the Alpha microprocessors, the shift
capabilities that coped effectively with programs up to
to distributed memory was a more profound step
 cores. Beyond bridging the OS differences, the Cray
for which neither the PVP nor Compass components
development work focused on exposing the TE hard-
were designed. This difficulty was largely side stepped
ware mechanisms and scaling TotalView’s debugging
because of the decision not to support any distributed
capability to larger processor counts.
memory programming language on the TE, so the
compiler developers were able to focus on single-core
Profiler
robustness and performance. The PVP compilers were
Cray’s MPP Apprentice profiler was built on ideas pio-
legendary for their ability to analyze code deeply for
neered by ATExpert, a prior tool developed to optimize
parallelization via vectorization, though they had no
programs written with Cray’s AutotaskingTM software,
technology for optimizing the use of cache (which the
a precursor to OpenMP. The Apprentice dynamically
PVP systems did not have). Conversely, the Compass
instrumented code at the basic block level and could
back-end was not designed to work with a front-end
report timing results at the loop, condition, or other
that could provide such robust information about the
statement blocks with the time spent executing, syn-
parallelism inherent in the input program. The rework
chronizing, and communicating reported separately.
of both segments of the compiler was possible, but
Also, the results could be split for an individual pro-
resource constraints and the compiler group’s need to
cessor or summed for all the processors. Further, this
split its focus between the TE and PVP systems ham-
information was provided in the context of the user’s
pered the delivered performance of code on the TE.
original source code, which was not uniformly true for
Due in part to deficiencies in the use of cache, the TE
parallel performance tools of that era. The instrumen-
compilers never reached the performance levels of the
tation was done so as to be suitable for execution of
compilers for the Alpha processor itself.
very large programs, measured either by lines of code
or length of execution, running on a thousand pro-
Libraries
cessors or more. The development team pioneered a
The strong similarity between the UNICOS and UNI- number of innovations that enabled this advanced level
COS/mk operating systems at the system-call level of information.
drastically simplified the task of providing (single-
processor) libraries for the TE. Of course extensive
Conclusions
parallel libraries needed to be delivered, including syn-
chronization (e.g., barriers, eurekas) as well as the MPI, The User Experience
PVM, and SHMEM libraries. Parallel I/O was sup- While the TD and TE projects embodied many pow-
ported by independent, positionally nonoverlapping erful technical innovations one can argue that the
calls to the parallel I/O infrastructure and file system. proper measure of their significance to users lies not
The Fast File I/O (FFIO) library layer was also ported simply in those innovations but, more importantly, in
CRAY TE C 

the stunning improvement in the applicability of highly the mechanisms that would be necessary for effective
scalable computers to a host of important scientific and scaling of the most challenging HPC applications.
engineering problems, as evidenced by the number of In addition to the substantial investment made
successor systems extending the path pioneered by TD by Cray in developing the TD and TE systems,
and TE. there was a correspondingly sizable user investment in C
Work on TD began against the background of the application reprogramming effort needed to take
microprocessors having advanced beyond simple inte- proper advantage of the new scaling features, especially
ger and text processors to support high-precision distributed memory. The new mechanisms had been
floating-point calculations at useful speeds. Although designed specifically to enable an increase in the num-
powerful individually, they could not compete with ber of compute PEs that could be allocated to a sin-
the purpose-built multiprocessor systems from Cray gle application before reaching the “point of negative
and others that were designed from the ground up return,” where the increased communications and/or
for use on the most challenging scientific and engi- synchronization overhead causes application perfor-
neering applications. Software, in the form of callable mance to decrease as the number of applied processors
libraries, notably Parallel Virtual Machine, was avail- increases. In earlier, fully shared-memory systems those
able to support cooperative work on multiple proces- overheads were controlled by both an extremely costly
sors using the message-passing paradigm. However, for shared-memory interface and a very low bound on the
the most challenging scientific applications, the hard- number of processors. Applications codes were suc-
ware support essential for HPC-level performance did cessfully exploiting the largest of those shared memory
not exist. Well-integrated, high-bandwidth, low-latency systems, but to realize the full potential of the TD and
inter-processor communications capability was simply TE required converting codes to the message-passing
not available on even the best of the microproces- style of work-partitioning. The next step, after the codes
sors. Creating an appropriate interconnect was partic- had been converted, was to drive up their scalability by
ularly difficult because the microprocessors had been artfully improving the efficiency of the message-passing
built, and highly optimized, for working singly on a process and allowing it to overlap with computation.
quite different set of applications. Furthermore, the Both of these techniques had long been used to reduce
design and development effort for those microproces- the I/O overhead in HPC applications, but the message-
sors dwarfed the resource typically available for custom passing problem was more demanding. In some cases,
HPC systems. The large commodity microprocessor the computational algorithms themselves needed to be
market justified that effort, but modification of those modified.
processors specifically for HPC was not economically Moving from a commodity cluster to an HPC-
feasible. capable MPP system required not only the creation of
There had been other efforts to build highly parallel a powerful, scalable message-passing network but also
computers from microprocessors notably the Intel Cor- the provision of an adequate I/O system. I/O capa-
poration iPSC/ and Paragon systems that were solely bility is sometimes an afterthought, but it can often
message-passing and the Thinking Machines Corpora- be the performance-limiting feature for HPC applica-
tion (TMC) CM- and CM-, which were blends of tions requiring the input and output of huge amounts
SIMD and MIMD functionality with significant archi- of data in combination with the computational load.
tectural innovation in global addressing, synchroniza- For the TD, I/O was channeled through its host vec-
tion, and programming (CM Fortran). Each of these tor machine, already endowed with HPC-class I/O;
systems had proved essential points about the ability to but, being self-hosted, the TE required its own I/O
decompose application codes for distributed memory capability.
and explored a variety of system mechanisms to enable For the TE to qualify as a supercomputer-class
the execution of scalable applications. Cray’s MPP MPP it needed not just an extremely flexible and pow-
design team greatly benefited from detailed discussions erful hardware I/O capability, but also the hardware and
with early customers of the Intel and TMC machines software to make that capability a fully shared resource
regarding their experience of using those systems, and able to be focused on one or many jobs independently
 C CRAY TE

of their location within the D torus interconnect struc- performance almost linearly for low numbers of pro-
ture. Segregating I/O handling gave a degree of pre- cessors but quickly reach the point of negative return,
dictability to the performance of application PEs; and where messaging overhead would begin to dominate
the rare applications that heavily stressed the TE’s I/O over the decreasing computational load per proces-
capacity, for a given number of I/O PEs, could be further sor. However, with effort and experience, the range of
isolated by explicitly placing them in the torus close to performance improvements that could be achieved by
the I/O PEs servicing their needs. ever-increasing processor counts could be expanded.
With the development of the TD, Cray quickly real- Typically, impressive results were obtained over a period
ized that application development would be critical to of several months. In many cases, it was possible to
the success of both the TD and TE. Consequently, bring to bear the total processing power and memory
the company fostered and explicitly supported the nec- of a full TD (often – PEs) on a single problem.
essary user activity by instituting a Parallel Applica- Both the aggregate processing power and the aggregate
tion Technology Partnership (PATP) program, which memory were much larger than previously available on
involved researchers and application developers asso- traditional vector supercomputers. Depending on the
ciated with five major HPC centers around the world. application, new scientific ground was broken because
Cray provided access to TD hardware for participat- of the computational power, total memory size, or both.
ing organizations to work alongside Cray applications
staff in effectively parallelizing a host of applications of The TE Contribution
both scientific and commercial interest. This joint effort The history of computing, and especially supercom-
required both programming expertise and the cooper- puting, is a continual co-evolution among hardware
ation of the owners or authors of the codes that were designers, software developers, and end-user scien-
to be modified. The PATP program helped ensure the tists. The scientists constantly demand much higher
success of the TD and TE programs by significantly performance to achieve the breakthroughs they see
contributing to the rapid development of a critical mass just beyond their grasp. The hardware designers work
of applications. within their physical realities and devise architectures
A prominent early adopter of the CRAY TD, and an that better deliver the key performance metrics. Sys-
effective member of the PATP program, was the Pitts- tem software developers simplify access to the hardware
burgh Supercomputing Center, whose scientists report structures from the prevailing languages. This process is
that early achievements included: not linear though, since some new hardware approaches
ultimately prove to be usable with current software
● The first high-resolution model of the Gulf Stream,
technology and some, such as Seymour Cray’s original
reproducing benchmark realistic features
CRAY-, do not. The development of the CRAY TE
● Prediction of the location and structure of severe
is therefore best viewed in the context of research and
storms  h in advance, providing economic benefit
development done in the prior years by Cray Research
to airline companies
and others.
● Microsecond-length simulation of the villin head-
Cray developed the TD and TE against a back-
piece sub-domain and implementation of improved,
ground of fully commoditized but weakly performing
more accurate algorithms
clusters. Superimposing Cray’s engineering capabilities,
● Real-time fMRI: coupling the TE to an fMRI
and its understanding of the needs of high-performance
instrument, allowing observation and analysis of
technical computing, onto the prevailing cluster archi-
functional brain response in real time
tectures enabled the company to pioneer the success-
● Elucidation of the mechanism for HIV Reverse
ful HPC transition to massive parallelism. The critical
Transcriptase
hardware engineering innovation was the low-latency,
● Simulation of unsteady combustion using a com-
high-bandwidth interconnect, and its D torus topol-
mercial code (FLUENT)
ogy. Adopting an expressive programming paradigm
For a typical application, initial conversion to message- (message-passing), which was the result of industry-
passing produced a code that would improve in wide collaboration, helped overcome the programming
Cray Vector Computers C 

obstacles experienced by the original CRAY-. Addi- commodity and high performance, and had become the
tionally, the ability to support substantially larger, albeit industry’s first widely used MPP system. The TE user
distributed, memories benefited a huge range of appli- environment and system tools provided an effective
cations; and the I/O architecture, which separated I/O platform to develop and run production applications at
from message-passing traffic, was a shared resource very high scale. Applications programmers and system C
to the entire system, unlike earlier clusters. Coalesc- managers responded enthusiastically to their new tools,
ing all of these attributes into a single, coherent, easily and the installed base of almost  TE systems estab-
usable, and manageable total system was UNICOS/mk, lished a new baseline of expectations among users and
the first general-purpose distributed operating system. the industry that became the foundation for the next
In the best Cray tradition, the TE was truly balanced phase of the supercomputing roadmap.
in respect of processor power, memory, and I/O, all The singular TE contribution was “proof of con-
combining to deliver maximum performance to the cept.” DARPA, DOE laboratories, NSF researchers, Cray
end-user through an efficient software system. Research, Intel, TMC, and others had collectively for-
Also, for the first time, true supercomputers were on mulated the vision. The CRAY TE provided hard evi-
the same Moore’s Law development curve as micropro- dence that those ideas could indeed be transformed into
cessors. Users could look forward to increased proces- usable, flexible, resilient physical reality at a new, more
sor power becoming available at lower cost every  years attractive, price point.
or so, rather than the -year cycle that had been typical
of earlier supercomputers. Acknowledgments
With the advent of the TE the industry was clearly The authors wish to acknowledge the contribution of
and demonstrably heading toward enabling scientific many former colleagues who made helpful suggestions
users to access unlimited computational power, con- and thoroughly reviewed the text for technical and his-
strained only by their ability to develop highly scal- torical accuracy. Specifically, they wish to thank Mike
able applications programs and the size of their bud- Booth, Bill Minto, Steve Scott, David Wallace, and
get. Contemporary () systems are now achieving William White.
PetaFlop performance levels, barely more than a decade
after ASCI Red and the TE first broke the TeraFlop
application barrier in . ExaScale systems, a million-
fold increase over the TE, are within reach. Oper- Cray Vector Computers
ational systems with more than , PEs (cores)
are available today, and million-core systems are being James L. Schwarzmeier
Cray Inc., Chippewa Falls, WI, USA
contemplated.
Clearly, the continuing successful growth of HPC
computing power, and resultant expansion of the range Synonyms
of beneficial applicability, now forms a “virtuous circle” Pipelining; SIMD (Single Instruction, Multiple Data)
driving that growth faster than Moore’s Law. However, Machine
it is strongly believed in some circles that major proces-
sor and memory changes will be required to take HPC Definition
from the PetaFlop to the ExaFlop range. Those changes Vector processing at the instruction set level is trans-
may dissolve the currently productive marriage of com- lation of a high-level language program into a series
modity components and HPC, conceivably requiring of scalar and vector instructions. Vector instructions
another TE-like development effort. use scalar and vector registers, where each vector reg-
After more than a decade of research and exper- ister holds a maximum number of elements as spec-
imentation, the TE had given the users real lever- ified by implementation-specific MAXVL, typically in
age based on the relentless progress and economics the range –. Vector processing at the hardware
of the microprocessor industry. The TE hardware level executes VL elements of an instruction in parallel
and software had finally bridged the gaps between across an implementation-specific number of lanes or
 C Cray Vector Computers

pipes. Each pipe contains a complete copy of functional improvement of ×– was possible. Fast scalar process-
units and possibly memory data paths and a colocated ing and low memory latency resulted in excellent vector
subset of elements of the vector register file. Vector performance for “short” vector loops compared to com-
Instruction Set Architectures (ISAs) allow easy transla- peting “long” vector machines. The hallmark of Cray
tion by a compiler of loop-level code parallelism into systems has been balance between CPU computation,
compact instructions that are very efficient in terms of memory bandwidth, IO bandwidth, and software stacks
fetch, decode, and issue. Vector ISAs lend themselves to deliver performance.
to straightforward replication of hardware components Government laboratories had in the Cray- a com-
for achieving high levels of parallel execution across puter capable of doing long simulation runs of real-
multiple functional units and multiple pipes. Vector istic physical problems. While  MB memory capacity
processors are an easy and natural way to provide long of the Cray- is tiny by current standards, scientists
periods of uninterrupted pipelined operation for high of the day were able to resolve problems in D and
processor efficiency. D geometries. There also was a base of government
data-analysis users whose codes benefited from fast and
Discussion extensive -bit integer and logical operations and ran-
The discussion of Cray vector systems is presented dom access to memory. Soon scientists and engineers
in three parts. First, a historical perspective of the in many private companies were buying Cray vector
rise, domination, and decline of Cray vector systems computers to gain competitive advantage by reduc-
is presented. The focus is on reasons for this evolu- ing time to market for new products. Early purchasers
tion. Second, a technical discussion of architectural and of Cray supercomputers were from industries such as
performance advantages of vector systems is presented. petroleum, car-crash simulation, computational chem-
These fundamental advantages drive the introduction of istry, aerospace, structural mechanics, electronics, and
vector-enhanced architectures seen today in the x SSE weather/climate. Cray vector machines were so much
and AVX instructions set extensions and SIMD designs faster than alternatives that, despite $– M price tags,
of graphics processing units (GPUs). Finally, a chronol- demand for the technology allowed Cray Research to
ogy of Cray vector systems is presented from the Cray  grow rapidly. Following the Cray  were many genera-
in  to the Cray X in . tions of Cray vector systems with increasing processor
count, memory capacity, clock rate, peak performance,
Introduction and Historical Perspective: and memory bandwidth. During the mid-late s
The Rise, Domination, and Decline of Cray Cray Research owned approximately % of the HPC
Vector Systems market.
Cray vector computers played a unique role in the While Cray systems dominated the HPC market for
history of high-performance computing (HPC) and many years, technology changes in the industry and
advancement of computer architectures. The genius of interest from computing companies much larger than
Seymour Cray as evidenced by his long track record Cray Research began eroding competitiveness of Cray
of designing high-performance computers with Control vector machines. As IC technology advanced during
Data Corporation (CDC) and introduction of the Cray- the late s, some performance advantages of Cray
 in  earned him the accolade of “Father of the vector systems became price-performance disadvan-
Supercomputer Industry.” The Cray- was introduced tages. Cray Research did not have resources to design
with a Reduced Instruction Set Computer (RISC) pro- full-custom logic CMOS chips, as was done by large
cessor and system clock of  MHz, whereas state-of- microprocessor companies. Dense CMOS logic allowed
the-art microprocessors of the time ran with – MHz. microprocessors to have large, low-latency, multilevel
This gave the Cray- a huge advantage even for scalar caches, which are an inexpensive way to provide band-
codes, and all customer codes experienced immediate width. Low latency to cache is important to prevent
and significant speedup compared to microproces- stalls of the processor due to allowing only a small
sors. As Cray’s vectorizing compiler improved and cus- number of outstanding memory references. Micropro-
tomers realized how to write vector loops, an additional cessor companies chose to design memory hierarchies
Cray Vector Computers C 

with a bandwidth profile strongly peaked at lowest level (SNL). The scientists did weak scaling studies and used
cache with modest main memory bandwidth based a distributed memory message passing model to give
on commodity DRAM chips. Cray systems tradition- ×– parallel speedups on a  node NCUBE on
ally were designed with no caches and high-bandwidth three applications from SNL.
crossbar switches to SRAM memory parts that were While early MPP systems had performance issues C
banked across individual double words rather than and lacked production-quality robustness, these sys-
cache lines. This provided a uniform memory access tems were an affordable means to scale the number
(UMA) shared memory system and was ideal for appli- of processors and amount of memory to levels much
cations with no spatial or temporal locality. For appli- higher than supported on shared memory vector sys-
cations with good locality of reference it became more tems. However, it can be challenging for even MPP
cost-effective to run on microprocessor systems. systems to run well when processor counts are in the
Fast, low-density ECL chips and SRAM memory range  – . Cray Research was a late comer to the
parts resulted in Cray vector machines with many chips MPP marketplace, but used its experience in deliver-
on many printed circuit boards and many wires con- ing HPC systems to contribute to this market segment.
necting them – expensive components. New micro- During the s the Cray TD and TE MPP systems
processors of the era, such as the IBM Power  and using the DEC Alpha EV and EV series of proces-
DEC Alpha series, were getting close enough in absolute sors soon dominated the MPP marketplace, largely due
performance to low-end vector machines that price/per- to unique features supported in the Cray proprietary
formance became a significant part of the customer’s network interface chip and router.
buying decision. This trend continues to the present. There is nothing that requires the processor in a dis-
The HPC market is much smaller than the general tributed memory MPP to be a scalar processor. Cray
microprocessor market. It was difficult for Cray sys- engineers designed several scalable, distributed mem-
tems to realize economic advantages of large-volume ory systems with powerful vector processors as com-
production. This problem was exacerbated as low-end pute elements. The Cray SV system introduced in 
customers began shifting from writing vector code to was an early version of multi-threaded vector opera-
scalar code, since their applications were being targeted tion. Users could specify with a compiler flag that four
for the larger volume superscalar market. This was espe- vector processors on different modules could operate
cially true of Independent Software Vendor (ISV) codes. either as four independent vector processors, or as a
By the mid-s application codes without significant single, four-way multi-threaded vector processor. The
vector content ran faster on microprocessors than on Cray SV was followed the Cray X system introduced
vector systems, as dictated by Amdahl’s Law. in  and the Cray X system introduced in . The
In the mid-s there were two other intertwined Cray X improved multi-threading by placing the four
innovations in the computer industry that signaled a vector processors on a Multi-Chip Module with tight
shift away from vector systems. First, massively parallel synchronization. These systems combine massive par-
processor (MPP) machines began entering the mar- allelism across many compute elements and within each
ketplace. These systems used hundreds or thousands compute element.
of commodity microprocessors with locally attached While traditional shared memory Cray vector sys-
DRAM memory and connected with low-bandwidth tems are no longer produced, vector hardware concepts
networks. Second, what made these systems usable are key to two advances in recent HPC architectures.
and ultimately led to a new de facto standard archi- First, x processors are moving toward improving per-
tecture for HPC systems was a standard distributed formance through SSE and AVX instructions, which are
memory inter-processor programming model. The ini- akin to fixed vector lengths of , , or more elements per
tial standard model was the Processor Virtual Machine register. Second, graphics processing units (GPUs) gang
(PVM) interface. In time PVM gave way to today’s stan- together a moderate number (–) of threads of execu-
dard – the Message Passing Interface (MPI) model. tion in a single-instruction-multiple-data (SIMD) man-
The  Gordon Bell Award was awarded to Benner, ner. Equally important to vector hardware concepts are
Gustafson, and Montry at Sandia National Laboratory vector compiler technologies for efficient use of SSE and
 C Cray Vector Computers

AVX instructions and GPUs. Cray compiler techniques one instruction,  values of y(i) with one instruction,
for handling outer loop, four-way thread parallelism, do  multiplies of a vector times a scalar with a single
and inner-loop vector parallelism with predicated exe- instruction, etc. For the final iteration vector length is
cution of the Cray X and X are directly applica- reduced as necessary, on architectures that support it.
ble to compiling for multi-threaded vector parallelism Vector parallelism also could be considered data level
of GPUs. parallelism (DLP).
The historical perspective of Cray vector systems At the third level is loop-level task or thread paral-
ends on the note that vector processing will continue to lelism. Here iterations of the loop contain independent
be an important component of future HPC systems. For streams of either purely scalar instructions or mixed
this reason, it is helpful to understand in more detail the scalar and vector instructions. For example, a task-
breadth and inherent advantages of vector architectures. parallel TRIAD loop is

Do j = 1,M ← task parallel over do-j


Architectural and Performance Do i = 1,N ← vector parallel over
Advantages of Cray Vector Systems do-i
z(i,j) = x(i,j) + a∗ y(i,j)
Parallel Hardware Needs Parallelizing enddo ! i
Compilers and Parallel Application Codes enddo ! j
For HPC users to benefit from increased parallel capa-
bility of hardware, HPC application codes must be This thread parallel loop could be detected with
written so as to allow compilers to identify at multi- an auto-parallelizing compiler, or could be identified
ple loop levels task and vector parallelism. The triad by user-inserted OpenMP directives before the do-j
of parallel hardware, parallelizing compilers, and par- loop. A vectorizing compiler will translate the inner
allel application codes are equally dependent on one loop into scalar–vector instructions, whereas a non-
another for efficient utilization of modern HPC systems. vectorizing compiler translates the loop into purely
The Fortran programming language introduced by John scalar instructions.
W. Backus of IBM in  was well-suited for express- The highest level parallelism in HPC codes today is
ing numerical discretization and similar algorithms that inter-processor parallelism, usually programmed with
an auto-parallelizing and vectorizing compiler could the MPI model.
analyze.
Application programs can contain parallelism at
Pipelining and Vector Processing
several levels. At the lowest level, a compiler generates
The design of efficient HPC processors is inexorably
many instructions for each line of code written with a
linked with the concept of pipelining. Vector archi-
high-level language. Generally groups of these instruc-
tectures are a natural and effective way to implement
tions involve different registers and can be issued in
pipelining.
parallel on the same clock cycle. This is instruction level
Hardware pipelining is the decomposition of logic
parallelism (ILP), and is supported on all processors
blocks and data paths into a series of stages that
today. Vector processors target intermediate-level data
allow back-to-back transmission of intermediate results
parallelism when operations between elements of vec-
or final data packets associated with a computation
tors or arrays are independent and can be expressed as
through the processor and memory system. The goal of
a single instruction. An example of a simple vector loop
pipelining is that at every clock cycle functional units
is a TRIAD operation
and data paths can accept new input data, perform an
Do i=1,N ← vector parallel over do-i operation, and advance output data to the next stage of
z(i) = x(i) + a∗ y(i)
the pipeline. For example, in processors today a float-
enddo
ing point multiply operation might be broken into four
A vectorizing compiler with maximum vector length  stages, which leads to a four clock latency for a single
breaks this loop in chunks of up to  iterations. Vec- multiply to complete. Moreover, in a pipelined design
tor instructions for TRIAD load  values of x(i) with a the first and subsequent stages are able to accept new
Cray Vector Computers C 

input operands every clock. For example, in a hypo- (B/dword)∗ (. GHz) =  GB/s. The bandwidth at
thetical pipelined implementation with  clock multiply the banks was over-provisioned compared to peak pro-
latency the time to complete a stream of  MULTIPLY cessor request rate by factor / = ×.. This was to
operands is  clocks –  clocks for the first result to compensate for inevitable contention between proces-
finish the last stage of the multiply unit, plus  clocks sors in crossbars and memory banks. In practice a fully C
for each successive result to finish the last stage of the loaded Cray C/ could compute TRIAD with aggre-
multiply unit. It is apparent that a vector register con- gate bandwidth of about  GB/s, which is % of the
taining  elements provides a natural implementation maximum processor request rate.
for providing a “pipelined stream of  operands.”
Advantages of Vector Processors
Vector Pipes and “Chime” Time Traditional vector processors have significant advan-
To boost performance, vector processors can be tages over traditional superscalar microprocessors for
designed with multiple pipes. An n-pipe implementa- HPC applications. These include: (a) reduced instruc-
tion contains /n of the maximum number of vector tion issue bandwidth requirements, (b) high processor
elements of all vector registers colocated with a com- concurrency for latency tolerance, (c) good fit with IC
plete copy of all functional units and possibly data paths technology for functional units and registers, and (d)
to the memory system. In a vector processor with two vector chaining.
pipes, a pipelined stream of  operands would com- (a) Reduced instruction issue bandwidth requirements
plete in  clocks. The chime time of a vector implemen-
tation is A significant advantage of vector ISAs is reduction in
instruction issue bandwidth requirements. This low-
Vector chime time = MAXVL/#pipes, ers power and silicon area requirements for instruction
where MAXVL is the maximum number of elements fetch and decode. Vector ISAs afford an easy way for
in a vector register. The biggest challenge for a pro- the compiler to convey data parallelism to hardware.
cessor to maintain pipelined operation is in the mem- For example, with MAXVL =  vector load/store
ory system – it is difficult and expensive to design instructions have one of two generic formats
a memory system that provides load operands in a
fully pipelined manner. Pipelining of a memory sys- v1 [a1, a2] ← constant stride load
[a1, a2] v1 ← constant stride store
tem extends from address translation, to cache coher- v1 [a1, v2] ← gather (indexed load)
ence directory consultation (if caches are present), to [a1, v2] v1 ← scatter (indexed store)
crossbars and banks of the memory system, and back
to processor registers. A well-pipelined memory sys- For the constant stride case, the base address of the load
tem can be illustrated with a Cray C, first intro- is in address register a and the stride of the load/store
duced in . The Cray C/ was a -processor is in register a. For the gather/scatter case (indirect
shared memory system that ran at nominal  MHz addressing), vector register v contains  integer val-
with  SRAM memory banks, each capable of provid- ues that are word-offsets relative to the base address
ing a  bit double word (dword) every  clock cycles. of each successive element of the vector reference. The
This provided a peak bandwidth at the banks of ( destination register of the load or operand register of
banks)∗ (B/bank/ clocks)∗ (. GHz) =  GB/s. the store is v. A single vector load instruction loads
Each Cray C processor had two vector read ports and  words from memory with all  addresses generated
one vector write port with the goal of supporting high automatically in vector hardware – element-by-element
performance for memory-intensive loops like TRIAD. address increment instructions are not required. A
The maximum memory bandwidth a single processor scalar processor requires  load instructions and 
could ask for was four dwords load data and two dwords address increment instructions for a total of  instruc-
store data per clock for total of  dwords per clock. tions. On a vector processor scalar instruction issue
 processors of a Cray C/ had a maximum mem- bandwidth can be relaxed, since scalar execution in
ory request rate of ( proc)∗ (dwords/clock/proc)∗ support of vector loops can be “hidden” under vector
 C Cray Vector Computers

chime time. The Cray- had length  vector registers logic is more complex, since register and FUG avail-
and one vector pipe, so its chime time was  clocks. ability and data path conflicts have to be examined to
This gave ample opportunity for scalar instructions to determine how many instructions can issue in the next
issue without interrupting back-to-back issue of vector clock. Complexity of scalar register files grows quadrat-
instructions. The reduced instruction issue requirement ically with number of registers. Vector register file size
of vector processing is taken advantage of when the Intel can grow with no increase in complexity by increasing
x ISA added SSE and AVX instructions, which have maximum vector length.
fixed vector length of , , or , and in GPUs which exe-
cute SIMD manner with an equivalent maximum vector (d) Vector “chaining”: vector processing of dependent
length in the range –. instruction streams

(b) High concurrency for latency tolerance Vector register files and chime time lead to an architec-
tural advantage of vector processing related to pipelin-
A second advantage of vector processing is the large
ing called chaining. Chaining occurs when an output
concurrency available to hide memory latency. The reg-
vector register of an instruction executed in one func-
ister file in a Cray X processor introduced in  has
tional unit is pipelined as an input vector register into
 vector registers with  elements each for a total
a dependent instruction in a different functional unit.
of ∗  =  doubleword (quadword for x) ele-
Register bypass paths between functional units are stan-
ments. The implementation of the Cray X allows each dard in microprocessors, but chaining extends this
processor to have up to  outstanding dword load concept with large vector length. From a compiler per-
elements. The Cray X has hardware support for direct spective, chaining allows vectorization across indepen-
load/store access to any address in the system. Each dent sets of dependent instructions. This architectural
processor has sufficient concurrency to tolerate local advantage of vector processing merits exploration with
memory latency of  ns and even remote memory a concrete example.
latencies of several microseconds.
The top of Fig.  shows a loop of  iterations con-
(c) Good fit of vector hardware to IC technology taining a line of code with pairwise-dependent MUL-
TIPLY/ADD instructions. Figure a shows instruction
A third advantage of vector processors is they are a good sequences for Y(I) on a superscalar processor (left half
match to IC technology in terms of functional units and of table) and vector processor with MAXVL =  (right
register files. The vector register file is large (good for half of table). Not shown are load instructions (pre-
pipelined operation) and simple to control. Each vec- sumed to hit in cache) that fill input registers R on
tor processor of the Cray X has eight vector pipes. the scalar side and S on the vector side. Calculation
From a logic design perspective, adding vector pipes is of Y(I) is parallel (vectorizable) over I, but assume reg-
a straightforward replication of functional unit groups ister pressure from other statements in the loop body
(FUGs) for each pipe with each FUG having two read prevents a superscalar compiler from unrolling or soft-
ports and one write port into its per-pipe registers. For ware pipelining calculation of Y for different I. The
example, FUG on the Cray X contains FP Add, INT assumption of no pipelining on the scalar processor may
ADD, a LOGICAL unit, COMPARE, etc. FUG con- seem unnecessarily restrictive, but the point is to illus-
tains FP MULTIPLY, INT MULTIPLY, and a SHIFT. trate the advantage of the length of registers rather than
The vector load/store unit also has ports into and out of the number of registers. This effect has been seen in
the per-pipe register file. One vector instruction from real application codes. Consider execution of this line
each FUG can issue per chime, as long as there are of code on an out of order (OOO), -way superscalar
no issue conflicts. In contrast, a superscalar micropro- processor and an in-order, single-pipe vector processor
cessor has a more complex and smaller register file. with MAXVL . Assume both processors can load or
In a -way-issue machine, in each clock four instruc- store one doubleword per clock from cache with  clock
tions four instructions” for correctness. need full access load latency, and both processors have MULTIPLY and
into any individual element of the register file. Issue ADD latencies of  clocks.
Cray Vector Computers C 

DO I = 1,64
... ¬ code with long-lived registers
Y(I) = X1(I)+A*( X2(I)+A*X3(I) ) )
... ¬ code with long-lived registers
ENDDO
C
Instruction # Instruction Issue clock Instruction Issue clock
(i=1) [A in R0] (I=1:64) [A in S0]
1 R3 [load X3(1)] 0 V3 [load X3(1:64)] 0
2 R10 R3*R0 3 V10 V3*S0 3
3 R2 [load X2(1)] 1 V2 [load X2(1:64)] 1+64=65
4 R11 R2+R10 3+4=7 V11 V2+V10 65+3=68
5 R1 [load X1(1)] 2 V1 [load X1(1:64)] 65+1+64=130
6 R12 R11*R0 7+4=11 V12 V11*S0 130+1=131
7 R10 R1+R12 11+4=15 V11 V1+V12 131+4=135
8 [store Y(1)] R10 15+4=19 [store Y(1:64)] V12 135+4+64=203
...
(i=64)
1+8*63=504 R3 [load X3(64)] 63*19=1197
505 R10 R3*R0 1197+3=1200
...
523 [store Y(1)] R10 1197+19=1216
a

Instruction # Explanation of ‘Issue clock’ from (a)


Superscalar processor Vector processor
1 Issue @ clock 0 Issue @ clock 0
2 Issue @ 3, cache latency 3 clocks Issue @ 3, first load element from
cache enters MUL unit
3 issue @ 2 since OOO and can issue Issue @ 65, 2nd load cannot start
one load or store per clock until previous load completes
4 Issue @ 7, MUL latency 4 clocks. R2 Issue @ 68, in-order issue plus
is ready before R10 cache latency of 3 clocks
5 Issue @ 2 since OOO Issue @ 130, after last element of
previous load completes
6 Issue @ 11, ADD latency 4 clocks Issue @ 131, since in-order issue
7 Issue @ 15, MUL latency 4 clocks Issue @ 136, MUL latency 4 clocks
8 Issue @ 19, ADD latency 4 clocks Issue @ 139, but add 64 clocks
since store busies cache before
subsequent load can issue
. . .
504 Issue @ 504 (64th iteration)
507 Issue @ 507, cache latency 3 clocks
. . .
523 Issue @ 511, 504+19=523
b

Cray Vector Computers. Fig.  Illustration of vector chaining for loop with dependent pairs of MUL/ADD instructions
whose loads hit cache with  clock latency. Floating point functional unit latency is  clocks. (a) Instructions with issue
times for OOO superscalar processor (left) and in order, MAXVL =  vector processor (right). (b) Explanation of issue times
for each instruction and processor

Figure a is a table showing when each instruction performance between the two processor types. Figure b
issues on the superscalar and vector processors. The is a table explaining what determines issue time of each
timing table reveals differences in issue constraints and instruction on the two processors. On the superscalar
 C Cray Vector Computers

processor loads of one scalar value can issue every clock improvements over the years. The following are high-
and issue OOO. The left half of Figure a shows that by lights of Cray vector (and non-vector) systems begin-
clock  performance is limited by the  clock functional ning with the Cray-. Some specifications are nominal,
unit latencies of MUL and ADD instructions, leading to since different models of a given system can ship with
a total of  clocks for each iteration I. Total time for this slightly different parameters.
one line of code for  iterations is approximately 
clocks. Cray-S/M
On the other hand, on the vector processor there The Cray ISA was one of the first Reduced Instruc-
is a single vector iteration of length  for calculat- tion Set Computers (RISC) in a commercially available
ing Y(:). Vector loads or stores can execute only computer. There were  integer-only address registers
once every  clocks, since each vector memory ref- (-bit) A–A, eight general-purpose scalar registers
erence busies the data path to cache for  clocks. As (-bit) S-S, and  vector registers (-bit) V-V
can be seen from the right half of Figure a, func- with MAXVL = .  B and T registers were used
tional unit latency on the vector processor just adds a to do block loads of scalar data from memory and
few extra clocks to execution time. Issue timings (col- as spill space for A and S registers, respectively. The
umn  of Fig. ) for instructions , , , and  show clock speed was  MHz for a peak performance of
that performance is limited by memory system band-  MFLOPS. The Cray- was a single processor sys-
width – functional unit latencies are basically hidden tem with no caches and up to  MB of bipolar memory
by pipelining from vector registers. To vector execution and supported a load path and store path to mem-
time (see green “” in instruction  of Fig. a) is added ory. Memory latency was  clock cycles. Memory was
time for the vector store to complete, as the first vec- dword-addressable (rather than byte-addressable) and
tor load following calculation of Y(:) cannot begin real as opposed to virtual. Figure  shows Seymour
until the store relinquishes the data path to the cache. Cray and the Cray- supercomputer. The iconic “C”
Total time for this one line of code for  iterations is shaped footprint allowed routing short-length wires
approximately  clocks. between processor and memory modules. The padded
The difference in performance between the vector seat perimeter housed power supplies and plumbing for
and superscalar processors for calculation of Y(:) is the liquid cooling system. The IO system was housed in
approximately × (/). This dramatic speedup is a separate cabinet. Dedicated IO processors could read
due to vector architecture, even though the processors and write main memory and disk without diverting the
have equal peak performance and cache bandwidth. It is CPU from computation. An optional solid state disk
much more efficient to stream  elements from cache (SSD) cabinet delivered much higher performance for
and pipeline them through MULTIPLY and ADD func- codes that repeatedly accessed datasets too large to fit
tional units, than to individually fetch words from cache in central memory. First customer ship was in .
and pay MULTIPLY and ADD latency separately for
each iteration. For the complete loop in Fig. , the differ- Cray XMP/
ence in performance between the two processors will be The Cray XMP came in – processor models. The
less than ×, depending on how much work is in parts Cray- ISA was expanded to add  “shared” and 
of the loop not shown. If arrays X(I), X(I), X(I), and “semaphore” registers. These registers allowed fast syn-
Y(I) are not in cache, performance on both systems will chronization between processors and IO resources.
be limited by memory bandwidth and possibly mem- A later version of the machine was the Cray XMP-EA
ory latency (superscalar case, due to low concurrency). (Extended Architecture). Expansion of A and B regis-
Of course, to the extent scalar pipelining is possible, the ters to  bits permitted the architecture to address up to
relative speedup of the vector case will be reduced.  GWords (-bit double words) main memory. Clock
speeds up to  MHz gave peak performance of 
Chronology of Cray Vector Systems MLOPS per processor. The number of paths to memory
While the basic Cray- ISA persisted within Cray for increased to two loads and one store path per proces-
nearly  years, there have been many implementation sor. Memory latency was approximately  clock cycles.
Cray Vector Computers C 

. m circle and its height was . m. The computer


used total immersion Flourinert cooling, with plumb-
ing to a distantly located cooling tower. The processor
ran at  MHz for a peak performance of  MFLOPS
per processor. The Cray- also introduced a  KB, C
compiler-managed, local memory. The Cray- system
ran the UNICOS operating system, which started the
shift toward UNIX-based OS within Cray. The Cray-
was introduced in . Like the Cray-, the Cray- sys-
tem was designed by Seymour Cray. Follow-ons Cray
 and Cray  were built by Seymour Cray as part of
Cray Computer Company in Colorado Springs during
the early s. The Cray  and Cray  pushed minia-
turization of packing and were difficult to manufacture
in volume.

Cray YMP/
The Cray YMP was the follow-on to the Cray XMP and
had from  to  processors. Clock speed of  MHz gave
peak performance of  MLOPS per processor. Mem-
ory latency was approximately  clock cycles. There
Cray Vector Computers. Fig.  Seymour Cray and the were several versions of the YMP designed for differ-
Cray- supercomputer ent market needs. The baseline system came with a  GB
SSD and  MB memory, but for memory-intensive
applications a DRAM memory part option allowed up
Flexible chaining was introduced to allow more fre- to  GB memory. For low-end markets the YMP-EL
quent overlap of functional units. Pioneering work was was an air-cooled CMOS implementation of the YMP.
done on the Cray software stack to implement Macro-
TaskingTM libraries that harnessed multiple CPUs on a Cray C/
single user program. A second version of paralleliz- The Cray C had up to  processors. Clock speed was
ing software, optimized for finer granularity parallel a nominal  MHz. An implementation change rela-
work, was Micro-TaskingTM . User-inserted directives tive to the YMP was to have two vector pipes with vector
were inserted before a parallel loop, which the com- registers  elements each, which boosted performance
piler translated into efficient work-sharing code using to  GFLOP per processor. To maintain processor and
low-latency shared registers. It was common to get par- memory bandwidth balance, each vector pipe had two
allel speedups greater than . on a four-processor Cray load and one store paths to memory. Memory latency
XMP. The Cray XMP was introduced in . was about  clocks, and memory capacity was .–
 GB. The C was introduced in .
Cray-/
The Cray- was renowned for two innovations. First Cray TD MPP
main memory was expanded up to  GB, which allowed Though not a vector system, the TD was Cray’s first
for much higher resolution scientific computing. Large MPP system. The TD was hosted by a Cray YMP
memory was made possible by use of high density but Model E system, which ran the Cray UNICOS operat-
low bandwidth DRAM chips. This allowed scientists ing system and provided I/O and most system services
and engineers to begin D simulations. Second, the sys- to the MPP. DEC Alpha EV  MHz processors were
tem was very compact – its cabinet footprint was a the compute engine of a node and ran a microkernel
 C Cray Vector Computers

operating system. Each node contained two EV pro- system in which vector memory references no longer
cessors connected to a proprietary network interface had synchronized access to data paths, but requests and
chip (NIC). The NIC arbitrated access of the two pro- responses were individual packets for each element of
cessors to a BLock Transfer engine (BLT), which per- the vector reference. Vector loads no longer completed
formed direct memory access (DMA) transfers to/from in order, and the processor could issue multiple vec-
the local node to remote nodes via a link to a proprietary tor loads (however, still only  vector registers) without
router chip. The router chip supported bidirectional D stalling unless a destination register in an instruction
torus connections to other router chips in the system. was not completely full. A T/ could execute the
The first Cray TD system was shipped in  to the STREAMS benchmark at  GB/s. The clock rate was
Pittsburgh Supercomputer Center (PSC).  MHz, which required Fluorinert cooling. Like the
Cray C, the T had two vector pipes, so its peak
Cray CS compute rate was . GFLOPS per processor. The T
Cray Research acquired the assets of Floating Points was introduced in .
Systems in , which led to formation of subsidiary
Cray Research Superservers in . These systems were Cray TE MPP
based on Sun Microsystems’ SuperSPARC micropro- The Cray TE system was a self-hosted follow-on to
cessor and utilized high-bandwidth buses in a fault- the Cray TD system. The D torus network was main-
tolerant, partitionable SMP configuration. The Cray tained from the TD, although the TD used dimen-
CS team was developing the Starfire system when sion-order routing while the TE used direction-order
it was sold to Sun Microsystems after SGI purchased routing. The latter routing enabled better network fault
Cray Research in . The Starfire became Sun’s very tolerance. Link bandwidth and functionality of the TE
successful Enterprise , system. router chip were enhanced over the TD. Two TE
router enhancements were adaptive routing of pack-
Cray J/ ets as network contention was detected (also improved
The Cray J had , , or  processor models. This fault tolerance) and a virtual barrier network. The TE
machine was the follow-on the Cray YMP-EL, so it used a single DEC Alpha EV or EV processor con-
was an air-cooled CMOS implementation. The J had nected to the NIC. One of the most significant new
two noteworthy innovations. First the CPU was split features of the TE NIC was “E registers.” There were
into two chips, one for scalar processing and one for  user-accessible E registers that were manipulated by
vector processing. Initially both chips ran at  MHz, the local processor to do direct, un-cached DMA trans-
but the Cray Jse (scalar enhanced) model ran with fers to/from memory of remote nodes. The ability to
 MHz on the scalar chip. The second innovation (for easily use E registers led to the successful SHMEM com-
Cray) was presence of a  KB scalar cache. Presence munication model, which implemented the first direct,
of a cache-based processor introduced new issues for high-performance GET/PUT semantic communication
Cray designers, such as maintaining coherence between model for MPPs. The first Cray TE system was deliv-
scalar and vector memory transactions. The cache was ered to PSC in . The TE’s performance and scal-
software managed, so the compiler was responsible for ability led it to become the gold standard of the MPP
hardware invalidations when required. The J began market in the late s.
shipping in .
Cray SV/
Cray T/ Silicon Graphics purchased Cray Research in . The
The Cray T had , , , and  processor models. The Cray SV was completed in the Cray Research division
T had two innovations. First was addition of a mode of Silicon Graphics. The SV was binary compatible with
that would use either IEEE or traditional Cray float- the Cray YMP and J, so computed solely using tra-
ing point format. IEEE mode was particularly impor- ditional Cray floating point arithmetic and had vector
tant to ISVs who wanted single source code to run registers with  elements each. There were three inno-
on many platforms. Second was a redesigned memory vations introduced in the SV. First each SV cabinet
Cray Vector Computers C 

had a  KB software-coherent merged scalar–vector of conditional code using predicated execution.
cache. For vectors the cache primarily served to fil- Furthermore, almost all vector instructions were
ter bandwidth rather than reduce latency. Second the subject to predicate execution as specified by VM
concept of “Multi-Streaming” was introduced. Here, registers, element by element. A relaxed memory
under control of software, groups of four processors of model was defined with a rich set of synchronization C
an SV cabinet could be ganged together at the task and Atomic Memory Operation (AMO) instruc-
level of parallelism in a program to serve as a single, tions were defined.
-way multi-threaded vector processor. This was the . There were two processor modes. In Single Stream
first attempt within Cray to develop a more powerful Processor (SSP) mode, each vector/scalar CPU was
“single processor,” but the concept was little used due a separate processor. In Multi-Streaming Proces-
to lack of key hardware support. The third innovation sor (MSP) mode, a group of four SSPs packaged
was multiple SV/ cabinets could be combined with on an MCM was treated by the compiler as a
a Cray proprietary Giga-RingTM interconnect to form -way multi-threaded vector processor. Applica-
a “Scalable Vector” system up  SV/ “nodes.” The tions were compiled either in SSP mode or MSP
original SV systems ran at  MHz, and the SVex mode via a compilation flag. The compiler could
follow-on ran at  MHz. Peak performance of a sin- either automatically detect outer loop task-level
gle processor was  GFLOP and  GFLOPS for the  parallelism with vector code generation on inner
processor node of SVex. A full -node system had loops, or the compiler could key off user-inserted
a peak of . TFLOPS. The Cray SV began shipping Cray Streaming Directives (CSDs) to schedule SSP
in . threads across loops possessing subroutine call
trees. Each SSP thread could discover vector par-
Cray XMT allelism as the thread traversed the call tree. A
In  Tera Computer Company bought the Cray fast MSYNC synchronization instruction was both
Research division of Silicon Graphics. The new com- a control and memory barrier between SSPs in
pany was renamed Cray Inc. The experience of Tera an MSP.
engineers in designing and building highly multi- . Each SSP processor was a two-pipe CPU with vec-
threaded systems led to the Cray XMT system. This tor registers of  elements. The scalar part of the
system was uniquely able to satisfy market needs for SSP ran at  MHz while the vector unit ran at
applications with abundant levels of non-predictable,  MHz. This led to a peak -bit rate of .
graph-based parallelism. GFLOPS, and -bit mode had peak of . GFLOPS.
. Each SSP was operated in a fully decoupled manner.
The first level of decoupling was that scalar execu-
Cray X
tion ran “ahead” of vector execution by hundreds
The Cray X was the first truly scalable vector-based
of clock cycles in each SSP. The scalar processor
system designed by Cray. It shipped in late . As
issued, completed, and marked for graduation scalar
described below the Cray X system had a limited ver-
instructions early while vector instructions were
sion of coarse-grained multi-threaded vector operation,
dispatched to deep queues to await scalar operands,
which shares with Cray XMT systems some of the same
if needed. The second level of decoupling was early,
compiler techniques for loop-based task and vector par-
non-blocking processing of vector load addresses
allelism. The Cray X had many innovations over past
into the memory system to prevent into load buffers
Cray vector machines.
data that would be moved into a vector register
. There was a major upgrade of the Cray- ISA: The when the load executed at the vector execution
Cray X ISA supported only IEEE arithmetic;  pipeline. Vector store addresses were also decoupled
-bit A registers;  -bit S registers;  - from vector store data to allow for pre-allocating
bit V registers with  elements each.  vector store requests in cache.
mask (VM) registers, each with  elements and . The MSP had peak -bit flop rate of . GFLOPS.
one-bit wide, were added to allow rapid evaluation The four SSPs on the MCM shared a  MB L cache,
 C Cray Vector Computers

while each SSP had a  KB scalar cache. High technology will continue to play a large role in deliv-
bandwidth directory caches in the Memory (M) ering performance on future HPC systems.
controller chips processed cache coherence mes-
sages without degrading performance.
Related Entries
. Each of four MCMs on a node was connected to
NVIDIA GPU
 M chips with peak nodal memory bandwidth of
Pipelining
 GB/s. Each M chip had two bidirectional cache-
Vector Extensions, Instruction-Set Architecture
line address sliced ports into a proprietary Cray
(ISA)
X network. The network had multiple D slices
between cabinets, and had cross-bar routing among
nodes in a cabinet. Bibliographic Notes and Further
. The Cray X was scalable up to  nodes: fully Reading
cache coherent system but only caching node-local Vector processing and related architectures preceded
data; entire system is globally addressable – proces- founding of Cray Research in . The ILLIAC IV
sor and network support native ISA remote scalar from Burroughs was an early SIMD computer. A control
and vector memory references with “remote transla- unit broadcasted an instruction to an Array Subsys-
tion” (map all system memory so there were no net- tem containing  processing units with local memory
work TLB misses). The latter feature was borrowed and mode bit control []. Two vector machines that
from the Cray TE NIC. appeared in the early s were the Texas Instruments
ASC [] and the Control Data Corporation (CDC)
Cray X STAR- []. However, it was the Cray-, introduced
The Cray X was the follow-on to the Cray X with in , that combined a vector RISC ISA with tech-
a number of noteworthy differences. First the con- nology improvements that made vector processing so
cept of MSP was dropped. On the X some applica- dominant in the first  years of the modern HPC era.
tions ran very well in MSP mode, but the market- The Cray XMP and YMP design teams were led by Steve
place generally favored SSP mode as being more effi- Chen [].
cient – Amdahl’s Law exacts a large penalty if parts Many academic investigators have studied vector
of an application do not have task-level parallel loops architectures. James E. Smith, while at Cray Research
to use the  SSP threads. Second, each X CPU was in the early s and at the University of Wisconsin-
an -pipe vector unit with vector registers of  ele- Madison, published extensively on vector processing
ments each, double the length in the Cray X. Scalar and [, ]. The Berkeley Intelligent RAM (IRAM) project
vector clock rates were  MHz and . GHz, respec- included design of a processor-in-memory chip that
tively. This resulted in peak performance of  GFLOPS sought to address the growing imbalance between fast
per processor. Each CPU had a . MB vector/scalar processors and low memory bandwidth []. The chip
cache. Vector Atomic Memory Operations (AMOs) took advantage of traditional strengths of vector pro-
were added to allow high performance on certain loops cessing, such as low power for issue and control, low
with potential memory conflicts. Another difference design complexity, mature vector compiler technology,
was the X used a fat-tree topology for the network. etc., and combined those with on-chip DRAM mem-
The Cray X, like the X, was globally addressable ory bandwidth. One target application area for such
across the system. The proprietary router processed sin- chips is on-demand media processing. The first edi-
gle dword or cache line requests at very high bisection tion of the classic text on computer architectures by
bandwidth. Hennessy and Patterson [] is recommended for a
The Cray X was the latest and last proprietary vec- quantitative approach to computer architectures and
tor computer designed at Cray Inc. However, many of comparisons between scalar and vector processing.
ideas of vector processing apply to future versions of Detailed performance models are presented extending
the x ISA and implementations and programming from processor to memory system. The authors empha-
models of GPUs. In particular, vectorizing compiler size importance of scalar performance and low memory
Cray XMT C 

latency leading to overall low start-up time for vec- . Padua D, Wolfe M () Advanced compiler optimization for
tor loops. Vector machines and SIMD machines share supercomputers. Commun ACM ():–
. Abts D, Bataineh A, Scott S, Faanes G, Schwarzmeier J, Lund-
many features and can be programmed similarly. The
berg E, Bye M, Schwoerer G () The cray BlackWidow: a
book by Guy Blelloch gives in-depth examples of how highly scalable vector multiprocessor. In: Best paper SC , Reno,
algorithms with irregular memory access can be coded Nevada, November . IEEE, ACM, New York C
in vector/SIMD manner []. Studies of vectorizing com-
piler techniques, important for usability of vector sys-
tems, can be found in [–]. A detailed description
of many advancements of the Cray X system is given
in []. Cray XMT
Larry Kaplan
Bibliography Cray Inc., Seattle, WA, USA
. Siewiorek DP, Bell CG, Newell A () Computer structures:
principles and examples, McGraw-Hill, New York Definition
. Watson W () The TI-ASC, A highly modular and flexible
supercomputer architecture. In: Proceedings of the AFIPS, vol ,
The Cray XMT is a shared memory parallel computer
pt . AFIPS Press, Montvale, pp – consisting primarily of aggressively multi-threaded pro-
. Hintz RG, Tate DP () Control data STAR- pro- cessors derived directly from the Tera Multi-Threaded
cessor design. In: Proceedings of the Compcon , New Architecture (MTA) (see Tera MTA) []. It is pack-
York. IEEE Computer Society Conference, Washington DC, aged using the same infrastructure and support as the
pp –
Cray XT (see Cray XT and Cray XT Series of Super-
. Chen S () Large-scale and high-speed multiprocessor system
for scientific applications. In: Proceedings of the NATO advanced computers). The main difference in hardware from the
research work on high speed computing, Research Center, Julich, XT is the replacement of the socket  Opterons on
West Germany (Also in Hwang K (ed) () Supercomputers: the compute nodes with Cray Threadstorm processors.
design and applications. IEEE, Washington, DC) These processors implement the Multi-Threaded Archi-
. Espasa R, Valero M, Smith JE () Vector architectures, past,
tecture with support for  hardware threads in each
present, and future. In: International conference on supercom-
puting, Melbourne, Australia. ACM, New York
processor. Systems of up to  Threadstorm processors
. Smith JE, Faanes G, Sugumar R () Vector instruction sup- are supported.
port for conditional loops. In: Proceedings of the th annual
international symposium on computer architecture, June ,
Vancouver, BC
Discussion
. Kozyrakis C, Gebis J, Martin D, Williams S, Mavroidis I, Pope S, Introduction
Jones D, Patterson D, Yelick K () Vector IRAM: a media-
Shared memory parallel programming can provide
oriented vector processor with embedded DRAM. In: th hot
chips conference, Palo Alto, CA, August  simplicity to the programmer by allowing the place-
. Hennessy J, Patterson D () Computer architecture, a ment of data to be ignored. All memory is essentially
quantitative approach, Morgan Kaufmann Publishers, San Fran- equally distant in such a system. However, implement-
cisco, CA ing a shared memory system in a scalable fashion has
. Blelloch GE () Vector models for data-parallel computing,
several challenges. The biggest challenge is making a
MIT Press, ISBN X
. Callahan D, Dongarra J, Devine D () Vectorizing compil-
distributed memory implementation of the memory
ers: a test suite and results. In: Supercomputing’, Orlando, FL, appear as uniformly shared to the programmer by hid-
November, . ACM/IEEE, pp – ing the latency to the memory as seen by the pro-
. Allen R, Kennedy K () Automatic translation of FOR- cessors. In addition, several other important features
TRAN programs to vector form. ACM Trans Program Lang Syst are required to make best use of the shared memory
():–
environment including support for fine-grained syn-
. Kuck D, Budnik PP, Chen S-C, Lawrie DH, Towle RA, Strebendt
RE, Davis EW, Jr., Han J, Kraska PW, Muraoka Y () Measure-
chronization. The Cray XMT implements these features
ments of parallelism in ordinary FORTRAN programs. Computer using the custom Threadstorm processor and SeaStar
():– interconnect.
 C Cray XMT

Note that in addition to the main Threadstorm here is specific to the multi-threaded compute portion
multi-threaded processors on the XMT compute nodes, of the system.
an XMT also contains service nodes that use standard
AMD Opterons. The system hardware architecture is Threadstorm
shown in Fig. . The Threadstorm processor is a direct descendant of the
These service nodes provide external connectivity MTA- processor. As with its predecessor, Threadstorm
via PCI-X cards, and a login environment with com- contains  hardware threads or streams multiplexed
pilation and other programmer tools. These nodes can onto a single execution pipeline on a cycle-by-cycle
also be programmed using standard Linux tools. Except basis. By context-switching every cycle, the execution
where explicitly noted, most of the discussion presented latency of any individual instruction can be effectively

Compute Service and IO

MTK Linux

PCI-X
10 GigE
Network
PCI-X

Fiber channel
RAID controllers

Service partition Compute partition


• Linux OS MTK (BSD)
• Specialized Linux nodes

Login PEs
IO server PEs
Network server PEs

FS metadata server PEs


Database server PEs

Cray XMT. Fig.  XMT architecture


Cray XMT C 

reduced or hidden from the point of view of the ● Data TLB increased in size to cover  terabytes
processor. This is especially important for memory (was only  terabytes)
operations in instructions because the latencies for ● Amount of memory supported per node increased
these operations can be relatively long in terms of pro- from  gigabytes to  gigabytes
cessor clock cycles. C
In addition, each stream within the processor, in SeaStar
concert with the compiler, implements a release consis- The Cray SeaStar -D Torus interconnect (see Cray
tent memory model [] and can have up to eight mem- XT and Seastar -DTorus Interconnect) was originally
ory references (e.g., loads and/or stores) outstanding designed for use in the Cray Redstorm system that ulti-
simultaneously. This allows for a total of , memory mately became the Cray XT. It is primarily designed
references to be outstanding from a given Threadstorm for message passing types of communication through
processor. the use of a direct memory access (DMA) engine and
Threadstorm also includes an integrated DDR an embedded processor. In order to support the XMT,
memory controller, primarily due to the fact that it sits a Remote Memory Access (RMA) block was added that
in an Opteron socket , which expects such function- could process the memory reference style of commu-
ality to be present. In addition, Threadstorm contains a nication used by Threadstorm. The resulting ASIC is
HyperTransport (HT) interface over which it commu- known as SeaStar and is shown in Fig. .
nicates with SeaStar, which implements the high-speed RMA transactions are used exclusively between
interconnect. Threadstorm nodes. Message passing transactions are
The processor and memory controller, which in used to communicate with the Opteron-based service
Threadstorm is on the same die as the processor, sup- nodes. Portals [] is used to drive the message-passing
ports fine-grained synchronization through the use transactions though only a small subset of the Portals
of state bits stored in memory. Conceptually each API is implemented on the compute nodes.
-bit memory location supports four additional bits
that together are used to implement various forms of Programming Model
synchronization. Unlike in the original MTA, some The Cray XMT inherits the majority of its program-
assumptions on the use of those bits are leveraged to ming model from the Tera MTA and presents a flat,
allow the implementation to only have two extra bits shared memory to the programmer that is accessed
per memory location. The directly supported forms of
synchronization include mutual exclusion locks, single
word producer–consumer, among others.
Because the processor is latency-tolerant, most Cray SeaStar Chip
memory used in the Cray XMT is distributed across all
of the memory units that reside at every processor. A
pseudo-random distribution is used to avoid any stride R RMA HyperTransport
o link
access conflicts or other memory reference patterns that
u
might create hot memory units, assuming the references t DMA
e engine
are not accessing the exact same memory word (or small RAM
r
set of words).
Some other differences between the MTA- proces- Processor
sor implementation and Threadstorm include:
Link to L0 controller
● Number of processors supported increased from
 to ,
● Memory distribution changed from every word to
every eight words Cray XMT. Fig.  SeaStar
 C Cray XMT

using enhanced C and C + + languages via a compiler service and compute nodes. TCP/IP is also supported
that provides automatic parallelization []. Fortran is between the two node types.
not supported on the Cray XMT.
The Cray XMT programming model is supported Infrastructure and Administration
by an advanced runtime library that is linked in with As previous described, from a packaging standpoint,
the application. Due to certain aspects of the hard- the Cray XMT is simply a Cray XT system where the
ware design, this runtime is able to assume various Opterons on the compute nodes have been replaced
responsibilities normally associated with an operating by Threadstorm processors (and SeaStar interconnect
system []. Such responsibilities include thread-level ASICs are present). As such, all of the other infras-
scheduling and exception handling. The MTA hardware tructure and support for XT systems, other than the
allows for thread creation and destruction in user mode XT compute node programming environment, applies
and delivers all exceptions directly to the privilege level to XMT. By using the XT infrastructure, the Cray
in which they are raised. Events such as floating point XMT can leverage the reliability, manufacturability, and
and memory synchronization exceptions are delivered economies of scale that XT provides.
directly to user mode if they are raised there. Cray XMT systems are administered in a very simi-
Various tools from the MTA have also been updated lar fashion to XT. They have the same System Mainte-
and repackaged for the Cray XMT. Debugging is pro- nance Workstation (SMW) and Hardware Supervisory
vided by the mdb debugger that is based on gdb []. System (HSS) as is used in XT, with some extensions to
The compiler analysis tool Canal is available. Tracing specifically support the Threadstorm processor. As with
is provided by Tview. Bprof provides block-level profil- XT, users log into Opteron-based service nodes for
ing. Cray Apprentice provides a graphical interface for access to the system and programming environment.
viewing the output of these tools []. Jobs are then launched from the service, or login, node
and run on the compute nodes.
Operating System
The compute processors of the Cray XMT run an oper- Target Applications
ating system called MTK, which is derived from the . The Cray XMT excels at applications that operate on
Berkeley Software Distribution (BSD) of Unix with a large volumes of unstructured data such that the data
custom microkernel. The microkernel handles most of does not fit into the memory of a single node of a typi-
the hardware-specific processor functionality, includ- cal computer and that data is also not easy to partition
ing memory allocation and process-level scheduling, for locality to multiple nodes of such a computer. Some
while the BSD layer provides a familiar environment application areas that have this type of data include
for applications. The operating system treats all of the social media analysis, power grid contingency analysis,
XMT compute processors as a single instance with a high-throughput video analysis, and some forms of text
Uniform Memory Access (UMA) model of memory. document processing.
The main exception to this treatment is that instructions
are replicated to every Threadstorm processor in order Follow-on Designs
to make it easier to deliver the required instruction fetch The concepts contained in XMT and Threadstorm have
bandwidth. been considered for some other designs.
The service processors of the XMT are identical
to those in the Cray XT and run a version of Linux Scorpio
derived from SuSE Linux Enterprise Server (SLES). As part of the research for the DARPA HPCS program,
Each service node runs its own instance of Linux. Cray investigated the design of a highly multi-threaded,
Built on top of the low-level Portals communication multi-core processor called Scorpio that planned to
protocol used between the compute and service nodes, include several of the features present in the Thread-
the Lightweight User Communication (LUC) library is storm processor. These features included the multi-
available for user programs to transfer data between the plexing of many hardware streams onto an execution
Cray XT and Cray XT Series of Supercomputers C 

pipeline and the implementation of extra memory state


bits on every -bit word of memory for fine-grained Cray XT Series
synchronization (though the bit definitions were some-
what different than with XMT). Scorpio has never been Cray XT and Cray XT Series of Supercomputers
Cray XT and Seastar -D Torus Interconnect
manufactured. C

XMT Cray XT


A new design of the Threadstorm processor and XMT
system is being investigated by Cray to address some of Cray XT and Cray XT Series of Supercomputers
the current shortcomings of the XMT. The main areas Cray XT and Seastar -D Torus Interconnect
of improvement being considered include:

● DDR memory controller (instead of DDR)


● Larger per-node memory sizes Cray XT and Cray XT Series of
● Use of XT infrastructure (rather than XT)
Supercomputers
Jeff Brooks , Gerry Kirschner
Related Entries 
Cray Inc., St. Paul, MN, USA
Cray XT and Cray XT Series of Supercomputers 
Cray Incorporated, St. Paul, MN, USA
Cray XT and Seastar -D Torus Interconnect
Multi-threaded Processors
Tera MTA
Synonyms
Cray red storm; Cray SeaStar Interconnect; Cray XT
series; Cray XT; Cray XT; Cray XT; Cray XT; MPP

Bibliography Definition
. Alverson A et al () The Tera computer system. In: Pro-
ceedings of the th international conference on supercomputing.
The Cray XT supercomputer is a large-scale Massively
ACM Press Parallel Processing (MPP) supercomputer from Cray
. Sarita V, Adve KG () Shared memory consistency models: a Inc. The design for the Cray XT is based on the Red
tutorial. In: IEEE Comput, December  Storm supercomputing system which was designed in
. Brightwell R et al () Portals .: protocol building blocks for cooperation with Sandia National Laboratory (Sandia).
low overhead communication. In: Proceedings of the international
The system was announced in November of . Sub-
parallel and distributed processing symposium, IEEE 
. Cray Inc () Cray XMTTM programming environment user’s sequent products have included the Cray XT, Cray
guide, S--, /, http://docs.cray.com/books/S--/ XT, and Cray XT supercomputers, each based on suc-
S--.pdf. Accessed December  cessive versions of AMD Opteron processor, but essen-
. Alverson G et al () Scheduling on the Tera MTA. In: Pro- tially retaining the basic Cray XT architecture. The
ceedings of the workshop on job scheduling strategies for parallel
Cray XT series of supercomputers has proven to be a
processing, Springer
. Cray Inc () Cray XMTTM debugger reference guide, S--.
very successful product line, selling over , cabinets
http://docs.cray.com/books/S--/S--.pdf. Accessed in a -year period.
December 
. Cray Inc () Cray XMTTM performance tools user’s guide, Discussion
S--. http://docs.cray.com/books/S--/S--.pdf.
In , the U.S. Department of Energy’s National
Accessed December 
. Cray Inc () Cray XMTTM system overview, S--.
Nuclear Security Administration (NNSA) set aside
http://docs.cray.com/books/S--/S--.pdf. Accessed funding for Sandia to obtain a new high-end computa-
December  tional capability to address mission needs. Sandia sent
 C Cray XT and Cray XT Series of Supercomputers

out a Request for Information; however, no existing booting, monitoring, and managing Cray XT sys-
architecture was able to satisfy the lab’s scalability and tem components. The SMW communicates with all
cost requirements. Subsequently, a Request for Quo- nodes and RAID disk controllers via a private Ethernet
tation was issued by Sandia. Two suppliers responded network.
with proposals; however, neither was able to fully meet
the requirements laid out in the proposed Statement of Processor
Work with existing or planned products. The AMD OpteronTM processor was selected for the
The proposal from Cray indicated a willingness Red Storm project. There were several reasons for this:
to custom engineer a supercomputer to meet Sandia’s
needs within the budgetary envelope which was approx- . The processor featured an open, high-speed inter-
imately $M. The project would require a custom face called HyperTransportTM . This would allow
ASIC, custom packaging and cooling, a custom clas- the processor to be tightly integrated with a cus-
sified/unclassified interconnect switch (known as the tom high-performance interconnect chip. Because
“red-black” switch), a custom hardware supervisory HyperTransport was an open interface, there was
system (HSS), and a custom software stack. Despite extensive third-party IP available. This would serve
this, the Red Storm project was completed and deliv- to reduce risk and development time for the inter-
ered in only  months. Concurrently, with this initial connect ASIC.
delivery, Cray launched the Cray XT supercomputer . The processor is fully compatible with the X pro-
in November of  in Pittsburgh, PA, at the SC cessor architecture. This ensured compatibility with
conference. a vast quantity of existing software, including com-
pilers, libraries, and applications. This would serve
to reduce risk associated with operating system and
Architecture application software.
The Cray XT system is a massively parallel system . The processor featured extensions to the X archi-
consisting of two types of nodes, grouped into two tecture specified by AMD allowing applications to
partitions, respectively: use -bit addressing. This mode is called X–.
Applications can be ported to -bit mode, yet
● Compute nodes run application programs. All com- full compatibility and performance is retained with
pute nodes run a Cray XT light weight kernel. -bit applications. Intel ultimately followed this
● Service nodes handle support functions such as convention, ensuring its market relevance for the
login management, I/O, and network management. long term.
All service nodes run a full Linux-based Cray XT . The AMD Opteron pulled the function of the
operating system. Northbridge chip onto the processor die itself. In
The nodes are tightly coupled with an interconnection particular, the memory controller is on the Opteron
network that supports fast MPI traffic, as well as a fast itself. This resulted in a very low-latency interface to
I/O to a global, shared-file system. The interconnection memory (∼ ns) and a dedicated high-bandwidth
network is based on a -D torus topology that com- interface for each processor (. GB/s).
bines HyperTransport and proprietary protocols. The . Since the Northbridge is in the Opteron itself, there
Cray SeaStar chip on each node functions as the router is considerable savings in components (one inte-
chip. It has seven bidirectional ports, six for the -D grated circuit per node) and power (approximately
interconnection network and one for the HyperTrans-  W/node) since no separate Northbridge chips
port link to the node’s processor. Figure  shows the would be necessary. This served to reduce compo-
basic architecture of a Cray XT system. nent count and increases reliability.
Each Cray XT system also includes a Cray RAS and
Management System (CRMS). The CRMS includes a Node Memory
System Management Workstation (SMW), which func- Each node has four slots for memory DIMMs. These
tions as the administrator’s single-point interface for slots provide – GB of local memory for the node.
Cray XT and Cray XT Series of Supercomputers C 

GigE
Login
server(s)

C
10 GigE
Network(s)

GigE SMW
Interconnection
network:
3D Torus in Fibre
each dimension channels
RAID
Y subsystem (s)

X
Compute node (Microkernel)
Compute node
Z
Service nodes (Full Linux OS)
Login node

Network node

I/O node

Boot node

Cray XT and Cray XT Series of Supercomputers. Fig.  Cray XT system architecture

Each DIMM has  bits:  data bits and  error- as blade control processor) interface which is used for
correcting code (ECC) bits. system management.
Memory is protected using single-symbol correc-
tion/ double-symbol detection (SSC/DSD) enhanced Service Nodes
ECC. This enhanced ECC can detect and correct single- Figure  shows the architecture of a service node.
symbol errors, or can detect two-symbol errors, but Service nodes use the same processors, memory,
cannot correct them. and SeaStar chips as compute nodes. In addition to
these components, the service node has an AMD-
Compute Node , HyperTransport PCI-X tunnel chip that drives
Figure  shows a block diagram of a Cray XT com- two PCI-X slots. Each slot is on its own indepen-
pute node. Each compute node consists of one processor dent -bit/ MHz PCI-X bus. PCI-X cards plug into
socket populated with minimum of a . GHz single- the PCI-X slots and interface to external I/O devices.
core AMD Opteron processor, four DIMM slots which Later versions of the Cray XT series implemented PCI-
provide – GB local memory, a Cray SeaStar chip that Express slots on these nodes.
connects the processor to the high-speed interconnec- Each Cray XT system includes several types of
tion network, and an L controller (also referred to service nodes. Each type performs dedicated function
 C Cray XT and Cray XT Series of Supercomputers

Local DRAM
Cray SeaStar chip Opteron processor (socketed) (1–8 GB)

DDR memory D D
Interconnection DMA Hyper controller I I
network engine transport M M
ports M M
Router AMD64
Hyper core D D
transport I I
M M
RAM Caches M M
L0 controller SSI
(CRMS) block
Processor

Cray XT and Cray XT Series of Supercomputers. Fig.  Compute node architecture

Local DRAM
Cray SeaStar chip Opteron processor (socketed) (1–8 GB)

DDR memory D D
Interconnect DMA controller
Hyper I I
network engine transport M M
ports
Router M M
AMD64
Hyper core D D
transport I I
RAM M M
Caches
M M
L0 controller SSI
(CRMS) block
Processor Bridge PCI-X slot
I/O
devices
Bridge PCI-X slot
AMD-8131
I/O
PCI-X tunnel

Cray XT and Cray XT Series of Supercomputers. Fig.  Service node architecture

and requires a minimum of one PCI-X card. The types card. The Fibre Channel ports connect to the sys-
include: tem’s RAID storage.
● Boot Node: Each system requires one boot node.
● Login Node: Users login into the system via login A boot node contains one FC HBA and one GigE
nodes. Each login node includes one or two single- PCI-X card. The FC HBA connects to the RAID and
port Gigabit Ethernet PCI-X card that connects the GigE card connects to the System Management
to a user workstation. Copper and fiber cards are Workstation of the CRMS.
available.
● Network Service Node: Each Network service node Interconnection Network
contains one  Gigabit Ethernet PCI-X card that The Cray XT system uses its interconnection network
can be connected to customer network storage to link nodes in a D torus topology and to facilitate
devices. the system’s high-communication bandwidth. Physi-
● I/O Node: Each I/O node contains one or two dual- cally, this network includes the Cray SeaStar routers (the
port Fibre Channel (FC) Host Bus Adapter (HBA) heart of the network), router ports, and cables.
Cray XT and Cray XT Series of Supercomputers C 

Key performance data for the interconnection Cray SeaStar chip


network are:
Interconnection
DMA Hyper
● A bidirectional injection bandwidth of more than network
engine transport
ports
 GB/s per direction. This is the bandwidth from the Router
processor core into the SeaStar router.
C
● A sustainable bandwidth of . GB/s for each of
RAM
the six directions of the torus. The SeaStar router
supports an aggregate bandwidth of almost  GB/s. L0 controller SSI
● An MPI latency of less than  μs between a processor (CRMS) block
Processor
pair.

This Cray SeaStar network allows efficient MPI messag- Cray XT and Cray XT Series of Supercomputers.
ing and also high-sustainable bandwidth to the Object Fig.  Cray SeaStar chip
Storage Servers (OSS) driving the disks systems for the
global Lustre filesystem.
A Cray SeaStar chip a HyperTransport connection transfer rate of . GB/s in each direction between
to the node’s AMD processor – its six ports connect the Opteron processor and the SeaStar chip.
the D torus. The X, Y, and Z dimensions use two ● A custom router that enables the Opteron to com-
ports each. The network uses a Cray proprietary municate with other nodes via the interconnection
protocol. network. The router has six ports that connect to
Network connections are made three ways: within the interconnection network. The peak bidirectional
the blade, on a backplane that the blades plug into, bandwidth of each port is . GB/s, with a measured
and with groups of network cables. The network cables sustained bandwidth of . GB/s.
can be configured in multiple ways to create various ● A communication interface that consists of a Direct
topologies based on system size. Memory Access (DMA) engine, on-chip processor,
Configuration tables are used for data flow mapping and on-chip RAM connected to a local bus. These
across the interconnection network. These tables are components function as a message processor that
loaded at system boot time. They can also be reloaded route send/receive data packets to their appropriate
to reconfigure the interconnection network following a destination.
hardware failure. ● A synchronous serial interface (SSI) that connects
the module controller (known as the L) to the
internal SeaStar local bus. This connection allows
Cray SeaStar Chip the L controller access to the Opteron memory
There is one Cray SeaStar chip for each Opteron and status registers. The L controller is part of the
processor in the system. The chip connects the Opteron CRMS, which is used for booting, maintenance, and
processor to the D interconnection network. It offloads monitoring the Cray XT system.
all send/receive work so that compute work is not inter-
In order to achieve the lowest possible system latency,
rupted. Here is a block diagram of the SeaStar chip
it is necessary to provide a path directly from the
(Fig. ):
application to the communication hardware without
The SeaStar chip is a system-on-chip design that
the time-consuming traps and interrupts associated
combines third-party hardware components and inter-
with traversing a protected operating system kernel.
face definitions with Cray custom logic and functional-
The SeaStar chip uses the Portals interface to provide
ity. It contains the following major blocks:
this path.
● A HyperTransport interface provides a bidirectional More information about the Cray SeaStar intercon-
link between the Opteron processor and the SeaStar nection network is available in the entry on Cray XT
chip. The HyperTransport Link has a maximum data and Seastar -D Torus Interconnect.
 C Cray XT and Cray XT Series of Supercomputers

Packaging that, to the application programmer, is an optimized


parallel processing machine and, to the system admin-
Cray XT Compute Cabinet
istrator, features a single boot image and a single root
Each compute cabinet contains three module chassis as
file system.
follows:
The software allows programmers to optimize appli-
● Each chassis is populated with any combination of cations that have fine-grain synchronization require-
eight blades (compute or service). ments, large-scale processor counts, and significant
– A compute blade contains four compute nodes, communication requirements.
including providing four Opteron processors, up Cray XT system software falls into four major areas:
to  GB of DIMM memory per processor, and
● Operating system
four SeaStar chips. A single compute cabinet can
● File system
hold up to  compute nodes.
● Programming Environment
– A service blade consists of two service nodes,
● RAS Management System software
including two Opteron processors and – GB
of memory per blade, depending on the func-
tions of the service nodes. The blade has four Cray XT Operating Systems
SeaStar chips to allow for a common board The Cray XT operating system functions fall into three
design and to simplify primary interconnect categories, identified by the role of the components on
configurations. Some service blades also contain which they run:
PCI-X slots for external connections. A single . Service nodes run the full Linux Cray XT operating
compute cabinet can hold up to  service nodes. system.
● The cabinet contains all power and cooling equip- . Compute nodes run a Light Weight Kernel called
ment for these blades; no external equipment is Catamount.
required. . CRMS components (the L and L controllers) run
● Each cabinet is air-cooled. A single blower assembly, an embedded real-time Linux kernel.
located toward the front, below the cages, air cools
all components within the cabinet. The blower pulls SuSE Linux Operating System
underfloor air into the cabinet and forces air verti- Cray XT system service nodes run a full, Linux-
cally through the three chassis. Warm air exhausts operating system based upon SuSE Enterprise Server
through the top of the cabinet. A blower speed con- .. This was later upgraded to SuSE Enterprise Server
troller varies the speed of the blower to maintain the release . in Q (Fig. ).
correct air pressure and temperature. The operating system supports six types of service
Ethernet switches and the cabinet controller (known as nodes:
the L controller) are located at the rear of the cabinet. ● Login Nodes
These components are part of the CRMS. ● I/O Server Nodes
The following figure shows the component loca- ● Network Server Nodes
tions of a compute cabinet. A single mechanical assem- ● File System metadata Server Nodes
bly contains three identical chassis that house blades. ● Database Server (Admin) Nodes
Blades plug into the front of the chassis. Each chassis ● Boot Nodes
has a backplane assembly at the back (Fig. ).

SuSE Linux Overview


Software With SuSE Linux, the Cray XT system provides users
Cray XT system software is based on software from the with all standard Linux interfaces and working envi-
Sandia Red Storm project, plus the addition of commer- ronments. SuSE Linux was selected for the Cray XT
cially available tools and features. This creates a system operating system for a number of reasons:
Cray XT and Cray XT Series of Supercomputers C 

Air exhaust
4 power
connectors
22
2 VRMs
interconnect
Cage 2 network C
cables
Cage ID
controller

8 blades Cage
Cage 1
per cage backplane
assemblies

Ethernet
switches
Cage 0

SCM Power
Blower supplies
assembly
L1 controller
PDU
assembly

Air intake
Front Rear

Cray XT and Cray XT Series of Supercomputers. Fig.  Compute cabinet

Compute Service & IO Service partition


Linux OS
Light weight kernel Linux Specialized Linux nodes
Login nodes
IO server nodes
Network server nodes
FS metadata server nodes
Database server nodes

Compute partition
Light weight kernel OS

PCI-X
10 GigE
Network
PCI-X
Fiber channel
RAID controllers

Cray XT and Cray XT Series of Supercomputers. Fig.  Node functions


 C Cray XT and Cray XT Series of Supercomputers

● Open Source. The availability of source code for the for application execution, yet keeps system overhead
operating system allows it to be customized. This is at a minimum. It provides virtual memory address-
not possible with proprietary operating systems. ing, physical memory allocation, memory protection,
● Widely Supported. Several vendors provide Linux access to a high-performance message-passing layer,
support, in addition to the legion of programmers and a scalable job loader. While parallel jobs retain an
who are constantly enhancing the operating system. agent (yod) on the login node that launched them,
● Familiar to Users. Many HPC users develop codes actual computation runs under the LWK on the com-
on Linux desktops. Most users are more comfortable pute nodes. And, as compute regions do not need to
with this familiar operating system than they would run standard Linux daemons, all standard requests are
be with a proprietary solution. executed from the service partition. This significantly
● -bit aware. At the time, it was chosen for the Cray reduces to the LWK, and more compute cycles are
XT system, SuSE was one of few companies actively reserved to the user. The ultimate benefit is that parallel
targeting the -bit market place. SuSE software was jobs run in shorter times.
already ported to the AMD Opteron architecture, Some system calls are handled by the LWK, others
so its close-working relationship with AMD made are forwarded either to the yod agent running on the
it the obvious choice. login node, or directly to I/O nodes for services that
● Cooperation. SuSE worked with Cray to further require a high degree of concurrency. The Catamount
the development of SuSE Linux into the High- LWK is made up of two parts: the quintessential kernel
Performance Market Place. (QK) and the Process Control Thread (PCT).

Single System View (SSV) QK – The Quintessential Kernel


The Cray XT system presents a Single System View The QK provides an operating system kernel for the
(SSV) to users and system administrators. The following compute nodes that is resilient and scalable with mini-
list of SSV features is provided: mal system overhead.
Resilient: Its simple design provides only essential
● Single point of Administration: An administrator
functions required to run applications, much less than
can perform system administration tasks from the
a full OS. It has been designed, tested, and evolved over
system boot node, without needing to log into addi-
two generations of MPPs.
tional nodes. Any administrative file in the machine
Scalable: The QK design eliminates synchronization
can be modified from the root node.
or communication between QK’s on different nodes.
● Global file space: User home directories and data
This independence allows the system to scale to thou-
directories are visible across the entire machine
sands of processors. Performance is achieved by using
and may be accessed from anywhere using the
a simple and minimal system protocol that minimizes
same path.
overhead for invoking kernel functions. Except for PCT,
● Single user process space: All user processes on the
there are no processes or daemons that can interrupt
system are visible from any node. Each has a unique
programs and degrade performance of tightly coupled
PID and can be signaled from any node.
applications.
Lightweight Kernel (LWK)
Because the Cray XT system was designed to scale Process Control Thread (PCT)
to very large processor counts, it runs a computational On top of QK, a special user-level process, the Process
environment that has very little system overhead, using Control Thread (PCT), performs services on behalf of
a lightweight kernel (LWK) named Catamount. This application processes. Its primary responsibility is to
LWK, developed jointly with Sandia National Labo- start user applications, track them, schedule them, and
ratories, is a small, low-complexity OS that manages relay completion and signal information between the
access to the physical node resources and infrastructure application and yod.
Cray XT and Cray XT Series of Supercomputers C 

LWK and Message Passing files that are required for program execution will have
The native messaging protocol for the Cray XT system an identical path name no matter where the job is run.
is Portals version .. It is a low-latency, low-overhead
protocol, ideal for scalable, high-performance network Product Enhancements
communications. It is connectionless (i.e., it does not C
stay connected across consecutive communications), XT Dual Core
so the amount of system RAM memory used to pro- The peak performance of Cray XT systems was sig-
vide buffers for message passing is independent of both nificantly increased with the introduction of AMD
the number of compute nodes in the system and the Opteron dual-core processors. The AMD Opteron dual-
number of compute nodes assigned to a particular job. core design increases application performance without
The LWK is designed to support a high-performance having to redesign the socket or board.
MPI implementation. It lays applications out in mem- Figure  depicts the processor die with two CPU
ory in a linear fashion so that virtual memory can cores, each core having  MB L cache. The caches in
be mapped to real memory in a relatively simple off- some sense are shared in case one CPU sees a cache
set scheme. As a result, Portals entries can span large miss for its associated cache. The data however are avail-
chunks of real memory, and it is possible to map all able in the other CPU’s cache, so it is directly loaded
memory using the available Portals table entries in the from there through the System Request Interface (SRI).
Cray SeaStar chip. The AMD Opteron was designed to add a second core,
Kernels on the compute nodes and on the service with a port already existing on the crossbar/SRI. The
nodes implement the Portals communication primi- dual-core drops into existing AMD Opteron AM sock-
tives. The Cray Seastar chip includes firmware that ets, thus also enabling upgrades of single-core systems.
offloads the Portals message-passing overhead from The two CPU cores share the same memory and HT
the Opteron processor. At the user level, application resources found in single core. The integrated memory
processes communicate with one another by linking controller and HT links route out the same as today’s
libraries that support the Portals interface.

Programming Environment
The Cray XT Programming Environment provides CPU0 CPU1
comprehensive support for -bit application develop-
ment under an MPI model. It includes the following:
1 MB 1 MB
● Compilers L2 cache L2 cache
● Application libraries
● Application and system state monitors System request interface
● Application launch utilities
Crossbar switch
● Debugger
● Performance API
● Modules utility
Memory
HT0 HT1 HT2
controller
Single Copies of Tools
On the Cray XT System, a single instance of each
programming environment tool (compilers, debuggers,
and so on) can be accessed from any login nodes on the
system as if it were loaded on a disk connected to that Cray XT and Cray XT Series of Supercomputers.
node. Likewise, the Lustre file system provides a consis- Fig.  Interface for second processor in the AMD opteron
tent view from anywhere within the system so that any dual core design
 C Cray XT and Cray XT Series of Supercomputers

implementation. Almost all of Cray’s initial single-core The Broadcom HT- modular bay on which the
Cray XT customers took advantage of this upgrade. tunnel chip is mounted provides two PCIe slots for cards
to interface to external I/O devices:
Cray Linux Environment
● a x PCIe slot provides  GB/s full duplex commu-
The original operating system for the Cray XT was
nication
based on SuSE Linux for the Service Nodes and Cata-
● an x PCIe slot provides  GB/s full duplex commu-
mount for the Compute Nodes. With quad-core proces-
nication
sors coming, Cray needed a more full-featured OS on
the compute node but needed to retain the scalability Each PCIe slot has independent buffers and shares
properties of Catamount. Cray transitioned to an oper- (fairly) the bandwidth back to the attached Opteron.
ating system for Cray XT systems called the Cray Linux The HT- HyperTransport PCIe tunnel chip is con-
Environment (CLE) that is Linux based throughout. nected to the dual-core Opteron via a -GB/s HT link
Cray Linux Environment (CLE) uses a lightweight (Fig. ).
kernel operating system for compute nodes. The
lightweight kernel includes a runtime environment Follow on Products
based on the SUSE SLES distribution. It provides
outstanding performance and scaling characteristics,
Cray XT
The Cray XT represented the second system in the
matching the performance seen on the Catamount soft-
Cray XT line of supercomputers. A new compute blade
ware stack.
was designed to accommodate the new processors from
CLE on Compute Nodes offers the full functionality
AMD and the rest of the Cray XT system (service
of a Linux kernel, including:
blades, boot infrastructure, software stack, cabinets,
● POSIX system calls (as supported by SUSE Linux backplanes, power supplies, etc.) was retained. The net-
kernel) work router chip, SeaStar, was enhanced and renamed
● Programming models, including OpenMP, MPI SeaStar. The Cray XT provided a significant perfor-
(version ., based on MPICH), Cray SHMEM, mance increase over the Cray XT system in the same
and CAF. (Note: Other programming models work as floor space. Initial Cray XT systems shipped with dual
well. These include Global Arrays (the communica- core AMD Opteron processors and DDR memory.
tion layer used by NWCHEM) and Charm++ (used Later enhancements of the Cray XT compute
by NAMD). Cray does not distribute this software but nodes include the use of Quad-Core AMD Opteron
interacts closely with its developers (PNNL and the processors. These processors were also capable of four
University of Illinois respectively) to make sure they floating-point results per cycle per processor core which
work on the Cray XT systems.) provided a large increase in peak performance. Ser-
● Application networking (Sockets) vice and I/O nodes (SIO) were upgraded to dual-core
● POSIX threads AMD Opteron processors. Cray XT customers could
upgrade to the Cray XT system by simply replacing
PCI-Express SIO Blades compute blades.
To improve the performance of Service and I/O Nodes The largest Cray XT system installed is “Franklin,”
(SIO) on Cray XT systems Cray transitioned from a -cabinet system installed at the National Energy
PCI-X to PCIe interfaces. The following figure shows Research Scientific Computing center (NERSC).
the architecture of a Cray XT dual-core service node.
Service nodes use an AMD Opteron dual-core pro- Cray XT
cessor, but the same DDR memory and interconnect The Cray XT provided another significant improve-
SeaStar processors as compute nodes. In addition to ment in performance over the previous generation
these components, the service node has Broadcom Cray XT. The processor technology for the Cray XT
HT- HyperTransport PCIe tunnel chip that drives included the quad-core “Barcelona” and “Shanghai”
two PCIe slots. processors from AMD, and also the Six-Core “Istanbul”
Cray XT and Cray XT Series of Supercomputers C 

Cray SeaStar2 chip AMD 940 processor socket


Local
Interconnect CPU CPU DDR1
network DMA Hyper core 0 core 1 SDRAM
ports engine transport
L2 cache L2 cache
Router D
I
D
I
C
System request interface M M
RAM M M
Crossbar switch
D D
L0 controller SSI I I
Hyper DDR1 memory
(CRMS) block M M
Processor transport controller
M M

Bridge 16x PCIe slot


I/O
devices
Bridge 8x PCIe slot
Broadcom
I/O HT-2,100
PCIe tunnel

Cray XT and Cray XT Series of Supercomputers. Fig.  Cray XT service node architecture (PCIe interface)

processor. A significant change from the Cray XT and interface used for system management and coherent
Cray XT, the Cray XT system packed four dual-socket HyperTransport connections between the two Opteron
nodes on a single compute blade. The form-factor for sockets. With the quad-core sockets, this provides an
the compute blade was retained resulting in an over- eight-way NUMA shared memory node (Fig. ).
all doubling of compute density. Each system cabinet
contains up to  AMD Opteron processors. Cray ECOphlex Cooling
The system cabinet was enhanced to accommodate The Cray XT also introduced a new phase-change
the increase in power and density. The backplane was heat-removal system known as “Cray ECOphlex” cool-
designed to handle higher current and an enhanced ing. Unlike traditional water cooling solutions avail-
power supply and blower system was designed to power able in the market, the heat generated by the compute
and cool the system. blades is rejected to a fluid refrigerant – the Ra –
The compute node on the Cray XT system is either via a liquid-vapor phase change (evaporation) as illus-
 or -core, depending on the Opteron used. The inter- trated in the figure below. The bottom-to-top airflow
connect router chip was also improved for the Cray on the Cray XT results in a limited surface area that
XT. The new SeaStar+ chip yields higher bandwidth can be used heat removal at the top of the system cab-
and lower latencies than previous generation SeaStar inets. Since phase-change cooling is about an order-of-
and SeaStar routers. magnitude more effective per unit area compared with
The service blades and service infrastructure of the traditional water coils, it represented an ideal match for
Cray XT was retained from the Cray XT system. the Cray XT system (Fig. ).
The following figure shows a block diagram of a Cray The Cray ECOphlex system includes one or more
XT compute node. Each compute node consists of two Heat Exchange Units (HEU, also known as XDP). This
processor sockets populated with AMD Opteron quad- HEU is connected to the building water circuit, and
core (or six-core) processors, eight DIMM slots, a Cray its purpose is to recondense the Ra which runs in
SeaStar ASIC that connects one processor to the high- a closed loop between the evaporators in each system
speed interconnection network, and an L controller cabinet and the condenser coil in the HEU. A single
 C Cray XT and Cray XT Series of Supercomputers

AMD F1207 processor socket

CPU CPU CPU CPU Local


core 0 core 1 core 2 core 3 DDR2
L2 L2 L2 L2 SDRAM
cache cache cache cache
D D
L3 cache I I
M M
System request interface M M
Cray SeaStar2 chip
Crossbar switch D D
Interconnection I I
DMA Hyper M M
network Hyper DDR2 memory
engine transport M M
ports transport controller
Router

Hyper DDR2 memory D D


RAM transport controller I I
M M
L0 controller SSI Crossbar switch M M
(CRMS) block
Processor System request interface D D
I I
L3 cache M M
M M
L2 L2 L2 L2
cache cache cache cache Local
CPU CPU CPU CPU DDR2
core 0 core 1 core 2 core 3 SDRAM

AMD F1207 processor socket

Cray XT and Cray XT Series of Supercomputers. Fig.  Cray XT compute node architecture ( processors)

Gas
phase
R134a
Exiting air stream out

R134a

Liquid Entering air stream


phase
R134a in
Hot air stream passes
R134a absorbs
through evaporator,
energy only in
rejects heat to R134a via
the presence of
liquid-vapor phase
heated air
change (evaporation)

Cray XT and Cray XT Series of Supercomputers. Fig.  Liquid-vapor phase change

blower located in the bottom of each system cabinet is Ra evaporators. A blower speed controller varies the
used to move air over the system blades. The blower speed of the blower to maintain the correct air pres-
pulls computer room air into the cabinet and forces air sure and temperature and control logic within the XDP
through the three chassis and ultimately through the ensures that Ra is never below the dew point.
Cray XT and Cray XT Series of Supercomputers C 

3,200 CFM

Exit evaporators

R134a piping C

1,600 CFM 1,600 CFM

Inlet evaporator

No plenum air required

Cray XT and Cray XT Series of Supercomputers. Fig.  Schematic of Cray XT cabinet with ECOphlex cooling

Additionally, this liquid cooling allows customers to A traditional approach to free cooling is an indoor,
potentially dramatically reduce yearly cooling costs by water-cooled chiller connected to an outdoor closed-
leveraging the “Free Cooling” concept. The concept of loop cooling tower. Typically, automatic valves and
“Free Cooling” is taking advantage of cool outdoor air crossover piping are employed to bypass the chiller
to help save energy in data center chilled-water cooling during when outside conditions are suitable.
systems. An additional benefit of free cooling is extending the
In a data center facility, a typical mechanically gen- useful life of the water chiller by reducing its operating
erated chilled water system consists of the following hours (Fig. ).
equipment that uses electrical energy: The largest Cray XT system delivered is the 
cabinet system at Oak Ridge National Laboratory,
● Water chiller known as “Jaguar.” Jaguar holds the distinction of being
● Computer room air-conditioning units the fastest system in the world as of November of 
● Chilled water pumps (Fig. ).
● Condenser water pumps
● Cooling tower fans

The water chiller is by far the biggest energy user


Cray XT
(approximately %) in a chilled water system. If the
Cray introduced the Cray XT in March of . The
water chiller’s compressor(s) can be shut down dur-
Cray XT provided several major enhancements over
ing cool weather, the outside ambient temperature can
the previous generation Cray XT.
be used to help save energy in the chilled water sys-
tem. The Cray XDP cooling system can operate with a . The Cray XT Compute Node uses two socket G
higher chilled water set point temperature than a typical ( nm technology) AMD “Magny Cours” proces-
computer room air-conditioning unit. Thus, it is possi- sors with eight or twelve cores per socket. Compute
ble to bypass the chiller more with ECOphlex cooling nodes on the Cray XT use two processors and
compared to traditional air-cooling. hence have  or  cores per node.
 C Cray XT

Cray XT and Cray XT Series of Supercomputers. Fig.   cabinet ORNL Jaguar system with Cray ECOphlex cooling

. The Cray XT Compute Node memory uses DDR


technology rather than the DDR memory of previ- Cray XT and Seastar -D Torus
ous generation Cray XT systems. Interconnect
. A new system cabinet was developed (the Series-
Dennis Abts
cabinet). Enhancements included a more efficient
Google Inc., Madison, WI, USA
ECOphlex evaporator design, a new system
blower, and an enhanced power distribution
system. Synonyms
. The Cray Linux Environment  (CLE). This version Cray red storm; Cray XT series; Cray XT; Cray XT;
of the system software includes a “Cluster Compat- Cray XT; Cray XT; Interconnection network; Net-
ibility Mode” which allows users to install and run work architecture
Independent Software Vendor (ISV) packages with
no changes. CLE retained the “Extreme Scalable
Mode” as the default OS image for native appli-
Definition
The Cray XT system is a distributed memory multi-
cations. A user can select either partition at job
processor combining an aggressive superscalar proces-
submittal time.
sor (AMD) with a bandwidth-rich -D torus inter-
Shipments of the Cray XT started in the first half of connection network that scales up to  K processing
. At the time of this writing, the largest Cray XT nodes. This chapter provides an overview of the Cray
system is installed at HECToR, the United Kingdom’s XT system architecture and a detailed discussion of its
National Supercomputer Service. This system has over interconnection network.
, processor cores.
Discussion
The physical sciences are increasingly turning toward
computational techniques as an alternative to the tra-
ditional “wet lab” or destructive testing environments
for experimentation. In particular, computational sci-
Cray XT ences can be used to scale far beyond that of tradi-
tional experimental methodologies; opening the door
Cray XT and Cray XT Series of Supercomputers to large-scale climatology and molecular dynamics, for
Cray XT and Seastar -D Torus Interconnect example, which encompass enough detail to accurately
Cray XT and Seastar -D Torus Interconnect C 

model the dominant terms that characterize the phys- compute or system and IO (SIO) nodes. SIO nodes are
ical phenomena being studied []. These large-scale where users login to the system and compile/launch
applications require careful orchestration among coop- applications.
erating processors to ply these computational tech-
niques effectively.
Topology C
The genesis of the Cray XT system was the collabo-
The Cray XT interconnect can be configured as either
rative design and deployment of the Sandia “Red Storm”
a k-ary n-mesh or k-ary n-cube (torus) topology. As
computer that provided the computational power nec-
a torus, the system is implemented as a folded torus
essary to assure safeguards under the nuclear Stock-
to reduce the cable length of the wrap around link.
pile Stewardship Program which seeks to maintain and
The seven-ported SeaStar router provides a proces-
verify a nuclear weapons arsenal without the use of
sor port, and six network ports corresponding to
testing. It was later renamed the Cray XT and sold
+x, −x, +y, −y, +z, and −z directions. The port assign-
commercially in configurations varying from hundreds
ment for network links is not fixed, any port can cor-
of processors, to tens of thousands of processors. An
respond to any of the six directions. The noncoherent
improved processor, faster processor–network inter-
HyperTransport (HT) protocol provides a low latency,
face, along with further optimizations to the software
point-to-point channel used to drive the Seastar net-
stack and migrating to a lightweight Linux kernel
work interface.
prompted the introduction of the Cray XT; however,
Four virtual channels are used to provide point-
the underlying system architecture and interconnection
to-point flow control and deadlock avoidance. Using
network remained unchanged.
virtual channels avoids unnecessary head-of-line (HoL)
blocking for different network traffic flows; however, the
System Overview
extent to which virtual channels improve network uti-
The Cray XT system scales up to k nodes using a
lization depends on the distribution of packets among
bidirectional -D torus interconnection network. Each
the virtual channels.
node in the system consists of an AMD superscalar
processor connected to a Cray SeaStar chip [] (Fig. )
which provides the processor–network interface, and Routing
six-ported router for interconnecting the nodes. The The routing rules for the Cray XT are subject to several
system supports an efficient distributed memory mes- constraints. Foremost, the network must provide error-
sage passing programming model. The underlying mes- free transmission of each packet from the source node
sage transport is handled by the Portals [] messaging identifier (NID) to the destination. To accomplish this,
interface. the distributed table-driven routing algorithm is imple-
This chapter focuses on the Cray XT interconnec- mented with a dedicated routing table at each input port
tion network that has several key features that set it apart that is used to look up the destination port and vir-
from other networks: tual channel of the incoming packet. The lookup table
at each input port is not sized to cover the maximum
● Scales up to K network endpoints
K node network since most systems will be much
● High injection bandwidth using HypterTransport
smaller, only a few thousand nodes. Instead, a hier-
(HT) links directly to the network interface
archical routing scheme divides the node name space
● Reliable link-level packet delivery
into global and local regions. The upper three bits of
● Multiple virtual channels for both deadlock avoid-
the destination field (given by the destination[:] in
ance and performance isolation
the packet header) of the incoming packet are com-
● Age-based arbitration to provide fair access to net-
pared to the global partition of the current SeaStar
work resources
router. If the global partition does not match, then the
Subsequent sections cover these topics in more detail. packet is routed to the output port specified in the global
There are two types of nodes in the Cray XT lookup table (GLUT). The GLUT is indexed by destina-
system. Endpoints (nodes) in the system are either tion[:] to choose one of eight global partitions. Once
 C Cray XT and Seastar -D Torus Interconnect

AMD64
processor

16
16

HyperTransport 1.x
cave

DMA
engine

12
+x
−x

+y 7-ported 384k
−y seastar scratch
router memory
+z
−z

RAS
controller
and
PowerPC
maintenance
440
interface

Cray XT and Seastar -D Torus Interconnect. Fig.  High-level block diagram of the SeaStar interconnect chip

the packet arrives at the correct global region, it will pre- link in a way that preserves deadlock freedom and
cisely route within a local partition of , nodes given attempts to balance the load across the physical links.
by the destination[:] field in the packet header. Furthermore, it is important to optimize the buffer
The tables must be constructed to avoid deadlocks. space within the SeaStar router by balancing the num-
Glass and Ni [] describe turn cycles that can occur ber of packets within each virtual channel.
in k-ary n-cube networks. However, torus networks are
also susceptible to deadlock that results from overlap- Avoiding Deadlock in the Presence
ping virtual channel dependencies (this only applies to of Faults and Turn Constraints
k-ary n-cubes, where k > ) as described by Dally and The routing algorithm rests upon a set of rules to pre-
Seitz []. Additionally, the SeaStar router does not allow vent deadlock. In the turn model, a positive first (x+, y+,
○ turns within the network. The routing algorithm z+ then x−, y−, z−) rule prevents deadlock and allows
must both provide deadlock-freedom and achieve good some routing options to avoid faulty links or nodes. The
performance on benign traffic. In a fault-free network, global/local routing table adds an additional constraint
a straightforward dimension-ordered routing (DOR) for valid turns. Packets must be able to travel to their
algorithm will provide balanced traffic across the net- local area of the destination without the deadlock rule
work links. Although, in practice, faulty links will occur preventing free movement within the local area. In the
and the routing algorithm must route around the bad Cray XT, network, the localities are split with yz planes.
Cray XT and Seastar -D Torus Interconnect C 

To allow both x+ and x− movement without restricting in the node in that direction must now look ahead to
later directions, the deadlock avoidance rule is modified avoid a ○ turn if it were to direct a packet to the node
to (x+, x−, y+, z+ then y+, y−, z+ then z+, z−). Thus, with the faulty links. When the desired Y link is avail-
free movement is preserved. Note that missing or bro- able, it is necessary to check that the node at that next
ken X links may induce a non-minimal route when a hop does not have a z+ link that the packet might prefer C
packet is routed via the global table (since only y+ and (based on XYZ routing) to follow next. That is, if the
z+ are “safe”). With this rule, packets using the global default direction for this destination in the next node is
table will prefer to move in the X direction, to get to z+ and the z+ link is broken there, the routing choice
their correct global region as quickly as possible. In the at this node would be changed from the default Y link
absence of any broken links, routes between compute to z+.
nodes can be generated by moving in x dimension, then
y, then z. Also, when y = Ymax , it is permissible to dodge Routing Rules for Z Links
y− then go x + /x−. If the dimension is configured as a When the desired route follows a z+ link that is broken,
mesh – there are no y+ links, for example, anywhere at the preference is to travel in y+ to find a good z+ link.
y = Ymax then a deadlock cycle is not possible. In this scenario, the Y link look ahead is relied
In the presence of a faulty link, the deadlock avoid- up to avoid the node at y+ from sending the
ance strategy depends on the direction prescribed by packet right back along y−. When the y+ link is
dimension order routing for a given destination. In not present (at the edge of the mesh), the sec-
addition, toroidal networks add dateline restrictions. ond choice is y−. When the desired route is to
Once a dateline is crossed in a given dimension, rout- travel in the z− direction, the logic must follow
ing in a higher dimension (e.g., X is “higher” than Y) is the z− path to ensure there are no broken links at all
not permitted. on the path to the final destination. If one is found, the
route is forced to z+, effectively forcing the packet to go
Routing Rules for X Links the long way around the Z torus.
When x+ or x− is desired, but that link is broken, y+
is taken if available. This handles crossing from com- Flow Control
pute nodes to service nodes, where some X links are not Buffer resources are managed using credit-based flow
present. If y+ is not available, z+ is taken. This z+ link control at the data-link level. The link control block
must not cross a dateline. To avoid this, the dateline in (LCB) is shown at the periphery of the SeaStar router
Z is chosen so that there are no nodes with a broken X chip in Fig. . Packets flow across the network links
link and a broken y+ link. Although the desired X link is using virtual cut-through flow control – that is, a packet
available, the routing algorithm may choose to take an does not start to flow until there is sufficient space in
alternate path when the node at the other side of the X the receiving input buffer. Each virtual channel (VC) has
link has a broken y+ and z+ link (note the y+ might not dedicated buffer space. A -bit field (Fig. ) in each flit
be present if configured as a mesh), then an early detour is used to designate the virtual channel, with a value of
toward z+ is considered. If the X link crosses a partition all s representing an idle flit. Idle flits are used to main-
boundary into the destination partition or the current tain byte and lane alignment across the plesiochronous
partition matches the destination partition and the cur- channel. They can also carry VC credit information
rent Y matches the destination Y coordinate, route in z+ back to the sender.
instead. Otherwise, the packet might be boxed in at the
next node, with no safe way out. SeaStar Router Microarchitecture
Network packets are comprised of one or more
Routing Rules for Y Links -bit flits (flow control units). The first flit of the packet
When the desired route follows a Y link that is broken, (Fig. ) is the header flit and contains all the neces-
the preference is to travel in z+ to find a good Y link. sary routing fields (destination[:], age[:], vc[:])
If z+ is also broken, it is feasible to travel in the oppo- as well as a tail (t) bit to mark the end of a packet.
site direction in the Y dimension. However, the routing Since most XT networks are on the order of several
 C Cray XT and Seastar -D Torus Interconnect

HyperTransport
16 16
links

HT cave HT cave
receive send
PowerPC

Input queue
MMRs
Arb Arb

Output queues

Output queues
CAM

Arb Arb
Output queues Input queue

Arb

12
12

Arb
LCB

LCB
Arb
Arb
12

12
Input queue Output queues
XBAR
Output queues Input queue

Arb
12

12
Arb
LCB

LCB
Arb
Arb
12

12
Input queue Output queues
Output queues

Output queues

Arb Arb
Input queue

Input queue

Arb Arb

LCB LCB

12
SeaStar links 12
12
12
a b
SeaStar block diagram SeaStar die photo

Cray XT and Seastar -D Torus Interconnect. Fig.  Block diagram of the SeaStar system chip
67
66
65
64
63
62
61
60
59
58
57
56
55
54
53
52
51
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
t vc Destination[14:0] dt k V Length S TransactionID[11:0] Source[14:7] R Source[6:0] u Age[10:0]
t vc Data[63:0]
… up to 8 data flits (64 bytes) of payload …
t vc Data[63:0]

Cray XT and Seastar -D Torus Interconnect. Fig.  SeaStar packet format

thousand nodes, the lookup table at each input port is corresponding to its destination address. Upon arrival
not sized to cover the maximum k node network. To at the input port, the packet destination field is com-
make the routing mechanism more space-efficient, the pared to the node identifier. If the upper three bits
-bit node identifier is partitioned to allow a two-level of the destination address match the upper three bits
hierarchical look up: A small eight-entry table identifies of the node identifier, then the packet is in the cor-
a region, the second table precisely identifies the node rect global partition. Otherwise, the upper three bits
within the region. The region table is indexed by the are used to index into the -entry global lookup table
upper -bits of the destination field of the packet, and (GLUT) to determine the egress port. Conceptually, the
the low-order -bits identifies the node within k-entry k possible destinations are split into eight, k parti-
table. Each network port has a dedicated routing table tions denoted by bits destination[11:0] of the
and is capable of routing a packet each cycle. This pro- destination field.
vides the necessary lookup bandwidth to route a new The SeaStar router has six full-duplex network ports
packet every cycle. However, if each input port used and one processor port that interfaces with the Tx/Rx
a k-entry lookup table, it would be sparsely popu- DMA engine (Fig. ). The network channels operate at
lated for modest-sized systems, and use an extravegent . Gb/s × lanes over electrical wires, providing a peak
amount of silicon area. of . GB/s per direction of network bandwidth. The
A two-level hierarchical routing scheme is used link control block (LCB) implements a sliding window
to efficiently look up the egress port at each router. go-back-N link-layer protocol that provides reliable
Each router is assigned a unique node identifier, chip-to-chip communication over the network links.
Cray XT and Seastar -D Torus Interconnect C 

The router switch is both input-queued and output- 100


90
queued. Each input port has four (one for each virtual 80
channel) -entry buffers, with each entry storing one 70
flit. The input buffer is sized to cover the round-trip 60

Latency
latency across the network link at . Gb/s signal rates. 50
40
C
There are  staging buffers in front of each output
30
port, one for each input source (five network ports, and 20
one processor port), each with four VCs. The staging 10
buffers are only  entries deep and are sized to cover 0
0.00 0.20 0.40 0.60 0.80 1.00
the crossbar arbitration round-trip latency. Virtual cut- Offered load
through [] flow control into the output staging buffers
Cray XT and Seastar -D Torus Interconnect. Fig. 
requires them to be at least nine entries deep to cover
Offered load versus latency for an ideal M/D/ queue
the maximum packet size.
model

Age-Based Output Arbitration bandwidth. That is, every two arbitration cycles node
Packet latency is divided into two components: queue-  will deliver a packet from source P, and every four
ing and router latency. The total delay (T) of a packet arbitration cycles it will deliver a packet from source P.
through the network with H hops is the sum of the A packet will merge with traffic from at most n other
queueing and router delay. ports since each router has n network ports with n − 
T = HQ(λ) + Htr () from other directions and one from the processor port.
In the worst case, a packet traveling H hops and merg-
where tr is the per-hop router delay (which is ≈ 
ing with traffic from n other input ports, will have a
ns for the SeaStar router). The queueing delay, Q(λ),
latency of:
is a function of the offered load (λ) and described by
L
the latency-bandwidth characteristics of the network. Tworst = ()
(n)H
An approximation of Q(λ) is given by an M/D/ queue
model (Fig. ). where L is the length of the message (number of pack-
 ets), and n is the number of dimensions. In this example,
Q(λ) = ()
 − λ P and P each receive / of the available band-
When there is very low offered load on the network, the width into node , a factor of  times less than that
Q(λ) delay is negligible. However, as traffic intensity of P. Reducing the variation in bandwidth is critical
increases, and the network approaches saturation, the for application performance, particularly as applica-
queueing delay will dominate the total packet latency. tions are scaled to increasingly higher processor counts.
As traffic flows through the network it merges with Topologies with a lower diameter will reduce the impact
newly injected packets and traffic from other directions of merging traffic. A torus is less affected than a mesh of
in the network (Fig. ). This merging of traffic from the same radix (Fig. a and b), for example, since it has a
different sources causes packets that have further to lower diameter. With dimension-order routing (DOR),
travel (more hops) to receive geometrically less band- once a packet starts flowing on a given dimension it
width. For example, consider the -ary -mesh in Fig. a stays on that dimension until it reaches the ordinate of
where processors P through P are sending to P. its destination.
The switch allocates the output port by granting pack-
ets fairly among the input ports. With a round-robin Key Parameters Associated with
packet arbitration policy, the processor closest to the Age-Based Arbitration
destination (P is only one hop away) will get the most The Cray XT network provides age-based arbitration to
bandwidth – / of the available bandwidth. The proces- mitigate the affects of this traffic merging as shown in
sor two hops away, P, will get half of the bandwidth into Fig. , thus reducing the variation in packet delivery
router node , for a total of / × / = / of the available time. However, age-based arbitration can introduce a
 C Cray XT and Seastar -D Torus Interconnect

0 1 2 3 4 5 6 7

1/128 1/64 1/32 1/16 1/8 1/4 1/2

P0 P1 P2 P3 P4 P5 P6 P7

a 8-ary 1-D mesh

1/2
1/2
0 1 2 3 4 5 6 7

1/4 1/8 1/16 1/32 1/16 1/8 1/4

P0 P1 P2 P3 P4 P5 P6 P7

b 8-ary 1-D torus

Cray XT and Seastar -D Torus Interconnect. Fig.  All nodes are sending to P and merging traffic at each hop

starvation scenario whereby younger packets are starved round-robin arbitration, and a value of all s will
at the output port and cannot make forward progress select age-based arbitration. A combination of s
toward the destination. The details of the algorithm and s will control the ratio of round-robin to age-
along with performance results are given by Abts and based. For example, a value of  ⋯  will use
Weisser []. There are three key parameters for control- half round-robin and half age-based.
ling the aging algorithm.
When a packet arrives at the head of the input queue,
it undergoes routing by indexing into the LUT with
● AGE_CLOCK_PERIOD – a chip-wide -bit count-
destination[:] to choose the target port and virtual
down timer that controls the rate at which packets
channel. Since each input port and VC has a dedicated
age. If the age rate is too slow, it will appear as though
buffer at the output staging buffer, there is no arbitration
packets are not accruing any queuing delay, their
necessary to allocate the staging buffer – only flow con-
ages will not change, and all packets will appear
trol. At the output port, arbitration is performed on a
to have the same age. On the other hand, if the
per-packet basis (not per flit, as wormhole flow control
age rate is too fast, packets ages will saturate very
would). Each output port is allocated by performing a
quickly – perhaps after only a few hops – at the
-to- VC arbitration along with a -to- arbitration to
maximum age of , and packets will not gen-
select among the input ports. Each output port main-
erally be distinguishable by age. The resolution of
tains two independent arbitration pointers – one for
AGE_CLOCK_PERIOD allows anywhere from  ns
round-robin and one for age-based. A -bit counter is
to more than  s of queuing delay to be accrued
incremented on each grant cycle and indexes into the
before the age value is incremented.
AGE_RR_SELECT bit array to choose the per-packet
● REQ_AGE_BIAS and RSP_AGE_BIAS – each
arbitration policy.
hop that a packet takes increments the packet age
by the REQ_AGE_BIAS if the packet arrived on
VC/VC or by RSP_AGE_BIAS if the packet Related Entries
arrived on VC/VC. The age bias fields are con- Anton, A Special-Purpose Molecular Simulation
figurable on a per-port basis, with the default bias Machine
of . Clusters
● AGE_RR_SELECT – a -bit array specifying the Cray XT and Cray XT Series of Supercomputers
output arbitration policy. A value of all s will select Distributed-Memory Multiprocessor
Crossbar C 

Infiniband . Dally WJ () Performance analysis of k-ary n-cube intercon-


Petascale Computer nection networks. IEEE Trans Comput ():–
. Dally WJ, Seitz CL () Deadlock-free message routing in
SoC (System on Chip)
multiprocessor interconnection networks. IEEE Trans Comput
Top
():–
. Dally WJ, Towles B () Principles and practices of intercon- C
nection networks. Morgan Kaufmann, San Francisco
Bibliographic Notes and Further . Glass CJ, Ni LM () The turn model for adaptive routing. In:
Reading ISCA ’: Proceedings of the th annual international sympo-
sium on computer architecture, pp –, 
The genesis of the Cray XT system was the collabora-
. Hoisie A, Johnson G, Kerbyson DJ, Lang M, Pakin S () A
tive design and deployment of the Sandia “Red Storm” performance comparison through benchmarking and modeling
computer that provided the computational power nec- of three leading supercomputers: blue gene/l, red storm, and pur-
essary to assure safeguards under the nuclear Stockpile ple. In: SC ’: Proceedings of the  ACM/IEEE conference
Stewardship Program which seeks to maintain and ver- on Supercomputing, ACM, p , New York, 
. Kermani P, Kleinrock L () Virtual cut-through: a new
ify a nuclear weapons arsenal without the use of testing.
computer communication switching technique. Comput Netw
Brightwell et al. [] provide an early look at the SeaStar :–
interconnection network used by the Sandia Red Storm
supercomputer.
Hoisie et al. [] use common high-performance
computing (HPC) benchmarks and modeling to com- Cray XT
pare performance of three leading supercomputers: the
Cray XT (Red Storm), IBM BlueGene/L, and ASCI Cray XT and Cray XT Series of Supercomputers
Purple supercomputers. Cray XT and Seastar -D Torus Interconnect
Dally presents a performance analysis of k-ary n-
cube networks []. While a more comprehensive anal-
ysis, with several examples from industry, is found in
Dally and Towles []. Cray XT
Cray XT and Cray XT Series of Supercomputers
Bibliography Cray XT and Seastar -D Torus Interconnect
. Abts D, Weisser D () Age-based packet arbitration in
large-radix k-ary n-cubes. In: SC ’: Proceedings of the 
ACM/IEEE conference on supercomputing, ACM, pp –,
New York, 
. Alam SR, Kuehn JA, Barrett RF, Larkin JM, Fahey MR,
Critical Race
Sankaran R, Worley PH () Cray xt: an early evaluation
Race Conditions
for petascale scientific simulation. In: SC ’: Proceedings of the
 ACM/IEEE conference on Supercomputing, ACM, pp –,
New York, 
. Brightwell R, Lawry B, MacCabe AB, Riesen R () Portals .:
protocol building blocks for low overhead communication. In: Critical Sections
IPDPS ’: Proceedings of the th international parallel and dis-
tributed processing symposium, IEEE Computer Society, p , Synchronization
Washington, 
. Brightwell R, Pedretti K, Underwood KD () Initial perfor-
mance evaluation of the cray seastar interconnect. In: HOTI ’:
Proceedings of the th symposium on high performance inter-
connects, IEEE Computer Society, pp –, Washington,  Crossbar
. Brightwell R, Pedretti KT, Underwood KD, Hudson T ()
Seastar interconnect: balanced bandwidth for scalable perfor- Buses and Crossbars
mance. IEEE Micro ():– Networks, Multistage
 C CS-

them. A shared event cannot occur without the simul-


CS- taneous participation of all of the processes involved,
and so each process added to a parallel combination
Meiko
represents a further constraint upon the order in which
shared events may occur.
The simplest operator is Stop, which describes a pro-
cess that can do nothing, and hence prevents the occur-
CSP (Communicating Sequential rence of any event in its alphabet. The prefix operator →
Processes) is used to introduce an event into a process description:
if a is any event, and P is any process, then a → P is a
A. W. Roscoe, Jim Davies
process that is ready to engage in event a and, should
Oxford University, Oxford, UK
a occur, will then behave as P. This choice of language
is deliberate: a may be shared with other processes, and
so a → P cannot simply “do” a; indeed, if it is sharing a
Synonyms
with Stop, then a will never occur.
Communicating sequential processes (CSP)
A generalized prefix notation encompasses a pro-
cess that offers the choice of a set of events: ?x : A →
Definition P(x) is prepared to communicate any event a ∈ A and
Communicating Sequential Processes (CSP) is a math- then behaves like the corresponding process P(a). You
ematical notation for describing patterns of interaction. can think of A as being a menu of alternatives that the
It has been used in the analysis of concurrent behavior process offers.
in a variety of applications; it has inspired the design of This choice is further generalized to one between
concurrency mechanisms and primitives in several pro- processes by the external choice operator P ◻ Q. This
gramming languages; it remains a focus for research and offers the first events of P and Q and behaves accord-
development in both academia and industry. ingly, so that

Process Language a → P(a) ◻ b → P(b) = ?x : {a, b} → P(x)


In the CSP notation, processes are used to specify
In the case where P and Q have overlapping sets of
the behavior of components, to express assumptions
initial events, P ◻ Q acts nondeterministically: in (a →
about behavior, or to characterize behavioral prop-
P) ◻ (a → Q), once a has been communicated then
erties. In each case, behavior is described in terms
the process can behave like P or Q, at its internal
of the occurrence and availability of abstract enti-
choice.
ties called events: these are transactions, or atomic,
Because internal choices like this arise in CSP, both
multi-way synchronizations, between combinations of
for this reason and others, there is an operator P ⊓ Q
processes.
that expresses internal, or nondeterministic choice. This
Processes may be combined using a number of oper-
process may choose to behave like either P or Q and
ators, denoting choice, sequencing, and parallelism. The
neither the user nor other processes have any influence
resulting language has a range of algebraic equivalences,
over which. Thus
and is often referred to as a process algebra. These equiv-
alences help to explain the meaning of the operators,
(a → P) ◻ (a → Q) = a → (P ⊓ Q)
and support the automatic transformation of process
descriptions into forms more convenient for analysis or The sequential composition P ; Q behaves as P until
implementation. that process reaches a point in its behavior described by
The alphabet of a process is the set of all events Skip, denoting successful termination; the subsequent
that require its participation. If an event appears in the behavior of the combination is then that of Q.
alphabet of two or more processes in a parallel com-
bination, then we say that this event is shared between (a → Skip) ; Q = a → Q
CSP (Communicating Sequential Processes) C 

for any event a and any process Q. Skip is distinct from as in


Stop, which does not even terminate successfully.
In the parallel composition P ∥ Q, the two pro- resettable(P) = P △ ( e → resettable(P) )
cesses P and Q will evolve independently, while sharing where e is an event representing an external exception.
in the occurrence of any event that is common to the Roscoe [] introduces an operator for handling inter- C
two alphabets. If a and b are not shared between the two nal exceptions: P Θerror Q behaves like P until the event
processes, then error occurs, at which point Q takes over. Thus

a → P ∥ b → Q = a → (P ∥ b → Q) ◻ b → (a → P ∥Q) (a → P) Θa Q = a → Q

that is, a and b may be observed in either order. Process Semantics


If P and Q can neither communicate independently The language of processes described above can be given
nor agree on a shared event, then P ∥ Q is deadlocked, a formal meaning, or semantics. Any process can be
and equivalent to Stop. For example, if a and b are shared associated with a unique set of sequences, or traces,
between the two processes, then comprising every sequence of events, within its alpha-
a → b → P ∥ b → a → Q = Stop bet, that this process will allow. For example, the process
Stop allows only the empty sequence ⟨⟩, while the prefix
the left-hand component of the parallel composition is process a → P will allow any sequence in which the first
able to perform b, but only after a has occurred; the event, if there is one, is a, and for which the remaining
right-hand process is able to perform a, but only after events could be performed by P.
b has occurred; each event requires the cooperation of The trace semantics is enough for many applica-
both processes, so no progress is possible. tions, but does not provide for an adequate treatment of
The interleaving operator P ∣∣∣ Q, on the other hand, nondeterminism. For example, the processes a → Stop
allows each of P and Q to communicate freely: no com- and (a → Stop)⊓Stop have the same set of traces, the set
munications are synchronized even when they belong {⟨⟩, ⟨a⟩}. Yet the first of these must be ready to perform
to the alphabets of P and Q; it follows that P ∣∣∣ Q = P ∥ a, whereas all that can be said for the second is that it
Q when P and Q have disjoint alphabets. may be ready to perform a (or it may deadlock).
The hiding operator P/X removes all actions in the For nondeterministic processes, another level of
set X from both the sight and control of the external semantic information may be required. Each trace of a
environment. Thus process may be associated with a set of sets of events,
or refusal sets, comprising every combination that may
(a → P)/{a} = P/{a}
be blocked by the process once that trace has been per-
It is frequently used to conceal the internal communica- formed. The combination of a trace and a single refusal
tions of a parallel composition as in (P ∥ Q)/(αP ∩ αQ) set is called a failure: if the process, having performed
where αP and αQ are respectively the alphabets of P that trace, were required to participate in one of the
and Q. events from the refusal set, then no further progress
If R is a relation between the set Σ of all visible events would be made.
and itself, then P[[R]] is a process that can perform b To provide a full characterization of the process lan-
whenever P can perform a for any pair (a, b) related guage, the failures semantics must be augmented by an
by R. Sometimes, the special cases where R is either appropriate treatment of undefined expressions. Where
a function f or an inverse function f − are considered the behavior of a process P is determined solely by an
separately. equation such as P = P, or even P = P⊓(a → Stop), then
The language constructs above are usually consid- there is not enough information to uniquely determine
ered the “core” of the notation. However, Hoare [] also a set of failures; such a process is said to be divergent.
introduces interrupt: P △ Q behaves like P until an ini- Processes like AS/{a}, where AS = a → AS, are also
tial event of Q occurs, after which Q takes over. This is divergent. In the definitive Failures–Divergences seman-
typically used to handle externally occurring exceptions tics, each process is associated with a set of failures and
 C CSP (Communicating Sequential Processes)

a set of divergences: traces leading to a point where the abstraction: LA (P) describes now the process looks to
subsequent behavior may be described by a divergent a user who cannot see the events in A but assumes there
process. Besides traces and failures-divergences, there is another user who can see and control them. This is
are a variety of other models. See [] for details. identical to hiding in the traces model, but subtly differ-
In analyzing the semantics of CSP, it is useful to have ent in other models since P/A assumes that the events
an additional form of choice: P ⊳ Q may choose to offer of A become automatic and uncontrolled.
the initial actions of P, but must offer the initial choices A key feature of the process language is that all of the
of Q. It is equivalent to (P ◻ a → Q)/{a} for a an event operators are monotonic with respect to the refinement
that neither P nor Q performs. ordering: if Q ⊑ P, then C[P] refines C[Q] where C[⋅]
Any finite CSP program (one whose traces are of is any CSP context, namely, a piece of CSP syntax with
bounded length) is equivalent to one written using a free process variable.
only Stop, Skip, prefix-choice, internal choice ⊓, slid-
ing choice ⊳, and a constant div representing a divergent
process: one equivalent to AS/{a} that simply performs Tools and Applications
internal (τ) actions for ever. This fact tells us that, in The multiplication of states and entities that is inherent
some sense, the other operators are not language primi- in concurrent behavior means that (fully) manual anal-
tives and that every program is equivalent to a sequential ysis of process semantics is infeasible. Effective tool sup-
one. The fact that this transformation can be accom- port is required: to examine the properties of a process
plished using algebraic laws is the core of an algebraic through step-by-step rewriting, or to check whether or
semantics for CSP (see Chap.  of []). not one process is a refinement of another, through
exhaustive exploration.
Refinement and Abstraction A machine readable syntax for the language has been
In applications of CSP, processes are used to describe defined, called CSPM : process descriptions written in
components, assumptions, or properties, and then com- this syntax can be checked for type consistency using
pared in terms of their semantics. The usual method of the Checker tool, and rewritten step-by-step using a
comparison is based on the notion of refinement, which tool called ProBE. A refinement checking tool called
corresponds to subsetting or containment of the seman- Failures–Divergences Refinement (FDR) is also available:
tics: a process P is a refinement of another process Q, the refinement of processes with millions of control
written Q ⊑ P, if every behavior of P – every trace, or states can be checked in less than a minute; designs
every failure and divergence – is also a behavior of Q. with billions of states have been checked successfully.
In the trace semantics, refinement can be used to Processes with larger, finite state spaces can be checked
check whether a particular sequence of interaction is using compression operators and related techniques.
possible, or to check whether every possible sequence A number of people, for example [], have embed-
of interaction is acceptable according to some specifica- ded theories of CSP into theorem provers such as
tion. However, for nondeterministic processes, it cannot Isabelle and PVS: such embeddings make it easier to
be used to demonstrate that a sequence must be pos- prove general results such as language equivalences and
sible: for this, refinement in the failures–divergences healthiness conditions, but are not as effective as FDR
semantics is required. at proving results about specific systems.
Most comparisons will be made between processes A key application area for CSP has been in the anal-
described at different levels of abstraction: one of the ysis of cryptographic protocols. Gavin Lowe discovered
processes will describe a design or implementation, the that these could be conveniently modeled in CSP and
other a property or specification, using a smaller set of checked for security on FDR, and he and others devel-
events. To effect such a comparison using refinement, oped tools – most notably a front end for FDR called
the events not mentioned in the property description Casper [] – that can input a protocol in a natural nota-
must be abstracted. tion and either find an attack on it or prove there is no
CSP provides a number of mechanisms for describ- such attack. The fact that Lowe, early on in this work,
ing abstraction, including renaming, hiding and lazy found a famous attack on a well-known protocol [] led
CSP (Communicating Sequential Processes) C 

to an explosion of work around the world on the formal of concurrency mechanisms for the Java language []
verification of security protocols. and the Go language released in  by Google [].
Apart from its use in security, CSP has been used An account of the history of CSP up to the mid-
successfully for the development of critical systems for s and its influences on occam and the Transputer
avionics, automotive, military, and embedded systems. can be found in [], a scientific biography of Hoare. C
Some of these are based on modeling directly in CSPM , Timed, prioritized, and probabilistic extensions
and some, like Casper, rely on front ends that translate have been suggested for the CSP notation, and attempts
another notation to CSPM . have been made to integrate it with object modeling
and state-based notations such as B []. Of these, there
has been considerable industrial use of CSP for timed
Language Versions and state-based systems, but always by translation from
There are several different versions of the process lan- the extended notation into CSPM and then checking on
guage, the best known being the mathematical notation FDR. For example, the passage of time can be repre-
used in Hoare’s  book Communicating Sequential sented by the regular occurrence of a tock event.
Processes [], used here. An extended version of that
and the “machine readable” variant CSPM are used in
Discussion
Roscoe’s  book The Theory and Practice of Con-
Many other process algebras have been proposed for the
currency []. CSPM is a significant extension of the
study of concurrency, most notably Milner’s CCS and
 notation, with more powerful notions of sequenc-
its many derivatives, including the π-calculus. Among
ing and abstraction, as well as directives for expediting
these, CSP is characterized by a rigorous adherence to a
the checking of traces and failures–divergences require-
small, coherent set of design principles:
ment. CSPM also contains a functional programming
language related to Haskell [] that is used both for – Each event is an atomic, symmetric, multi-way syn-
building processes’ events and parameters, and for lay- chronization, corresponding to an abstract transac-
ing out networks. These same two versions of CSP are tion involving a fixed collection of processes.
used in Roscoe’s  book Understanding Concurrent – Each process gives only a partial account of when an
Systems. event may occur: until closure is achieved through
The main difference between the Hoare and Roscoe abstraction, other processes may further constrain
presentations is that whereas Hoare assumes that every its occurrence.
process has an intrinsic alphabet for use in the parallel – Processes are characterized completely by their
operator, Roscoe uses versions of parallel that specify externally observable behavior: there are no “inter-
alphabets explicitly: for example P[∣A∣]Q (in the CSPM nal” or “hidden” events in the language.
syntax) runs P and Q with interface A. – There is no explicit treatment of priority, timing,
Hoare introduced an early version of the notation or degree of parallelism; these require more specific
in , in the Communications of the ACM []. This interpretations of interaction and communication.
was a language of programs rather than constraints or – The semantics is captured by models which describe
patterns of behavior; however, it embodied two of the a process as one or more sets of observations of
characteristic features of the later notation: that the linear, as opposed to branching behaviors.
behavior of a process should be characterized entirely As a result, the language has certain properties not
by its external interaction, and that parallel composi- found in other algebras: most notably, processes
tion should be a primary means of adding structure to describing components, assumptions, or properties at
process descriptions. different levels of abstraction may be compared using
The CSP notation in both its  and process alge- notions of refinement.
bra versions was the inspiration for the programming In other languages, comparisons between processes
language occam [], implemented on the inmos Trans- are made only at the same level of abstraction, using
puter: both the language and the processor are discussed various notions of simulation; analyses across differ-
elsewhere in this volume. It has inspired also the design ent levels of abstraction are conducted using separate
 C CSP (Communicating Sequential Processes)

formalisms, such as temporal logic. Such logics are often in CSP. The vision for the process language, and the
excellent at expressing succinct properties of systems, failures–divergences semantics, is set out in Hoare’s 
but using CSP descriptions as refinement specifications book, Communicating Sequential Processes; key stages in
frequently has an advantage when capturing a more its development can be seen in Hoare’s  paper of the
complete description of a system []. It is possible to same name [] and the  paper by Brookes et al. [].
express some subsets of Linear Temporal Logic in CSP, Schneider [] is a book on CSP that emphasizes the
and aspects of fairness, but any property that can be theory of Timed CSP (see also [, ]). The application
checked over standard CSP semantics must be deter- of CSP to security protocols is covered in [].
mined by a process’s linear as opposed to branching
behavior.
On a theoretical note, it can be shown that every
Bibliography
. Abrial JR () The B-book: assigning programs to meanings.
closed property R of CSP models, and many others,
Cambridge University Press, Cambridge
can be characterized by refinement checks of the form . Bird R () An introduction to functional programming using
F(P) ⊑ G(P) for CSP contexts F and G []. Here, R is Haskell. Prentice-Hall, Hertfordshire, UK
closed if, whenever R(P) does not hold, there is some . Brookes SD, Hoare CAR, Roscoe AW () A theory of commu-
n such that there is no Q which both satisfies R and is nicating sequential processes. J ACM ():–
. Davies J () Specification and proof in real-time CSP. Cam-
equivalent to P for the first n steps of its behavior.
bridge University Press, Cambridge
Recent results (see Chap.  of []) have demon- . Google () The go programming language. http://golang.org/.
strated that the CSP notation is in some sense a universal Accessed July 
language for concurrency: every language whose oper- . Hoare CAR () Communicating sequential processes. Com-
ational semantics satisfies the conditions to be CSP-like mun ACM ():–
can be simulated (at the level of strong bisimulation) . Hoare CAR () Communicating sequential processes. Prentice
Hall (Available from http://www.usingcsp.com)
by CSP. Any such language is automatically given com-
. Inmos Ltd. () Occam reference manual. Prentice Hall
positional semantics over CSP models and a theory of . Isobe Y, Roggenbach M () A generic theorem prover of CSP
refinement. One such language is π-calculus: see []. refinement. TACAS , Springer
. Jones CB, Roscoe AW () Insight, innovation and collabora-
Related Entries tion. Reflections on the work of C.A.R. Hoare. Springer, London
. Lea D () Concurrent programming in java: design principles
Bisimulation
and patterns. Addison Wesley, Reading
Deadlocks
. Leuschel M, Currie A, Massart T () How to make FDR Spin:
Determinacy LTL model checking of CSP by refinement. FME . Springer
Formal Methods-Based Tools for Race, Deadlock, . Lowe G () Breaking and fixing the Needham-Schroeder
and Other Errors public-key protocol using FDR. In: Proceedings of TACAS ’.
Pi-calculus Lecture notes in computer science, vol . Springer, Berlin
. Lowe G () Casper: a compiler for the analysis of security
Process Algebras
protocols. Proceedings of CSFW 
. Reed GM, Roscoe AW () A timed model for communicating
Bibliographic Notes and Further sequential processes, Theor Comput Sci :–
Reading . Roscoe AW () The theory and practice of concurrency. Pren-
A comprehensive account of CSP is presented in tice Hall, Hertfordshire, UK
. Roscoe AW () On the expressive power of CSP refinement.
Roscoe’s book Understanding Concurrent Systems, pub-
Form Asp Comput ():–
lished by Springer in  [], a significant update . Roscoe AW () CSP is expressive enough for π. In: Reflections
of his  book The Theory and Practice of Concur- on the work of C.A.R. Hoare, Springer, London
rency []: this includes a guide to the latest version of . Roscoe AW () Understanding concurrent systems. Springer,
the “machine readable” language, and a detailed account London
of tools, applications, and extensions. It explains the . Ryan PYA, Schneider SA, Goldsmith MH, Lowe G, Roscoe AW
() The modelling and analysis of security protocols: the CSP
links with other ideas from concurrency such as shared
approach. Addison-Wesley, Reading
variables, data-flow and buffered channels, mobility, . Schneider SA () Concurrent and real-time systems: the CSP
time, and priority by showing how to model these approach. Wiley, New York
Cyclops C 

. The traditional splitting of the general-purpose reg-


Cyclops ister file into fixed and floating-point components
almost always resulted in the wrong balance. For
Monty Denneau applications without floating point data, all of the
IBM Corp., T.J. Watson Research Center, Yorktown
Heights, NY, USA
floating-point registers were wasted. For applica- C
tions with very high floating point content, many of
the fixed-point registers went unused.
. When processors were running the same small
Synonyms program, the dedication of a large instruction
Cyclops- cache to just one or two processors wasted silicon
resources.
Definition
Cyclops is a targeted-purpose, highly parallel, mas-
sively multicore family of computers developed at Thread Unit
the IBM T.J. Watson Research Center. The design The fundamental Cyclops computing element is the
is characterized by several novel architectural fea- -bit general-purpose thread unit, the instruction set
tures and by an aggressive exploitation of ASIC chip architecture for which was developed without any par-
technology. Machines under construction range in ticular tie to previous machines.
performance from tens of teraflops to a full petaflop. The thread unit contains a -port ( read,  write)
 ×  register file and a -KB SRAM (Static Random
Access Memory). It executes one scalar instruction per
Discussion cycle. Instructions and data are, respectively,  and 
bits wide. There are only  basic instruction types.
Introduction
Instructions are executed from a -element prefetch
Cyclops falls into the category of targeted high-
buffer (PIB). While execution is not speculative,
performance computing systems. These machines are
instructions are prefetched using a -entry -bit
built with an intentionally unconventional balance of
branch history table (BHT).
resources in order to optimize cost-performance for
The name thread unit is historical – a thread unit is
specific application domains. Cyclops was designed to
actually a complete processor that can execute a fixed
target proprietary optimization codes for a select group
point instruction every cycle.
of IBM clients.
Five observations of the execution of these codes
on commercial machines motivated several of the key Processor
architectural decisions: Two thread units and a shared floating-point unit con-
stitute a processor. The floating-point unit can start one
. The majority of instructions were not used most
double precision add and one double precision multiply
of the time. This was true even for so-called RISC
every cycle. Underflow is truncated to zero and over-
processors. Instructions were found that were never
flow is forced to infinity – Cyclops has no interrupts on
emitted by any compiler.
arithmetic exceptions. The multiply unit can also start
. Much of the architectural, design, and implementa-
a  ×  -> integer multiply-accumulate operation
tion work that went into optimizing single-thread
every cycle.
performance was wasted. Many processor resources
A processor also has -bit outgoing and incoming
were idle much of the time, wasting clock and leak-
ports to the chip-wide crossbar described below.
age power.
. Data caches consumed a large part of the silicon
resources, but often increased the energy per oper- Instruction Cache Group
ation and, in some cases, even decreased perfor- Five processors (ten thread units) and a shared Instruc-
mance. tion Cache constitute an Instruction Cache Group.
 C Cyclops

The -KB shared Instruction Cache is -way SRAM of the second processor, and so on. In order to
associative and has -byte lines. Replacement is Least- enforce ordering, access to global interleaved memory is
Recently-Used. always done over the crossbar, even if a particular item
On each cycle the Instruction Cache can provide is local to the accessing processor.
one -instruction line. These are aligned to -byte Attached to each Cyclops chip is one gigabyte of
(-instruction) boundaries. Requests to the same line external DDR DRAM, implemented as four indepen-
are combined and satisfied simultaneously. dent quadrants, each with its own port to the chip. Data
is interleaved in -byte blocks, first across the quad-
Crossbar rants, and then within the internal banks of the DRAM
On-chip Cyclops communication is provided by a large modules. The peak bandwidth into external DRAM is
pipelined crossbar. This avoids the locality constraints  GB/s.
that accompany, for example, nearest-neighbor inter- Only the processors on a chip can directly address
connects. A full crossbar ensures that programmers do the memory associated with that chip. Remote access is
not have to consider the particular positions of proces- handled by message routing hardware described below
sors on the chip. and SHMEM utilities.
The crossbar has  ports, each approximately 
bits wide. It is pipelined so that, in the absence of block- Atomic Memory Operations
ing, transactions can be initiated on every cycle. Trans- All levels of Cyclops storage support atomic mem-
actions can encompass either single items or blocks of ory operations. These include both fetch-and-add
items. Forward and return transactions share the same and fetch-and-logic with any -input logical opera-
wires, but otherwise have separate facilities in order to tion. Alternative versions perform the update without
prevent deadlock. returning data.
Of the  crossbar ports,  connect to the pro-
cessors,  to the Instruction Caches,  to the DRAM
controllers,  to the D router,  to the combining router, Signal Bus
and  to the parallel host interface. The signal bus is a -bit wide -way pipelined OR
Loads and Stores through the Cyclops crossbar have tree used to support fast barrier operations. To pre-
an important ordering property, which is best illustrated serve ordering, processors write to the signal bus over
by example: If a processor stores data into global mem- the crossbar, but read from it locally. Each processor
ory and sets a flag to indicate that the data has been observes the logical OR of the contributions of all 
written, then another processor observing that the flag processors. Three bits are typically used to implement
has been set will subsequently read the correct data. each barrier.
Unlike most other multiprocessors, Cyclops does not
require or have a synch instruction. A-Switch
The unloaded crossbar latency is  cycles. Each Cyclops chip contains a simple -port D router
referred to as the A-Switch. Chips are physically con-
Storage nected in a D mesh. The petaflop version of Cyclops is
Each of the  Cyclops processors contains  KB of a  ×  ×  cube.
SRAM. The SRAM is partitioned at runtime into two Each of the six external channels simultaneously
parts. The first is local scratchpad, which is most often sends and receives data at a rate of  GB/s, for an
directly accessed by the owning processor, but which aggregate external data bandwidth of  GB/s. A small
can also be explicitly accessed by any other processor amount of additional bandwidth is present for control.
on the chip. Each channel consists physically of  differential
The second is a contribution to global interleaved pairs (lanes), each running at  Gbits/s. Nine of the lanes
SRAM. This memory segment is interleaved across all are for payload, two for block error correction, and one
participating processors, so that the first  bytes are in is reserved as a spare. The block error correction is
the SRAM of the first processor, the next  bytes in the strong enough to allow total failure of any single lane.
Cyclops C 

There are four classes of traffic. Two are for normal In addition to the activation criteria, the tag also
forward and reply messages, and two are for forward contains the location of an applet in a small instruction
and reply messages that must be rerouted to avoid a memory. When a tag is activated, it posts a notification
bad chip. of that event on an activation queue.
Messages typically travel first along the x-axis, then The B-Switch contains a simple embedded proces- C
along the y-axis, and finally along the z-axis of the mesh. sor that is a subset of the primary Cyclops processor.
The number of hops along each axis is contained in the Only instructions important for collective operations
header. are implemented. For example, floating-point addi-
If chip A cannot reach chip B because of a fail- tion and comparison are supported, but multiplication
ing chip along the route, then chip C is found such is not.
that A can reach C and C can reach B. As the message The B-Switch processor executes, one by one, the
passes C, its class is changed to the reroute class. This applets on the activation queue. During execution, the
mechanism prevents deadlock. processor typically sends new packets to adjacent chips
In order for processors to collectively participate and also notifies the sources that an operation is com-
in the creation of messages, traffic injected into the plete and resources are now free.
A-Switch is mediated by six outgoing channel con- A typical B-Switch operation is to determine a maxi-
trollers, one for each direction. Processors assemble mum floating point value across all chips in a ××
components of messages in memory, and then atomi- machine, and then to broadcast that value to all partic-
cally queue a pointer to the appropriate channel con- ipants. This operation is done in two phases, each with
troller. an associated tag. First, the comparisons are done in a
Similarly, there are six incoming channel controllers B-Switch tree that terminates near the center of the
responsible for dejecting data from the A-Switch. Data cube. Then, the chip at the root of the tree changes
is written to FIFOs allocated in memory. Processors the tag and, in the reverse direction, broadcasts the
remove message pointers from an incoming work FIFO, maximum value to all participants.
process the corresponding data, and then notify a chan- An overview of the Cyclops chip is shown below
nel controller when done so that it can free the space. (Fig. ).

Single-Step
B-Switch Cyclops has a unique single-step property. It can be
In a machine with over , processor chips, collec- started from cycle , run M cycles, stopped and exam-
tive operations such as floating-point reductions, global ined, then run an additional N cycles, and it will
max/min, and barriers can, if implemented in software, behave exactly as if it were run M + N cycles without
negatively affect performance. The Cyclops processor interruption.
chip incorporates a component referred to as the B- Several mechanisms were required in order to
Switch, which accelerates these operations. enable single-step. There is a unique central oscillator,
Traffic to and from the B-Switch uses the same phys- and the rising edges of the clock can be thought of as
ical links as the A-Switch, with B-Switch traffic always being numbered. All chips update their state in the same
taking priority. Processors directly inject packets into way on edge N, even though they may receive edge N at
and deject packets from the B-Switch. B-Switch trans- different times.
actions are strictly between neighbors in the D mesh. The major challenges to single-step were the DRAM
Central to the unit is a set of  trigger tags. Each and HSS (High Speed SerDes) subsystems. On any
tag is programmed with criteria for activation based given cycle these can have transactions or data in
on B-Switch packets received from the six neighbor- flight. When the machine pauses, Cyclops sinks in-flight
ing chips. The tag can either require that all members results to completion buffers. Upon starting again, the
of some particular subset of the neighbors have sent machine sources results from the completion buffers in
B-Switch packets, or that any member of a particular such a way that they appear on the same clock cycles
subset has sent a packet. that they would have if no pause had occurred.
 C Cyclops

32 KB SRAM0 32 KB SRAM1

Thread Unit 0 Thread Unit 1

Floating-Point/MAC
Shared Port to Crossbar

P = Processor,
IC = I-Cache (32 KB SRAM)

P P IC P P P P P P IC P P
P P IC P P P P P P IC P P
I/O

I/O
P P IC P P P P P P IC P P
P P IC P P P P P P IC P P

Pipelined Crossbar Switch

P P IC P P P P P P IC P P
P P IC P P P P P P IC P P
I/O

I/O
P P IC P P P P P P IC P P
P P IC P P P P P P IC P P

A- and B-Switches Four DDR2 Memory


Host Port
(Message Passing) Controllers

12 64 64 1 Alert

0 1 2 3 4 5 FPGA
High Speed External Memory
I/O PortsTo/From (Four Banks)
Disk, Gigabit Ethernet,
Nearest-Neighbor Nodes Control Tree

Cyclops. Fig.  Overview of the Cyclops chip

Processor Sparing The compression is done in two steps. First, fast


Cyclops was the first chip designed within IBM to arithmetic hardware divides the address by the number
exploit large scale processor sparing in order to reduce of good banks and produces the quotient and remain-
silicon costs. Even in a mature technology, yields of der. The quotient is the address in the target bank. Next,
perfect die the size of the  mm ×  mm Cyclops die the reminder is used to index a remap table that lists,
would not be high. A single failed logic transistor or without gaps, the good banks. The output of the remap
wire results in the die being discarded. table is the target bank.
The Cyclops ASIC can tolerate any number of failed
processors, with the sole consequence being a linear
reduction in performance and SRAM capacity. This is Mechanical Package
achieved by mapping out the failed processors and com- Each Cyclops ASIC is contained on its own ′′ × ′′
pressing the remaining interleaved memory space so water-cooled blade. The blade contains one gigabyte of
that it appears to be contiguous. off-the-shelf DRAM, an FPGA for external Ethernet
Cyclops C 

access and hard disk drive control, power supply com- result, it posed many unique challenges to the existing
ponents, and support for an integral hard disk drive. design methodology and tools.
Water-cooling is implemented by means of a cold Work on early versions of Cyclops began in the late
plate constructed from fused sheets of etched copper. s to address first the challenges of raytracing, and
The junction temperature of the ASIC is typically held later protein-folding. The ASIC was designed by very C
to under  ○ C, enhancing reliability and improving small teams at the IBM T.J. Watson Research Center,
system speed. the IBM Burlington and Fishkill facilities, and by
Twenty-four blades are mounted on each side of a Dr. Gregory Chudnovsky at the Polytechnic Institute
midplane, and three midplanes make up a rack of  of New York University. Packaging and electronics
blades. The petaflop installation of Cyclops has  rows were developed by Central Scientific Services at IBM
of  racks each. Research under the direction of James Speidell.
Midplanes are connected within rows using high-
performance flex. Row-to-row connections are made
Future Work
with X Infiniband cable.
Aside from the expected improvements in speed and
Power consumption of the petaflop machine is
processor count resulting from the expected course of
under two megawatts.
technology, the next generation of Cyclops will take
advantage of two major advances. First, multiple pro-
Programming Model cessor and memory die will be tightly integrated on
Within a chip, Cyclops is a -way SMP. In the sim- the same module, greatly increasing the bandwidth
plest mode of operation, a single program is loaded into between them. Second, cost-effective optics and dif-
shared memory. Processor identity determines which ferent topologies will enable a large increase in global
parts of the program to execute and on what data to long-hop bandwidth.
operate. There is also a lightweight thread library similar Probably the single most distinctive and valuable
to pthreads. feature of the Cyclops architecture is the very large
Between chips there are simple message-sending pipelined crossbar. This is expected to persist, with
and Shmem-like shared storage libraries. There is no increasing size and bandwidth, through all future gen-
support for MPI. erations.
The primary software for Cyclops was developed
under the direction of Dr. Guang Gao at the University
of Delaware. Appendix
The Cyclops die is shown in the photograph below
(Fig. ). Colors are due to diffraction. The photograph
Performance was taken before fabrication of the upper metal layers.
While Cyclops was not designed for high floating point On the left edge are nine -bit HSS (High Speed
performance, it still achieves over  GFlops/s mul- SerDes) receivers, with a phase-locked loop in the cen-
tiplying double-precision matrices stored in external ter. On the right edge are the corresponding nine -bit
DRAM. The corresponding power is  watt/GFlop. HSS transmitters.
The chip achieves approximately . GUPS (giga To the right of the HSS receivers are the A-Switch
updates per second) to external DRAM. and B-Switch.
On the top and bottom edges are four DRAM
Design controllers.
The Cyclops ASIC, a -MHz -nm design, was the Running horizontally across the center is the
largest die ever processed by the IBM ASICS division. -port pipelined crossbar.
It had the densest logic implementation of any ASIC, The two thin columns approximately one-third and
the highest content of low threshold voltage logic, the two-thirds of the way across the die are  Instruction
highest latch count, and the highest peak power. As a Caches.
 C Cyclops-

Cyclops. Fig.  The Cyclops die

Finally, most of the die is filled by the  ×  array of


processors, each of which contains two thread units and Cydra 
a floating-point unit. Two blocks of SRAM can be seen
attached to each processor. Michael Schlansker
Hewlett-Packard Inc., Palo-Alto, CA, USA

Definition
The Cydra  was a mini-supercomputer designed by
Cyclops- Cydrome and completed in . The Cydra  hetero-
geneous multiprocessor consisted of a general-purpose
Cyclops shared-memory multiprocessor based on Motorola
Cydra  C 

 processors along with a numeric processor the delay between successive loop iterations. The time
that used Cydrome’s Directed DataflowTM architecture. delay between successive loop iterations is known as the
The numeric processor achieved about one half the per- Initiation Interval (II). The II measures the throughput
formance of a contemporary supercomputer at about of loop execution as one new loop iteration is completed
one tenth of the price. every II cycles. C
VLIW processors evolved from earlier micropro-
Discussion grammed processors that employed complex data paths
to carry operands from registers to and from function
Introduction units. The processing efficiency was greatly reduced by
Cydrome (originally known as Axiom) was founded by a number of obstacles. A key obstacle was the need
chief architect Bob Rau, along with Arun Kumar, Ross to develop operation schedules that jointly accommo-
Towle, Dave Yen, and Wei Yen. The key design goal for date all resource. Complex schedules often required a
the Cydra  was to provide near-supercomputer levels distinct and coordinated schedule for both function
of computing performance at departmental prices. The unit use and register use. Processors often contained
Cydra  was designed to accelerate existing unmodified multiple specialized register files that held operands as
FORTRAN applications commonly used for scientific inputs to and outputs from function units. A single bus
computing. Goals included providing  () MFLOPS often supported multiple register files or multiple func-
of peak single (double) precision floating-point perfor- tion units. This increased scheduling complexity and
mance and achieving greater computation efficiency by reduced efficiency, limiting program parallelism. Such
sustaining a higher fraction of peak performance than machines could not be effectively exploited by com-
competing vector and multiprocessor architectures. pilers and were best programmed by clever humans.
Another key challenge is to accommodate register
System Architecture lifetimes that often limit parallelism by introducing
A primary feature of the Cydra  was the Directed bottleneck dependences arising from register reuse.
Dataflow architecture that provided specialized hard- The lifetime of a variable formalizes the need hold a
ware for parallel loop execution. The Cydra  supported value for an extended period of time and is the period
compiler-directed parallelization of a very general class of time starting when a variable is computed and ter-
of loops, thus providing greater flexibility for acceler- minating on the variable’s last use. A register cannot be
ating existing codes than competing vector and multi- reused to hold a new variable until the end of the prior
processor architectures. The architecture combined the variable’s lifetime.
MultiConnect register file, predicated execution, and a The Cydra  architecture addressed above difficul-
specialized branch to support software-pipeline loops ties using the Context Register Matrix (CRM) or Multi-
without code replication needed by other hardware Connect as it was called inside Cydrome. The CRM pro-
architectures. In addition to supporting traditional vec- vided a number of features to allow compiler technology
torizable loops, this architecture efficiently supported to automate the construction of compact and efficient
a very general class of loops with recurrences, condi- software pipelines for a broad class of loops. The CRM
tionals, arbitrary fixed stride, and randomly scattered evolved from the earlier PolyCyclic interconnect [].
memory references. The polycyclic interconnect required the shifting of data
Software pipelining [, ] exploits parallelism by as new values were created. This required custom mem-
overlapping the execution of a sequence of loop itera- ory structures and could not be implemented using
tions. When loop iterations are executed sequentially, high-performance RAM. The CRM integrated standard
latencies between the operations within each iteration RAMs with novel register addressing units to provide
limit loop parallelism. This problem is alleviated by ini- similar benefits.
tiating subsequent loop iterations before the completion The Cydra  system contained multiple General-
of prior iterations. Software pipelining executes iden- Purpose Processors (GPPs) and a numeric proces-
tical copies of the loop body at periodic intervals. An sor (NP) integrated as a heterogeneous multiprocessor
optimized periodic schedule is identified to minimize running the Unix System  operating system. While
 C Cydra 

the primary role of the NP was to accelerate scientific Multi-Op mode was primarily used to accelerate highly
applications, the GPPs were used to run nonnumeric parallel innermost loop code but could also be used
applications. This architecture alleviated troublesome outside of accelerated loops when sufficient parallelism
limitations of specialized attached processors such as was available. The combination of the Cydra ’s wide
the Floating Point Systems AP b [] that ran numeric instruction issue along with its deep pipelining pro-
programs with no general-purpose or I/O capability. duced large amounts of instruction-level parallelism.
Such architectures required careful rewriting of appli- The total number of operations in flight (issued but not
cations to stream data from a main program running complete) could exceed  concurrent operations.
on the host into numeric kernel programs running The Cydra  NP relied on explicit compile-time
on an attached processor. Unlike these machines, the scheduling to eliminate complex hardware resource-
Cydra  appeared as a single processor that allowed dependence checking and arbitration. Hardware offered
applications to be compiled either for the GPPs or for no instruction interlocks or ability to dynamically delay
the NP. dependent operations, and the compiler was respon-
The general-purpose processors provided an exten- sible for inserting explicitly encoded delays needed to
sible general-purpose computing capability based on enforce a correct program schedule. Instructions pro-
. MHz Motorola  processors. Up to seven vided capabilities needed to program explicit delays.
GPPs could be added to expand the general-purpose Resource usage was modeled with reservation tables
computing capability. In addition to hosting compila- [] that encoded the pattern resources over a num-
tions needed for the Numeric Processor, the GPPs sup- ber of machine cycles relative to the time instruc-
ported a large variety of conventional computing tasks tion issue. A machine description database formally
under the Cydrix operating system which was derived defined the hardware’s characteristics capturing both
from AT&T System V Unix. resource needs and latencies of all operations and pre-
cisely defined constraints that must be met for correct
Cydra  Numeric Processor execution. This database was used by the compiler to
The numeric processor was a Very Long Instruction ensure that all resource and dependence constraints
Word (VLIW) [] processor implemented in Emitter are met.
Coupled Logic (ECL) gate arrays for very high per- The Cydra ’s NP datapath is shown in Fig. . This
formance. The NP achieved its -MHz clock speed datapath was designed to sustain a peak bandwidth of
using simple hardware to perform performance criti- two floating point operations on every cycle for impor-
cal tasks and, where possible, relied on compiler and tant numeric loops. Two floating point units, two mem-
runtime software to implement complex operations ory units, and two address units were needed to satisfy
that could be performed in software without excessive this goal. The machine executed six simultaneous non-
performance loss. branch operations each capable of reading two inputs
The numeric processor provided a -bit wide and writing one output on each cycle. Twelve input
instruction container that was interpreted in either ports and six output ports were needed to sustain full
UniOp or MultiOP execution modes. In UniOp mode, performance in parallel code regions. A single file with
the VLIW instruction was interpreted as six sequen- so many ports could not be constructed using available
tial instructions with explicitly encoded delays between ECL gate arrays. In order to reduce complexity for pro-
those instructions as needed to satisfy interopera- viding needed register file ports, the Numeric Processor
tion dependences and resource constraints. The UniOp is divided into data and address clusters.
mode was used outside of high-performance loops and The CRM holds operands flowing into and out of all
provided a relatively compact encoding for sequen- function units. Each row consists of cross-point regis-
tial code. In MultiOp mode the VLIW instruction was ters that are replicated as a multiported file (indicated
interpreted as a single instruction that allowed the issue by small squares). When a value is written into a row,
of seven operations in a single cycle. Simultaneous oper- it is broadcast across the row so that cross-point regis-
ations included two arithmetic (floating or integer), ters hold identical values. Thus, each function unit has
two address, two memory, and one branch operation. dedicated write access and can generate a new value
Iteration Control Registers (ICR)

General Purpose Registers (GPR)


0
4 1
3 2
2
1 Displ.
adder

Floating Multiplier,
point floating Main Main Instruction Address Address
adder, point memory memory and misc. adder, adder, bit
Integer Integer Div/ port 1 port 2 reg. unit multiplier reverse
ALU Sqrt

Data cluster GPR input bus Address cluster

Cydra . Fig.  Cydra ’s NP datapath


Cydra 
C


C
 C Cydra 

into its dedicated row on every cycle. Columns provide body. If conversion is a code transformation that uses
distinct read access ports for two inputs from any row predicates to eliminate branches. After simple if-then-
on each cycle. else code is if-converted, a compare computes compli-
Often, the length of variable lifetimes must be mentary Boolean predicates for then and else clauses.
greater than the initiation interval. When a soft- Operations within each clause reference a predicate
ware pipeline is executed on a conventional horizon- operand that provides the correct clause conditional.
tal architecture-computed operands are systematically Each predicated operation executes only if its predicate
overwritten, after looping to the next iteration, by the input is true otherwise it performs no action. This cor-
same operation that computed the operand on prior rectly implements conditional actions without program
iterations. This overwrite often occurs even before the branches. Predicated execution also allows the nested
end of the operands lifetime. This problem can be allevi- if-conversion of conditionals within conditionals. After
ated by copying the operand before the operand is over- if-conversion, conditional, and unconditional code is
written or by unrolling code and renaming registers. uniformly scheduled within a software pipeline. The
Each of these techniques has complex limitations. execution of multiple if-then-else conditionals can be
The CRM addressed this problem using an Iteration overlapped and scheduled as if the code were uncon-
Frame Pointer (IFP) that relocated loop-variant values ditional code.
on each iteration. Loop-variant register operands are
addressed by adding a displacement from the instruc- Specialized Loop Control
tion to the IFP modulo the register file size to determine Software pipelining produces code schedules that are
the physical location of the selected register. The IFP is periodic only after a prolog period during which the
decremented every loop iteration to relocate subsequent software pipeline is filled and before an epilog period
register reads and writes. This allowed each operation in during with the pipeline is drained []. Without spe-
the loop body to generate overlapped sequences of reg- cialized hardware, code replication is needed for both
ister lifetimes that spanned multiple loop iterations. The prolog and epilog code before a periodic kernel loop
compiler was responsible for allocating and tracking is reached. Efficient support for loops with small trip
these values as loop execution progressed. counts exaggerates the complexity of code generation
General-purpose registers (GPRs) were designed to and scheduling as prolog code may enter epilog code
solve a number of problems. Loop invariant values are without ever reaching the kernel.
values that are computed prior to loop execution and The Cydra  combined the CRM with predicated
repeatedly referenced on each iteration. Loop invari- execution and a specialized branch to greatly simplify
ants are supported using GRPs that are addressed as this problem. A brtop loop branch operation supported
conventional registers and do not participate in register software pipelines. The brtop was itself a pipelined
relocation. The use of separate address and data clusters branch having two branch delay slots in which addi-
limits the flow of operands between clusters. For exam- tional brtop operations might be scheduled. The brtop
ple, a value loaded from memory and stored within its referenced a number of registers, including: the Loop
CRM row cannot be read directly by an operation in Counter (LC), Epilog Stage Counter (ESC), the itera-
the address cluster. To provide flexible connectivity, all tion frame pointer (IFP), as well as the Iteration Control
units can write or read the GPRs. GPRs are typically Register (ICR) predicate file.
written outside of innermost loops and allow flexible The structure of a software-pipeline loop executing
communication between clusters. on the Cydra  is shown in Fig. . Each row illus-
trates operations executed during II cycles of machine
Predicated Execution code execution. Operations from a single iteration of
Many loops have conditionals that preclude paralleliza- the original loop body are divided into lettered stages,
tion by previous vector and multiprocessor techniques. each stage requiring exactly II execution cycles. The ver-
The Cydra  used predicated execution to allow the exe- tical (time sequential) execution of code within stages
cution of nested if-then-else conditionals within a loop ABCD and E completes a single iteration of the source
Cydra  C 

Iteration Loop ICR


number counter ESC Stages executed predicates
initial 6 4 00001
1 6 4 e d c b A Prolog code 00001
2 5 4 e d c B A 00011
3 4 4 e d C B A 00111 C
4 3 4 e D C B A 01111
5 2 4 Kernel code E D C B A 11111
6 1 4 E D C B A 11111
7 0 4 E D C B A 11111
0 3 Epilog code E D C B a 11110
0 2 E D C b a 11100
0 1 E D c b a 11000
0 E d c b a 10000
00000

Cydra . Fig.  The structure of a software-pipeline loop

loop. These loop iterations are overlapped, and each is controlled using the pattern of ICR stage predicates
subsequent copy of the actively executed loop body is shown on the right.
displaced downward in time by II cycles. The exam- The brtop operation implements all actions needed
ple illustrates the execution of seven source iterations to control such loops. Each execution of the brtop
of a loop body having five stages in its software pipeline decrements the iteration frame pointer in order to relo-
schedule. The execution of code stages is lettered and cate all register values (including predicates) to allow
lower case indicates that a stage is nullified (executed overlap of lifetimes. The brtop operates on both loop
with predicate ). Upper case indicates that a stage is count (LC) and epilog stage count (ESC) registers to
active (executes with predicate ). correctly calculate the branch decision and stage predi-
The number of stages measures the amount overlap cate needed for the next II cycles. While the LC counts
among successive iterations. The kernel code region is the number of source loop iterations, the ESC counts
shown in gray. In this region, horizontal (simultaneous additional iterations needed to fill the software pipeline.
issue) code from all five stages of the loop executes every A newly calculated stage predicate shifts into view from
cycle. Each kernel stage corresponds to a single loop the right-hand side every II cycles. Operations within
iteration at the machine instruction level that executes each stage reference a corresponding stage predicate.
the steady-state pattern in the heart of highly optimized Using this architecture, a single copy of the loop body
loop. This is the period during which full parallelism is executes highly optimized software pipelines.
achieved.
Prior to entry in the kernel, the loop is in its pro-
log. New iterations are initiated every II cycles, but there Cydra  Memory
are an insufficient number of prior iterations to fill the Instead of relying on a data cache for high bandwidth,
pipeline. In the first row or iteration of the machine code the Cydra ’s NP provided full bandwidth to DRAM-
loop, only the first stage A of the first source iteration based main memory. A -way interleaved memory
actively executes. In the second row, B from the first provided up to  MB of storage, and two access
source iteration and A from the second source iteration ports allowed two memory operations on every cycle.
execute simultaneously. As the pipeline fills, additional Memory module collisions, memory refresh, and other
stages become active. The pattern of predicates shown difficult-to-predict events introduced nondeterministic
on the right correctly enables corresponding stage code. delays in performance. However, the numeric processor
During the epilog, old iterations continue to finish but provided no register interlocks to stall dependent oper-
no new iterations are initiated. Again the epilog pattern ations for tardy input operands, and memory operands
 C Cydra 

needed to return at a statically scheduled time. This was of instruction issue but instead occur a number of cycles
accomplished using the memory collating buffers and after issue at specific exception latencies. For example,
the memory latency register (MLR). a memory exception is reported after a failed address
Memory collating buffers allowed operands to translation, three cycles after operation issue. Hard-
return from memory early and out of order within the ware directs execution a software exception handler
memory subsystem while those operands are delivered that provided a direct indication of the unit on which
on time and in order to the NP. The collating buffer the exception occurred and the cause of the excep-
holds returning memory operands until the exact deliv- tion. However, hardware does not directly indicate the
ery time. The memory latency register allowed the pro- actual operation that caused the exception. This opera-
gramming of the latency time at which results from tion issued a number of cycles in the past and may have
memory were delivered. If a memory result could not been followed by one or more intervening branches.
be delivered within the requisite time as defined by Exception-handling software processes the PCHQ and
the MLR, the NP was stalled until that operand was uses known exception latencies to identify the exact
available. From the viewpoint of the NP, all memory operation that caused the exception. If needed, software
operands are delivered exactly on time. recovers input operands from the input registers for the
A key memory system goal was to support the exception causing operation. This provides all inputs
efficient access to data for programs having arbitrary needed to process exceptions in software and resume
stride or complex scatter-gather access patterns. Con- execution.
ventional interleaved memories suffer a large perfor-
mance loss when the stride of the access pattern and Cydra  Compiler Technology
the degree of memory interleave have common factors. A large burden was placed on the compiler to identify
For a worst-case stride  access into a -way inter- and exploit large amounts of parallelism. After classical
leave memory, only a single memory module is used and optimizations, the compiler was responsible for identi-
memory bandwidth drops significantly. The Cydra  fying all interoperation dependences to expose paral-
uses a pseudo random address transformation [] that lelism as a program graph. The graph was transformed
randomizes the memory access pattern so that com- to enhance parallelism in order to providing adequate
mon address patterns appear as a random pattern to parallelism for the NP. Then the compiler’s scheduler
the memory. The memory is designed to deliver full created a schedule that packed operations into instruc-
bandwidth for such random patterns. tions and time. Specialized compiler technology for
The lowest possible latency for The Cydra  NP software pipelining was developed to exploit the NP
memory is  cycles. For scalar code, the MLR was set hardware. While the Cydra  compiler also utilized spe-
to  to minimize the wait time for scalar operands cialized VLIW compilation techniques for scalar code,
arriving from underutilized memory that occurs out of this is beyond the scope of this discussion and can be
loop. Within loops, the this minimal MLR value leads to studied in more detail in [, ].
reduced performance as collisions from memory refer- The Cydra  software-pipelining technology demon-
ence sequences resulted in delayed memory access. An strated optimal sustained performance for most numeric
optimal value for loops was selected as  cycles which loops. Specialized loop optimization was applied to
balanced needs for high bandwidth and low latency, software-pipeline loops to transform code into an opti-
especially for short trip count loops. mized form for software pipeline execution. The com-
piler used an intermediate form to represent loops
Exception Handling in various optimization stages. The intermediate form
The Cydra ’s exception handling followed the philos- used the concept of Expanded Virtual Registers (EVRs)
ophy of relying on software rather than hardware for to reference operands that have been computed in the
complex operations that not critical to performance. current iteration or in previous iterations that may still
The Cydra  provided a Program Counter History be available for use in the current iteration. Thus, while
Queue (PCHQ) to record the recent history of instruc- an operand “t” represents a temporary computed in this
tion addresses. Exceptions are not generated at the time iteration, t [] represents the value of the same operand
Cydra  C 

as computed in the prior loop iteration. This interme- initialize:


a[1]
diate form was used to express a number of loop opti- s[1]
mizations, including load/store elimination, common
do N times
sub-expression elimination, and height-reducing trans- a = a[1]+4
formations needed to produce optimized code. Each of t1 = read(a) C
these optimizations reused values computed from prior s = fadd(s[1],t1)
iterations and EVRs precisely define this operand reuse. enddo
s = s[1]
Optimization is followed by scheduling for software
pipeline execution. Performance bounds on the maxi-
Cydra . Fig.  Vector sum pseudo code
mal execution rate were calculated and used to assist in
loop scheduling. For most loops, loop schedules were initialize:
identified that exactly match those bounds achieving a[2],a[1]
s[3],s[2],s[1]
optimal program schedules for long trip count loops.
The Initiation Interval II measures the minimal time do N times
a = a[2]+8
delay in cycles between successive iterations in a soft-
t1 = read(a)
ware pipeline. The II may be limited either by insuf- s = fadd(s[3],t1)
ficient hardware resources or by the latency between enddo
dependent operations in successive iterations. u = fadd(s[1],s[2])
The limited availability of hardware resource pro- s = fadd(u,s[3])
vides a bound known as ResMII. Assume that a loop
Cydra . Fig.  Height-reduced code
body has a number of operations nt of a given oper-
ation type t (e.g., four operations of type memory in
the loop body). Assume that hardware provides a num- lat = 2
ω=2
ber ft of function units that execute operations of this a = a[2]+8
type (e.g., two memory units). For operation type t,
rest = nt /ft provides a resource bound for that type.
This bound provides a lower bound on the initiation
lat = 5
interval and an upper bound on the throughput for this
t1 = read(a)
loop. In this example, a loop body having four mem-
ory operations that executes on a processor having two
memory units has II ≥ . For multifunction VLIWs, the
largest rest over all resource types provides the worst- lat = 3
case ResMII bound. The initiation interval measures a ω=3
s = fadd(t1, s[3])
number of clock cycles as an integer quantity. When
the ResMII resource bound is fractional, loop unrolling
can often achieve the fractional bound. For example, if Cydra . Fig.  Loop graph for height-reduced code
the ResMII = ., two unrolled loop iterations may be
executed every five cycles.
Dependences between operations in adjacent itera- relationships between operations within a single loop
tions also limit parallelism. A bound known as RecMII iteration as well as between iterations.
is calculated directly the program intermediate form. The Cydra  compiler applies height-reducing trans-
The intermediate form represents a program graph that formations to alleviate the performance limiting effects
captures dependences both within a loop iteration and of inter-iteration dependences. Figure  illustrates vec-
between loop iterations. Inter-iteration dependences tor sum pseudo code for the Cydra . To simplify the
are marked with a distance that captures the number example, latencies of  for address add, and  for float-
of iterations spanned by the dependence. Legal pro- ing add replace actual Cydra  latencies of , and . Each
gram graphs contain cycles that describe dependence operation in the loop body can be executed as a single
 C Cydra 

2.5

1.5

0.5

0
100 200 300 400

Cydra . Fig.  Achieved performance for over  loops

machine operation on the Cydra . The EVR references be scheduled. Typically, the scheduler finds an optimal
provide an iteration distance as a register reference r[i] schedule in this process. However, sometimes, when
indicates the register value r from the i’th prior loop iter- all scheduling choices are exhausted or after excessive
ation. A value produced in the current iteration r[] is compile time is expended searching for a schedule, the
abbreviated as r. In this non-height-reduced form, the II is raised to search for a lower performance loop
floating point sum recurrence limits loop performance schedule.
to one iteration every three cycles. Figure  illustrates the performance of over 
Figure  shows code after height reducing transfor- loops taken from Numerical Recipes, Linpack, and the
mations [] are applied to alleviate limitations from Livermore FORTRAN Kernels. The performance each
recurrences. First-order recurrences are replaced by loop is plotted as a ratio of the II for the achieved
second- and third-order recurrences for address and loop schedule divided by the II that is estimated using
floating sums, respectively. While summation terms are the MAX(ResMII, RecMII) bound. Thus, a value of
reordered, a correct final value is reproduced as the one indicates the loop scheduler has achieved optimal
same terms are finally summed. Figure  illustrates performance. Loops are sorted on order of increas-
the program graph for the height reduced loop. Two ing ratio of (achieved II)/(estimated II) and plotted.
recurrences are shown, and each limits performance Note that over  of the loops achieve optimal per-
by its path length divided by the iteration distance. formance. On the right, a single worst case loop having
After height reduction, a new loop iteration is initiated unusually complex resource patterns and dependences
every cycle and, in the final loop schedule (not shown), achieves a throughput . times slower than the desired
all three operations are scheduled in a single machine bound.
instruction that executes every cycle.
The Cydra  loop scheduler relied on an accurate
estimate of the desired II to begin the scheduling pro- Bibliographic Notes and Further
cess. ResMII and RecMII bounds are calculated, and Reading
the larger of the two is used as the starting II for loop At the time Cydra  was introduced, scientific comput-
scheduling. The actual procedure to calculate RecMII ing was dominated by Cray corporation’s vector proces-
is somewhat complex and beyond the scope of this sors []. Other key contemporary processors included
discussion. After an initial II is determined, the sched- the Alliant multiprocessor and the Convex vector pro-
uler uses a branch and bound search algorithm to cessor. VLIW architectures were explored in research
explore the placement of operations within a cyclic by Fischer [], and this work resulted in the Multiflow
schedule. Each newly scheduled operation is placed architecture and compiler [] that was contemporary
to satisfy both resource availability and interoperation with the Cydra . The Multiflow architecture relied on
dependences from previously scheduled operations. Trace Scheduling to identify and replicate important
The scheduler backtracks when new operations cannot program code regions.
Cydra  C 

Bibliography . Rau BR, Schlansker MS, Tirumalai PP () Code generation


. Lam M () Software pipelining: an effective scheduling tech- schema for modulo scheduled loops. In: Proceedings of the th
nique for VLIW machines. In: ACM SIGPLAN ‘ conference annual international symposium on microarchitecture, Portland
on programming language design and implementation, Atlanta, . Rau BR, Schlansker MS, Yen DWL () The Cydra Tm  stride-
pp – insensitive memory system. In: Proceedings of the  Interna-
. Rau BR () Iterative modulo scheduling: an algorithm for tional Conference on Parallel Processing, The Pennsylvania State C
software pipelining loops. In: Proceedings of the th annual University, University Park
symposium on microarchitecture, San Jose, pp – . Dehnert JC, Towle RA () Compiling for the Cydra . J Super-
. Rau BR, Glaeser CD, Greenawalt EM () Architectural comput :–
support for the efficient generation of code for horizon- . Ellis JR () Bulldog: a compiler for VLIW architectures. MIT
tal achitectures. In: Symposium on architectural support for Press, Cambridge, MA
programming languagges and operating systems, Palo Alto, . Schlansker M, Kathail V () Acceleration of first and higher
pp – order recurrences on processors with instruction level parallel-
. Charlesworth AE () An approach to scientific array pro- sim. In: Sixth international workshop on languages and compilers
cessing: the architectural design of the AP-B/FPS- family. for parallel computing, Portland
Computer ():– . Tang J, Davidson ES () An evaluation of the Cray X-MP
. Fisher JA () Very long instruction word architectures and the performance on vectorizable Livermore FORTRAN kernels. In:
the ELI-. In: Proceedings of the tenth annual symposium on Proceedings of the nd International conference on Supercom-
computer architecture, Stockholm, pp – puting, ACM, St. Malo
. Davidson ES, Shar LE, Thomas AT, Patel JH () Effective con- . Lowney PG, Freudenberger SM, Karzes TJ, Lichtenstein WD,
trol for pipelined computers. In: Proceedings of the COMPCON, Nix RP, O’Donnell JS, Ruttenberg JC () The multiflow trace
IEEE, New York, pp – scheduling compiler. J Supercomput :–
D
The exponential growth in information technology
DAG Scheduling made possible by advancements in semiconductor fab-
rication for the past several decades has been met or
Task Graph Scheduling surpassed by the growth in demand for computing.
Data Mining The net result is that computing platforms hosted in
machine rooms have become denser (e.g., have a higher
computational and storage capability per occupied unit
volume), and machine rooms have been growing in
Data Analytics size, increasingly hosting large collections of server-
class machines referred to as clusters or server farms.
Data Mining Today, information technology is an indispensable
pillar for a modern society in general and indus-
try in particular. Companies from start-ups to multi-
nationals all have IT departments which are large
Data Centers organizations with many machine room installations
that provide computing as a service internally to the
Babak Falsafi
company. Because most of the activity within an IT
Ecole Polytechnique Fédérale de Lausanne, Lausanne,
Switzerland department is centered around processing and storage
of data belonging to an organization, machines rooms
have evolved into entities that are increasingly referred
Synonyms to as Data Centers.
Internet data centers Today, Data Centers range in size from installations
that have about thousands of processing cores and a few
Definition petabytes storage dissipating a few hundred kilowatts
A Data Center is a facility hosting large collections of of electricity to hundreds of thousands of processing
servers that are physically co-located to facilitate one or cores, hundreds of petabytes, and tens of megawatts of
more combinations of Internet connectivity, operation electricity.
infrastructure (such as power delivery and cooling),
management, and providing access security.
Physical Layout
As in machine rooms, a Data Center typically refers to
Discussion a physical site. A site can be a shipping-container-sized
Introduction entity, a room, one or more floors of a building, or a
Large organizations (e.g., enterprises, governmental number of collocated buildings. The minimal physical
agencies, research and academic institutions) have his- unit to which network connectivity, power, and cool-
torically used machine rooms to host and operate a ing are delivered to is referred to by some vendors as
collection of server- and supercomputer-class comput- a POD. A POD is a collection of racks of hardware that
ing platforms. The machine rooms were designed to are collocated and sit adjacent to each other.
leverage the cost of building, hosting and operating a Containers are PODs that are optimized for trans-
collection of high-performance computers. port and mobility and are sized exactly to be the size

David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----,


© Springer Science+Business Media, LLC 
 D Data Centers

of a shipping container. PODs in a room are sized for allow for serviceability. Today’s low PUE installations
maximum energy efficiency. circulate chilled water over rack tops and use forced air
down to cool the racks. Modern cooling infrastructure
and POD installations also obviate the need for raised
Servers
floors which played a central role in air circulation in
The IT equipment in a Data Center consists of com-
more conventional setups.
pute nodes or compute clusters (for processing), stor-
Power accounts for the second largest source of
age nodes or storage clusters (for storing), the network
overhead after cooling. To prevent a single point of fail-
nodes (for communicating both within and to the out-
ure, all elements of the electrical system are replicated.
side), and a tape library (for backup). Compute, storage,
This replication dramatically adds to the cost and elec-
and network hardware today appear in the form of
tricity usage of the power delivery system. There are
rack-mounted platforms. Tape libraries often appear in
provisions for backup power that consist of universal
the form of tightly integrated collection of tapes that are
power supplies and diesel generators.
operated internally by robots. Nodes of the same type
are often collocated due to the diverse cooling and relia-
bility requirements of the node types. Node form factors Tiers
continue to evolve based on improvements in fabri- Data Centers have diverse reliability and availability
cation technology, server density, and power/cooling requirements based on usage. As such, they are cate-
delivery. gorized into “tiers” ranging from Tier  with the least
stringent reliability requirements to Tier  with the
most stringent reliability requirements. Even within the
Hosting Infrastructure
same organization, applications may demand the usage
Hosting infrastructure includes outside network con-
of Data Centers from varying tiers. For instance, a
nectivity, power delivery, cooling, and security (includ-
telecommunications company may use a Tier  data
ing both fire protection and secured access). The site
center for high availability in their data network service
must provide network connectivity commensurate to
and a Tier  data center with less stringent availability
its size. Fortunately, with advances in communication
for bill processing. The various tiers’ availability require-
technologies in line with improvements in processing
ments are .% for Tier , .% for Tier , .%
speed and storage capacity, network bandwidth has
for Tier , and .% for Tier . The difference in
kept pace with the demand on IT installation sites.
the infrastructure (power delivery and cooling) cost
Cooling and power delivery account for a substantial
between Tier  and Tier  can be a factor of two or
fraction of electricity in Data Centers. Power Usage
higher [].
Efficiency (PUE) is a metric that expresses the energy
efficiency of a Data Center. PUE is the total elec-
tricity used by a site divided by the electricity used Energy
only by the IT equipment. PUEs in today’s Data Cen- The main scalability impediment to Data Centers in
ters can be as high as two indicating a % over- the future is energy, with both economic and environ-
head in electricity to deliver cooling and power. Ven- mental implications. Demand for Data Center scala-
dors are now marketing installations with PUEs as bility has shot up through the years, resulting in an
low as .. unprecedented level of electricity usage in Data Centers.
Historically, machines in a room have been air- On the one hand, the demand for ubiquitous com-
cooled with computer room air conditioning (CRAC) puting and data access by all computer users has sky-
units that cool the room and consequently the equip- rocketed. On the other hand, the cost of owning and
ment. As Data Centers have become denser through operating IT equipment has caused many start-ups and
the years and semiconductor technologies are reaching smaller enterprises to buy processing and storage of data
their limits in energy efficiency, designers are resort- as a utility. The latter is loosely referred to as Cloud
ing to CRAC-less cooling technologies that are more Computing. Cloud Computing is leading to unprece-
energy-efficient. Modern PODs are airtight from all dented growth in server-side computing and Data
sides and have access panels/doors from the sides to Centers.
Data Centers D 

The Energy Star report to the US government indi- of the traffic to disks, requiring lower bandwidth disk
cates that Data Center electricity usage in the USA dou- configurations, and saving energy at the same time.
bled from  to , and is slated to double again
by . The net result is that in , operation cost in Future Directions
terms of electricity over the lifetime of volume servers With regards to energy efficiency, there are a num-
will cost % more than the capital cost of purchasing ber of opportunities to exploit to scale Data Centers.
the server in . The carbon footprint of Data Centers As voltages in semiconductors chips have leveled off, D
in the USA in  is slated to equal that of the airline while chip integration continues, designers must pur-
industry. sue solutions that minimize the number of joules per
While projections of conventional semiconductor operation in a computing stack. To scale energy for
fabrication technologies indicate scalability in chip inte- another  years, the solutions must achieve two orders
gration levels and cost for another decade, chip energy of magnitude of improvement in energy efficiency.
efficiency has dramatically slowed down. The net result Mobile platforms historically have been designed
is tokened “the economic meltdown of Moore’s law” for for low power due to battery usage concerns. Servers
servers and datacenters by Kenneth Brill of the Uptime must incorporate such technologies to minimize power
Institute. without sacrificing the multitude of abstraction layers
There are a number of short-term solutions to mit- in modern server software. Specialization and vertical
igate the Data Center energy efficiency problem. Idle integration of lower-level software and hardware for
power is quite high in servers, reaching up to % of common services in a server (e.g., OS, databases, web,
peak power. Virtualization helps avoid idling by con- search) are likely candidates to achieve such orders of
solidating multiple servers into a single physical plat- magnitude in energy reduction.
form. Virtualization is quite effective in eliminating idle Air cooling is highly inefficient as air is not an effec-
power, but soon reaches diminishing returns as utiliza- tive medium in absorbing heat. Conventionally, super-
tion reaches %. computers have used liquid cooling to achieve much
Cooling efficiency is another low hanging fruit. higher cooling efficiencies at albeit higher technology
Modern installations use cold-isle rather than the con- costs. Technologies, such as two-phase liquid cool-
ventional hot-isle setup where the racks are cooled ing, are promising approaches to improve Data Center
inside the air-tight POD and the hot air is circulated PUE’s to levels below ..
out. Cooling efficiency can dramatically reduce elec- To achieve optimal temperature and minimal cool-
tricity consumption. Improvements in overall energy ing, thermal sensing, management, and cooling must
consumption can come about for installations (e.g., be coordinated. Modern servers include internal knobs
in Europe) that tightly integrate outdoors air/water at (e.g., voltage/frequency scaling) for thermal/power
cooler temperatures to help reduce cooling costs, and management. As servers are overprovisioned (to pro-
use the generated heat to warm buildings. vide performance guarantees), modern servers do not
Microprocessors are increasingly incorporating a exploit these mechanisms. Moreover, modern platforms
larger number of simpler processing cores. The lat- include crude temperature sensing and management
ter are more energy-efficient but require an increase technologies that do not lend themselves well to accu-
in the degree of parallelism in the software. A rate thermal management. There are proposals for inte-
number of conventional server softwares are highly grated and accurate thermal management and cooling.
parallel and as such amenable to such organiza- There are a number of proposals to reduce replica-
tions. Low parallelism in code and the diminishing tion levels in the electrical system and power delivery
returns on how simple cores can get will eventually while not compromising the availability. Such proposals
limit energy efficiency in conventional microprocessor dramatically reduce the electricity usage in the power
organizations. delivery.
Solid-state disks (SSD) are emerging as a cost- and
energy-effective approach to augment the storage hier- Related Entries
archy. A layer of SSD between memory and disks can Clusters
serve as an effective cache for the storage, filtering much Power Wall
 D Data Distribution

Bibliographic Notes and Further restricted to arrays whose subscripts are affine expres-
Reading sions, and to distributions that can be described as
The Datacenter as a Computer [] is a short book affine expressions in one or two variables. These restric-
that glosses over modern datacenter design issues and tions are important because they allow the problems
trends. The data center page on Wikipedia [] includes above to be targeted with Diophantine equations while
more detailed information about conventional data- handling common data distributions and program
center installations. The Uptime Institute [] and the idioms.
Energy Star [] reports include careful analysis on
energy costs and projections for future Data Center Overview
scalability. ITRS [] includes white papers on technol-
ogy projections. Computational Model and Notation
Figure a shows the array A block distributed across
Bibliography four processors. The distribution results in the block
. Data Center Site Infrastructure Tier Standard: Topology ()
of elements  . . .  being placed on the first proces-
Uptime Institute LLC sor, elements  . . .  on the second processor, ele-
. Barosso LA, Hölzle U () The datacenter as a computer: an ments  . . .  on the third processor, and elements
introduction to the design of warehouse-scale machines. Syn-  . . .  on the fourth processor.
thesis series on computer architecture. Morgan & Claypool, Given a set α of array elements whose values will be
San Rafael
computed by a processor p, it is often the case that values
. Wikipedia. http://en.wikipedia.org/wiki/Data_center
. Brill K () The economic meltdown of Moore’s law. Uptime not distributed onto p are needed to perform the com-
Institute and the Green Data Center. USENIX Keynote. http:// putation. A popular way of determining the elements
www.usenix.org/event/lisa/tech/brill_talk.pdf in α is the owner computes rule [], i.e., each proces-
. US Environmental Protection Agency () Energy star program, sor computes new values for the array elements that
report to congress on server and datacenter energy efficiency.
reside on it, i.e., that are owned by it. To compute the
Public Law –. http://www.energystar.gov
. Yearly Reports. The international technology roadmap for semi-
new values of array elements in α, it is necessary for
conductors. http://www.itrs.net/reports.html the processor to receive from other processors the input
values that reside on these processors. Thus in Fig. a,
to compute A[], processor p must acquire the value
of A[] from processor p . Because values are typically
Data Distribution communicated via two-sided MPI primitives, it is also
necessary for the processor that owns the needed values
Samuel Midkiff to know that it must send them. Thus, in the exam-
Purdue University, West Lafayette, IN, USA ple, it is necessary for p to send the value of A[] to
processor p .
Definition
Data distribution refers to the process of spreading Notation Needed by Data Distribution
the elements of a data structure, typically an array, Let A be an array of rank DA , that is A has DA dimen-
across two or more processors of a distributed memory sions. For every dimension d,  ≤ d < DA , the lower
machine. This is done to enable parallel computation on bound of A in dimension d will be zero, and the upper
the distributed structure. Distributing an array across bound will be denoted UAd . The processors performing
multiple processors leads directly to the problems of the computation can also be arranged as an array with
how to address the elements of the array within a pro- rank DP , using a similar notation. Individual processors
cessor, how to determine the processor on which some will be described as elements in the processor grid, i.e.,
element of an array has been placed, and the related pj ,j ,...,jDP − is element P[j , j , . . . , jDP − ] in the processor
question of what elements a given processor owns. Top- grid P. For simplicity, in this article, one-dimensional
ics related to distributing non-array data structures are processor grids will be assumed unless otherwise
not covered in this article. Moreover, the discussion is stated.
Data Distribution D 

A: 0 1 2 ... 49 A: 50 51 52 ... 99
Local and Global Spaces
The literature often refers to the local and global spaces
P0 P1
of an array. In the example of Fig. , the global space of
a The layout of block distributed array A on four processors the array contains all of the elements of the array, and
double A(200); distribute A(BLOCK) onto P(4)
valid index values are  . . . . The local space varies
for(i = 0; i < 199; i ++) { with each processor. Thus the local space of the array
A[i] = A[i] + A[i+1])/2; on p , p , p , and p is  . . . . In languages like Fortran D
}
 that support nonzero lower bounds and array sec-
tions, the distributed array elements on some processor
A: 100 101 102 ... 149 A: 150 151 152 ... 199
can be addressed in the global space. In languages like
P2 P3 C, C++, and Java, they must be accessed in the local
b A loop performing a computation on A that is not space, and a translation analogous to what Fortran 
distributed.
does going from a reference in global space to accessing
Data Distribution. Fig.  An example illustrating data a locally distributed element must be performed by the
distribution and related problems programmer. For simplicity, in this article, it is assumed
that all accesses are in global space, i.e., A[] is used to
access the first element of the A array on processor p in
Fig. a.
References to arrays are typically contained in loop
nests. The loops within a nest are subscripted by index Data Distributions
variables named i , i , . . . , in , where the i subscript is the This section describes what elements of an array are
nest level of the loop, i.e., i is the outermost loop. The placed on what processors, and how this placement
index variable for an entire loop nest is called I, i.e., I = can be represented by an affine function in one or two
[i , i , . . . , in ]. For each dimension d of array A, there is variables.
an affine index or subscript function σAd (I) = c ∗ i + After discussing the simple one-dimensional case,
c  ∗ i  + . . . cn ∗ i n + c  . the case of multidimensional data arrays being dis-
Upper (lower) bounds will be referred to as U (L), tributed onto multidimensional processor grids will be
respectively, with a subscript indicating the program considered.
structure (e.g., a loop, or a data array or processor
grid) and the dimension whose bounds are being repre- Replication of Data
sented. Thus, Ui is the upper bound of the loop whose The simplest form of data distribution is to replicate
index variable is i , LAd is the lower bound of dimension the structure across the different processors []. Thus,
d of an array, and LAd ,pj is the lower bound of dimension a copy of the distributed array is placed on each pro-
d of the portion of the array A on processor pj . The num- cessor as shown in Fig. a. The function △(e) =  ≤
ber of elements in an array (processor grid) dimension e < UA describes the elements of A contained on each
is UAd +  (UP,d + ). processor.
The block size Bs is the number of elements of a
block of an array dimension that is either block or block- Block Distribution
cyclic distributed. As with upper and lower bounds, With a block distribution, blocks of contiguous array
subscripts can be used to further identify the block elements are distributed onto each processor [], as
being described. For each distribution, there is an affine shown in Fig. a. Given an array A of  ele-
function △Ad ( ) in one or two variables that describes ments block distributed onto four processors, the
the distribution of elements of dimension d of array first processor would receive elements A[], A[], . . . ,
A onto a processor grid. A function σ ( ) is an affine A[], the second processor would receive elements
subscript function for some dimension of an array ref- A[], A[], . . . , A[], and so forth. This distribution
erence. Subscripts on σ (and other quantities) may be can be defined more formally. Given the UP +  proces-
dropped if the meaning is clear from the context. sors onto which an array dimension is distributed, the
 D Data Distribution

A: 0 1 2 ... 199 A: 0 1 2 ... 199


block of length Bs the element is in. The block contain-
ing element e is ⌊ e+
Bs
⌋. Finally, △Ad (e) = LAd ,pj ≤ e ≤
p0 p1
UAd ,pj .

A: 0 1 2 ... 199 A: 0 1 2 ... 199


Cyclic Distribution
a p2 p3 With a cyclic distribution, elements of the array are dis-
tributed in a round-robin fashion as shown in Fig. b.
A: 0 4 8 ... 196 A: 1 5 9 ... 197 Each processor pj receives elements
p0 p1 A[j], A[j + UP + ], A[j + ⋅(UP + )], . . .
A[j + UA,pj ⋅(UP + )],
A: 2 6 10 ... 198 A: 3 7 11 ... 199
p2
where UA,pj , the number of elements distributed to pro-
b p3
U +−j
cessor pj , is ⌈ UA P + ⌉. An affine function describing the
... elements of A that reside on some pj is △Ad (pj ) = (UP +
A: 0, 1, 2 12, 13, 14 188, 189, 190
)⋅(k − ) + j,  ≤ k ≤ ⌈ UUA +−j
P +
⌉. The processor that owns
p0
element A[σAd (I)] is given by σAd (I) mod (UP + ).
A: 3, 4, 5 15, 16, 17 ... 191, 192, 193
Block-Cyclic
p1 The block-cyclic (or cyclic(Bs)) distribution combines
features of both block and cyclic distributions. Whereas
A: 6, 7, 8 18, 19, 20 ... 194, 195, 196 the cyclic distribution (which is a block-cyclic distri-
p2 bution with Bs = ) spreads single elements of an
array across the processors in a round-robin fashion, the
A: 9, 10, 11 21, 22, 23 ... 197, 198, 199 block-cyclic distribution spreads blocks of contiguous
array elements across the processor grid. With block-
p3
c cyclic, Bs is the programmer specified size of a block,
A +
and numBs = ⌈ UBs ⌉ is the number of blocks in the
Data Distribution. Fig.  Examples of replicated, cyclic,
array. To find the elements contained by processor pj ,
and block-cyclic distributions. (a) The layout of a replicated
it is necessary to first find the blocks that are contained
array A on four processors. (b) The layout of a cyclically num −j
by pj . A processor pj contains numBs,pj = ⌈ UP Bs+ ⌉
distributed array A on four processors. (c) The layout of a
blocks. Note that the formula for the number of blocks
block-cyclically distributed array A on four
is similar to that for the number of elements on a pro-
processors
cessor with the cyclic distribution. Number the b blocks
as , , . . . numBs,pj . Then the first element in block b of
array Ad is LAd ,b,pj = Bs⋅j + Bs⋅(UP + )⋅(b − ). Thus, the
number of elements on each processor (i.e., the block set of doubles
+
size for each processor) is Bs = ⌈ UUAP + ⌉ Processor pj will {[max(LAd , LAd ,b,pj ) : min(LA d, LAd ,b,pj + Bs − )] ∣
get the jth block, and thus the lowest element of A on
 ≤ b ≤ numB,pj }
pj is LAd ,pj = j⋅Bs. The highest element on pj will be one
less than the start of the block on the next processor, describe the elements of the A on processor pj .
or (j + ) ⋅ Bs − . To ensure that only elements in the The elements contained on a processor pj for
declared array are distributed, the upper bound on pj the replicated, cyclic, and block distributions can be
must be less than the maximum element in the array, described as an affine function in one variable because
and thus UAd ,pj = min(UAd , (j + ) ⋅ Bs − ). To deter- the distance between the e’th and e + ’th element of an
mine what processor pj contains, or owns, some element array residing on pj , in the global space of A, is constant
e = A[σAd (I)], it is only necessary to determine which and independent of the value of e. For block-cyclic, this
Data Distribution D 

is not true – if elements e and e +  are within a block, a A[0:3, 0:24] A[4:7, 0:24] A[8:11, 0:24] A[12:15, 0:24]

the distance between them is one, and if e and e +  are A[0:3, 0:3] A[4:7, 0:3] A[8:11, 0:3] A[12:15, 0:3]
in different blocks, the distance is at least (UP + )⋅Bs. A[0:3, 4:7] A[4:7, 4:7] A[8:11, 4:7] A[12:15, 4:7]
This means that the distribution cannot be described A[0:3, 8:11] A[4:7, 8:11] A[8:11, 8:11] A[12:15, 8:11]
b A[0:3, 12:15] A[4:7, 12:15] A[8:11, 12:15] A[12:15, 12:15]
as an affine function in one variable, a situation that
caused much difficulty in handling block cyclic distri- A[0:3, 0:3] A[0:3, 4:7] A[0:3, 8:11] A[0:3, 12:15]
butions in HPF [, ] in the early s. However, once A[4:7, 0:3] A[4:7, 4:7] A[4:7, 8:11] A[4:7, 12:15 D
it is realized that the distribution can be described as an A[8:11, 0:3] A[8:11, 4:7] A[8:11, 8:11] A[8:11, 12:15]
A[15:12, 0:3] A[12:15, 4:17] A[12:15, 8:11] A[12:15, 12:15]
affine function in two variables [, , ], it is possible to c
manipulate these distributions as described above. One A[0:3] A[4:7] A[8:11] A[12:15]
variable, b, moves from block to block, and the other, e, A[0:3] A[4:7] A[8:11] A[12:15]
A[0:3] A[4:7] A[8:11] A[12:15]
traverses the elements within a block.
d A[0:3] A[4:7] A[8:11] A[12:15]

Distributing Data Along Multiple Data Distribution. Fig.  Examples of multidimensional


Dimensions distributions. (a) Distribution of an array onto a processor
The distribution of multidimensional array A onto mul- grid with fewer dimensions. (b) Distribution of an array
tidimensional processor grid P is now described. There onto a processor grid with the same number of
exist three cases: () the array rank is greater than the dimensions. (c) Distribution of rows (columns) of an array
processor rank; () the array and processor rank are onto columns (rows) of a processor grid. (d) Distribution of
equal; and () the processor rank is greater than the an array onto a processor grid with a larger number of
array rank. dimensions
In the first case, DP dimensions of the array are
distributed onto the processor grid, and DP −DA dimen-
sions are replicated, as shown in Fig. a. In the second A cyclic distribution is often used in situations
case shown in Fig. b, DA = DP dimensions of the array where the amount of work being done on each pro-
are distributed onto the processor grid. Figure b shows cessor may change from iteration to iteration, as is the
array rows (columns) distributed onto processor rows case with triangular loops and some of sparse matrices.
(columns), and Fig. c shows array columns (rows) dis- Using a cyclic distribution spreads consecutive itera-
tributed onto processor rows (columns). In the third tions across the processors and leads to better load
case, DA dimensions of the array are distributed onto DP balancing at the cost of increased communication when
dimensions of the processor grid, and the distributed adjacent array elements (in the global space) are needed.
array is spread (replicated) across the remaining DP −DA A block distribution can lead to a greater load imbalance
dimensions of the processor grid, as seen in Fig. d. but less communication. The block-cyclic distribution is
The computations required for determining what a hybrid of the two with the benefits of both at the cost
elements along a dimension a processor might own can of more complex code to access elements of the array
be computed independently for every dimension. Given and to determine loop bounds.
an element of an array, it resides on a processor if the
index for that element along every dimension of the Iteration Space Partitioning
array resides on the processor. After data has been distributed onto processors, it is
necessary to partition, or shrink, the iteration space of
Trade-Offs of Different Distributions loops that execute on each processor to only access the
When a value is needed on every processor, and the elements of the array that are owned by the processor.
computation of the value costs significantly less than To partition the iteration space for a processor pj , infor-
communicating it (e.g., via a broadcast or complicated mation is needed about the distribution, original loop
point-to-point messaging), the best solution is often to bounds, and the subscript that is accessing the array.
compute the value on every processor. This requires For the distribution, the information required is the
replicating the data across the processors. first and last elements of the array that are distributed
 D Data Distribution

onto the processor (UA,pj , LA,pj ), the block-size Bs, the the bounds computed above, and the list of elements
number of processors (UP + ), the affine function △ on the processor, specified by △ and its bounds. This
describing elements of A that are present on a proces- is done by solving the Diophantine equation σ = △. If
sor, and the type of the distribution. For simplicity, the there is no solution to the equation, then no elements
present discussion will restrict itself to subscript expres- of A accessed by the subscript are owned by the pro-
sions that are affine functions in one variable, i, i.e., cessor. If there exist solutions, then for each variable
e = σ (i). This restriction is later lifted, and the general x , x , . . . , xn there is a parametric equation x = κ + κ  ⋅t
case of multiple references in a loop is also discussed. []. For block and cyclic distributions, the variable i used
in σ (i) is the loop index, and the variable (k for cyclic,
Partitioning on Subscript Expressions b for block) for △ specifies the different elements. Solv-
i,p j ,σ ≤ κ  + κ  ⋅ t ≤ Ui,pj ,σ gives the range of the t
ing Linit init
in a Single Variable
Partitioning the loop bounds logically proceeds in two needed to generate the needed values of the x that cor-
steps. First, the bounds are initially shrunk to encom- responds to the index variable i. Plugging these values
pass the range of elements that are owned by the proces- of t into the formula above gives the range of i val-
sor. Next, the elements resident on a processor that are ues, and therefore the bounds of i for the loop on each
accessed by the subscript are identified, and the range processor.
of the index variable is modified to only access those For block-cyclic distributions, the same technique
elements. is used, but the △ function has two unknowns, and
Initial bounds partitioning. The initial partitioning gives consequently the i loop must be logically replaced by a
lower and upper bounds of the loop index i on processor pair of nested loops that iterate over the elements of the
pj for some reference A[σ (I)] as array.
− −
i,p j ,σ = max(Li , min(σ (LA,pj ), σ (UA,p j )))
Linit
Partitioning on Subscript Expressions
and in Many Variables
init
Ui,p j ,σ
= min(Ui , max(σ − (LA,pj ), σ − (UA,pj ))), It is often the case that a subscript function will be a
function of two or more index variables in a loop nest.
respectively. The function σ − is the inverse of the sub- Consider a loop nest of depth n, and the analysis of a
script function, i.e., e = σ (σ − (e)). Thus the lower subscript expression
bound of the partitioned loop i must be at least as great
as the least of the lower and upper bounds of the original σ (I) = c ⋅i + c ⋅i + . . . + cq− ⋅iq−
loop. Moreover, for a monotonically increasing sub- + cq ⋅iq + cq+ ⋅iq+ + . . . cn ⋅in + c
script value, it should be no less than the value needed
to partition the bounds of the iq loop. At runtime, when
to access the lowest numbered element of the array
the iq loop begins to execute, values of the i , i , . . . , iq−
partition owned by the processor. The explanation of
loop indices are known, and the sum of products of
the expression for monotonically decreasing subscript
these and their coefficients can be folded into the con-
values and for the initial loop upper bound is similar.
stant c , giving c′ = c +∑j= cj ⋅ij . For the index variables
q−

Finding elements on pj accessed by σ (i). The technique iq+ , . . . , in , the maximum and minimum values of the
above is not sufficient to find the bounds of the loop. function
Why this is true can be seen in the example of Fig. a.
σ ′ (iq+ , . . . , in ) = cq+ ⋅iq+ + . . . cn ⋅in
In the example, some processors only contain odd ele-
ments of A, and some only contain even elements. If in the iteration space of the iq+ , . . . , in index variables
the array is accessed with σ (i) =  ⋅ i, the processors can be found. The minimum and maximum values that
containing only odd elements of A will not access any σ ′ can have are found for each processor by evaluat-
elements. ing it at upper and lower bounds of the iteration space
To determine precisely which elements are accessed, on that processor, and intersecting it with the array
it is necessary to intersect the set of elements accessed elements owned by the processor. Once these bounds
by the subscript for each reference, as specified by σ and are determined, the lower bound can replace the terms
Data Distribution D 

iq+ , . . . , in in σ to find the lower bound as described for (i = max(LA , min(LA ,i , LA ,i , . . . , LA r ,i )),
in section “Partitioning on Subscripts Expressions in i ≤ min(max(UA ,i , UA ,i , . . . , UAr ,i )),
Single Variables”, and the upper bound can be used to i+ = gcd(SA  ,i , SA ,i , . . . , SAr ,i ) {
replace the terms to find the upper bound. if (i ∈ max(LA , LA ,i ) : min(UA , UA ,i ) : SA ) {
A = . . .
Finding Iteration Spaces for Loops with }
Multiple References if (i ∈ max(LA , LA ,i ) : min(UA , UA ,i ) : SA ) { D
Let LA ,i , LA ,i , . . . , LAr ,i be the lower bounds on i for A = . . .
the r array references, UA ,i , UA ,i , . . . , UAr ,i be the upper }
bounds, and SA ,i , SA ,i , . . . , SAr ,i be the strides. (The …
strides result from the solutions of the parametric equa- if (i ∈ max(LA , LAr ,i ) : min(UA , UA r ,i ) : SAr ) {
tions described in section “Partitioning on Subscripts Ar = . . .
Expressions in Single Variables”.) The final iteration }
space must contain all of the iterations needed by all of }
the references. Thus, the lower bound of the resulting
loop is max(Li , min(LA ,i , LA ,i , . . . , LAr ,i )), the upper Data Distribution. Fig.  Example of code needed with
bound is min(Ui , max(UA ,i , UA ,i , . . . , UAr ,i )), and the multiple computed references to distributed arrays in a
stride is gcd(SA ,i , SA ,i , . . . , SAr ,i ). Because the iteration loop
space contains, in general, more iterations than needed
by any single reference, each computed reference must Bibliography
be guarded by an if statement that checks if the current . Banerjee U () Loop transformations for restructuring com-
pilers: the foundations. Kluwer, Norwell
value of i is within the upper and lower bound for that
. Banerjee UK () Dependence analysis. Kluwer, Norwell
reference, and if the stride is a divisor of the stride for a . Gupta M, Banerjee P () A methodology for high-level syn-
reference, as shown in Fig. . thesis of communication on multicomputers. In: ICS ’: Pro-
ceedings of the th international conference on supercomputing,
Communication Sets ACM Press, New York, pp –
If cross-iteration dependences exist in the program, . Gupta M, Midkiff S, Schonberg E, Seshadri V, Shields D, Wang
K-Y, Ching W-M, Ngo T () An HPF compiler for the IBM
communication among processors is necessary to trans- SP. In: Supercomputing ’: proceedings of the  ACM/IEEE
fer data that is needed by processor pr but is owned conference on Supercomputing (CDROM), ACM, New York, p 
by another processor ps . With the bounds of the par- . Kennedy K, Koelbel C, Zima HP () The rise and fall of High
titioned loop on some processor pr , the elements of a Performance Fortran: an historical object lesson. In: Proceedings
read array reference accessed by pr can be determined of the third ACM SIGPLAN history of programming languages
conference (HOPL-III), pp –
and expressed as an affine function. The affine func-
. Kennedy K, Kremer U () Automatic data layout for
tion describing the read elements is simply the subscript distributed-memory machines. ACM Trans Program Lang Syst
expression σ for the reference, bounded by the upper ():–
and lower bounds of the partitioned loop. By inter- . Kennedy K, Nedeljkovic N, Sethi A () Communication
secting this affine function with the affine functions △ generation for cyclic(k) distributions. Kluwer, Troy, New York,
describing elements of the array owned by each proces- pp –
. Koelbel C, Mehrotra P () Compiling global name-space par-
sor, the elements that must be sent from each processor allel loops for distributed execution. IEEE Trans Parallel Distrib
to other processors can be determined. This intersection Syst ():–
is the solution of the Diophantine equation σ = △. . Koelbel CH, Loveman D, Schreiber RS () The High Perfor-
Naively, this intersection must be performed by mance Fortran handbook. MIT Press, Cambridge
every processor. However, compiler analyses can deter- . Midkiff SP () Local iteration set computation for block-cyclic
distributions. ICPP :–
mine communication patterns for pairs of references
. Rogers A, Pingali K () Process decomposition through local-
[, ]. These patterns will often limit the number of ity of reference. In: Proceedings of the  ACM conference
processors for which communication sets must be on programming language design and implementation, Portland,
computed. OR, pp –
 D Data Flow Computer Architecture

. Wang L, Stichnoth JM, Chatterjee S () Runtime performance The following sections discuss static and dynamic
of parallel array assignment: an empirical study. In: Supercom- data flow architecture, then data structures in data flow
puting ’: proceedings of the  ACM/IEEE conference on
computers, and multithreading architectures inspired
supercomputing (CDROM), IEEE Computer Society, Washing-
ton, DC, p 
by data flow principles.

Static Data Flow Architecture


The basic scheme of a static data flow multiprocessor is
Data Flow Computer Architecture illustrated in Fig. . A data flow program is stored in the
machine as a collection of Activity Templates divided
Jack B. Dennis
among the several processing elements and held in the
Massachusetts Institute of Technology, Cambridge,
Activity Store of each. The operand fields of Activity
MA, USA
Templates are placeholders for operand values expected
as results of executing other instructions; destination
Definition fields specify the target Activity Template, which may
Data Flow Computer Architecture is the study of special
reside locally or in a remote processor, and the role that
and general purpose computer designs in which per-
the result plays at the target. The Instruction Queue con-
formance of an operation on data is triggered by the
tains Activity Store addresses of instructions, for which
presence of data items.
operands needed for execution have been placed in the
corresponding Activity Template, and delivers them to
Discussion the Fetch Unit. In turn, the Fetch Unit accesses the com-
Introduction pleted Activity Template and forwards it to a Function
This article discusses several forms of data flow archi- Unit appropriate for the requested operation. When a
tecture that have been studied in university research result is obtained, the Function Unit constructs one
groups and industrial laboratories beginning around or more Result Packets, each consisting of a copy of
 []. The architectures covered all use some form the result value and one destination address from the
of data flow graph programming model to enable the Activity Template, and passes these to the Send Unit.
exploitation of parallelism. The Send Unit either directs the packet through the
In its original form, data flow architecture envi- Interprocessor Network to the Update Unit of a remote
sioned single instructions operating on individual data processing element, or delivers the packet to the local
elements, integers or floating point numbers for exam- Update Unit through the Receive Unit. The Update Unit
ple, as the units of computation, and this has character- enters the result in an operand field of the target Activ-
ized most of the design proposals and projects. Other ity Template, and checks whether all needed operands
designs have used blocks of instructions or modules of are present. If so, it enters the address of the Activ-
code as the independently executed unit. ity Template in the Instruction Queue. Conditional and
Two major lines of development of data flow archi- iteration program elements may be implemented using
tecture have been pursued. In static data flow machines, relational instructions that produce Boolean control
the hardware holds a single unchanging representation packets; the Update Unit may use such a control value to
of a data flow graph loaded prior to program execu- choose which of two operands a Function Unit should
tion. In dynamic data flow, means are incorporated so use. Reference [] gives details of one proposal.
that computation proceeds as though copies of a data Although implementations of static data flow have a
flow graph were generated during program execution, variety of forms, the circular pipeline structure evident
providing direct support of arbitrarily nested function in Fig.  is a common feature, and supports the contin-
activations. A general form of dynamic data flow also uous processing of several instructions simultaneously.
permits many instances of a loop body to have effec- Static data flow concepts are well-suited to signal pro-
tively separate data flow representations for each cycle cessing applications and to scientific computation using
of loop execution so that overlapped execution may linear algebra methods, especially when array sizes are
occur. known at the start of computation. Several industrial
Data Flow Computer Architecture D 

Activity template Data flow multiprocessor

Opcode
Operand A Inter-
processor
Operand B network
Destination 1
a Destination 2 D
c

Data flow processing element


Function
units Send

Inter-
Instruction queue processor
network

Fetch Update
Receive
unit Activity unit
store
b
Data Flow Computer Architecture. Fig.  Scheme of a static data flow multiprocessor

projects have constructed static data flow processors, lie in the way instructions are stored and activated. In
and the principles have found use in signal processing dynamic data flow, operands cannot be held with the
applications. instruction that operates on them. Instead, a special
component, called the Token Matching Store, is used
Dynamic Data Flow Architecture to hold operand values that must wait for a second
In a dynamic data flow computer, the values passed operand to arrive before an operation on the pair may
between actors of a data flow program graph are be performed. Then the Matching Store forwards the
tagged to indicate which instance of a variable is rep- pair of values to Instruction Fetch, which accesses the
resented. To handle arbitrary nests of function calls instruction and passes the combination to Instruction
and unknown loop iteration counts, the number of Execute. The Form Token unit creates a token using
distinct value instance existing in any snapshot of the computed result value, and the operand tag and
an ongoing computation is unknown until after the destination address passed through from Instruction
computation has begun. Hence, the information in the Fetch. Tokens are either entered in the local Token
tag of a value cannot be coded in the structure of the Queue or dispatched to a remote processor according
(finite) graph. The tag of a value must define to which to their tag values.
function activation the value pertains, and to which A pioneering implementation of dynamic data flow
cycle within each nested loop of the function body was built at Manchester University, England [], and
the value applies. The semantics of an execution model it used a hardware associative memory for the Token
known as the unraveling interpreter were published in Matching Store. Several variations on the scheme
 [, ] and supplied the initial impetus for dynamic of Fig.  have been developed at research institu-
data flow. tions, some with industrial support. The variations
Figure  shows the general scheme of a dynamic data are mainly concerned with replacing the associative
flow multiprocessor. The dynamic data flow scheme has memory used in the Manchester University machine
similarities to the static scheme: tokens carrying data with innovative schemes that are substantially more
values flow in a circular pipeline structure. Differences cost-effective.
 D Data Flow Computer Architecture

a Token format b Tagged token processing element

Value
Instruction Form
Tag execute token Inter-
processor
network
Token
Program Instruction
queue
memory fetch

Token matching store

Data Flow Computer Architecture. Fig.  Scheme of a dynamic data flow multiprocessor

Data Structures in Data Flow


Performing computations involving structured data F: Fork
F
such as arrays of values presents a challenge for data Spawn a thread
flow architecture because it seems wasteful to carry a for each child F

complete data structure from a producer actor to a con-


J: Join B
sumer. On the other hand, if the data structure is held
separately from the data flow graph, then hazards can Collect results from B B B
arise from races between conflicting memory opera- each child thread
tions. One solution to this quandary is to use special
data structures known as I-Structures for which a single B: Block of code
J
write order sets a value and several synchronized reads Sequential code to
may safely access it []. Other approaches involve par- perform a task J
tial copy-on-write operations, and special data parallel
treatment of array construction operations.
Data Flow Computer Architecture. Fig.  Multithread
tasking scheme inspired by data flow concepts
Data Flow Ideas in Multithreading
Several workers familiar with data flow principles have
employed data flow concepts in the construction of sys- behavior so long as the restriction of code segments to
tems able to multithread large numbers of concurrent read-only shared data is honored.
activities with a guarantee of determinate execution.
The principles used are illustrated in Fig. , where the Future Prospects
links indicate flow of control (threads). Here four code Currently, increased attention is being given to mul-
blocks are arranged to execute concurrently through the tithread computer architectures inspired by data flow
use of fork and join operators. It is expected that a com- principles to yield higher performance by dynamically
puter system will schedule threads to execute the code scheduling sequences of instructions as independent
blocks as short sequential program segments. In execu- units of computation []. It is known how to build
tion, the code blocks may have internal variables, but static dataflow computers and streaming processors that
all shared data is read- only. At a fork operator, the par- can perform well. The extension to data structures and
ent thread makes specific data objects accessible to each dynamic control structure, such as recursive functions,
child task. At a join-operator result values produced has remained a challenge for demonstrating levels of
by each child are combined and made available to the performance that would attract widespread interest.
continuation thread. Several variations of this scheme Because dataflow graphs are a determinate rep-
have been implemented for shared-memory multipro- resentation for parallel programs that do not make
cessor systems, and provide a guarantee of determinate essential use of nondeterminacy, it is feasible to build
Data Flow Computer Architecture D 

computers using data flow principles that can pro- The best known early project to build a machine
vide the user with a guarantee of determinate opera- using the tagged token principle was the Manchester
tion, while achieving highly parallel computation. This University data flow computer that introduced the
was accomplished in the Barton/Davis DDM machine waiting-matching store []. Since then the principle has
[, ]. However, reluctance to adopt the functional pro- been developed to introduce more efficient realizations
gramming style favored by data flow architecture has in projects in Japan and at MIT in the USA. The most
limited the prospects for this achievement. advanced and successful of these are the Sigma  project D
at the Electrotechnical Laboratory in Japan [, ], and
Related Entries the Monsoon project at the MIT Laboratory for Com-
Data Flow Graphs puter Science [, ]. These projects have demonstrated
Determinacy the ability of data flow architecture to exploit parallelism
Functional Languages in important codes for scientific computation.
Multi-Threaded Processors Multithreading projects inspired by data flow con-
cepts include the Threaded Abstract Machine [], and
Bibliographic Notes and Further others described in []. A recent revival of interest in
Reading applying dataflow principles in computer architecture
Data flow computer architecture was inspired by the has led to projects at University of Texas, Austin []
prospect of using data flow graphs as a programming and the University of Washington, Seattle [].
model, thereby permitting exploitation of parallelism The  volume edited by Gaudiot and Bic
to be readily accomplished. The first proposals for the [] includes chapters written by authors personally
hardware organization of dataflow computers were pub- involved in many of the data flow research efforts at its
lished in  [] and  [], originating in the Com- time of publication. Good treatments of multithreaded
putation Structures Group of MITs Project MAC. The computer architectures and their relation to data flow
earliest working model of a computer using data flow research may be found in []. The papers collected
principles was the DDM designed by Al Davis and in [] describe early work on advanced programming
Robert Barton at the University of Utah, built at the Bur- models in the context of sequential computers that fore-
roughs Corporation and described in  [, ]. It was sees later work on dynamic data flow. The book by Sharp
a very innovative machine, supporting a dynamic pro- [] provides a summary of approaches to data flow
gramming model including a tree-structured heap for computer architecture and includes a good bibliogra-
memory objects. It relied on the serial processing of text phy of early work in the field, and the Veen survey []
packets representing instructions and data. explains details of the many dataflow hardware projects
Static data flow architectures were developed as a of the late s and s.
direct implementation in hardware of the data flow
graph program model [, ]. Experimental static data Bibliography
flow computers were built by Texas Instruments [], . Arvind, Gostelow KP () The U-interpreter. IEEE Computer
the ESL division of TRW [], Hughes [], and February:–
in France []. . Arvind, Nikhil RS, Pingali KK () I-structures: data struc-
tures for parallel computing. ACM Trans Program Lang Syst ():
In Japan, NEC conducted two significant static data
–
flow projects. The NEDIPS processor was developed in . Comte D, Hifdi N, Syre J () The data driven LAU multipro-
the Radio division for processing radio astronomy data cessor system: results and perspectives. In: Proceedings of IFIP
[]. The second project developed a data flow process- congress , Tokyo, Japan, Oct , pp –
ing chip that could be used in cascade to implement . Cornish M () The TI data flow architectures: the power of
signal processing functions and was envisioned to be concurrency for avionics. In: Proceedings of the third conference
on digital avionics systems. IEEE, New York, pp –
the basis for an advanced photo copying machine that
. Culler DE, Sah A, Schauser KE, von Eicken T, Wawrzynek J
could scale and rotate digitized images [, ]. Many () Fine-grain parallelism with minimal hardware support: a
other Japanese manufacturers also built experimental compiler-controlled threaded abstract machine. In: Proceedings
data flow machines; Veen [] provides details. of the fourth international conference on architectural support for
 D Data Flow Graphs

programming languages and operating systems. ACM, New York, . Papadopoulos GM, Traub KR () Multithreading: a revisionist
pp – view of dataflow architectures. Proceedings of the th interna-
. Davis AL () The architecture and system method of DDM: a tional symposium on computer architecture, ACM, New York,
recursively structured data driven machine. In: Fifth international pp –
symposium on computer architecture, pp – . Sankaralingam K, Nagarajan R, Gratz P, Desikan R, Gulati
. Davis AL () DDM. In: AFIPS conference proceedings , D, Hanson H, Kim C, Liu H, Ranganathan N, Sethumadha-
pp – van S, Sharif S, Shivakumar P, Yoder W, McDonald R, Keck-
. Dennis JB, Misumas DP () A computer architecture for highly ler SW, Burger DC () Distributed microarchitectural proto-
parallel signal processing. In: Proceedings of the ACM national cols in the TRIPS prototype processor. In: Thirty-ninth interna-
conference, New York, NY. ACM, New York, pp – tional symposium on microarchitecture, IEEE, Washington, DC,
. Dennis JB, Misunas DP () A preliminary architecture for a pp –
basic data-flow processor. In: Proceedings of the second annual . Sharp JA () Data flow computing. Wiley, New York
symposium on computer architecture. ACM, pp –. See also . Temma T, Mizoguchi M, Hanaki S () Template-controlled
the retrospective in Sohi GS (ed) ()  years of the interna- image processor TIP- performance evaluation. In: Proceedings
tional symposia on computer architecture ACM, pp – of the IEEE, CPVR - Computer Vision and Pattern Recognition.
. NEC Electronics, Inc. (December ) μPD: image pipelined IEEE
processor. Preliminary data sheet. NEC Electronics, Inc, Moun- . Tou J, Wegner P (eds) (February ) Data structures in program-
tain View, CA ming languages. ACM SIGPLAN Notices :–. See especially
. Gaudiot J-L, Bic L () Advanced topics in data-flow computing. the papers by Wegner, Johnston, Berry and Organick
Prentice Hall, New York . Vedder R, Campbell M, Tucker G () The Hughes data flow
. Gostelow KP, Arvind () A computer capable of exchang- multiprocessor. In: Proceedings of the fifth international confer-
ing processors for time. In: Information processing ‘. North ence on distributed computing, Denver. IEEE, pp –
Holland, New York . Veen AH() Dataflow machine architecture. ACM Comput
. Gurd JR, Kirkham CC, Watson I, January () The Manchester Surv ():–
dataflow prototype computer. Commun ACM ():–
. Hiraki K, Shimada T, Nishida K () Hardware design of
the SIGMA-: a data-flow computer for scientific applications.
In: International conference on parallel processing, Bellaire, MI.
IEEE, pp – Data Flow Graphs
. Hiraki K, Sekiguchi S, Shimada T () Status Report of
SIGMA-: a data-flow supercomputer. In: Advanced topics in Jack B. Dennis
data-flow computing. Prentice Hall, New York, pp – Massachusetts Institute of Technology, Cambridge,
. Hoganauer EB, Newbold RF, Inn YJ () DSSP: a data flow com-
MA, USA
puter for signal processing In: International conference on parallel
processing, Bellaire, MI. IEEE, pp –
. Iannucci RA (ed), Gao GR, Halstead RH, Smith B () Mul- Synonyms
tithreaded computer architecture: a summary of the state of the
Program graphs
art. International series in engineering and computer science,
Springer
. Kurokawa H, Matsumoto K, Temma T, Iwashita M, Nukiyama T Definition
() The architecture and performance of image pipeline pro- A data flow graph is a graph model for computer pro-
cessor. In: VLSI ’: VLSI design of digital systems: proceedings
grams that expresses possibilities for concurrent exe-
of the IFIP TC /WG . international conference on very
large scale integration, Trondheim, Norway. IFFIP/Elsevier,
cution of program parts. In a data flow graph, nodes,
pp – called actors, represent operations (functions) and pred-
. Mercaldi M, Swanson S, Petersen A, Putnam A, Schwerin A, icates to be applied to data objects, and arcs represent
Oskin M Eggers S () Instruction scheduling for a tiled channels for data objects to move from a producing
dataflow architecture. In: Architectural support for programming actor to a consuming actor. In this way, control and
languages and operating systems. ACM, New York, pp –
data aspects of a program are represented in one inte-
. Papadopoulos GM, Culler DE () Monsoon: an explicit token-
store architecture. Seventeenth international symposium on com- grated model. When data objects are available at input
puter architecture, IEEE Computer Society/ACM, New York, ports of an actor and certain conditions are satisfied, the
pp –. See also the retrospective in Sohi GS (ed) ()  years actor is said to be enabled. Because the set of enabled
of the international symposia on computer architecture. ACM, actors may be chosen to fire simultaneously, or sequen-
pp –
tially in any order, data flow models expose much of the
Data Flow Graphs D 

parallelism available in the computation represented, work with static data flow graphs, links are permitted
and this parallelism may be exploited in an implemen- to hold just one token. For simple static DFGs, each arc
tation of the model. holds either a single token or is empty. The firing rule
for advancing the state of a DFG is:
Discussion Firing rule: An actor of a DFG is enabled if a token is
present on each input link of the actor, and no token
Introduction is present on any output link. An enabled actor may be D
Although many versions of data flow graphs (DFGs) fired by removing one token from each input link and
have been studied in the literature, they share some placing a token on each output link. The values held by
significant common features. the output tokens are the result of applying the oper-
. A DFG is a directed graph representation in which ator of the actor to the set of values held by the input
an arc is a path over which data are passed from a tokens.
producing node to a consuming node. Asynchronous firing of actors is permitted, and the
. Dynamically, a node of a DFG acts (by firing, as in time taken for any actor to fire may be arbitrary, so sev-
a Petri net) by accepting one or more data items eral actors in a DFG may be firing simultaneously. In
from its inputs, performing some computation, and Fig. , a state of execution is shown wherein actors  and
delivering resulting data items to its ouputs.  are both enabled and may be chosen to fire in either
. Action by a node is triggered by the presence of order, or to fire simultaneously. After actor  has fired,
actor  will be enabled because its output link will then
input data.
be empty.
The study of data flow graphs has focused mainly on The input ports of a DFG may be regarded as special
three well-defined formal models: the static data flow actors that place tokens on the input links of the graph
model, dynamic data flow, and synchronous data flow. when values are supplied to the ports from the external
The various models differ in such aspects as how many environment of the DFG. Similarly, output ports make
data items are permitted to occupy an arc, whether the values held by tokens on output links available to the
actors are permitted to have internal state, and other environment and remove the tokens when the values
details. are taken.
Basic forms of static DFGs and their properties
are described below, including determinacy; this is fol-
lowed by a discussion of dynamic data flow and the
synchronous model. Uses of DFGs are summarized.

Basic Simple DFGs


Figure  shows a basic simple static data flow graph,
which is an acyclic directed graph (DAG) in which the
nodes are drawn as circles and represent actors that
perform functional operations on a set of input values
(possibly empty) to produce a set of one or more output
values. The arcs of the graph are called links, and con-
vey values from an output port of one actor to an input
port of another, or from an input port of the DFG to an
actor input, or from an actor output to an output port
of the DFG.
To discuss the semantics of a DFG, it is useful to
associate tokens with arcs of the graph. Each token car-
ries a value, which may be a scalar value such as an
integer, or, in some studies, a data structure. For most Data Flow Graphs. Fig.  Simple static data flow graphs
 D Data Flow Graphs

A DFG is activated by supplying data values at each is adapted from their use in several compilers for func-
input port, and terminates when values are available tional programming languages. Figure  illustrates a
at each output port. A DFG is said to be well-behaved typical way of representing an if. . .then. . .else con-
if its activity terminates following each presentation of struction as a data flow graph. The diamond-shaped
values. decider actor accepts a token carrying a boolean value
The DFG in Fig. a represents the standard formula from one of the input ports and generates event signals
for computing one of the roots of the second-degree (tokens) that activate gating structures, shown as rect-
polynomial, a + bx + cx : angular boxes, that select input values for passing to the
√ two component DFGs: the T-Arm activated for a true
b+ b − ac decision and the F-Arm for false. Any combination of
x=
a input ports may be selected as inputs to the arm graphs.
However, the output ports of the arm graphs must be
This DFG is well-behaved, as are all simple static DFGs
in one-to-one correspondence with output ports of the
if their actors have operators that are total functions.
conditional graph so that output values are defined for
Also, simple DFGs can represent pipelined computa-
every possible execution. A conditional DFG is deter-
tion, that is, more than one set of input vales may be
minate, and will be well-behaved if the two arm graphs
accepted by the DFG before its activity has terminated are well-behaved.
for the earlier set of input values. The pipelined opera- Figure  illustrates one way of representing an itera-
tion of DGFs is closely related to use of coroutines as a tive loop as a data flow graph, in this case a computation
program structure for parallel computing. Some work- of the form while. . .do. . . The body of the iteration is
ers with DFGs view each arc as a FIFO queue that can
represented by a DFG that takes input values from a set
hold either a bounded or unbounded number of value- of loop variables and produces output values that pro-
carrying tokens. This may be shown on the graph by vide updated definitions for the loop variables. Here, the
permitting multiple tokens on any arc, with the stipu- gating structures serve to provide initial values for the
lation that order is preserved: tokens are removed from loop variables and then to redefine them from the out-
the arc by a consuming actor in the same order as they
puts of the loop body for each cycle. The decider uses the
were placed by the producing actor.
value of a boolean loop variable to control the gating.
A simple data flow graph is determinate. Determi-
nacy is an important concept in parallel computing.
It captures the idea that a system can have repeatable
behavior while at the same time being nondeterminis-
tic in its operation. The property of being well-behaved
guarantees that the value produced at each output port
is a fixed composition of the functions defined by the
actors of the DFG. Basic simple DFGs are determinate
whether or not they are well-behaved.

Compound Simple DFGs


A complete system for describing computations requires
means for describing conditionals and iteration. In
simple DFGs, these features are provided using DFG
structures motivated by the principles of structured
programming advocated by Dijkstra and Hoare that
have had a strong influence on modern programming
language design.
For ease of exposition, the manner of expressing
conditional and iteration expressions as DFGs used here Data Flow Graphs. Fig.  A conditional DFG
Data Flow Graphs D 

Data Flow Graphs. Fig.  The apply actor

(colored) tokens. For substitution, the meaning of a DFG


containing apply actors is the simple DFG obtained by
copying the specified DFG in place of each apply actor.
Of course, this reaches limitations for recursive func-
tions where a potentially unbounded cascade of copying
is necessary.
The second approach is to avoid the copying of
Data Flow Graphs. Fig.  Iteration as a data flow graph graphs by permitting multiple tokens with distinct tags
to be present on each DFG link. The tags are chosen
so that all tokens that carry values associated with a
Note that the initial state of this graph must have a token specific activation of a function graph have identical
carrying an event signal for gating initial values to the tags. The use of tagged tokens also permits iteration to
loop variables. If the body DFG is well-behaved, then be represented in a way that permits actors from more
so will the compound DFG, provided the iteration ter- than one cycle of an iteration to be active concurrently.
minates. Note that the compound DFG is determinate The tagged-token model of DFG semantics has been the
even if it might not terminate for some inputs. inspiration for at least two projects for building com-
puters using data flow principles, and constitutes the
The Apply Actor essence of dynamic data flow.
In comparison with programming languages, the DFG An important and useful property of data flow
constructions introduced so far are limited. There is graphs is that they not only express the dependence
no means for expressing function invocation, and no relations among operations of a computation, but also
means for building and analyzing data structures. For provide a recipe for execution – how to perform the
function invocation, DFGs use the apply actor which computation – specifically the firing rule given above.
has the semantics shown in Fig. . The Apply actor iden- The firing rule specifies a data-driven order of perfor-
tifies a separate DFG that defines a computation to be mance, that is, each operator acts just when all data it
performed whenever the Apply actor is executed. It is needs have arrived. In this scheme, the firing of an Apply
executed by copying values from the input ports of the actor, causing execution of the body of the applied func-
apply actor to the input ports of the specified DFG. tion, is delayed until all inputs (function arguments) are
When the function body terminates, the values at its present (evaluated). A consequence is that if the applied
output ports are transferred to the output links of the function does not make use of all of its arguments,
Apply actor. unnecessary computation will be performed, and exe-
There are two ways of elaborating the semantics of cution of such a program might not terminate if evalua-
the Apply actor: by substitution and by using tagged tion of an unneeded argument executes an endless loop.
 D Data Flow Graphs

This issue can be avoided by using demand-driven eval- graphs are an attractive IR because they both represent
uation wherein a data flow actor is executed only when dependence information and provide a complete exe-
its result is needed. It is known how to convert a data cutable semantics of the computation. Therefore, being
flow graph into a new graph for which data-driven exe- able to use this form for programs defined by tradi-
cution performs the actors of the original graph exactly tional imperative languages would be highly beneficial.
as for demand-driven execution. It is straightforward to represent imperative programs
by control flow graphs and means for converting con-
Data Structures trol flow representations to data flow graphs are known.
The simplest way to include data structures in a data Hence, it is feasible to employ data flow graphs as an IR
flow model is to use read and write actors that access for optimizing transformations in compilers for imper-
a memory that is not part of the DFG. The read opera- ative languages.
tion accesses a specified location in the memory, and the
write operation updates a memory location with a new Synchronous Data Flow
value. The drawback of this approach is that the DFG In the synchronous data flow (SDF) model, arcs of a
is no longer necessarily determinate because read and DFG are permitted to hold multiple tokens, and a gen-
write operations for the same memory location may be eralized form of firing rule is used. Moreover, the firing
concurrent in the DFG. of actors happens at definite points in time, as if at peri-
A better way to introduce data structures is by per- odic cycles defined by a system clock. For each actor, the
mitting data objects, representing array or record val- designer of the model may specify a number of tokens
ues, to be carried by tokens in a DFG. Three new actor required to be present at each input port, how many are
types are added: create, append, and select. The create removed when the actor fires, and how many tokens are
actor, when fired, produces a null data structure hav- placed on output arcs. These numbers do not change
ing no components – representing an empty record or during operation of the model. The links in an SDF
an array with no defined elements. The append actor model are treated as FIFO buffers, and their sizes, along
performs the operation with the numbers of tokens placed or removed by the
firing of nodes must be chosen carefully to avoid dead-
object ← append (object, index, element)
lock. Conditional constructions and while loops are not
where the object produced has a component named permitted in SDFs as these would negate the property
index with element as its value. (The element could that a SDF graph describes a system with a constant rate
be a structured value such as a row of a matrix or a of operation. The effect of a conditional may be achieved
scalar value.) The object operated on by an append actor by incorporating the conditional within a node, such
may be a null object from a create actor, or a structure that the duration of its firing is invariant. Feedback
from another append actor with one or more defined paths may be present in a SDF graph and are important
components. The select actor to represent continuous processes that have internal
state using only state-free (functional) operators.
element ← select (object, index)
The synchronous data flow (SDF) model is intended
yields the component identified by index in the data as a representation of digital signal processing (DSP)
structure represented by object. DFGs constructed using computations that continue in execution for an
the select and append actors have functional (determi- unbounded time interval, consuming a stream of input
nate) semantics as in pure Lisp. Other work on data data items and producing an output stream. A feature
flow models has treated arrays of values as sequences of the SDF model is that DSP systems with compo-
of tokens passed between actors. nents operating at different clock rates can be accurately
modeled.
Control Flow and Data Flow
A subject of importance for compiler design is the Applications
choice of an intermediate representation (IR) on which The early work on data flow models inspired several
program transformations may be performed. Data flow attempts in universities and industry to build practical
Data Flow Graphs D 

computers embodying data flow concepts. See the arti- synchronous data flow model was introduced and its
cle on Data Flow Computer Architecture in this ency- properties studied in the  paper of Edward Lee [].
clopedia. The programming languages Id, Val, Sisal, and In  the IEEE journal Computer published a spe-
Ph are functional programming languages specifically cial issue devoted to concepts and research on data
designed to permit natural expression of the parallelism flow models []. It includes an introduction by editors
exposed in data flow models. Applications in signal pro- Tilak Agerwala and Arvind, a paper on data flow lan-
cessing include the system modeling tool LabVIEW and guages by William Ackerman, the survey of the data D
work on synthesizing digital signal processing systems, flow program graphs by Alan Davis and Robert Keller,
as in the Ptolemy project at UC Berkeley. An important a description of the unraveling interpreter by Arvind
use of data flow graphs is as an intermediate form for and Kim P. Gostelow, and a description of the Manch-
programs processed by a compiler written in either a ester tagged token data flow computer by Ian Watson
functional or an imperative programming language. and John Gurd. A “Second Opinion on Data Flow” by
Gajski, Padua, Kuck, and Kuhn completes the issue. The
book by Sharp [] provides a later review of the major
Related Entries
forms of data flow program models and the types of
Actors
computer program execution models inspired by them.
CSP (Communicating Sequential Processes)
Several programming languages have been devel-
Data Flow Computer Architecture
oped or inspired by data flow concepts, including
Dependences
Val [], Sisal [], Lucid [] and pH []. The use
Functional Languages
of data flow graphs as a program representation in
Message Passing Interface (MPI)
Sisal compilers has been described by Dennis []
Petri Nets
and by Skedzielewski and Glauert [], although there
have been other unreported uses. The expression of
Bibliographic Notes and Further demand-driven computations as demand-driven DFGs
Reading was described by Pingali and Arvind [], and the
Precursors to data flow models may be found in use of DFGs in representing computations written in
simulation models for industrial systems, engineering imperative languages is discussed by Beck, Johnson and
descriptions of the processing of information packets, Pingali [].
especially for communications systems, as reviewed in For a discussion of the influence of data flow con-
the survey by Davis and Keller []. However, the first cepts on computer architecture and programming lan-
works using the data-driven concept as a precise rep- guages, see the encyclopedia entries on Data Flow
resentation for computer programs and systems were Architecture and Parallel Functional Languages.
published in the s, independently by Karp and
Miller at the IBM T. J. Watson Research Center [], by Bibliography
Rodriguez at MIT [] and by Duane Adams at Stanford . Adams DA () A model for parallel computations. In: Hobbs
University []. These early models were “static” in that LC (ed) Parallel processor systems, technologies and applications,
they did not embody a natural way of describing arbi- Spartan Books, New York, pp –
. Agerwala T, Arvind A (eds) () IEEE Computer , . Special
trary nests of possibly recursive function definitions.
issue devoted to data flow
The  paper of Dennis [] includes an introduction . Arvind A, Gostelow KP, Plouffe W () An asynchronous pro-
to static DFGs. gramming language and computing machine. Technical report
The Graph/Heap model for dynamic data flow using a, Department of Information and Computer Science, Univer-
colored tokens was described by Dennis in  []. sity of California, Irvine
The work of Arvind, Gostelow and Plouffe in  [] . Ashcroft EA, Wadge WW () Lucid, a nonprocedural language
with iteration. Commun ACM ():–
formulated the dynamic data flow model in the form
. Beck M, Johnson R, Pingali K () From control flow to
of the “unraveling interpreter” using tagged tokens. dataflow. ACM Trans Progr Lang Syst :–
The semantics of dynamic data flow computations by . Davis AL, Keller RM () Dataflow program graphs. IEEE
graph copying was described by Dennis in . The Comput ():–
 D Data Mining

. Dennis JB () First version of a dataflow procedure language. repositories or data streams. Parallel data mining refers
In: Lecture notes in computer science LCNS : programming to the use of parallel computing to reduce time to exe-
symposium, Springer, Berlin, pp – cute a data-mining algorithm.
. Dennis JB () Data flow supercomputers. IEEE Comput
():–
. Dennis JB () The paradigm compiler: mapping a functional Discussion
language for the Connection Machine. In: Simon H (ed) Scien-
tific applications of the connection machine, World Scientific, Introduction
Singapore, pp – Advances in data collection and storage technologies
. Karp RM, Miller RE () Properties of a model for parallel com- have allowed organizations to gather, collect, and dis-
putations: determinacy, termination, queuing. SIAM J Appl Math
tribute increasing amounts of data. Spurred by these
():–
. Lee EA, Messerschmitt DG () Synchronous data flow. Proc
advances, the field of data mining has emerged, merg-
IEEE ():– ing ideas from statistics, machine learning, databases,
. McGraw JR () The VAL language: description and analysis. and high performance computing. The main challenge
ACM T Progr Lang Syst ():– in data mining is to extract knowledge and insight from
. McGraw JR, Skedzielewski SS, Allan R, Oldehoeft R, Glauert J, massive data sets in a fast and efficient manner. This
Kirkham C, Noyce B, Thomas R () SISAL: streams and iter-
process is iterative in nature and involves a human in
ation in a single assignment language: reference manual version
.. Technical report M-, Rev. . Lawrence Livermore National the loop. Therefore, to facilitate effective data under-
Laboratory, Livermore, CA standing and knowledge discovery, it is imperative to
. Nikhil R, Arvind A () Implicit parallel programming in Ph. minimize execution time of a data-mining query. Par-
Morgan Kaufman, San Francisco, CA allel data mining refers to the use of parallel computing
. Pingali K, Arvind A (, ) Efficient demand-driven eval-
to reduce the time required to execute a data-mining
uation. ACM T Progr Lang Syst ():– (Part ), and ():
– (Part )
query.
. Rodriguez JE () A graph model for parallel compu- Data mining commonly involves four classes of
tations. MIT technical report ESL-R-, and MAC-TR-, tasks: association rule mining, classification, clustering,
Cambridge, MA and regression. These classes will be described next
. Sharp JA () Data flow computing. Ellis Horwood, Chichester together with a brief summary of parallel algorithms for
. Skedzielewski S, Glauert J () IF – an intermediate form for
each task.
applicative languages. Technical report M-, Lawrence Liver-
more National Laboratory, Livermore, CA
Association Rule Mining
Association rule mining is a popular method for dis-
covering interesting relations between variables in large
databases. Agrawal et al. introduced association rules
Data Mining [] for discovering patterns involving products in trans-
Amol Ghoting
actional data. An example of transactional data is
IBM Thomas. J. Watson Research Center, Yorktown point-of-sale (POS) data that is produced in super-
Heights, NY, USA markets, which lists the items purchased by each cus-
tomer during a visit. An association rule found in such
data could indicate that if a customer purchased items
Synonyms A and B together, the customer is likely to also pur-
Data analytics; Knowledge discovery chase item C. An example of an association rule is
(milk,sugar) → cereal (%), which means that if a cus-
Definition tomer buys milk and sugar together, he or she is likely to
Data mining, also popularly referred to as knowl- also buy cereal % of the time. Note that the antecedent
edge discovery from data, is the automated or con- of the rule can contain more than two items. Such rules
venient extraction of patterns representing knowledge are often used to enable various marketing activities
implicitly stored or catchable in large data sets, data such as promotional pricing and product placements.
warehouses, the Web, and other massive information In addition to the above example from market basket
Data Mining D 

analysis, association rules are also employed in many generates the same set of candidate itemsets. This pro-
other application areas, including web usage mining, cess repeats until a level is reached where there are no
intrusion detection, and bioinformatics. more candidate itemsets. CD thus reduces the commu-
Finding association rules requires determining all nication between processors at the cost of performing
frequent patterns in a transactional data set. A post- redundant computations (in the form of candidate gen-
processing step can then find all association rules eration) in parallel. However, it scales poorly as the
using the set of frequent patterns. The frequent pat- number of candidates increases. D
tern mining problem was first formulated by Agrawal DD distributes the data set to each node. At the start
et al. []. Briefly, the problem description is as follows: of each level, each node takes turn broadcasting its local
Let I = {i , i , ⋯, in } be a set of n items, and let D = data set to every other node, and each node then gener-
{T , T , ⋯, Tm } be a set of m transactions, where each ates a mutually disjoint partition of candidates. Further-
transaction Ti is a subset of I. An itemset i ⊆ I of size k is more, frequent itemsets are communicated so that each
known as a k-itemset. The support of i is ∑m j= ( : i ⊆ Tj ), processor can asynchronously generate its share of can-
or informally speaking, the number of transactions in didates for the next level. DD overcomes the bottleneck
D that have i as a subset. The frequent pattern mining of serial candidate generation by partitioning genera-
problem is to find all i ∈ D that have support greater tion amongst the processors. However, it incurs very
than a minimum support value, minsupp. The frequent high communication overhead when broadcasting the
pattern mining step is the most time-consuming step in entire data set, which is especially costly in case of large
association rule mining, and much work has been done data sets. In their study, the authors find the costs of
to make it efficient. communication outweigh the improvements to parallel
Agrawal et al. [] presented Apriori, the first efficient candidate generation. In both CD and DD, processors
algorithm to solve the frequent pattern mining problem. need to be synchronized at the end of each step, which
The Apriori algorithm traverses the itemset search space can potentially cause processors to idle if the data is
in a breadth-first manner. At level k, candidate item- skewed.
sets of size k are generated by using frequent itemsets Han et al. [] proposed Intelligent Data Distribu-
of size k − . These candidates are then validated against tion (IDD) and Hybrid Distribution (HD), which build
the database (usually by a full scan) to obtain frequent upon CD and DD. IDD reduces the communication
itemsets of size k. Apriori-based algorithms speed up overhead by distributing the data using a ring all-to-all
the computation by using the anti-monotone property broadcast algorithm. In the original DD algorithm, each
and by keeping some auxiliary state information. The node broadcasts its data set, which results in unneces-
anti-monotone property states that if a size k-itemset is sary communication. Although an improvement, IDD
not frequent, then any size (k + )-itemset containing it still suffers from the high overhead of transferring the
cannot be frequent. entire data set at each level. The authors noted that a
Agrawal and Shafer [] were the first to address the weakness in CD is that the global candidate list may
challenge of frequent pattern mining on shared-nothing exceed the size of available main memory for a single
architectures, proposing Count Distribution (CD) and node. To address this problem while minimizing com-
Data Distribution (DD). Both CD and DD are paral- munication and computation overheads, HD combines
lelizations of the Apriori algorithm. the advantages of both the CD and IDD algorithms by
CD reduces execution time by parallelizing the data dynamically grouping processors and partitioning the
set scans. The data set is partitioned across all nodes, candidate set accordingly to maintain good load bal-
and each node scans its local data set to obtain fre- ance. Each group is large enough to hold the candidate
quencies of candidates. The frequency information is list in main memory and executes the IDD algorithm.
then accumulated to determine the global frequencies Communication across groups is performed as in CD.
of the candidate itemsets. Once the global frequencies Parthasarathy et al. considered the parallelization
are obtained, infrequent itemsets are pruned, and each of the Apriori algorithm on shared memory systems
node generates candidates for the next level. Candidate []. They showed that the hash-based data structures
generation is sequential, and not parallel, as each node used by Apriori during the candidate generation and
 D Data Mining

counting phases result in poor data locality and false the salient features of the algorithm include a serializa-
sharing on shared-memory systems. To address these tion and merging strategy for computing global trees
issues, the authors proposed several memory placement from local trees, efficient tree pruning for better use of
techniques to improve performance. available memory space, and mechanisms to efficiently
Unlike the above approaches that traverse the search handle out-of-core data sets and data structures.
space in a breadth first manner, Eclat [] is an algo- Ghoting et al. [, ] presented a cache-conscious
rithm that traverses the search space in a depth first approach for the effective parallelization of FPGrowth
manner. The algorithm uses itemset clustering tech- on shared-memory systems. They showed that poor
niques to approximate the set of potentially maximal cache performance limits scalability on shared-memory
frequent itemsets. A maximal frequent itemset is a fre- systems and demonstrated that through a careful
quent itemset that cannot have a frequent superset item- architecture conscious redesign these bottlenecks can
set. Once these sets have been identified, the algorithm be removed, affording significant improvements. The
makes use of efficient depth-first traversal techniques authors developed smart data placement and computa-
to generate frequent itemsets contained in each clus- tion restructuring techniques to improve both spatial as
ter. The advantages of this approach compared with the well as temporal locality during execution.
Apriori-style approaches are improved locality and a
natural structure for parallelization – each cluster can Classification
be processed independently. The structure of this algo- Data classification is a supervised learning procedure
rithm naturally lends itself to both shared-memory and that given a training data set consisting of data points
distributed-memory parallelizations. and their corresponding labels, learns an unknown
Han et al. proposed the FPGrowth [] algorithm function that is capable of mapping the input data ele-
that improves performance through the use of a smart ments to their corresponding labels based on some
data structure known as the FP-tree and by employing characteristic inherent in the input data set. For
a projected data set during the depth-first traversal of instance, consider the problem of filtering email spam.
the search space. However, while delivering excellent Here, one has some representation of an email and a
sequential performance, a limitation to parallelizing label indicating whether the email is “Spam” or “Non-
this algorithm is its reliance on a dynamic, pointer- Spam.” The goal of data classification would be to learn
based data structure and the fact that the data structure a function that can label an email as “Spam” or “Non-
can potentially be out-of-core for very large data sets. Spam.” This function can then be applied to new data
Cong et al. [] proposed a sampling-based frame- items, in this case, new email messages, to classify them
work for parallel itemset mining using FPGrowth on as “Spam” or “Non-Spam.” A variety of data classifica-
shared-nothing systems. They first selectively sample tion algorithms are used in data mining. Examples of
the transactions by discarding a fraction of the most classifiers are neural networks (multilayer perceptrons),
frequent items from each transaction. Based on the support vector machines, k-nearest neighbor classifier,
mining time on the reduced data set, the computation naive Bayes, and decision trees. These algorithms dif-
is divided into tasks needing roughly the same amount fer in the nature of functions they use to approximate
of execution time. In practice, they found that sample the target function. Describing all these algorithms and
mining times were quite representative of actual min- their parallelizations is beyond the scope of this entry.
ing times. Using these timing results, tasks could be For many of these algorithms, computation is struc-
assigned to machines statically, while affording excel- tured and often specified using linear algebra, allowing
lent load balancing. A drawback of their approach, par- for a parallel implementation using popular libraries
ticularly when mining very large dense data sets is that such as LAPACK []. Due to the unstructured nature
it assumes that the FP-tree fits in core, and that the data of its computations, decision trees are described – a
is present on a single central node, which might not be family of extremely popular and scalable classifications
true in many real world scenarios. To address the lim- algorithms.
itation of the approach by Cong et al., Buehrer et al. Decision trees approximate the target function as a
presented a parallel algorithm [] based on FPGrowth – tree. Each interior node in this tree corresponds to one
Data Mining D 

of the input variables; there are edges to children for to determine the best global split point. Once this
each of the possible values of that input variable. Each decision has been made, the attribute lists of the split-
leaf represents a value of the target variable given the ting attribute can be partitioned easily. In order to split
values of the input variables represented by the path other lists, a hash table is formed to maintain a mapping
from the root to that leaf. Decision trees are popular of record identifiers to nodes (of the decision tree in
as they can be used to generate a set of classification the next level), which needs to be communicated to all
rules (each path from the root to the leaf represents a the other processors from the processor that produced D
classification rule) that are easy to interpret. An exam- the winning split point. The other processors then use
ple of such a classification rule is : if temperature >, the hash table to split their own attribute lists. The size of
humidity <, and outlook = sunshine, then weather this hash table is proportional to the number of records
is good for playing tennis. A decision tree is typically at the current node, which means that for the root of
learnt using a recursive procedure. During each step the decision tree, the size of the hash table is propor-
in this recursion, one determines an (attribute,value) tional to the total number of records in the training set.
pair that can be best used to split a data set. Here, ScalParC [] improves on SPRINT by using a dis-
“best” is defined by how well the (attribute,value) splits tributed hash table that need not be locally constructed
the data set into subsets that have similar values on and thus eliminates the bottleneck that the hash table
the target label. Finding the best split involves sort- presented for SPRINT.
ing each attribute on its value so as to estimate the The above mentioned approaches build a decision
distribution of the target label for the potential splits. tree level-by-level and thus exploit data parallelism.
Starting with the input data set, the process is repeated However, disjoint sub-trees of a decision tree being
on each derived subset until it is determined that independent, an alternative strategy would be to employ
splitting no longer provides any improvement to the task parallelism. It is recognized that data parallelism is
predictions. more efficient when processing nodes near the root of
SLIQ [] was one of the first scalable algorithms for the tree, while task parallelism is more efficient when
decision tree induction. The approach first sorts each processing nodes near the leaf of the tree. This is the
attribute in the data set separately. Each attribute is basis of the pClouds approach [] for decision tree con-
then maintained as a list of sorted values together with struction on shared-nothing systems. Similar strategies
the corresponding record identifier. Presorting means have also been utilized for decision tree construction on
that attributes do not need to be sorted through each shared-memory systems [].
step in the recursion. The tree is built level-by-level (or
bread-first), each level requiring a scan of the input data Clustering
set. Unfortunately, SLIQ uses a memory-resident data Data clustering, or simply clustering, is the assignment
structure that stores the class labels of each record. This of a set of observations into subsets (called clusters)
data structure limits the size of the data sets SLIQ can so that observations in the same cluster are similar
handle. in some sense. It is a form of unsupervised learning.
Shafer et al. [] presented a more memory-efficient Data clustering is vastly popular and has several real-
version of SLIQ called SPRINT and also gave par- world applications. One example is the assignment of
allel versions of SPRINT for shared-nothing systems. customers into segments or groups based on some cus-
This algorithm maintains the class label along with the tomer traits (e.g., spending habits) to better understand
record identifier for each value in the attribute lists. The groups as well as the relationships between groups.
data set is first partitioned across all processors in a row- Another example is the grouping of web pages into
wise fashion. Then, a global sort operation is performed genres to automatically categorize web pages. One of
at the end of which each processor has an equal and the most important steps in clustering is to define a
contiguous portion of each attribute list. When deter- distance measure between points to measure similar-
mining split points, each processor determines the best ity. The nature of the distance measure dictates the
split points for all the records that were assigned to shape of the discovered clusters as well as the quality
it. This is followed by an all-to-all broadcast operation of the clustering. Clustering algorithms can be roughly
 D Data Mining

put into three categories – partitional, hierarchical, and the sequential nature of the computations, few practi-
density-based algorithms. cal parallel algorithms for hierarchical clustering exist
Partitional algorithms typically determine all clus- today.
ters at once. An example of a partitional clustering algo- Density-based clustering algorithms attempt to
rithm is the popular kMeans clustering algorithm []. cluster data points based on their local densities [].
Given a data set and the desired number of centers (k), Such clustering algorithms are capable of discover-
the algorithm first generates k random cluster centers. ing clusters with arbitrary shapes. Clusters are groups
Next, each point in the data set is assigned to its nearest of points that have dense neighborhoods and are
cluster center, where “nearest” is defined with respect connected in some sense. The primary computation
to one of the distance measures. Finally, the new clus- performed in density-based clustering is the nearest
ter centers are recomputed as the means of the points neighbor search, and such clustering algorithms have
that were previously assigned to the centers. These steps been parallelized on shared-nothing systems by using
are repeated until some convergence criterion is met a distributed index structure [].
(usually that the cluster centers haven’t changed). The
main advantage of the kMeans algorithm is that it is
Regression
easy to implement and parallelize, and thus can be used
Regression analysis is used to model the relationship
to process large data sets. Its parallelizations on shared-
between a target variable and a set of predictor variables.
memory and shared-nothing systems are very similar.
This relationship is expressed as a function that pre-
The task of assigning data points to one of k centers is
dicts the target variable using the predictor variables as
the most time-consuming step and is easy to parallelize.
inputs. Regression analysis is widely used for prediction
Each processor is responsible for assigning a portion of
and forecasting. Regression is similar to classification in
the data points to one of k centers. Each processor main-
that both approaches attempt to predict a target vari-
tains arrays for the means of all points and number of
able. Unlike classification, where the target variable is
points that were assigned to the centers. These arrays
categorical, regression attempts to predict a continu-
are then summed up and used to find new k centers.
ous variable. A variety of regression algorithms are used
Hierarchical algorithms find successive clusters
in data mining. Techniques for classification such as
using previously established clusters []. These algo-
neural networks (multilayer perceptrons) support vec-
rithms usually are either agglomerative (“bottom-up”)
tor machines, k-nearest neighbors, and decision trees
or divisive (“top-down”). Agglomerative algorithms
have been adapted to the problem of regression, and
begin with each data point as a separate cluster and
their parallelizations are structurally similar. Parametric
merge them into successively larger clusters. Divisive
approaches to regression such as linear regression and
algorithms begin with the whole data set (as one cluster)
logistic regression are often specified using matrix oper-
and proceed to divide it into successively smaller clus-
ations, and they can be implemented in parallel using
ters. Most hierarchical clustering algorithms employ a
popular libraries such as BLAS [] and LINPACK [].
greedy approach to divide or merge clusters for effi-
ciency reasons, i.e., during each iteration they either
divide (“top-down”) or merge (“bottom-up”) clusters Bibliographic Notes and Further
that are deemed to provide the best clustering. During Reading
each iteration, a distance measure between not just a The earlier portion of this entry concentrated on asso-
pair of points but groups of points needs to be defined. ciation rule mining. Over the past decade, a variety of
Due to a wide range of such possible distance func- extensions to association rules such as sequential rules
tions, a variety of hierarchical clustering variants exist and graph patterns have emerged. The reader can look
today. All these algorithms present worse than linear at the book by Zaki and Ho [] and papers by Buehrer
scalability, making hierarchical clustering computation- et al. [] for their parallelizations.
ally demanding. The reader can look at the work by Until this last decade, most data sets that were used
Olson [] for an analysis of parallel hierarchical clus- in data mining could be processed on a single machine.
tering on various parallel abstraction machines. Due to With the advent of the world wide web and advances
Data Mining D 

in data collection technologies, institutions are increas- . Ghoting A, Buehrer G, Parthasarathy S, Kim D, Nguyen A,
ingly seeing the need to implement parallel data-mining Chen Y, Dubey P () Cache-conscious frequent pattern
mining on a modern processor. In: Proceedings of the st
algorithms, and an emerging research area is that of
international conference on very large data bases, Trondheim,
building infrastructures that are specifically aimed at pp 
implementing parallel data mining algorithms. Tech- . Ghoting A, Buehrer G, Parthasarathy S, Kim D, Nguyen AD,
nologies such as the IBM Parallel Machine Learning Chen Y-K, Dubey P () Cache-conscious frequent pattern
Toolbox [] and the rapidly growing Hadoop Apache mining on modern and emerging processors. VLDB J ():– D
project [] are indicators of this trend. . Han E-H, Karypis G, Kumar V () Scalable parallel data min-
Another recent research direction in parallel data- ing for association rules. IEEE T Knowl Data Eng ():–
mining is that of executing data-mining algorithms on . Han J, Pei J, Yin Y () Mining frequent patterns without can-
didate generation. In: International conference on management
emerging multicore architectures such as the commod-
of data, Dallas, pp –
ity GPU [] and the STI Cell processor []. Due to the
. Hartigan J () Clustering algorithms. Wiley, New York
nature of these architectures, researchers have targeted
. Johnson S () Hierarchical clustering schemes. Psychometrika
data mining algorithms that have structured computa- ():–
tion and data access patterns. . I. P. M. L. Toolbox. http://www.alphaworks.ibm.com/tech/pml
. Joshi M, Karypis G, Kumar V () ScalParC: a new scalable
and efficient parallel classification algorithm for mining large
Bibliography datasets. In: IPPS ’: proceedings of the th international par-
. Apache Hadoop. http://apache.hadoop.org allel processing symposium on international parallel processing
. Agrawal R, Srikant R () Fast algorithms for mining associa- symposium, Orlando. IEEE Computer Society, Washington, DC,
tion rules in large databases. In: International conference on very pp 
large databases, Santiago, pp – . LAPACK. http://www.netlib.org/lapack
. Agrawal R, Shafer JC () Parallel mining of association rules.
. Mehta M, Agrawal R, Rissanen J () SLIQ: a fast scalable
IEEE T Knowl Data Eng ():–
classifier for data mining. In: Advances in database technology
. Agrawal R, Imielinski T, Swami AN () Mining associa- EDBT’, Avignon, pp –
tion rules between sets of items in large databases. In: ACM
. Olson C () Parallel hierarchical clustering. Technical report,
international conference on management of data, Washington,
University of California, Berkeley
pp –
. Parthasarathy S, Zaki MJ, Li W () Memory placement tech-
. BLAS. http://www.netlib.org/blas
niques for parallel association mining. In: International con-
. Buehrer G, Parthasarathy S, Chen Y () Adaptive parallel
ference on knowledge discovery and data mining, New York,
graph mining for CMP architectures. In: Sixth international con-
pp –
ference on data mining ICDM’, Hong Kong, pp –
. Shafer JC, Agrawal R, Mehta M () Sprint: a scalable parallel
. Buehrer G, Parthasarathy S, Ghoting A () Out-of-core fre-
classifier for data mining. In: VLDB, Mumbai, pp –
quent pattern mining on a commodity PC. In: International con-
ference on knowledge discovery and data mining, Philadelphia, . Sreenivas M, Alsabti K, Ranka S () Parallel out-of-core
pp – divide-and-conquer techniques with application to classification
trees. In: Proceedings of the th international symposium on
. Buehrer G, Parthasarathy S, Goyder M () Data mining on
parallel processing and the th symposium on parallel and dis-
the cell broadband engine. In: Proceedings of the nd annual
tributed processing, San Juan, pp –. IEEE Computer Soci-
international conference on supercomputing, Island of Kos,
ety, Los Alamitos
pp –. ACM, New York
. Catanzaro B, Sundaram N, Keutzer K () Fast support vec- . Xu X, Jäger J, Kriegel H () A fast parallel clustering algo-
tor machine training and classification on graphics processors. rithm for large spatial databases. In: Guo Y, Grossman R (eds)
In: Proceedings of the th international conference on machine High performance data mining, Kluwer, New York, pp –
learning, Helsinki, pp –. ACM, New York . Zaki M, Ho C () Large-scale parallel data mining. Springer,
. Cong S, Han J, Hoeflinger J, Padua DA () A sampling-based New York
framework for parallel data mining. In: International conference . Zaki MJ, Parthasarathy S, Ogihara M, Li W () New algo-
on principles and practice of parallel programming, Chicago, rithms for fast discovery of association rules. In: KDD, Newport
pp – Beach, pp –
. Ester M, Kriegel H, Sander J, Xu X () A density-based algo- . Zaki M, Ho C, Agrawal R () Parallel classification for data
rithm for discovering clusters in large spatial databases with mining on shared memory multiprocessors. In: ICDE ’: pro-
noise. In: Proceedings of international conference on knowledge ceedings of the th international conference on data engineering,
discovery and data mining, Portland, vol , pp – Sydney. IEEE Computer Society, Washington, DC
 D Data Race Detection

Discussion
Data Race Detection Deadlocks are a problem in parallel computing systems
because of the use of software or hardware synchroniza-
Intel Parallel Inspector
tion resources or locks to provide mutual exclusion for
Race Detection Techniques
shared data and process coordination.
In general, a deadlock has four necessary condi-
tions:
Data Starvation Crisis . A resource cannot be used by more than one process
at a time, commonly called mutual exclusion.
Memory Wall
. A process using a resource may request another
resource, a hold and wait condition.
. A no preemption condition applies to the resources
Dataflow Supercomputer held by a process, and they cannot be released with-
out an action of the process.
SIGMA- . Two or more processes are waiting for each other
to release a resource in a circular wait or chain of
dependency.

Data-Parallel Execution In parallel programming, resources can have many dif-


ferent forms. Some examples are:
Extensions
. Memory that must be updated as if by an atomic or
Vector Extensions, Instruction-Set Architecture (ISA)
indivisible operation
. Devices like tape drives that can only be used by one
process at a time
. Synchronization primitives that coordinate access to
Deadlock Detection the use of resources
. Messages that must be sent and received
Deadlocks
Intel Parallel Inspector Example
Consider two sets of memories M and M. If process
P has exclusive access to the contents of M and needs
exclusive access to M to complete its operations but
Deadlocks process P has exclusive access to the contents of M
Roy H. Campbell and needs exclusive access to M to complete its own
University of Illinois at Urbana-Champaign, Urbana, operations, then this is a deadlock. An I/O device like
IL, USA a tape drive may be considered as a substitute for the
memory.
In shared memory parallel programming scenar-
Synonyms
ios, semaphores can be used to coordinate access to a
Deadlock detection; Gridlock; Hang; Impass; Stalemate
resource. Each memory location (or tape drive) would
be associated with a semaphore. An operation P() would
Definition
represent a request to have access to the resource and an
A deadlock is a condition that may happen in a system
operation V() would represent its release. So a program
composed of multiple processes that can access shared
for process P might be written:
resources. A deadlock is said to occur when two or
more processes are waiting for each other to release a P: S.P();/∗ get exclusive access to memory M ∗ /
resource. None of the processes can make any progress. S.P();/∗ get exclusive access to memory M ∗ /
Deadlocks D 

/∗ update M and M ∗ / for example, making memory access read-only elim-


S.V();/∗ release exclusive access to memory M ∗ / inates the need for mutual exclusion. Similarly,
S.V();/∗ release exclusive access to memory M ∗ / devices may be shared as in if a tape device is
used in append only mode. Non-blocking synchro-
Semaphore S is associated with access to M and S
nization primitives may be used that return an
to M. The value of a semaphore represents whether
error code if a resource is in use. Messages may
its associated resource is available. In this case, each
semaphore is given an initial value of  for its associated
be sent in asynchronous fashion so that a message D
receive does not block a process or synchronization
memory location. A semaphore value of  represents
primitives.
that a process has exclusive access to memory M. The
Programming synchronization and coordina-
operation P() decreases the value of a semaphore, while
tion solutions without blocking primitives use “non-
the operation V() increases it. If the semaphore is , a
blocking synchronization algorithms.” However,
P() operation makes the requesting process wait for the
such algorithms can suffer from livelock, a similar
semaphore before continuing.
problem to deadlock in which processes make no
Process P is written in a similar manner:
progress but are never blocked.
P: S.P();/∗ get exclusive access to memory M ∗ / . The hold and wait condition can be eliminated by
S.P();/∗ get exclusive access to memory M ∗ / forcing a process either to acquire all the resources it
/∗ update M and M ∗ / needs in one operation or to release all the resources
S.V();/∗ release exclusive access to memory M ∗ / it holds when it acquires more resources. Job control
S.V();/∗ release exclusive access to memory M. ∗ / in early IBM  systems imposed such conditions.
In general, this is inefficient as it leads to a process
Then the two processes P and P deadlock if pro-
holding more resources than it needs or to an undue
cess P executes its first semaphore operation S.P() and
number of acquire and release operations.
process P also executes S.P().
. Allowing resources to be removed from a pro-
Coordination of processes can create a deadlock as
cess eliminates the no preemption condition. For
when process P waits for a message from process P
example, processes often require additional physical
before sending a message to P and process P either
memory and virtual memory algorithms allow the
never sends a message to P or P waits for a message
physical memory of another process to be released
from P first and then sends a message to P.
in order to prevent a deadlock arising from mem-
Each example satisfies the four deadlock conditions
ory allocation needs. Another approach often used
and may be extended to more processes and resources.
to preempt a resource is to roll the process hold-
ing the resource back to a previous state prior to
Deadlock Detection, Prevention, and the time it acquired the resource. This leads to opti-
Avoidance mistic concurrency control, lock-free, wound-and-
The set of four deadlock conditions are necessary and
wait, and wait-free algorithms for resource alloca-
sufficient to create a deadlock. Detecting, prohibit-
tion. In general, preemption of resources can lead to
ing, or avoiding one of the conditions is thus a basis
inefficiencies and thrashing.
for deadlock detection, prevention, or avoidance [].
. Ordering the acquisition of resources eliminates the
Deadlock detection requires analyzing whether the four
circular wait condition and can be implemented in
conditions apply to a set of processes and resources.
a number of ways. For example, the address of a
Prevention ensures that one or more of the conditions
resource can be used to create an order in which
cannot hold. Avoidance analyzes whether granting a
resources are acquired or the resources can be orga-
resource request could eventually lead to a deadlock and
nized into a hierarchy.
blocks the process from making that request if such a
possibility exists. In turn:
As an example above of ordering resources, if the
. Allowing a resource to be used by more than one address of M is less than M, then this can be used to
process at a time removes mutual exclusion. Thus, impose an ordering on the semaphores such that both
 D Deadlocks

process  and process  always perform a P.S before The banker’s algorithm [] is one such avoidance
P.S. This requires rewriting process P as: technique, and it works with a given number of pro-
cesses and resources, each resource being of a partic-
P: S.P();/∗ get exclusive access to memory M
ular type. It imposes the restriction that every pro-
S.P();/∗ get exclusive access to memory M
cess declares the maximum number of resources it
/∗ update M and M
might need of a particular type. When a resource
S.V();/∗ release exclusive access to memory M
is requested, the algorithm determines whether to
S.V();/∗ release exclusive access to memory M
allocate that resource or suspend the process based
Now, whichever process performs S.P() first can per- on if that allocation will keep the system in a safe
form the S.P() and continue until it can release its state.
access to M and M and perform S.V() and S.V(). A safe state is one in which there is some sequence
of allocations of resources to processes such that each
process can eventually acquire its maximum number of
Detection resources needed so that it may complete and release
Given an arbitrary parallel program, detecting whether
those resources. This algorithm is conservative in that
that program will deadlock is undecidable except in
it avoids situations that are unsafe, but may not lead
very specific circumstances, in the same manner as the
to deadlock. For example, a process may decide not
halting problem is undecidable. There is much litera-
to request its maximum number of resources but the
ture concerning the undecidability of deadlocks. Brand
banker’s algorithm has no way of accommodating this
and Zafiropulo [] describe the computational com-
information.
plexity of deadlock detection in communicating state
If it is inexpensive to kill and restart a process,
machines, and Gold [] describes the computational
then another avoidance approach is to allow processes
complexity of deadlock avoidance algorithms. How-
to continue until a request is made that will cause
ever, deadlocks are detectable at runtime. Since the
a deadlock. Every process is given an age using a
processes involved in a deadlock are waiting, deadlock
timestamp. In the Wait/Die algorithm, if the process
detection must be performed by a monitoring algo-
requesting the resource is older than the process hold-
rithm that tracks whether the four conditions have been
ing the resource, it waits. Otherwise, the process dies.
met for the given set of processes and resources. Dead-
In the Wound/Wait algorithm, if the process request-
lock detection can be achieved by building a resource
ing the resource is younger than the process hold-
allocation graph [] that represents the state of the sys-
ing the resource, it waits. Otherwise, the process dies.
tem in which a deadlock is observed as a cycle in the
In both algorithms, the aging scheme ensures that
graph. Detection algorithms have a runtime complex-
progress is made by preferentially killing the younger
ity of at least O(MxN) or O(Mˆ), where M number
process.
of resources and N is the number of processes (Holt
). Kim and Koh found a complexity of O(MxN)
using trees.
Distributed Deadlock
The problems of deadlock extend to distributed sys-
Avoidance tems. The solutions are similar, but require distributed
Deadlock detection works well for processes that are implementations. Distributed deadlock detection solu-
already deadlocked. It is possible, under certain restric- tions decompose the global resource allocation graph
tions, to determine whether allowing a process to into local wait-for graphs that are then combined glob-
acquire a resource might eventually lead to dead- ally to detect cycles indicating circular wait using graph
lock. If a deadlock is unavoidable, the process and reduction []. An alternative approach to detecting
its request can be suspended pending the release of the cycles is to trace the edge of the wait-for graphs
other resources, leading to a variety of avoidance from one node to another to determine if it forms
algorithms. a cycle.
Debugging D 

Discussion
In practice, deadlocks have not proven to be as diffi- Debugging
cult an issue as at first thought. The cost of restarting
a process or system when a deadlock is discovered is Christof Klausecker, Dieter Kranzlmüller
often not too great compared with the cost incurred Ludwig-Maximilians-Universität München, Munich,
Germany
from an implementation of deadlock prevention or
avoidance. D
Definition
Bibliography Debugging and testing are integral parts of the soft-
. Belik F () An efficient deadlock avoidance technique. IEEE ware development process, which provides a model for
Trans Computers ():– the development of software products from the ini-
. Brand D, Zafiropulo P () On communicating finite-state tial planning to deployment and maintenance. How-
machines. IBM Res Report RZ  (#) ever, debugging and testing are, although closely
. Bracha G, Toeug S () Distributed deadlock detection. Distrib
related, two completely different tasks which must be
Comput :–
. Coffman E, Elphick M, Shoshani A () System deadlocks. clearly distinguished. Software testing is used to ver-
Comput Surveys :– ify the reliability and functionality of software prod-
. Dijkstra E () Cooperating sequential processes. Techni- ucts and individual components. This is accomplished
cal Report EWD-, Technological University, Eindhoven, The by checking their adherence to specified requirements,
Netherlands
thus allowing an assessment of the quality of soft-
. Dijkstra E () The mathematics behind the banker’s algorithm.
In: Dijkstra EW, published as pages – of Edsger W. Dijkstra,
ware. If any discrepancies occur during the testing
Selected writings on computing: a personal perspective. Springer, phase, debugging is used to track the reasons why
Berlin the software fails and to correct the mistakes. Con-
. Ezpeleta J, Tricas F, Garcia V, Colom J () A banker’s solu- currency in applications adds additional error sources
tion for deadlock avoidance in FMS with flexible routing and and effects, making debugging parallel programs a
multiresource states. IEEE Trans Robot Autom ():–
challenging task. As such, parallel program debug-
. Gold E () Deadlock prediction: easy and difficult cases. SIAM
J Comput ():– ging has to deal with increased complexity, large
. Habermann A () Prevention of system deadlocks. Commun amounts of data, and effects like race conditions and
ACM ():–,  deadlocks.
. Havender J () Avoiding deadlock in multitasking systems.
IBM Syst J ():–
. Holt R () Comments on prevention of system deadlocks. Discussion
Commun ACM ():–
Overview
. Holt R () Some deadlock properties of computer systems.
ACM Comput Surveys :–
Debugging of parallel programs is a complex topic.
. Kim J, Koh K () An O() time deadlock detection scheme Therefore, Section “Bugs” starts by clarifying the mean-
in single unit and single request multiprocess system. In: IEEE ing and origin of the term “bug” itself. Section “The
TENCON’, New Delhi, India. pp – Debugging Process” describes a commonly used debug-
. Lang S () An extended banker’s algorithm for deadlock ging approach and the difficulties posed by parallel
avoidance. IEEE Trans Softw Eng ():–
applications. Section “Breakpointing in Parallel Pro-
. Lee J, Mooney V () Hardware-software partitioning of oper-
ating systems: Focus on deadlock detection and avoidance. In: grames” presents the purpose of breakpointing, one of
IEE proceedings on computers and digital techniques the essential features provided by interactive debuggers,
. Leibfried T () A deadlock detection and recovery algorithm together with different breakpoint types for concur-
using the formalism of a directed graph matrix. Operating Syst rent programs. Debuggers, that is, the tools commonly
Rev –
used to track down bugs, will be discussed and classi-
. Shoshani A, Coffman E () Detection, prevention and recov-
ery from deadlocks in multiprocess, multiple resource systems. fied in Sect. “Parallel Debugging Tools”. Current issues
In: th annual Princeton conference on information sciences and and future directions will be presented in Sect. “Future
system, Princeton, NJ Directions”.
 D Debugging

Bugs The above given definition of a bug stresses the fact


A “bug,” which in computing can occur in hardware that its location is initially unknown, so in order to fix
or software, causes unexpected or unintended behavior it, it needs to be tracked down first. The process of find-
that diverges from the product’s specification. A com- ing and fixing bugs is called debugging and is usually
mon belief is that the term originated from the finding supported by corresponding tools.
of a real living bug. While the moth that was found in
one of the relays of the Mark II computer [] may have The Debugging Process
actually been the first real bug to be found in a com- As stated before, the best approach is to avoid pro-
puter, the term “bug” was already used at the end of the ducing bugs in the first place, which is unfortunately
nineteenth century to describe technical glitches []. rarely possible. When a programmer encounters bugs,
the goal of the debugging process (Definition ) is to
Definition  Bug reduce the number of bugs, and consequently make
A bug is the commonly used term for an error with an application execute and behave according to its
originally unknown location and reason []. specification.
Definition  Debugging
According to Definition , the location of a bug is
Debugging [] is the process of locating, analyzing, and
typically initially unknown. Its causes are small mis-
correcting suspected errors.
takes, design errors, or hard- and software failures. Their
effects, however, can vary greatly in their extent. Bugs According to the definition of a bug (Definition ),
crashing a program immediately may seem to be the its location is initially unknown; therefore, to remove it,
more severe ones, but they are also more likely to get it needs to be located. In practice approximately % of
noticed, and thus usually have a limited life span. The debugging time is spent on locating the origin [].
serious bugs are often those having initially only limited The typically applied debugging strategy (in sequen-
effects, or those occurring only sporadically, because tial and in parallel case) is the cyclic debugging
they are located in seldom used functions or appear only approach [] as shown in Fig. . The underlying prin-
under special circumstances – for instance only when ciple is as follows: After a bug has been detected,
running a parallel program with a higher number of the developer reruns the program under control of a
processes or with specific input data. Such bugs remain debugger to narrow down the location of the bug. The
undetected for a long time, often until an application is same input data is used to reproduce the same faulty
used in production. A product containing bugs is not behavior. The developer uses breakpoints to stop a pro-
only annoying for the users, but may also cause conse- gram at execution states where the origin of errors
quences such as financial loss and bad publicity for the is suspected or where additional knowledge about the
developer. program’s behavior can be obtained. If the informa-
While bugs can be avoided to a certain extent by tion gathered during a cycle is not enough to isolate
carefully planning and designing using the established the bug, the procedure starts again by setting new
software development processes and by practicing good breakpoints.
code style, the more complex a program gets, the more A different approach is reverse debugging, which
likely it is to contain bugs. It is practically impossi- based on recorded information, enables to execute a
ble to write larger pieces of bug-free code. The reason target program in reverse to a limited extent. This
for this is that software is written by humans, who are functionality has been recently introduced in some
likely to make “dumb mistakes” []. Parallel programs debuggers [, ].
are especially error-prone since they are often complex The cyclic debugging approach implies that a pro-
due to their concurrent nature and the resulting nec- gram behaves in the same way every time it is executed.
essary communication and synchronization steps. As While this is often the case for sequential programs, it is
a consequence, software should be tested extensively not necessarily directly applicable to parallel programs.
for unintended behavior and upon detection the causes Due to the introduction of concurrency, a whole new
need to be found and the errors corrected. class of problems and error sources appears. The reasons
Debugging D 

Begin ● Additional anomalous effects


Due to parallelism and communication, error
sources which do not exist in sequential applica-
tions are introduced. Those additional anomalous
Instrument program
effects include serious problems like deadlocks and
nondeterminism, or related events, where concur-
rent execution of processes and their interaction is D
Set breakpoint involved.
● Scalability
When the limit of frequency scaling was reached
Execute program roughly in the year , a switch from single- to
multi-core architectures took place. This develop-
ment influenced not only the desktop market, but
also made high-end parallel computers with higher
Inspect state
core counts possible and necessary – a fact posing
scalability problems not only for applications, but
also for the debugging process and the associated
N Y
tools (which usually are parallel programs them-
Bug detected?
selves).

A mean source of errors in parallel programs is


nondeterministic behavior. This poses serious problems
Correct error
for the cyclic debugging approach introduced above,
because nondeterministic applications do not neces-
sarily produce the same outcome or execution path
End for the same input data on repeated runs. This so-
called irreproducibility effect sometimes renders reruns
Debugging. Fig.  Cyclic Debugging [] of an application to narrow down the location of a bug
useless.
There are various reasons for nondeterministic
behavior of an application – a common and usu-
why parallel debugging differs from sequential debug-
ally intended source is, for example, a random num-
ging [] can be summarized as follows:
ber generator. An often unintended effect is a race
● Increased complexity condition. Resulting from varying execution speeds
Because of the multiple processes or threads of the parallel processes or network jitter, if two
involved and the interactions between them, parallel processes send a message to a third one, the out-
applications can quickly become complex. There- come can vary depending on which message arrives
fore, it is hard to understand their behavior and to first.
locate bugs by only using tools which operate on A possible solution for the elimination of the irre-
source code level. producibility effect are mechanisms such as instant
● Amount of debugging data replay []. During a so-called record phase, infor-
The amount of information which accumulates mation from the initial program run, containing
during debugging parallel applications is signifi- ordering information for critical events, is gathered.
cantly larger in comparison to sequential applica- Afterward, this information is used to replay the
tions. Thus, it can get difficult for the programmer to previously observed behavior, thus ensuring that the
find the proverbial needle in the haystack, especially same execution path is used for repeated
if haystacks are increasing. application runs.
 D Debugging

Breakpointing in Parallel Programs breakpoint has been hit a certain, previously specified,
One of the essential functions used during cyclic debug- number of times [].
ging is breakpointing – the process of setting break- Setting breakpoints in sequential applications is
points. A breakpoint can be defined as follows: straightforward: When a breakpoint is reached and the
conditions are met, the program is stopped. Afterward,
Definition  Breakpoint when the programmer finished inspecting and mod-
A breakpoint [] is a controlled way to force a program to ifying the application’s state, the application can be
stop its execution. Breakpointing may occur on software resumed. When it comes to parallel programs consist-
interrupt calls, on calls to a program subfunction, or on ing of multiple processes, breakpointing gets more com-
selected points within the program. plicated since the programmer may want to inspect the
state and variables of multiple processes. The question
Breakpoints are set intentionally to interrupt a pro- arising is: which of the processes to stop and where.
gram’s execution at a state interesting to the developer. Parallel debuggers typically allow to identify indi-
During this interruption, the environment and the state vidual processes of a parallel application using numbers
of the application can be arbitrarily inspected and even reflecting the process number assigned by the parallel
modified. Furthermore, it is possible to set additional runtime environment. Breakpoints for parallel applica-
breakpoints for future interruptions. tions can be divided into three main classes (Fig. ) [],
There are two common types of breakpoints: which can be described as follows:
instruction breakpoints and data breakpoints. The for-
mer offer the possibility to interrupt the execution of ● Local breakpoints
an application before a specified instruction is reached. A local breakpoint is only applied to a single
The latter, also called watchpoints, allow to stop when process, all other processes are not affected and
a memory address is accessed, or a specified value will continue their execution. This is identical to
is assigned to a variable. In other words, instruction the sequential case. However, in parallel programs,
breakpoints enable a control-flow-oriented debugging there are usually relations and dependencies to other
strategy, while watchpoints allow data-oriented debug- processes.
ging []. Additionally, some debuggers allow to specify ● Message breakpoints
conditions for breakpoints. In the context of break- A message breakpoint stops all processes involved
pointing, a condition is a boolean expression which is in a single communication event. In point-to-point
evaluated every time the breakpoint is reached. Only if communication this typically involves two pro-
the expression is true the application is stopped. The cesses. However, in case of collective operations any
so-called ignore count is a special case of breakpoint number of processes can be part of a message break-
condition, which only stops the application when a point.

Parallel breakpoint

Local breakpoint Message breakpoint Global breakpoint

Single global breakpoints Global breakpoint set

Debugging. Fig.  Classification of parallel breakpoints []


Debugging D 

● Global breakpoints program. These tools are typically designed for cer-
A global breakpoint involves multiple processes tain programming interfaces or specific error classes.
and its classification can be further divided into a The used parallel programming interface often adds
single global breakpoint and a global breakpoint set. considerable complexity to the development of concur-
– Single global breakpoint rent applications. A common example is the Message
A single global breakpoint has exactly one Passing Interface [, ] (MPI), which provides
owner process, the respective process on which developers with great freedom but also leaves much D
it was hit. This owner process is stopped at the room for mistakes. As a consequence run-time tools
location or state defined by the breakpoint. The checking the correct usage of MPI have been devel-
other processes of the parallel application are oped. Prominent examples for this class of tools
stopped immediately afterward. They are, how- are Marmot [], Umpire [] and their successor
ever, stopped at undefined states, because of fac- MUST [], which detect invalid arguments for MPI
tors like varying execution speed. function calls, race conditions, and deadlocks, but also
– Global breakpoint set portability issues between different MPI implementa-
In contrast to the single global breakpoint, tions. Another example class of run-time tools are so-
the global breakpoint set consists of multiple called memory checkers. The Intel Parallel Inspector []
local breakpoints. For this reason, when the or the Valgrind tool suite [], help detecting poten-
global breakpoint is reached, the involved pro- tial memory bugs, like usage of uninitialized memory,
cesses are stopped at defined states. However, to memory leaks, and allocation/deallocation errors.
accomplish this, each local breakpoint needs to Program visualization tools provide developers with
be set at a specific location. a way to cope with the increased complexity of paral-
lel applications resulting from communication between
Breakpointing is an important concept in the pro-
concurrent processes. Depicting application flow and
cess of pinpointing the origin of a bug and therefore a
relations of parallel processes as space-time diagram,
central aspect of the debugging process.
like, for instance, the VAMPIR [] performance anal-
ysis tool does, can aid program understanding, and
Parallel Debugging Tools help to detect errors in an application’s communication
While for relatively short and simple programs it may behavior.
be sufficient to carefully read the code or to insert The Trace Viewer [] is a visualization tool based on
print statements for locating mistakes, it is not adequate the space-time diagram and focused on debugging. It
for complex programs like large-scale parallel scientific implements the event graph paradigm [], and enables
simulations. As mentioned above parallel applications to visualize and analyze communication of message
are not only more error-prone, but locating the error passing programs based on traces [] recorded dur-
source is also more challenging. For this reason, more ing run-time. Figure  illustrates the dependencies and
sophisticated tools than the so-called printf debugger the message flow of a parallel message passing program.
are required. The vertices represent the events which are located at
There is a variety of tools available for debugging specific points in time and highlight state changes of
parallel applications. Debugging tools, which intend to a process. The edges represent the continuous transi-
provide useful knowledge about a program’s execution tion from one state to another. This includes not only
and the occurring program state changes [] to the local events, but also the communication flow between
developer, can be divided into three major classes: processes.
Traditional debuggers are tools allowing to look at
● Run-time checking tools
and to analyze other programs interactively while they
● Program visualization tools
are running []. As such, these debuggers support
● Traditional debuggers
programmers in searching for the reasons for an appli-
Run-time checking tools provide automated mech- cation’s abnormal and unexpected behavior and can be
anisms to check for errors during the execution of a defined as follows:
 D Debugging

Debugging. Fig.  Event graph visualization []

Definition  Debugger a debugger to already running applications, but also to


A debugger [] is a tool to help track down, isolate, and disconnect it, thus ending the influence of the debugger
remove bugs from software programs. on the application. After inspection, the debugger can,
on behalf of the user, signal the program to continue,
Rosenberg further proposes several principles a to be aborted, or even to perform a single instruction
good debugger should at least try to adhere to []. or execute a source code statement – enabling to trace
These principles can be summarized as following: the execution path of an application. As such, debug-
gers can be classified into machine-level debuggers and
● Observing an application with a debugger inevitably
source-level debuggers. As the name already indicates,
changes the debuggee’s behavior (sometimes called
the former operate on a very low level, providing only
Heisenberg principle []). A debugger, however,
assembly information. Source-level debuggers need the
should minimize its intrusion on the debuggee.
debuggee to contain additional symbolic information,
● The second principle asserts that a debugger must
hence they are also called symbolic debuggers. This
provide truthful information. It is very important
additional information can be added during the com-
that the debugger can be trusted and that it does not
pilation of the application, usually by adding a specific
lead the programmer into a wrong direction.
compiler flag. Source-level debuggers provide, in addi-
● Another prerequisite for a good debugger is that
tion to the assembly view, means to track the execution
it presents the provided debugging information to
progress in the original source code.
the user, together with information where in the
Concerning support for parallelism, one needs to
execution the program is and how it got there.
distinguish between thread and process level paral-
● The final principle states the unfortunate fact
lelism. While most of today’s interactive debuggers sup-
that debuggers are usually not as technologically
port threaded applications, only few provide support
advanced as the programmer would need them
for multiple processes. However, there are both open
to be.
source and commercial debuggers with support for pro-
The functionality provided by traditional interactive cess level parallelism available; for example, MPI pro-
debuggers allow to start and stop an application under grams. The most widespread parallel debuggers include
their control []. Furthermore, it is possible to attach Allinea’s DDT [] and Rogue Wave’s Totalview [],
Debugging D 

both commercial products. Well-known open source support for filtering and gathering statistical informa-
solutions include the Eclipse Parallel Tools Platform tion on data array values, a useful feature for locating
(PTP) [], and the g-Eclipse [] debugging function- bugs in computation, is provided by the aforementioned
ality. Additionally, several vendor-specific debuggers, parallel debuggers.
usually distributed together with and optimized for cer-
tain hardware, are available. Future Directions
The listed debuggers provide functionality similar The future of parallel program debugging is clearly con- D
to single process debuggers, some of them are actually nected to the evolution of parallel computing itself.
merely a scalable frontend to multiple instances of a sin- To exploit future computational resources, applications
gle process debugger, like the popular open source GNU will have to adapt to hierarchical hardware structures,
Project Debugger [] (GDB). Parallel debuggers, how- as well as utilize accelerators. Developers will have to
ever, allow to perform debugging operations like break- incorporate a variety of programming models simul-
pointing, stopping, stepping, and variable and array taneously into their applications, and the mixture of
inspection not only on individual processes or threads these models will make the already demanding task
but on defined process groups or the whole parallel of developing parallel applications even harder and
application. They provide comparison means to analyze more error-prone. With multi- and many-core architec-
differences in the content of variables and in program tures coming to the desktop, parallel program debug-
states between processes of a parallel application. ging will also become more important to a wider
Parallelizing applications usually requires to parti- audience. Faced with such complexities, debugging
tion the data to be processed. Array inspection and will become a major problem, in particular at scale.
visualization tools (Fig. ) provide useful insight into This poses scalability problems, not only for tools
distributed multidimensional arrays, and assist in the launching and internal communication [], but also
search for data distribution errors []. Additionally, for the analysis and presentation of the debugging
information.
Current approaches include using filtering tree-
based communication structures [] for scalable tool
communication, however, even if the debugging tools
scale, the developer, who has to sift through the data
manually, might not. Visualization of program runs
with more than a few thousand processes is also not
an option without further abstraction techniques, like
for instance pattern matching algorithms supporting
communication analysis and debugging [].
Another issue will be that the traditional interactive
debugging approach with several reruns is hardly appli-
cable, if not impossible, on large program runs, since
it is unmanageable and simply too expensive in terms
of time, power, and ultimately money. The Stack Trace
Analysis Tool [] (STAT) tries to address this problem
using a lightweight approach for sampling and merg-
ing stack traces, with the goal of providing a preselec-
tion of processes for further analysis using a traditional
interactive debugger.
The focus of future debugging tools must lie on
algorithms that automatically analyze the gathered
Debugging. Fig.  Array visualization using the MAD data. Only this will make debugging tools ready for
environment [] upcoming computing systems by preprocessing and
 D Debugging

abstracting the presented information, assisting and . Kidwell PA () Stalking the elusive computer bug. IEEE Ann
guiding the developer through the error detection and Hist Comput ():–
error removal processes. . Klausecker C, Köckerbauer T, Preissl R, Kranzlmüller D ()
Debugging MPI programs on the grid using g-eclipse. In: Resch
MM, Keller R, Himmler V, Krammer B, Schulz A (eds) Tools for
Related Entries high performance computing, proceedings of the nd interna-
Checkpointing tional workshop on parallel tools for high performance comput-
Deadlocks ing, Stuttgart, July . HLRS, Springer, Berlin
Intel Parallel Studio . Knüpfer A, Brendel R, Brunst H, Mix H, Nagel WE () Intro-
ducing the open trace format (OTF). In: Vassil A, van Albada
Parallel Tools Platform
G, Sloot P, Dongarra J (eds) Computational science ICCS ,
Race Conditions vol  of lecture notes in computer science. Springer, Berlin,
Tracing pp –
. Köckerbauer T, Klausecker C, Kranzlmüller D () Scalable
Bibliographic Notes and Further parallel debugging with g-eclipse. In: Müller MS, Resch MM,
Schulz A, Nagel WE (eds) Tools for high performance computing
Reading . Springer, Berlin, pp –
Further general information on debugging can be found . Krammer B, Bidmon K, Müller MS, Resch MM () MAR-
in [], [] and []. Approaches for debugging and MOT: an MPI analysis and checking tool. In: Joubert GR, Nagel
visualization of parallel message-passing applications WE, Peters FJ, Walter WV (eds) Parallel computing - software
are presented in []. A documentation and introduc- technology, algorithms, architectures and applications, volume 
tion to using the widespread GNU Project Debugger of Advances in parallel computing, North-Holland, Amsterdam,
pp –
is freely available []. Information on specific parallel
. Kranzlmüller D () Event graph analysis for debugging mas-
debuggers can be found in [], [] and []. sively parallel programs. PhD thesis, GUP Linz, Joh. Kepler
University Linz, Austria. http://www.mnm-eam.org/~kranzlm/
Bibliography documents/phd.pdf. Accessed  Nov 
. Allinea DDT () http://www.allinea.com. Accessed  Nov . Kranzlmüller D, Rimnac A () Parallel program debugging
 with MAD: a practical approach. In: Proceedings of the 
. Arnold DC, Pack GD, Miller BP () Tree-based overlay net- international conference on computational science, ICCS’,
works for scalable applications. International Parallel and Dis- Springer, Berlin, pp –
tributed Processing Symposium, Long Beach, CA . Krawczyk H, Wiszniewski B () Analysis and testing of dis-
. Arnold DC, Ahn DH, de Supinski BR, Lee GL, Miller BP, tributed software applications. Taylor & Francis, Bristol
Schulz M () Stack trace analysis for large scale debugging. . LeBlanc TJ, Mellor-Crummey JM () Debugging parallel pro-
International Parallel and Distributed Processing Symposium, grams with instant replay. IEEE Trans Comput :–
Long Beach, CA . LeDoux CH, Parker DS () Saving traces for Ada debugging.
. Copperman M, Thomas J () Poor man’s watchpoints. SIG- In: Proceedings of the  annual ACM SIGAda international
PLAN Notices ():– conference on Ada, SIGAda ’. Cambridge University Press,
. DiMarzio JF () The debugger’s handbook. Auerbach Publi- New York, pp –
cations. Boston, MA . Lee GL, Ahn DH, Arnold DC, de Supinski BR, Legendre M, Miller
. Gottbrath C () Reverse debugging with the TotalView debug- BP, Schulz M, Liblit B () Lessons learned at K: towards
ger. In: Cray Users Group Conference Proceedings, Helsinki, debugging millions of cores. SC, Austin
Finland . McDowell CE, Helmbold DP () Debugging concurrent pro-
. Hilbrich T, Schulz M, de Supinski BR, Müller MS () MUST: grams. ACM Comput Surv ():–
a scalable approach to runtime error detection in MPI programs. . Message Passing Interface Forum: MPI: a message-passing inter-
In: Müller M, Resch M, Schulz A, Nagel W (eds) Tools for high face standard () http://www.mpi-forum.org/docs/mpi-.ps.
performance computing, proceedings of the rd international Accessed  Nov 
workshop on parallel tools for high performance computing, . Message Passing Interface Forum: MPI-: extensions to the
Dresden, September . ZIH, Springer, Berlin message-passing interface () http://www.mpi-forum.org/
. Hovemeyer D, Pugh W () Finding bugs is easy. SIGPLAN docs/mpi-.ps. Accessed  Nov 
Notices ():– . Myers GJ, Sandler C, Badgett T, Thomas TM () The art of
. Intel Parallel Inspector () http://software.intel.com/en-us/ software testing, nd edn. Wiley, New Jersey
articles/intel-parallel-inspector/. Accessed  Nov  . Nagel WE, Arnold A, Weber M, Hoppe HC, Solchenbach K
. Kacsuk P () Systematic macrostep debugging of message () VAMPIR: visualization and analysis of MPI resources.
passing parallel programs. Future Gen Comput Sys ():– Supercomputer :–
DEC Alpha D 

. PTP – Parallel Tools Platform () http://www.eclipse.org/ptp. were facing difficulties in meeting their schedules and
Accessed  Nov  the designs were becoming overly complex for the rela-
. Rosenberg JB () How debuggers work: algorithms, data tively small DEC design teams. As a result VAX designs,
structures, and architecture. Wiley, New York
which had had their heyday during the s, were
. Seward J, Nethercote N, Weidendorfer J () Valgrind . -
advanced debugging and profiling for GNU/Linux applications. starting to lag behind their competitors on cost and
Network Theory Ltd, Bristol performance. But the VMS operating system and a sub-
. Stallman R, Pesch R, Shebs S et al () Debugging with GDB, stantial software ecosystem, which ran on VAXes, were D
th edn. Free software foundation, Boston very important to DEC, representing a large customer
. Stitt M () Debugging – creative techniques and tools for
base and a significant revenue stream.
software repair. Wiley Professional Computing Series. Wiley,
New York
The VAX architecture had been developed around
. TotalView () http://www.totalviewtech.com. Accessed  a philosophy that was premised on code being written
Nov  in assembly language. In the late s when the VAX
. Vetter JS, de Supinski BR () Dynamic software testing of MPI architecture was being specified, most large software
applications with umpire. In: Proceedings of the  ACM/IEEE
systems were written in assembly language to achieve
conference on Supercomputing (CDROM), Supercomputing ’.
IEEE Computer Society, Washington
adequate efficiency. The VAX instruction set made
. Von Kaenel PA () A debugger tutorial. SIGCSE Bulletin this easier by having instructions which performed
():– sophisticated operations and provided a universally
available set of addressing modes for each operand of
the instruction. Improved compilers, however, under-
mined the assembly language premise on which the
VAX was based. In this evolution, the VAX was at
DEC Alpha a disadvantage because the VAX instructions were
Joel Emer, Tryggve Fossum over-specified, and it was difficult for compilers to
Intel Corporation, Hudson, MA, USA map programming language idioms into the available
instructions. Even worse, the VAX instructions pro-
vided semantics beyond that needed by the compiler but
Synonyms which had to be paid for in the implementation. This
Microprocessors resulted in designs that were more complex and slower
than needed.
Definition The VAX architecture with its sophisticated instruc-
The DEC Alpha was an architecture and line of micro- tions and powerful addressing modes became known as
processors created by Digital Equipment Corporation a complex instruction set computer or CISC-style archi-
(DEC). These processors, which began shipping in the tecture. But the s also had seen the beginning of
early s, were among the most innovative and high- the reduced instruction set computer or RISC revolu-
est performing of their era. The -bit Alpha proces- tion. RISC architectures were based on the notion that
sors were the successor to DEC’s highly successful VAX a computer architecture should be good target for com-
product line and ran VAX legacy code, though transla- pilers by providing exactly the primitive operations that
tion, on new incarnations of the company’s VMS oper- a compiler needs and that also map naturally to sim-
ating system. Alphas also ran an OSF-derived version ple and efficient hardware implementations. The RISC
of Unix called Tru, and Microsoft Windows NT ran work was spearheaded by research at IBM, Stanford,
natively on the Alphas with support for x programs and Berkeley, who each developed simple, powerful
via binary translation. processor chips quickly. This work led eventually to
IBM’s PowerPC processors, spawned MIPS as a com-
Background for Alpha pany, and SPARC as the basis for Sun’s computer line.
The Alpha architecture arose inside DEC in the late The attraction of RISC-based architectures was not
s out of recognized challenges with the company’s lost on DEC’s engineers. As a result, there were sev-
then–current VAX product line. VAX design projects eral early RISC projects at DEC. These included Titan,
 D DEC Alpha

a RISC-based research project; SAFE, a RISC architec- simplification facilitated the creation of single-chip-
ture explored within the development organization; and microprocessor VAXes. These were the highly success-
Prism, which formed a RISC-style architectural basis ful MicroVAXes. Since reducing the ISA had been
for an entire software system that was an evolution- successful before, why not try it again?
ary reimplementation of the VMS operating system. The fixed length instruction VAX subset proposal
DEC also built and marketed systems based on the suggested carefully selecting and implementing only a
MIPS architecture. Those systems used MIPS designed subset of VAX instructions, all fixed length and with
chips, but there were MIPS processor designs in flight other RISC- like attributes. Note, however, that limiting
at DEC when all RISC development efforts were redi- instructions to only -bits (like other RISC architec-
rected toward Alpha. tures) would have resulted in a load with an inade-
A flaw common to all of DEC’s initial RISC-style quate -bit offset, so a small number of different length
design efforts was that they were not believed to be able instructions were included. Using such a subset archi-
to preserve the VAX/VMS-based market, so a task force tecture, DEC could design a simpler, faster VAX. As
was formed to address that deficiency. Thus, the pri- with MicroVAX, this scheme dealt well with back-
mary charter of the Extended VAX, or EVAX, taskforce ward compatibility, but would require some form of
was to propose a means of modernizing the VAX archi- emulation for forward compatibility of instructions no
tecture in a way that preserved the software base in VMS longer implemented in hardware. Unfortunately, analy-
and in DEC’s Ultrix (Unix) market. Modernization, of sis showed that a number of factors, such as the small
course, implied that it had to gain the efficiency/com- architectural register file set and the implementation
plexity benefits of the competing RISC architectures complexity of multiple instruction sizes, would destine
and thus this group was often referred to as the RISCy this proposal for subpar performance.
VAX taskforce. A final hardware solution proposed a hardware
The RISCy VAX taskforce considered a number of unit that would do hardware instruction translation
design alternatives. Since a key criterion was preserv- of VAX instructions into RISC instructions and then
ing software compatibility, a number of the alterna- execute those instructions in a pure RISC-style back
tives also preserved the VAX architecture directly. For end. This strategy was approximated in some VAX
example, one alternative was to have a heterogeneous designs, leaving much of the CISC operations to the
design in which new systems would have a small VAX front end for simple instructions, while more com-
core running the OS and legacy applications, with a plex instructions were handled in microcode. Such a
more modern RISC core running performance criti- scheme was used in the mid-s by Intel to imple-
cal applications, while sharing virtual memory. While ment out-of-order execution in its Pentium Pro pro-
such a design preserved compatibility, it added to a cessor and has been the basis of many subsequent Intel
RISC design the cost and complexity of a VAX design processors.
whose performance would still be a critical component After discussing this set of less than perfect hard-
of overall performance. ware solutions to the problem, it was decided to save
To mitigate some of the cost of multiple het- the hardware of the hardware-based instruction trans-
erogeneous cores, another proposal examined multi- lation scheme and leave it to software to facilitate the
execution path cores. In this variation a single core could migration from VAX to Alpha. Through a combination
execute both VAX and RISC instructions by provid- of interpretation, static and dynamic binary translation,
ing two front ends: one interpreting VAX instructions, compilation of Macro assembly language using a new
and another dealing with the new RISC ISA. This was Macro “compiler,” and recompiling of higher-level
how VAX evolved from the PDP-, but in that case the language programs, software would solve the problem.
microcode complexity was already being paid for, which Relegating migration to software freed the architecture
was something not needed for a pure RISC core. team to focus on what they believed to be the best RISC
Some years earlier DEC had already made a architecture ever without significant legacy constraints.
simplification of the VAX architecture through the To make migration to Alpha easier, a few hard-
removal some of its most complex operations. This ware hooks were architected in for VAX support, but
DEC Alpha D 

not many. Several were easy to incorporate such as than two input values and one output value for an
identical protection rings, memory protection codes, instruction.
and interrupt priority levels. The VAX page size of  A recurring issue related to the two-input/one-
bytes was already resulting in less efficient page trans- output principle was the proposal for adding a Floating
lation and complicated first level cache design, so was a Multiply and Add instruction (FMA). FMA allows for
sticky point. Therefore, it was decided to immediately more efficient implementation of its constituent oper-
start aligning VAX pages on  KB boundaries in the ations and saves instruction issues for an operation D
linker two years before Alphas were to start shipping to sequence that occurs frequently in some codes. But
allow Alpha to be architected with a minimum page size FMA also requires a minimum of three sources and a
of  KB. destination. Thus it violated the basic design principles
Many compatibility issues were also handled with and was left out of the architecture.
special code called PALcode. A program enters PAL- With -bit instructions, providing a reasonable
code with a special instruction that dispatches into the number of operations and no more than three regis-
entry point for an implementation-specific PALcode ter operands, the number of architectural registers is
routine. PALcode, which looks like regular instructions, restricted to . However, by using separate -entry
but with extra privileges, some implementation-specific register files for integer and floating point instruc-
instructions, and special registers, could perform spe- tions, Alpha gained an extra implicit register specifier
cial operations needed for various system activities but bit, resulting in a total of  architectural registers for
also could be used to support complex VAX opera- computation.
tions. Special instructions could even be used to bracket The majority of Alpha architectural state is the 
sequences of operations that were to be executed integer and floating point registers. Two of these reg-
atomically. isters, R and F, were hardwired to zero. Thus reads
PALcode also provided an easy place to provide of these registers always returned a zero and writes were
various other system functions and implementation- ignored. Allowing writes to these registers did, however,
specific functions. This use is similar to Itanium’s allow for special instruction semantics. For example,
PAL code. loads to R served as a memory prefetch. Additional
architectural state consisted of shadow registers for inte-
Alpha Architecture ger registers that appeared in kernel mode. These saved
Like all RISC architectures, the Alpha architecture was the costs of explicit register saving by providing some
designed with a set of simple register-based operations scratch space on entry to kernel routines.
and separate instructions to move memory values to Another piece of architectural state is a floating
and from the registers. As with many other RISC archi- point mode control register. This register was a com-
tectures, the instructions are of a fixed length of  bits. promise that involved saving the use of precious bits
Alpha instructions came in just four forms: in each floating point instruction to specify the mode,
e.g., rounding mode, and having library routines inherit
● Register-to-register operations
a mode from the caller (with unpredictable accuracy
● Memory operations
results) or check/save/restore the mode (at a significant
● Branches
performance penalty for small routines).
● Dispatches to PALcode
The program counter (PC) is also architectural state,
The Alpha instructions were also designed to map effi- but unlike on the VAX the PC does not appear as a
ciently to a hardware implementation and hence were numbered general purpose register. Alphas also have a
extremely regular. To keep them regular, the operands status register that holds information about such things
are in fixed locations. Furthermore, instructions have at as the current and previous ring level and interrupt
most two sources operands and no more than one desti- priority level. But in contrast to the VAX the Alpha
nation. This greatly simplified the register file, operand architecture does not have an explicit register to hold
tracking tables, and pipeline organization, since there condition codes. This both saves on the bottleneck of
never was a need to access, track, or transport more having a single register receiving results from many
 D DEC Alpha

instructions and preserves the principle that an instruc- Alpha. This lets later instructions complete before over-
tion only reads two registers and writes one register thus flow and underflow conditions are known. To give soft-
simplifying dependency tracking. ware the opportunity to rationally respond to such con-
ditions, a barrier instruction was provided that would
Avoid Complex Hardware Features force instruction issue to wait for all pending exceptions
In a reaction to the complexity of VAX instructions, to be resolved. Thus software could achieve the effect
the Alpha architecture was designed to avoid features of precise exceptions by placing a barrier instruction at
that would result in complex hardware. For an opera- the end of an idempotent sequence of instructions that
tion to be included in the Alpha ISA as an instruction, followed any instruction that might cause an exception.
it had to be shown to occur frequently in important pro- This approach simplifies hardware, but does compli-
grams, that it could not be done effectively by sequences cate software design. It is interesting to note that with
of other instructions, and that it fit the Alpha model of out-of-order execution, register renaming made precise
simple hardware implementation. Integer multiply was exceptions practically free.
debated, but was finally included. Integer divide was The Alpha design teams also considered hardware
found to be a rare operation, often dividing by a con- support for speculative operations. This tends to cre-
stant, equivalent to a handful of shifts and adds, and ate some complexities, mainly around exceptions. But
hence not included. there was some benefits from being able to move opera-
A significant, and maybe the most controversial tions past branches (control speculation), and to a lesser
decision based on complexity versus frequency of degree from moving loads past stores (data specula-
occurrence, was the decision to leave out load and store tion). Ultimately, it was found that most of these benefits
operations for less than -bit values (longwords). This could be achieved without hardware support, but by
meant that -bit values (bytes) and -bit values (words) dealing with spurious exceptions in software. Such a
were not accessible except as part of a longword. Stores framework was developed in Tru UNIX.
of a byte required a read, modify, write sequence. The
statistics showed byte memory operations to be rare, Hardware Implementations
and that the shifts involved in lining up an individ- The Alpha architecture was implemented in a series of
ual byte in a -bit datapath would slow down loads hardware designs. The principal numeric designation
and stores in general. Byte access can also complicate for these designs was <n>, where the  stood for
the implementation of error checking, for example, via the twenty-first century (since Alpha was touted to be
ECC. The absence of  and  bit load/store support the architecture for the next  years), the  indicated
turned out to be more costly than had originally been that this was a -bit architecture and <n> indicated the
envisioned. The compiler’s lack of ability to determine generation of the design starting with .
memory aliasing of the bytes being gratuitously read Internally, the Alpha designs were referred to by
(and re-written) near a byte of interest was a major code names in the form EV<n>. The EV prosaically
impediment to code reordering. Thus instructions to stood for Extended VAX, although it has been sug-
load and store both bytes and words were added in the gested that EV stood for other more colorful things. The
shrink of the second generation Alpha, EV. <n> stood for the DEC CMOS process generation, thus
Floating point square root instructions and support EV (which was sold as the ) was implemented
for video encoding were added in the third genera- in CMOS, an . μ process. When a design was largely
tion Alpha, EV. Square root was chosen because it remapped into a later process the process generation
was a common operation and could be executed by a of the new process was appended to the code name.
small tweak to the floating point divide hardware. Video Thus, EV was a shrink (with some design changes)
and multimedia support had emerged as performance- of the EV design into the CMOS process. Unfortu-
critical applications. A few, simple instructions turned nately, as the goal of having a complete new design in
out to make a big difference for performance. each process generation proved unsustainable, the <n>
In order to simplify exception handling hardware, became simply a monotonically increasing designation
arithmetic exceptions are architected to be imprecise in of new designs. Still the EV nomenclature conveys more
DEC Alpha D 

DEC Alpha. Table  Alpha timeline performance of EV made it an eye opener for the indus-
Year Processor Process Frequency Features try. It started an era of rapid improvements in CMOS
EV . μ Software circuit speed. An updated version of EV ran at  MHz.
development Overall, the software transition from VAX to Alpha
vehicle went smoothly due to the careful preparation of tools
 EV . μ  MHz Two-wide to assist in porting both OS functions and applications.
superscalar Several systems were designed around EV, including D
 EV . μ  MHz Four-wide an Alpha PC, multiple workstations, and low and high
superscalar end servers.
 EV . μ  MHz Out-of-order, Both UNIX and VMS were supported. Introducing
Multimedia a -bit architecture turned out to be complex. When
 EV . μ . GHz Integrated VAX was introduced as a -bit architecture  years
system earlier, the market had clearly outgrown  bits. The
functions
need for  bits was less pressing. DEC decided on a
EV . μ . GHz Eight-wide
split strategy: -bit VMS, and -bit UNIX. Pioneering
superscalar,
SMT  bits was heavy lifting for UNIX, as many applica-
tions did not port easily from  bits. This had a mixed
EV . μ Vector
processing impact on Alpha as reduced application availability
clearly slowed Alpha acceptance. The larger memory
footprint due to the increased size of address pointers
information about a design, so this article will typically also had a detrimental impact on some applications.
use it for describing the various Alpha implementations On the other hand, a -bit address space proved a
(Table ). valuable asset in some application domains, such as
databases, where Alpha did shine. Alpha’s pioneering
EV: Software Development Vehicle porting efforts also clearly aided today’s -bit architec-
To assist software development and allow exploration tures by getting applications ready.
of high-speed circuit techniques, a non-commercialized Considering the classic formula that expresses pro-
implementation of the Alpha architecture called EV was cessor performance as a function of frequency of opera-
developed. Following the naming convention described tion, cycles per instruction and number of instructions,
above, EV was implemented in DEC CMOS pro- the Alpha architecture was most focused on facilitat-
cess, which was a . μ process. The EV design largely ing high-frequency operation with instructions that
matched the first commercially sold Alpha, EV, but in required few cycles to execute. Designs exhibiting such
order to fit in a single chip in the less spacious CMOS characteristics were dubbed “speed demons” and the
process, some nonessential functions were left out or EV exhibited these characteristics through a design
reduced in size. For example, floating point operations philosophy of providing speed through high frequency
were emulated in software. The caches were also only implementation in a simple, relatively shallow pipeline.
 KB. EV was built into the Alpha Development Unit Although the EV frequency of operation was a
(ADU), which was available for early software develop- function of many components, considerable effort was
ment. EV proved very valuable in the successful launch expended on creating a -bit adder that could be
of Alpha as the fastest microprocessor in the industry. cycled as fast as possible and was intended to set the
frequency of the design. Thus, other less frequent oper-
EV: First Commercially Available Design ations, such as shifts, were allowed to take multiple
The first commercially available Alpha microproces- cycles. Considerable design effort was also expended
sor was EV. It was launched at an astonishing clock on the high frequency clocking network to minimize
frequency of  MHz, in DEC’s CMOS . μ pro- clock skew across the chip – as a consequence approxi-
cess at a time when the industry standard was well mately half the power of the design went into the clock
below MHz.Thefunctionality,frequency,andoverall network.
 D DEC Alpha

High frequency operation was also facilitated by It does, however, have the implicit assumption that the
avoiding stage-to-stage flow control in the main destination register is always physically in the same
pipeline. This eliminated complicated cascading stall location. But even in an in-order pipeline the destina-
signals and reduced delay in each stage of the pipeline tion value of a CMOV is not always in the same place
and thus allowed for more computation logic in when compared to other arithmetic operations. There-
each stage. fore CMOV instruction results cannot be bypassed and
To control the pipeline, an end-to-end flow control have a latency greater than that of other simple arith-
was implemented in which instructions in the early metic operations.
stages of the pipeline always flow systolically and an exe- In an out-of-order pipeline, the assumptions on
cution unit stage would notice the need for a stall. But which the CMOV instruction was premised are even
rather than stall the immediately preceding stage of the less true. In an out-of-order design with register renam-
pipeline the execution unit would redirect the front of ing, the newly assigned destination register does not
the pipeline to stall and buffer up the next few instruc- automatically contain the old value when the condi-
tions coming down the pipeline and replay them when tion is false. Thus the old value in effect becomes
the stall condition had been resolved. This buffer was a third source, violating a key Alpha design princi-
referred to as a “skid buffer.” ple. This became an issue in EV and EV, the out-
Since the design of the Alpha architecture and of-order Alpha designs. Rather than require costly
the EV/ micro-architecture were designed contem- extra register ports and other changes to the exe-
poraneously, a number of micro-architectural consid- cution pipeline, the CMOV instruction was bro-
erations entered into the architecture. This included ken into two separate instructions in the instruction
the placement of operand fields in the instruction fetch stage.
and assignment of operation codes that were easy to Despite these implementation challenges the CMOV
map into a programmable logic array or PLA. The instruction was found valuable by software. Since the
instruction set was also divided into classes that used CMOV instruction can be viewed as a specific case of
disjoint sets of function units to facilitate the cre- a predicated operation, since a little predication was
ation of superscalar designs that could issue multi- found to be valuable the team considered adding more
ple instructions simultaneously. Superscalar implemen- predication. At the time, there were several papers pub-
tation was also simplified through the absence of a lished describing significant IPC improvements from
MIPS-style branch delay slot or a VAX-style condition predicating all instructions. Additional support for
code register. predicated operations proposed for inclusion in the
EV was a superscalar design, issuing up to two architecture. Within the Alpha team, performance stud-
instructions per cycle. There were no redundant func- ies of such more extensive predication were conducted,
tional units, so the two instructions issued had to be but found only negligible benefits beyond what was
a combination of single load/store, integer, branch, obtained with the CMOV instruction alone.
or floating point instructions. The chip had an  KB Finally, application performance depends on more
instruction cache and an  KB data cache on die, with than just the microprocessor. The system included
support for a board level cache. industry standard memory chips and IO components.
A challenge in accommodating specific micro- The raw execution advantage of EV was somewhat
architectural optimizations into the specification of an diluted by these system components. Not until EV
architecture is the impact such features have on future did Alpha microprocessors incorporate major advances
micro-architectural innovations. The Alpha’s condi- in memory system, followed by the innovative system
tional move instruction (CMOV) is an interesting case design in EV.
in this regard.
The CMOV instruction takes a source operand and EV: Compaction of EV into .μ CMOS
a condition, and copies the source to the destination Technology
if the condition is true. In an in-order design, this Initially, the Alpha design teams maintained a cadence
seems to fit the two sources and a destination model. of scaling the previous design into a new process,
DEC Alpha D 

followed by a new design in the same process. EV was design in the. μ CMOS  process resulted in a dra-
the CMOS version of the EV design. A similar strat- matic frequency improvement from  to  MHz.
egy at Intel is referred to as the Tick-Tock model. EV With the smaller feature size, came smaller die size
kept the dual issue pipeline of EV, but the increased and lower power, making it easier to manufacture and
size of first level cache from to  KB and the clock design into systems. Compilers were getting better at
frequency to  MHz. dealing with the idiosyncrasies of the microarchitec-
ture. A fairly major change in EV was the addition D
EV: The Next Generation of byte and word access to memory. This change did
EV was a major new Alpha design. It was four- not have a negative impact on the cycle time, as had
wide superscalar, but only if two of these were float- been feared. It had a positive impact on porting appli-
ing point instructions, an addition and a multiplica- cation, which often accessed memory in small chunks,
tion. For integer operations, it was two-wide issue. EV and the previously required costly read-modify-write
did relax some of the integer issue rules from EV, sequences.
by allowing two load operations or one store through
an L cache which was implemented as two copies of EV: Speed Demon and Brainiac
the data to allow parallel reads. Stores wrote to both EV was the third major Alpha design. EV attempted
copies. to maintain the high frequency design of its prede-
EV was an ambitious design. With a total transistor cessors. Like EV, it was superscalar with a peak per-
budget of . million devices, it crammed a lot of func- formance of four instructions per cycle. But the EV
tionality into the chip. This was partly made possible by team aimed to not only preserve those “speed demon”
relying on the compiler to schedule code for optimal characteristics of the previous designs but also pro-
execution. Multi issue works only when independent vide sophisticated microarchitectural features of the
instructions are aligned on natural boundaries. Studies so-called “brainiacs” of that era.
showed that more complex issue logic might improve There were several important microarchitecture fea-
performance in cycles per instruction, but would cre- tures included in EV. Most striking was the ability
ate tighter timing constraints and make it more difficult to execute instructions out of order. Instructions were
for the compiler to schedule for multiple issue. EV also fetched, destination registers were renamed, and source
relied on the compiler to schedule for resource conflicts. registers mapped. They were then placed in an instruc-
For example, a store followed by a load instruction to the tion queue, and issued based on when their source
same address in the following cycle causes a replay trap. operands were ready.
EV took the Alpha philosophy of executing good code With a deeper pipeline and more operations in
very well, while sometimes tripping over legal, but less flight, the branch predictor became more important.
than optimal code. EV had a tournament branch predictor, with a global
EV coincided with DEC doing research in VLIW (path history) and local (history of this branch) predic-
compiler technology. In some ways, it was useful tor dueling. A Chooser picked the winner based on the
to view EV as a VLIW. Having the compiler con- path history. EV’s branch predictor was far larger and
sider low-level code scheduling had a big impact on more sophisticated than any of its contemporaries, yet
performance. in the design post mortem it was felt that even more size
The cache hierarchy was revamped from EV. The and effort should have been devoted to it.
first level caches, I and D, remained at  KB each, but The complex EV branch predictor took multiple
they were backed up by an on-die  KB, three-way set cycles to make a prediction and calculate an expected
associative, unified L cache. It continued to support an next address. Since the machine needed to fetch a block
off chip cache as a third level. of instructions each cycle EV incorporated a structure
referred to as a ‘line predictor’ that could quickly predict
EV: Compacting EV into .μ Technology a new next instruction block every cycle. This elimi-
The successor, EV, was possibly the biggest success nated bubbles in the pipeline waiting for the branch
for the Alpha product line. Re-implementing the EV predictor. If the branch predictor disagreed with the
 D DEC Alpha

line predictor, then a quick redirect of the front end EV: Integrating System Functions
occurred. Of course, if the branch predictor was found Alpha processors achieved good performance on
to disagree with the actual instruction during execution benchmarks like SPEC, both integer and floating point.
then a later redirect of the front end would occur. In larger applications, some of the processor perfor-
EV could issue four integer operations in a sin- mance advantage was lost in time spent outside the
gle cycle, with up to two Load or Store instructions. A processor chip itself. With the extra transistors avail-
remarkable feature was the ability to double pump the able through the new . μ process, it became practical
first level cache. This allowed for a fully dual ported to integrate the traditional functions of the system chip,
cache, without bank conflicts. This double pumping such as the memory controller, the network switch, and
mechanism also allowed for reading dirty data out of the network router. In addition, EV added a . MB
the cache in the same cycle a block was filled. The virtu- on die second level cache, and a direct interface to an IO
ally indexed cache had a three cycle load to use latency. chip. This made for an almost glueless multiprocessor
It was backed up with an off-chip cache with a dedi- system.
cated interface, eliminating conflicts with traffic to main The memory was based on RAMBUS technology,
memory and IO. with eight channels connected directly to the CPU chip.
To keep the instructions flowing, EV relied on This was a further improvement in memory latency
speculative execution. It would let issue-ready loads and bandwidth, up to . GB/s. The memory channels
issue before earlier stores, and replay if the addresses could operate in RAID mode, XOR’ing in a redundant
conflicted. To prevent future conflicts, a PC-indexed channel for high reliability. To keep the caches coher-
table kept track of past problems and kept loads in order ent, there was an in-memory directory which kept track
when necessary. of which cache lines were present in a cache some-
Another interesting issue-related mechanism existed where in the shared memory system. This scheme is
for optimizing the issue of instructions dependent on relatively simple, but has the drawback of requiring
a value from a load instruction. While the latency of memory bandwidth for directory updates. To reduce
most instructions is deterministic, the latency of a load this effect, EV did not use the directory for local cache
depends on whether it hits or not. To rendezvous with a accesses. This works very well for programs with good
hit optimally requires issuing before the machine knows locality. Remote references had to snoop the local cache
if the load will hit in the first level cache. Since guessing for possible hits.
a load is going to hit and having it miss resulted in a The core design itself was an EV with only minor
costly replay, EV had a predictor for the behavior of changes. The EV systems pioneered a new trend in
future loads controlling this load-hit speculation. system design. With point-to-point socket connections
The EV team set out to combine the raw execu- through high bandwidth links, the performance scaled
tion rate of the processor with a matching memory and very well up to  sockets with shared, cache coherent
system interface. EV introduced several new instruc- memory. This was partly due to a high ratio of memory
tions to the Alpha architecture. Besides Square Root, and link bandwidth ( GB/s per link) relative to indi-
and multimedia operations, there were new versions of vidual socket performance. The plan was to follow up
the memory prefetch instruction, making writes and this breakthrough in system design with a new core,
streaming references more efficient. EV, with increased core performance. In practice, the
For EV, the Tsunami chipset made big advances in industry trend became multi-core processors, with high
improving memory access. Wide interfaces and point- aggregate performance, putting pressure on the system
to-point connections resulted in several Gigabytes per functions such as memory and link bandwidth.
second of memory bandwidth. This design style fore- EV moved the board level cache onto the die. The
shadowed the link-based systems that followed. resulting cache was smaller than the  MB cache used
The EV design was compacted into new processes: in many EV systems. Sometimes this impacted perfor-
EV in . μ, and EV in . μ. It was also used as mance negatively, but with the higher memory band-
the compute engine in EV. width and shorter latencies, most applications achieved
DEC Alpha D 

good performance improvement with EV systems. competing for resources and issue in the same cycle.
EV showed the importance of overall system design in With multiple threads the instruction window was
achieving robust application performance. Even though much more likely to contain issue ready instructions.
EV entered the market when Alpha investments were This was an early implementation of SMT, breaking
being reduced, it had a long product life due to its new ground while solving some interesting problems.
scalable system design. A few new instructions were added to support SMT.
To keep idling threads from consuming resources, the D
EV: Araña instruction pair of ARM and QUIESCE set up a mem-
EV was aimed at pushing Alpha processors to ever ory address to be monitored for updates and allowed a
higher single stream performance. The Alpha architec- thread to suspend itself while waiting for a write to that
ture goal of enabling superscalar implementations with location.
simple instruction decode and dependence checking With SMT, it became much more practical to utilize
would be tested as the EV goal was to fortify Alpha’s CPU resources. Performance scaled well up to four
position as the world’s highest performing computer threads. For high IPC applications, there was a dropoff
architecture. The total transistor budget was now several in improvement from three to four threads, but for oth-
hundred million transistors and the goal for EV was ers, such as transaction processing, the improvement
to make good use of them all. The path-finding phase was almost linear with thread count. SMT is an effi-
aimed at developing features that would increase the cient latency hiding technique. While one thread waits
IPC, with relatively little concern for power and area. for memory, other threads are free to use all the pipeline
Performance modeling was aimed at finding the lim- resources. The extra threads can, of course, put pressure
its for instruction level parallelism. The design could on the shared resources, and will especially increase
issue eight instructions per cycle, with a wide range of the cache miss rate due to supporting multiple pro-
functional units to support an assortment of instruc- grams when there is little locality. For EV, the SMT
tion mixes. design became a careful balancing act between multi-
Keeping an eight-wide issue unit busy is a big chal- threaded performance and the cost of supporting the
lenge, which EV designers tried hard to meet. The front extra threads in area and power.
end could fetch up to  instructions per cycle in two There were two principal points in the pipeline
chunks of eight aligned instructions, possibly with a with arbitration between threads. Every cycle, a thread
taken branch instruction between them. An extensive selector would pick a thread to fetch instructions
branch predictor could make up to sixteen predictions from, based on which thread had the fewest instruc-
in a cycle. A large instruction window tried to feed the tions active in the early stages of the pipeline. This
functional units with issue ready instructions, usually allowed threads with high IPC to move right along,
issued out of order. while allowing slower threads access without clogging
One of the messy problems with out-of-order exe- up pipeline resources. Another thread selector picked
cution results from address conflicts when loads are threads for mapping registers based on the register
hoisted above stores. The EV team invented the “store renaming scheme. Once instructions had been mapped,
set” mechanism, which associated set of loads with sets they were put into an instruction issue window shared
of potentially conflicting stores, and used the spare reg- by all the threads. From there they were selected for
ister number in the micro-architectural implementation issue based on age among issue ready instructions.
of these operations to encode the dependency. While SMT required dedicated architectural regis-
With performance modeling, it became clear that ter state for each thread, the renaming registers required
many interesting applications would not significantly to fill the instruction pipeline was shared among the
benefit from eight-wide instruction issue. The amount four threads. Wherever possible, pipeline structures
of ILP was just not there, so the architects added “simul- and caches were shared dynamically between the four
taneous multi-threading” (SMT) to the design. Four threads for optimal throughput performance. Only a
threads could execute simultaneously in the pipeline, few safeguards had to be part of the allocation to prevent
 D DEC Alpha

threads from being starved for resources. Overall, SMT . Alpha Architecture Reference Manual () In: R.L. Sites (ed)
proved to more than double through-put performance Digital Press, Bedford
with less than % increase in hardware resources. . Bhandarkar DP () Alpha architecture and implementations.
Digital Press, Bedford
. Sites RL, Chernoff A, Kirk MB, Marks MP, Robinson SG ()
EV: Vectors and Multicore Binary translation. Commun ACM ():–
Although EV never shipped, research work was in . Chernoff A, Herdeg M, Hookway R, Reeve C, Rubin N, Tye T,
progress on a variety of successor designs. One alter- Yadavalli SB, Yates J () FX!: a profile-directed binary trans-
native, called EV, was targeted at high-performance lator. IEEE Micro ():–
technical computing and added an integrated vector . McLellan E () The alpha AXP architecture and  proces-
sor. IEEE Micro :–
unit. Following the naming scheme begun with EV,
. Dobberpuhl DW, Witek RT et al () A -MHz -bit dual-
which was called araña (spider in Spanish) alluding to issue CMOS microprocessor. Digital Tech J ():–
both the eight-way issue and SMT threads, EV was also . Edmondson JH, Rubinfeld P, Preston R, Rajagopalan V ()
called tarantula. The Cray X continued this naming Superscalar instruction execution in the  alpha micropro-
tradition with a code name of “black widow.” cessor. IEEE Micro ():–
. Gronowski P, Bowhill WJ, Donchin DR, Blake-Campos RP,
Another trajectory examined putting a large num-
Carlson DA, Equi ER, Loughlin BJ, Mehta S et al () A
ber of smaller processors on a ring-based intercon- -MHz -b quad-issue RISC microprocessor. IEEE J Solid-
nect. This work was pioneered by DEC’s research lab State Circuits ():–
in Palo Alto, code-named Piranha project. The initial . Gwennap L () Brainiacs, speed demons and farewell. Micro-
design suggested that replacing a single Alpha core opti- processor Report ():–
mized for single stream performance with eight simpler . Kessler RE, McLellan EJ, Webb DA () The alpha  micro-
processor architecture. In: Proceedings of the international con-
cores could improve overall performance on applica-
ference on computer design: VLSI in computers and processors,
tions such as online transaction processing. Austin, pp –
. Kessler RE, McLellan EJ, Webb DA () The alpha  micro-
Alpha Systems processor architecture. In: Proceedings of the international con-
Alpha processors were the compute engines in a long ference on computer design: VLSI in computers and processors,
Austin, pp –
array of computer systems. DEC produced systems
. Kessler RE () The alpha  microprocessor. IEEE Micro
ranging from PC’s and workstations to high end server ():–
systems. TurboLaser was a bus based system with more . Bannon P et al () Alpha : A Single-Chip Shared Mem-
than  GB/s of bus bandwidth used to connect multiple ory Multiprocessor. In Government Microcircuits Applications
processors with memory and IO. Wildfire was the first Conf. : –
link-based system, connecting groups of four proces- . Mukherjee SS, Bannon P, Lang S, Spink A, Webb D ()
The alpha  network architecture. IEEE Micro ():–
sor blocks with a two-level switch interconnect. Marvel
. Mukherjee SS, Silla F, Bannon P, Emer J, Lang S, Webb D () A
was an almost glueless MP system design for EV and comparative study of arbitration algorithms for the Alpha 
EV, using a D Torus. The network router and switches pipelined router. In: Proceedings of the th international con-
were integrated into the processor chips along with the ference on architectural support for programming languages and
RAMBUS memory controllers. operating systems, San Jose, – Oct 
. Tullsen DM, Eggers SJ, Emer JS, Levy HM, Lo JL, Stamm RL
Alphas were integrated into Cray’s TD and TE
() Exploiting choice: instruction fetch and issue on an imple-
designs. These systems combined the fast Alpha micro- mentable simultaneous multithreading processor. In: Proceed-
processor with an external memory system. The nodes ings of the rd annual international symposium on computer
were connected with a D toroidal mesh network. The architecture, Philadelphia, pp –, – May 
ASCI Q system at Los Alamos consisted of a network of . Emer J () EV: the post-ultimate alpha. In: Keynote PACT
a thousand four-socket SMP systems based on . GHz , Barcelona
. Emer J () Simultaneous multithreading: multiplying alpha
EV processors connected by network switches.
performance. In: Proceedings of microprocessor forum, San Jose
. Chrysos GZ, Emer JS () Memory dependence prediction
Bibliography using store sets. In: Proceedings of the th annual international
. Robert M () Supnik, Digital’s Alpha project. Commun ACM symposium on computer architecture, Barcelona, pp –,
():–  June– July 
Denelcor HEP D 

. Seznec A, Felix S, Krishnan V, Sazeides Y () Design tradeoffs Discussion


for the alpha EV conditional branch predictor. In: Proceed-
ings of the th annual international symposium on computer Initial Design
architecture, Anchorage, – May  The HEP computer system was originally conceived as a
. Espasa R, Ardanaz F, Gago J, Gramunt R, Hernandez I, Juan T, digital replacement for the analog computers that were
Emer J, Felix S, Lowney G, Mattina M, Seznec A () Tarantula: Denelcor’s traditional products. The objective was to
a vector extension to the alpha architecture. In: Proceedings of the
achieve comparable performance in doing what ana-
th annual international symposium on computer architecture
log computers do, namely solving nonlinear ordinary
D
(ISCA ‘), Anchorage
. Barroso LA, Gharachorloo K, McNamara R, Nowatzyk A, differential equations (ODEs). Algorithms for ODEs
Qadeer S, Sano B, Smith S, Stets R, Verghese B () Piranha: exhibit only fine-grained parallelism and are poorly
a scalable architecture based on single-chip multiprocessing. In: suited for vector computation. These characteristics
Proceedings of the th annual international symposium on com-
account for the several similarities between the HEP
puter architecture, Vancouver
and data flow architectures.
Initially, the HEP was a collection of heterogeneous
pipelined functional units connected by a high-speed
bus. It was implemented in Schottky TTL logic tech-
Decentralization nology. The planned functional units included an alge-
Peer-to-Peer braic module that implemented the basic -bit floating
point operations, a -bit wide Runge-Kutta integra-
tor unit, a first-in first-out buffer unit to help data
scheduling among the other functional units, a small
Decomposition memory for holding constants, and analog and digital
input and output units. A control processor called the
METIS and ParMETIS scheduler explicitly moved data among the several func-
tional units over the -bit high-speed bus. Synchro-
nization between the scheduler and the functional units
used full-empty bits provided with the bus-addressable
Deep Analytics input and output locations of each functional unit so
that the scheduler could stall until both producer and
Massive-Scale Analytics consumer were ready. Later, it was observed that in-
place atomic operations could be synchronized with
full/empty bits by consuming the old value and then
producing a new one.
Denelcor HEP A prototype of this system was built with funding
from BRL, the US Army Ballistics Research Labora-
Burton Smith tory (now called the Army Research Laboratory) in
Microsoft, Redmond, WA, USA
Aberdeen, Maryland. A follow-on contract with them
called for more robust technology, higher performance,
Synonyms and programmability in a high-level language (tacitly
Heterogeneous element processor assumed to be Fortran). These requirements resulted in
major architectural changes.
Definition
The HEP was a general-purpose shared memory mul- Process Execution Modules
tiprocessor system manufactured by Denelcor, Inc. of To improve performance, multi-threaded algebraic
Denver, Colorado. Design began in  and a total of modules (renamed Process Execution Modules or
six systems were delivered between  and , the PEMs) were designed that could concurrently execute
year Denelcor closed its doors. multiple instruction sequences independent of the
 D Denelcor HEP

scheduler. They had -bit data paths throughout registers of the PEM as shared variables collected into a
and were implemented entirely in K series ECL. A special register common block. Full-empty synchroniza-
PEM had  hardware instruction streams (threads) tion was invoked by prefixing the variable name with
to tolerate both functional unit and synchronization a currency symbol ($); this turned stores into produce
latency. An additional  threads were implemented operations and loads into consumes. Subscripted vari-
for the operating system. User traps and system calls able references used the register index and needed a
blocked the trapping thread and started a supervi- four-instruction sequence for the address computation.
sor thread at the appropriate trap handler entry point. Subroutines could be either called or created. The latter
A PEM implemented eight protection domains, one would create a hardware thread to run the subroutine,
of which was used by the operating system; the rest or trap if the number of hardware threads became too
were made available to run independent user jobs close to .
concurrently. Excluding division, the maximum functional unit
 -bit general-purpose registers, each equipped latency in a PEM was eight -ns cycles. Programs for
with a full/empty bit, and  -bit constant regis- this prototype one-PEM system that did very little syn-
ters were provided in each PEM. Both general register chronization or division exhibited no speed-up above
and constant register addresses could be indexed using eight hardware threads. Division was not pipelined,
a per-thread register index or constant index, and were but a PEM could have from one to eight divide mod-
further translated by a pair of base addresses associ- ules. Division took about  cycles. The floating point
ated with the protection domain to which the thread arithmetic format was compatible with the IBM .
belonged. There was a limit register as well for gen- A few unusual features were provided such as unnor-
eral register addresses. Instructions were  bits wide malized add and subtract operations and a loss-of-
and were stored in a dedicated per-PEM program mem- significance trap.
ory with base and limit registers for each protection The PEM hardware was hosted by an Interdata /
domain. The integrator and first-in first-out units had minicomputer, which had compatible arithmetic and
been subsumed by the PEMs and disappeared, leaving was responsible for compilation as well as operating sys-
only the scheduler and input-output functions to be tem functions. Several papers describe computational
dealt with. experiments with this system for ODEs [], linear
To address the high-level language requirement, algebra[], and graph shortest-path problems [], and
advice was sought from a compiler development com- the system was demonstrated successfully to BRL in
pany, Massachusetts Computer Associates. They recom- November .
mended replacing the scheduler with a multi-ported
shared data memory that all PEMs could access in par- Shared Memory
allel. Denelcor agreed with this assessment but insisted The multi-threaded PEMs were already able to toler-
that all shared data memory locations would need to be ate both functional unit and synchronization latency.
equipped with full-empty bits. New PEM instructions Since functional unit latency was variable, unlike the
would be required, as well as memory base and limit peripheral processor “barrel” of the CDC  and
registers per protection domain. Given all the changes  systems, the thread issue order was not round-
to the original HEP design, BRL decided to require a robin but out-of-order, requiring only that the pre-
single PEM be built and demonstrated running a high- ceding operation of that hardware thread had been
level language before the shared memory design could completed. Now that the time had come to design a
begin in earnest. multi-ported shared memory, the natural approach was
to extend latency tolerance to the memory system by
Register Fortran implementing a fully pipelined packet switched net-
The requirement to implement Fortran without data work connected to multiple pipelined memory banks
memory was a challenging one. An existing Fortran rather than to use the circuit switches then popular
compiler was modified to use most of the  general in existing systems. Each three-port packet routing
Denelcor HEP D 

node occupied two circuit boards and could handle Software


a new memory request or response packet every  The Fortran compiler was modified to address the new
ns on each port. The clock period was  ns, and so memory with little difficulty. The memory full/empty
a packet took four clocks to transit any point in the bits were invaluable, and had unexpected uses. Loops
network. were executed in parallel with an arbitrary number of
To make the routers implementable, there was no hardware threads by self-scheduling [] the loop itera-
buffering to accommodate conflicts for output ports; tions: A thread seeking more iterations to execute would D
instead, each packet was immediately routed to some consume the iteration counter value, produce a new
output port, which might be the wrong one if there was value incremented by a constant quantum, and then
a routing conflict. This scheme, called “hot-potato” or execute the iterations it had just acquired. The program-
“pure deflection” routing, was invented by Paul Baran in mer wrote something like
 for the Internet. The HEP adapted the idea to com-
I = $C
puter interconnect. A misrouted packet had its priority
$C = I + K
increased to make it more likely to win future conflicts,
and once a packet reached maximum priority, the rout- The quantum K would be chosen big enough to amor-
ing tables were programmed to send these packets an tize scheduling overhead.
Euler tour of the entire network to guarantee no further At the end of most loops, it was necessary to make
conflict losses. Gore-Tex twisted pair cables were used sure all iterations were complete; this was accomplished
as in the Cray , permitting about  m of cable between by having each thread atomically decrement a counter
routing nodes. The latency of a routing node and its and then wait at another location until the last thread,
associated cabling was  ns, which meant the graph the one that decremented the counter to zero, freed all
of the network to be -colorable to guarantee correct the rest. Harry F. Jordan named the underlying abstrac-
timing. tion barrier synchronization [] after the barrier used
The newly designed functional unit that connected to start horse races because in both cases, all must
the PEM to memory via the network was called the SFU, enter before any may leave. Later, a pattern of pair-
this abbreviation variously standing for Storage Func- wise produces and consumes resembling the wiring of
tion Unit or Scheduler Function Unit (since it indeed an interconnection network was used to scale to larger
replaced the original scheduler). Informally, the SFU numbers of threads. This was popularly called the but-
was fondly known as the Strange Function Unit, chiefly terfly barrier. Loops containing linear recurrences were
because it functioned in an unusual way. Instead of parallelized in a similar way using parallel cyclic reduc-
reporting synchronization failure, e.g., from an attempt tion [] or one of the schemes due to Richard Ladner
to consume an empty memory location, back to the and Michael Fischer [].
thread scheduling logic so the instruction could be reis- Harry Jordan developed a set of macros in M for
sued, it would repetitively retry the operation itself. automating much of the HEP Fortran programmer’s
Since a hardware thread would not be allowed to issue task including self-scheduling and barrier synchroniza-
its next instruction until the SFU indicated the preced- tion. This language extension was called The Force []
ing one was complete, the semantic effect was the same. and was the first Single Program, Multiple Data (SPMD)
The major difference was that synchronization con- language. It was later implemented for other systems
sumed a small amount of memory bandwidth and no including Flex/, Encore Multimax, Sequent Balance,
instruction issue bandwidth at all. If there were enough Alliant FX series, Cray , Cray Y-MP, and Convex C.
hardware threads, typically –, executing in parallel IBM research adopted these ideas for the RP [],
and tolerating all latencies, the HEP would steadily exe- and they achieved widespread adoption in OpenMP.
cute very close to one instruction per clock per PEM. Argonne National Laboratory wrote their own collec-
This made each PEM the rough equivalent of a CDC tion of M macros to implement a programming model
 or IBM / and about half the scalar speed of a based on monitors and message passing. This package
Cray . was simply known as the “Argonne Macros” [] and
 D Denelcor HEP

evolved into the P and PARMACS languages for both MIGs. MBB started with one PEM but increased the size
C and Fortran. of the system to three PEMs during the mid-s.
The first implementation of the functional language BRL, the US Army Ballistics Research Laboratory
Sisal was done on the HEP by Rodney R. Oldehoeft that funded HEP development, acquired a four-PEM
and his colleagues []. Sisal was also implemented on system around . Their applications included not
Cray vector processors and on the Manchester Data only physical simulations but also ray-tracing for ren-
Flow Machine. The HEP let full/empty bits be used dering three-dimensional computer-aided design files.
in a way in which loads did not consume but left the The application that resulted from this work was named
memory location full; this permitted a direct imple- BRL-CAD []. BRL was also a pioneer in using the Unix
mentation of Sisal’s single-assignment variables, known operating system for high-performance computing and
as I-structures [] in the data flow community. That strongly encouraged Denelcor to port Unix to the HEP.
community in turn adopted the producer-consumer The resulting system, HEP/UPX, was delivered in early
and atomic update semantics of full/empty bits in the . It was based on AT&T System  Unix, ran sym-
concept of M-structures []. metrically on all PEMs, and was highly parallel inter-
Parallel divide-and-conquer looked very simple in nally. Another four-PEM system was delivered to the
HEP Fortran but was often not simple at all; the sys- US Department of Defense, and a one-PEM system was
tem would run out of memory due to an excess of delivered to Shoko Ltd. in Japan. Both of these systems
parallelism and its associated state. This problem was were acquired for evaluation of the general potential of
solved by judiciously limiting recursion depth. A much the HEP architecture for parallel computing.
more elegant solution that needed no assist from the
programmer was later devised and implemented by Related Entries
Robert H. Halstead in Multilisp []. CDC 
Data Flow Computer Architecture
Flynn’s Taxonomy
Applications and Customers IBM System/ Model 
There were six HEP machines delivered to customers. MIMD (Multiple Instruction, Multiple Data) Machines
The first was a one-PEM system that went to Los Alamos Multi-Threaded Processors
National Laboratory, which was interested in using the Networks, Direct
system for their nonvector work. Monte Carlo parti- Networks, Multistage
cle simulation was one such application. The problem Sisal
of parallel random number generation was first solved SPMD Computational Model
in using the HEP in addressing this kind of computa- Synchronization
tion [] and was quickly applied to vector processors as Tera MTA
well. Argonne National Laboratory also acquired a one-
PEM HEP system and used it for automated reasoning Bibliography
applications. Argonne also taught a summer school in . Allan SJ, Oldehoeft RR () HEP SISAL: parallel functional
parallel computing, and many students were exposed to programming. In: Parallel MIMD computation: HEP supercom-
HEP programming there. puter and its applications. Massachusetts Institute of Technology,
Cambridge, MA, pp –
A system was delivered to Messerschmitt-Bolkow-
. Boyle J, Butler R, Disz T, Glickfeld B, Lusk E, Overbeek R, Patter-
Blohm (MBB) outside Munich for simulating the flight son J, Stevens R () Portable programs for parallel processors.
dynamics of the Tornado fighter in real time. The Holt, Rinehart, and Winston, New York
original mission of the HEP, solving ordinary differ- . BRL-CAD main website: BRL-CAD ∣ Open Source Solid
ential equations, was part of the job. The rest of it Modeling
. Darema F, George DA, Norton VA, Pfister GF () A single-
involved interfacing to physical hardware, in particu-
program-multiple-data computational model for Epex/Fortran.
lar the avionics equipment for controlling the fighter’s Parallel Comput ():–
flight dynamics, the cockpit controls, and the vision sys- . Deo N, Pang CY, Lord RE () Two parallel algorithms for
tem that provided the heads-up display and the view of shortest path problems. In Proceedings of the  international
the adversary aircraft, which in this case were Russian conference on parallel processing. New York
Dense Linear System Solvers D 

. Frederickson P, Hiromoto R, Jordan TL, Smith B, Warnock T systems, and based on the block-Cholesky factoriza-
() Pseudo-random trees in Monte Carlo. Parallel Comput tion for symmetric positive definite systems or symmet-
():– ric indefinite systems. They take advantage of parallel
. Halstead RH (Oct ) MULTILISP: a language for concur-
architectures exclusively through efficient implementa-
rent symbolic computation. ACM Trans Program Lang Syst ():
– tion of vector and matrix primitives on such computing
. Jordan HF () Performance measurements on HEP: a platforms. In fact, most ScaLAPACK [] procedures are
pipelined MIMD computer. In: Proceedings of the th annual generated by replacing the BLAS in the correspond- D
international symposium on computer architecture. Stockholm, ing LAPACK procedures by their parallel counterparts
Sweden, pp –
PBLAS.
. Jordan HF () Structuring parallel algorithms in an
MIMD, shared memory environment. Parallel Comput ():
–
Nonsymmetric Systems
. Kumar SP, Lord RE () Solving ordinary differential equations
on the HEP computer. In: Kowalik JS (ed) Parallel MIMD com-
Consider first the nonsymmetric system Ax = f where
putation: HEP supercomputer and its applications. Massachusetts A ∈ Rn×n is nonsingular. For the sake of illustration,
Institute of Technology, Cambridge, MA, pp – let n = k where k is the chosen block size. The first
. Ladner RE, Fischer MJ () Parallel prefix computation. J ACM step in the procedure is to obtain the LU-factorization
():– of the leading k columns with partial pivoting using the
. Lord RE, Kowalik JS, Kumar SP () Solving linear algebraic
classical unblocked Gaussian elimination scheme, for
equations on an MIMD computer. J ACM ():–
. Nikhil RS, Arvind () Implicit parallel programming in pH. example, see []. Thus if,
Morgan Kaufman, San Francisco
. Smith BJ () Architecture and applications of the HEP multi- ⎛A A ⎞
 A
processor computer system. In: Proceedings of SPIE Real-Time ⎜  ⎟
⎜ ⎟
Signal Processing IV. vol , New York, pp – ⎜
P ∗ A = ⎜ A A A ⎟ ()

. Sweet RA () A parallel and vector variant of the cyclic reduc- ⎜ ⎟
⎜ ⎟
tion algorithm. SIAM J Sci Stat Comput ():– ⎝ A A A ⎠

in which P is the permutation matrix affecting the


partial pivoting in the LU-factorization of the first k
Dense Linear System Solvers columns of A, then

Bernard Philippe , Ahmed Sameh ⎛A ⎞ ⎛L ⎞


⎜  ⎟ ⎜  ⎟

Campus de Beaulieu, Rennes, France ⎜ ⎟ ⎜ ⎟
⎜ A ⎟ = ⎜ L ⎟ ∗ U , ()
 ⎜  ⎟ ⎜  ⎟
Purdue University, West Lafayette, IN, USA ⎜ ⎟ ⎜ ⎟
⎜ ⎟ ⎜ ⎟
⎝ A ⎠ ⎝ L ⎠
Synonyms
Cholesky factorization; Direct schemes; Gaussian elim- where L and U are k × k unit lower triangular, and
ination; Linear equations solvers upper triangular matrices, respectively. The remainder
of the first block row of the upper triangular factor of A
is obtained by solving a triangular system with multiple
Definition
right-hand sides,
A dense system is given by
Ax = f , () L (U , U ) = (A , A ) ()
where A ∈ R and b ∈ R , with A having relatively
n×n n

very few zero elements. using the BLAS- primitive DTRSM. Setting

Discussion ⎛A A ⎞ ⎛ A A ⎞ ⎛ L ⎞


⎜  ⎟ := ⎜ ⎟ − ⎜ ⎟ (U , U ) , ()
These are direct schemes based on block-Gaussian ⎜ ⎟ ⎜ ⎟ ⎜ ⎟
⎝ A A ⎠ ⎝ A A ⎠ ⎝ L ⎠
elimination with partial pivoting, for nonsymmetric
 D Dense Linear System Solvers

which is obtained via the rank-k update BLAS- prim- and setting
itive DGEMM, the matrix P ∗ A may be expressed as
(L , L ) := P (L , L ) ()
⎛ L  ⎞ ⎛U U ⎞ resulting in the completed LU-factorization of A,
 U
⎜  ⎟ ⎜  ⎟
⎜ ⎟⎜ ⎟
P ∗ A = ⎜ ⎟⎜
⎜ L I  ⎟ ⎜  A A ⎟ .
⎟ () ⎛L   ⎞ ⎛ U U U ⎞
⎜ ⎟⎜ ⎟ ⎜  ⎟⎜ ⎟
⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟
⎝ 
L  I ⎠⎝  A A  ⎠
Π ∗ A = ⎜ L L  ⎟ ⎜  U U ⎟
⎜ ⎟ ⎜
⎟ ()
 ⎜ ⎟⎜ ⎟
⎜ ⎟⎜ ⎟
One is now ready to obtain the unblocked LU- ⎝ L L L ⎠ ⎝   U ⎠
factorization of the leading k columns of the lower prin-
in which the permutation Π is given by
cipal submatrix of order k less, that is, of order n − k,
⎛I  ⎞ ⎛ Ik  ⎞
Π=⎜
k
⎛A ⎞ ⎛L ⎞ ⎟⎜ ⎟ P . ()
 ⎜ ⎟⎜ ⎟
P ⎜ ⎟ = ⎜  ⎟ ∗ U . () ⎝  P  ⎠ ⎝  P ⎠
⎜ ⎟ ⎜ ⎟
⎝ A ⎠ ⎝ L ⎠
This scheme is described in many references but the
Setting presentation above is closest to that in [], where it
⎛A ⎞ ⎛A ⎞ is called the right-looking algorithm. It is the scheme
⎜  ⎟ := P ⎜  ⎟ ∗ U and () implemented in the routines DGETRS of LAPACK and
⎜ ⎟ ⎜ ⎟
⎝ A ⎠ ⎝ A ⎠ PDGETRS of ScaLAPACK [].
The solution of Ax = f is then obtained by block-
⎛L ⎞ ⎛L ⎞ forward and block-backward sweeps using the routines
⎜  ⎟ := P ⎜  ⎟ , ()
⎜ ⎟ ⎜ ⎟ DGETRF of LAPACK and PDGETRF of ScaLAPACK.
⎝ L ⎠ ⎝ L ⎠
one can now obtain the remainder of the second block Symmetric Positive-Definite Systems
row of the upper triangular factor of A by solving the For symmetric positive definite (s.p.d.) systems, one
triangular systems adopts the Cholesky factorization, A = U T U, where U
is upper triangular, through a block-oriented algorithm.
L U = A , ()
For the sake of illustration, let A be partitioned as
using the corresponding BLAS- primitive. Setting
⎛A A ⎞
 A
A := A − L U , () ⎜  ⎟
⎜ ⎟
A=⎜ T ⎟
which is also obtained via BLAS-, one obtains the ⎜ 
A A  A  ⎟ , ()
⎜ ⎟
factorization ⎜ ⎟
T T
⎝ A A A ⎠
⎛L  ⎞
⎛I  ⎞ ⎜  ⎟ in which Aij ∈ Rk×k with k being the chosen block size.
⎜ ⎟
⎜ k ⎟ ∗ P ∗ A = ⎜ L L ⎟
⎜ ⎟ ⎜    ⎟
⎝  P ⎠ ⎜ ⎟
⎜ ⎟ Step :
⎝ L L I ⎠ Obtain the Cholesky factorization of the s.p.d. subma-
trix A using the unblocked Cholesky algorithm, using
⎛U U ⎞
 U BLAS- and BLAS-, for example, see [],
⎜  ⎟
⎜ ⎟
⎜  U ⎟ A = UT U ,
⎜  U ⎟ . () ()
⎜ ⎟
⎜ ⎟
⎝   A ⎠ where U is upper triangular of order k.

Now, for n = k, the last step consists of the unblocked


Step :
LU-factorization
The remainder of the first block row of U is obtained
P A = L U () by solving a triangular system with multiple right-hand
Dense Linear System Solvers D 

sides (Level- of BLAS), data is accessed only when absolutely needed, for exam-
ple, for pivoting, and (b) the Crout algorithm. Charac-
UT (U , U ) = (A , A ) . () teristics of these versions are discussed by Dongarra,
Duff, Sorensen, and van der Vorst in their survey [].
Step : The partial pivoting strategy guarantees a sound fac-
Setting torization but it limits the parallel efficiency due to loss
of data locality. As a remedy, Gallivan, Plemmons, and D
⎛A A ⎞ ⎛ A A ⎞ ⎛ U
T⎞ Sameh discuss the pairwise-pivoting strategy in []. An
⎜  ⎟ := ⎜ ⎟−⎜ ⎟ (U , U ) , ()
⎜ ⎟ ⎜ ⎟ ⎜ ⎟ error analysis of pairwise pivoting is given by Sorensen
⎝ A A ⎠ ⎝ A A ⎠ ⎝ U ⎠
T T T
in []. While the upper bound on the error of the result-
ing factorization is worse than that of the classical LU-
which is obtained using the BLAS- primitive DGEMM, factorization with partial pivoting, extensive numerical
one has the partial factorization experiments show that both factorization schemes yield
almost identical solutions of the linear system.
⎛ UT   ⎞ ⎛U U ⎞
 U The above block-oriented algorithms are quite
⎜  ⎟ ⎜  ⎟
⎜ ⎟⎜ ⎟
A=⎜ T ⎟ ⎜ ⎟ robust and realize reasonably high performance on a
⎜ U I  ⎟ ⎜  A A ⎟ . ()
⎜ ⎟⎜ ⎟ variety of parallel architectures. In the early days of
⎜ ⎟⎜ ⎟
T T
⎝ U  I ⎠ ⎝  A A ⎠ parallel computing, some work was done regarding
the theoretical limit of parallelism inherent in Gaus-
Similar to the LU-procedure for nonsymmetric systems, sian elimination. Csanky [] made use of Leverrier’s
steps – are repeated twice to complete the block- method [] to show that a dense system of order n
Cholesky factorization A = U T U with U being of the can be solved in O(log  n) parallel arithmetic opera-
form tions employing O(n ) processors. Later, Preparata and
⎛U U ⎞ Sarwate reduced the number of required processors to
 U
⎜  ⎟ O(n. /log  n) taking advantage of properties of New-
⎜ ⎟
U = ⎜  U U ⎟

⎟. () ton identities [], and using Strassen’s method for matrix
⎜ ⎟
⎜ ⎟ multiplication [].
⎝   U ⎠

Again the solution of Ax = f is obtained by block-


Bibliography
forward and block-backward sweeps implemented in
. Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J,
routines of LAPACK and ScaLAPACK based on Level- Dhillon I, Don-garra J, Hammarling S, Henry G, Petitet A,
of BLAS. Stanley K, Walker D, Whaley RC () ScaLAPACK: A linear
algebra library for message-passing computers. In-formation
bridge: DOE scientific and technical information, #,
Indefinite Symmetric Systems available at http://www.osti.gov/bridge/servlets/purl/-
Unlike the symmetric positive definite case (Cholesky dBCNwX/webviewable/.pdf, 
factorization), one needs a pivoting strategy . Bunch JR, Parlett BN () Direct methods for solving symmet-
(Bunch–Parlett strategy []) for obtaining the factoriza- ric indefinite systems of linear equations. SIAM J Numer Anal
tion U T DU, where U is unit upper triangular and D is :–
. Csanky L () Fast parallel matrix inversion algorithms. SIAM J
block diagonal with  − by −  and  − by −  diago-
Comput :–
nal blocks. Aside of the pivoting scheme, and row and . Dongarra JJ, Duff IS, Sorensen DC, Van der Vorst H ()
column pivoting, the main procedure steps are almost Numerical linear algebra for high performance computers. SIAM,
identical to the schemes discussed above. Philadelphia, PA
. Gallivan KA, Plemmons RJ, Sameh AH () Parallel algorithms
for dense linear algebra computations. In: Gallivan KA, Heath MT,
Other Parallel Algorithms Ng E, Ortega JM, Peyton BW, Plemmons RJ, Romine CH, Sameh
Variants of the algorithm above include (a) the left- AH, Voigt RG (ed) Parallel algorithms computations. SIAM, Phi-
looking version of block Gaussian elimination in which lapdelphia, PA
 D Dependence Abstractions

. Golub GH, Van Loan CF () Matrix computations, rd edn. used in all optimizing compilers as use-def chains, as
John Hopkins, Baltimore, MD defined by Aho et al. []. Let’s consider the sequence:
. Householder A () The theory of matrices in numerical analy-
sis. New York, Dover S1: x = 2;
. Sorensen DC () Analysis of pairwise pivoting in Gaussian S2: y = 3*x+1;
elimination. IEEE Trans Comput :–
. Strassen V () Gaussian elimination is not optimal. Numer The compiler cannot guarantee that the value of y
Math :– is the same after executing S1;S2; or S2;S1;. This
is pretty obvious in a sequence, but S1 and S2 may be
executed many times during the program execution, for
instance because they are located in loops. So the com-
Dependence Abstractions piler has to deal with all instances of Statements S1 and
François Irigoin
S2, in all possible execution traces.
MINES ParisTech/CRI, Fontainebleau, France Furthermore, the memory locations they access may
be computed dynamically when pointers or arrays or
Synonyms function calls are used as in:
Dependence accuracy; Dependence approximation; S1: x[i] = foo(x, y, 2);
Dependence cone; Dependence direction vector; Depen- S2: y[j] = 3*x[k]+1;
dence level; Dependence polyhedron; Exact depen-
dence; Use-def chains Without additional information, it is impossible to
decide whether S2 could be executed before S1. It is
Definition unlikely, but the decision hinges on the ranges of Vari-
A data dependence between two computer instruc- ables i, j, and k and on the behavior of Function foo
tions is due to some value passing or resource con- (See Banerjee’s Dependence Test, Dependences,
flict. Dependencies usually cannot be known exactly Dependence Analysis, GCD Test, Omega Test).
at compile time. They must be modelized and over- To sum up, an exact knowledge of statement depen-
approximated with different abstractions to be used dencies depends on the program input and its resulting
statically by compilers and to decide if program trans- trace on one hand, and on our ability to specify statically
formations and optimizations, such as loop paralleliza- the dynamic statement instances and memory loca-
tion or locality improvement, can be applied without tions accessed on the other. Since the exact dependence
changing the program results. The characteristics of information is not computable in general, abstractions
the dependence abstractions are linked to the program must be used to make the problem tractable in a com-
transformations whose legality must be proved. All piler. These abstractions must be over-approximations
dependence abstractions defined up to  are fami- to over-constrain the program optimizations and to
lies of convex affine sets, but different kinds of restric- guarantee their legality.
tions apply to the affine constraints. They are ordered To start with, program inputs are ignored. Then
from a less accurate to a most accurate abstraction, and a decision must be made about the representation of
they are all optimal with respect to at least one kind of statement instances. Finally, a decision must be made
transformation: the set of legal transformations cannot about the representation of sets of arcs between state-
be increased by increasing the dependence abstraction ment instances. These decisions must be made accord-
accuracy. ing to the program transformations that the compiler
implements.
Discussion
When two statements of a program use and/or define
Set Abstractions
the same memory location, their execution order can-
Finite sets can be represented exactly in extension, but
not be modified by the compiler to optimize the execu-
unbounded sets such as the number of iterations of a
tion without modifying the program’s result when the
loop:
semantics of the operations is unknown. This sufficient
constraint is known as Bernstein’s condition [] and is for(i=0; i<n; i++) {}
Dependence Abstractions D 

can only be represented implicitly via predicates. Dif- pieces of code, which include sequences of loops, are
ferent classes of predicates can be used, but Pres- called static control code (See Polyhedron Model):
burger arithmetic is the most powerful framework
that is still computable. It subsumes less powerful for(i=0; i<n; i++) {
frameworks such as convex polyhedra, i.e., affine con- S1: a[i] = 0.;
straints, affine subspaces, i.e., affine equations, and their for(j=0; j<m; j++) {
sub-frameworks obtained by reducing the number of if(i!=j) D
nonzero coefficients in the constraints, e.g., one or two, S2: a[i] += b[i][j];
and/or the magnitude of the coefficients, e.g., −, , or . }
Bounded lists of abstract sets can be used to increase the }
accuracy of a given framework by providing a limited
exact set union. Within a loop nest, two statements S1 and S2 may
appear at different depths and/or may be nested within
Sets of Statement Instances different loops. Their iteration vectors can be limited to
Program parallelization has primarily been focused on their common nesting loops. Such an iteration vector
loop parallelization because its optimization poten- does not always represent one instance of S1 or S2, but
tial linked to the iteration counts is greater than task, may represent a whole set of them:
instruction level, and superword level parallelizations.
The scope of interest is then limited to one loop for(i=0; i<n; i++) {
or to a loop nest. Each instance of a statement in the for(j=0; j<m1; j++) {
loop body can be known exactly by its iteration vector, S1: a[i][j] = 0.;
whose components are the values of the surrounding for(j=0; j<m2; j++) {
loop indices, from the outermost to the innermost one. S2: b[i][j];
Note that if some control structure is used within the }
loop body, some abstract instances may not exist in a }
real execution trace because some condition is true or
false, but the set of abstract instances is greater than the Here S1 and S2 have only one common loop, the
set of effective instances, which is safe for most analyses i loop. Abstract instances of S1 or S2 represented by i
and transformations. Note also that the nesting depths contain each a set, possibly empty, of effective instances
of two statements S1 and S2 may be different because controlled by the j loops.
all statements appearing within the loop nest do not When all statements in a loop body appear at the
always have the same nesting loops. The iteration (or same depth, the loop nest is said to be perfectly nested
instance) vectors i for S1 and i for S2 may not be com- and statement instances are accurately represented by
parable. A set abstraction must be chosen to define the the loop-nest iteration vector, at least when the loop
sets of instances of each statement. bounds are compatible with the set abstraction cho-
sen (See Loop Nest Parallelization and Unimodular
for(i=0; i<n; i++) {
Transformations).
S1: a[i] = 0.
It is sometimes convenient to give a number to each
for(j=0; i<m; j++) {
statement and to integrate this number in its itera-
S2: a[i] += b[i][j];
tion vector. Thus, the set of all statement instances is
}
abstracted by a set of vectors, which are easier to handle
}
mathematically than several sets.
In some cases, statement instances may be defined To sum up, some set abstraction must be chosen,
exactly over a small piece of source code because all iter- for instance convex sets, as well as some coding rules
ation sets are defined exactly and because all conditions to handle sequences, tests, and loop nesting. Control
are predicates over the iteration vectors. The loop bounds structures such as goto and exception preclude such
and the test conditions must belong to the chosen set instance abstractions and hence loop parallelization and
abstraction, e.g., affine constraints and convex sets. Such optimization where they occur.
 D Dependence Abstractions

Sets of Dependence Arcs ● The dependence polyhedron, DP


Basically, Bernstein’s conditions for Statements S1 and ● The dependence set D, a.k.a. dependence distance
S2 must be checked for every instance pair of S1 when the dependences are constant
and S2. In the compiler, a collecting semantics is used ● The dependences between iterations, DI
to project the information about statement instances
onto source statements. Information about arc instances They have been listed here from the less accurate to the
is built on the abstract instance chosen, usually a most accurate one.
vector, using a new arc predicate, i.e., yet another The dependence between iterations, DI, is the most
abstract set. accurate one and it stands out because the predicate
Each instance of S1 are defined by an integer vec- associated is a predicate over two iteration vectors, i
tor i and a set predicate P (i ). Each instance of S2 and i . It is the most flexible as it includes the modeliza-
are defined by a vector i and a set predicate P (i ). tion of the loops surrounding i , of the loops surround-
Dependence arcs between S1 and S2 are pairs (i , i ) ing i , and of the relationship between i and i . It has
meeting Conditions P (i ) and P (i ) and some arc been used by Feautrier [] and Maydan et al. [] (See
Polyhedron Model).
dependence condition A(i , i ), where A belongs to
some set abstraction. All other dependence abstractions assume that i
So the dependence abstraction is built with an and i are comparable vectors, i.e., they are related to
instance abstraction and an arc abstraction. Predicates the same surrounding loops. Since we need to know if
P and P are usually easy to derive from the program i occurred before or after i , i and i can be replaced
internal representation. The dependence predicate A by the dependence vector d, with d = i − i . Arcs (i , i )
requires some dependence test (See Banerjee’s Depen- are replaced by dependence distances d.
dence Test, GCD Test, Omega Test). D is the exact set of dependence distances. It can be
Note that the arc set abstraction may be as simple used when its cardinal is bounded because the depen-
as  and , empty set and non-empty set. The empty set dencies are constant. For instance, the loop nest:
is modelized by , i.e., no instance of S2 ever depends for(i=0; i<n; i++)
on any instance of S1, and the non-empty sets by , S: a[i] = a[i-1] + b[i];
i.e., there may exist at least one instance of S2 that
depends on at least one instance of S1. Again, a safe contains dependences between instances of S and other
over-approximation is used. In case of doubt, a depen- instances of S. The i instance of S depends on the
dence arc is assumed. This safety property must be result of the i −  instance of S. This was used
insured by the dependence test. by Karp [] for scheduling computations and intro-
Note also that S1 and S2 may denote the same state- duced by Muraoka [] for loop parallelization (See
ment S, as some instances of S can depend on other Parallelization, Automatic). The systolic community
instances of S. also used this representation (See Systolic Arrays).
The dependence polyhedron, DP, is derived from
DI, when Predicates P , P , and A are affine, by adding
Dependence Abstractions
the affine equation d = i − i and by projecting i and i .
Many dependence abstractions have been defined since
The dependence cone DC, introduced by Irigoin and
the early seventies because there is a relationship
Triolet [], is the transitive closure of DP. So it is a little
between the program transformations applied by a
bit less accurate, but easier to use because its dual rep-
compiler and the necessary accuracy of the dependence
resentation is simplified. Vertices are transformed into
information.
rays, and only rays have to be dealt with.
In [], Yang et al. survey the usual dependence
The dependence direction vector abstraction, defined
abstractions:
by Wolfe in [], is also a convex set, but the acceptable
● The dependence levels, DL constraints are signed constraints such as c >  (i.e.,
● The dependence direction vectors, DDV c ≥ ), c ≥ , c = , c <  (i.e., −c ≥ ), and c ≤  (i.e.,
● The dependence cone, DC −c ≥ ), where c is any component of the dependence
Dependence Abstractions D 

vector d. As mentioned earlier, this is a special case of transformation is introduced, no dependence approxi-
affine convex set: the constraints are limited to one vari- mation might be available that is accurate enough for it
able, the coefficient is either  or −, and the constant and a new one has to be designed.
term is either  or . Yang et al. [] shows that the dependence level,
The dependence level abstraction used by Allen [] DL, is the best abstraction for loop reversal and loop
is even simpler. The only constraints that can be used parallelization, but dependence direction vectors are
for the components of d are c =  and c ≥ . necessary to perform loop permutation (a.k.a. loop D
Other intermediate abstractions include the depen- interchange) successfully. For unimodular loop trans-
dence vector abstraction of Sarkar et al. [] which adds formations, for scalable loop tiling (see Tiling), and
one-variable equations such as c = n, where n is a for multidimensional affine scheduling, the dependence
numerical constant to the dependence direction frame- cone is the best dependence abstraction. Some trans-
work, and the abstraction by Wolf et al. [] which also formations require the dependence set D or the depen-
extends the dependence direction vector and Sarkar’s dence polyhedron DP because they rely on the minimal
abstraction by using constant terms different from  size of a dependence arc. For instance, the odd and
and . This leads to constraints bounding c such as n ≤ even iterations in the following loop can be executed
c ≤ . With n = n , Sarkar’s constraints are obtained concurrently:
and hence the abstraction order.
for(i=0; i<n;i++) {
Note that simpler less accurate abstractions can eas-
a[i] = a[i-2]+b[i];
ily be derived from more complex more accurate ones
}
with simple algebraic transformations. Note also that
dependence tests must be developed for each abstrac- For transformations dealing with sequences of loops
tion and that their results depend on the subscript and global affine schedules as defined by Feautrier [, ],
expressions used, on the pointer-based references, and information about the dependence arc, DI, is necessary,
on the procedure calls. Interprocedural dependence although, in the important case of loop fusion, some
testing requires some more abstraction to modelize the sufficient dependence information D can be computed a
effect of procedure calls on the local variables. posteriori since the statements S1 and S2 end up nested
Note also that more complex set abstractions such in the same loop by definition of the transformation.
as Presburger arithmetic, implemented by Pugh [] in Locality transformations, such as privatization or
the Omega library, and Feautrier’s QUAST [], which array scalarization, which may require information
uses finite lists to increase the accuracy of the underly- about the precision of a particular abstraction with
ing convex affine representation, are not covered here respect to the exact information, are not covered here
(see Polyhedron Model). (see Parallelization, Automatic).

Relationships Between Dependence Future Directions


Abstractions and Program Dependence abstractions presented here are useful for
Transformations loop nest transformations. Better information is needed
Initial contributions, such as Karp’s [], Muraoka’s [], to schedule larger piece of codes, such as sequences of
and Lamport’s [], were based on exact constant depen- loop nest: the information about the source and sink
dences, also called uniform dependences because they iterations must be preserved, when possible. This is part
do not depend on the iterations. No abstraction of of the polyhedron model.
dependence was used, but the applicability is too lim-
ited because dependences are not always constant and Conclusion
dependence sets are not finite. Several abstraction steps must be taken to make depen-
One may wonder why so many different dependence dence analysis useful in an optimizing compiler. These
abstractions were introduced over the years. In fact, abstraction steps are often overlooked when a particular
each of them is linked to a program transformation or program transformation and dependence abstraction
to a class of program transformations. So, when a new are introduced, but a quick survey is useful to make
 D Dependence Accuracy

them understandable to newcomers to the automatic . Irigoin F, Triolet R () Computing dependence direction vec-
parallelization field. tors and dependence cones with linear systems, Technical Report
However, when designing an optimizing compiler, CAIE, CRI, MINES ParisTech
. Karp R, Miller R, Winograd S () The organization of compu-
it is important to understand the underlying affine base
tations for uniform recurrence equations. J ACM ():–
common to all abstractions and the impact of the choice . Lamport L () The parallel execution of DO loops. Commun
of dependence abstractions on accuracy and compu- ACM ():–
tational requirements. The operator complexity is also . Maydan D, Amarasinghe S, Lam M () Array dataflow anal-
often used for decision making, although the practi- ysis and its use in array privatization. POPL ’: Proceedings of
the th ACM SIGPLAN-SIGACT symposium on principles of
cal complexity is polynomial for most operations on
programming languages. pp –
unrestricted affine convex sets related to dependence . Muraoka Y () Parallelism exposure and exploitation in pro-
analysis. grams. PhD thesis, Report , Department of Computer Science,
UIUC, Feb 
. Pugh W (Aug ) A practical algorithm for exact array depen-
Related Entries dence analysis. Commun ACM ():–
Banerjee’s Dependence Test . Sarkar V, Thekkath R () A general framework for iteration-
Dependences reordering loop transformations. ACM SIGPLAN  con-
Loop Nest Parallelization ference on programming language design and implementation
(PLDI), ACM SIGPLAN Notices, ():–, San-Francisco
Omega Test
. Wolf M, Lam M () Maximizing parallelism via loop trans-
Parallelization, Automatic
formations. In rd Programming Languages and Compilers for
Polyhedron Model Parallel Computing, Irvine, – Aug 
Systolic Arrays . Wolfe M () Optimizing supercompilers for supercomputers.
Tiling PhD thesis, Department of Computer Science, UIUC
Unimodular Transformations . Yang Y-Q, Ancourt C, Irigoin F () Minimal data dependence
abstractions for loop transformations. Int J Parallel Program
():–
Bibliographic Notes and Further
Reading
The only paper dealing explicitly with the relation-
ships between different dependence abstractions is Dependence Accuracy
Yang’s []. Other papers deal only with one kind of
Dependence Abstractions
dependence. The latest editions of the textbook by
Aho et al. [] presents several uses of dependences,
e.g., scheduling, parallelization and memory allocation.
Dependence Analysis
Bibliography
. Aho V, Lam M, Sethi R, Ullman J () Compilers. Princi- Banerjee’s Dependence Test
ples, techniques & tools. Pearson Education, Addison-Wesley, Dependence Abstractions
Boston, MA Dependences
. Allen R, Kennedy K (Oct ) Automatic translation of FOR- Omega Test
TRAN programs to vector form. ACM Trans Program Languages
Syst (TOPLAS) ():–
. Bernstein AJ (Oct ) Analysis of programs for parallel process-
ing. IEEE Trans Comput EC-(), – Dependence Approximation
. Feautrier P () Dataflow analysis of scalar and array references.
Int J Parallel Program ():–
Dependence Abstractions
. Feautrier P (Oct ) Some efficient solutions to the affine
scheduling problem. Part I, one-dimensional time. Int J Parallel
Program ():–
. Feautrier P (Oct ) Some efficient solutions to the affine Dependence Cone
scheduling problem. Part II, multi-dimensional time. Int J Parallel
Program ():– Dependence Abstractions
Dependences D 

and a parallel program has a partial execution order. In


Dependence Direction Vector fact, an embarassingly parallel program, in which every-
thing can be done in parallel, has the empty execution
Dependence Abstractions order.
Methods for specifying operations and their execu-
tion order differ from one program style to another –
Dependence Level loop programs, recursive programs, and functional D
programs all have different conventions.
Optimizing compilers, and in particular paralleliz-
Dependence Abstractions
ing compilers, try to transform the source program
into an equivalent program that is better adapted to
the target architecture, or runs faster, or has more
Dependence Polyhedron parallelism. The problem is that in general, program
equivalence is undecidable. A possible solution is to
Dependence Abstractions restrict the class of admissible transformations to a
decidable subset. One of the classes of choice consists
of reordering transformations. The operations of the
transformed program are the same as in the source,
Dependences but they may be executed in a different order (includ-
ing parallel execution). The necessary information for
Paul Feautrier
validating such a transformation is the dependence
Ecole Normale Supérieure de Lyon, Lyon, France
relation:

Synonyms Two operations u and v are independent if the


Hazard (in hardware) programs u; v and v; u give the same results.
One can prove that a sufficient condition for the
Definition equivalence of two terminating programs which exe-
One needs a paradigm shift when reasoning about par- cute the same operations is that dependent operations
allel programs. The usual approach, as for instance in are executed in the same order in both programs.
denotational semantics, is to consider each statement According to this definition, the dependence rela-
as a function from memory state to memory state, and tion is symmetric. If one of the programs above is the
to consider two programs as equivalent if they imple- reference program (e.g., if it is the sequential program
ment the same function. For instance, x = ; x = x + ; to be parallelized), one usually orients the dependence
and x =  are deemed equivalent. However, if these two relation according to the execution order of the ref-
programs are run in parallel with x = , they may give erence program. With this convention, a transformed
different results. Similarly, asking whether a statement program is equivalent to the reference program if its
can be run in parallel with itself does not make sense. execution order is an extension of the dependence
However, the statement may be enclosed in a context – relation.
for instance, a loop – which causes it to be executed
repeatedly. Each repetition is an instance or operation,
and it makes sense to ask whether all or some of these
Discussion
operations can be executed in parallel. Bernstein’s Conditions
These considerations motivate the following defini- Deciding if the execution order of two operations can
tions. A program is a specification for a set of opera- be reversed can be arbitrarily difficult. Bernstein [] has
tions. Each operation is executed only once, and must devised a simple test, which gives a sufficient condi-
have a unique name. The program must also specify the tion for independence. This test depends only on the
order in which operations are executed. It is easy to see memory cells that are accessed by the operations to be
that a sequential program has a total execution order, tested.
 D Dependences

Let R(u) be the set of memory cells that are read which one considers only regular DO loops, and sub-
by u. Similarly, let W(u) be the set of memory cells scripts which are affine functions of the surrounding
which are written by u. Then u and v are indepen- loop counters and of constant parameters. A function is
dent if the three sets: R(u) ∩ W(v), R(v) ∩ W(u), and affine if it is the sum of a linear part and a constant. One
W(u) ∩ W(v) are all empty. may construct a dependence relation among operations
Note that in most languages, each operation writes (i.e., loop iterations) in the following way. To name an
at most one memory cell: W(u) is a singleton. However, iteration, one may use its iteration vector, whose coordi-
there are exceptions: multiple and parallel assignments, nates are the surrounding loop counters, ordered from
vector assignments among others. outside inward. Iterations are ordered according to the
Bernstein’s conditions reflect a kind of worst-case lexicographic order of iteration vectors:
reasoning. Consider:
⃗ı ≺ ⃗ȷ ≡ i < j ∨ (i = j ∧ i < j ) ∨ . . . ()
y = e; y = f;
The final value of y is given by the operation that exe- where ≺ (read “before”) is the execution order.
cuted last; in this case, this value is the value of f . If the Consider two iterations ⃗ı and ⃗ȷ of some loop nest,
operations are reversed, y will get the value of e, and in and two accesses to the same array, A[ f (⃗ı )] and
the absence of any information on the respective values A[ g(⃗ȷ)], one of them at least being a write. f and g
of e and f , one must conclude that the two operations are subscript functions, which relate subscripts to the
are dependent. surrounding iteration vectors. To be in dependence, the
If the dependence relation is computed according two iterations must satisfy the following conditions:
to Bernstein’s conditions, the above property can be
● They must access the same array cell: f (⃗ı ) = g(⃗ȷ).
extended in two directions. Firstly, it applies now also to
● They must be legal iterations, that is, each loop
nonterminating programs: one can show that the history
counter must be within the corresponding loop
of each variable (the succession of values it holds dur-
bounds, which define the iteration domains of the
ing execution) is the same for both programs. Secondly,
operations.
it also holds if one of the programs has real parallelism.
● The direction of the dependence is given by the
The reason is that two operations that write to the same
execution order ⃗ı ≺ ⃗ȷ.
location are always in dependence (this is the third
Bernstein condition) and hence can never be executed For programs that fit in the polyhedral model,
in parallel. Hence, two writes to the same memory cell the subscript equations and the iteration domain con-
never occur at the same time. straints are conjunctions of affine equalities or inequal-
ities. According to (), the ordering constraint can be
Scalars, Arrays, and Beyond split in several conjunctions. Hence, the set of iterations
Computing dependences for scalar variables is easy. The in dependence is the union of several disjoint depen-
sets R and W above are finite, and computing inter- dence polyhedra. Each polyhedron is characterized by
sections is trivial. The only real difficulty is caused by the number of equations at the begining of the execu-
aliasing, when two distinct identifiers refer to the same tion order, the so-called dependence depth, which goes
memory cell. The detection of aliasing is a difficult prob- from  to the number of loops that enclose both array
lem. However, it is easy to detect cases in which there is references. For some authors, the dependence depth (or
no aliasing, for example, among the local variables of a level) starts at  and goes up to infinity. The notation
procedure, and to handle all other cases conservatively. ⃗ı δ p ⃗ȷ is commonly used to state that there is a depen-
Orienting each dependence is easy, at least within linear dence from iteration ⃗ı to ⃗ȷ at depth p, that is, the first p
code (basic blocks). coordinates of ⃗ı and ⃗ȷ are equal.
The case of arrays is more problematic. Arrays usu- For more general programs, one has to resort to
ally occur in loops, and their subscripts usually differ approximations: a constraint is omitted or approxi-
from one iteration to the next in complex ways. This has mated if not affine. This has the effect of enlarging the
lead to the invention of the polyhedral model [, ], in dependence polyhedra, and hence reducing the amount
Dependences D 

of parallelism. However, the correctness of the gener- It can even be shown that some of the flow depen-
ated program is still guaranteed. dences are spurious. Let us consider a memory cell and
When a program uses pointers, the computation of an operation v that reads it. There may be many writes
dependences becomes much more difficult. The reason to this cell. Intuitively, the only one on which v really
is that, depending on the source language, a pointer depends is the last one u which is executed before v.
can point anywhere in memory, and that it is very The dependence from u to v is a direct dependence. If
difficult to decide whether two pointers point to the the source program is sufficiently regular, direct depen- D
same memory cell. The best that can be done is to dences can be computed using linear programming
associate to each pointer (conservatively) a region in tools []. It is, however, difficult to find conservative
memory, and to decide a dependence when two regions approximations for more general programs.
intersect.
Dependence Tests
Classification Early automatic parallelizers were concerned only with
While in Bernstein’s conditions the two operations play the existence or nonexistence of dependences. For
the same role, the symmetry is broken as soon as the instance, to decide that a loop is parallel, one has only
direction of the dependence is taken into account. One to show that there are no dependences between its iter-
usually distinguishes: ations, that is, all related dependence polyhedra are
empty.
● Flow dependences (or true dependences, or Produc-
Since dependences occur between a pair of state-
er/Consumer dependences [PC], or Read after Write
ments, it follows that the number of dependences
hazards [RAW]), in which the write to a memory cell
increases more or less as the square of the size of the
occurs before the read.
program. Hence, the scientists of the s initiated a
● Output dependences (or PP dependences, or WAW
quest for fast but approximate dependence tests. One
hazards) in which both accesses are writes.
may, for instance, observe that some of the dependence
● Anti-dependences (or CP dependences, or WAR
constraints are linear equations whose unknowns, the
hazards) in which the read occurs before the write.
loop counters, are integers. Such an equation can have
● In some contexts (e.g., when discussing program
solutions only if the gcd of the unknown’s coefficients
locality), one may consider Consumer/
divides the constant term. This observation is the basis
Consumer dependences, which do not constrain
of the very fast gcd test.
parallelization.
The Banerjee test [] is based on the observation
If the objective is just to decide which operations that in many cases, when solving an equation f (x) = ,
can be executed in parallel, all three kinds are equiva- one knows a lower bound a and an upper bound b for
lent. Differences appear as soon as one considers mod- x, which come, most of the time, from loop bounds.
ifying the source code for increased parallelism. It is Now, the equation has solutions only if f (a) and f (b)
easy to see that in a flow dependence, a value which are of opposite signs. The approximation comes from
is computed by the first operation is reused later by the observation that the eventual solution is not neces-
the second operation. Hence, a flow dependence cor- sarily integral, hence the idea of coupling the gcd and
responds to the naming of an intermediate result, is Banerjee test, for increased precision. This idea can be
intrinsic to the code algorithm, and cannot be removed extended in the following way: the inequalities of the
except by modifying it. In contrast, an output depen- problem dictate that x belongs to some polyhedron, and
dence simply indicates that a memory cell is reused the condition for the existence of a solution is that the
when the value it holds becomes useless. Such a depen- maximum of f over this polyhedron be positive, and
dence can be removed simply by using two distinct cells the minimum be negative. Now the extrema of an affine
for the two values. Lastly, in a correct program, where function over a polyhedron are located at some of its
all memory cells are defined before being used, remov- vertices. Hence, one has only to test the sign of f at
ing output dependences has the side effect of removing a finite set of points. This is especially efficient if the
anti-dependences. vertices are in evidence.
 D Dependences

Many other tests were invented with the aim of han- may generate false negatives. However, this occurs suf-
dling systems of equations instead of a single equation. ficiently seldom that it is usually ignored.
For instance, in the Lambda test [] the Banerjee test
is applied to well chosen linear combinations of equa- Dependence Depth
tions. If the answer is negative, then the original system The dependence depth abstraction is a natural byprod-
has no solution. uct of the decomposition of the lexical order into dis-
However, it was realized around  that the ques- joint polyhedra as in (). One simply has to record at
tion of the emptiness of the dependence polyhedron which depth a dependence occurs, instead of taking the
can be solved using classical linear programming algo- disjunction of the test results for several depths. Knowl-
rithms. This approach was originally deemed too costly, edge of the dependence depth allows one to decide, in
but with improved algorithms and a ,-fold increase a loop nest, which loop must be kept sequential and
in processing power (Moore’s law), the argument has which one can be executed in parallel. This informa-
lost its strength. tion is necessary for the Allen and Kennedy algorithm,
One possibility is to use the Fourier–Motzkin algo- which uses loop splitting to find more parallel loops.
rithm [], in which the unknowns of the problem are
eliminated by combining the inequalities that define Dependence Distances
the dependence polyhedron, until one obtains numer- The dependence distance is the difference between the
ical inequalities, which can be decided by inspection. iteration vectors of two dependent operations. One can
Programming the Fourier–Motzkin algorithm is quite define a distance polyhedron as:
simple, but its complexity is doubly exponential, which ⃗ ı , ⃗ȷ : ⃗ı δ p ⃗ȷ, d⃗ = ⃗ȷ − ⃗ı }.
D = {d∣∃⃗
is not critical since dependence testing problems are
usually small. The standard Fourier–Motzkin algorithm A projection algorithm is needed to eliminate the exis-
finds rational solutions or proves that none exists. An tential quantifiers: one may use the Omega test or
extension, the Omega test [], considers only integral parametric integer programming. Observe also that the
solutions. Other extensions are the I test [] and the dependence distance is always lexicopositive.
Power test []. Computing the distance polyhedron is especially
Another possibility is to use the Simplex algorithm, interesting when it reduces to a single constant vec-
which is more complex, but which runs most often in tor. One says in that case that the dependence is uni-
time proportional to the third power of the number form. Many parallelization algorithms are especially
of inequalities, and hence scales better for large prob- simple when the source program has only uniform
lems. The algorithm can also be extended to the integer dependences.
case using Gomory cuts, and can even solve parametric
problems []. Dependence Direction Vectors
In the presence of nonuniform dependences, one may
Dependence Approximations further abstract the distance polyhedron by considering
Early parallelization algorithms (e.g., deciding if a loop only the signs of the components of the distance vectors.
has parallelism) did not need a precise knowledge of The components of a Dependence Direction Vector
dependence polyhedra. Just testing them for emptiness (DDV) are either integers (meaning that the component
was enough. As the sophistication of parallelization is constant) or one of the symbols <, >, ≥, ≤ (mean-
algorithms increased, more precise descriptions of ing that the component has the corresponding sign)
dependences were needed. All such approximations can or ∗, meaning that the component may have an arbi-
be described as supersets of dependence polyhedra. On trary sign. DDVs are usually computed by adding the
the relations between dependence approximations and corresponding constraint to the definition of the depen-
program transformations, see []. dence polyhedron and testing for feasibility. While sim-
The simplest approximation consists in ignoring the pler to compute than the distance polyhedron, DDVs
fact that iteration vectors have integer components. give enough information to check the legality of some
When testing a transformed program for legality, this transformations, like loop permutation.
Determinacy D 

Dependence Cones . Lengauer C () Loop parallelization in the polytope model.


The dependence cone is simply the cone generated by In: Best E (ed) CONCUR’, LNCS vol , pp –, Springer,
the dependence distance vectors. The simplest way of Berlin
. Li Z, Yew P-C, Zhu C-Q () An efficient data dependence anal-
computing the dependence cone is to compute first
ysis for parallelizing compilers. IEEE Trans Parallel Distrib Syst
the vertices of the distance polyhedron, d , . . . , dn . The :–
dependence cone is then: . Pugh W () The omega test: a fast and practical integer pro-

n
gramming algorithm for dependence analysis. In: Supercomput- D
ing, 
C = {∑ λ k dk ∣λ k ≥ } . . Triolet R, Irigoin F, Feautrier P () Automatic parallelization
k=
of FORTRAN programs in the presence of procedure calls. In:
Knowledge of the dependence cone is especially Robinet B, Wilhelm R (eds) ESOP , LNCS , Springer,
Berlin
useful when tiling a loop nest.
. Wolfe M, Tseng C-W () The power test for data dependence.
IEEE Trans Parallel Distrib Syst ():–
Related Entries . Yang Y-Q, Ancourt C, Irigoin F () Minimal data depen-
dence abstractions for loop transformations. In: LCPC’: Pro-
Banerjee’s Dependence Test
ceedings of the th International Workshop on Languages
Bernstein’s Conditions and Compilers for Parallel Computing, Springer, London,
Dependence Abstractions pp –
Omega Test
Polyhedra scanning

Bibliographic Notes and Further


Readings Detection of DOALL Loops
There is a large literature on pointer analysis, which may
Parallelism Detection in Nested Loops, Optimal
be useful for program parallelization, albeit its orien-
tation is more toward program verification and safety.
A good starting point is [] ( references!).
For an attempt at parallelization of recursive pro-
grams, see []. Determinacy

Bibliography Christoph von Praun


Georg-Simon-Ohm University of Applied Sciences,
. Amiranoff P, Cohen A, Feautrier P () Beyond iteration vec-
Nuremberg, Germany
tors: instancewise relational abstract domains. In: Static Analysis
Symposium (SAS’), Seoul, Corea
. Banerjee U () Loop transformations for restructuring com-
pilers: Dependence analysis. Kluwer Academic Publishers
. Bernstein AJ () Analysis of programs for parallel processing.
Synonyms
IEEE Transactions on Electronic Computers, EC-:– Determinism
. Feautrier P () Parametric integer programming. RAIRO
Recherche Opérationnelle :–
. Feautrier P () Dataflow analysis of scalar and array references.
Definition
Int J Parallel Program ():– Determinacy refers to the property of repeatability. A
. Feautrier P () Automatic parallelization in the polytope parallel program is determinate, if all feasible execu-
model. In: Perrin G-R, Darte A (eds) The data-parallel program- tions of the program, when started on the same input,
ming model. LNCS vol , pp –, Springer, Heidelberg generate the same result.
. Hind M () Pointer analysis: haven’t we solved this problem
Likewise, the result of a non-determinate parallel
yet? In: PASTE’. ACM, Snowbird
. Kong X, Klappholz D, Psarris K (July ) An efficient data
program can vary for different executions, depending
dependence analysis for parallelizing compilers. IEEE Trans Par- on the timing of different threads. The reason for non-
allel Distrib Syst ():– determinacy are race conditions.
 D Determinacy

Discussion Internal Determinacy


A parallel program is called internally determinate if the
Introduction
sequence of intermediate values taken by each shared
This entry discusses the concept of determinacy in
location is the same in all executions. Figure  gives an
the context of parallel programming. The examples
example.
and classification of different kinds of determinacy are
The notion of internal determinacy is stronger than
adopted from Emrath and Padua [, Sect. ]. The discus-
external determinacy, since not only final values but also
sion focuses on explicitly parallel programs with threads
intermediary values of shared variables are considered.
that share memory, although the concepts of determi-
Naturally, programs that are internally determinate are
nacy can be generalized to other parallel programming
also externally determinate.
models.
Internal determinacy does not imply that the overall
Sequential programs are naturally determinate
shared state of the program is transformed in a prede-
(Exceptions exist, namely languages that offer a choice
termined sequence. It may well be that some executions
operator, where the choice of program-flow is not
of the example program in Fig.  initialize variables
directly specified by the programmer but chosen by
at lower indices first, while other executions initialize
the run-time system; such methods are, for example,
variables in some other order.
embodied in backtracking systems.) If parallel pro-
Internal determinacy is however not required to
gramming is used to accelerate the performance of
achieve external determinacy. Figure  gives an example.
a sequential program without intention of changing
The internal non-determinacy in Fig.  is due to the
the functional behavior, determinacy is a correctness
race conditions among concurrent iterations of the sec-
requirement. In general, however, determinacy is nei-
ond loop when accessing the critical section. The order
ther a sufficient nor a necessary condition for the cor-
in which different iterations enter the critical section is
rectness of a parallel
program.
The determinacy of a program can depend on the
int[] arr = new arr[N]
program input. An example is given in Fig. . doall (i <- 0 to N-1)
arr[i] = i + 1
enddoall
External Determinacy Determinacy. Fig.  Internal determinacy: The sequence
Determinacy that refers to the repeatability of input– of states for each shared variable arr[i] is the same in all
output behavior is called external determinacy. executions, that is, irrespective of the timing of parallel
iterations. Every execution transits the state of location
arr[i] from its initial value to value i+

int i = read()
if (i != 0)
int[] arr = new arr[N]
cobegin int sum = 0
i = 1 || doall (i <- 0 to N-1)
i = 2 arr[i] = i + 1
coend enddoall
endif doall (i <- 0 to N-1)
print(i) critical
sum += arr[i]
Determinacy. Fig.  Determinacy of this program is endcritical
enddoall
input-dependent: On input , the program will always
output ; for other inputs, the program is non-determinate Determinacy. Fig.  External determinacy but internal
and may output  or . The reason for the non-determinacy non-determinacy: The sequence of intermediate values of
is the data race among the write accesses to variable i in variable sum can differ in different executions. However,
the parallel block the final value of all shared variables is always the same
Determinacy D 

not predetermined. There is no data race however, since Complete Non-Determinacy


simultaneous updates of variable sum cannot occur. External non-determinacy may occur for reasons other
External determinacy arises here, since the integer than the special case of associative non-determinacy
addition performed on the shared integer variable sum discussed in the previous paragraph. Programs with
is commutative and associative. such form of external non-determinacy are called com-
Benefits: Internal non-determinacy can be exploited to pletely non-determinate.
accelerate parallel collective reduction and scan opera- When executed on a given input, the final state D
tions []. of a completely non-determinate program depends on
Debugging: Internal non-determinacy complicates debug- the timing of threads and their operations on shared
ging since a programmer may observe different sequences variables. An example for such a program is given in
of operations and intermediate values of shared vari- Fig. .
ables in different program executions. The non-determinacy is due to a race condition
among the write operation to variable j in different loop
iterations. In this particular case, the race condition is a
Associative Non-Determinacy data race, since there could be executions with simulta-
A computer represents real numbers typically as float- neous write accesses. But the kind of race is immaterial
ing point values with limited precision. Unlike their here. If even the execution of write operations were
mathematical counterpart, operations such as addition ordered due to a critical section, the precise order is
may not be generally associative and distributive on the unspecified in the program and thus the program is
floating point completely non-determinate.
representations.
The program in Fig.  seems similar to the pro-
grams in Fig. : It is internally non-determinate, since Analysis of Non-Determinacy
the sum is computed in some order that is not spec- Non-determinacy is caused by race conditions. Hence,
ified a priori by the program. This program is how- tools for race detection are used to analyze cases of non-
ever not externally determinate, since the add opera- determinacy. When tool are specialized on the detection
tion is computed on the floating point, not the inte- of data races, they will not be successful to identify
ger domain. The result of this computation may vary causes of non-determinacy since general races that are
slightly across different runs. For some applications, this not data races can also cause non-determinacy.
deviation and thus the external non-determinacy may
be acceptable, hence the program could be considered Determinacy in Other Parallel
correct. Programming Models
External non-determinacy that is solely due to the Non-determinacy may be introduced in any explicitly
lack of associativity and commutativity of floating point parallel programming model that is susceptible to race
operations is called associative non-determinacy. conditions. Besides multithreading with shared mem-
ory, these are, for example, task-based programming
models such as Intel TBB [] and also message-passing
int N = 100000 models such as MPI [].
double sum = 0
doall (i <- 0 to N)
critical
sum += 1.0/i int N = 100000
endcritical int j = 0
enddoall doall (i <- 0 to N)
j = i
Determinacy. Fig.  Associative non-determinacy: The enddoall
print(j)
final value of variable sum can vary slightly in different
executions due to the non-associativity of floating point Determinacy. Fig.  Program that is completely
numbers non-determinate
 D Determinacy

Determinate Parallel Programs yet been adopted widely. Possible reasons are that the
Parallel programs that are completely non-determinate models restrict the solution space too much, such that
are rarely useful in practice. Hence, programming common algorithmic or parallel programming idioms
model have been designed that prevent or at least reduce cannot be expressed. Moreover, the performance impact
the risk of creating a parallel programs with undesired due to run-time checking for forbidden side effects is
non-determinacy. deemed too high or not well understood.
Autoparallelization: Starting point is a sequential pro-
gram from which a parallel form is derived automatically. Determinate Parallel Execution
The resulting parallel programs are typically externally Deterministic replay: Deterministic replay is a debugging
determinate. aid and controlled run-time environment that enables
Autoparallelization may involve the reordering of repeating a particular parallel program execution of an
arithmetic operations. The associativity of arithmetic otherwise non-determinate parallel program. A debug-
operations is defined in the language report; in partic- ging methodology for data races using deterministic
ular, floating point operations may be non-associative. replay was first conceived and formalized by Choi and
The user of a parallelization tool can typically con- Min [, Sect. .]. They prove that an iterative debug-
figure if the parallelization should comply with the ging process of data race detection and replay will result
non-associativity for floating point operations or if such eventually in a program execution that is data-race-free.
operations may be reordered. The latter may lead to The process requires a data race detection tool with the
programs with associative non-determinacy. following property: If an execution has data races, the
Speculative parallelization: Similar to autoparalleliza- tool reports some but not necessarily all races.
tion, speculative parallelization avoids non-determinacy Systems that support deterministic replay, operate
by design, however not at compile time, but at run time. in two phases: First, the recording phase, where pro-
A run-time system with support for speculative par- gram input and the synchronization order of a parallel
allelization detects harmful race condition, and hence program execution are memorized. Second, the replay
sources of non-determinacy []. When a possible vio- phase, where recorded information is used by a con-
lation of determinacy is detected, the results of the trolled run-time environment to reproduce the original
speculative parallel execution are discarded, followed program execution.
by a re-execution. An example for a determinate pro- The original and the replayed executions typi-
gramming model based on speculative parallelization is cally correspond to each other at least according to
given in []. the requirements of internal determinacy, that is, the
Programming languages for asynchronous parallel com- sequence of values taken by individual variables must be
putations: Some parallel programming models are the same in both executions. Such guarantee is typically
designed to limit the space of permissible programs to given by software-based implementations, for exam-
deterministic ones. Examples are Kahn Networks [] ple []. Some systems are able to maintain a global
and an experimental programming language described order of all updates, which is an even stronger notion
by Steele []. The key idea of the latter language is to of correspondence among executions [].
limit the available forms of asynchrony and at the same Computer architectures for determinate parallelism:
time to restrict the side effects that each asynchronous Recently, computer architectures [] have been pro-
computation can have on shared memory. Different posed that constrain the possible execution schedules at
models for the validation of side effects are feasible, the level of individual memory operations, or groups of
ranging from purely dynamic to static analysis. Unin- memory operations called chunks. Constraints are such
tended non-determinacy due to race conditions is one that inter-processor communication becomes determi-
of the most frequent errors in parallel programs. Thus, nate. At a programming level, this means, for example,
a language-enforced programming discipline to avoid that the order in which a lock is taken by different pro-
non-determinacy seems to be useful. However, such cessors is the same in different executions. Since this
determinate parallel programming models have not architecture permits only one execution timing, it lets
Determinism D 

any parallel program, determinate or not, execute in a Third ACM SIGPLAN Symposium on Principles and Practice of
determinate manner. Parallel Programming, ACM, New York, pp –
. Intel Corporation. Intel Threading Building Blocks.
http://www.threadingbuildingblocks.org/
Checking Determinacy
. Devietti J, Lucia B, Ceze L, Oskin M () DMP: deterministic
Determinacy can be asserted at run time, by validat- shared memory multiprocessing. ASPLOS : Proceeding of the
ing that an execution is free from race conditions. The th International Conference on Architectural Support for Pro-
property of race freedom is however too strong to be gramming Languages and Operating Systems, ACM, New York, D
useful, since many correct parallel programs with inter- pp –
. Emrath PA, Padua DA () Automatic detection of nondeter-
nal non-determinacy would be flagged as erroneous.
minacy in parallel programs. Proceedings of the ACM Workshop
Hence, checking determinacy means checking external on Parallel and Distributed Debugging, University of Wisconsin,
determinacy while allowing internal non-determinacy. Madison, Wisconsin, pp –
Burnim and Sen [] developed a set of program . Message Passing Interface Forum. MPI: A message passing inter-
annotations, called bridge assertions, to mark blocks face standard. http://www.mpi-forum.org/, June 
of code that should behave externally determinate. . Kahn G () The semantics of a simple language for parallel pro-
gramming. Information Processing : Proceedings of the IFIP
These assertions validate external determinacy by relat-
Congress, Stockholm, Sweden, North-Holland, pp –
ing the functional behavior of such block with obser- . Montesinos P, Ceze L, Torrellas J () DeLorean: recording
vations from previous executions of the same block. and deterministically replaying shared-memory multiprocessor
The method is capable to describe associative non- execution efficiently. ISCA ’: Proceedings of the th Interna-
determinacy. tional Symposium on Computer Architecture, IEEE Computer
Society, Washington, DC, pp –
An alternative technique is proposed by Sadowsky
. Owens J () Data-parallel algorithms and data struc-
et al. []. Their run-time checker verifies the absence of tures. SIGGRAPH ’: ACM SIGGRAPH  courses, ACM,
interference while allowing race conditions due to cer- New York, p 
tain synchronization idioms that are known to preserve . Rauchwerger L, Padua DA () The LRPD test: speculative
determinacy. As in [], code blocks for which (exter- run-time parallelization of loops with privatization and reduction
nal) determinacy should be validated have to specified parallelization. IEEE Trans Parallel Distrib Syst ():–
. Sadowski C, Freund SN, Flanagan C () Singletrack:
explicitly.
a dynamic determinism checker for multithreaded programs.
ESOP ’: Proceedings of the th European Symposium on Pro-
Related Entries gramming Languages and Systems, Springer, Berlin/Heidelberg,
Deterministic Parallel Java pp –
Parallelization, Automatic . Steele GL Jr () Making asynchronous parallelism safe for
the world. POPL ’: Proceedings of the th ACM SIGPLAN-
Race Conditions
SIGACT Symposium on Principles of Programming Languages,
Race Detection Techniques ACM, New York, pp –
. von Praun C, Ceze L, Caşcaval C () Implicit parallelism with
Bibliography ordered transactions. PPoPP ’: Proceedings of the th ACM
. Burnim J, Sen K () Asserting and checking determinism SIGPLAN Symposium on Principles and Practice of Parallel
for multithreaded programs. ESEC/FSE ’: Proceedings of the Programming, ACM, New York, pp –
th Joint Meeting of the European Software Engineering Confer-
ence and the ACM SIGSOFT Symposium on the Foundations of
Software Engineering on European Software Engineering Con-
ference and Foundations of Software Engineering Symposium, Determinacy Race
ACM, New York, pp –
. Choi J-D, Alpern B, Ngo T, Sridharan M, Vlissides J ()
Race Conditions
A perturbation-free replay platform for cross-optimized mul-
tithreaded applications. IPDPS : Proceedings of the th
International Parallel and Distributed Processing Symposium
(IPDPS’), IEEE Computer Society, Washington, DC Determinism
. Choi J-D, Min SL () Race frontier: reproducing data races
in parallel-program debugging. PPOPP : Proceedings of the Determinacy
 D Deterministic Parallel Java

in that they disallow sharing of mutable data through


Deterministic Parallel Java references. In contrast, widely used imperative lan-
guages such as Java that support references to muta-
Robert L. Bocchino, Jr.
ble data typically provide no guarantee that a parallel
Carnegie Mellon University, Pittsburgh, PA, USA
program is deterministic or even race-free.
Deterministic Parallel Java (DPJ) is a Java-based
Synonyms programming language that provides deterministic-by-
DPJ default semantics for parallel programs, while support-
ing references to mutable objects. The key feature of
Definition DPJ that allows this guarantee is a type and effect sys-
Deterministic Parallel Java (DPJ) is a parallel program- tem that expresses what parts of memory are read and
ming language based on Java and developed at the Uni- written by different parts of the program, so that paral-
versity of Illinois at Urbana-Champaign. DPJ uses a type lel tasks can be checked for noninterference. DPJ has
and effect system to provide deterministic-by-default first-class support for a fork-join style of parallelism.
semantics. “Deterministic” means that for a particu- Specifically, it includes a foreach statement for paral-
lar input, a given program always produces the same lel loops and a cobegin block for parallel statements,
output, regardless of the parallel schedule chosen. “By most often used to express parallel recursion. Object-
default” means that the program is deterministic unless oriented frameworks, discussed below, provide a way to
the programmer explicitly requests nondeterministic express other forms of parallel control.
behavior. DPJ guarantees this semantics at compile time
for the programs it can express. The DPJ Effect System
The DPJ effect system [] allows the programmer to par-
tition the heap into regions, and to provide method effect
Discussion summaries stating the effects of methods in terms of
Introduction accesses (reads and writes) to those regions. The com-
Many parallel programs are intended to have determin- piler uses simple, intraprocedural analysis to check two
istic behavior. Typically, these programs are transfor- things:
mative: they accept some input, do a computation in
. Correctness of method summaries. Every method
memory, and produce some output, in contrast to pro-
effect summary includes all the actual effects of the
grams such as servers that accept input throughout the
method it summarizes, as well as the effects of any
computation. There are many examples of determinis-
methods overriding that method.
tic programs in domains such as scientific computing,
. Noninterference. Any two memory accesses, one
graphics, multimedia, and artificial intelligence.
from each of two tasks that may run in parallel, do
A programming model that can guarantee deter-
not conflict (i.e., they are both reads, or they operate
ministic behavior for such programs has many advan-
on disjoint memory).
tages. Parallel programs become easier to reason about,
because a key source of complexity (nondeterminis- The programmer can omit the region and effect anno-
tic parallel schedules) has been removed, along with tations for sequential code (an ordinary sequential Java
attendant difficulties such as unwanted data races and program is a correct DPJ program) but must add the
complex memory models. Testing and debugging are annotations to code that is executed in a parallel task.
easier because, as in the sequential case, it suffices to test
only one schedule per input. Regions and Effects
Well-known programming models that guaran- Figure  gives a simple example showing the use of
tee determinism include dataflow, data parallel, and regions and effects in DPJ. Line  declares region names
pure functional programming. While these models First and Second. Lines  and  place instance fields
have the advantages described above, they are restrictive first and second in regions First and Second. The
Deterministic Parallel Java D 

class Pair {
in Java cannot have their addresses taken and never
region First, Second; alias. Therefore, effects on such variables are handled
int first in First; automatically by the compiler and are ignored by the
int second in Second;
programmer-visible effect system.) Second, the sum-
void setFirst(int first) writes First {
this.first = first; mary may be conservative, that is, it may specify more
} effects than the method actually has. In particular, write
void setSecond(int second) writes Second {
this.second = second;
effects subsume read effects, so it is permissible (but D
} conservative) to say writes R in the summary when
void setBoth(int first, int second) { the method only reads region R. Finally, the effects of
cobegin { an overridden method must include the effects of any
setFirst(first); /* writes First */
setSecond(second); /* writes Second */
overriding method. This rule is similar to how plain Java
} handles throws clauses; it ensures sound reasoning
} about effects in the presence of polymorphic method
}
dispatch.
Lines – illustrate the expression of deterministic
Deterministic Parallel Java. Fig.  Regions and effects in parallelism. The cobegin block (line ) says to execute
DPJ (DPJ additions to Java are highlighted in bold) each component statement in parallel. The compiler
accumulates the effect of each component statement
and checks that the effects are pairwise noninterfer-
region names have static scope, for example, all first ing. Here, the effect of invoking setFirst in line
fields in any instance of the Pair class reside in the  is writes First (from the definition of method
same region Pair.First. (DPJ uses region parameters, setFirst in line ); and similarly the effect in line  is
explained below, to assign different regions to differ- writes Second. Because First and Second are dis-
ent object instances.) DPJ also provides local regions tinct names, the writes are to disjoint regions, so they
(declared within a method scope) for expressing effects are noninterfering. If the effects in lines  and  were
on objects that do not escape the method. interfering (e.g., they both wrote to the same region),
Lines  and  illustrate the use of method effect sum- then the compiler would issue a warning.
maries. The summary writes First in line  states
that method setFirst writes region First, and simi- Region Parameters
larly for writes Second in line . In general, a method DPJ allows classes and methods to be written with
effect summary has the form reads region-list writes region parameters that become bound to actual regions
region-list. If a method has no effect on the heap, it may when the class is instantiated into a type, or the method
be annotated pure. A method effect summary may be is invoked. Figure  illustrates the use of class region
omitted entirely (as in line ); in this case the compiler parameters. Line  declares a class SimpleTree with
infers the most conservative effect (“writes the entire one region parameter P. Region parameter declara-
heap”) for the method. This default is generally used tions coexist with Java generic type parameters and use
only for methods that are never called inside parallel a similar syntax; the keyword region distinguishes
tasks, so it is not important to know their precise effects. region parameters from type parameters. Line  places
A few simple rules govern the use of method effect the instance field data in region P. When the class
summaries. First, all the actual effects of the method SimpleTree is instantiated with a type, and an object
must be present in the summary. For example, the only of that type is created with new, the type specifies the
heap effect in the body of setFirst is the write to actual region of the storage for data, as illustrated in
field first in line , so the summary is correct. If the lines –.
effect in line  were pure, the compiler would issue an The compiler computes effects on fields by using the
error. (Line  also reads the method parameter first. region specified in the class, after substituting actual
However, method parameters and other local variables regions for formal region parameters. This is illustrated
 D Deterministic Parallel Java

in lines –. The effect of line  is writes Left, as An RPL is a colon-separated list of names beginning
shown, because the write is to field left.data, Line with Root, such as Root:Left:Right. RPLs naturally
 says that data is in region P, and line  says that the form a tree, where the RPL specifies the path in the tree
type of left has P = Left. (There is also a read of field from the root to the node that it represents. For exam-
left in region Left, but this read is subsumed by the ple, Root:Left:Right is a child of Root:Left. In the
effect writes Left.) Similarly, the effect of line  is execution semantics of DPJ, every region is represented
writes Right. Because Left and Right are distinct as an RPL; a bare region name like Left is equivalent
regions, the cobegin statement in lines – is legal. to Root:Left. RPLs may be partially specified by using
The DPJ type system ensures that this kind of reason- * to stand in for any sequence of zero or more names,
ing is sound: for example, it is a type error to attempt for example, Root:*:Left or Root:Left:*. This is
to assign a reference of type SimpleTree<Left> to a useful in specifying sets of regions in types and effects.
variable of type SimpleTree<Right>. Figure  illustrates the use of RPLs to write a tree
Region parameters can have disjointness constraints. that can be traversed in parallel to update its elements.
For example, <region P1, P2 | P1 # P2> declares The key feature that makes this work is a parameterized
two parameters P1 and P2 and constrains them to
be disjoint. The compiler uses the constraint to check
noninterference in parallel code that uses the parame-
ters. If the disjointness constraint is not satisfied by the class Tree<region P> {
region Left, Right;
actual region arguments at the point where a class is
int data in P;
instantiated or a method is invoked, the compiler issues Tree<P:Left> left in P:Left;
a warning. Tree<P:Right> right in P:Right;
int increment() writes P:* {
++data; /* writes P */
Region Path Lists (RPLs) and Nested Effects cobegin {
In deterministic parallel computations, it is often essen- /* writes P:Left:* */
tial to express a nesting relationship among regions. if (left != null) left.increment();
For example, to do a parallel update traversal on a /* writes P:Right:* */
binary tree, one must specify effects on “the left sub- if (right != null) right.increment();
}
tree” or “the right subtree.” A natural way to do this
}
is to put a nesting structure on the regions that mir- }
rors the nesting structure of the tree. DPJ represents
this kind of nesting structure using region path lists,
or RPLs. Deterministic Parallel Java. Fig.  A tree class

class SimpleTree<region P> {


region Left, Right
int data in P;
SimpleTree<Left> left in Left = new SimpleTree<Left>();
SimpleTree<Right> right in Right = new SimpleTree<Right>();
void updateChildren(int leftData, int rightData) {
cobegin {
left.data = leftData; /* writes Left */
right.data = rightData; /* writes Right */
}
}
}

Deterministic Parallel Java. Fig.  Class region parameters


Deterministic Parallel Java D 

RPL: an RPL can begin with a parameter P, as shown RPLs, these two effects are on disjoint sets of regions
in lines –. In the execution semantics, the parameters for any common binding to P. Therefore the effects are
are erased via left recursive substitution. Figure  illus- noninterfering, so this check passes as well.
trates this procedure for the root node of a tree, and its
two children. Each node of the tree has its data field Arrays
in a distinct region, the RPL of which reflects the posi- The RPL mechanism is useful for trees, as shown above.
tion of the node in the tree. Further, the DPJ type system It also supports two common patterns of parallel com- D
enforces this structure: for example, it would be illegal putation on arrays that are not well handled by previ-
to assign the right child to the left field of the root, ous type and effect systems: parallel updates of objects
because the types do not match. through arrays of references and divide-and-conquer
Lines – show how to write an increment updates to arrays.
method that traverses the tree recursively in paral- Parallel updates of objects through arrays of refer-
lel and updates the data fields of the nodes. In line ences. DPJ supports this pattern with two features. First,
, the summary writes P:* says that the method an RPL may include an element [e], where e is an inte-
writes P and all regions under P. The effect of line ger expression, representing cell e of an array. This is
 is writes P, because field data is declared in called an array RPL element. Second, the region in the
region P (line ). Line  generates two effects: a read type of an array cell may be parameterized by the index
of P:Left due to the read of field left, and a write of the cell. This is called an index-parameterized array.
to P:Left:* obtained from the effect summary of the Together, these features allow the programmer to spec-
recursively invoked method increment, after substi- ify an array of references such that the object pointed to
tuting P = P:Left from the type of left. Because by each reference is instantiated with a distinct region.
writes subsume reads, both effects may be summarized Figure  shows an example. The class Body has
as writes P:Left:*, as shown in line . Similarly, the one region parameter P and a field force in region P.
effect of line  is writes P:Right:*. The instance method computeForce computes a
As described above, the compiler has to check force and writes it into force. The static method
two things: first, that the method summary is correct; computeForces takes an array of bodies and iter-
and second, that the statements inside the cobegin ates over it in parallel, calling computeForce on each
(line ) are noninterfering. The first check passes, body. The type of bodies, shown in line , is an index-
because all the statement effects are included in the parameterized array type. The #i declares a fresh vari-
effect writes P:* stated in the summary in line . able i in scope over the whole array type. The element
As to the second check, the effects of the two state- type Body<[i]> says that for any natural number n,
ments in the cobegin are writes P:Left:* and the array element at index n is a reference of type
writes P:Right:*. Because of the tree structure of Body<Root:[n]>.
Figure  shows how the array might look at runtime.
The assignment rules in the DPJ type system ensure that
P = Root
the types are correct: for instance bodies[10] must
data Root
point to an object of type Body<Root:[10]>. There-
left Root:Left
fore, all the force fields are in distinct regions, and the
right Root:Right
parallel updates in line  of Fig.  are noninterfering.
Divide-and-conquer updates to arrays. To support
P = Root:Left P = Root:Right this pattern, DPJ allows dynamic array partitioning:
data Root:Left data Root:Right an array may be divided into two (or more) disjoint
left Root:Left:Left left Root:Left:Right parts that are updated in parallel. Figure  illustrates
right Root:Left:Right right Root:Left:Right how this works for a simple version of quicksort. The
quicksort method (line ) has a region parameter P
Deterministic Parallel Java. Fig.  Runtime and takes a DPJArray<P>, which points into a con-
representation of the tree tiguous subset of a Java array. In lines –, the array is
 D Deterministic Parallel Java

class Body<region P> {


static <region P>
double force in P;
void quicksort(DPJArray<P> A) writes P:* {
void computeForce() reads Root writes P {
/* Ordinary quicksort partition */
...
int p = quicksortPartition(A);
}
/* Split array int two disjoint pieces */
static void
final DPJPartition<P> segs =
computeForces(Body<[i]>[]#i bodies) {
new DPJPartition<P>(A, p);
foreach (int i in 0, bodies.length) {
cobegin {
/* reads Root writes Root:[i] */
/* writes segs:[0]:* */
bodies[i].computeForce();
quicksort(segs.get(0));
}
/* writes segs:[1]:* */
}
quicksort(segs.get(1));
}
}
}

Deterministic Parallel Java. Fig.  Code using an


index-parameterized array Deterministic Parallel Java. Fig.  Parallel quicksort

10 90 DPJPartition object referred to by the final local


... ... ...
variable segs points to two DPJArray objects, each
of which points to a segment of the original array.
Body<Root : [10]> Body<Root : [90]> The DPJArray objects are instantiated with owner
double force Root : [10] double force Root : [90] RPLs segs:[0]:* and segs:[1]:* in their types.
An owner RPL is like an RPL, except that it begins
with a final local reference variable instead of Root.
Deterministic Parallel Java. Fig.  Runtime This allows different partitions of the same array to
representation of the index-parameterized array be represented. Again because of the tree structure of
segs DPJPartition<Root>
DPJ regions, segs:[0]:* and segs:[1]:* are dis-
joint region sets, and so the effects in lines  and
 are noninterfering. Also, as in ownership type sys-
tems, region segs is under the region P bound to
the first parameter of its type, so the effect summary
DPJArray<segs:[0]:*> DPJArray<segs:[1]:*>
writes P:* in line  covers the effects of the method
body.

Commutativity Annotations
p
Sometimes, a parallel computation is deterministic,
even if it has interfering reads and writes. For exam-
Deterministic Parallel Java. Fig.  Runtime ple, two inserts to a concurrent set can go in either
representation of the dynamically partitioned array order, and preserve determinism, even though there are
interfering writes to the set object. DPJ supports this
kind of computation with a commutativity annotation
split at index p, creating a DPJPartition object hold-
on methods (typically API methods for a concurrent
ing references to two disjoint subarrays. The cobegin
data structure). For example:
in lines – calls quicksort recursively on these two
subarrays. interface Set<type T, region R> {
Figure  shows what this partitioning looks like at commutative void add(T e) writes R;
runtime, for the root of the tree and its children. A }
Deterministic Parallel Java D 

This annotation is provided by a library or frame- to ensure soundness, the language disallows some pat-
work programmer and trusted by the compiler (it terns that are in fact deterministic, but cannot be proved
is not checked by the DPJ effect system). It means deterministic by the type system. One example is that
that cobegin { add(e1); add(e2); } isequivalentto the type system disallows reshuffling of the elements
doingthe add operationsinsequence(isolation) and that in an index-parameterized array. This limitation poses
both sequence orders are equivalent (commutativity). a problem for the Barnes-Hut n-body force computa-
When the compiler encounters a method invoca- tion, as rearranging the array of bodies is required for D
tion, it generates an invocation effect that records both good locality. A workaround for this limitation is to
the method that was invoked, and the read-write effects recopy the bodies on reinsertion in to the array, but this
of that method: is somewhat clumsy and could degrade performance.
A third limitation is that some programmer effort is
foreach (int i in 0, n) {
required to write the DPJ annotations, although this
/* invokes Set.add with writes R */
effort is repaid by the benefits discussed above.
set.add(A[i]);
}
Ongoing Work
The compiler uses the commutativity annotations to
The foregoing describes the state of the public release
check interference of effect. For example, the effect
of DPJ as of October . The following subsections
invokes Set.add with writes R is noninterfer-
describe ongoing work that is expected to be part of
ing with itself (because add is declared commutative).
future releases.
However, the same effect interferes with invokes Set.
size with reads R.
Controlled Nondeterminism
DPJ is being extended to support controlled non-
Expressivity and Limitations determinism [], for algorithms such as branch and
DPJ can express a wide variety of algorithms that use bound search, where several correct answers are equally
a fork-join style of parallelism and access data through acceptable, and it would unduly constrain the schedule
a combination of () read-only sharing; () disjoint to require the same answer on every execution. The key
updates to arrays and trees; () commutative operations new language features for controlled nondeterminism
on concurrent data structures like sets and maps; and are () foreach_nd and cobegin_nd statements that
() accesses to task-private heap objects. DPJ has been operate like foreach and cobegin, except they allow
used to write the following parallel algorithms, among interference among their parallel tasks; () atomic
others: IDEA encryption, sorting, k-means clustering, statements supported by software transactional mem-
an n-body force computation, collision detection algo- ory (STM); and () atomic regions and atomic effects for
rithm from a game engine, and a Monte Carlo financial reasoning about which memory regions may have inter-
simulation. The main advantages of the DPJ approach fering effects, and where effects occur inside atomic
are () a compile-time guarantee of determinism in statements.
all executions; and () no overhead from checking Together, these features provide several strong
determinism at runtime, as the region information is guarantees for nondeterministic programs. First, the
erased before parallel code generation. Further, the extended language is deterministic by default: the
effect annotations provide checkable documentation program is guaranteed to be deterministic unless
and enable modular analysis, and the defaults allow the programmer explicitly introduces nondetermin-
incremental transformation of sequential to parallel ism with foreach_nd or cobegin_nd. Second, the
programs. extended language provides both strong isolation
DPJ does have some limitations. One limitation, (i.e., the program behaves as if it is a sequential inter-
mentioned above, is that currently only fork-join par- leaving of atomic statements and unguarded memory
allelism can be expressed. A second limitation is that accesses) and data race freedom. This is true even if
 D Deterministic Parallel Java

the underlying STM provides only weak isolation (i.e., Inferring DPJ Annotations
it allows interleavings between unguarded accesses and Research is underway to develop an interactive tool that
atomic statements). Third, foreach and cobegin are helps the user write DPJ programs by partially inferring
guaranteed to behave as isolated statements (as if they annotations. A tool called DPJizer has been devel-
are enclosed in an atomic statement), even inside a oped [] that uses interprocedural constraint solving to
foreach_nd or cobegin_nd. Finally, the extended infer method effect summaries for DPJ programs anno-
type and effect system allows the compiler to boost tated with the other type, region, and effect information.
the STM performance by removing unnecessary syn- DPJizer simplifies DPJ development by automating one
chronization for memory accesses that can never cause of the more tedious, yet straightforward, aspects of writ-
interference. ing DPJ annotations. The tool also reports the effect
of each program statement, helping the programmer to
isolate and eliminate effects that are causing unwanted
Object-Oriented Parallel Frameworks interference.
DPJ is being extended to support object-oriented paral- The next step is to infer region information, given
lel frameworks. Current support focuses on two kinds a program annotated with parallel constructs. This is
of frameworks: () collection frameworks like Java’s a more challenging goal, because there are many more
ParallelArray [] that provide parallel operations such degrees of freedom. The current strategy is to have the
as map, reduce, filter, and scan on their elements programmer provide partial information (e.g., parti-
via user-defined methods; and () a framework for tion the heap into regions and put in the parallel con-
pipeline parallelism, similar to the pipeline framework structs), and then have the tool infer types and effects
in Intel’s Threading Building Blocks []. The key chal- that guarantee noninterference.
lenge is to prevent the user’s methods from causing
interference when invoked by the parallel framework. Runtime Checks
For example, a user-defined map function must not Research is underway to add runtime checking as a
do an unsynchronized write to a global variable. The complementary mode for guaranteeing determinism, in
OO framework support incorporates several exten- addition to the compile-time type and effect checking.
sions to the DPJ type and effect system for express- The advantage of runtime checking is that it is more pre-
ing generic types and effects in framework APIs, with cise. The disadvantages are () it adds overhead, so it is
appropriate constraints on the effects of user-defined probably useful only for testing and debugging; and ()
methods. it provides only a fail-stop check, instead of a compile-
Frameworks represent one possible solution to the time guarantee of correctness. One place where run-
problem of array reshuffling and other operations that time checking could be particularly useful is in relaxing
the type system cannot express. Such operations can the prohibition on inconsistent type assignments (e.g.,
be encapsulated in a framework. The type and effect in reshuffling an index-parameterized array). In par-
system can then check the uses of the framework, ticular, if the runtime can guarantee that a reference
while the framework internals are checked by more is unique, then a safe cast to a different type can be
flexible (but more complex and/or weaker) methods performed at runtime. Using both static and runtime
such as program logic or testing. This approach sepa- checks, and providing good integration between the
rates concerns between framework designer and user, two, would offer the user a powerful set of options for
and it fosters modular checking. It also enables dif- managing the trade-offs among the expressivity, per-
ferent forms of verification with different trade-offs formance, and correctness guarantees of the different
in complexity and power to work together. In addi- approaches.
tion, frameworks provide a natural way to extend
the fork-join model of DPJ by adding other par- Related Entries
allel control abstractions, including finer-grain syn- Determinacy
chronization (e.g., pipelined loops) or domain-specific Formal Methods–Based Tools for Race, Deadlock,
abstractions (e.g., kd-tree querying and sparse matrix and Other Errors
computations). Transactional Memories
Distributed Memory Computers D 

Bibliographic Notes and Further . Harris T, Larus J, Rajwar R () Transactional memory (Syn-
Reading thesis lectures on computer architecture), nd edn. Morgan &
For further information on the DPJ project and its goals, Claypool Publishers, San Rafael
. Lucassen JM, Gifford DK () Polymorphic effect systems.
see the DPJ web site [] and the position paper pre-
In: POPL ’: Proceedings of the th ACM SIGPLAN-SIGACT
sented at HotPar  []. The primary reference for symposium on principles of programming languages, New York,
the DPJ type and effect system is the paper presented pp –
at OOPSLA  []. A paper on the nondeterminism . Reinders J () Intel threading building blocks: outfitting C++ D
work will appear at POPL  []. The Ph.D. thesis of for multi-core processor parallelism. O’Reilly Media Sebastopol
. Talpin J-P, Jouvelot P () Polymorphic type, region and effect
Robert Bocchino [] presents the full language, together
inference. J Funct Program :–
with a formal core language and effect system, sound- . Thomas P, Weedon R () Object-oriented programming in
ness results for the core system, and proofs. A paper pre- Eiffel, nd edn. Addison-Wesley Longman Publishing Co., Inc.,
sented at ASE  [] describes the work on inferring Boston
method effect summaries. . Vakilian M, Dig D, Bocchino R, Overbey J, Adve V, Johnson
R () Inferring method effect summaries for nested heap
The main influence on DPJ has been the prior work
regions. In: ASE ’: Proceedings of the th IEEE/ACM
on type and effect systems, including FX [] and the International Conference on Automated Software Engineering,
various effect systems based on ownership types [, ]. Auckland, pp –
Other influences include work on design-by-contract
for framework APIs [], work on type and effect infer-
ence [], and the vast body of literature on transac-
tional memory []. Direct Schemes
Dense Linear System Solvers
Bibliography
. http://gee.cs.oswego.edu/dl/jsr/dist/extraydocs/index.html?
extray/package-tree.html
. http://dpj.cs.uiuc.edu
. Bocchino RL Jr () An effect system and language for Distributed Computer
deterministic-by-default parallel programming. PhD thesis,
University of Illinois, Urbana-Champaign Clusters
. Bocchino RL Jr, Adve VS, Adve SV, Snir M () Parallel pro- Distributed-Memory Multiprocessor
gramming must be deterministic by default. In: HOTPAR ’: Hypercubes and Meshes
USENIX workshop on hot topics in parallelism, Berkeley, March

. Bocchino RL Jr, Adve VS, Dig D, Adve SV, Heumann S,
Komuravelli R, Overbey J, Simmons P, Sung H, Vakilian M ()
A type and effect system for deterministic parallel Java. In: OOP- Distributed Hash Table (DHT)
SLA ’: Proceedings of the th ACM SIGPLAN conference on
object-oriented programming systems, languages, and applica- Peer-to-Peer
tions, New York, pp –
. Bocchino RL Jr, Heumann S, Honarmand N, Adve S, Adve V,
Welc A, Shpeisman T, Ni Y () Safe nondeterminism in a
deterministic-by-default parallel language. In: POPL ’: Proceed-
Distributed Logic Languages
ings of the th ACM SIGACT-SIGPLAN symposium on princi-
Logic Languages
ples of programming languages, New York
. Cameron NR, Drossopoulou S, Noble J, Smith MJ () Multiple
ownership. In: OOPSLA ’: Proceedings of the nd ACM SIG-
PLAN conference on object-oriented programming systems and
applications, New York, pp – Distributed Memory Computers
. Clarke D, Drossopoulou S () Ownership, encapsulation and
the disjointness of type and effect. In: OOPSLA ’: Proceed-
Clusters
ings of the th ACM SIGPLAN conference on object-oriented
Disturbed-Memory Multiprocess
programming, systems, languages, and applications, New York,
pp – Hypercubes and Meshes
 D Distributed Process Management

used as a server, in order to provide more performance


Distributed Process Management or better cost-performance than a shared memory mul-
tiprocessor server. A high-availability (HA) cluster is
Single System Image a cluster with software and possibly hardware that
supports fast failover so as to tolerate component fail-
ures. Finally, a cloud is a (generally very large) clus-
ter that supports dynamic provisioning of computing
Distributed Switched Networks resources, usually virtualized, to remote users, via the
Internet.
Networks, Direct
Networks, Multistage
Historical Background
High-performance computing was focused until the
late s on SMPs and vector computers built of
bipolar gates. The fast evolution of VLSI technology
Distributed-Memory indicated the possibility of achieving much better cost-
Multiprocessor performance by assembling a larger number of less
performing, but much cheaper, microprocessors. The
Marc Snir
 node Cosmic Cube built at Caltech in the early
University of Illinois at Urbana-Champaign, Urbana, IL,
s was an early demonstration of this technology [].
USA
Several companies started manufacturing such systems
in the mid-s; a , node nCUBE system was
Synonyms installed at Sandia National Lab in the late s and
Distributed computer; Distributed Memory Comput- a  Transputer nodes system was running at Edin-
ers; Massively parallel processor (MPP); Multicomput- burgh in . The success of these systems, which was
ers; Multiprocessors; Server farm documented in the influential talk “The Attack of the
Killer Micros” by Eugene D. Brooks III at the Super-
Definition computing  Conference, led to a flurry of com-
A computer system consisting of a multiplicity of pro- mercial and governmental activities. Companies such
cessors, each with its own local memory, connected via as Thinking Machines (CM-), Intel (iPSC, Paragon),
a network. Load or store instructions issued by a pro- and Meiko (CS-) started offering Distributed-Memory
cessor can only address the local memory, and different Multiprocessor (DMM) products in the mid-s, fol-
mechanisms are provided for global communication. lowed by Fujitsu (AP), IBM (SP, SP), and Cray
(TD, TE) in the early s. DSMs have dominated
the TOP list of fastest supercomputers since the
Discussion early s.
Terminology The use of clusters evolved in parallel, and largely
A Distributed-Memory Multiprocessor (DMM) is built independently, in the commercial world, driven by sev-
by connecting nodes, which consist of uniprocessors eral trends: Clusters were used in the early s for
or of shared memory multiprocessors (SMPs), via a server consolidation, replacing multiple, distinct servers
network, also called Interconnection Network (IN) or with one cluster, thus reducing operating costs. The
Switch. While the terminology is fuzzy, Cluster gen- explosive growth of the Internet in the late s led
erally refers to a DMM mostly built of commodity to a need for large clusters with a small footprint (to fit
components, while Massively Parallel Processor (MPP) in downtown buildings) that can be quickly expanded,
generally refers to a DMM built of more specialized to serve Internet companies. Web requests can easily
components that can scale to a larger number of nodes be served on a DMM, as each TCP/IP request is a
and is used for large, compute-intensive tasks – in par- small, independent action and request failures can be
ticular, in scientific computing. A server farm is a cluster tolerated.
Distributed-Memory Multiprocessor D 

DMMs were also used to extend SMP architectures


beyond the point where hardware could conveniently
support coherent shared memory. Thus, IBM introduced
in  the parallel sysplex architecture that
allowed coupling of multiple S/ mainframes into a
larger cluster. The use of such clusters became more
widespread as software vendors developed distributed D
computing firmware and adapted applications to run on
clusters.
Clusters have become increasingly used since the
late s for providing fault tolerance. Rather than
using very specialized hardware, such as in the early
Tandem computers, High Availability (HA) Clusters pro-
vide fault tolerance through hardware replication that
eliminates single points of failure, and suitable software.
Many vendors (IBM, Microsoft, HP, Sun, etc.) offer HA
cluster software solutions.
Utility computing, namely the provisioning of com-
puting resources (CPU time and storage) using a pay-
per-use model, has been pursued by multiple companies
over decades. Changes in technology (low-cost, high-
bandwidth networking, decline in hardware cost and
relative increase in operation cost, increased use of off-
the-shelf software and improved support for hardware
virtualization) and in market needs (very rapid and Distributed-Memory Multiprocessor. Fig.  Beowulf
unpredictable growth of the compute needs of Internet cluster
companies) have led to the recent success of this model,
under the name of cloud computing. Cloud computing
platforms are large clusters with software that supports (b) a higher performance, third party switched net-
the dynamic provisioning of virtualized compute and work, such as Infiniband or Myrinet; or (c) a vendor-
storage resources (hardware as a service – HaaS) and, specific network, such as available on IBM and Cray
possibly, firmware and application software (software as systems. Communication over Ethernet will typically
a service – SaaS) use the TCP/IP protocols; communication over non-
Ethernet fabrics can support lower latency protocols.
Architecture Adapters for networks of type (a) and (b) will connect
At its simple most, a DMM can be assembled by con- to a standard I/O bus, such as PCIe; vendor-specific
necting personal computers or workstations with a networks can connect to the (nonstandard) mem-
commercial network such as Ethernet, as shown in ory bus, thus achieving higher bandwidth and lower
Fig. . The Beowulf system built in  is an early latency.
example []. A more compact and more maintainable The basic communication protocol supported by
cluster can be assembled from rack-mounted servers, as any of these networks is message passing: Two nodes
shown in Fig. . In addition to the network that is used communicate when one node sends a message to the
for communication between nodes, such a cluster will second node and the second node receives the message.
have a control network that is used for controlling and Networks of type (b) and (c) often support remote direct
monitoring each node from a centralized console. memory access (rDMA): A node can directly access data
The network of a DMM can be (a) a commod- in the memory of another node, using a get operation,
ity Local Area Network (LAN), such as Ethernet; or update the remote memory, using a put operation.
 D Distributed-Memory Multiprocessor

Network Attached Storage (NAS) technologies. Some


commercial clusters, such as the IBM Parallel Sysplex,
provide clustering hardware that accelerates data shar-
ing and synchronization across nodes.

Software
DMMs reuse, on each node, software designed for
node-sized uniprocessors or SMPs. This includes the
operating system, programming languages, compilers,
libraries, and tools. An additional layer of parallel soft-
ware is built atop the node software, in order to integrate
the multiple nodes into one system.
This includes, for compute clusters:
● Software for system initialization, monitoring and
control. This enables the control of a large num-
ber of nodes from one console; concurrent boot-
ing of a large number of nodes; network firmware
initialization; error detection; etc.
● Scheduler and resource manager for allocating
Distributed-Memory Multiprocessor. Fig.  Rack- resources to parallel jobs, loading them and starting
mounted cluster them. Parallel jobs often run on dedicated partitions;
the scheduler has to allocate nodes in large chunks,
while ensuring high node utilization and reducing
Support for such operations in user space requires an waiting time for batch jobs. The resource manager
adapter that can translate virtual addresses to physical has to reduce job-loading time (e.g., by using a par-
addresses. Oftentimes, such an adapter will also support allel tree broadcast algorithm) and ensure that all
read-modify-write operations on operands in remote nodes of a job start correctly.
memory, such as increment; this avoids the need for ● Parallel file system. Such a system integrates a large
a round-trip for such operations and facilitates their number of disks into one system and provides scal-
atomic execution. able file coherence protocols. Files are striped across
Networks of type (c) may have hardware support many nodes so as to support parallel access to differ-
for collective operations, such as barrier, broadcast, and ent parts of the same file and ensure load balancing.
reduce. Software RAID schemes may be used to enhance the
The topology of the interconnection network impacts reliability of such a large file system.
its performance: Networks with a low-dimension mesh ● Checkpoint/restart facilities. On large DMMS, the
topology (D or D) can be packaged so that only short mean time between failures is often shorter than
wires are needed; this enables the use of cheaper cop- the running time of large jobs. To ensure job com-
per cables while maintaining high bandwidth. Radix- pletion, jobs periodically checkpoint their state; if a
network topologies, such as butterfly or fat-tree, reduce failure occurs, then job execution is restarted from
the average number of hops between nodes, but require the last checkpoint. A checkpoint facility supports
longer wires []. Optical cables are needed for high- coordinated checkpoint or restart by many nodes.
speed transmission over long distances. ● Message passing libraries for supporting communi-
Commercial clusters (server farms and clouds) typ- cation across nodes. MPI is the de facto standard for
ically use Ethernet networking technology and com- message passing on large DMMs.
municate using TCP/IP protocols. The nodes often ● Tools for debugging and performance monitoring
share storage, using Storage Area Networks (SAN) or of parallel jobs. These are often built atop node
Distributed-Memory Multiprocessor D 

level debugging and performance monitoring; they on DMMs in the late s was the conference on
enable users to control many nodes in debug mode, Hypercube Concurrent Computers and Applications).
and to aggregate information from many nodes in The abstract model used was of unit time communica-
debug or monitoring nodes. tion on each edge, with communications either being
limited to one incident edge per node at a time, or
The use of conventional software on the nodes of DMMs
occurring concurrently on all edges incident to a node.
is limiting; in particular, the use of a conventional
operating systems on each node results in jitter: Back-
For example, the problem of sorting on a hypercube is D
treated by several tens of publications.
ground OS activities may slow down nodes at random
As indirect networks became more prevalent, sig-
times; applications where barriers are used frequently
nificant amount of research focused on the topology
can slow down significantly as all threads wait at each
of routing networks and cost-performance tradeoffs,
barrier to the slowest thread. Some systems, e.g., the
and routing protocols for different topologies that are
IBM Blue Gene system, are using on their compute
efficient and deadlock free, and that can handle failures.
node–specialized light-weight kernels (LWK) that pro-
Modern DMMs largely hide the network topology
vide only critical services and offload to remote server
from the user and may not provide mechanisms for con-
nodes heavier or more asynchronous system services.
trolling the mapping of processes and communications
Commercial clusters use TCP/IP for inter-node
onto the network nodes and edges. Therefore, the topol-
communication, and distributed file systems for file
ogy is not part of the programming model. The abstract
sharing. They leverage firmware and application soft-
programming models used to represent such systems
ware designed for distributed systems, such as sup-
assume that communication time between two nodes
port for remote procedure calls (or remote method
is a function of the size of the message communicated.
invocations), distributed transaction processing mon-
Two of the most frequently used models are the postal
itors, messaging frameworks, and “shared-nothing”
model [] and the logP model [].
databases.
System research on DMMs has focused on com-
A critical component of server farms that handle
munication protocols, operating systems, file systems,
Web requests is a load balancer that distributes TCP/IP
programming environments, and applications. Some of
requests coming on one port to distinct nodes.
the larger efforts in this area are the Network of Work-
HA clusters have software that monitor the health of
station (NOW) effort at Berkeley [], and the SHRIMP
the system components and its applications (heartbeat)
project at Princeton [].
and report failures; software to achieve consensus after
failures and to initiate recovery actions; and application-
Related Entries
specific recovery software to restart each application in
Checkpointing
a consistent state. Disk state is preserved, either though
Clusters
mirroring, or through the sharing of highly reliable
Connection Machine
RAID storage via multiple connections.
Cray TE
Clouds often use virtual machine software to vir-
Distributed-Memory Multiprocessor
tualize processors and storage, thus facilitating their
Ethernet
dynamic allocation. They need a resource management
Hypercubes and Meshes
infrastructure that can allocate virtual machines on
IBM RS/ SP
demand.
InfiniBand
Meiko
Research Moore’s Law
Early DMMs had directly connected networks; the net- MPI (Message Passing Interface)
work topology was part of the programming model. Myrinet
Theoretical research focused on the mapping of parallel Network Interfaces
algorithms to specific topologies; in particular, hyper- Network of Workstations
cubes and related networks (the primary conference Networks, Direct
 D Ditonic Sorting

Networks, Multistage . Leighton FT () Introduction to parallel algorithms and


Parallel Computing architectures: arrays, trees, hypercubes. Morgan Kaufman,
San Francisco
PCI Express
. Pfister GF () In search of clusters, nd edn, Prentice Hall,
Routing (Including Deadlock Avoidance)
New Jersey
. Sterling T, Becker DJ, Savarese D, Dorband JE, Ranawake UA,
Bibliographic Notes and Further Packer CV () BEOWULF: a parallel workstation for scientific
Reading computation. In: Proceedings of the th international confer-
ence on parallel processing. Oconomowoc, Wisconsin, pp –
The book of Fox et al. [] covers the early history of
MPPs, their hardware and software, and the applica-
tions enabled on them. The book of Culler and Singh []
is a good introduction to MPP architectures and their
Ditonic Sorting
networks, with a detailed description of the Cray TD
and IBM SP networks. The book of Pfister [] is a gen- Bitonic Sort
eral introduction to clusters, their design, and their use,
while the two books of Buyya [, ] provide in-depth
coverage of cluster hardware, software, and applications.
The book of Dally and Towles [] provides a very good DLPAR
coverage of IN theory and practice. Finally, the book of
Leighton [] is an excellent introduction to the theory Dynamic Logical Partitioning for POWER Systems
results that are relevant to DMMs and INs.

Bibliography Doall Loops


. Anderson TE, Culler DE, Patterson DA, the NOW team ()
A case for NOW (Networks of Workstations). IEEE Micro
Loops, Parallel
():–
. Bar-Noy A, Kipnis S () Designing broadcasting algorithms
in the postal model for message-passing systems. In: Proceedings
of the fourth annual ACM symposium on parallel algorithms and
architectures SPAA ’. San Diego, California, United States, June Domain Decomposition
–July , . ACM, New York, pp –
. Blumrich MA, Li K, Alpert R, Dubnicki C, Felten EW, Sandberg J Thomas George , Vivek Sarin
() Virtual memory mapped network interface for the 
IBM Research, Delhi, India
SHRIMP multicomputer. In: Sohi GS (ed)  years of the inter- 
Texas A&M University, College Station, TX, USA
national symposia on computer architecture (selected papers)
(ISCA ’). ACM Press, New York, pp –
. Buyya R () High performance cluster computing: architec- Synonyms
tures and systems, vol . Prentice-Hall, New Jersey Functional decomposition; Grid partitioning
. Buyya R () High performance cluster computing: program-
ming and applications, vol , Prentice-Hall, New Jersey
. Culler DF, Singh JP () Parallel computer architecture: a hard-
Definition
ware/software approach. Morgan Kaufman, San Francisco
. Culler D, Karp R, Patterson D, Sahay A, Schauser EK, Santos E,
Domain decomposition, in the context of parallel com-
Subramonian R, von Eicken T () LogP: towards a realis- puting, refers to partitioning of computational work
tic model of parallel computation. In: Proceedings of the fourth among multiple processors by distributing the com-
ACM SIGPLAN symposium on principles and practice of paral- putational domain of a problem, in other words, data
lel programming, San Diego, California, United States, – May associated with the problem. In the scientific computing
, pp –
literature, domain decomposition mainly refers to tech-
. Dally WJ, Towles BP () Principles and practices of intercon-
nection networks. Morgan Kaufmann, San Francisco
niques for solving partial differential equations (PDE)
. Fox GC, Williams RD, Messina PC () Parallel computing by iteratively solving subproblems corresponding to
works, Morgan Kaufman, San Francisco smaller subdomains. Although the evolution of these
Domain Decomposition D 

techniques is motivated by PDE-based computational


simulations, the general methodology is applicable in a
number of scientific domains not dominated by PDEs.

Introduction
One of the key steps in parallel computing is parti-
tioning the problem to be solved into multiple smaller D
components that can be addressed simultaneously
by separate processors with minimal communication.
There are two main approaches for dividing the com-
putational work. The first one is domain decomposi-
tion, which involves partitioning the data associated
with the problem into smaller chunks and each paral-
lel processor works on a portion of the data. The second
approach, functional decomposition, on the other hand,
Domain Decomposition. Fig.  Tetrahedral meshing of
focuses on the computational steps rather than the data
the heart. The figure on the right shows the mesh
and involves processors executing different indepen-
partitioned into subdomains indicated with different
dent portions of the same program concurrently. Most
colors
parallel program designs involve a combination of these
two complementary approaches.
In this entry, we focus on domain decompo-
sition techniques, which are based on data paral- anomalous regions. Lastly, it is also useful for devel-
lelism. Figure  shows a -D finite element mesh oping out-of-core (or external memory) solution tech-
model of the human heart [] used for simulation niques for large computational problems with billions
of the human cardiovascular system and in particu- of unknowns by facilitating the partitioning of the prob-
lar, the flow of blood through the various chambers lem domain into smaller chunks, each of which can fit
and valves (http://www.andrew.cmu.edu/user/jessicaz/ into main memory.
medical_data/Heart_Valve_new.htm). Computation is
performed repeatedly for each point on the mesh using
Domain Decomposition for PDEs
Darcy’s law [], which describes the flow of a fluid
Scientific computing applications such as weather mod-
through a porous medium and is widely used in blood
eling, heat transfer, and fluid mechanics involve mod-
flow simulations. Figure  also illustrates decomposition
eling complex physical systems via partial differential
of this mesh model into  subdomains (indicated by
equations. Though there exist many different classes
different colors) that are relatively easier to model due
of PDEs, the most widely used ones are second-order
to their simpler geometry.
elliptic equations, for example, the Poisson’s equation,
Domain decomposition is vital for high-performance
which manifests in many different forms in a number
scientific computing. In addition to providing enhanced
of applications. For example, the Fourier law for ther-
parallelism, it also offers a practical framework for
mal conduction, Darcy’s law for flows in porous media
solving computational problems over complex, hetero-
and Ohm’s law for electrical networks, all arise as special
geneous and irregular domains by decomposing them
cases of the Poisson’s equation.
into homogeneous subdomains with regular geome-
Example . For the sake of illustration, we consider a
tries. The subdomains are each modeled separately
problem that requires solving the Poisson equation over
using faster, simpler computational algorithms, and the
a complex domain Ω with Γ denoting the boundary
solutions are then carefully merged to ensure consis-
tency at the boundary cases. Such decomposition is
also useful when the modeling equations are different ∇ u = f in Ω,
across subdomains as in the case of singularities and u = uΓ on Γ.
 D Domain Decomposition

In practice, the domain is discretized using either Specifically, Schur’s complement techniques are
finite difference or finite element methods, which yields suitable for subdomains with no overlap and
a large sparse symmetric positive definite linear system, Schwarz alternating techniques are suited for over-
Au = f where A is determined by the discretization lapping subdomains. Since these solution tech-
and u is a vector of unknowns. Domain decomposi- niques iteratively solve the constituent subproblems,
tion for this problem involves partitioning the domain these subproblems need not be solved accurately
Ω into subdomains {Ω i }si= , which may be overlapping in each step and different choices of approximation
or nonoverlapping depending on the nature of the prob- over the subproblems lead to variants of the overall
lem. The partitioning into subdomains is sometimes algorithms.
referred to as the coarse grid with the original dis- ● Given a parallel architecture, the final step is to
cretization being described as the fine grid. map the subdomains and the interfaces to the
Given a discretized computational problem as in individual processors and translate the domain-
the example, there are four main steps that need to decomposition-based solution approach to a par-
be addressed in order to obtain an efficient parallel allel program that can be implemented efficiently
solution. on this architecture. This mapping depends on the
relative computational and communication require-
● The first step is to identify the preferred charac- ments for solving the interfaces.
teristics of subdomains for the given problem. This
Each of the above steps will be discussed in the
includes choosing the number of subdomains, the
following sections.
type of partitioning (element/edge/vertex-based),
and the level of overlap between the subdomains
(with disjoint subdomains being the extreme case). Nature of Subdomains
Each of these choices depend on a variety of fac- The effectiveness of any domain decomposition approach
tors such as size, type, and geometry of the problem critically depends on the relative suitability of the sub-
domain, the number of parallel processors, and the domains both with respect to the computational prob-
PDE/linear system being solved. lem (PDE/linear system) being solved and the parallel
● The next step is to partition the problem domain architecture being employed. Below, we discuss certain
into the desired subdomains. This can be achieved key characteristics of subdomains that influence the
by using a variety of geometric and graph partition- overall performance as well as some general guidelines
ing algorithms. For certain applications involving proposed in literature for choosing them for a given
adaptive mesh refinement, one may chose to iter- problem.
atively refine the partitioning based on the load
imbalance. Number of Subdomains
● Given the subdomains, the next step involves solv- The choice of the number of subdomains has a signif-
ing the PDE for the entire problem domain. This icant effect on both the computational time and the
consists of : quality of the final solution. In general, a large number
– Solving the PDE over each subdomain, that is, of subdomains, or alternately a small size for the coarse
∇ u = f in Ω i , for  ≤ i ≤ s, or, equivalently grid, results in a better, but more expensive solution
the linear systems Ai ui = fi , for  ≤ i ≤ s, where since a large number of subdomain problems need to
Ai , ui , and fi are restrictions of A, u, and f to Ω i , be solved and there is much more communication. On
respectively the other hand, choosing a small number of subdomains
– Ensuring consistency at the interfaces (or the has the opposite effect.
overlap in case of overlapping regions) of the Given a fine grid, that is, a discretized problem, it
subdomains, which is referred to as the coarse has been observed empirically [] that there exists an
solve optimal choice for the number of subdomains that min-
● Depending on the overlap between the subdo- imizes the overall computational time. However, theo-
mains, there are two classes of solution approaches. retical results [] exist only for fairly simple scenarios
Domain Decomposition D 

where the same solver is used for all the subdomain convergence of the algorithm, for example, Schur com-
problems as well as the original problem, and the con- plement method can be implemented much more effi-
vergence rate is independent of the coarse grid size. For ciently for edge-based partitionings than the other two
example, for a uni-processor setting on a structured nd types []. For most problems, the appropriate type of
grid with fine grid scale h and a solver with complex- partitioning is determined by the nature of the compu-
ity O(nα ), the optimal choice for the coarse grid size is tational problem (PDE) and the discretization itself, but
given by

where there is flexibility, it is important to make a judi- D
α α−d α cious choice in order to optimize the computation time
Hopt = ( ) h α−d .
α−d for the coarse solve.
Practical scientific computing problems, however,
often involve fairly complex domains as in Fig.  and this Level of Overlap
requires taking into account a number of other consid- Overlap between the subdomains is another important
erations. Firstly, it is preferable to partition the domain characteristic that determines the solution approach.
into subdomains with regular geometry on which faster In particular, Schur complement–based approaches
solvers can be deployed. Secondly, in a parallel set- are applicable only in case of nonoverlapping subdo-
ting, an ideal choice is to assign each subdomain to an mains whereas Schwarz alternating procedures (SAP)
individual processor to minimize communication costs. are applicable in case of overlapping subdomains. The
Such a choice might, however, lead to load imbalance in appropriate level of overlap is often dependent on the
case the number of subdomains is less than that of the geometry of the problem domain and the PDE to be
available processors or there is vast difference in the rel- solved. A decomposition into subdomains with regu-
ative sizes of the subdomains. For a parallel scenario, lar geometry (e.g., circles, rectangles) is often preferable
there exist limited theoretical results. For example, for irrespective of the level of overlap. In general, SAP-
a structured nd grid where all the subdomain solves are based techniques are easier to implement in addition to
performed in parallel followed by an integration phase providing close to optimal convergence rate and being
performed by one of the solvers, it can be shown that more robust. However, extra computational effort is
the optimal number of processors is nd/ , but this anal- required for the regions of overlap and they are not
ysis only considers load distribution and ignores the suitable for handling discontinuities.
communication costs. For a real-world application, one
needs to empirically determine an appropriate parti- Partitioning Algorithms
tioning size that optimizes both the load balance as well For scientific computing tasks, the problem domain
as the communication costs. is invariably discretized making it natural to use a
graph (or in certain cases hypergraph) representation.
Type of Partitioning Decomposing the problem domain into subdomains,
There are three main types of partitioning depending therefore, becomes a graph partitioning problem. For
on the constraints imposed on subdomain interfaces. parallel computing with homogeneous processors, it is
The first type is element-based partitioning, common desirable to have a partitioning such that the resulting
in case of finite-element discretizations, where each subdomains require comparable amounts of computa-
element is solely assigned to a single subdomain. The tional effort, that is, balanced load, and there are mini-
second type is edge-based partitioning, more suitable mal communication costs for handling the subdomain
for finite-volume methods, where each edge is solely interfaces. In other words, the subdomains need to be
assigned to a single subdomain so as to simplify the of similar size with few edges between them. In case
computation of fluxes across edges. The last type is of structured and quasi-uniformly refined grids, such
vertex-based partitioning, which is the least restrictive a partitioning can be obtained with relative ease. How-
and only requires each vertex to be assigned to a unique ever, for unstructured grids resulting from most real-
subdomain. The choice of partitioning type determines world applications, it is an NP-complete problem [].
how the subdomain interface values are computed and A number of algorithms have been proposed in liter-
can have a significant effect on the complexity and ature to address this problem and [, , ] provides
 D Domain Decomposition

a detailed survey of these algorithms. Some of the key These techniques are widely used in a number of scien-
approaches are discussed below. tific computing applications since they can provide bet-
ter control of communication costs and do not require
Geometric Approaches an embedding of the problem in a geometric space.
Geometric partitioning approaches rely mainly on the There currently exist a large number of such algo-
physical mesh coordinates to partition the problem rithms [], but most of them have been derived from
domain into subdomains corresponding to regions of combinations of three basic techniques: (a) Recursive
space. Such algorithms are preferable for applications Graph Bisection, (b) Recursive Spectral Bisection, and
such as crash and particle simulations, where geomet- (c) Greedy approach. Of these, the recursive graph
ric proximity between grid points is much more critical bisection method [] begins by finding a pseudo-
than the graph connectivity. peripheral vertex, (i.e., one of a pair of vertices on the
Recursive bisection methods form an important longest graph geodesic), computing the graph distance
class of geometric approaches that involve computing a to all the other vertices from this vertex. This distance
separator that partitions a region into two subregions is then used to sort the vertices, which are then split
in a desired ratio (usually equal). This bisection pro- into two groups in the desired ratio (usually equal) such
cess is performed recursively on the subregions until that one group is closer to the peripheral vertex and the
the desired number of partitions are obtained. The sim- other one is away, and this process is repeated recur-
plest of these techniques is the Recursive Coordinate sively till the desired number of partitions are obtained.
Bisection (RCB) [], which uses separators orthogonal The second technique, that is, recursive spectral bisec-
to coordinate axes. A more effective technique is Recur- tion method [] makes use of the Fielder vector, which
sive Inertial Bisection (RIB) [] that employs separators is defined as the eigenvector corresponding to the sec-
orthogonal to principal inertial axes of the geometry. ond lowest eigenvalue of the Laplacian matrix of the
Over the last  decades, a number of extensions [] graph. The size of this vector is the same as the number
as well as hybrid techniques [] that combine RIB of vertices in the graph and the difference between the
with local search methods such as the Kernighan-Lin’s Fielder coordinates is closely related to the graph dis-
method, in order to simultaneously minimize the edge tance between the corresponding vertices. The Fielder
cut between the subdomains have been proposed. vector, thus, provides a linear ordering of the vertices,
Space-filling curve (SFC)-based partitioning appro- which can be used to divide the vertices into two groups
aches [] are another class of geometric techniques. and recursively deployed as in the case of SFC-based
In these methods, a space-filling curve maps an partitioning algorithms. The greedy approach typically
n-dimensional space to a single dimension, which pro- uses a breadth first search and uses graph growing
vides a linear ordering of all the mesh points. The mesh heuristics to create partitions [].
is then partitioned by dividing the ordered points into There are also a number of fast multilevel techniques
contiguous sets of desired sizes, which correspond to that are especially suitable for large problem domains.
the subdomains. These techniques typically consist of three phases:
The explicit use of geometric information and the coarsening, partitioning, and refining. In the first phase,
incremental nature of the above techniques makes them a sequence of increasingly coarse graphs is constructed
highly beneficial for applications where geometric local- by collapsing vertices that are tightly connected to each
ity is critical and there are frequent changes in the other. The coarsened graph is then partitioned into the
geometry requiring dynamic load balancing. However, desired number of subdomains in the second phase. The
these techniques have limited applicability for compu- graph is then progressively uncoarsened and refined
tational problems where one needs to consider graph using local search methods such as the Kernighan-Lin
connectivity and communication costs. method [], which improves the edge cut via a series of
vertex-pair exchanges, in the third phase.
Coordinate-Free Approaches
Coordinate-free partitioning approaches include spec- Dynamic Approaches
tral and graph-theory-based techniques that rely only When the computational structure is relatively invariant
on the connectivity structure of the problem domain. through out a simulation, it is sufficient to consider a
Domain Decomposition D 

static partitioning of the problem domain. However, for expressed as


applications where the structure changes dynamically ⎡ ⎤ ⎡ int ⎤ ⎡ ⎤
⎢ B D ⎥ ⎢ ⎥ ⎢ fint ⎥
(e.g., in crash simulations with geometric deformations ⎢ ⎥ ⎢ u ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
or in case of adaptive mesh refinement methods), it is ⎢ ⎥ ⎢ int ⎥ ⎢ ⎥
⎢ B D ⎥ ⎢ ⎥ ⎢ fint ⎥
⎢ ⎥ ⎢ u ⎥ ⎢ ⎥
critical to dynamically repartition the problem domain ⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⋱ ⋮ ⎥ ⎢ ⎥=⎢ ⋮ ⎥
⎥⎢ ⋮
to ensure load balance. Such a repartitioning needs to , ()
⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥ D
account for not only the load distribution and commu- ⎢ ⎥ ⎢ int ⎥ ⎢ ⎥
⎢ Bs D s ⎥ ⎢ ⎥ ⎢ fsint ⎥
nication costs, but also the costs of data redistribution ⎢ ⎥ ⎢ us ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥ ⎢ bnd ⎥
across the partitionings []. Furthermore, in case of ⎢ E E ⋯ Es C ⎥ ⎢ ubnd ⎥ ⎢ f ⎥
⎣ ⎦⎣ ⎦ ⎣ ⎦
large problems, since it is expensive to load the entire
mesh onto a single processor, these partitionings them- where each fiint denotes the vector of unknowns inte-
selves need to be computed in parallel. Currently, there rior to subdomain Ω i and ubnd denotes the vector of
exist a number of dynamic partitioning algorithms such interface unknowns.
as [, , ] which can viewed as extensions of existing Figure  shows the nonzero structure for such a
static partitioning algorithms that incorporate an addi- linear system. This can be equivalently expressed in a
tional objective term based on the data redistribution block form as
costs. These techniques often involve a clever remap- ⎛B D ⎞ ⎛ uint ⎞ ⎛ f int ⎞
ping of subdomain labels to reduce the redistribution ⎜ ⎟⎜ ⎟=⎜ ⎟, ()
⎜ ⎟⎜ ⎟ ⎜ ⎟
costs. In case of multilevel methods such as Locally- ⎝E C ⎠ ⎝ ubnd ⎠ ⎝ f bnd ⎠
Matched Scratch-Remap (LMSR) [], the coarsening
phase is similar to that in the static case, the partitioning where B, D, E, f int , and uint are aggregations over the
(i.e., repartitioning) phase leads to tentative new subdo- corresponding subdomain specific variables. From
mains, and the final uncoarsening/refinement phase is Eq. , by substituting for uint , we obtain a reduced
adapted so as to additionally optimize the data redistri- system
bution costs. (C − EB− D)ubnd = f bnd − EB− f int .

PDE Solution Techniques


There are two main classes of approaches for solving
PDEs over a partitioned problem domain [] depend-
ing on the overlap, namely, (a) Schur complement–
based approaches and (b) Schwarz alternating
procedures, which we discuss below.

Schur Complement–Based Approaches


This class of approaches is based on the notion of a spe-
cial matrix called the Schur complement, which enables
one to solve for the subdomain interface values inde-
pendent of the interior unknowns. The computation of
the unknowns in the interior of each of the subdomains
is then performed in parallel independently.
Consider a computational problem corresponding
to a second-order elliptic PDE as in Example . As men-
tioned earlier, a discretization leads to the linear system
Au = f , u ∈ Ω. In case of nonoverlapping subdo-
mains, that is, Ω i ⋂ Ω j = /, and edge-based partitioning Domain Decomposition. Fig.  Nonzero structure of the
on a finite difference mesh, this linear system can be resulting linear system
 D Domain Decomposition

Algorithm  Schur complement method Γ3,0


. Form the reduced system based on Schur comple-
ment matrix
. Solve the reduced system for the interface Ω3

unknowns
. Solve the decoupled systems for interior unknowns Γ1,3
by back-substituting the interface unknowns

Γ3,1
The matrix S = (C − EB− D) is called the Schur comple- Γ1,0 Γ2,1 Γ1,2 Ω2 Γ2,0
ment matrix. This matrix can be computed directly from Ω1
the original coefficient matrix A given the partitioning
and can be used to compute the interface values ubnd .
Back-substitution of ubnd is then used to solve smaller
i = fi
decoupled linear systems Bi uint − Di ubnd .
int
Domain Decomposition. Fig.  An L-shaped domain
In the case of vertex-based and element-based par- showing overlapping subdomains
titionings, the coefficient matrix A exhibits a more com-
plex block structure unlike in Eq. . However, even for
these cases, one can construct a reduced system based an approximate subdomain solve or in the extreme case,
on a Schur complement matrix assembled from local one iteration of the subdomain solve. Thus, each step
(subdomain specific) Schur complements and the inter- can be viewed as an additive update of u. This update
face coupling matrices, which can be used to solve for step can be concretely specified using the notion of a
the interface unknowns. restriction operator. Let Ri ∈ {, }∣Ω i ∣×∣Ω∣ be a binary
Algorithm  describes the key steps for this class matrix that denotes the restriction of Ω to Ω i , that is,
of techniques irrespective of the partitioning type. the jth component of Ri is  iff the jth unknown belongs
Depending on the choice of the methods (e.g., direct vs. to Ω i . Given {Ri }si= , one can define the restriction of
iterative methods, exact vs. approximate solve) used for A to Ω i as the matrix Ai = Ri ARTi and the update is then
forming the Schur complement and solving the reduced given by
systems, there are multiple possible variants.
u = u + RTi A−
i Ri ( f − Au).

Schwarz Alternating Procedures Since the updates over each subdomain are applied
Schwarz alternating procedures refer to techniques that sequentially, it is equivalent to a multiplicative compo-
involve alternating between overlapping subdomains sition and this procedure is referred to as Multiplicative
and in each case, performing the corresponding subdo- Schwarz Alternation.
main solve assuming boundary conditions based on the To make this approach amenable to paralleliza-
most recent solution. tion, an alternate procedure Additive Schwarz Alter-
Consider the discretized elliptic PDE problem in nation [] was proposed, which involves using the
Example  for the case where each subdomain Ω i over- same residual ( f − Au) to compute the incremen-
laps with its neighboring subdomains (as in Fig. ). Let tal updates over each subdomain in a single sweep
Γij denote the boundary of Ω i that is included in Ω j . The and the incremental updates are then aggregated to
original solution method proposed by Schwarz [] in obtain the new solution. Algorithm  shows the main
 was a sequential method where each step consists steps for the basic additive Schwarz alternation and
of solving the restriction of Au = f over the subdomain each of the δ i computations in a single iteration are
Ω i assuming the boundary conditions over Γij , ∀j. independent of each other and can be performed in
Since the approach is iterative and sweeps through parallel. Algorithm  corresponds to the most basic ver-
each subdomain multiple times, it suffices to perform sion of additive Schwarz. Real applications, however
Domain Decomposition D 

Algorithm  Additive Schwarz method In graph partitioning, researchers have developed


Choose an initial guess for u hyper-graph algorithms [, ] that can be used for
repeat graphs with edges that involve more than two nodes.
for i = ⋯s do Such graphs arise naturally in many applications such
Compute δ i = RTi A−
i R i ( f − Au) as higher-order finite elements, and can also be con-
unew = u + ∑si= δ i structed from regular graphs. Empirical evidence points
until convergence to improvement in the efficiency of partitioning algo- D
rithms as well as in the quality of partitions. It has also
been recognized for a while that large graph problems
that require domain decomposition techniques often
employ variants of this approach with similar indepen-
arise in a parallel setting, and as a result, work has
dent additive updates, but are based on more complex
focused on developing parallel partitioning methods.
preconditioners [].
These methods are, in general, straightforward paral-
lelization schemes, with additional care taken to ensure
Mapping to Parallel Processors quality of the partitions. Another important area of
A natural way to map a domain decomposition work involves adaptive partitioning and refining [, ]
algorithm to a parallel architecture is to assign each sub- for applications where the graph changes frequently
domain to a separate processor. In this case, the subdo- during the simulation. A key component of these algo-
main solves can be performed independently in parallel, rithms is the ability to repartition quickly and efficiently
but the coarse solve (handling of interface values) is starting from an initial partition. The new partitioning
nontrivial since the relevant interface data is scattered must ensure good load balance without increasing com-
among all the processors. For example, in Schur com- munication requirement or introducing large overhead.
plement approaches, solving the reduced system based The domain decomposition step has implications on
on the Schur complement, in the general case, requires the numerical procedure used to solve the PDE. In par-
access to data from all the subdomains. ticular, the rate of convergence of iterative solvers can be
There are three common ways to address the improved significantly by considering domain decom-
above problem. The first is to keep the data in place position as a “preconditioning” step that reduces the
and perform the solve in parallel with the necessary number of iterations. In this regard, there has been some
communication. Such an approach would require an work on improving the Schur-complement approach by
intelligent grouping of the interface unknowns with using Schwarz-type methods for the reduced linear sys-
minimal dependencies across groups, which can then tem. This can be useful when the Schur complement is
be assigned to different subdomains so as to minimize complex and large enough to require a parallel solver.
the communication cost. The second option is to col- Another approach that is motivated by solving PDEs
lect all the data on a single processor, perform the coarse in parallel is the Finite Element Tearing and Integra-
solve, and broadcast the output while the third approach tion (FETI) [, ] method in which splitting of the
is to replicate the relevant data on all the processors and domain into subdomains is accompanied by modifica-
perform the coarse solve in parallel. There have been tion of the linear operator to achieve faster convergence
empirical studies [] that evaluated the relative benefits of the iterative solver. After splitting the domain, one
of these approaches, but the appropriate choice typically needs to enforce continuity across subdomain bound-
depends heavily on the characteristics of the specific aries using Lagrange multipliers. The solution process
problem and the parallel architecture. determines these Lagrange multipliers via an iterative
method that requires repeated subdomain solves that
Advanced Methods can be done in parallel.
More recent work on domain decomposition has
focused on efficient techniques for specific topics. These Application Areas
include advanced methods for graph partitioning and Domain decomposition has been used in a wide vari-
for solving PDEs. ety of disciplines. These include aerospace, nuclear,
 D Domain Decomposition

mechanics, electrical, materials, petroleum, and others. . Chan TF, Shao JP () Parallel complexity of domain decom-
The benefits of the method are most apparent when position methods and optimal coarse grid size. Parallel Comput
used for applications that require large systems to be ():–
. Devine KD, Boman EG, Heaphy RT, Hendrickson BA, Teresco JD,
solved. In such cases, the solver phase can consume up
Faik J, Flaherty JE, Gervasio LG () New challenges in
to % or more of the total running time of the applica- dynamic load balancing. Appl Numer Math (–):–
tion. The domain decomposition concept can be applied . Devine KD, Boman EG, Heaphy RT, Bisseling RH, Çatalyürek UV
in a fairly straightforward manner once the subdomains () Parallel hypergraph partitioning for scientific computing.
are identified. However, it may be important to fine- In: Proceedings of the th parallel and distributed processing
symposium , Rhodes Island, Greece,  pp
tune the parameters for each application in order to
. Dryja M, Widlund OB () Domain decomposition algorithms
get the best performance. with small overlap. SIAM J Sci Comput ():–
. Farhat C, Lesoinne M () Automatic partitioning of unstruc-
Future Directions tured meshes for the parallel solution of problems in computa-
The advent of multicore processors and the availabil- tional mechanics. Int J Numer Methods Eng ():–
. Farhat C () A simple and efficient automatic fem domain
ity of large multiprocessor machines pose significant
decomposer. Comput Struct ():–
challenges to researchers and developers in the area . Farhat C, Pierson K, Lesoinne M () The second gener-
of domain decomposition. Challenges that arise from ation feti methods and their application to the parallel solu-
large-scale compute capability involve the ability to use tion of large-scale linear and geometrically non-linear structural
a large number of cores concurrently and efficiently. It analysis problems. Comput Methods Appl Mech Eng (–):
will be much harder to achieve good partitioning and –
. Farhat C, Roux F-X () A method of finite element tearing and
load balance on such platforms. The ability to perform
interconnecting and its parallel solution algorithm. Int J Numer
asynchronous computation and to adapt to changing Meth Eng :–
computational structure is equally important. In addi- . Fjällström P-O () Algorithms for graph partitioning: a survey,
tion, it is imperative to build fault tolerance into the vol , part . Linköping University Electronic Press, Sweden
algorithmic structure to protect against inevitable archi- . Gropp WD, Keyes DE () Domain decomposition on parallel
tectural instabilities. One must also consider relaxing computers. IMPACT Comput Sci Eng ():–
. Gropp WD, Smith BF () Experiences with domain
the numerical requirements on the computations in
decomposition in three dimensions: overlapping schwarz
favor of obtaining an approximate but acceptable solu- methods. In: Proceedings of the th international symposium on
tion quickly. Another area of interest is the automatic domain decomposition, AMS, Providence, , pp –
tuning of parameters required by the algorithms to . Hendrickson B, Kolda TG () Graph partitioning models for
ensure good performance. It is also important to under- parallel computing. Parallel Comput ():–
stand that domain decomposition is often used as a . Karypis G, Kumar V () Parallel multilevel k-way partition-
ing scheme for irregular graphs. In: Proceedings of the th
low-level computational tool in applications where the
ACM/IEEE conference on supercomputing (CDROM) (Super
main goal is to optimize within a design space or to computing ’), Article , Pittsburgh, PA, USA
conduct sensitivity studies. A tighter integration of the . Karypis G, Kumar V () Multilevel -way hypergraph partition-
domain decomposition step with other computational ing. In: Proceedings of the th design automation conference,
modules is necessary to realize the full benefit of this New Orleans, LA, USA, pp –
. Kernighan BW, Lin S () An efficient heuristic procedure for
technique for an application.
partitioning graphs. Bell Syst Tech J ():–
. Leland R, Hendrickson B () An empirical study of static
Related Entries load balancing. In: Proceedings of scalable high performance
Graph Partitioning computing conference, pp –
. Neuman SP () Theoretical derivation of Darcy’s law. Acta
Mech ():–
Bibliography . Oliker L, Biswas R () Plum: parallel load balancing for adap-
. Berger MJ, Bokhari SH () A partitioning strategy for tive unstructured meshes. J Parallel Distrib Comput ():–
nonuniform problems on multiprocessors. IEEE Trans Comput . Pothen A, Simon HD, Liou K () Partitioning sparse matri-
():– ces with eigenvectors of graphs. SIAM J Matrix Anal Appl ():
. Çatalyürek UV, Boman EG, Devine KD, Bozdag D, Heaphy RT, –
Riesen LA () A repartitioning hypergraph model for dynamic . Saad Y () Iterative methods for sparse linear systems, nd
load balancing. J Parallel Distrib Comput ():– edn. SIAM, Philadelphia, PA
Dynamic Logical Partitioning for POWER Systems D 

. Schamberger S, Wierum J () Partitioning finite element Discussion


meshes using space-filling curves. Future Gen Comp Syst
():– Introduction
. Schloegel K, Karypis G, Kumar V () Wavefront diffusion
A Logical PARtition (LPAR) is a subset of hardware
and LMSR: Algorithms for dynamic repartitioning of adaptive
meshes. IEEE Transactions on Parallel and Distributed Systems resources of a computer upon which an operating sys-
():– tem instance can run. One can define one or more
. Schwarz HA () Uber einen Grenzübergang durch alternieren- LPARs in a physical server, and an OS instance running D
des Verfahren. Vierteljahrsschrift der Naturforschenden in an LPAR has no access and visibility to the other OS
Gesellschaft in Zürich, :– instances running in other LPARs of the same server. In
. Simon HD () Partitioning of unstructured problems for
parallel processing. Comput Syst Eng (–):–
October , IBM introduced the DLPAR technology
. Sohn A, Simon H () Jove: a dynamic load balancing on POWER-based servers with the general availability
framework for adaptive computations on an sp- distributed mul- of AIX .. This DLPAR technology enables resources
tiprocessor. Technical report, CIS, New Jersey Institute of Tech- such as memory, processors, disks, network adaptors,
nology, NJ and CD/DVD/tape drives to be moved nondisruptively
. Sun K, Zhou Q, Mohanram K, Sorensen DC () Parallel
(i.e., without requiring a reboot of the source and target
domain decomposition for simulation of large-scale power grids.
In: ICCAD, pp – operating systems) between LPARs in the same physical
. Zhang Y, Bajaj C () Finite element meshing for cardiac server. In simple terms, one can perform the following
analysis. Technical Report –, ICES, University of Texas at basic nondisruptive dynamic operations with DLPAR:
Austin, TX
● Remove a unit of resource from an LPAR.
● Add a unit of resource to an LPAR.
● Move a unit of resource from one LPAR to another
DPJ in the same physical server.
Deterministic Parallel Java For POWER systems, the granularity/unit of resource
for a DLPAR operation was  CPU for processors, 
Logical Memory Block (LMB) of size MB for mem-
DR ory, and  IO-slot for I/O (disks, network, and media)
adapters. Those units of processor, memory, and I/O-
Dynamic Logical Partitioning for POWER Systems
slots that are not assigned to any LPAR reside in a “free
pool” of the physical server, and can be allocated to a
new or existing LPARs via DLPAR operations later on.
Dynamic Logical Partitioning for Similarly, when a unit of resource is DLPAR-removed
POWER Systems from an LPAR, it goes into the free pool.
The fundamentals of DLPAR are described in [].
Joefon Jann AIX has supported DLPAR removal of memory
T. J. Watson Research Center, IBM Corp., Yorktown since /; however, DLPAR removal of memory
Heights, NY, USA was not available for Linux on POWER until ,
with the availability of SUSE SLES . On POWER and
Synonyms follow-on POWER systems, the minimum size of an
DLPAR; DR; Dynamic LPAR; Dynamic reconfigura- LMB (unit of DLPAR removal or add) can be as small
tion as MB. This minimum size is a function of the total
amount of physical memory in the physical server.
Definition Since the availability of POWER, all POWER
Dynamic Logical Partitioning (DLPAR) is the nondis- servers are partitioned into one or more LPARs.
ruptive reconfiguration of the real or virtual resources These LPARs are defined via the POWER HYPervi-
made available to the operating system (OS) running in sor firmware (PHYP). An LPAR can be a dedicated-
a logical partition (LPAR). Nondisruptive means that no processor LPAR or a Shared-Processor LPAR (SPLPAR).
OS reboot is needed. A dedicated-processor LPAR is an LPAR which contains
 D Dynamic Logical Partitioning for POWER Systems

an integral number of CPUs. SPLPAR was introduced At a very high level, the following sequence of oper-
with AIX . on POWER systems in /. One ations initiated from the OS is used to accomplish a
can define a pool of CPUs to be shared by a set of DLPAR removal of a CPU:
LPARs, which will be called shared-processor LPARs
. Notify all applications and kernel extensions that
(SPLPARs). The PHYP controls the time-slicing and
have registered to be notified of CPU DLPAR events,
allocation of CPU resources across the SPLPARs shar-
so that they may remove dependencies on the CPU
ing the pool. An SPLPAR has the following attributes:
to be removed. This typically involves unbinding
(a) Minimum, assigned, and maximum number of their threads that are bound to the CPU being
Virtual Processors (VPs). removed.
The number of VPs is the degree of processor- . Migrate threads that are bound to the CPU being
concurrency made available to the OS instance. removed to another CPU in the same LPAR.
(b) Minimum, assigned, and maximum number of . Reroute existing and future hardware-interrupts
processor-units of Capacity. that are directed to the CPU being removed. This
Capacity is the amount of cumulative real-CPU- involves changing the interrupt controller data
equivalent resources allocated to the OS instance in structures.
a reasonable interval of time (e.g.,  ms). . Migrate the timers and threads from the CPU being
(c) Capped or uncapped mode. A capped SPLPAR can- removed to another CPU in the same LPAR.
not use excess CPU resources in the pool beyond . Notify the hypervisor/firmware to complete the
its maximum capacity. removal task.
(d) If uncapped, an SPLPAR has an uncapped-weight
value (currently, an integer between  and ), DLPAR Addition of a CPU
which is used for prioritizing the SPLPARs in Dynamic addition of a CPU involves the following tasks
a pool. initiated from the OS:
The DLPAR technology was then enhanced to be able to . Create a process (waitproc) to run an idle loop
dynamically move units of a SPLPAR’s CPU-resources on the incoming CPU, before it starts to do real
nondisruptively from one SPLPAR to another in the work.
same SPLPAR pool. More precisely, DLPAR technology . Set up various hardware registers (e.g., GPR for the
can nondisruptively and dynamically kernel stack, GPR for the kernel Table Of Contents)
● Remove or add an integral number of Virtual Pro- of the incoming CPU.
cessors from or to an SPLPAR, or move such from . Allocate and/or initialize the processor-specific ker-
one SPLPAR to another nel data structures (e.g., runqueue, per-processor
● Remove or add Capacity (in processor-units) from data area, interrupt stack) for the incoming
or to an SPLPAR, or move capacity from one CPU.
SPLPAR to another . Add support for the incoming CPU to the interrupt
● Change the processor Mode of an SPLPAR to subsystem.
Capped or Uncapped . Notify the DLPAR-registered applications and ker-
● Change the Weight of an Uncapped SPLPAR nel extensions that a new CPU has been
(between a value of  and ) added.
Dynamic addition and removal of processors to and
DLPAR Technology from an LPAR allow it to adapt to varying workloads.
Figure  shows that the throughput (in requests-per-
DLPAR Removal of a CPU second) of the WebSphere Trade benchmark scales
DLPAR enables the dynamic removal of a CPU from a almost linearly with the number of CPUs in a dedicated-
running OS instance in one LPAR, without requiring a processor-LPAR, as CPUs were removed and added to
reboot of the OS instance. the LPAR with DLPAR operations [].
Dynamic Logical Partitioning for POWER Systems D 

900
<---Number of CPUs-->
8 7 6 5 4 3 2 1 2 3 4 5 6 7 8
800

700

600 D
Request per second

500

400

300

200

100

0
12 24 36 48 60 72 84 96 108 120 132 144 156 168 180
Elapsed time in hundreds of seconds

Dynamic Logical Partitioning for POWER Systems. Fig.  WebSphere Trade benchmark: throughput (in
requests-per-second) as a function of the number of CPUs (the upper x-axis labels), and as a function of elapsed time
(the lower x-axis labels)

DLPAR Addition of a Memory Block (LMB) of memory that the OS instance can grow into. This
When a DLPAR memory-add request arrives at the OS, reservation can incur much memory wastage and per-
the latter has to perform two tasks: formance degradation, particularly if not utilized. We
avoided this wastage by changing the AIX kernel so
. Allocate and initialize software page descriptors that that software page descriptor data structures are always
will hold metadata for the incoming memory. accessed in translation-on mode.
. Distribute the incoming memory among the frame- The challenge of distributing the incoming memory
sets of a mempool (memopools are described two across different page replacement daemons, so that each
paragraphs below). The challenges encountered in daemon handles a roughly equal load was resolved as
implementing these two tasks in AIX and how they follows:
were resolved are described as follows. In AIX, memory is hierarchically represented by
the data structures vmpool, mempool, and frameset.
Prior to implementing DLPAR, software page descrip- A vmpool represents an affinity domain of memory. A
tors could be accessed in translation-off mode (where vmpool is divided into multiple mempools, and each
virtual address is used as real address) while trying mempool is managed by a single page replacement
to reload a page mapping into the hardware page LRU (least recently used) daemon. Each mempool is
table. If page descriptors are allowed to be accessed in further subdivided into one or more framesets that
translation-off mode, then the memory allocated for contain the free-frame lists, so as to improve the scal-
the incoming new descriptors has to be physically con- ability of free-frame allocators. When new memory
tiguous to the memory for the existing descriptors; is added to AIX, the vmpool that it should belong
which implies that memory has to be reserved at boot- to is defined by the physical location of the memory
time for enough descriptors for the maximal amount chip. Within that vmpool, we wanted to distribute the
 D Dynamic Logical Partitioning for POWER Systems

memory across all the available mempools to balance there is just a small amount of these page-frames,
the load on page replacement daemons. However, the and they are usually collocated in low memory.
kernel assumed that a mempool consisted of physically
The design and implementation of DLPAR memory
contiguous memory. Thus, to be able to break up the
removal has manifested itself as three modular func-
new memory (LMB) into several parts and distribute
tions, such that one can mix and match these functions
them across different mempools, the kernel was modi-
in several possible ways, adapting to the state of the
fied to allow mempools to be made up of discontiguous
system at the time of memory removal, thus achiev-
sections of memory.
ing the desired end result with the most optimal path.
These three modular functions perform, respectively,
DLPAR Removal of a Memory Block (LMB) the following three tasks on the memory (LMB) being
This is by far the hardest-to-implement of all the DLPAR removed:
operations, particularly when there are ongoing DMA
operations in the OS instance. For the purpose of (a) Remove its free and clean pages from the regular
use of VMM (Virtual Memory Manager)
DLPAR memory removal, we classify the memory page-
(b) Page out its page-able dirty pages
frames in AIX into five categories: unused, page-able,
(c) Migrate the contents of each remaining page-frame
pinned, DMA-mapped, and translation-off memory.
in the LMB to a free page-frame outside the LMB
The approach taken to remove a page-frame (,
bytes) in each of these categories is as follows: Memory removal can be implemented with any one of
the sequences: abc, ac, bc, or just c. For example, the
. A page-frame containing a free/unused page is sim-
decision to either invoke page-out (task b) or to migrate
ply removed from its free-list.
(task c) all the pages depends on the load in the LPAR
. A page-frame containing a page-able page can be
at that particular time. If the LMB being removed con-
made to either page out its contents to disk or to
tains a lot of dirty pages that belong to highly active
migrate its contents to a different free page-frame.
threads, then it does not make sense to invoke task b,
. A page-frame containing a pinned page will have
because these pages will be paged back in almost imme-
its contents migrated to a different page-frame; also
the page-fault reload handler had to be made to spin diately, hence negatively impacting the efficiency of the
during the migration. system.
. A page-frame containing a DMA-mapped page can- As a second example, if there are not enough free
not be removed nor have its contents migrated until frames in other LMBs to migrate the pages to, then the
all the accesses to the page are blocked. Here, the memory removal procedure can invoke task b before
term “DMA-mapped page-frame” is used generi- invoking task c, so that there will be far fewer pages left
cally to mean a page-frame whose physical address that need to be migrated in task c.
is subject to read or write by an external (to ker-
nel) entity such as a DMA engine. The contents DLPAR of I/O Slots
of a DMA-mapped page-frame are migrated to a The methods to dynamically configure or unconfigure
different page-frame with a new hypervisor call a device have been introduced as early as AIX ver-
(h_migrate_dma) that will selectively suspend the sion . The changes required in the kernel design for
bus traffic while it is modifying the TCEs (transla- the DLPAR of I/O slots were not in the same scale
tion control entries) in system memory used by the as those for the DLPAR of processors, and particu-
bus unit controller for DMA accesses. larly as those for the DLPAR of memory. The rea-
Other page-frames whose physical addresses are son is that the onus of configuring or unconfiguring
exposed to external entities are handled by invok- a device lies with the device driver software, which
ing preregistered DLPAR callback routines and then operates in the kernel extension environment. The ker-
waiting for completion of the removal of their nel just acts as a provider of serialization mechanisms
dependencies on the page. for devices accessing common resources, and as an
. A page-frame containing translation-off pages will intermediary between the applications and the device
not be removed by DLPAR in AIX .. Fortunately, drivers.
Dynamic Logical Partitioning for POWER Systems D 

Values and Uses of DLPAR ● CUoD (Capacity Upgrade on Demand), which


DLPAR is the foundation of many advanced technolo- allows a customer to activate preinstalled but inac-
gies for ensuring scalability, reliability, and full utiliza- tive and unpaid for processors as resource needs
tion of the resources in POWER Systems servers. It arise, as soon as IBM is notified and enables the cor-
opens up a whole set of possibilities for great flex- responding license keys. Many enterprises with sea-
ibility in dealing with dynamically changing work- sonally varying business activities extensively utilize
load demands, server consolidations of workloads this feature in their IT infrastructure. D
from different time zones, and high availability for ● Hot repair of PCI devices, and possibly future hot
applications. repair of MCM (Multi-Chip Modules).
● The failover of an LPAR to another LPAR in another
Some Obvious Values Enabled by DLPAR physical server, used with the IBM HACMP LPP
● Ability to dynamically move processors and mem- (Licensed Program Product). This is detailed in
ory from a test LPAR to a production LPAR in peri- the -page IBM Redpaper titled “HACMP v.,
ods of peak production workload demand, then Dynamic LPAR, and Virtualization” []. HACMP is
move them back as demand decreases. acronym for High Availability Cluster Multiprocess-
● Ability to programmatically and dynamically move ing. It is IBM’s solution for high-availability clusters
processors and memory from less loaded LPARs to on the AIX and Linux POWER platforms.
busy LPARs, whenever the need arises []. This
is particularly useful for economically providing
resources to LPARs that service workloads in differ- DLPAR-Safe, DLPAR-Aware,
ent time zones or even different continents. and DLPAR-Friendly
● Ability to dynamically move an infrequently used Programs/Applications
I/O device between LPARs, such as a CD/DVD/tape A DLPAR-safe program is one that does not fail as a
drives for installations, applying updates, and for result of DLPAR operations. Its performance may suf-
backups. Occasionally, one may also want to move fer when resources are taken away, and performance
Ethernet adaptors and disk drives dynamically from may not scale up when new resources are added. How-
 LPAR to another. ever, by default, most applications are DLPAR-safe, and
● Ability to dynamically release a set of processor, only a few applications are expected to be impacted
memory, and I/O resources into the “free pool,” so by DLPAR.
that a new LPAR can be created using the resources A DLPAR-aware program is one that has code that is
in the free pool. designed to adjust its use of system resources commen-
● Providing high availability for the applications in an surate with the actual resources of the LPAR, which is
LPAR. For example, you can configure a set of min- expected to vary over time. This may be accomplished
imal LPARs on a single system to act as “failover” in two ways.
backup LPARs to a set of primary LPARs, and also The first way is by regularly polling the system to
keep some set of resources free. If one of the associ- discover changes in its LPAR resources.
ated primaries fails, then its backup LPAR can pick The second way is by registering a set of DLPAR
up the workload, and resources can be dynamically code that is designed to be executed in the context of a
added to the backup LPAR as needed. DLPAR operation. At a minimum, DLPAR-aware pro-
grams are designed to avoid introducing conditions that
Some Non-Obvious Values Enabled may cause DLPAR operations to fail. They are not neces-
by DLPAR sarily concerned with the impact that DLPAR may have
● Dynamic CPU Guard, which is the automatic and on their performance.
graceful de-configuration of a processor which has A DLPAR-friendly program is a DLPAR-aware appli-
intermittent errors. cation or middleware that automatically tunes itself in
● Dynamic CPU Sparing, which is the Dynamic CPU response to the changing LPAR resources. AIX pro-
Guard feature enhanced with automatic enablement vides a set of DLPAR scripts and APIs for applica-
of a spare CPU to replace a defective CPU. tions to dynamically resize. Notable vendor applications
 D Dynamic LPAR

that are DLPAR-friendly include IBM DB [], Lotus . Shah P () DB and dynamic logical partitioning. http://www.
Domino [], Oracle [], etc. ibm.com/developerworks/eserver/articles/db_dlpar.html
. Bassemir R, Faurot G () Lotus Domino and AIX
More information about these scripts and APIs can
DLPAR. http://www.ibm.com/developerworks/aix/library/
be found in chapter  (Dynamic Logical Partitioning) au-DominoandDLPARver.html
of the online IBM manual “AIX Version . General . Shanmugam R () Oracle database and Oracle RAC gR on
Programming Concepts” []. IBM AIX. http://www.ibm.com/servers/enable/site/peducation/
To maintain expected levels of application perfor- wp/a/a.pdf
mance when memory is removed, buffers may need . IBM manual “AIX Version .: General Programming Concepts:
Writing and Debugging Programs” SC--, /
to be drained and resized. Similarly when memory is
pp – http://publib.boulder.ibm.com/infocenter/aix/vr/
added, buffers may need to be dynamically increased to topic/com.ibm.aix.genprogc/doc/genprogc/genprogc.pdf
gain performance. . Quintero D, Bodily S, Pothier P, Lascu O () HACMP v.,
Similar treatment needs to be applied to threads, dynamic LPAR, and virtualization. IBM Redpaper, http://www.
whose numbers, at least in theory, need to be dynam- redbooks.ibm.com/redpapers/pdfs/redp.pdf
. Bodily S, Killeen R, Rosca L () PowerHA for AIX cook-
ically adjusted to the changes in the number of online
book. IBM Redbook, http://www.redbooks.ibm.com/redbooks/
CPUs. However, thread-based adjustments are not nec- pdfs/sg.pdf
essarily limited to CPU-based decisions; for example, . Matsubara K, Guérin N, Reimbold S, Niijima T () The
the best way to reduce memory consumption in Java complete partitioning guide for IBM eServer pSeries servers.
programs may be to reduce the number of threads, IBM Redbook, SG--, http://portal.acm.org/citation.
since this should reduce the number of active objects cfm?id= ISBN:
. Irving N, Jenner M, Kortesniemi A () Partitioning imple-
that need to be preserved by the Java Virtual Machine’s
mentation for IBM eServer p servers. IBM Redbook, SG--
garbage collector. , http://www.redbooks.ibm.com/redbooks/pdfs/sg.pdf
. Hales C, Milsted C, Stadler O, Vågmo M () PowerVM
Future Directions virtualization on IBM system p: introduction and configura-
Future uses of DLPAR are already being explored; tion, th edn. IBM Redbook, http://www.redbooks.ibm.com/
redbooks/pdfs/sg.pdf
among these are:
. Dimmer I, Haug V, Huché T, Singh AK, Vågmo M, Venkatara-
● Power/energy conservation for the server by dynam- man AK () Ch : DLPAR. In: PowerVM virtualization man-
aging & monitoring. IBM RedBook, http://www.redbooks.ibm.
ically DLPAR-remove of unused CPUs and memory
com/redbooks/pdfs/sg.pdf
● Dynamic hot repair of server components, even . Jann J, Burugula RS, Dubey N IBM DLPAR tool set for pSeries
MCM (Multi-Chip Modules) for P & P systems. http://www.alphaworks.ibm.com/tech/dlpar
(Contact joefon@us.ibm.com)
. Wikipedia entry on DLPAR: http://en.wikipedia.org/wiki/
Related Entries DLPAR

IBM Power Architecture


Multi-Threaded Processors

Dynamic LPAR
Bibliography
. Jann J, Browning L, Burugula RS () Dynamic reconfigu- Dynamic Logical Partitioning for POWER Systems
ration: basic building blocks for autonomic computing in IBM
pSeries servers. IBM Syst J (Spec Issue Auton Comput) ()
. Jann J, Dubey N, Pattnaik P, Burugula RS () Dynamic recon-
figuration of CPU and WebSphere on IBM pSeries servers. Softw
Pract Exp J (SPE Issue) ():– Dynamic Reconfiguration
. Lynch J Dynamic LPAR – the way to the future. http://www.
ibmsystemsmag.com/aix/junejuly/administrator/p.aspx Dynamic Logical Partitioning for POWER Systems
E
The ES as a whole thus consists of , APs with 
Earth Simulator TB of main memory and the theoretical performance
of  Tflops.
Ken’ichi Itakura
Each AP consists of a four-way super-scalar unit
Japan Agency for Marine-Earth Science and Technology
(SU), a vector unit (VU), and main memory access
(JAMSTEC), Yokohama, Japan
control unit on a single LSI chip. The AP operates
at a clock frequency of  MHz with some circuits
Synonyms operating at  GHz. Each SU is a super-scalar proces-
ES sor with  KB instruction caches,  KB data caches,
and  general-purpose scalar registers. Branch pre-
Definition diction, data prefetching, and out-of-order instruction
The Earth Simulator is a highly parallel vector super- execution are all employed. Each VU has  vector reg-
computer system consisting of  processor nodes and isters, each of which has  vector elements, along
an interconnection network. with eight sets of six different types of vector pipelines:
addition/shifting, multiplication, division, logical oper-
ations, masking, and load/store. The same type of vector
Discussion pipelines works together by a single vector instruction
The Earth Simulator, which was developed, as a national and pipelines of different types can operate concur-
project, by three governmental agencies, the National rently. The VU and SU support the IEEE  floating-
Space Development Agency of Japan (NASDA), the point data format. This LSI chip technology is  nm
Japan Atomic Energy Research Institute (JAERI), and CMOS and eight layers copper interconnection. The die
Japan Marine Science and Technology Center (JAM- size is . mm × . mm. It has  million transis-
STEC), was led by Dr. Hajime Miyoshi. The ES is housed tors and , pins. The maximum power consumption
in the Earth Simulator Building (approximately  m × is  W.
 m ×  m). The fabrication and installation of the ES The overall MS is divided into , banks and the
at the Earth Simulator Center of JAMSTEC by NEC was sequence of bank numbers corresponds to increasing
completed at the end of February in . It was first on addresses of locations in memory. Therefore, the peak
the Top list for two and a half years starting in June throughput is obtained by accessing contiguous data
 (Fig. ). which are assigned to locations in increasing order of
memory address.
Hardware The RCU is directly connected to the crossbar
The ES is a highly parallel vector supercomputer sys- switches and controls internode data communications
tem of the distributed-memory type, and consisted of at . GB/s bidirectional transfer rate for both send-
 processor nodes (PNs) connected by  ×  ing and receiving data. Thus the total bandwidth of
single-stage crossbar switches. Each PN is a system with internode network is about  TB/s. Several data-transfer
a shared memory, consisting of eight vector-type arith- modes, including access to three-dimensional (D)
metic processors (APs), a -GB main memory system sub-arrays and indirect access modes, are realized in
(MS), a remote access control unit (RCU), and an I/O hardware. In an operation that involves access to the
processor. The peak performance of each AP is  Gflops. data of a sub-array, the data is moved from one PN to

David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----,


© Springer Science+Business Media, LLC 
 E Earth Simulator

Disks

Interconnectin network (IN)


Processor node (PN)
cabinets (65)
cabinets (320)

65 m
Air conditioning system 71 yd
50 m
55 yd

Power supply system Double floor for cables

Earth Simulator. Fig.  Bird’s-eye view of the earth simulator system

another in a single hardware operation, and relatively . User view is a very large disk and usual UNIX
little time is consumed for this processing. commands and tools can access the MDPS data on
The Interconnection Network (IN) cabinet has two the login server. Data I/O procedure becomes easy.
switches of ×. The signal from PN changes from . MDPS enables users to access the results computed
Serial to Parallel and is inputted into a switching circuit. by ES remotely, because it can transfer the data to a
The signal which came out from the switching circuit is dedicated server out of ES LAN (ES-Network FTP
performed Parallel/Serial conversion. Then, it is sent to Server).
PN cabinet. The number of Cables which connects PN
cabinet and IN cabinet is  ×  = , , and the Operating System
total extension is , km. The operating system running on ES is an enhanced ver-
sion of NEC’s UNIX-based OS called “SUPER-UX” that
is developed for NEC’s SX Series supercomputers. To
support ultra-scale scientific computations, SUPER-UX
MDPS
was enhanced mainly in the following two points:
In October , Mass Data Processing System (MDPS)
was installed as a massively data storage system which . Extending scalability
renews the tape library system. It consists of four file ser- . Providing special features for ES
vice processors (FSPs),  TB hard disk drives, and a
Extending scalability up to the whole system ( PNs)
currently used . PB cartridge tapes library (CTL).
is the major requirement for the OS of ES. All func-
MDPS was adopted aiming to improve data transfer
tions of the OS, such as process management, memory
throughput and accessibility.
management, file management, etc., are fully optimized
. The transfer speed of saving and extracting the data to fulfill the requirement. For example, any of the OS
between the ES and the storage became two to five- tasks costing order n, such as scattering data in sequence
times faster. This improvement can be realized by over all PNs, is replaced with the equivalent one costing
expansion of transfer cable ability and replacement order log n, such as binary-tree copy, if possible, where
of a tape archive with disks. n is the number of PNs.
Earth Simulator E 

Crossbar network

TSS cluster Back-end clusters

PN #624

PN #625

PN #639
PN #15

PN #16

PN #17

PN #31
PN #0

PN #1
CCS #0 CCS #1 CCS #39 E

Earth Simulator. Fig.  Super cluster system of ES: a hierarchical management system is introduced to control the ES.
Every  nodes are collected as a cluster system and therefore, there are  sets of cluster in total. A set of cluster is called
an “S-cluster” which is dedicated for interactive processing and small-scale batch jobs. A job within one node can be
processed on the S-cluster. The other sets of cluster is called “L-cluster” which are for medium-scale and large-scale batch
jobs. Parallel processing jobs on several nodes are executed on some sets of cluster. Each cluster has a cluster control
station (CCS) which monitors the state of the nodes and controls electricity of the nodes belonged to the cluster. A super
cluster control station (SCCS) plays an important role of integration and coordination of all the CCS operations.

PN#0 PN#1 PN#2 PN#n

AP AP AP AP
FAL FAL FAL FAL

Request
File File File File
server server server server

Read/write

Parallel file

Earth Simulator. Fig.  Parallel file system (PFS): A parallel file, i.e., file on PFS is striped and stored cyclically in the
specified blocking size into the disk of each PN. When a program accesses to the file, the File Access Library (FAL) sends a
request for I/O via IN to the File Server on the node that owns the data to be accessed

On the other hand, the OS provides some special reads from or writes to its own disk solves the prob-
features which aim for efficient use or administration lem, it is a very hard work to manage such a large
of such a large system. The features include internode number of partial files. Therefore, parallel I/O is greatly
high-speed communication via IN, global address space demanded in ES from the point of view of both perfor-
among PNs, super cluster system (Fig. ), batch job mance and usability. The parallel file system (PFS) pro-
environment, etc. vides the parallel I/O features to ES (Fig. ). It enables
handling multiple files, which are located on separate
Parallel File System disks of multiple PNs, as logically one large file. Each
If a large parallel job running on  PNs reads process of a parallel program can read/write distributed
from/writes to one disk installed in a PN, each PN data from/to the parallel file concurrently with one I/O
accesses to the disk in sequence, and performance statement to achieve high performance and usability
degrades terribly. Although local I/O in which each PN of I/O.
 E Earth Simulator

Job Scheduling batch jobs (making initial data, processing results of


ES is basically a batch-job system. Network Queuing a simulation, and other processes), and L batch queue
System II (NQSII) is introduced to manage the batch is for a production run. Users choose an appropriate
job (Fig. ). queue for their jobs.
There are two type queues. One is L batch queue and S batch queue is designed for single-node batch jobs.
the other is S batch queue. S batch queue is aimed at In S batch queue, Enhanced Resource Scheduler (ERSII)
being used for a pre-run or a post-run for large-scale is used as a scheduler and does a job scheduling based

Running L-system Node number


Login server
Waiting J1

Batch job entry J2


Multi-node batch job
(qsub)
J2 J3

J1
J3

Current Time
L batch queue
S-system AP number

S batch queue

Single-node batch job in S-system

Current Time

Earth Simulator. Fig.  Shows the queue configuration of ES

For single-node batch jobs For multi-node batch jobs

Nodes for single-node Interactive nodes Nodes for multi-node batch jobs
batch jobs

••••••• •••••••••

S-disk Work Work •••••••• Work


disk disk disk
Stage-in/out
HDME DATA
disk disk Stage-in/out automatically
before/after job execution
Submitting jobs
(multi-node jobs)
Submitting jobs
NFS mount
(single-node jobs)
MDPS (M-Disk)
Batch script NFS mount
User
MT MT
Login server
Remote login Tape library

Earth Simulator. Fig.  Job execution. The user writes a batch script and submits the batch script to ES. The node
scheduling, the file staging, and other processing are automatically processed by the system
Earth Simulator E 

on CPU time. On the other hand, L batch queue is for job execution. The job can use the nodes exclusively and
multi-node batch jobs. In this queue, the customized the processes in each node can be executed simultane-
scheduler for ES is used as the scheduler. We have been ously. As a result, the large-scale parallel program is able
developing this scheduler with following strategies: to be executed efficiently.
PNs of L-system are prohibited from access to the
. The nodes allocated to a batch job are used exclu-
user disk to ensure enough disk I/O performance.
sively for that batch job.
Therefore, the files used by the batch job are copied
. The batch job is scheduled based on elapsed time
from the user disk to the work disk before the job exe-
instead of CPU time. E
cution. This process is called “stage-in.” It is important
Strategy () enables to estimate the job termination time to hide this staging time for the job scheduling.
and to make it easy to allocate nodes for the next batch Main steps of the job scheduling are summarized as
jobs in advance. Strategy () contributes to an efficiency follows:
. Node allocation
. Stage-in (copies files from the user disk to the work
Earth Simulator. Table  Programming model on ES disk automatically)
“Hybrid” “Flat”
. Job escalation (rescheduling for the earlier esti-
mated start time if possible)
Inter-PN HPF/MPI HPF/MPI
. Job execution
Intra-PN Microtasking/openMP
. Stage-out (copies files from the work disk to the user
AP Automatic vectorization
disk automatically)

CPU CPU CPU CPU CPU CPU

Microtask/OpenMp Microtask/OpenMp
HPF/MPI

Hybrid parallelization

CPU CPU CPU CPU CPU CPU

HPF/MPI HPF/MPI
HPF/MPI

Flat parallelization

Earth Simulator. Fig.  Two types of parallelization of ES


 E Eden

When a new batch job is submitted, the scheduler


searches available nodes (Step ). After the nodes and Eden
the estimated start time are allocated to the batch job,
Rita Loogen
stage-in process starts (Step ). The job waits until the
Philipps-Universität Marburg, Marburg, Germany
estimated start time after stage-in process is finished.
If the scheduler finds the earlier start time than the
estimated start time, it allocates the new start time to
the batch job. This process is called “Job Escalation” Definition
(Step ). When the estimated start time has arrived, the Eden is a parallel functional programming language
scheduler executes the batch job (Step ). The scheduler that extends the non-strict functional language Haskell
terminates the batch job and starts stage-out process with constructs for the definition and instantiation
after the job execution is finished or the declared elapsed of parallel processes. The programmer is freed from
time is over (Step ) (Fig. ). managing synchronization and data exchange between
processes while keeping full control over process gran-
Programming Model in ES ularity, data distribution, and communication topology.
The ES hardware has a three-level hierarchy of Eden is geared toward distributed settings, that is, pro-
parallelism: vector processing in an AP, parallel cesses do not share any data. Common and sophis-
processing with shared memory in a PN, and parallel ticated parallel communication patterns and topolo-
processing among PNs via IN. To bring out high perfor- gies, that is, algorithmic skeletons, are provided as
mance of ES fully, you must develop parallel programs higher-order functions in a skeleton library written
that make the most use of such parallelism. in Eden.
As shown in Table , the three-level hierarchy of Eden is implemented on the basis of the Glasgow
parallelism of ES can be used in two manners, which Haskell Compiler GHC [], a mature and efficient
are called hybrid and flat parallelization, respectively Haskell implementation. While the compiler frontend is
(Fig. ). In the hybrid parallelization, the internode par- almost unchanged, the backend is extended with a par-
allelism is expressed by HPF or MPI, and the intranode allel runtime system (PRTS) []. This PRTS uses suitable
by microtasking or OpenMP, and you must, therefore, middleware (currently PVM or MPI) to manage parallel
consider the hierarchical parallelism in writing your execution.
programs. In the flat parallelization, the both inter- and
intranode parallelism can be expressed by HPF or MPI, Discussion
and it is not necessary for you to consider such compli-
cated parallelism. Generally speaking, the hybrid par- Introduction
allelization is superior to the flat in performance and Functional languages are promising candidates for
vice versa in ease of programming. Note that the MPI parallel programming, because of their high level of
libraries and the HPF runtimes are optimized to per- abstraction and, in particular, because of their referen-
form as well as possible both in the hybrid and flat tial transparency. In principle, any sub-expression could
parallelization. be evaluated in parallel. As this implicit parallelism
would lead to too much overhead, modern parallel
Bibliography functional languages allow the programmers to spec-
. Sato T, Kitawaki S, Yokokawa M () Earth simulator running In: ify parallelism explicitly. The underlying idea of Eden
th international supercomputer conference ISC, Hedielberg is to enable programmers to specify process networks
. Kitawaki S () The development of the earth simulator. In: in a declarative way. Processes map input to output
LACSI symposium  (Keynote speech), Santa Fe values. Inputs and outputs are transferred via unidirec-
. Habata S, Kitawaki S, Yokokawa M () The earth simulator
tional one-to-one channels. A comprehensive definition
system. NEC Res Dev :():–
. Inasaka J, Ikeda R, Umezawa K, Ko Y, Yamada S, Kitawaki S
of Eden including a discussion of its formal seman-
() Hardware technology of the earth simulator. NEC Res Dev tics and its implementation can be found in the journal
():– paper []. We describe only the essentials here.
Eden E 

Basic Eden Constructs driven evaluation creates a parallel process only when
The basic Eden coordination constructs are process its result is already needed to continue the main eval-
abstraction and instantiation: uation. This suppresses any real parallel evaluation.
Although Eden overrules the lazy evaluation, for exam-
ple, by evaluating inputs and outputs of processes
p r o c e s s :: ( T r a n s a , Trans b) ⇒
eagerly and by sending those values immediately to
(a → b) → Process a b
( # ) :: ( T r a n s a , Trans b) ⇒
the corresponding recipient processes, it is often nec-
Process a b → a → b essary to use explicit demand control in order to start
processes speculatively before their result values are E
needed by the main computation. Evaluation strate-
The function process embeds functions of type a→b gies [] are used for that purpose. We will not go into
into process abstractions of type Process a b, while further details. In the following, we will use the (pre-
the instantiation operator ( # ) takes such a process defined) Eden function spawn to eagerly and immedi-
abstraction and input of type a to create a new pro- ately instantiate a finite list of process abstractions with
cess which consumes the input and produces output of their corresponding inputs. Appropriate demand con-
type b. Thus, the instantiation operator leads to a func- trol is incorporated in this function. Neglecting demand
tion application, with the side effect that the function control, spawn would be defined as follows:
is evaluated by a remote child process. Its argument is
concurrently, that is, by a new thread, evaluated in the s p a w n :: ( T r a n s a , T r a n s b ) ⇒
parent process and sent to the child process which, in [ Process a b] → [a] → [b]
turn, fully evaluates and sends back the result of the -- d e f i n i t i o n w i t h o u t d e m a n d c o n t r o l
spawn = z i p W i t h (#)
function application. Both processes are using implicit
: communication channels established between child
and parent process on process instantiation. The variant spawnAt additionally locates the created
The type context (Trans a, Trans b) states that processes on given processor elements (identified by
both a and b must be types belonging to the type class their number).
Trans of transmissible values. In general, Haskell type
classes provide a structured way to define overloaded s p a w n A t :: ( T r a n s a , T r a n s b ) ⇒
functions. Trans defines implicitly used communica- [ Int ] → [ P r o c e s s a b ] → [ a ] → [ b ]
tion functions which by default transmit normal form
values in a single piece. The overloading is used twofold. Example: A parallel variant of the function
Lists are communicated element-wise as streams and map :: (a→b) → [a] → [b] which creates a
tuple components are evaluated concurrently. An inde- process for each application of the parameter function
pendent thread will be created for each component of to an element of the input list can be defined as follows:
an output tuple and the result of its evaluation will be
sent on a separate channel. The connection points of
parMap :: ( T r a n s a , T r a n s b ) ⇒
channels to processes are called inports on the receiver (a → b) → [a] → [b]
side and outports on the sender side. There is a one- parMap f
to-one correspondence between the threads and the = spawn ( repeat ( process f ))
outports of a process while data that is received via the
inports is shared by all threads of a process. Analo- The Haskell prelude function repeat ::
gously, several threads will be created in a parent process a → [a] yields a list infinitely repeating the parameter
for tuple inputs of a child process. During its lifetime, element. ⊲
an Eden process can thus contain a variable number of
threads. To model many-to-one communication, Eden pro-
The demand-driven (lazy) evaluation of Haskell vides a nondeterministic function merge that merges a
is an obstacle for parallelism. A completely demand- list of streams into a single stream.
 E Eden

m e r g e :: T r a n s a ⇒ [[ a ]] → [a]
the results according to the original task order (func-
tion orderBy :: [[b]] -> [Int] -> [b], code
not shown).
This function can, for example, be used by the master Initially, the master sends as many tasks as spec-
in a master–worker system to receive the results of the ified by the parameter prefetch in a round-robin
workers as a single stream in the time-dependent order manner to the workers (see definition of initReqs).
in which they are provided. Further tasks are sent to workers which have deliv-
Although merge is of great worth, because it is the ered a result. The list newReqs is extracted from the
key to specify many reactive systems, one has to be worker results tagged with the corresponding worker id
aware that functional purity and its benefits are lost which are merged according to the arrival of the worker
when merge is being used in a program. Fortunately, results. This simple master–worker definition has the
functional purity can be preserved in most portions of advantage that the tasks need not be numbered to
an Eden program. In particular, it is, for example, pos- reestablish the original task order on the results. More-
sible to use sorting in order to force a particular order over, worker processes need not send explicit requests
of the results returned by a merge application and thus for new work together with the result values.
to encapsulate merge in a deterministic context.
Example: A simple master–worker system with the fol-
lowing functionality can easily be defined in Eden as Defining Non-hierarchical Process
shown in Fig. . The master process distributes tasks to Networks in Eden
worker processes, which solve the tasks and return the With the Eden constructs introduced up to now, com-
results to the master: munication channels are only established between par-
ent and child processes during process creation. This
tasks results results in purely hierarchical process topologies. Eden
master further provides functions to create and use explicit
masterWorker np pf f channel connections between arbitrary processes. These
functions are rather low level and it is not easy to
toWs fromWs use them appropriately. Therefore, we will not show
them here, but explain two more abstract approaches
to define non-hierarchical process networks in Eden:
tag_map f ... tag_map f
remote data and graph specifications.

worker 1 worker np Remote data of type a is represented by a handle of


The Eden function masterWorker (evaluated by type RD a with the following interface functions []:
the “master” process) takes four parameters: np spec- release :: a -> RD a
ifies the number of worker processes that will be fetch :: RD a -> a.
spawned, prefetch determines how many tasks will
initially be sent by the master to each worker process, The function release yields a remote data handle that
the function f describes how tasks have to be solved by can be passed to other processes, which will in turn use
the workers, and the final parameter tasks is the list the function fetch to access the remote data. The data
of tasks that have to be solved by the whole system. The transmission occurs automatically from the process that
auxiliary pure Haskell function distribute :: Int releases the data to the process which uses the handle to
-> [a] -> [Int] -> [[a]] (code not shown) is fetch the remote data.
used to distribute the tasks to the workers. Its first Example: We show a small example where the remote
parameter determines the number of output lists, which data concept is used to establish a direct channel con-
become the input streams for the worker processes. nection between sibling processes. Given functions f
The third parameter is the request list reqs which and g, one can calculate (g ○ f) a in parallel creating
guides the task distribution. The same list is used to sort a process for each function.
Eden E 

m a s t e r W o r k e r :: ( T r a n s a , T r a n s b ) ⇒
Int → Int → ( a → b ) → [ a ] → [ b ]
m a s t e r W o r k e r np p r e f e t c h f t a s k s
= o r d e r B y f r o m W s reqs
where
fromWs = s p a w n w o r k e r s toWs
workers = [ p r o c e s s ( t a g _ m a p f ) ∣ n ← [1.. np ]]
toWs = d i s t r i b u t e np t a s k s reqs
newReqs = m e r g e [[ i ∣ r ← rs ] ∣ (i , rs ) ← zip [1.. np ] f r o m W s ]
reqs = i n i t R e q s ++ n e w R e q s E
i n i t R e q s = c o n c a t ( r e p l i c a t e p r e f e t c h [1.. np ])

Eden. Fig.  Eden definition of a simple master–worker system

main main
inp inp

f g
release.f g.fetch
a Indirect connection b Direct connection

Eden. Fig.  A simple process graph

Simply replacing the function calls by process data leads to a direct communication of the actual data
instantiations between the processes evaluating f and g (see Fig. b).
The remote data handle is treated like the original data
(process g # (process f # inp))
in the first version, that is, it is passed via the main
leads to the process network in Fig. a where the process process from the process computing f to the one com-
evaluating the above expression is called main. Pro- puting g. ⊲
cess main instantiates a first child process calculating The remote data concept enables the programmer to
g, thereby creating a concurrent thread to evaluate the easily build complex topologies by combining simpler
input for the new child process. This thread instantiates ones [].
the second child process for calculating f. It receives the
remotely calculated result of the f process and passes it
to the g process. The drawback of this approach is that Grace
the result of the f process will not be sent directly to (Graph-based communication in Eden) is a library that
the g process. This causes unnecessary communication allows a programmer to specify a network of pro-
costs. cesses as a graph, where the graph nodes represent
In the second implementation, we use remote processes and the edges of communication channels
data to establish a direct channel connection between []. The graph is described as a Haskell data structure
the child processes: ProcessNetwork a, where a is the type of the result
computed by the process network. A function build
process (g o fetch) #
is used to define a process network, while a function
(process (release of) # inp)
start instantiates such a network and automatically
It uses function release to produce a handle of type sets up the corresponding process topology, that is, the
RD a for data of type a. Calling fetch with remote processes are created and the necessary communication
data returns the value released before. The use of remote channels are installed.
 E Eden

b u i l d :: f o r a l l f a g r p n e .
parameter and result types. This is needed to create indi-
-- type c o n t e x t vidual channels for these parts. Suitable instances will
( Placeable f a g r p, be derived automatically for every possible function.
Ord e , Eq n ) ⇒ In the type context Placeable f a g r p the func-
(n , f ) → -- main node
tion’s type f determines the other types: a is the type of
[ Node n ] → -- node list
[ Edge n e ] → -- edge list the function’s first argument, g is the remaining part of
ProcessNetwork r the function’s type without the first argument. The final
s t a r t :: ( T r a n s a ) ⇒ result type of the function after applying all parameters
ProcessNetwork a → a is r. Finally, p is a type level list of all the parameters.
The main benefit of Grace is a clean separation
between coordination and computation. The network
specification encapsulates the coordination aspects. The
The function build transforms a graph specification graph nodes are annotated with functions describing
into a value of type ProcessNetwork r. The graph is the computations of the corresponding processes.
defined by a list of nodes of type [Node n] and a list of Example: We consider again a simple process network
edges of type [Edge n e]. Type variables n and e rep- which is similar to the one in Fig. b and show its speci-
resent the types of node and edge labels, respectively. fication in Grace. The behavior of the three processes is
To allow functions with different types to be associ- specified by functions f, g, and root. The communica-
ated with graph nodes, the type f of node functions is tion topology is defined by a graph consisting of three
existentially quantified and only explicitly given for the nodes (a root node and two additional nodes) and three
main node, because the function placed on the main edges. The specification is shown in Fig. . ⊲
node determines the result type r of the process net-
work. This is also the reason why the main node is not Algorithmic Skeletons
a member of the node list but provided as a separate Algorithmic skeletons [] capture common patterns of
parameter. The function build uses multiparameter parallel evaluations like task farms, pipelines, divide-
classes with functional dependencies and explicit quan- and-conquer schemes, etc. The application program-
tification to achieve flexibility in the specification of mer only needs to instantiate a skeleton appropriately,
graphs, for example, by placing arbitrary functions on thereby concentrating on the problem-specific matters
nodes and nevertheless using standard Haskell lists to and trusting on the skeleton with respect to all parallel
represent node lists. The type context Ord e and Eq n details.
ensures that edges can be ordered by their label and that The small introductory example functions parMap
nodes can be identified by their label. Placeable is a and masterWorker shown above are examples for
multiparameter type class with dependent types, which skeleton definitions in Eden. In Eden and other parallel
is used by the Grace implementation to partition user functional languages, skeletons are no more than poly-
supplied function types into their parts, for example, morphic higher-order functions which can be applied

example = start network


where
-- fct. spec. of process behaviour
f, g, root :: (Trans a ) => [a] -> [a] root
...
-- process graph specification
network = build ("root", root) nodes edges
nodes = [N "nd_f" f , N "nd_g" g ]
edges = [E "nd_f" "nd_g" 0 nothing, f g
E "nd_g" "root" 0 nothing,
E "root" "nd_f" 0 nothing]

Eden. Fig.  Grace specification of a simple process graph


Eden E 

with different types and parameters. Thus, program- The resulting structure is a tree of task nodes where
ming with skeletons follows the same principle as pro- successor nodes are the sub-problems, the leaves rep-
gramming with higher-order functions. An important resenting trivial tasks. A fundamental Eden skeleton
issue is that skeletons can be both used and implemented which specifies a general divide and conquer algorithm
in Eden. In other skeletal approaches, the creation of structure can be found in []. Here, we show a version
new skeletons is considered as a system programming where every nontrivial task is split into a fixed num-
task, or even as a compiler construction task. Skeletons ber of sub-tasks. Moreover, the process tree is created
are implemented by using imperative languages and/or in a distributed fashion: One of the tree branches is
parallel libraries. Therefore, these systems offer a closed processed locally, the others are instantiated as new pro- E
collection of skeletons which the application program- cesses, as long as PEs are available. These branches will
mer can use, but without the possibility of creating new recursively produce new parallel subtasks, resulting in a
ones, so that adding a new skeleton usually implies a distributed expansion of the computation. In the follow-
considerable effort. ing binary tree of task nodes, the boxes indicate which
The Eden skeleton library contains a big collec- task nodes will be evaluated by the eight processes:
tion of different skeletons. Various kinds of skele-
ton definitions in Eden together with cost models
to predict the execution time of skeleton instantia-
tions have been presented in a book chapter [].
Parallel map implementations have been analyzed in
[]. An Eden implementation of the large-scale map-
and-reduce programming model proposed by Google
[] has been investigated in []. Hierarchical master–
worker schemes with several layers of masters and
submasters can elegantly be created by nesting a sim- In this setting, explicit placement of processes is essen-
ple single-layer master–worker scheme. More elaborate tial to ensure that processes are not placed on the same
hierarchical master–worker systems have been dis- processor element while leaving others unused. In [],
cussed in []. this kind of skeleton is compared with a flat expansion
version, where the main process unfolds the tree up to
Example: (A regular fixed-branching divide and con- a given depth, usually with more branches than avail-
quer skeleton) As a nontrivial example of a skele- able processor elements (PEs). The resulting subtrees
ton definition we present a dynamically unfolding can then be evaluated by a farm of parallel worker pro-
divide-and-conquer skeleton []. The essence of a cesses, the main process combines the results of the sub-
divide-and-conquer algorithm is to decide whether the processes. A master–worker system (as shown above)
input is trivial and, in this case, to solve it, or else to can be used to implement this version.
decompose nontrivial input into a number of subprob- Figure  shows the distributed expansion divide-
lems, which are solved recursively, and to combine the and-conquer skeleton for k-ary task trees. Besides the
output. A general skeleton takes parameter functions standard parameter functions, the skeleton takes the
for this functionality, as shown here: branching degree, and a ticket list with PE numbers
to place newly created processes. The left-most branch
of the task tree is solved locally, other branches are
instantiated using the Eden function spawnAt, which
type D i v i d e C o n q u e r a b instantiates a collection of processes (given as a list) with
= ( a → Bool ) → -- trivial ? respective input, on explicitly specified PEs. Results are
(a → b) → -- solve combined by the combine function.
( a → [ a ]) → -- split
Explicit Placement via Tickets: The ticket list is used
([ b ] → b ) → -- combine
a → b -- input / result to control the placement of newly created processes.
First, the PE numbers for placing the immediate child
 E Eden

dcN :: (Trans a, Trans b) ⇒


Int → [Int] → -- branch degree/tickets
DivideConquer a b
dcN k tickets trivial solve split combine x
|null tickets = seqDC x
|trivial x = solve x
|otherwise = ... -- code for demand control omitted
combine (myRes:childRes ++ localRess)
where
-- sequential computation
seqDC x = if trivial x then solve x
else combine (map seqDC (split x))
-- child process generation
childRes = spawnAt childTickets childProcs procIns
childProcs = map (process ◦ rec_dcN) theirTs
rec_dcN ts = dcN k ts trivial solve split combine
-- ticket distribution
(childTickets, restTickets) = splitAt (k-1) tickets
(myTs: theirTs) = unshuffle k restTickets
-- input splitting
(myIn:theirIn) = split x
(procIns, localIns) = splitAt (length childTickets) theirIn
-- local computations
myRes = ticketF myTs myIn
localRess = map seqDC localIns

Eden. Fig.  Distributed expansion divide and conquer skeleton for k-ary task trees

processes are taken from the ticket list. Then, the Details on Eden’s implementation, especially on
remaining tickets are distributed to the children in a primitives provided by the interface of the parallel run-
round-robin manner using the function unshuffle time system (PRTS) and the implementation of the
:: Int -> [a] -> [[a]] which unshuffles a given Eden module can be found in [, ].
list into as many lists as the first parameter tells. Child
computations will be performed locally when no more EdenTV – The Eden Trace Viewer Tool
tickets are available. The explicit process placement via The Eden trace viewer tool (EdenTV) [] provides a
ticket lists is a simple and flexible way to control the dis- postmortem analysis of program executions on the level
tribution of processes as well as the recursive unfolding of the computational units of the PRTS. The latter
of the task tree. If too few tickets are available, com- is instrumented with special trace generation com-
putations are performed locally. Duplicate tickets can mands activated by a runtime option. In the space-
be used to allocate several child processes on the time diagrams generated by EdenTV, machines (i.e.,
same PE. ⊲ processor elements), processes, and threads are rep-
resented by horizontal bars, with time on the x-axis.
Implementation The diagram bars have segments in different colors,
Eden’s implementation follows a microkernel approach. which indicate the activities of the respective logical
The kernel implements a few, general control mech- unit in a period during the execution. Bars are grey,
anisms and provides them as primitive operations when the logical unit is running, light grey, when it
(see Fig. ). The more complex high-level language is runnable, but currently not running, and dark grey,
constructs are implemented in libraries using the prim- when the unit is blocked. Messages between processes
itive base constructs. Apart from simplifying the imple- or machines are optionally shown by gray arrows which
mentation, this approach has important advantages start from the sending unit bar and point at the receiv-
with respect to productivity and maintainability of the ing unit bar. The representation of messages is very
system. important for programmers, since they can observe
Eden E 

Eden program Language

Haskell
Eden module and libraries System
libraries

Eden primitives
Sequential RTS Kernel
Parallel RTS

Suitable middleware

Eden. Fig.  Implementation layers


E

11.6 11.8 12 12.2 12.4 12.6 12.8 13 13.2 13.4 13.6 13.8 14 14.2 14.4 14.6 14.8 15 15.2 15.4 15.6 15.8 16 16.2 16.4 16.6

Zoom of the final communication phase of the hyperquicksort algorithm when sorting 5M elements on 17 machines of a
Beowulf cluster. Messages are shown as black arrows.

Eden. Fig.  Trace of hypercube communication topology of dimension 

hot spots and inefficiencies in the communication dur- would improve the availability and acceptance of
ing the execution as well as control communication parallel Haskells.
topologies. ● Recent results [] show that the Eden system,
Figure  shows in grayscale (light grey = yellow, although tailored for distributed memory machines,
grey = green, dark grey = red) the trace of a recur- behaves equally well on workstation clusters and
sive parallel quicksort in a hypercube of dimension  on multi-core machines. It needs to be inves-
with message traffic, exposing the well-known butter- tigated whether the Eden system can be opti-
fly communication pattern imposed by the hypercube mized for multi-core machines by exploiting the
structure. shared memory and thereby avoiding unnecessary
communications.
● We plan to design and investigate further skeletons.
The flexibility of defining individual skeletons for
Future Directions several algorithm classes has not been completely
Lines of future studies are exploited yet. Much more experience in the design
● It is planned for the future to maintain a common and use of skeletons is necessary.
parallel runtime environment for Eden, Glasgow ● Recent studies on remote data and the Grace system
parallel Haskell (GpH), and other parallel Haskells. have shown that high-level parallel programming
Such a common environment is highly desirable and notations simplify parallel programming a lot. More
 E Eden

research is necessary to find even more abstract receive functions are provided for channel communica-
parallel programming constructs. tion. First-class synchronous operations, called events,
● Supporting tools like profiling, analysis, visualiza- however, allow to hide channels and complex commu-
tion, and debugging tools are essential for real-world nication and synchronization protocols behind appro-
parallel program development. Much more work is priate abstractions. In contrast to Eden and other par-
needed to further develop existing, and to design allel functional languages, Concurrent ML has been
new powerful tools which help the programmers to designed for concurrent programming, that is, with
analyze and tune parallel programs. a focus on structuring software and not with the
goal of speeding up computations. Recently, however,
a shared-memory parallel implementation has been
developed [].
Related Entries Manticore is a strict parallel functional language that
Concurrent ML supports different kinds of parallelism. In particular,
Functional Languages it combines explicit concurrency in the style of Con-
Glasgow Parallel Haskell (GpH) current ML with various implicitly parallel constructs
Multilisp which provide fine-grain data parallelism.
NESL
Parallel Skeletons Bibliographic Notes and Further
Sisal Reading
The seminal book on research directions in parallel
Glasgow parallel Haskell (GpH) is another parallel functional programming [] edited by Hammond and
Haskell extension that is less explicit about parallelism Michaelson covers not only fundamental issues but
than Eden. It provides a simple parallel combinator par also provides summaries about selected research areas.
to annotate sub-expressions that should be evaluated in Various parallel and distributed variants of the func-
parallel if free processing elements were available. In tional programming language Haskell are discussed in
contrast to Eden, GpH assumes a global shared mem- a journal paper by Trinder et al. []. A comprehen-
ory (at least virtually), and the parallel runtime system sive overview on patterns and skeletons for parallel
decides whether an annotated sub-expression will be and distributed programming can be found in a book
evaluated in parallel or not. edited by Gorlatch and Rabhi []. The three parallel
NESL is a strict data-parallel functional language functional languages, Glasgow parallel Haskell, PMLS
that supports nested parallel array comprehensions. (a parallel version of ML), and Eden, have been com-
Parallelism is hidden and concentrated in data-parallel pared with respect to programming methodology and
operations. performance in [].
SISAL (Streams and Iteration in a Single Assign- Comprehensive and up-to-date information on
ment Language) and MultiLisp are early strict parallel Eden is provided on its web site
functional languages. SISAL is a first-order functional
http://www.mathematik.uni-marburg.de/∼eden
language with implicit parallelism, a rich set of array
operations and stream support. It has been designed Basic information on its design, semantics, and
for numerical applications. MultiLisp extends the LISP implementation as well as the underlying programming
dialect Scheme with explicit parallel constructs, in par- methodology can be found in []. Details on the paral-
ticular so-called futures. Futures initiate parallel evalu- lel runtime system and Eden’s concept of implementa-
ations without forcing the main computation to wait for tion can best be found in [, ]. The technique of layered
the result. In this respect, futures lead to a special form parallel runtime environments has been further devel-
of laziness within a strict computation language. oped and generalized by Berthold et al. [, ]. The Eden
Concurrent ML extends the strict functional lan- trace viewer tool EdenTV is available on Eden’s web
guage SML (Standard ML) with concurrency prim- site. A short introductory description is given in [].
itives like thread creation and synchronous message Another tool for analyzing the behavior of Eden pro-
passing over explicit channels. Blocking send and grams has been developed by de la Encina et al. [, ]
Eden E 

by extending the tool Hood (Haskell Object Obser- . Berthold J, Klusik U, Loogen R, Priebe S, Weskamp N ()
vation Debugger) for Eden. Extensive work has been High-level process control in Eden. EuroPar  – Parallel Pro-
done on skeletal programming in Eden. An overview cessing, LNCS , Springer, pp –
. Berthold J, Loogen R () Skeletons for recursively unfold-
on various skeleton types have been presented as a
ing process topologies. Parallel Computing: Current & Future
chapter in the mentioned book by Gorlatch and Rabhi Issues of High-End Computing, ParCo , NIC Series, vol ,
[]. Definitions and applications of specific skeletons pp –
can, for example, be found in the following papers: . Berthold J, Loogen R () Parallel coordination made explicit
parallel map [], topology skeletons [], adaptive in a functional setting. Implementation and Application of
skeletons [], Google map-reduce [], hierarchi- Functional Languages (IFL ), Selected Papers, LNCS , E
Springer, pp – (awarded best paper of IFL’)
cal master–worker systems [], divide-and-conquer
. Berthold J, Loogen R () Visualizing parallel functional pro-
schemes []. Special skeletons for computer algebra gram runs – case studies with the Eden trace viewer. Parallel
algorithms are developed with the goal to define the Computing: Architectures, Algorithms and Applications, ParCo
kernel of a computer algebra system in Eden []. An , NIC Series, vol , pp –
operational and a denotational semantics for Eden have . Berthold J, Marlow S, Zain AA, Hammond K () Compar-
been defined by Ortega-Mallén and Hidalgo-Herrero ing and optimising parallel Haskell implementations on multi-
core. rd International Workshop on Advanced Distributed and
[, ]. These semantics have been used to analyze Eden
Parallel Network Applications (ADPNA-). IEEE Computer
skeletons [, ]. A non-determinism analysis has been
Society
presented by Segura and Peña []. . Berthold J, Zain AA, Loidl H-W () Scheduling light-weight
Eden is intensively being used in teaching paral- parallelism in ArtCoP. Practical Aspects of Declarative Languages
lel programming and algorithms and as a platform (PADL ), LNCS , Springer, pp –
for investigating high-level parallel programming con- . Cole M () Algorithmic skeletons: structured management of
cepts, techniques, methodology, and parallel runtime parallel computation. MIT Press, Cambridge
. Dean J, Ghemawat S () MapReduce: simplified data process-
systems. Up to now, its main use in practice has been
ing on large clusters. Commun ACM ():–
the implementation of parallel computer algebra com-
. Dieterle M, Horstmeyer T, Loogen R () Skeleton composition
ponents [, ]. using remote data. Practical Aspects of Declarative Programming
 (PADL ), LNCS , Springer, pp –
. Encina A, Llana L, Rubio F, Hidalgo-Herrero M () Observ-
ing intermediate structures in a parallel lazy functional language.
Acknowledgments Principles and Practice of Declarative Programming (PPDP
The author is grateful to Yolanda Ortega-Mallén, Jost ), ACM, pp –
Berthold, Mischa Dieterle, Thomas Horstmeyer, and . Encina A, Rodríguez I, Rubio F () pHood: a tool to ana-
Oleg Lobachev for their helpful comments on previous lyze parallel functional programs. Implementation of Functional
versions of this essay. Languages (IFL’), Seton Hall University, New York, pp –.
Technical Report, SHU-TR-CS---
. Hammond K, Berthold J, Loogen R () Automatic skeletons in
template Haskell. Parallel Process Lett ():–
Bibliography . Hammond K, Michaelson G (eds) () Research directions in
. Berthold J () Towards a generalised runtime environment for parallel functional programming. Springer, Heidelberg
parallel Haskells. Computational Science – ICCS’, LNCS . . Hidalgo-Herrero M, Ortega-Mallén Y () An operational
Springer (Workshop on practical aspects of High-level parallel semantics for the parallel language Eden. Parallel Process Lett
programming – PAPP ) ():–
. Berthold J, Dieterle M, Lobachev O, Loogen R () Distributed . Hidalgo-Herrero M, Ortega-Mallén Y () Continuation
memory programming on many-cores - a case study using Eden semantics for parallel Haskell dialects. Asian Symposium on Pro-
divide-&-conquer skeletons. ARCS , Workshop on Many- gramming Languages and Systems (APLAS ), LNCS ,
Cores. VDE Verlag Springer, pp –
. Berthold J, Dieterle M, Loogen R () Implementing parallel . Hidalgo-Herrero M, Ortega-Mallén Y, Rubio F () Analyzing
google map-reduce in Eden. Europar’, LNCS , Springer, the influence of mixed evaluation on the performance of Eden
pp – skeletons. Parallel Comput (–):–
. Berthold J, Dieterle M, Loogen R, Priebe S () Hierarchical . Hidalgo-Herrero M, Ortega-Mallén Y, Rubio F () Comparing
master–worker skeletons. Practical Aspects of Declarative Lan- alternative evaluation strategies for stream-based parallel func-
guages (PADL ), LNCS , Springer, pp – tional languages. Implementation and Application of Functional
 E Eigenvalue and Singular-Value Problems

Languages (IFL ), Selected Papers, LNCS , Springer, either eigenvalues: these are the roots λ of the nth-
pp – degree characteristic polynomial:
. Horstmeyer T, Loogen R () Graph-based communication in
Eden. In: Trends in Functional Programming, vol . Intellect det(A − λI) = . ()
. Klusik U, Loogen R, Priebe S, Rubio F () Implementation
skeletons in Eden – low-effort parallel programming. Implemen-
tation of Functional Languages (IFL ), Selected Papers, LNCS
The set of all eigenvalues is called the spectrum of
, Springer, pp – A: Λ(A) = {λ  , . . . , λn }. Eigenvalues can be real or
. Lobachev O, Loogen R () Towards an implementation of a complex, whether the matrix is real or complex.
computer algebra system in a functional language. AISC/Calcule- or eigenpairs: in addition to an eigenvalue λ, one seeks
mus/MKM , LNAI , pp – for the corresponding eigenvector x ∈ Cn − {}
. Loidl H-W, Rubio Diez F, Scaife N, Hammond K, Klusik U,
which is the solution of the singular system
Loogen R, Michaelson G, Horiguchi S, Pena Mari R, Priebe S,
Portillo AR, Trinder P () Comparing parallel functional lan-
guages: programming and performance. Higher-Order Symbolic (A − λI)x = . ()
Computation ():–
When the matrix A is real symmetric or complex her-
. Loogen R, Ortega-Mallén Y, Peña R, Priebe S, Rubio F ()
Parallelism abstractions in Eden. In [], chapter , Springer,
mitian, the eigenvalues are real.
pp – Given a matrix A ∈ Rm×n (or A ∈ Cm×n ), the
. Loogen R, Ortega-Mallén Y, Peña-Marí R () Parallel func- singular-value decomposition (SVD) of A consists of
tional programming in Eden. J Funct Prog ():– computing orthogonal matrices U ∈ Rm×m and V ∈
. Peña R, Segura C () Non-determinism analysis in a parallel-
Rn×n (or unitary matrices U ∈ Cm×m and V ∈ Cn×n )
functional language. Implementation of Functional Languages
(IFL ), LNCS , Springer
such that the m × n matrix
. Rabhi FA, Gorlatch S (eds) Patterns and skeletons for parallel and
distributed computing. Springer, Heidelberg Σ̃ = U T AV ()
. Reppy J, Russo CV, Yiao Y () Parallel concurrent ML. Inter-
national Conference on Functional Programming (ICFP) , (or Σ̃ = U H AV) has zero entries except for its diago-
ACM nal leading p × p real submatrix Σ = diag(σ , . . . , σp )
. The GHC Developer Team. The Glasgow Haskell Compiler. with σ ≥ . . . ≥ σp ≥ . The first p columns of
http://www.haskell.org/ghc
U = [u , . . . , um ], and of V = [v , . . . , vn ], are called the
. Trinder P, Hammond K, Loidl H-W, Peyton Jones S () Algo-
rithm + strategy = parallelism. J Funct Prog ():– left and right singular vectors, respectively. The singular
. Trinder PW, Loidl HW, Pointon RF () Parallel and distributed triplets are (σi , ui , vi ) for i = , . . . , p.
Haskells. J Funct Prog ( & ):–
. Zain AA, Berthold J, Hammond K, Trinder PW () Orches-
trating production computer algebra components into portable Discussion
parallel programs. Open Source Grid and Cluster Conference
Mathematical Fundamentals
(OSGCC)
Eigenproblems. Equation  has n complex roots, some
of them possibly multiple. For every eigenvalue λ ∈ C,
Eigenvalue and Singular-Value the dimension of the eigenspace Xλ = ker(A − λI) is
Problems at least . The algebraic multiplicity of λ (its multiplic-
ity as root of ( )) is larger or equal to its geometric
Bernard Philippe , Ahmed Sameh multiplicities (the dimension of Xλ ). When all the geo-

Campus de Beaulieu, Rennes, France metric multiplicities are equal to the algebraic ones, a

Purdue University, West Lafayette, IN, USA basis X = [x , . . . , xn ] ∈ Cn×n of eigenvectors (say right
eigenvectors) can be built by concatenating bases of the
invariant subspaces X λ where λ ∈ Λ(A). In such a situ-
Definition ation, the matrix A is said to be diagonalizable since by
Given a matrix A ∈ Rn×n or A ∈ Cn×n , an eigen- expressing the same operator in the basis X,
value problem consists of computing eigen-elements of
A which are: D = X − AX = diag(λ , . . . , λn ). ()
Eigenvalue and Singular-Value Problems E 

By denoting X −H = Y = [y , . . . , yn ], Eq.  becomes D = matrix, while the augmented symmetric matrix
Y H AX with Y H X = I. For i = , . . . , n, the vector yi is the
⎛  A⎞
left eigenvector corresponding to the eigenvalue λi .
Aaug = ⎜


⎟ ()
The well-known Jordan normal form generalizes the T
⎝A  ⎠
diagonalization when Eq.  does not hold. However,
the problem of determining such a decomposition is has eigenvalues ±σ , . . . , ±σn , corresponding to the
ill-posed. Transforming it into a well-posed problem eigenvectors
can be done by proving the existence of matrices with
⎛ ⎞ E
an assigned block structure in a neighborhood of the  ⎜ ui ⎟,
matrix. For more details see Kagström and Ruhe’s con- √ ⎜ ⎟ i = , . . . , n.
 ⎝ ±v ⎠
i
tribution in [].
A decomposition which is numerically computable Every method for computing singular values of A is
is called the Schur decomposition, in which a unitary based on computing the eigenpairs of B or Aaug .
matrix U exists such that the complex matrix
The Symmetric Eigenvalue Problem
⎛ λ τ ⋯ τ ⎞ Three class of methods are available for solving prob-
⎜   n

⎜ ⎟ lems () and () when A is symmetric:
⎜ ⋱ ⋱ ⋮ ⎟
⎜ ⎟
T = U H AU = ⎜

⎟,
⎟ () Jacobi iterations: The oldest method, introduced by
⎜ ⋱ τ n−,n ⎟
⎜ ⎟ Jacobi in  [], consists of annihilating succes-
⎜ ⎟
⎜ ⎟ sively off-diagonal entries of the matrix via orthog-
⎝ λn ⎠ onal similarity transformations: A = A and Ak+ =
RTk Ak Rk where Rk is a plane rotation. The scheme is
is upper triangular. Eigenvectors can then be computed organized into sweeps of n(n−)/ rotations to anni-
from T. When the matrix A is Hermitian – or real hilate every off-diagonal pairs of symmetric entries
symmetrics – so is the matrix T which becomes diago- once. One sweep involves  n + O(n ) operations
nal and real. The symmetric eigenvalue problem deserves when symmetry is exploited in the computation.
a lot of attention since it arises frequently in practice. The method was abandoned due to high compu-
When the matrix A is real unsymmetric, pairs of tational cost but has been revived with the advent
complex conjugate eigenvalues may exist. In order to of parallelism. However, depending on the parallel
avoid complex arithmetic in that case, the real Schur architecture, it remains more expensive than other
decomposition extends the Schur decomposition by solvers.
allowing  ×  diagonal blocks with pairs of conjugate QR iterations: The aim of the QR algorithm relies on the
eigenvalues. following iteration:

A = A and Q = I
SVD problems. Since the complex case is similar to the ⎧



⎪ For k ≥ ,
real one and since the SVD of AT is obviously obtained ⎪

⎪ ()

from the SVD of A, the study is restricted to the situa- ⎨ (Qk+ , Rk+ ) = qr(Ak ) (QR Factorization)



tion A ∈ Rm×n with m ≤ n. ⎪



⎪ A = Rk+ Qk+ ,
Let A have the singular-value decomposition ( ), ⎩ k+
then, the symmetric matrix where the QR factorization is defined in
[linear least squares and orthogonali-
B = AT A ∈ Rn×n , () zation 12]. Under weak assumptions, matrix Ak
approaches a diagonal matrix for sufficiently large
has eigenvalues σ  ≥ . . . ≥ σn  ≥ , with corresponding k. A direct implementation of ( ) would involve
eigenvectors (vi ), (i = , . . . , n). B is called the normal O(n ) arithmetic operations at each iteration. If,
 E Eigenvalue and Singular-Value Problems

however, one obtains the tridiagonal matrix T = Eigenvalue and Singular-Value Problems. Table 
QT AQ via orthogonal similarity transformations at a Annihilation scheme as in [] (first regime)
cost of O(n ) arithmetic operations, the number of x       
arithmetic operations of a QR iteration is reduced x      
to O(n ) or even O(n) if only the eigenvalues are x     
needed. To accelerate convergence, a shifted version x    
of the QR algorithm is given by x   
x  
T = T and Q = I
x 



⎪ For k ≥ , x




⎪ ()
⎨ (Qk+ , Rk+ ) = qr(Tk − μk I) (QR Factorization)







⎪ T = Rk+ Qk+ + μ k I, where Rk (i, j) is that rotation which annihilates the (i, j)
⎩ k+
off-diagonal element (⊕ indicates that the rotations are
where μk is an approximation of the smallest eigen-
assembled in a single matrix and extended to order
value of Tk .
n by the identity). Let one sweep be the collection of
Sturm sequence evaluation: The Sturm sequence in λ ∈
such orthogonal similarity transformations that annihi-
R is defined by p (λ) =  and for k = , . . . , n by
late the element in each of the  n(n − ) off-diagonal
pk (λ) = det(Ak − λ Ik ) where Ak and Ik are the
positions (above the main diagonal) only once, then
principal matrices of order k of A and of the iden-
for a matrix of order , the first sweep will consist of
tity matrix. The number N (λ) of sign changes in the
seven successive orthogonal transformations with each
sequence is equal to the number of eigenvalues of A
one annihilating distinct groups of maximum four ele-
which are smaller than λ. N (λ) can also be com-
ments simultaneously as described in the Table . For
puted as the number of negative diagonal entries of
the remaining sweeps, the structure of each subsequent
the matrix U in the LU factorization of the matrix
transformation Uk , k > , is chosen to be the same
A − λI = LU. As for the QR iteration, the algorithm
as that of Rj where j =  + (k − ) mod . In gen-
is applied to the tridiagonal matrix T, so that each
eral, the most efficient annihilation scheme consists of
Sturm sequence computation has complexity O(n).
(r − ) similarity transformations per sweep, where
Sturm sequence allows one to partition a given inter-
r = ⌊  (n + )⌋, in which each transformation anni-
val [a, b] as desired to obtain each of the eigenvalues
hilates different ⌊  n⌋ off-diagonal elements (see []).
in [a, b] to a desired accuracy. When needed, the
Several other annihilation schemes are possible which
eigenvectors are then obtained by inverse iterations.
are based on round-robin techniques. Luk and Park
Many references exist which describe the theory under- [] have demonstrated that various parallel Jacobi rota-
lying the three methods outlined above, e.g., see []. tion ordering schemes are equivalent to the sequen-
tial row-ordering scheme, and hence share the same
convergence properties.
Parallel Jacobi algorithms. A parallel version of the
The above algorithms are well-suited for shared
cyclic Jacobi algorithm was given by Sameh []. It is
memory computers. While they can also be imple-
obtained by the simultaneous annihilation of several
mented on distributed memory systems, their efficiency
off-diagonal elements by a given orthogonal matrix Uk ,
on such systems may suffer due to communication
rather than only one rotation as is done in the serial ver-
costs. In order to increase the granularity of the com-
sion. For example, let A be of order  (see Table ) and
putation (i.e., to increase the number of floating point
consider the orthogonal matrix Uk as the direct sum of
operations between two communications), block algo-
four independent plane rotations simultaneously deter-
rithms are considered. A parallel Block-Jacobi algo-
mined. An example of such a matrix is
rithm is given by Gimenez et al. in []. This algorithm
Uk = Rk (, ) ⊕ Rk (, ) ⊕ Rk (, ) ⊕ Rk (, ), takes advantage of the symmetry of the matrix.
Eigenvalue and Singular-Value Problems E 

Tridiagonalization of a symmetric matrix by Householder A parallel version of the algorithm is implemented


reductions. The goal is to compute a symmetric tridiago- in routine PDSYTRD of ScaLAPACK [].
nal matrix T, orthogonaly similar to A: T = QT AQ where
Q = H ⋯Hn− combines Householder reductions. These
Parallel QR for the symmetric tridiagonal eigenvalue
transformations are defined in [linear least square
problem. The QR scheme is the method of choice when
and orthogonalization 12]. The transformation H
all the eigenvalues and the eigenvectors are sought.
is chosen such that n −  zeros are introduced in the first
A Divide-and-Conquer algorithm was introduced by
column of H A, the first row being inchanged:
Dongarra and Sorensen in []. It consists of partition-
E
ing the triadiagonal matrix by rank-one tearings.
⎛ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⎞
⎜ ⎟ A tridiagonal matrix T of order n can be partitioned
⎜ ⎟
⎜ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆ ⎟ ⎛T  ⎞
⎜ ⎟
⎜ ⎟ as follows: T = ⎜

⎟ + ρff T where T and T
⎜ ⎟ ⎜ ⎟
⎜  ⋆ ⋆ ⋆ ⋆ ⋆ ⎟
⎜ ⎟ ⎝  T ⎠
H A = ⎜

⎟.

⎜  ⋆ ⋆ ⋆ ⋆ ⋆ ⎟ are two tridiagonal matrices of order n and n such
⎜ ⎟
⎜ ⎟ that n = n + n and where f is a vector with zero
⎜ ⎟
⎜  ⋆ ⋆ ⋆ ⋆ ⋆ ⎟ entries except in position n and n + . If the Schur
⎜ ⎟
⎜ ⎟
⎜ ⎟ decompositions T = Q D QT and T = Q D QT are
⎝  ⋆ ⋆ ⋆ ⋆ ⋆ ⎠ already known, the problem is reduced to computing
the eigenvalues of D̃ = QT TQ = D + ρzz T where
Therefore, by symmetry, applying H on the right results
⎛Q  ⎞ ⎛D  ⎞
in the following pattern: 
⎟, D = ⎜ 
Q = ⎜ ⎜ ⎟ ⎜
⎟, and z = Qf .

⎛ ⎝  Q ⎠ ⎝  D ⎠
⋆ ⋆     ⎞
⎜ ⎟ The eigenvalues of T are interleaved with the diag-
⎜ ⎟
⎜ ⋆ ⋆ ⋆ ⋆ ⋆ ⋆⎟ onal entries of D and satisfy a simple secular equa-
⎜ ⎟
⎜ ⎟
⎜ ⎟ tion obtained from the characteristic polynomial of
⎜  ⋆ ⋆ ⋆ ⋆ ⋆⎟
⎜ ⎟ D̃. The eigenvectors are then explicitely known but
A = H AH = ⎜


⎟ ()
⎜  ⋆ ⋆ ⋆ ⋆ ⋆⎟ they must be expressed back in the original basis by
⎜ ⎟
⎜ ⎟ pre-multiplying them by Q. The back transformation
⎜ ⎟
⎜  ⋆ ⋆ ⋆ ⋆ ⋆⎟
⎜ ⎟ may even include the initial orthogonal transforma-
⎜ ⎟
⎜ ⎟ tion which reduces the original symmetric matrix to the
⎝  ⋆ ⋆ ⋆ ⋆ ⋆⎠
tridiagonal form.
By p recursive tearings, the problem is reduced to
In order to take advantage of the symmetry, the reduc-
p independent eigenvalue problems of order m =
tion A = H AH can be implemented by a symmetric
n/p . Then, the technique described above is recursively
rank-two update which is a routine of the Level  BLAS
applied in parallel to update the eigendecompositions.
[]. The tridiagonal matrix T is obtained by repeating
The total number of arithmetic operations is O(n ).
successively similar reductions on columns  to n − .
The total procedure involves n / + O(n ) arithmetic
operations. Assembling the matrix Q = H . . . Hn− Parallel Sturm sequences. Sturm sequences are often
requires n / + O(n ) additional operations. used when only the part of the spectrum, in an inter-
As indicated in [linear least square and val [a, b], is sought. Several authors discussed the
orthogonalization 12], successive applications of parallel computation of Sturm sequences for a tridi-
Householder reductions can be compacted into blocks agonal matrix T, and among them are Lo et al. in
which allow use of routines of Level  BLAS []. []. It is hard to parallelize the LU factorization of
Such reduction is implemented in routine DSYTRD of (T − λI) which provides the number of sign changes
LAPACK [] which takes advantage of the symmetry of in the Sturm sequences, because the diagonal entries
the process. of U are obtained by a nonlinear recurrence. One
 E Eigenvalue and Singular-Value Problems

possible approach is to consider the linear three-term Also, successive applications of Householder reduc-
recursion which computes the Sturm sequence terms tions can be compacted into blocks which allow use of
pk (λ) = det(Tk − λ Ik ), for k = , . . . , n. Since Level  BLAS []. The algorithm is implemented via
this computation corresponds to solving a triangular routine DGEHRD of LAPACK []. A parallel version of
banded system of three diagonals, it can be paral- the algorithm is implemented via routine PDGEHRD of
lelized, as for instance, by the algorithm described by ScaLAPACK [].
Chen et al. in []. An analysis of the situation is given
in []. Because the sequence (pk (λ)),n could suffer
from overflow or underflow, Zhang proposed the com- Parallel solution of the Hessenberg eigenproblem. Sev-
putation of a scaled sequence in [] and provided the eral attempts for generalizing the Divide-and-Conquer
backward error of the sequence. approach used in the symmetric case have been con-
To parallelize the computation, a more beneficial sidered, including Dongarra and Sidani in [], Adams
approach is to consider simultaneous computation of and Arbenz in []. In these algorithms, the Hessenberg
Sturm sequences by replacing bisection of intervals by matrix is partitioned into a sum of a two-block upper-
multisection. However, multisections are efficient only triangular matrix and a rank-one update. Compared
when most of the created subintervals contain eigen- to the symmetric case, there are two major drawbacks
values. Therefore, in [] a strategy in two steps is that limit the benefit of the tearing procedure: () the
proposed: () isolating all the eigenvalues with disjoint condition number of the eigenproblems defined by the
intervals, () extracting each eigenvalue from its inter- two diagonal blocks of the first matrix can be much
val. Multisections are used for step () and bisections or higher than that of the original eigenproblem as shown
other root finders are used for step (). This approach by Jessup in []. () the Newton iterations involved in
proved to be very efficient. When the eigenvectors are the updating process include solving triangular or Hes-
needed, they are computed independently by Inverse senberg linear systems rendering the procedure rather
Iterations. A difficulty could arise if one wishes to com- more expensive than without tearing.
pute all the eigenvectors corresponding to a cluster of Henry et al. discussed in [] existing approaches
very poorly separated eigenvalues. for realizing a parallel QR procedure and ended up
Demmel et al. discussed in [] the reliability of the considering an improved implicit multishifted strat-
Sturm sequence computation in floating point arith- egy which allows the use of Level  BLAS []. The
metic where the sequence is no longer monotonic. algorithm is implemented in subroutine PDLAHQR
Therefore, in very rare situations, it is possible to obtain of ScaLAPACK [].
a wrong answer with regular bisection. A robust algo-
rithm, called MRRR developed by Dhillon et al. [], The Singular-Value Problem
is implemented via LAPACK with routine DSTEMR for As mentioned above, methods for solving the symmet-
the computation of high-quality eigenvectors. Several ric eigenvalue problems can be reconsidered as methods
approaches are discussed in [] for possible parallel for solving SVD problems. In this section, the study is
implementations of the scheme in DSTEMR. restricted to the case where A ∈ Rm×n with m ≥ n.

The Unsymmetric Eigenvalue Problem Parallel Jacobi algorithms for SVD. In the special case
Reduction to the Hessenberg form. Similar to the sym- where m ≫ n, one should first perform an initial
metric case, the reduced form is obtained by House- “skinny” QR factorization of A = QR where Q ∈ Rm×n
holder reductions. But here, lack of symmetry implies is orthogonal and R ∈ Rn×n is triangular. Next, the SVD
that, unlike the picture in (), zeros are not introduced of R = ŨΣV T provides the SVD of A = UΣV T with
in the upper part of the matrix. Therefore, the pro- U = QŨ. Computing the QR factorization in parallel
cess results in an upper Hessenberg matrix. The total has been outlined in [linear least squares and
procedure involves n / + O(n ) arithmetic opera- orthogonalization 12]. The complete process is
tions. Assembling the matrix Q = H . . . Hn− requires obtained at the cost of mn + O(mn + n ) arithmetic
n / + O(n ) additional operations. operations.
Eigenvalue and Singular-Value Problems E 

One-sided and two-sided Jacobi algorithms. The first point operations between two communications), block
one-sided algorithm was introduced by Hestenes in algorithms are considered. For one-sided algorithms,
[]. Applying a rotation R to both sides of B = AT A for each processor is allocated a block of columns instead
annihilating the entry βij is equivalent to post-multiply of a single column. The computation remains the same
A by R for making the columns i and j of AR orthog- as discussed above with the ordering of the rotations
onal. Therefore, any parallel Jacobi algorithm for solv- within a sweep is as given in [].
ing a symmetric eigenvalue problem can be considered For the two-sided version, the allocation manipu-
as a parallel one-sided Jacobi algorithm for solving a lates -D blocks instead of single entries of the matrix.
SVD problem. Each sweep is a sequence of orthogo- Bečka and Vajtešic propose in [] a modification of the E
nal transformations Vk made of a set of independent basic algorithm in which one annihilates, in each step,
rotations that are successively applied on the right side two off-diagonal blocks by performing a full SVD on a
by Ak+ = Ak ṼkT (A = A). When the right singu- small-sized matrix. While reasonable efficiencies is real-
lar vectors are needed, they are accumulated by Vk+ = ized on distributed memory systems, this block strat-
Vk Ṽk = (V = I). At convergence, the singular values egy increases the number of sweeps needed to achieve
are given by the norms of the columns of Ak and the convergence.
normalized columns are the corresponding left singular We note that parallel Jacobi algorithms can only
vectors. Each sweep performs mn + O(mn) arith- surpass the speed of the bidiagonalization schemes of
metic operations. Brent and Luk [] and Sameh [] ScaLAPACK when the number of processors available
described possible implementations of this algorithm are much larger than the size of the matrix under con-
on multiprocessors. sideration.
Charlier et al. [] demonstrate that an implemen-
tation of Kogbetliantz’s algorithm for computing the Bidiagonalization of a matrix by Householder reductions.
SVD of upper-triangular matrices is quite effective on The classical algorithm which reduces A into an upper-
a systolic array of processors. Therefore, the first step bidiagonal matrix B, via the orthogonal transformation
in a Jacobi SVD scheme is the QR factorization of A, B = U T AV, was introduced by Golub and Kahan (see for
followed by computing the SVD of ∈ Rn×n . The Kog- instance []). The algorithms requires (mn −n /)+
betliantz’s method for computing the SVD of a real O(mn) arithmetic operations. U and V can be assem-
square matrix A mirrors the method for computing the bled in (mn − n /) + O(mn) and n / + O(n )
eigenpairs of symmetric matrices, in that the matrix A additional operations, respectively. A parallel version of
is reduced to the diagonal form by an infinite sequence this algorithm is implemented in routine PDGEBRD of
of plane rotations ScaLAPACK.
The one-sided reduction for bidiagonalizing a
Ak+ = Uk Ak VkT , k = , , . . . , () matrix was proposed by Raha []. It appears to be
where A ≡ A, and Vk = Vk (i, j, ϕ kij ),
Uk = Uk (i, j, θ ijk ) better suited for parallel implementation. It consists of
are plane rotations. It follows that Ak approaches the determining in a first step the orthogonal matrix V
diagonal matrix Σ = diag(σ , σ , . . . , σn ), where σi is the which would appear in the tridiagonalization of AT A:
ith singular value of A, and the products (Uk . . . U U ), T = V T AT AV = BT B. The matrix V can be obtained by
(Vk . . . V V ) approach matrices whose ith column is using Householder or Givens reductions, without first
the respective left and right singular vector correspond- forming AT A explicitely. The second step performs a QR
ing to σi . factorization of F = AV: F = QR. Since F T F = RT R = T
is tridiagonal, its Cholesky factor R is upper-bidiagonal
which means that any two nonadjacent columns of F are
Block Jacobi algorithms. The above algorithms are well- orthogonal. This allows a simplified QR factorization.
suited for shared memory architectures. While they can Bosner and Barlow introduced in [] two adaptations
also be implemented on distributed memory systems, of Raha’s approach: a block version which allows use of
their efficiency on such systems will suffer due to com- Level  BLAS routines and a parallel version.
munication costs. In order to increase the granularity of Once the bidiagonalization is performed, all that
the computation (i.e., to increase the number of floating remains is to compute the SVD of B. When only
 E Eigenvalue and Singular-Value Problems

the singular values are sought, it can be done by . Chen SC, Kuck DJ, Sameh AH () Practical parallel band
the iterative Golub and Kahan algorithm where the triangular system solvers. ACM Trans Math Software :–
number of arithmetic operations is O(n) per itera- . Demmel J, Dhillon I, Ren H () On the correctness of some
bisection-like parallel eigenvalue algorithms in floating point
tion (see for instance []) which is performed only
arithmetic. ETNA :–
on uniprocessor. The whole Singular-Value Decom- . Demmel J () Applied numerical linear algebra. SIAM,
position is implemented in routine PDGESVD of Philadelphia
ScaLAPACK. . Dhillon D, Parlett B, Vömel C () The design and imple-
mentation of the MRRR algorithm. ACM Trans Math Software
:–
. Dongarra JJ, Duff IS, Sorensen DC, Van der Vorst H ()
Bibliographic Notes and Further Numerical linear algebra for high performance computers. SIAM,
Reading Philadelphia
Eigenvalue problems and singular-value decomposi- . Dongarra JJ, Sidani M () A parallel algorithm for the nonsym-
tion: [] metric eigenvalue problem. SIAM J Sci Comput :–
Algorithms: [] . Dongarra JJ, Sorensen DC () A fully parallel algorithm for
the symmetric eigenvalue problem. SIAM J Sci Stat Comput
Linear algebra: [, , ]
:s–s
Parallel algorithms for dense computations: [, ] . Dongarra JJ, Van De Gein R () Parallelizing the QR algo-
rithm for the unsymmetric algebraic eigenvalue problem: myths
and reality. SIAM J Sci Comput :–
Bibliography . Gallivan KA, Plemmons RJ, Sameh AH () Parallel algorithms
. Adams L, Arbenz P () Towards a divide and conquer algo- for dense linear algebra computations. In: Gallivan KA, Heath
rithm for the real nonsymmetric eigenvalue problem. SIAM J MT, Ng E, Ortega JM, Peyton BW, Plemmons RJ, Romine CH,
Matrix Anal Appl ():– Sameh AH, Voigt RG (eds) Parallel algorithms computations.
. Anderson E, Bai Z, Bischof C, Blackford S, Demmel J, Don- SIAM, Philapdelphia
garra J, Du Croz J, Greenbaum A, Hammarling S, McKenney A, . Gimenez D, Hernandez V, van de Gein R, Vidal AM () A
Sorensen D () LAPACK users’ guide, rd edn. Society for Block Jacobi method on a mesh of processors. Conc Pract Exp
Industrial and Applied Mathematics, Philadelphia, Library avail- –:–
able at http://www.netlib.org/lapack/ . Golub GH, Van Loan CF () Matrix computations, rd edn.
. Bai Z, Demmel J, Dongarra J, Ruhe A, van der Vorst H (eds) Johns Hopkins University Press, Baltimore
() Templates for the solution of algebraic eigenvalue prob- . Henry G, Watkins D, Dongarra J () A parallel implementa-
lems – a practical guide. SIAM, Software–Environments–Tools, tion of the nonsymmetric QR algorithm for distributed memory
Philadelphia architectures. SIAM J Sci Comput :–
. Bečka M, Vajtešic M () Block-Jacobi SVD algorithms for dis- . Hestenes MR () Inversion of matrices by biorthogonalization
tributed memory systems. Parallel algorithms and applications, and related results. J Soc Ind Appl Math :–
Part I in , –, Part II in , – . Jacobi CGJ () Uber ein Leiches Vehfahren Die in der
. Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon I, theorie der Sacular-storungen Vorkom-mendern Gleichungen
Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Numerisch Aufzuosen. Crelle’s J für reine und angewandte Math-
Walker D, Whaley RC () ScaLA-PACK users’ guide. Society ematik :–
for Industrial and Applied Mathematics, Philadelphia, Available . Jessup ER () A case against a divide and conquer approach
on line at http://www.netlib.org/lapack/lug/ to the nonsymmetric eigenvalue problem. J Appl Num Math :
. Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, –
Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A, Pozo . Kagström B, Ruhe A () An algorithm for numerical computa-
R, Remington K, Whaley RC () An updated set of basic lin- tion of the Jordan normal form of a complex matrix. ACM Trans
ear algebra subprograms (BLAS). ACM Trans Math Soft –: Math Software :–
– . Lo SS, Philippe B, Sameh A () A multiprocessor algorithm
. Bosner N, Barlow JL () Block and parallel versions of one- for the symmetric tridiagonal eigenvalue problem. SIAM J Sci Stat
sided bidiagonalization. SIAM J Matrix Anal Appl :– Comput :s–s
. Brent RP, Luk FT () The solution of singular-value and sym- . Luk F, Park H () On parallel Jacobi orderings. SIAM J Sci Stat
metric eigenvalue problems on multiporcessor arrays. SIAM J Sci Comput :–
Stat Comput :– . Parlett BN () The symmetric eigenvalue problem. SIAM,
. Charlier J-P, Vanbegin M, Van Dooren P () On efficient Classics in Applied Mathematics, Philadelphia
implementations of kogbetliantz’s algorithm for computing the . Ralha R () One-sided reduction to bidiagonal form. Linear
singular value decomposition. Numer Math :– Algebra Appl :–
EPIC Processors E 

. Sameh A () On Jacobi and Jacobi-like algorithms for a parallel EPIC processors combine the relative hardware
computer. Math Comp :– simplicity of Very Long Instruction Word (VLIW)
. Sameh A () Solving the linear least squares problem on a processors and the runtime adaptability of superscalar
linear array of processors. In: Snyder L, Gannon DB, Siegel HJ
processors. VLIW processors require the compiler to
(eds) Algorithmically specialized parallel computers. Academic,
H.J. Siegel specify a POE consisting of a cycle-by-cycle descrip-
. Trefethen LN, Bau D III () Numerical linear algebra. SIAM, tion of independent operations in each functional unit,
Philadelphia and the hardware follows the POE to execute the
. Zhang J () The scaled sequence computation. In: Proceedings operations. While VLIW considerably simplifies the
of the SIAM Conference on Applied Linear Algebra. July –,
hardware, it also results in machine-code incompati- E

bility between implementations of a VLIW architec-
ture. Due to the strict adherence to the compiler’s POE,
code compiled for one implementation will not exe-
cute correctly on other implementations with differ-
EPIC Processors ent hardware resources or latencies. Superscalar pro-
cessors discover ILP and construct a POE entirely in
David I. August, Arun Raman
hardware. This means that code compiled for a par-
Princeton University, Princeton, NJ, USA
ticular superscalar architecture will execute correctly
on all implementations of that architecture. However,
Definition ILP extraction in hardware entails significant com-
Explicitly Parallel Instruction Computing (EPIC) refers plexity, resulting in diminishing performance returns
to architectures in which features are provided to and an adverse impact on clock rate. EPIC processors
facilitate compiler enhancements of instruction-level retain VLIW’s philosophy of statically exposing ILP and
parallelism (ILP) in all programs, while keeping hard- constructing the POE, thereby keeping the hardware
ware complexity relatively low. Using ILP-enhancement complexity relatively low, but incorporate the ability of
techniques such as speculation and predication, the superscalar processors to cope with dynamic factors
compiler identifies the operations that can execute in such as variable memory latency. Combined with tech-
parallel in each cycle and communicates a plan of exe- niques to provide binary compatibility for correctness,
cution to the hardware. EPIC processors provide a means to achieve high lev-
els of ILP across diverse applications while reducing the
hardware complexity needed for the same.
Discussion
Introduction Design Principles
The performance of modern processors is dependent In EPIC architectures, the compiler is responsible for
on their ability to execute multiple instructions per designing a POE. This means that the compiler must
cycle. In the early s, instruction-level parallelism identify the inherent parallelism of a sequential pro-
(ILP) was the only viable approach to achieve higher gram and map it efficiently to the parallel hardware
performance without rewriting software. Programs for resources in order to minimize the program’s execu-
ILP processors are written in a sequential program- tion time. The main challenge faced by the compiler
ming model, with the compiler and hardware being in determining a good POE is on account of dynamic
responsible for automatically extracting the parallelism events that determine the program’s runtime behavior.
in the program. Explicitly Parallel Instruction Comput- Will a load-store pair access the same memory location?
ing (EPIC) refers to architectures in which features are If yes, then the operations must be executed sequen-
provided to facilitate compiler enhancements of ILP in tially; otherwise, they may be scheduled in any order.
all programs. These architectures allow the compiler to Will a conditional branch be taken? If yes, then opera-
generate an effective plan of execution (POE) of a given tions along the taken path may be scheduled in parallel
program, and provide mechanisms to communicate the with operations before the branch; otherwise, they must
compiler’s POE to the hardware. not be executed. In such situations, the compiler can
 E EPIC Processors

speculate that a load-store pair will not alias, or that a control transfer operations. All operations are assumed
branch will be taken, with the compiler’s confidence in to have a latency of one cycle, with the exception of
the speculation typically being determined by the out- memory loads that have a latency of two cycles.
comes of profiling runs on representative inputs. EPIC Figure a shows the scheduled, assembly code.
architectures support compiler speculation by provid- The compiler conservatively respects all dependences,
ing mechanisms in hardware to ensure correctness if resulting in a rather sparse schedule with several slots
speculation fails. Finally, the architecture must be rich going to waste. The execution time along either branch
enough for the compiler to express its POE; specifically, is ten cycles.
which operations are to be issued in parallel and which
hardware resources they must use. Speculation
Section “ILP Enchancement” presents the major Compiler-controlled speculation refers to breaking
techniques used to enhance ILP of programs. Sec- inherent programmatic dependences by guessing the
tion “Enabling Architectural Features” describes the outcome of a runtime event at compile time. As a
micro-architectural features that enable these and other result, the available ILP in the program is increased
ILP enhancement techniques. by reducing the height of long dependence chains
and by increasing the scheduling freedom among the
ILP Enhancement operations.
The code example in Fig.  is used to illustrate the
main compiler transformations for ILP enhancement Control Speculation
that are made possible by EPIC architectures. In the Many general-purpose applications are branch-
example, a condition is evaluated to alternatively update intensive. The latency to resolve a branch directly affects
a memory location p4 with the contents of location ILP since potentially multiple issue slots may go to
p2 or p3. Outside the if-construct, the variable val waste. It is crucial to overlap other instructions with
is updated with the value at location p5, with the lat- branches. Control speculation breaks control depen-
ter potentially aliasing with p4. The processor model dences which occur between branches and other oper-
assumed for illustration purposes, shown in Table , is ations. An operation is control dependent on a branch
a six-issue processor capable of executing one branch if the branch determines whether the operation will
per cycle, with no further restrictions on the combi- be executed at runtime. By guessing the direction of a
nation of operations that may be concurrently issued. branch, the control dependence may be broken effec-
Conditional branches require separate comparison and tively making the operation’s execution independent
of the branch. Control speculation allows operations
to move across branches, thus reducing dependence
EPIC Processors. Table  Assumed processor model to height and resulting in a more compact schedule.
illustrate ILP enhancement techniques
Issue width Six instructions Data Speculation
Latency of conditional branches Two cycles Many general-purpose applications exhibit irregular
Latency of memory loads Two cycles memory access patterns. Often, a compiler is conser-
Latency of other instructions One cycle vative because it respects a memory dependence that
occurs infrequently or because it cannot determine that
if (*p1 == 0) the dependence does not actually exist. In either case,
data speculation breaks data flow dependences between
*p4 = *p2;
else memory operations. Two memory operations are flow
*p4 = *p3;
dependent on one another if the first operation writes a
val += *p5; value to a memory location and the second operation
potentially reads from the same location. By guessing
EPIC Processors. Fig.  Example C code to illustrate ILP that the two memory operations will access different
enhancement techniques locations, the load operation may be hoisted above the
EPIC Processors E 

0 (1) r1 = MEM[p1]
1
2 (2) c1 = (r1 == 0)
3 (3) JUMP c1, ELSE

4 (4) r2 = MEM[p2]

5
6 (5) MEM[p4] = r2 (6) JUMP CONTINUE
ELSE:
0 (7) r3 = MEM[p3] E
1
2 (8) MEM[p4] = r3

CONTINUE:
0 (9) r5 = MEM[p5]
1
2 (10) r4 = r4 + r5
a

0 (1) r1 = MEM[p1] (4) r2 = MEM[p2] <CS> (7) r3 = MEM[p3] <CS> (9) r5 = MEM[p5] <DS>
1
2 (2) c1 = (r1 == 0) (10) r4 = r4 + r5 <DS>
3 (3) JUMP c1, ELSE

4 (4′) CHECK r2

5 (5) MEM[p4] = r2 (6) JUMP CONTINUE


ELSE:
0 (7′) CHECK r3
1 (8) MEM[p4] = r3
CONTINUE:
0 (9′) CHECK r5
b

0 (1) r1 = MEM[p1]
1
2 (11) pT, pF = (r1 == 0)
3 (4) r2 = MEM[p2] <pT> (7) r3 = MEM[p3] <pF>

4
5 (5) MEM[p4] = r2 <pT> (8) MEM[p4] = r3 <pF>
6 (9) r5 = MEM[p5]
7
8 (10) r4 = r4 + r5
c

0 (1) r1 = MEM[p1] (4) r2 = MEM[p2] <CS> (7) r3 = MEM[p3] <CS> (9) r5 = MEM[p5] <DS>
1
2 (11) pT, pF = (r1 == 0) (10) r4 = r4 + r5 <DS>
3 (4′) CHECK r2 <pT> (7′) CHECK r3 <pF>

4 (5) MEM[p4] = r2 <pT> (8) MEM[p4] = r3 <pF>

5 (9′) CHECK r5
d
EPIC Processors. Fig.  Predication and speculation have a synergistic relationship: applying speculation after
predication results in the best improvement in ILP. (a) Sequential schedule. (b) Schedule using control and data
speculation alone. (c) Schedule using predication alone. (d) Schedule using predication and speculation combined
 E EPIC Processors

store. EPIC provides for more aggressive code motion: to detect if a dependence actually existed for this exe-
operations dependent on the load may also be hoisted cution and initiates repair if required. In general, all
above potentially aliasing stores. data-speculative and all potentially excepting control-
Applying speculation to the code in Fig.  results in speculative operations require checks. In Fig. b, opera-
the tighter schedule shown in Fig. b. <CS> and <DS>, tions ′ , ′ , and ′ are the previously discussed symbolic
respectively, denote operations which have been spec- check operations.
ulated with respect to control or data. The resultant
increase in ILP is achieved primarily by applying spec- Predication
ulation to the loads (operations , , and ). Note that Predicated execution is a mechanism that supports con-
loads from either side of a branch may be speculatively ditional execution of individual operations based on
executed. If the branch is highly biased, then the com- Boolean guards, which are implemented as predicate
piler may choose to speculatively execute just the loads register values. A compiler converts control flow into
from the biased path. Operations  and  are control predicates by applying if-conversion []. If-conversion
dependent on operation . Control speculation enables translates conditional branches into predicate defin-
the compiler to break the dependences and move  and ing operations and guards operations along alterna-
 to the top of the block. Operation  is memory depen- tive paths of control under the computed predicates.
dent on operations  and : these ambiguous memory A predicated operation is fetched regardless of its pred-
dependences arise because the compiler is unable to icate value. An operation whose predicate is TRUE
prove that pointer p5 does not point to the same loca- is executed normally. Conversely, an operation whose
tion as pointer p4. The net result of applying specula- predicate is FALSE is prevented from modifying the
tion is that the dependence height of the code segment processor state. With if-conversion, complex nests of
is reduced to seven cycles. branching code can be replaced by a straight-line
While speculation enhances ILP, it requires hard- sequence of predicated code. Predication increases ILP
ware support to handle exceptions. Exceptions gener- by allowing separate control flow paths to be over-
ated by speculated operations can either be genuine, lapped and simultaneously executed in a single thread
reflecting exception conditions present in the original of control.
code, or spurious, resulting from unnecessary execu- Figure c illustrates an if-conversion of the code seg-
tion of speculative operations. Since speculative opera- ment from Fig. . The predicate for each operation is
tions, like ordinary operations, may cause long-latency shown within angle brackets. For example, operation 
exceptions such as page faults and TLB misses, time is predicated on pT. The absence of a predicate indicates
is wasted if spurious exceptions are repaired. Spuri- that the operation is always executed.
ous exceptions are eliminated by taking exceptions only Predicates are computed using predicate define
when the results of speculative operations are used operations, such as operation . The predicate pT is
nonspeculatively, since it is guaranteed at that point that set to  if the condition evaluates to true, and to  if
the speculated code would have executed in the orig- the condition evaluates to false. The predicate pF is the
inal program. A symbolic operation, called a check, is complement of pT. Operations  and , and operations
responsible for detecting any problems that occurred  and  are executed concurrently with the appropri-
in a previous speculative execution. When an error is ate ones taking effect based on the predicate values. The
detected by a check instruction, either an exception is net result of applying predication is that the dependence
reported or repair is initiated. By positioning the check height of the code segment is reduced to nine cycles.
at the point of the original operation, the error detection
and repair is guaranteed to occur only when the original Combining Speculation and Predication
operation would have been executed by a nonspeculated Both speculation and predication provide effective
version of the program. means to increase ILP. However, the example shows that
For data speculation, repair is necessary when an their means of improving performance are fundamen-
actual data dependence existed between the speculated tally different. Speculation allows the compiler to break
load and one or more stores predicted to be indepen- control and memory dependences, while predication
dent at compile time. The check queries the hardware allows the compiler to restructure program control flow
EPIC Processors E 

and to overlap separate execution paths. The problems First, speculative operations must be distinguished
attacked by both techniques often occur in conjunction; from nonspeculative operations, since the latter should
therefore, the techniques can be mutually beneficial. report exceptions immediately. Also, loads that have not
Figure d illustrates the use of predication and been speculated with regard to data dependence need
speculation in combination. If-conversion removes the not interact with memory conflict detection hardware.
branch resulting in a stream of predicated operations. For this, each operation that can be speculated has an
As before, data speculation breaks the dependences extra bit, called the S-bit, associated with it. This bit is set
between operations  and , and operations  and , on operations that are either control speculated or are
allowing the compiler to move operation  to the top of data dependent on a data-speculative load. Each load E
the block. Even though no branches remain in the code, has an additional bit, called the DS-bit. This bit is set
control speculation is still useful to break dependences only on data-speculative loads.
between predicate definitions and guarded instructions. Second, to defer the reporting of exceptions aris-
In this example, the control dependences between oper- ing from speculative operations until it is clear that
ations  and , and operations  and  are elimi- the operation would have been executed in the origi-
nated by removing the predicates on operations  and nal program, a -bit tag called the NaT (Not a Thing)
. This form of speculation in predicated code is called bit is added to each register. When a speculative load
promotion. As a result of this speculation, the compiler causes an exception, the NaT bit is set instead of imme-
can hoist operations  and  to the top of the block to diately raising the exception. Speculative instructions
achieve a more compact schedule of height six cycles. that use a register with its NaT bit set propagate the bit
In summary, predication is only % faster and spec- through their destination registers until a check oper-
ulation is only % faster than the original code seg- ation determines whether speculation failed in which
ment. As the example shows, speculation in the form of case the deferred exception is suppressed.
promotion can have a greater positive effect on perfor- Third, a mechanism must be provided to store the
mance after if-conversion than before. This synergistic source addresses of data-speculative loads until their
relationship between predication and speculation yields independence with respect to intervening stores can be
a combined speedup of % over the original code established. This functionality can be provided by the
segment. Memory Conflict Buffer (MCB) [] or the Advanced
Load Address Table (ALAT) []. The MCB temporarily
associates the destination register number of a specula-
Enabling Architectural Features
tive load with the address from which the value in the
Since EPIC delegates the responsibilities of enhancing
register was speculatively loaded. Destination addresses
ILP and planning an execution schedule to the compiler,
of subsequent stores are checked against the addresses
the architecture must provide features:
in the buffer to detect memory conflicts. The MCB is
. To overcome the worst impediments to a compiler’s queried by explicit data speculation check instructions,
ability to enhance ILP, namely, frequent control which initiate recovery if a conflict was detected.
transfers and ambiguous memory dependences
. To specify an execution schedule
Predication Support
A set of predicate registers are used to store the results
Speculation Support of compare operations. For example, the Intel Ita-
As discussed in Section “ILP Enhancement”, an archi- nium processor family has a set of  (-bit) predi-
tecture that supports speculation must provide cate registers. An EPIC processor must also implement
mechanisms to detect potential exceptions on control- some form of predicate squash logic, which prevents
speculative operations as they occur, to record infor- instructions with false predicates from committing their
mation about data-speculative memory accesses as they results. The IMPACT EPIC architecture [] also pro-
occur, and then to check at an appropriate time whether vides predication-aware bypass/interlock logic, which
an exception should be taken or data-speculative repair forwards results based on the predicate values associ-
should be initiated. ated with the generating instructions.
 E EPIC Processors

Static Scheduling Support can schedule other operations that are anti- or output-
EPIC provides many features for enabling the com- dependent on the operation in question, resulting in
piler to specify a high-quality Plan Of Execution (POE). tighter schedules.
These features include the ability to specify multiple As mentioned earlier, EPIC processors may still
operations per instruction, the notion of architecturally have interlocking to account for the deviation of actual
visible latencies, and an architecturally visible register latencies of operations from their assumed latencies.
structure. However, MultiOp and NUAL permit an in-order
implementation of the architecture while still extracting
Multiple Operations per Instruction (MultiOp) substantial ILP.
EPIC architectures allow the compiler to specify explicit
information about independent operations in a pro- Register Structure
gram. A MultiOp is a specification of a set of operations ILP processors require a large number of registers.
that are issued simultaneously to be executed by the Specifically in EPIC processors, techniques such as
functional units. Exactly one MultiOp is issued per cycle control and data speculation cause increased register
of the virtual time. The virtual time is the schedule built pressure. Fundamentally, since multiple operations are
by the compiler, and may differ from the actual time if executing in parallel, there is a demand for a larger store
the hardware inserts stalls due to dynamic events. for temporary values. Superscalar processors architec-
For example, the Itanium architecture [] con- turally expose only a small fraction of the physical reg-
sists of instruction groups that have no read-after-write isters, and rely on register renaming to generate parallel
(RAW) or write-after-read (WAR) register dependen- schedules. However, this requires complex hardware for
cies, delimited by stops. The Itanium implementation dynamic scheduling. Since EPIC demands a schedule
consists of -bit instruction bundles comprising three from the compiler in order to simplify the hardware,
-bit opcodes and a -bit template that includes intra- it becomes necessary to expose a larger set of architec-
bundle stop bits. The use of template bits greatly sim- tural registers to generate the same parallel schedule as
plifies instruction decoding. The use of stops enables achieved through hardware register renaming.
explicit specification of parallelism. EPIC architectures also provide a mechanism to
rotate the register file to speed up inner-loop execution.
Architecturally Visible Latencies Speeding up loops involves executing instructions from
A lot of the hardware complexity of a superscalar pro- multiple loop iterations in parallel. One way to do this
cessor arises from the need to maintain an illusion is by using Modulo Scheduling []. Modulo scheduling
of sequential execution (i.e., each operation completes schedules one iteration of the loop such that when this
before another begins) in the face of non-atomicity of schedule is repeated at regular intervals, no intra- or
the operations. To avoid the complexity, EPIC proces- inter- iteration dependence is violated, and no resource
sors expose operations as being nonatomic; instead, conflict arises between the dynamic operations within
read-and-write events of an operation are viewed as and across loop iterations. While generating code, it is
the atomic events. There is a contractual specifica- important that false dependences do not arise owing to
tion of assumed latencies of operations that must be dynamic instances on different iterations of the same
honored by both the hardware and the compiler. operation writing to the same registers. While the loop
A MultiOp that takes more than one cycle to generate could be unrolled, followed by static register renaming,
all the results is said to have non-unit assumed latency this typically results in substantial code growth. EPIC
(NUAL). The guarantee of NUAL affords multiple ben- architectures implement a rotating register file that pro-
efits to EPIC processors. Since the hardware is certain vides register renaming such that successive writes to
that the read event of an operation will not occur before the same virtual register actually go to different phys-
the write event of the operation that produces the value ical registers. This is accomplished through the use of
to be read, there is no need for stall/interlock capability. a rotating register base (RRB) register. The physical reg-
Since the compiler is certain that an operation will not ister number accessed by each operation is the sum of
write its result before its assumed latency has elapsed, it the virtual register number specified in the instruction
EPIC Processors E 

and the number in the RRB register, modulo the num- Related Entries
ber of registers in the rotating register file. The RRB Cydra 
is updated at the end of each iteration using special Dependences
operations. In this manner, EPIC architectures provide Instruction-Level Parallelism
compiler-controlled dynamic register renaming. Modulo Scheduling and Loop Pipelining
Another interesting feature is the register stack that Multiflow Computer
is used to simplify and speed up function calls. Instead Parallelization, Automatic
of spilling and filling all the registers at a procedure call
and return sites, EPIC processors enable the compiler E
to rename the virtual register identifiers. The physical
Bibliographic Notes and Further
register accessed is determined by renaming the virtual
Reading
register identifier in the instruction through a base reg-
The EPIC philosophy was primarily developed by HP
ister. The callee gets a physical register frame that may
researchers in the early s [, ]. As mentioned in the
overlap with the caller’s frame; the overlap constitutes
introduction, the EPIC philosophy was fundamentally
the registers that are passed as parameters.
influenced by the VLIW philosophy developed in the
s in the Multiflow and Cydrome processors [, ].
EPIC inherited various features from Cydrome, most
significantly:
Cache Management
EPIC architectures allow the compiler to explicitly con- ● Predicated execution
trol data movement through the cache hierarchy by ● Support for software pipelining of loops in the form
exposing the hierarchy architecturally. Generating a of rotating register files
high-quality schedule often requires the compiler to be ● A rudimentary MultiOp instruction format
able to accurately predict the level in the cache hier- ● Hardware support for speculation including specu-
archy where a load operation will find its data. This is lative opcodes, a speculative tag bit in each register,
normally achieved by profiling or analytical means, and and deferred exception handling
the source-cache specifier is encoded as part of the load
Many of the ideas on speculation and its hardware
instruction. This informs the hardware of the expected
support were developed independently by researchers
location of the datum as well as the assumed latency of
in the IMPACT group at the University of Illinois []
the load.
and IBM Research []. The most famous commercial
EPIC architectures also provide a target cache spec-
EPIC processor is the Itanium processor family jointly
ifier for load/store operations that allow a compiler to
developed by HP and Intel. The first processor in the
instruct the hardware of where the datum should be put
family was released in , with successors having
in the cache hierarchy for subsequent references to that
improved memory subsystems, multiple cores on the
datum. This allows the compiler to explicitly control the
same die, and improved power efficiency.
content of the caches. The compiler could, for example,
More information on the EPIC philosophy and
prevent pollution of the caches by excluding data with
implementation can be found in the technical reports
poor temporal locality from the higher-level caches, and
and manuals on the topic [, ].
remove data from the last level cache after the last use.
Many streaming programs operate on data without
accessing them again. To improve the latency of such Bibliography
accesses, EPIC architectures provide a data prefetch . Allen JR, Kennedy K, Porterfield C, Warren J () Conversion
cache. Prefetch load instructions can be used to prefetch of control dependence to data dependence. In: Proceedings of the
th ACM Symposium on Principles of Programming Languages,
data with poor temporal locality into the data prefetch
pp –, January 
cache. This prevents the displacement of first-level data . Gallagher DM, Chen WY, Mahlke SA, Gyllenhaal JC, Hwu WW
cache contents that may potentially have much better () Dynamic memory disambiguation using the memory con-
temporal locality. flict buffer. In: Proceedings of th International Conference on
 E Erlangen General Purpose Array (EGPA)

Architectural Support for Programming Languages and Operat- developed, mainly by W. Händler. Such a system is a
ing Systems, pp –, October  collection of nodes where each node consists of one pro-
. Intel Corporation () Intel Itanium architecture software cessor and one shared memory unit at least. The nodes
developer’s manual. Application architecture, vol , Revision ..
are organized in several hierarchically ordered planes
Santa Clara, CA
. August DI, Connors DA, Mahlke SA, Sias JW, Crozier KM, Cheng (layers) and one top-node. Each plane is a collection of
B, Eaton PR, Olaniran QB, Hwu WW () Integrated pred- m identical nodes arranged in an m × m grid, where
ication and speculative execution in the IMPACT EPIC archi- each node is connected to its four nearest neighbors via
tecture. In: Proceedings of the th International Symposium on shared memory. At the border of the plane, the nodes
Computer Architecture, pp –, June 
are interconnected in such a way that the layer forms a
. Rau BR () Iterative modulo scheduling: an algorithm for soft-
ware pipelining loops. In: Proceedings of the th International torus. Each upper plane is smaller than the layer imme-
Symposium on Microarchitecture, pp –, December  diately below. It has a quarter of the nodes of the lower
. Schlansker MS, Rau BR () “EPIC: An architecture for plane. Each node, which is not part of the lowest plane,
instruction-level parallel processors,” Technical Report HPL- has access to the shared memory of four child nodes
-, Compiler and Architecture Research HP Laboratories
of the layer immediately below. The base layer, called
Palo Alto, February 
. Schlansker MS, Rau BR () EPIC: Explicitly Parallel Instruc-
working plane, is that part of the system where most of
tion Computing. Computer ():– the real work is to be done. The upper layers and the
. Fisher JA () Very long instruction word architectures and the top-node feed the working plane with data and offer
ELI-. In: ISCA ’: Proceedings of the th Annual Interna- support functions. In larger systems, only some of the
tional Symposium on Computer Architecture, ACM, New York, lower planes are realized. In this case, the top-node is
pp –
connected to the highest realized plane via bus.
. Beck GR, Yen DW, Anderson TL () The Cydra  minisu-
percomputer: architecture and implementation. J Supercomput A special EGPA system is MEMSY. In such a system,
:– only the two lower planes and the top-node are realized.
. Mahlke SA, Chen WY, Hwu WW, Rau BR, Schlansker MS () Furthermore, by adding a bus system, long distance
Sentinel scheduling for VLIW and superscalar processors. In: Pro- communication via messages is possible. Consequently,
ceedings of the th International Conference on Architectural
several programming models, not only shared memory,
Support for Programming Languages and Operating Systems,
pp –, October 
are supported.
. Ebcioglu K () Some design ideas for a VLIW architecture From  to , three such systems of the EGPA
for sequential-natured software. In: Cosnard M, Barton MH, class have been realized: the Pilot Pyramid (+ nodes),
Vanneschi M (eds) Parallel Processing (Proc. IFIP WG . Work- the DIRMU Pyramid, and the MEMSY Prototype ( +
ing Conference on Parallel Processing, Pisa, Italy), pp –, North  +  nodes both).
Holland, 

Discussion
Introduction
Erlangen General Purpose In , W. Händler published first ideas [A] and pro-
Array (EGPA) posed one year later together with F. Hofmann and H.J.
Schneider a new subclass of multiprocessors []: the
Jens Volkert
Johannes Kepler University Linz, Linz, Austria Erlangen General Purpose Array (EGPA, Erlangen is
an university city near Nürnberg in Bavaria). (Many
of the following text is taken from [, , ].) The
essential features and design objectives of these MIMD-
Synonyms computers are:
MEMSY
. Homogeneity: There is only one identical type of
element – the node often called PMM (processor-
Definition memory-module).
During the s, the concept of the multiprocessor . Memory-coupling: Connection between nodes is
class Erlangen General Purpose Array (EGPA) was via memory.
Erlangen General Purpose Array (EGPA) E 

. Restricted neighborhood: No node has access to physics, applied geophysics, and others. The corre-
the whole memory, and no memory block can be sponding models of physical phenomena are either con-
accessed by all nodes. tinuum models or many-body models.
. Hierarchy: The PMMs are arranged in several layers Essential is that there is an inherent “locality” in
and these layers form a total order. these approaches. The designer of EGPA kept this fact
. Regularity: The nodes of one layer are ordered like a in mind when they developed their machines accord-
grid. ing to the possibilities the computer technology offered
at that time, the s.
W. Händler explained these features as follows: The rea- E
sons for these demands are speed (memory coupling),
Design Goals
programmability (regularity, homogeneity), control
The EGPA architecture was defined with the following
(hierarchy), technical limits (restricted neighborhood),
design goals in mind:
and expandability (all).
In the following years, several EGPA systems were Scalability
built. The first one, the pilot pyramid, was assembled The architecture should be scalable with no theo-
in  and was used until . It consisted of four retical limit. The communication network should
PMMs at the working plane level  (A-processors) and grow with the number of processing elements in
one PMM at level  (B-processor). order to accommodate the increased communica-
From  to , DIRMU (Distributed Recon- tion demands in larger systems.
figurable Multiprocessor Kit []) was used for test- Portability
ing applications written for EGPA systems. DIRMU Algorithms should run on EGPA systems of any size.
was a kit consisting of PMMs,  based, where Cost effectiveness
processor-ports of one PMM could be interconnected As far as possible, off-the-shelf components should
with memory-ports of other PMMs by cables. Using be used.
several of these PMMs, different topologies could be Flexibility
realized: rings, trees, and so on. Especially an EGPA The architecture should be usable for a great variety of
pyramid with three layers ( +  +  PMMs) could user problems.
be built. Efficiency
From  to , a new EGPA system started The computing power of the system should be big
running in Erlangen. It was called MEMSY (Modular enough to handle real problems which occur in
Expandable Multiprocessor System). In this multipro- scientific research.
cessor, three layers were realized. It consisted of  nodes
in the working plane and  nodes in the next layer. System Topology
These machines were used to demonstrate the feasi- An EGPA system consists of identical PMMs (nodes):
bility and usefulness of EGPA systems. one processing unit and one attached communication
In the following, the principles of EGPA systems are (multiport) memory form one PMM (Processor Mem-
discussed, and the pilot pyramid and MEMSY serve as ory Module). The processing element consists of one or
examples. more processors. Optionally, there can exist a copro-
cessor and/or a private memory only accessible by the
processing unit itself.
Motivation The PMMs are hierarchically arranged in several
EGPA systems are general purpose machines, but they layers (planes). An example with four layers (A, B, C, D)
were especially designed for numerical simulations. is given in Fig. a. At the topmost layer, there is only a
With the emergence of powerful computer systems, the single PMM which serves as a front-end to the system.
interest in computational science was awakened. This At each lower layer, each PMM is connected to exactly
computational approach in natural science and engi- four neighbors at the same layer, e.g., each processor
neering sciences was and is very successful in context has access to the communication memories of its neigh-
with fluid dynamics, condensed matter physics, plasma boring PMMs. Thus, at each plane, the hardware has a
 E Erlangen General Purpose Array (EGPA)

D
Processor- Child NE
Memory-Module 1111...
Child NW
C (PMM) 1110...
Child SW
Symmetric 1101...
Child SE
multiport-memory 1100...
Neighbour E
connection 1011...
Neighbour N
1010...
B Asymmetric Neighbour W
1001...
multiport memory Neighbour S
1000...
connection
between PMMs of
different
hierarchical level Private
I/O communication memory
supported by I/O-
A processor
0001...
Shared Memory 0000...

Erlangen General Purpose Array (EGPA). Fig.  (a) EGPA with  PMMs in the working plane (b) address space of a node

grid-like structure. As in each row, the most left node The overall arrangement of an EGPA system is pyra-
is connected to the most right PMM, and in each col- midal. Each pyramid can be extended downward to any
umn, the corresponding nodes are connected too, each size by adding new levels. Such an extension signifi-
layer forms a torus. (In Fig. a, for clarity, only two of cantly increases the computing power of the system.
these turn-around connections are drawn.) In addition An EGPA system is a distributed memory multiproces-
to the horizontal accesses, each node, except the very sor. Remote communication between nodes is done via
lowest, has access to four PMMs at the next lower layer. chains of copies and not via messages as usual.
Consequently, the address space of a processor looks
like in Fig. b. Each access with an address lying in the
address space part of a child or neighbor will be trans- System Software
lated by hardware to a shared memory address of that Each node has its own local operating system that
child or neighbor and that location of the child or neigh- exactly renders those services the resources of which
bor will be accessed (e.g., the address  . . . causes are available. Now, in case a process needs a system ser-
an access to the cell  . . . of the neighbor in the vice, this service either will be rendered at once or the
north). addressed operating system cannot perform it and now
The vertical connections are equipped to broadcast informs his superior processor, i.e., its operating system,
data downward to all four of the lower PMMs simulta- of the order.
neously, say, to transmit code segments. On the other In the pilot pyramid, this technique was realized as
hand, though the lower nodes do not have memory- follows. For each A-processor, a shadow process existed
access to those at higher layers, they are able to interrupt in the B-processor. At the four A-processors, only a user
their supervising PMM. These substructures, consisting interface to the operating system of the B-processor
of four nodes and their supervisor, are called elementary was implemented. If there was an order to the operat-
pyramid. Several of them are highlighted in Fig. a. Each ing system of the B-processor, this interface communi-
elementary pyramid has an I/O processor with corre- cated with its shadow process. Then this shadow process
sponding I/O devices, mainly disks. The I/O processor sent the request to the proper operating system which
which has access to all of its elementary pyramid memo- executed the desired function afterward.
ries is controlled by the supervising PMM. The topmost This approach can be extended to larger EGPA
PMM is connected to a network containing a software computer; since each such system is composed of ele-
development system. mentary pyramids (see Fig. a). Each B-processor is
Erlangen General Purpose Array (EGPA) E 

part of an elementary pyramid one level higher and


so on. Consequently, the technique of shadow pro-
I/O
cesses could be used in upper elementary pyramid I/O control
as well.
The proposed user interface for programming such
a multiprocessor was in addition to several tools for
testing parallel programs the EGPA monitor. This mon-
Memory
itor consists of an EGPA frame program and a pool
of subroutines. If a parallel program is executed, the E
frame program runs first installing the necessary sys-
tem processes and then starting the user routines. The
Processor
subroutines of the monitor are tools for controlling
user programs. They allow to distribute programs and
data to the system, to control the processes and to exe-
cute coordinating tasks (see []). Two very important Erlangen General Purpose Array (EGPA). Fig.  Pilot
tools are EXECUTE (PROGNAME, PROC) (starts rou- Pyramid
tine PROGNAME at the indicated processor – child or
neighbor) and WAIT (PROGNAME, PROC) (wait until
the routine has finished at the processor).
The EGPA monitor was realized for the Pilot Pyra- Before implementing a parallel program, the user
mid. But the concept can be carried to larger systems. has to partition his problem domain in such a way
It supports only one parallel program at a time in that parallel work can be done as much as possible.
contrast to MEMSOS (see later) used in the MEMSY Furthermore, he has to assign himself the data to the
pilot. processing elements and, in contrast to a global shared
memory systems, in EGPA, the user had to take into
The Pilot Pyramid account that needed data should not only exist within
The pilot pyramid, equivalent to an elementary pyra- the system but also in the right memory at the right
mid, was assembled in  and was used until  time. The synchronization and the control of data were
(see Fig. ). It consisted of four PMMs at the work- done by using the EGPA monitor. The allocation of data
ing plane level  (A-processors) and  PMM at level was effected by segments, and most of data transports
 (B-processor). All nodes were commercially avail- happened under user control. There were two possibil-
able computers AEG / which offered a multiport ities for synchronization: under system control using
memory. The characteristics of these processors were messages or under user control by special routines of
internal processor cycles of , , and  ns and a the monitor. The latter was a factor thousand faster,
word length of  bits. The access time to a word was but deadlocks were possible if the user made a mistake.
– μs. Each A-processor had a memory of  KB and Nevertheless, it was preferred by nearly all users and the
the B-processor  KB. For more details, see []. gained speedups were much better.
The operating system was an expansion of the orig- One very important area for using EGPA systems
inal available operating system MARTOS. The EGPA are partial differential equations. Under the assump-
monitor could be called from user programs written in tion that the region of interest is a square (or a cube)
FORTRAN or SL (a subset of ALGOL). represented by a grid, for numerical purposes, and a
grid-point-wise relaxation method is performed in red
Application Programming with the EGPA black order, the partitioning is very simple. To each pro-
Monitor cessor one part is assigned. Each processor handles its
EGPA was designed for mainly solving problems using portion of the grid and needs only values produced by
data partitioning. The corresponding programming neighboring processors. In Fig. a, the access area of one
model is known as SPMD (Single Program Multi Data). processor is hatched.
 E Erlangen General Purpose Array (EGPA)

On an EGPA system, one iteration step performed plane shall serve as an example. At the first glance,
at a processor p is as follows: the interconnection topology of an EGPA seems to be
a serious restriction for calculation of any new ele-
. Relax red points which do not depend on points of
ment, since data stored in remote memories (in the
any neighbor.
sense of restricted neighborhoods) are needed and have
. Wait until all neighbors have signaled that their val-
to be transported using the communication memo-
ues can be used, e.g., the western neighbor has set
ries. The following algorithm is based on an idea of
the variable “ready” to iteration number (p accesses
Cannon [].
this variable using the name “ready&western”).
At the beginning, all matrices are equally distributed
. Relax red points which depend on black points of
over the lowest plan as well as possible and H is set to
neighbors. These are points at the border of p.
. Hij , Fij , and Gij are stored in node pij (see Fig.  Start
. Relax black points.
position for m = ). The start position is transformed
. Ready := Iteration number +; /∗ signal to neighbors
by more or less rotation of the partial matrices Fij and
own borders are ready for next iteration.
Gij in rows respectively columns of the working plane
For this simple case, speedups near the number of pro- until the data are distributed as depicted in Fig.  step
cessors in the working layer were achieved. If it was  for m = . For example, p reads G and copies it
possible to map a rectangle onto a non-square physical into the communications memory of processor p , this
space (see, e.g., Fig. b), the results were equivalent. data is read and written into its own shared memory
If the domain under investigation is not so optimal, by p . These data transports can be done in paral-
the programmer had to search for other solutions. If, lel by the hardware. Thereby, the toroidal connections
e.g., the domain is a circle, it is not possible to give all are used.
nodes the same load and consequently, the efficiency of Step : Each processor pij multiplies the partial
the system is smaller. Sometimes it helps to divide the matrices Fik and Gkj stored at its memory at the begin-
region into several blocks. Then the grid is so folded ning of step i and adds the resulting matrix to Hij .
onto the multiprocessor that each processor operates Subsequently, the processor writes the matrix Fik into
on a part of several blocks and each block is calculated the memory of the processor “west” of it and the matrix
by several processors simultaneously (Fig. c). Thereby, Gij into the memory of the processor “north” of it.
the torus structure of the layer is necessary. In Fig. d, After m steps, the result H is distributed on the
this is illustrated for a circle. Each square is distributed working plane as shown in Fig. . Result:
to all nodes of the working plane. Parts of the program for demonstration purposes
In many cases, the application designer has to search written in SL (subset of ALGOL).
for new algorithm to use the power of the system.
Thereby, he has to take into account the interconnec- It is one SPMD program per layer: The same code runs
tion structure of an EGPA system. Matrix multipli- at each processor of one layer. For that reason, relative
cation (H = F.G) with m × m nodes in the working names with the following syntax were used.

a b c d
Erlangen General Purpose Array (EGPA). Fig.  Data grid partitioning for pointwise relaxation method
Erlangen General Purpose Array (EGPA) E 

Start position: Step 1: Step 2:


F11 G11 F12 G12 F13 G13 F14 G14 F11 G11 F12 G22 F13 G33 F14 G44 F12 G21 F13 G32 F14 G43 F11 G14

F21 G21 F22 G22 F23 G23 F24 G24 F22 G21 F23 G32 F24 G43 F21 G14 F23 G31 F24 G42 F21 G13 F23 G24

F31 G31 F32 G32 F33 G33 F34 G34 F33 G31 F34 G42 F31 G13 F32 G24 F34 G41 F31 G12 F32 G23 F33 G34

F41 G41 F42 G42 F43 G43 F44 G44 F44 G41 F41 G12 F42 G23 F43 G34 F41 G11 F42 G22 F43 G33 F44 G44

Step 3: Step 4: Result:


F13 G31 F14 G42 F11 G13 F12 G24 F14 G41 F11 G12 F12 G23 F13 G34 H11 H12 H13 H14
E
F24 G41 F21 G12 F22 G23 F23 G34 F21 G11 F22 G22 F23 G33 F24 G44 H21 H22 H23 H24

F31 G11 F32 G22 F33 G33 F34 G44 F32 G21 F33 G32 F34 G43 F31 G14 H31 H32 H33 H34

F42 G21 F43 G32 F44 G43 F41 G14 F43 G31 F43 G42 F41 G13 F42 G24 H41 H42 H43 H44

Erlangen General Purpose Array (EGPA). Fig.  Matrix multiplication for m =  =  nodes: partial matrix distribution

<relative name>::=<name>&<neighbour> ∣ indicators that the data needed


<name>&<child> for STEP i are already stored∗ /
<neighbour>::=NORTH ∣ WEST ∣ SOUTH ∣ EAST/∗ the 4 WHILE FSEM/= STEP OR GSEM/= STEP
neighbours DO WAIT (T) OD;/∗ T for
<child>::=NORTHWEST ∣ NORTHEAST ∣ SOUTHWEST ∣ avoiding too much memory accesses,
SOUTHNORTH/∗the 4 children∗ / if coherent caches are used - not
in the Pilot Pyramid, WAIT is not
Allocation of data at each A-processor: necessary ‘/
MULT (HMATRIX,FMATRIX,GMATRIX)
/∗ installation of a global (shared) /∗ local matrix multiplication
segment with name “MATRIX” in followed by addition of the
its own shared memory∗ / result to HMATRIX ∗ /
&SEG (MATRIX,GLOBAL); FSEM := GSEM := -STEP/∗ indicator
/∗ consequently there exist segments that the data are used and can be
“MATRIX” installed in the shared overwritten∗ /
memory of all its neighbour. It WHILE FSEM& WEST/= -STEP
needs only access to those DO WAIT(T) OD;
segments of its western and /∗ western neighbour has
northern neighbour. Therefore∗ / not finished∗ /
&SEG (MATRIX&WEST,GLOBAL); COPY (FMATRIX,FMATRIX&WEST);
&SEG (MATRIX&NORTH,GLOBAL); /∗ subroutine which copies
/∗ declaration part∗ / FMATRIX into FMATRIX of the
GLOBAL (MATRIX) [] REAL FMATRIX, GMATRIX, western neighbour.∗ /
HMATRIX, INT FSEM, GSEM; FSEM&WEST := STEP + 1;/∗ indicator
VALGLOB (MATRIX&WEST) REF [] REAL that data is transported ∗ /
FMATRIX&WEST, REF INT FSEM&WEST; WHILE GSEM&NORTH/= -STEP
VALGLOB (MATRIX&NORTH) REF [] REAL DO WAIT (T) OD;
GMATRIX&NORTH, REF INT COPY (GMATRIX, GMATRIX&NORTH);
GSEM&NORTH; GSEM&NORTH := STEP + 1
OD
Essential part of the main program running at each
A-processor:
The codes for gaining the data distribution at the begin-
/∗ P x P is the number of A-processors∗ /
P is the FSEM := GSEM := ; ning of Step  (see Fig. ) as well as for COPY and
FOR STEP FROM 1 TO m for MULT are not represented. These routines need the
DO /∗ FSEM = STEP and GSEM = STEP are size of the matrices which depend on the position of
 E Erlangen General Purpose Array (EGPA)

C-plane A prototype with  nodes in the A-plane and


 nodes in the B-plane had been realized. The theoret-
ical peak performance was  GMIPS or  MFLOPS
B-plane
(double precision), the measured performance was up
to  MFLOPS [, ]. (Most of the following informa-
tion presented in this chapter is taken from [].)
The node structure of MEMSY differed from that
one of the pilot pyramid. Each node consisted
A-plane
of a Motorola multiprocessor board MVEME
( MC,  MHz clock rate) and additional hard-
ware. It contained a local memory, an attached shared
memory, and an I/O-bus with access to different devices.
The programming model of MEMSY was defined
as a set of library calls which could be called from
C and C++. In contrast to the EGPA model, it was
Memory sharing within a plane
Memory sharing between planea not only based on the shared memory paradigm of
Common bus EGPA. Instead, the model defined a variety of differ-
ent mechanisms for communication and coordination.
Erlangen General Purpose Array (EGPA). Fig.  MEMSY
From these mechanisms, the application programmer
prototype
could pick those calls which were best suited for his
particular problem.
The use of the shared memory was based on the
concept of “segments,” very much like the original
the processor. Therefore, the EGPA monitor provides shared memory mechanism provided by UNIX Sys-
each processor with its position (column, row) in the tem V. A process which wants to share data with another
working plane together with the value of m.
process (possibly on another node) first has to create
a shared memory segment of the needed size. To have
the operating system, select the correct location for this
MEMSY memory segment; the process has to specify with which
One essential result of all investigations concerning neighboring nodes this segment needs to be shared.
EGPA was the following. For many kinds of compu- After the segment has been created, other processes may
tations, the main computing load was carried by the map the same segment into their address space by the
lowest layer. The higher levels were mainly used for means of an “attach” operation. The segments may also
distributing partial tasks and data and for the realiza- be unmapped and destroyed dynamically.
tion of communication mechanisms. The consequence For coordination purposes, three mechanisms had
was that the higher layers were left underutilized. For been provided: short (two word) messages, semaphores,
removing this problem, the first idea was to taper the and spinlocks. There was a second message type
pyramid so that each lower level in an elementary pyra- for transports of high-volume and fast data transfer
mid has  or even , rather than merely four PMMs. between any two processors of the system.
Experiences with a test system showed that excessive To express parallelism within a node, the program-
tapering (more than nine nodes in the elementary pyra- mer had to create multiple processes on each process-
mid) can lead to bottlenecks at the higher layers. During ing element by a special variant of the system call
these investigations, the best solution was found: The “fork.” Parallelism between nodes was handled by a
plane B must be mighty enough (elementary pyramid configuration description, which forced the applica-
with five nodes) that the upper levels can be removed if tion environment to start one initial process on each
the topmost node is connected to the B-plane by a bus. node that the application should run on. In this start-
Such a special EGPA system is called MEMSY. ing phase, the application was consistent with the SPMD
Erlangen General Purpose Array (EGPA) E 

model. Through the execution of fork-calls, a configu- examples which were well-dimensioned for the used
ration could be established which is beyond this model. system. This means the problem size is maximal and,
The operating system of the MEMSY prototype was e. g., in case of the relaxation, the region under investi-
MEMSOS, it was based on UNIX V/ Release  of gation is a square and not a circle or something else (see
Motorola, which was adapted to the multiprocessor Fig. d). In a few cases, the dependency of the speedup
architecture of the processor board []. on the problem size is depicted (see Gauss-Seidel and
In MEMSOS, most of the users’ needs were sup- Table ).
ported by the implementation of the “application Using the Pilot pyramid, the maximum speedup
concept.” An application was defined as the set of all pro- was . Here are some of the measured speedups. The E
cesses belonging to one single user program. A process, example “text formatting” demonstrates that not all
belonging to an application, could be identified by (a) a types of tasks will run well on an EGPA system.
globally unique application id, (b) a task id which was
Linear algebra [, ]:
assigned to each process (called task) of an application
Matrix inversion: Gauss-Jordan (.), Column
and which was locally unique for this application and
Substitution (.)
processor node and (c) a globally unique node id which
Matrix multiplication (.)
identified any node in the system. This concept allowed
Gauss-Seidel iteration ().
a very simple steering and controlling of applications.
Speedup/dimension of the dense matrix: /,
MEMSOS could distinguish different applications and
./, ./, ./, ./
was able to run several of them concurrently in “space
sharing” as well as in “time sharing” style. Differential equations []:
Additionally, a very important new concept, called Relaxation (.)
“gang scheduling,” had been implemented. The purpose Image processing and graphics []:
of this was to assure the concurrent execution of tasks Topographical representation (.)
of one application on all nodes at the same time. This Illumination of topographical model (.),
was achieved by creating a runqueue for each applica- Vectorizing a grey level matrix (.),
tion on each allocated node. A global component, called distance transformation
applserv, retrieved information about the nodes and the – Each processor is working on a fixed part of data
applications started and running there. It established a (.),
ranking of applications which were to be scheduled. – Dynamic assignment of varying parts of data
(.).
Exemplary Applications Nonlinear programming []:
In order to validate the concepts of EGPA, certain appli- Search for a minima of a multi-dimensional objective
cations from different scientific areas were adapted to function (.)
the programming models of the EGPA monitor and of Graph theory
MEMSOS. In some cases, it was necessary to develop Network flow with neighborhood aid (.)
new parallel algorithms. Text formatting [] (.)
The corresponding programs were written a lot of
years ago. The listings of them are no more available. Erlangen General Purpose Array (EGPA). Table  Results
Therefore, it is not possible to give an impression of with  processors in the working plane []
the amount for writing such a program, e.g., in lines of
Poisson equation with (/), (./),
code. Only the matrix multiplication (see page ) gives Dirichlet conditions: (./)
an idea:  lines (an estimation of the not shown part is (speedup/number of cells in
included). finest grid)
The authors of such parallel programs wrote in their Steady state Stokes (./), (./),
papers mainly about the principles of the underlying equation with staggered (./)
parallel algorithms and the runtimes. In many cases, grid: (speedup/number of
cells in finest grid)
they only published the maximal speedups received by
 E Erlangen General Purpose Array (EGPA)

In order to have a larger system, the DIRMU (Dis- Related Entries


tributed Reconfigurable Multiprocessor based on  Computational Sciences
with coprocessor ) was used for building an EGPA Distributed-Memory Multiprocessor
with three layers. Among other things, that multipro- Metrics
cessor was used to investigate multigrid methods which SPMD Computational Model
were established in the industry at that time. Though in Topology Aware Task Mapping
coarser grids the work per processor is very small, the
results were very good (see Table ). In this context, it is
worth mentioning that on a system, which only realized Bibliographic Notes and Further
a working plane consisting of  DIRMU-PMMs, adap- Reading
tive multigrid approaches were implemented with high Further papers dealing with the EGPA pilot are [, , ,
efficiency []. , , , ]. Publications concerning MEMSY can be
On the MEMSY Prototype, several algorithms were found on the page: www.informatik.uni-erlangen.de/
realized with great success. The measured efficiency was Projects/MEMSY/papers.html
mostly greater than %. This is true for all the above-
mentioned algorithms which were ported from pilot or Bibliography
DIRMU to MEMSY [], but also for the following real . Bode A, Fritsch F, Händler W, Henning W, Hofmann F, Volk-
world applications []: ert J () Multi-grid oriented computer architecture. ICPP,
–
● Calculation of the electronic structure of polymers . Cannon LE () A cellular computer to implement the Kalman
● Parallel molecular dynamic simulation filter algorithm. PhD thesis, Montana State University
● Parallel processing of seismic data by wave analogy . Dal Cin M, Hohl W, Dalibor S, Eckert T, Grygier A,
of the common depth point Hessenauer H, Hildebrand U, Hönig J, Hofmann F, Linster
● Mapping parallel program graphs into MEMSY by C.-U, Michel E, Pataricza A, Sieh V, Thiel T, Turowski S ()
Architecture and realization of the modular expandable multi-
the use of a hierarchical parallel. Algorithm
processor system MEMSY. In: Proceedings of parallel systems
● Parallel application for calculating one-/two- and fair of the th international parallel processing symposium (IPPS
ecp-electron integrals for polymers ‘), Cancun, pp –
. Finnemann H, Brehm J, Michel E, Volkert J () Solution
Only for sparse matrix algorithms, efficiencies near % of the neutron diffussion equation through multigrid methods
were measured. implemented on a memory-coupled -processor system. Parallel
comput (–):–
Outlook . Finnemann H, Volkert J () Parallel multigrid algorithms
implemented on memory-coupled multiprocessors. Nucl Sci Eng
At the same time when the mentioned EGPA sys-
:–
tems were realized in Japan, a very similar systems, . Fritsch G, Jens Volkert () Multiprocessor systems for large
named PAX [], has been built. Also in Japan, the Sys- numerical applications. Parcella –
tem HAP represented a lot of the principles of EGPA . Fritsch G, Henning W, Hessenauer H, Klar R, Linster CU,
systems []. Oelrich CW, Schlenk P, Volkert J () Distributed shared-
memory architecture MEMSY for high performance parallel
In the s, large commercial multiprocessors
computations. Comput Archit News ():–
came into the market. Distributed memory MIMDs . Fritsch G, Müller H () Paralleliation of a minimation problem
were now available which consisted of much more pro- for multiprocessor system. In: CONPAR . Lecture Notes Com-
cessors than, e.g., the MEMSY prototype. Even global puter Science, vol . Springer, Berlin/Heidelberg/New York,
shared memory systems with cache coherence were pp –
larger with respect to the number of processors. Espe- . Geus L, Henning W, Vajtersic M, Volkert J () Matrix inversion
algorithms for the a pyramidal multiprocessor system. Comput
cially the latter multiprocessors were easier to program
Artif Intell (Nr ):–
than EGPA systems with their restricted neighbor- . Gossmann M, Volkert J, Zischler H () Matrix image process-
hoods. Consequently, the research on EGPA machines ing and graphics on EGPA. EGPA-Bericht, IMMD, Universität
was stopped even in Erlangen. Erlangen-Nürnberg
Ethernet E 

. Hä̈ndler W () A unified associative and von-Neumann pro- . Rusnock K, Raynoha P () Adapting the Unix operating sys-
cessor EGPP and EGPP array. In: Sagamore Computer Confer- tem to run on a tightly coupled multiprocessor system. VMEbus
ence, pp – Syst ():–
. Händler W., Klar R. () Fitting processors to the needs of a . Vajtersic M () Parallel Poisson and biharmonic solvers imple-
general purpose array (EGPA). MICRO : – mented on the EGPA multiprocessor. In: ICPP 
. Hä̈ndler W, Herzog U, Fridolin Hofmann, Hans Jü̈rgen Schnei-
der: Multiprozessoren fü̈r breite Anwendungsbereiche: Erlangen
General Purpose Array. ARCS :–
. Händler W, Bode A, Fritsch G, Henning W, Volkert J () ES
A tightly coupled and hierarchical multiprocessor architecture. E
Comput Phy Commun :– Earth Simulator
. Händler W, Fritsch G, Volkert J () Applications implemented
on the erlangen general purpose array. In: Parcella’. Akademie-
Verlag, Berlin, pp – Ethernet
. Händler W, Herzog U, Hofmann F, Schneider HJ () A general
purpose array with a broad spectrum of applications. In: Infor- Miguel Sanchez
matik Fachberichte vol . Springer, Berlin/Heidelberg/New York, Universidad Politécnica de Valencia, Valencia, Spain
pp –
. Händler W, Maehle E, Wirl K () Dirmu multiprocessor
configurations. In: Proceedings of the international conference on
parallel processing, pp –
Synonyms
. Henning W, Volkert J () Programming EGPA Systems. In: IEEE .; Interconnection network; Network archi-
Proceedings of the th International conference on distributed tecture; Thick ethernet; Thin ethernet
computing systems, Denver, pp –
. Henning W, Volkert J () Multi grid algorithms implemented Definition
on EGPA multiprocessor. ICPP, pp –
Ethernet is the name of the first commercially successful
. Hercksen U, Klar R, Kleinöder W () Hardware-
measurements of storage access conflicts in the processor
Local Area Network technology invented by Robert
array EGPA. In: ISCA ’, Proceedings of the th annual sympo- Metcalfe and David Boggs while working at Xerox’s
sium on computer architecture, ACM Press, New York, La Baule, Palo Alto Research Center (PARC) in . While the
pp –   prototype developed worked at . Mbps, the com-
. Hofmann F, Dal Cin M, Grygier A, Hessenauer H, Hildebrand mercial successor was called Ethernet and worked at
U, Linster CU, Thiel T, Turovski S () MEMSY – a modu-
 Mbps. Several faster versions have been developed
lar expandable multiprocessor system. In: Proceedings of parallel
computer architectures: theory, hardware, software, applications.
and marketed since then.
Lecturer notes computer science, vol . Springer, London, Ethernet networks are based on the idea of a shared
pp – serial bus and the use of Carrier Sense Multiple Access
. Hoshino T, Kamimura T, Kageyama T, Takenouchi K, Abe H with Collision Detection (CSMA/CD) technique for
() Highly parallel processor array “PAX” for wide scien- network peers to communicate. Ethernet media access
tific applications. In: Proceedings of the international confer-
is fully distributed. Ethernet evolution brings the use of
ence on parallel processing, IEEE Computer Society, Columbus,
pp –
switching instead of CSMA/CD and full-duplex links.
. Linster CU, Stukenbrock W, Thiel T, Turowski S () Das At the same time, coaxial cable is first replaced by
Multiprozessorsystem MEMSY. In: Wedekind (Hrsg.): verteilte twisted pair and then by fiber optics.
systeme. Wissenschaftsverlag, Mannheim, Leipzig Wien Zürich,
pp – Discussion
. Momoi S, Shamada S, Koboyashi M, Tshikawa T () Hier-
archical array processor system (HAP). In: Proceedings of the
conference on algorithms and hardware for parallel processing on
Introduction
COMPAR , pp – The world of Computing has always been a scenario of
. Rathke M () Paralleliseren ordnungserhaltender Programm- fast changes where state of the practice is challenged on
systeme für hierarchische Multiprozessorsysteme. PhD thesis. a daily basis. Most of the advances in our field have come
Arbeitsberichte des IMMD, Universität Erlangen () from this attitude of challenging the existing wisdom.
 E Ethernet

Thanks to that, we moved from the mainframe to the developed at the University of Hawaii and first deployed
mini, and from it to the Personal Computer and now to in  by Normal Abramson and others. They used
embedded systems. amateur radios to create a computer network so tele-
When we lived in a world of a few computers, types at different locations could access a time-sharing
networking seemed not to be in big demand, but as the system. It was called the ALOHA network. All termi-
number of computers in enterprise grew, a new need nals use a shared frequency to transmit to the host.
started to develop to make efficient use of companies’ Simultaneous transmissions destroy each other but if a
computing resources. It was the dawn of the Local Area transmission was successful then it was acknowledged
Network (LAN). from the host.
In the early days of computing, most of the inter- A network technology was needed to link together
connection focused on adding dumb terminals to the Alto computers with printers and other devices. No
existing mainframe computers. For that purpose, dif- existing technology was found suitable so a team at
ferent variations of serial connections sufficed. How- PARC was set to develop a new technology. That was
ever, as first minicomputers and then personal com- the embryo of Ethernet.
puters started to spread on the enterprise there was a
need of another type of interconnection. In this new Collision Sense Multiple Access with
environment, the computing power started to move Collision Detection
from a central location to be distributed to each desk. The basic principle of the ALOHA network that allowed
Together with that, the need of keeping systems inter- a node to transmit at any time was simple but not very
connected became apparent, especially at those loca- efficient. Harvard University student Robert Metcalfe
tions that have a significant number of mini or personal sought to improve the basic idea, and so he did his PhD
computers. on a better system and joined PARC. The main improve-
Xerox Corporation played an important role in ment was that network nodes should check for an ongo-
the development of both the personal computer and ing transmission before starting their own. The way to
LANs. More specifically, Xerox founded Palo Alto do this was to check whether a carrier signal was present
Research Center (PARC) in  to conduct indepen- or not. This technique is also called Listen Before Talk
dent research. Many current technologies can be traced (LBT). This gave way to a technique called Carrier Sense
back to PARC, from the personal computer to the Multiple Access (CSMA).
Graphical User Interface to Ethernet LANs. Xerox Alto CSMA could be improved further by adding another
was the first personal computer. It was developed in mechanism called Collision Detection (CD). It con-
 (Fig. ). sisted of stopping an ongoing transmission as soon as
The advent of Ethernet as a successful wired net- a collision was detected. A collision happens when two
working technology is based on a wireless precursor or more network nodes transmit at the same time. The
purpose of this second improvement was to reduce the
duration of collisions. This newer version of the access
method was called Carrier Sense Multiple Access with
Collision Detection or CSMA/CD and it was the base
for the Ethernet network [].
A third improvement over ALOHA was what to
do after an unsuccessful transmission (collision). While
ALOHA called for a random (but short) timer, Met-
calfe’s proposal was to set the retransmit timer as a
random slot number chosen from a special set. Eth-
ernet’s Binary Exponential Backoff algorithm calls for
random numbers to be selected from a range that starts
Ethernet. Fig.  Original Ethernet drawing by Robert as {, } and doubles after each collision. The maximum
Metcalfe range is { . . . }. This algorithm ensures that the
Ethernet E 

probability of a new collision after a collision is halved. The availability of broadband coaxial cable on some
While CSMA/CD works on continuous time, Binary premises for cable TV distribution enabled another
Exponential Backoff works on slotted time, each slot variation for Ethernet, but now using broadband signals
being the time required to transmit  bits. and these call for a change as transmission is here unidi-
Robert Metcalfe was helped by Stanford student rectional. This broadband version of Ethernet was never
Dave Boggs and they created a working system by  very successful but it allowed a maximum distance of
networking together several Alto computers. , m.
In a quest for cheaper alternatives, coaxial cable
was replaced by twisted pair and the bus topology gave E
Ethernet Topologies way to a star topology. While the first attempt at this
Original Ethernet design used a bus topology with a only worked at a fraction of the coaxial data rate, the
coaxial cable as the shared media for all the trans- same speed was achieved with a later version. The main
missions. Data transmission used a baseband signal advantage of using twisted pair was its wide availability
using Manchester encoding. A bus was found as a suit- in office setups as it was used for voice communications.
able way to broadcast information to all the network Twisted pair was also cheaper and easier to install and to
nodes and no switching was needed. The network bus handle. But the use of star topology called for a central
was terminated with a resistive load on each end to multi-port repeater or hub. Hub device was, in essence,
reduce signal reflection. Maximum segment length was a bus-in-a-box. However, the star topology meant that if
up to  m. Repeaters could be use to connect two a cable was severed only one network node was affected.
or more segments together. No more than four con- In a bus topology, if the coaxial cable is severed the
secutive repeaters were allowed. Maximum network whole network stops working. The star topology proved
diameter was , m (Fig. ). to be much more robust than the bus counterpart but
Coaxial cable, specially the thick cable, was diffi- each cable run was limited to  m giving a maximum
cult to use in an office setup due to its size and it was distance of  m for each segment.
expensive. This topology had the main drawback of a Support for fiber optic media came from the need
bus being a single point of failure. for both longer distance links and electrically isolated
In order to reduce the cost of the network some links. The first use of fiber optic media in Ethernet net-
revisions of Ethernet called for a thinner cable. This works was for the interconnection of repeaters. It later
so-called Thin Ethernet was still based on a bus topol- developed as an alternative media to coaxial or twisted
ogy but using a thinner cable would make it not only pair. While more expensive than copper, fiber optic
cheaper but also easier to install cable runs. Unfor- media allowed longer distances of up to , m. Ether-
tunately, the new setup with thinner cable was not net fiber optic links were point to point and that meant
without its own troubles as the attachment of each a star topology. A passive-star over fiber was proposed
station was made using so-called T-connectors that but it was never implemented.
required the coaxial cable to be cut at each joint and
required two connectors to be added. Both the number
of connectors and the physical characteristic of the cable The First Ethernet, and the First
(RG-) limited the length of Thin Ethernet segments Commercial Ethernet Success
to  m. The first version of Ethernet was deployed in PARC in
 and it ran at . Mbps and used -bit values for
source and destination addresses. The first  bits of
the data field were reserved for a type identifier, so dif-
1 1 0 0 1 1 0 0
ferent protocols could be used over the same network
each one using a different value for the type identifier.
Data frames were protected against errors by means of
Ethernet. Fig.  Manchester encoding as used on a Cyclic Redundancy Check of  bits placed at the end
Ethernet of the frame.
 E Ethernet

The original name for this network was the Aloha IEEE . also developed several choices for
Alto Network, as it was developed for interconnecting physical media and used a naming convention that
Alto computers together and with laser printers. One of prepended the data rate to a word describing the mod-
the original prototype boards can be found now at the ulation scheme (baseband or broadband) and next the
Smithsonian. physical media used. BASE described a  Mbps
However, it was soon recognized that such a name version of Ethernet using Manchester encoding over
suggested it would only work with Alto computers. RG-X coaxial cable. BASE-T described a similar net-
The Ethernet name was a vendor-neutral name that work but over twisted-pair cable.
was based on an old concept of wireless transmissions. Eventually twisted pair became the cabling of
During nineteenth century, it was believed that wireless choice, mostly using two of the four pairs of Category
signals propagate using an invisible media called ether,  and Category  cable. While twisted-pair setups
although physicists Michelson and Morley rejected that required the use of expensive active hubs, the savings
misconception in . for using cheaper cable and the improved reliability of
the network compensated for the additional costs.
DEC, Intel and Xerox
In , Xerox, DEC and Intel joined to develop a CSMA/CD Ethernet Modeling and Problems
common specification for Ethernet products: The first One of the early criticisms of Ethernet was the fact
commercially successful Ethernet network was born. that media access is probabilistic (nondeterministic).
The specification was published as the “blue book” []. That means transmission latency cannot be known
The new Ethernet specification, also known as DIX beforehand. The deterministic behavior of other media
(acronym based on the three companies names), called access algorithms, like IEEE . (Token Ring) was
for  Mbps data rate, -bit addresses, -bit type considered superior, but average transmission latency
field, and -bit Cyclic Redundancy Check (CRC) field is much lower on Ethernet than on a token-passing
(Fig. ). The data field was of variable size of up to network (except under very heavy network loads).
, bytes. It was a new and improved version of the Ethernet (and CSMA/CD by extension) was found
first Ethernet and now it was set to become a viable to exhibit what was called “capture effect,” that happens
commercial product. At this same time, the market when two stations contending for the medium have a
for personal computers blossomed with many differ- collision and both have more frames to transmit. The
ent models. While this first commercially available ver- winner will transmit successfully and it will keep an
sion of Ethernet sold well, there were some concerns edge on future transmissions against the station that
about the potential lock-in by the three companies. lost the transmission slot after the collision. This effect
The Institute of Electrical and Electronic Engineers causes a transient channel-access unfairness that in turn
(IEEE) came into play. Ethernet would be standard- can interfere with higher layer protocols, like TCP, that
ized within the networking group , who met for the can amplify the effect.
first time on February . Ethernet became the IEEE A second cause of concern was the behavior of heavy
. standard in  []. The availability of the . loaded Ethernet networks [, ]. As traffic grows beyond
standard eased the doubts about the maturity level of a certain limit, the chances of a transmission attempt
Ethernet networks. At the same time, many different to end up in a collision grow too. Having an increased
companies started selling compatible products many chance of a collision when traffic load is high introduces
of them intended for the — by then — new IBM PC a positive feedback that can bring the network to a
computer. congestion state. But as Ethernet data rate goes up, so

Ethernet. Fig.  Ethernet frame format


Ethernet E 

does the overload limit. Again, the use of full-duplex From the logical point of view, both exhibit the same
Ethernet avoids this problem. behavior working at the physical layer. A repeater
retransmits a digital signal so it can reach a longer dis-
Next Stop:  Mbps Ethernet tance. A repeater does not create a separation of the
Once the  Mbps Ethernet IEEE standard was approved, collision domain.
the  group was tasked to develop a  Mbps ver- A bridge is an interconnection device that works
sion. This faster version was known as Fast Ethernet. at the Link Layer. Bridges connect two or more cable
Different standards allowed this faster speed over differ- runs called network segments. Bridges understand
ent types of physical media using both copper and fiber MAC addresses and create several independent colli- E
optic media. Twisted-pair cable is not very well suited sion domains. Only traffic intended for other network
for high speed communication so different modulation segment is actually forwarded by a bridge. The aggre-
schemes were used over different types of twisted-pair gate bandwidth of a network interconnected by a bridge
cables to give way to several physical layer specifi- is higher than if the interconnection uses a repeater, but
cations: BASE-TX use BB MLT- coded signals so is latency.
over Category  twisted-pair cable while BASE-T The term “bridge” was usually associated with the
uses BT OAM- coded signals over four twisted interconnection of large and usually coax-based seg-
pairs of Category . Similarly, BASE-FX is Fast Eth- ments while the term “switch” referred mostly to inter-
ernet media access for multimode fiber. Depending connection devices using point-to-point links, where
on whether half-duplex or full-duplex links are used, each port would have a workstation or server. From
 m or  km is the maximum link distance respec- the logical point of view, bridges and switches are
tively for fiber media. similar.
Little by little, bridges were replaced by new devices
Switched Ethernet called switches that offered similar functionality but
Ethernet evolved not only becoming faster but also many ports per unit. Switches started to replace network
adding new components to the original architecture. hubs. There was a performance increase but the cost was
A shared bus proved to be not enough for large net- high too.
works. The maximum distance constraints together The different topologies of Ethernet had to be loop-
with traffic considerations opened the way to bridges. free. But given that bridges or switches were used to
One way to increase the capacity of a network seg- interconnect network segments there was a risk of creat-
ment was to split it into two smaller segments, each one ing network loops. On the other hand, having network
now having fewer computers competing for the avail- loops is a way to introduce redundancy to the topology
able capacity. But these two segments need now to be of a network, a feature that could be very useful if it did
interconnected so distributed applications can still work not interfere with Ethernet media access. The problem
as before. was that a new component was needed for Ethernet not
The original bus topology gave way to a star topol- to have problems in the presence of loops in the network
ogy using twisted-pair cable. Hubs provided the inter- topology or frames would be forwarded forever within
connection between multiple stations but there was a the loops.
problem: All the stations had to work at the same data The brilliant solution to the problem was the
rate on the same collision domain. Bridges could be Spanning-Tree protocol [], invented by Radia Perlman
used to split one collision domain into several inter- while working at Digital Equipment Corporation and
connected network segments where only intersegment ratified as the IEEE .d standard. This protocol run-
traffic was shared. ning on switches and bridges creates a spanning tree
Therefore, both hubs and repeaters enabled the out of the existing network topology. Frames are only
interconnection of two or more network segments. forwarded on network segments that belong to the
The term “hub” is used only for the interconnection of spanning tree and, being a tree a loop-free topology,
point-to-point links while repeaters could be used for the problem of forwarding loops is solved. Whenever
interconnecting two coaxial-cable Ethernet segments. one of the links in use stops working, a new spanning
 E Ethernet

tree is calculated making use of the available redundant able to accept data sooner than expected, a new pause
links. Bridges and switches exchange control frames frame with zero wait-time can be transmitted.
periodically to maintain the spanning tree.
A switch, like a bridge, works on Link Layer Ethernet Gets Faster
and creates an independent collision domain on each In May , the Gigabit Ethernet Alliance was founded
port. Switches are store-and-forward devices that can by eleven companies interested in the development of
easily handle different data rates on different ports. Gigabit Ethernet (i.e., , Mbps Ethernet). The orig-
Switches can handle simultaneous frame forwarding inal aim of this new Ethernet version, faster than Fast
on different ports, an event that would have caused Ethernet, was to make server and backbone connec-
a collision should a hub were used. Different pro- tions even faster. Gigabit Ethernet was covered in IEEE
prietary flow-control mechanisms were developed for .z and .ab standards and it was approved in
half-duplex switches based on back pressure congestion . Gigabit Ethernet can use either multi-mode fiber,
management: A switch may transmit a carrier signal on single-mode fiber, or twisted-pair copper cable.
those ports it wants to alleviate congestion. Gigabit Ethernet over cooper (known as
Ethernet switches may use different approaches to BASE-T) poses a special challenge for twisted-pair
switching. The simpler way was to use the store-and- connections that could use Category  cable only if it
forward approach where a frame is stored in a buffer meets some restrictions. This is the reason for Category
to be retransmitted once it is received entirely by the E or Category  copper cable. Different techniques
switch. Cut-through switches only store the very first were used for making possible to get gigabit speed over
bytes of the frame and once the destination address is copper (e.g., QAM-modulation to more efficiently uti-
known the forwarding of the frame can start. This sec- lize the channel bandwidth).
ond approach leads to a significant reduction of the Many changes were needed beyond speed. Half-
switching latency experienced by frames. duplex operation was still allowed and so a technique
An Ethernet network that uses switches instead of called carrier extension was added to allow Ethernet
hubs enables a new improvement: full-duplex links. to keep CSMA/CD and the ordinary  m maximum
Provided that the physical media has two different network distance. An extension field is appended as
circuits for simultaneous transmission and reception needed to ensure a minimum frame length of 
(e.g., two different copper pairs on BASE-TX), it bytes. Alternatively, another optional improvement was
is possible for a station to transmit and receive at the frame bursting: It allows the transmission of consecutive
same time. While full-duplex Ethernet does work at data frames after one MAC arbitration on half-duplex
the same data rate as half-duplex, higher throughput channels.
can be achieved with full-duplex. Besides, collisions In June ,  Gigabit Ethernet was ratified. With
do not occur in a full-duplex point-to-point link. A this new version of Ethernet the performance level is
collision-free link is going to be more effective as the reaching or surpassing other Metropolitan Area Net-
time devoted to collisions can now be put to use to work (MAN) and Wide Area Network (WAN) tech-
transmit more data. However, network adapters still nologies such as OC- and OC-, SONET or SDH
need to keep the CSMA/CD functionality for backward technologies. This is the first version of Ethernet only
compatibility. supporting fiber optic media (GBASE-CX barely
Together with full-duplex operation came the .x supports  m of copper cable, just for attaching a switch
flow-control mechanism. Sometimes, a receiver on a to a router). A  Gbps link range is between  m and
full-duplex link may not be able to handle frames fast  km. Both multi-mode and single-mode fibers are
enough. Ethernet flow-control is based on the receiver supported.
transmitting a special pause frame to a multicast MAC
address. Such a frame reports the amount of time (using Ethernet and Beyond
 bit-time units) the sender on that full-duplex link New versions of Ethernet are coming as  Gbps Ether-
must wait before the next transmission. If a receiver is net and  Gpbs Ethernet. The IEEE .ba standard
Exaop Computing E 

has been ratified in June . A bit unusual was the fact Bibliography
that an interim Gbps version was developed largely as . Metcalfe RM, Boggs DR () Ethernet: distributed packet switch-
a step towards Gpbs Ethernet. The .ba calls for ing for local computer networks. Commun ACM ():–
full-duplex fiber optic links. . The Ethernet () A local area network: data link layer and
physical layer specifications (version .). Digital Equipment
Data Center Ethernet (DCE) extends Ethernet
Corporation, Intel, Xerox
architecture to improve networking and management . ANSI/IEEE Standard .– Carrier sense multiple access
in data centers. Several features like Priority-Based with collision detection. IEEE, October 
Flow-Control, Enhanced Transmission Selection (stan- . Boggs DR, Mogul JC, Kent CA () Measured capacity of an
dardized as .Qaz), Multipathing or Congestion Ethernet: myths and reality. In: Proceedings of ACM sigcomm ’ E
symposium of communications architecture and protocols, Stan-
Management (.Qau) focus on improving network
ford, CA, (Also as Computer Commun Rev (), August )
features and manageability. The vision of DEC is a uni- . Takagi H, Kleinrock L () Throughput analysis for persistent
fied fabric where LAN, storage area network (SAN), and CSMA systems. IEEE Trans Commun COM-():–
inter-process communication traffic (IPC) converge. . Perlman R () Interconnections: bridges, routers, switches, and
Technologies like Fiber Channel over Ethernet internetworking protocols (a ed. edición). Addison-Wesley, Read-
ing, MA
(FCoE) enable data centers to consolidate interconnec-
. Spurgeon CE () Ethernet—the definitive guide. O’Reilly &
tion and internetworking technologies into a single net- Associates, Inc, Sebastopol
work, which may simplify connectivity and reduce data
center power consumption.
Whether this is the last stop of Ethernet’s network
technology or not, it is unclear. However, if past is an
indication of what may be ahead, an even faster version Event Stream Processing
of Ethernet should be expected in a few years time.
Stream Programming Languages

Related Entries
Buses and Crossbars
Clusters
Eventual Values
Distributed-Memory Multiprocessor
Network Interfaces Futures
Network of Workstations

Bibliographic Notes and Further Exact Dependence


Reading
Access to many standards has been fee-based. IEEE  Dependence Abstractions
standards started a move a few years ago by opening
their documents to the public, for free. The documents
of published standards are available for free down-
load  months after they are published. That means Exaflop Computing
all details about any approved Ethernet standard ver-
sion is easily available visiting http://standards.ieee.org/ Exascale Computing
getieee/..html
Unfortunately, standards are usually written so all
details are covered but many times they are not reader-
friendly. For getting a good understanding of Ethernet Exaop Computing
technology, the Charles Spurgeon’s [] guide is a very
appropriate text. Exascale Computing
 E Exascale Computing

A list of the Top  supercomputers in the world


Exascale Computing has been maintained since June  and is a useful indi-
cator of trends in the capabilities of the largest scientific
Ravi Nair machines. The list shows that there has indeed been a
IBM Thomas J. Watson Research Center, Yorktown trend of doubling of Linpack performance every year,
Heights, NY, USA
but that the manufacturer of the highest-performing
system and the architecture of the system have been
Synonyms changing over time. While shared-memory computers
Exaflop computing; Exaop computing and even uniprocessors made the list in earlier days,
today the list is dominated by cluster supercomputers
Definition and massively parallel processors, the latter being dis-
Exascale computing refers to computing with systems tinguished from the former by their highly customized
that deliver performance in the range of  (exa) oper- interconnection between nodes.
ations per second. With growing interest in the use of large-scale
machines for applications that are considered
Discussion commercial (or, more appropriately, nonscientific), the
term exascale is also being used to refer to systems
Introduction which can carry out  operations per second (
In computing literature, it is customary to measure exaops). There continues to be ambiguity in what con-
progress in factors of . The term exa, standing stitutes an operation in this terminology. An operation
for  or  , is derived from the Greek word `´ε ξ, is variously defined as an instruction, an arithmetic or
meaning six. Thus exascale computing refers to sys- logical operation, an operation intrinsic to the execu-
tems that execute between  and  opera- tion of an algorithm, and so on. This has led to one
tions per second. This performance level represents a abstract definition of an exascale system to be one that
-fold increase in capability over petascale comput- has a performance capability roughly  times more
ing, which itself represents a -fold increase over than a system solving the same problem in .
terascale computing. In any case, the needs of the two worlds, scien-
There is often confusion over how to denote the tific and commercial, appear to be converging especially
performance of a system. The most common measure from the points of view of compute capability, system
is the number of double-precision floating-point oper- footprint, energy requirements, and cost. It remains to
ations that are completed every second (flops) when be seen whether, in comparison to systems of ,
the system is running the Linpack dense linear alge- the first exascale system will execute only Linpack 
bra benchmark. By this measure, an exascale system times faster, or whether it will solve a variety of prob-
is one which can execute Linpack at the rate of  lems, both scientific and commercial, faster by a factor
exaflops or higher. For a given system, this is usually of .
less than the peak double-precision floating-point exe-
cution rate, typically by –%, because of practical
The Importance of Exascale
limitations in exploiting the full floating-point capabil-
Large supercomputers are already being used to solve
ity in a system. In many systems that implement only
important problems in a variety of areas. The Interna-
single-precision floating point in hardware, the peak
tional Exascale Software Project (IESP) has categorized
single-precision floating-point capability may be more
the present areas of application of supercomputers as
than double the flops as measured for Linpack. Petas-
follows:
cale computing was ushered in with the introduction
of the IBM Roadrunner system in May , when it ● Material Science
ran Linpack at . petaflops. If trends continue as in the ● Energy Science
past, it is reasonable to expect an exascale system to be ● Chemistry
operational around . ● Earth Systems
Exascale Computing E 

● Astrophysics and Astronomy of physical systems in the presence of both systematic


● Biology/Life Systems and random variations.
● Health Sciences The list of areas that could benefit from the use of
● Nuclear and High Energy Physics supercomputers has been increasing steadily. Applica-
● Fluid Dynamics tions for supercomputing are also emerging in areas
that were not traditionally considered scientific areas.
Examples of these are visualization, finance, and opti-
Each of these areas has specific problems and a set of mization. Supercomputers have been used in the past
algorithms customized to solve those problems. How- by large movie production companies to render ani- E
ever, there is an overlap in the use of algorithms across mated movies frame-by-frame over long periods of
these domains, because of similarity in the basic nature time. However, for typical scientific applications, the
of some of the problems across domains. Many of these results produced by supercomputers were often pack-
problems could benefit from even larger supercomput- aged and shipped to a significantly smaller computer
ers, but the expectation is that exascale systems will also which the scientist used to analyze and visualize the
bring a qualitative change to problem solving, allow- results. The point where it will be too expensive to ship
ing them to be used in new types of problems and in the vast amount of data produced by supercomputers
new areas. is fast approaching. Increasingly, supercomputers will
There are two ways in which larger supercomputers be used to process the results of their computation and
can help in the solution of existing problems. The first produce visualization data to be shipped over to the
is called strong scaling, where the problem remains the scientist.
same, but the solution takes less time. Thus, a weather The area of business analytics, traditionally consid-
calculation that forecasts the weather a week ahead is ered a commercial application, is fast embracing the
more useful if it takes  h to solve rather than  days. The use of massive computers. The potential use of mas-
other form of scaling is called weak scaling, where the sive computation in such areas suggests that exascale
size of the problem that can be solved is vastly increased computers are likely to represent the point at which
with the help of exascale computers with its significantly supercomputers shift from being the largely simulation-
larger resources. Thus the quality of a weather forecast oriented scientific machines that they have traditionally
can be improved if it is based on global data rather been to more versatile machines solving fundamen-
than local data, or if the forecast is made based on data tally different types of problems for a larger section of
from a larger number of more closely located sensors, humanity.
or if the model is increased in complexity, for exam-
ple, through the use of more types of meteorological The Challenges of Exascale
information. The principal engine for the advances made by super-
Beyond just scaling of existing applications, large- computers over the past decades has been Dennard scal-
scale computing enables new types of computation. ing through which the electronics industry has enjoyed
It becomes possible to model in a single application the perfect combination of increased density, better per-
a physical phenomenon at different scales in time, formance, and reduced energy per operation with suc-
space, or complexity simultaneously, for example, at cessive improvements in silicon technology. When all
the atomic, molecular, granular, component, or struc- dimensions of an electronic circuit on a chip, for exam-
tural scale, or at the molecular, cellular, tissue, organ, or ple, the transistor gate width, the diffusion depth, and
organism scale. This capability is already becoming evi- the oxide thickness, are reduced by a factor α, and when
dent at the petascale level and will be further exploited the voltage at the circuit is also reduced by a factor α, it is
at the exascale level. Another new and interesting class possible to increase the density of packing of devices on
of applications that will be further exploited in exascale the chip by a factor of α  and increase the frequency of
computing is the quantification of uncertainty in math- operation of the device by a factor of α, with a reduction
ematical models. These applications involve the use of in energy per operation by a factor of α  . With process
massive parallel computing to understand the behavior technologists providing lithography improvements that
 E Exascale Computing

allowed a doubling of transistor density roughly every US city. The cost of acquiring the largest supercom-
 years, with microarchitects discovering new ways to puter has remained largely constant over the years at
expose parallelism at the instruction level, and with around $– million. This is the cost of about –
system architects engineering sophisticated techniques  MW-years of energy. Thus it is unlikely that exascale
to combine multiple processors into ever large parallel installations will be willing to provide more than about
systems, it was not difficult to maintain a steady dou- – MW of power, a third of which can be expected
bling of supercomputer system performance every year to be lost in cooling and power distribution. There is
at almost constant cost. thus a gap of x–x between the power efficiency
Going forward from the current  nm technology of today’s machines and the desired efficiency of exas-
node, it will become increasingly difficult to keep up cale machines. Traditional CMOS scaling would have
with Dennard scaling. With oxide thickness down to bridged that gap in about four technology generations,
a few atoms, voltages cannot be scaled proportionally but with the slowing down of CMOS scaling and with
without serious compromise to the reliability and pre- longer intervals between introductions of technology
dictability of the operation of devices. It is no longer generations, it is imperative to look for other avenues
reasonable to expect the same frequency improvement to achieve the desired power efficiency.
for constant power density that was enjoyed in the past. This gap must be bridged for all components of the
At the same time, the nature of typical application pro- system – the processor, the memory system, and the
grams makes the detection and exploitation of addi- interconnection network. At the processor level, it is
tional parallelism within a single thread of execution likely that exascale systems will include domain-specific
difficult. Thus, scaling the total number of threads in accelerators because it is easier to build hardware that is
a system remains the main option to increase the per- power efficient in a specific domain rather than across a
formance of future large-scale systems. With current general range of applications. Vector units, which per-
petascale systems already sporting hundreds of thou- form the same computation on multiple data simultane-
sands of cores, the expected number of cores in an exas- ously, could play an even bigger role in exascale systems
cale system is likely to be in hundreds of millions. Due precisely for this reason. Unlike what is seen in desktop
to various restrictions that will be enumerated below, and server processors today, processor designers will be
most of this increase will have to come from using what- forced to take a minimalist view in every part of the
ever lithography advances are possible to increase the processor, paring overhead down to a minimum with-
density of cores on a chip. out significant impact on performance. Power-efficient
The task of achieving the goal of building an exas- optics technology will need to be deployed in intercon-
cale system by  is daunting. There are challenges in nection networks if the total system bandwidth has to be
implementing the hardware, in designing the software scaled proportionally from what exists today. Memory
environment, in programming the system, and in keep- system designs, both at the main memory level and at
ing the machine running reliably for long periods at a the cache level, will have to be reinvented with the above
stretch. Listed below are brief descriptions of some of stringent power-efficiency requirements in mind.
these challenges.
Memory
One of the trends seen in the top supercomputers
Power is that the fraction of the total cost of the system
A quick glance at the Top  list shows that the and the fraction of the total system power going into
power ratings for the top two petascale systems are, memory has been steadily rising, even though the
respectively,  MW/Petaflops for the Cray system and ratio of memory capacity to the computational perfor-
. MW/Petaflops for the IBM Roadrunner system. At mance (bytes/flops) has been decreasing. In the Top
this rate, an exascale system could consume as much as  machines, this ratio has decreased from  byte/flop
– MW of power, a quantity that is larger than to as low as . byte/flop in some cases. The result of
the amount of power consumed by a medium-sized this is that there is a greater pressure on applications
Exascale Computing E 

to adapt to the reduced amount of memory by care- At the exascale level, costs will force the bandwidth
fully managing the locality of applications, and a pres- per flops ratio to be even lower. The ExaScale Comput-
sure on processor designers to provide ever-growing ing Study sponsored by DARPA estimated the bisec-
cache capacities and deeper cache hierarchies, both of tion bandwidth requirements at the exascale level to
which add to the challenge of streamlining the proces- be in a range between  PB/s and  EB/s. Such a high
sor design to achieve a low power-performance ratio for bandwidth will force a predominant use of optics tech-
exascale systems. nology across the system, not just between racks, but
Another problem is the bandwidth at the memory also at the board level and perhaps even at the chip
interface. DRAM designs have been catering mainly to level. Exascale computing would benefit from lower E
the commodity market and hence, while their capac- costs through commoditization of optics components
ities have been keeping up with Moore’s Law, their and from further advances in optics technology, like
bandwidths have not kept up with increases in proces- silicon photonics.
sor performance. Cost considerations have prevented
supercomputer vendors from designing custom DRAM
solutions. However, the rampant data explosion in all Packaging
types of information processing is forcing even com- As mentioned earlier, the power goal of – MW
modity memory vendors to rethink the designs of mem- for an exascale system will impose challenging require-
ory interfaces and modules. It is conceivable that such ments in all aspects of the system. One such aspect is
new designs along with new storage-class memory tech- packaging. The number of racks in an exascale system
nologies and new packaging technologies will do a cannot be allowed to increase dramatically from today
better job in meeting the needs of exascale systems. because of limitations in the size and cost of installa-
tions. Thus each rack will have to accommodate more
boards and each board will have to accommodate more
Communication chips than at present. Increased packaging density will
An important aspect of all large-scale scientific appli- increase power density at various levels of the system
cations is the amount of time spent communicating and will make testing, debugging, and servicing of the
between computation nodes. Applications impose a system more challenging.
certain communication pattern which often reflects the Two technologies that are maturing and that could
locality of interactions of objects in the physical world. help in this respect are D stacking of silicon dies and
Thus the topologies of interconnection networks in the use of silicon carriers. D stacking not only helps
large systems tend to be optimized for such commu- in providing a compact footprint but also reduces the
nication patterns. Increasingly, however, the nature of latency of communication between circuits on differ-
algorithms and the types of problems being solved are ent layers. With judicious use of through-silicon vias
imposing greater randomness in the pattern of commu- (TSVs), D technology could also provide greater band-
nications and hence greater pressure on the available width between sections of the processor design without
bandwidth. having to resort to large, expensive dies.
One measure used to characterize the behavior of In comparison to the  mm pitch of printed-circuit
systems for nonlocal communication patterns is the board technology, silicon carrier technology allows
bisection bandwidth. This is the least bandwidth avail- mounting of dies closer than  um, wiring pitch of
able between any two partitions of the system each – um, and an I/O pitch of – um. The higher sig-
having half the total number of computing nodes. Petas- naling rates, the bandwidth, and the higher I/O density
cale systems for the High Productivity Computing Sys- promised by silicon carrier technology at lower power
tems (HPCS) program are required to have a bisection and in a compact footprint will go a long way toward
bandwidth of at least . PB/s. But few organizations meeting exascale goals.
will either need or will be able to afford such a high Supercomputers have always been at the forefront in
bandwidth per flops ratio. packaging; exascale promises to be no different.
 E Exascale Computing

Reliability new exascale programming model. Neither MPI, the


The number of components in a system has a direct standard message-passing paradigm, nor OpenMP, the
bearing on the reliability of the system. Today’s petas- standard shared-memory paradigm, alone can exploit
cale systems, even with their redundancy and error- a system almost certainly likely to include a cluster
correcting techniques at the cache and memory levels, of shared-memory SIMD processors. There is grow-
have a mean time between failures (MTBF) of less than ing interest in a hybrid model, for example, one that
a month. A direct extrapolation for a -fold increase uses OpenMP at the local chip level and MPI at higher
in the number of components brings this number down levels. New programming languages classified as Par-
to a few minutes. The conventional method of handling titioned Global Address Space (PGAS) languages like
failures is to take a checkpoint of the relevant state of X, UPC, Co-Array Fortran, and Titanium provide the
an application at an interval well under the MTBF of programmer the view of one large shared address space
the system. When a failure occurs, the system is inter- for the entire system, but still allow the programmer
rupted, the state reverted back to a previously check- to specify partitions of the address space local to each
pointed state, and execution resumed. The maintenance thread.
of checkpoints requires the system to be periodically Various parts of the software stack will also be
interrupted and typically involves copying of proces- expected to incorporate features to tolerate the expected
sor and memory state to some other part of memory high rate of failures as discussed in the previous sec-
or to storage. The overhead of this checkpointing pro- tion. While this could be handled directly through the
cess could impede forward progress of the application programming language, ease-of-use considerations are
if it has to be performed too often. New techniques likely to force reliability to be taken care of either at
for detection of faults and for recovery from faults are the compiler level or in the runtime system, as in the
needed for future exascale systems. Map-Reduce programming model.
While coding and redundancy techniques could be The role of compilers and runtime systems also
used to improve the MTBF of the system, they come at becomes important if the system is heterogeneous
a cost – in area, in power, and often in performance. like some of today’s petascale systems or if domain-
With the exascale goal already a significant challenge specific accelerators and field-programmable gate-
on all these fronts, there is a growing belief that relia- arrays (FPGAs) become part of the system hardware.
bility techniques can no longer be application agnostic, While a small number of proficient users, the so-called
and that applications will need to guide when and how top-gun programmers, would enjoy programming with
checkpointing needs to be done in future large systems. such a diverse array of components, the vast majority of
users would prefer a compiler or the system to decide
Usability when and how to employ these components to deliver
The eventual success of exascale computing will be mea- optimal performance for their programs.
sured not by whether it can execute the Linpack bench- A new organization, the International Exascale Soft-
mark at an exaflops rate but by how successfully it is ware Project (IESP) has been formed with the recogni-
embraced by the community and how fundamentally tion that unless a coordinated global effort is begun now,
it transforms both science and commerce. The preced- the programming community will not be ready to uti-
ing discussion about the challenges of implementing an lize exascale systems effectively when they are deployed.
exascale system suggests that the role of software will be The IESP has taken upon itself the task of outlining
important in the effective exploitation of such a system. a roadmap for the development of systems software
On the one hand, life for application developers will be (including operating systems, runtime systems, I/O sys-
made easier if the system supports today’s programming tems, systems management software), of development
models without change, but on the other hand, software environments (including programming models, frame-
will have to work around the necessary restrictions in works, compilers, libraries, and debugging tools), of
hardware due to reasons mentioned above. algorithm support (including data management, anal-
Perhaps the most effective way in which this can ysis, and visualization), and also of algorithms for
be accomplished is through widespread adoption of a common applications.
Exascale Computing E 

Conclusion The Top Web site [] gives a historical trend


Exascale systems promise to usher in a new era of com- of the performance and capabilities of the highest-
puting where massive supercomputers will be used not performing scientific computers. The limits of CMOS
only for traditional simulations of large scientific prob- scaling and its effect on power consumption in future
lems but also to process the vast amount of data that technologies are described well in []. An excellent
is being produced especially from new types of ubiqui- overview of the architecture, power consumption, and
tous devices. The success of such large systems, however, economics of large datacenters may be found in [].
will be determined by how effectively they are utilized. There are several new technologies that promise to
Exascale systems can be expected to blaze new trails supplant or complement nonvolatile storage. A good E
in power-efficient hardware design, in a system-wide description of storage-class memories may be found
approach to ensuring availability despite expected fail- in []. There is recent interest also in the use of such
ures, and in new software approaches to maximize the memories as extensions to traditional DRAM main
utilization of the available computation power. memory [].
A comprehensive description of challenges in pack-
Related Entries aging and potential solutions can be found in []. The
Chapel (Cray Inc. HPCS Language) promise of D technology is discussed in []. An intro-
Checkpointing duction to recent advances in silicon photonics appears
Clusters in [].
Coarray Fortran
Distributed-Memory Multiprocessor
Fault Tolerance
Bibliography
. Barroso LA, Hölzle U () The datacenter as a computer:
Hybrid Programming With SIMPLE
an introduction to the design of warehouse-scale machines.
LINPACK Benchmark In: Hill M (ed) Synthesis lectures on computer architec-
Massive-Scale Analytics ture. Morgan and Claypool Publishers, San Rafael, CA.
Metrics Available: http://www.morganclaypool.com/doi/abs/./
Networks, Direct SEDVYCAC. Accessed  Feb 
Networks, Fault-Tolerant . Elnozahy EN (ed) () System resilience at extreme scale.
Available: http://www.darpa.mil/ipto/personnel/docs/ExaScale_
Networks, Multistage
Study_Resiliency.pdf. Accessed  Feb 
PGAS (Partitioned Global Address Space) Languages . Frank D () Power-constrained CMOS scaling limits. IBM J
Power Wall Res Dev (/):–
Reconfigurable Computers . Freitas RF, Wilcke WW () Storage-class memory: the next
TOP storage system technology. IBM J Res Dev (/):–
. International Exascale Software Project () Main page.
UPC
Available: http://www.exascale.org/iesp/Main_Page. Accessed 
Feb 
. Knickerbocker JU et al () Development of next-generation
Bibliographic Notes and Further system-on-package (SOP) technology based on silicon carriers
Reading with fine-pitch chip interconnection. IBM J Res Dev (/):
–
The most comprehensive description of the challenges
. Kogge PM (ed) () ExaScale computing study: technology
of exascale systems can be found in the report of a study challenges in achieving exascale systems. Available: http://www.
[] under the sponsorship of the DARPA Informa- darpa.mil/ipto/personnel/docs/ExaScale_Study_Initial.pdf . Acc-
tion Processing Techniques Office. The software chal- essed  Feb 
lenges are detailed in a similar study described in []. . Qureshi MK, Srinivasan V, Rivers JA () Scalable high per-
The International Exascale Software Project [] is also formance main memory system using phase-change memory
technology. In: Proceedings of ISCA-. Austin, TX, USA
actively pursuing an agenda to ensure adequate soft-
. Sarkar V (ed) () Exascale software study: software chal-
ware exploitation of future exascale systems. DARPA lenges in extreme scale systems. Available: http://www.darpa.mil/
also sponsored a study on reliability aspects of exascale ipto/personnel/docs/ExaScale_Study_Software.pdf. Accessed 
systems and the resulting whitepaper is available at []. Feb 
 E Execution Ordering

. Top Supercomputer Sites () Home page. Available: http://


top.org/. Accessed  Feb  Experimental Parallel
. Topol AW et al () Three-dimensional integrated circuits. Algorithmics
IBM J Res Dev (/):–
. Tsybeskov L, Lockwood DJ, Ichikawa M () Silicon photonics:
Algorithm Engineering
CMOS going optical. Proc IEEE ():–

Execution Ordering Extensional Equivalences


Scheduling Algorithms Behavioral Equivalences
F
the electrical power grid, nuclear reactors, and space
Fast Fourier Transform (FFT) missions. In many cases, human lives and huge mate-
rial values depend on the well-functioning of these
FFT (Fast Fourier Transform) systems even under adverse conditions. Despite the suc-
cesses in verification and validation technology over
past decades, theoretical as well as practical studies con-
vincingly demonstrate that large systems typically do
Fast Multipole Method (FMM) contain design faults, even when subject to the strictest
development disciplines. Moreover, even a perfectly
N-Body Computational Methods
designed system may be subject to external faults, such
as radiation effects and operator errors. As a conse-
quence, it is essential to provide methods that avoid
Fast Poisson Solvers system failure and maintain the functionality of a sys-
tem, possibly with degraded performance, even in the
Rapid Elliptic Solvers case of faults. This is called fault tolerance.
A fault is defined as a defect in a system that may
cause an error during its operation. If an error affects
the service to be provided by a system, a failure occurs.
Fat Tree Fault-tolerant systems were built long before the advent
of the digital computer, based on the use of replication,
Networks, Multistage
diversified design, and federation of equipment. In an
article on Babbage’s difference engine published in ,
Dionysius Lardner wrote []:“The most certain and
effectual check upon errors which arise in the process
Fault Tolerance
of computation is to cause the same computations to be
Hans P. Zima, Allen Nikora made by separate and independent computers; and this
California Institute of Technology, Pasadena, CA, USA check is rendered still more decisive if they make their
computation by different methods.” Early fault-tolerant
Definition computers include NASA’s Self-Testing-and-Repairing
Fault tolerance denotes the capability of a system to (STAR) computer, developed for a -year mission to the
adhere to its specification and deliver correct service in outer planets in the s, and the computers onboard
the presence of faults. the Jet Propulsion Laboratory’s Voyager systems. Today,
highly sophisticated fault-tolerant computing systems
control the new generation of fly-by-wire aircraft, such
Discussion
as the Airbus and Boeing airliners, protecting against
Introduction design as well as hardware faults. Perhaps the most
Modern society relies increasingly on highly complex widespread use of fault-tolerant computing has been
computing systems across a wide spectrum of appli- in the area of commercial transactions systems, such
cations, ranging from commercial transactions to the as automatic teller machines and airline reservation
control of transportation and communication systems, systems.

David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----,


© Springer Science+Business Media, LLC 
 F Fault Tolerance

The focus of this article is on software methods system’s capability to prevent the unauthorized disclo-
for dealing with faults that may arise in hardware or sure of information. Maintainability characterizes the
software systems. Before entering into a discussion on ability of a system to undergo modifications, including
fault tolerance, the broader concept of dependability is repairs: one specific aspect is the time it takes to restore
explored below. a failed system to normal operation.

Dependability Threats
Fault tolerance is one important aspect of a system’s A system is associated with a specification, which
dependability, a property that has been defined by the describes the service the system is intended to pro-
IFIP . Working Group on Dependable Computing vide to its user (a human or another system). It con-
and Fault Tolerance as the “trustworthiness of a comput- sists of one or more interacting components. A threat
ing system which allows reliance to be justifiably placed is any fact or event that negatively affects the depend-
on the service it delivers.” Whereas fault tolerance deals ability of a system. Threats can be classified as faults,
with the problem of ensuring correct system service errors, or failures; their relationship is illustrated by the
in the presence of faults, the notion of dependability fault-error-failure chain [] shown in Fig. .
includes also methods for removing faults during sys- A fault is a defect in a system. Faults can be
tem design, for example, via program verification, or to dormant – e.g., incorrect program code that is not
prevent faults a priori, e.g., by imposing coding rules executed – and have no effect. When activated dur-
that outlaw common design faults. ing system operation, a fault leads to an error, which
is an illegal system state. A fault inside a component
is called internal; an external fault is caused by a fail-
Basic Attributes of Dependability
ure propagated from another component, or from out-
The attributes of dependability specify a set of properties
side the system. Errors may be propagated through a
that can be used to assess how a system satisfies its over-
system, generating other errors. For example, a faulty
all requirements. A system’s dependability specification
assignment to a variable may result in an error char-
needs to include the requirements for each of the indi-
acterized by an illegal value for that variable; the use
vidual attributes in terms of the acceptable frequency
of the variable for the control of a for-loop can lead
and severity of service failures for specified classes of
to ill-defined iterations and other errors, such as ille-
faults and the system’s operating environment. Short
gal accesses to data sets and buffer overflows. A failure
definitions for a typical list of attributes are provided
occurs if an error reaches the service interface of a sys-
below; a detailed discussion can be found in [, , ].
tem, resulting in system behavior that is inconsistent
The reliability, R(t), of a system at a time, t, is spec-
with its specification.
ified as the probability that the system will provide
The execution of a system can be modeled by a
correct service up to time t. For example, the reliability
sequence of states, with state transitions being caused
required for aircraft control systems has been specified
as the result of atomic actions. In a first approximation,
as R(t) >  − − for t =  h (this is often infor-
the set of all system states can be partitioned into correct
mally characterized as “ nines”). The mean time to
states and error states. By separating those error states
failure (MTTF) is the expected time that a system will
that allow a recovery from those that represent system
operate correctly before failure. System availability at
failure the set of all error states is further partitioned
a time, t, is defined as the probability that the sys-
into tolerated error states and failure states (Fig. ).
tem can provide correct service at time t. The safety
The transitions between these state categories can be
of a system is characterized by the absence of catas-
described by classifying unambiguously each action as
trophic consequences on its user and the environment
a correct, fault, or recovery action []:
it is operating in. More formally, it can be defined as
the probability to fail in a controlled, or safe manner ● A correct action, executed in a correct state, results
(see section “Controlled Failure”). Confidentiality is a in a correct state
Fault Tolerance F 

Service interface

Propagation
to service
Activation interface

Propagation Failure
Fault Error−1 .... Error−n

Defect Invalid state Invalid state F


Violation of
system SPEC

External fault
(caused by external failure)

Service interface

Fault Tolerance. Fig.  Threats: the fault-error-failure chain

● A correct or fault action, executed in a tolerated


Failure states error state, results in an error state (tolerated or
failure)
● A recovery action, executed in a tolerated error state,
Tolerated error states results in a correct state.
● For a class of systems that are referred to as
Correct / fault
fail-controlled recovery from failure is possible
Correct states action using special protocols (see section “Controlled
Recovery (unrecoverable)
action Fault action Failure”).

Fault action
(unrecoverable) A system is fault tolerant if it never enters a failure state:
this means that errors may occur, but they never reach
the service boundary of the system and always allow
Correction of recovery to take place. The implementation of fault tol-
controlled failure
erance in general implies three steps: error detection,
error analysis, and recovery.
The main categories of faults are physical, design,
and interaction faults. A physical fault manifests itself by
Fault Tolerance. Fig.  State-space partitioning
malfunctioning of hardware, which is not the result of a
design fault. Examples include a permanent fault caused
by the breakdown of a processor core due to overheat-
ing, a transient fault caused by radiation and resulting
● A fault action, executed in a correct state, results in in a bitflip in a register or memory, or a fault caused by
an error state (tolerated or failure) a human operator.
 F Fault Tolerance

Design faults may originate in either hardware or the number of faults identified and removed during its
software. An illegal condition in a while loop lead- development. Fault removal refers to the set of tech-
ing to infinite iteration or an assignment involving niques that eliminate faults during the design and devel-
incompatible variable types are examples for software opment process. Verification and Validation (V&V) are
design errors. important in this context: Verification refers to meth-
Interaction faults occur during system operation and ods for the review, inspection, and test of systems, with
are caused by the environment in which the system the goal of establishing that they conform to their spec-
operates. They include illegal input data or operator ification. Validation checks the specification in order
errors as well as natural faults caused by radiation hit- to determine if it correctly expresses the needs of the
ting the system. system’s users.
In addition to these main categories, faults can be Verification methods can be classified as either
characterized along a set of additional properties, often dynamic or static, depending on whether or not they
across the above classes. This includes the domain of involve execution of the underlying program. Both
a fault – hardware or software, its intent – acciden- techniques require that a system’s functional and behav-
tal or malicious (e.g., faults caused by viruses, worms, ioral specification be transformed into a mathemati-
or intrusion), or its persistence – hard or soft. A hard cally based specification language, such as first-order
fault (such as a device breakdown) makes a compo- logic []. System developers also need to specify a set
nent permanently unusable. In contrast, a soft fault is of correctness properties, which must be satisfied. For
a temporary fault that can be transient or intermittent. example, a correctness property for a camera mounted
A component affected by such a fault can be reused on a robotic planetary exploration spacecraft would be
after the resulting error has been processed. For exam- that the camera’s lens cover must be closed if the camera
ple, a bitflip in a register caused by a proton hitting a is pointed within a given angle of the Sun.
circuit is a transient fault referred to as a Single Event Static verification of a program implies a static
Upset (SEU). proof – performed before execution of the program –
Reflecting the diversity of faults, the failures caused that for all legal inputs the program conforms to its
by them can be characterized by a broad range of specification. Early techniques include Hoare’s logic
nonorthogonal properties. These include their domain – and Dijkstra’s predicate transformers [, ]. Theorem
value or timing, the capability for control, signalling – provers use mathematical proof techniques to provide
signalled or unsignalled failures, and their consequences, a rigorous argument that the correctness properties
ranging from minor to catastrophic. are satisfied. Some theorem provers, such as the early
versions of the Boyer-Moore theorem prover [], are
Fault Prevention designed to operate in a completely automated fash-
Fault prevention addresses methods that prevent faults ion. Others, such as the Prototype Verification System
to being incorporated into a system. In the software (PVS) [], are interactive, requiring the user to “steer”
domain, such methods include restrictive coding stan- the proof. Model checkers determine whether a system’s
dards that avoid common programming faults, the use correctness properties can be violated by exploring its
of object-oriented techniques such as modularization computational state space. If a violation is detected,
and information hiding, and the provision of high-level the model checker returns a counterexample describ-
programming languages and abstractions. Firewalls are ing how the violation can occur. This also means that
mechanisms for preventing malicious faults. An exam- model checkers can be used to generate test cases for
ple for hardware fault prevention includes shielding the implemented system []. Model checkers are able
against radiation-caused faults []. to identify types of faults associated with multithreaded
systems such as deadlocks, lack of progress cycles, and
Fault Removal race conditions. The more widely used model checkers
Despite the existence of fault tolerance methods, a soft- include the SPIN model checker [], UPPAAL [], and
ware system’s dependability will tend to increase with SMV (Symbolic Model Verifier) [].
Fault Tolerance F 

Although static verification techniques can identify activity (such as detumbling the spacecraft after launch
faults early in a system’s development, these techniques or descent to a planetary surface) can be organized as a
face a number of theoretical and practical challenges fail-controlled system. When a fault is detected during
and limitations. Specifically, many decision problems operation, all active command sequences are termi-
are undecidable, that is, there is no general method nated, components inessential for survival are powered
that provides a complete solution []. Other prob- off, and the spacecraft is positioned into a stable, sun-
lems, for which solution algorithms exists – including pointed attitude. Critical information regarding its state
many graph problems, are NP-complete, that is, com- is transmitted to ground controllers via an emergency
putationally intractable due to exponential complexity. link; rebooting and restoring the health of the spacecraft
Finally, methods such as model checking often need to is then delegated to controllers on Earth.
deal with the scalability challenge, that is, an exploding Systems that preserve continuity of service can be F
state space. significantly more difficult to design and implement
Furthermore, substantial effort can be required to than fail-controlled systems. Not only is it necessary to
create the formal specifications to which these tools determine that a fault has occurred, the software must
will be applied. Learning the specification language can be able to determine the effects of the fault on the sys-
require substantial effort as well; even more important tem’s state, remove the effects of the fault, and then
is learning the skill of abstracting from nonessential place the system into a state from which processing can
details of the system’s functionality and behavior. For proceed.
these reasons, these techniques are usually applied only
to the critical elements of a system (e.g., the fault detec-
tion, identification, and repair component of onboard
Fault Tolerance
spacecraft control software, or fault response systems General Methods for Software Fault Tolerance
for nuclear power plants). Ensuring fault tolerance, that is, continuity of service
Dynamic verification technologies are based on the in the presence of faults, requires replication of func-
actual execution of the program. Most important are tionality. Many of the techniques used to achieve repli-
test methods. The test of a program for a specific input cation of functionality can be categorized as either
either demonstrates correctness of the program for this design diversity or as self-checking software. Using
specific input, or results in an error. Thus, a specific test design diversity to tolerate faults in the design of a soft-
execution can either prove that the program operates ware system requires that there be at least two variants
correctly for the selected input, or it identifies an error. of a system (i.e., different designs and/or implementa-
Thus, as already noticed by Edsger Dijkstra in , tests tions based on a common specification), a decider to
can prove or disprove the existence of a fault but never provide an error-free result from the variants’ execu-
the absence of all faults []. tion, and an explicit specification of decision points
in common specification (i.e., specification of when
Controlled Failure decisions have to be performed, and specification of
Non-fault tolerant systems may be fail-controlled, the data upon which decisions have to be performed).
meaning that they fail only in specific, predefined Two well-known types of design diversity are recovery
modes, and only to a manageable extent, avoiding com- blocks and N-version programming.
plete disruption. Rather than providing the capabil-
ity of resuming normal operation, a failure in such
a system puts it into a state from which recovery Recovery Blocks
is possible after the failure has been detected and Recovery blocks [] make use of sequentially applied
identified. design diversity. In this case, the decider is an accep-
As an example, the onboard software controlling tance test that checks the computation results against
robotic planetary exploration spacecraft for those por- context-specific criteria to determine whether the result
tions of a mission during which there is no critical can be accepted as correct. If the acceptance test decides
 F Fault Tolerance

that the result is correct, the system continues execution whether they also have that defect. The goal of back-to-
and does not attempt to apply any corrective action. On back testing is to improve the reliability of a fielded sys-
the other hand, if the acceptance test indicates that the tem without incurring the memory space or execution
result is incorrect, the result is discarded and a variant time penalties that might be associated with N-version
(termed an “alternate” for recovery blocks) is applied programming. However, fielding only a single version
to the inputs to retry the computation. If all of the removes the fault-tolerance capability associated with
alternates have been attempted and no results have met N-version programming.
the acceptance criteria, additional steps (such as excep-
tion handling, see section “Exception Handling”) are N-Self-Checking Programming
required to recover from the fault. A Petri-net model of N-self-checking programming [] combines the idea of
a recovery block is shown in Fig. . an acceptance test from recovery blocks with the par-
allel execution of N ≥  variants. In this case, each of
N-Version Programming the variants has an acceptance test associated with it.
For N-version programming [], design diversity is The acceptance tests associated to each variant or the
applied by executing all of the variants (termed versions) comparison algorithms associated to a pair of variants
in parallel. The decider then votes on all of the results: if can be the same, or specifically derived for each of them
a majority of the results meet the decider’s criteria for from a common specification. As the variants are exe-
correctness, the system is considered to be operating cuted, the acceptance test associated with each variant
normally and continues execution. Otherwise, further determines whether the result meets the criteria for cor-
action is required to recover from the fault. A Petri-net rectness. From the set of correct results, one is selected
model for N-version programming is shown in Fig. . as the collective output for all the variants and exe-
A variant of N-version programming, known as back- cution proceeds normally. If the set of correct results
to-back testing [], keeps only one of the N versions is empty, the system is considered to have failed and
in the fielded system, but uses all of the versions dur- further action is required to restore it to nominal oper-
ing testing. If a defect is found in one of the N versions ation. A Petri-net model for N-self-checking software is
during test, the other versions are analyzed to determine shown in Fig. .

Start software
execution

Primary alternate No further alternates


Alternate #i
execution available
selected
End alternate #i
End primary execution
alternate Acceptance test
execution not passed

Execute
acceptance
test
Alternate
selection
Acceptance
test passed
Execution
End software execution
alternate #i
Failed
software

Fault Tolerance. Fig.  Petri-net model for recovery block


Fault Tolerance F 

Start software
execution

Version 1 Version 2 Version N


execution execution execution

End execution
End
version 2
execution End execution
version 1 version N
Gather
Start decision
algorithm execution version F
results Acceptable result
N not provided
Acceptable
result provided

Failed
software
Execute decision
algorithm

Fault Tolerance. Fig.  Petri-net model for N-version programming

Start software
execution
Self-checking
component N
execution
Self-checking Self-checking
component 1 component 2
execution execution

Acceptable
Gather, select
result provided
acceptable result
N
Result No result Gather unacceptable
selected selected results

Acceptable Failed
result not software
provided

Fault Tolerance. Fig.  Petri-net model for self-checking programming

For self-checking software components that are test. For example, it could perform the required compu-
based on the association of two variants, one vari- tations with a different accuracy, compute the inverse
ant only may be written for the purpose of fulfilling function from the primary variant’s results (if possi-
the functions expected from the component, while the ble), or exploit application-specific intermediate results
other variant can be written as an extended acceptance of the primary variant. Algorithm-based fault tolerance
 F Fault Tolerance

(ABFT) belongs to this category []. Many of the fault- systems require support for adaptive fault tolerance,
tolerant systems encountered in real life are based on based on a fault model and a characterization of applica-
self-checking software. For example, the software for the tions and their components with respect to their over-
Airbus A and A flight control systems and the all importance, exposure to faults, and requirements
Swedish railways’ interlocking system are based on par- for fault tolerance. An introspection-based approach can
allel execution of two variants – in the latter case, the support such an environment by providing a capabil-
variants’ results are compared, and operation is stopped ity for execution-time monitoring and self-checking,
if the results of the variants disagree. analysis, and feedback-oriented recovery []. An intro-
It is important to note that the actual effectiveness spection module, as illustrated in Fig. , is the core
of fault-tolerance techniques based on multiple versions element of such a system, connected to the application
may not be as high as the theoretical limits, because via a set of software sensors for receiving input from,
even independent programming teams may make sim- and a set of actuators for generating feedback to an
ilar errors []. However, recent work indicates that in application module. An inference engine connected to a
spite of this limitation, N-version techniques can still be knowledge base controls the operation of the introspec-
a viable way of tolerating faults []. tion module. Introspection modules are organized into
an introspection graph that reflects the organization and
fault-tolerance requirements of the application.
The application-specific error detection, analysis,
Introspection-Based Adaptive Fault Tolerance and recovery actions of an introspection module can be
Different components of a system may have differ- controlled via assertions inserted in a program, direc-
ent requirements for fault tolerance. For example, a tions for generation of specific checks, and correspond-
space mission may, at one end of the spectrum, con- ing recovery actions. This information can be provided
tain program components that perform low-level image by a user or automatically generated from the source
processing with minimal fault-tolerance requirements program or a pre-existing knowledge basis character-
(it does no harm if one in a few million pixels is lost), izing properties of the application [, ]. A detailed
while at the other end the software for spacecraft control analysis of the range of fault-tolerance requirements for
and communication needs maximum protection. Such a complex real-time environment is provided in [].

Sensors

.
.
Inference engine .
.
.
.

Control
links
Monitoring To / from
Application
segment(s) introspection
Analysis
modules
Knowledge
base
Feedback /
. recovery
.
.
.
.
Prognostics .

Actuators

Fault Tolerance. Fig.  Introspection module


Fault Tolerance F 

Fault Tolerance in Concurrent Systems on detailed system properties and the programming
model under consideration. In the next subsection, a
Safety and Liveness Properties number of typical problems in concurrent systems are
A concurrent system is introduced as an interconnected shortly discussed; a comprehensive treatment of such
collection of a finite number of autonomous processes. issues can be found in []. In addition to the applicabil-
Processes can either interact via a global shared memory, ity of the general methods discussed in section “General
or communicate via message-passing over a network. Methods for Software Fault Tolerance”, fault tolerance
No assumptions are made about individual and rela- can be often achieved using specialized techniques.
tive speeds of processes and delays involved in access
to shared memory or message-passing communication, Example Problems
except for the requirement that the speed of active pro- Mutual exclusion constrains access to a shared resource – F
cesses is finite and positive, and message delays are for example, a variable, record, file, or output device –
finite. (Thus, this model does not cover real-time con- such that at any time at most one process may access the
straints.) An execution of such a system can be modeled resource (a safety property). Any correct solution must
by a sequence of system states, with a transition between also guarantee that each process requesting the resource
successive states taking place as a result of an atomic is granted access after a finite waiting period (a liveness
action in a process or a communication step. property). A simple method for implementing mutual
Lamport [] introduced two key types of correct- exclusion in a shared memory system is via a binary
ness properties for parallel systems: safety properties and semaphore [] that controls access to the resource. The
liveness properties. A safety property can be described corruption of the semaphore value (by any process or as
by a predicate over the state that must hold for all execu- a result of a hardware fault) destroys the safety property
tions of the system. It rules out that “bad things happen.” of the algorithm. An error in the semaphore implemen-
An example for a safety property is the requirement of tation may destroy the liveness property. The mutual
an airline reservation system to provide mutual exclu- exclusion problem is discussed in detail in [].
sion for accesses to any record representing a specific State-based synchronization makes the progress of
flight on a given day. A more mundane example is the a process dependent on a condition in its environ-
requirement for a system controlling the traffic lights ment. For example, in a simple version of the producer-
of a street intersection that the lights for two cross- consumer problem, the producer and consumer are
ing streets may never be green at the same time. A autonomous processes operating asynchronously, with
liveness property requires that each process, which in a cyclic buffer between them acting as a temporary stor-
principle is able to work, will be able to do so after a age for data items: in each cycle of the producer, it
finite time. This property can be expressed via a predi- generates one data item and places it into the buffer;
cate that must be eventually satisfied, guaranteeing that conversely, the consumer removes in each cycle one
“a good thing will finally happen.” For instance, if a set item from the buffer and then processes it. Wrong syn-
of processes compete for a resource, each of the pro- chronization (as a result of a design error or a corrup-
cesses should be able to acquire it after a finite amount of tion of variables used for the coordination) can lead
time. Examples for violation of liveness include the pre- to a number of errors, including the corruption of a
vention of a process to reach regular termination or to buffer element, an attempt of the producer to write into
provide a specified service. Another example is a dead- a full buffer, or an attempt of the consumer to read from
lock involving two ore more processes, which cyclically an empty buffer. The result is a violation of the safety
block each other indefinitely in an attempt to access property, and possibly of liveness.
common resources. Alpern and Schneider have shown A deadlock is a system state where two or more pro-
that every possible property of the set of executions can cesses block each other indefinitely in an attempt to
be expressed as a conjunction of safety and liveness access a set of resources. In a typical example, process
properties []. p holds resource r and requests resource r , while pro-
The manifestation of faults in a concurrent system, cess p holds r and requests r . This is a violation of the
and related methods for their toleration depend largely liveness requirement. It can be addressed in a number
 F Fault Tolerance

of ways, including deadlock prevention –, by priori- (e.g., as a result of a failing process), the information
tizing resources and allowing access requests only in stored at the checkpoint can be used to restart the pro-
rising priority order, or by periodically checking for a cess. Recovery can be managed by the application; here
deadlock and taking a recovery action if it occurs. A dif- a system-supported approach is discussed.
ferent example for a liveness violation is represented by There are two categories of recovery-based proto-
the famous dining philosophers problem, where two pro- cols. The first relies only on the use of checkpoints,
cesses conspire to block a third process indefinitely from whereas the second approach in addition performs log-
accessing a common resource. Problems of this kind can ging of all nondeterministic events, which allows the
be avoided by defining an upper limit for the number deterministic regeneration of an execution after the
of unsuccessful resource requests, and prioritizing the occurrence of a fault (which is not possible if only
blocked process once this limit is reached. checkpoints are used). This is important in applica-
In a system where processes communicate over a tions that heavily communicate with the outside world
network, a network fault can propagate into a number (which cannot be rolled back!). However, from the
of different faults in the process system. These include viewpoint of today’s large-scale scientific applications,
() transmission of wrong message content, () send- the first approach, which will just be denoted as check-
ing a message to the wrong destination, and, () total pointing is more relevant, and will be the sole focus in
loss of the message, resulting in a possibly indefinite the rest of this section.
delay of the intended receiver. A way to deal with the There are two main methods for performing check-
last of these faults is the use of a watchdog timer that pointing, which are characterized as uncoordinated
regulary checks the progress of processes. or coordinated. In uncoordinated checkpointing, each
Byzantine faults are arbitrary value or timing faults process essentially decides autonomously when to
that can occur as a result of radiation hitting a circuit, take checkpoints. During recovery, a consistent global
or of a malicious intruder. Under certain conditions, checkpoint needs to be found taking into account the
such faults can be tolerated using a sophisticated form dependences between the individual process check-
of replication []. points. Uncoordinated checkpointing has the advantage
of allowing a process to take into account local knowl-
edge, for example, for minimizing the amount of data
Checkpointing and Rollback Recovery that needs to be saved. However, there are serious dis-
in Message-Passing Systems advantages, including the domino effect that can result
This section deals with fault tolerance for long-running in rollback propagation all the way to the beginning of
applications in message-passing systems implemented the computation and the need to record multiple check-
on top of a reliable communication layer (which has points in each process, with the necessity of periodical
to be provided by the underlying hardware and soft- cleanups via garbage collection.
ware layers of the network). The state of such systems As a result of these shortcomings, the dominating
consists of the set of local states of the participating method used in today’s systems, and particularly in
processes, extended by the state of the communication supercomputers, is coordinated checkpointing. In this
system, which records the messages that are in-flight at approach, a global synchronization step involving all
any given time. Consistency of the global state implies processes is used to form a consistent global state.
that the dependences created between processes as a Only one checkpoint is needed in each process, and no
result of message transmissions must be observable in garbage collection is required. A straightforward cen-
the state: more specifically, at the time a process records tralized two-stage protocol for the creation of a check-
receipt of a message the state of the sender must reflect point requires blocking of all communication while the
the previous sending of that message. protocol executes. In a first step, a coordinator process
A recovery approach based on checkpointing and broadcasts a quiesce message to all processes requesting
rollback requires a stable storage device that is not them to stop. After receiving an acknowledgment for
affected by faults in the system. At a checkpoint, a pro- this request from all processes, the coordinator broad-
cess saves recovery information to this storage device casts a checkpoint message that requests processes to take
during fault-free operation; after occurrence of an error their checkpoint and acknowledge completion. Finally,
Fault Tolerance F 

after receipt of this acknowledgment from all processes, methodology [], is to change the parameters passed
the coordinator broadcasts a proceed message to all pro- to software modules and observe the result. In Ballista, a
cesses, allowing them to continue their work. On a test passes if the system does not hang or crash, ignoring
failure in a process, the whole application rolls back to the validity of output values produced. The JPL Imple-
the last checkpoint and resumes execution from there. mentation of a Fault Injector (JIFI) [] toolset is typical
Coordinated checkpointing provides a relatively of modern software-controlled fault injectors. It allows
simple approach to recovery. Drawbacks include the automated campaigns of random uniformly distributed
overhead for communication in the checkpointing and bitflip faults into application registers and/or memory
recovery protocols, and the time required for storing the space and registers. In order to characterize the fault
relevant data sets to a stable storage device, with the injection results, it is necessary to verify the final result
obvious consequences for scalability. In fact, a recent and classify its effect on the system with respect to its F
study on exascale computing [] shows that check- liveness properties and the functional correctness of the
pointing along the above protocol seems no longer to output. The main goal of this methodology is to char-
be feasible, due to the existence of billions of threads in acterize an application’s sensitivity to transient faults.
such a system. A potential solution would require new For example, it is possible to determine whether the
storage technology for saving the recovery information. application is more likely to crash or produce incorrect
A comprehensive treatment of rollback-recovery results, or which subroutines of the program are most
protocols, including log-based approaches, can be vulnerable.
found in []. Coordinated checkpointing for large-
scale supercomputers is discussed in detail in [], Exception Handling
including a study of scalability and an approach to deal An exception is a special event that signals the occur-
with failures during the checkpointing and recovery rence of one among a number of predetermined faults in
phases. a program. Exception handling can be considered as an
additional fault-tolerance technique in that it provides
Fault Injection Testing a structured response to such events.
Fault injection testing refers to a technique for revealing There are two types of exception handling, resump-
defects in a system’s ability to tolerate faults []. Fault tion and termination []. In resumption, if an exception
injection inserts faults into a system or mimicks their is encountered at a given command in a program, the
effects, combined with a strategy that specifies the type program is temporarily halted and control is transferred
and number of faults to be injected into system compo- to a handler associated with that program. The han-
nents, the frequency at which faults are to be injected, dler performs the computations required to mitigate
and dependencies among fault types. the effects of the exception and then transfers control
A number of fault injectors have been developed back to the program. The handler may return to the
[, , ], which allow faults to be inserted into a com- command immediately following the one at which the
puting system’s memory or processor registers while exception was encountered, or to a different location
the system is running, in a manner that is transparent that may depend on the details of the program’s state
to the executing application and system software. This when the exception was signalled. By contrast, termi-
allows developers of fault-tolerant systems to observe nation simply leads to the (controlled) halting of the
the system operating in an environment closely related program in which the exception was observed.
to the expected environment during fielded use. As Although resumption has a potential advantage in
faults are injected into the system, system engineers that the interrupted program may be resumed upon
are able to observe the immediate effects of each fault successful execution of the handler, there are also
as well as the way in which it propagates through the advantages associated with choosing termination [].
system. When conducting fault injection testing, it is For example, with termination a programmer is encour-
necessary to design test strategies that will cause the aged to recover a consistent state of a module in which
system under test to execute its fault-handling capabil- an exception is detected before signalling it, so that
ities under a wide variety of conditions. One particular further calls to the module’s procedures find the mod-
strategy, implemented by the Ballista automated testing ule state consistent. With resumption, a programmer
 F Fault Tolerance

does not know if control will come back after sig- Exascale Computing
nalling an exception. In addition, recovering a consis- Race Detection Techniques
tent state before signalling (e.g., for example by undo- Synchronization
ing all changes made since the procedure start) defeats
the purpose of resumption, which is to save the work
Bibliographic Notes and Further
done so far between the procedure entry and the detec-
Reading
tion of the exception. If a consistent state is not recov-
The basic concepts of dependability are discussed in
ered, the handler may never resume execution of the
detail in the seminal work of Avizienis, Laprie, Randell,
module after the signalling command, and the module
and Landwehr [, ]. A comprehensive discussion of
will remain in the intermediate, most likely inconsis-
software fault tolerance can be found in the collection
tent, state that existed when the exception was detected,
of articles edited by Lyu []. In [] dependability and
meaning that further module calls can lead to additional
fault tolerance are discussed with a focus on hardware
exceptions and failures.
issues. Faults and fault tolerance for algorithms in dis-
tributed systems, with a detailed treatment of provably
Future Directions solvable and unsolvable issues can be found in [].
As computing systems become increasingly embedded The premier conference on dependability is the
in and critical to society, more effective capabilities to annual IEEE/IFIP International Conference on Depend-
deal with faults induced by exceptional environmen- able Systems and Networks (DSN), with DSN- held
tal conditions, specification and design defects, or both, in Chicago in . This conference was established
will be required. As an example, consider implantable in by combining the IEEE-sponsored International
medical devices such as insulin pumps or defibrillators. Symposium on Fault-Tolerant Computing (FTCS), which
These systems already execute safety-critical software; had been held since , with the Working Confer-
future versions of these devices intended to respond ence on Dependable Computing for Critical Applica-
to a wider set of medical conditions will be substan- tions (DCCA) sponsored by IFIP WG .. Other con-
tially more complex than those in operation today. ferences related to dependability include the Inter-
New challenges for fault tolerance will also come from national Symposium on Software Reliability Engineer-
the increased degree of automation in transportation ing (ISSRE) organized by the IEEE Computer Soci-
systems (railways, airplanes, automobiles) as well as ety and the Reliability, Availability, and Maintainability
extreme-scale systems for large-scale simulation with Symposium (RAMS) organized by the IEEE Reliability
billion-way parallelism []. Last but not least, future Society.
robotic space missions in our solar system and beyond In addition to many journals discussing fault tol-
will demand an unprecedented degree of autonomy erance in the context of more general or related areas
when operating in environments in which direct com- such as software engineering, programming languages
munication with Earth will no longer be possible within and systems, and system security, the IEEE Transac-
practical time intervals. The additional complexity of tions on Dependable and Secure Computing (TDSC) have
these systems and the uncertainties of the environments a strong focus on dependability and fault tolerance.
in which they operate will introduce more opportunities The bibliography in [] provides a good overview of
for specification and design defects [], for which effec- publications in the field until .
tive fault-tolerant techniques must be developed and
deployed.
Bibliography
. Alpern B, Schneider FB () Defining liveness. Inf Process Lett
Related Entries ():–
. Amnell A, Behrmann G, Bengtsson J, D’Argenio PR, David A,
Checkpointing
Fehnker A, Hune T, Jeannet B, Larsen KG, MÄoller MO, Pettre-
Compilers son P, Weise C, Wang Yi () UPPAAL–now, next, and future.
Deadlocks In: Proceedings of modelling and verification of parallel processes
Distributed-Memory Multiprocessor (MOVEP’k), June . LNCS tutorial , pp –
Fault Tolerance F 

. Avizienis A, Laprie JC, Randell B () Fundamental concepts on dependable systems and networks, (DSN-), Goeteborg,
of dependability, UCLA CSD Report No. , University of Sweden, June/July 
California, Los Angeles . Hoare CAR () An axiomatic basis for computer program-
. Avizienis A, Laprie JC, Randell B, Landwehr C () Basic con- ming. Commun ACM ():–
cepts and taxonomy of dependable and secure computing. IEEE . Holzmann GJ () The SPIN model checker. Primer and refer-
Transactions on Dependable and Secure Computing ():– ence manual. Addison-Wesley, New York
. Avizienis AA () The methodology of N-version pro- . Hopcroft JE, Motwani R, Ullman JD () Introduction to
gramming. In: Lyu MR (ed) Software fault tolerance. Wiley, automata theory, languages, and computation, rd edn. Addison
Chichester Wesley Higher Education, New York
. Boyer RS, Kaufmann M, Moore JS () The Boyer-Moore theo- . James M, Shapiro A, Springer P, Zima H () Adaptive fault
rem prover and its interactive enhancement. Comput Math Appl tolerance for scalable cluster computing in space. IJHPCA ():
():– –
. Cai X, Lyu MR, Vouk MA () An experimental evaluation . Kalbarczyk ZT, Iyer RK, Bagchi S, Whisnant K () Chameleon:
F
of reliability features of N-version programming. In: Proceed- a software infrastructure for adaptive fault tolerance. IEEE Trans
ings of the IEEE international symposium on software reliability Parallel Distrib Syst ():–
engineering, pp – . Knight JC, Leveson NG () An experimental evaluation of the
. Cristian F () Exception handling and tolerance of software assumption of independence in multiversion programming. IEEE
faults. In: Lyu MR (ed) Software fault tolerance. Wiley, New York, Trans Softw Eng ():–
pp – . Koopman P () Ballista design and methodology, http://www.
. Crow J, Owre S, Rushby J, Shankar N, Srivas M () ece.cmu.edu/koopman/ballista/reports/desmthd.pdf
A tutorial introduction to PVS. http://www.csl.sri.com/papers/ . Lamport L () Proving the correctness of multiprocess pro-
wift-tutorial/ grams. IEEE Trans Softw Eng ():–
. Dijkstra EW () Co-operating sequential processes. In: . Laprie JC, Arlat J, Beounes C, Kanoun K () Definition and
Genuys F (ed) Programming languages. Academic, New York, analysis of hardware- and software-fault-tolerant architectures.
pp – Computer ():–
. Dijkstra EW () Notes on structured programming. In: Dahl . Lardner D () Babbages’s calculating engine. Edinburgh
OJ, Dijkstra EW, Hoare CAR (eds) Structured programming. Review, July . Reprinted. In: Morrison P, Morrison E (eds)
Academic, London, pp – Charles Babbage and his calculating engines. Dover, New York
. Dijkstra EW () A discipline of programming, nd edn. Pren- . Lyu MR (ed) () Software fault tolerance, vol . Wiley,
tice Hall, Englewood Cliffs Chichester
. Elnozahy ENM, Alvisi L, Wang Yi-Min, Johnson DB () A . McCluskey EJ, Mitra S () Fault tolerance. In: Tucker AB (ed)
survey of rollback recovery protocols in message-passing systems. Computer science handbook, nd edn. Chapman and Hall/CRC
ACM Comput Surv ():– Press, Chapter , London, UK
. Enderton HB () A mathematical introduction to logic, nd . McMillan KL () Symbolic model checking – an approach to
edn. Academic, San Diego the state explosion problem. Ph.D. Dissertation, Carnegie Mellon
. Kogge P et al () ExaScale computing study: technology University
challenges in achieving ExaScale systems. Technical report, . Nikora AP, Munson JC () Building high-quality software
DARPA information processing techniques O±ce (IPTO). http:// predictors. Softw Pract Exper ():–
www.notur.no/news/inthenews/files/exascale_final_report., . Randell B, Xu J () The evolution of the recovery block concept.
September  In: Lyu MR (ed) Software fault tolerance. Wiley, Hoboken, New
. Gargantini A, Heitmeyer CL () Using model checking to gen- Jersey, USA, pp –
erate tests from requirements specifications. In: Proceedings of . Schneider FB () On concurrent programming. Springer,
the joint th european software engineering conference and the New York
th ACM SIGSOFT symposium on the foundations of software . Shirvani PP () Fault-tolerant computing for radiation envi-
engineering, Toulouse, France, September  ronments. Technical report -. Ph.D. Thesis, Center for Reliable
. Gärtner FC () Fundamentals of fault-tolerant distributed Computing, Stanford University, Stanford, California , June
computing in asynchronous environments. ACM Comput Surv 
():– . Critical Software () csXCEPTION. http://www.criti-
. Goldberg D, Li M, Tao W, Tamir Y () The design and imple- calsoftware.com/products services/csxception/
mentation of a fault-tolerant cluster manager. Technical report . Some RR, Agrawal A, Kim WS, Callum L, Khanoyan G, Shamilian
CSD-, Computer Science Department, University of Cali- A () Fault injection experiment results in space-borne par-
fornia, Los Angeles, California, USA, October  allel application programs. In: Proceedings  IEEE aerospace
. Gunnels JA, Katz DS, Quintana-Ort ES, van de Geijn RA conference, Big Sky, MT, USA, March 
() Fault-tolerant high-performance matrix multiplication: . Some RR, Kim WS, Khanoyan G, Callum L, Agrawal A, Beahan J
theory and practice. In: Proceedings of international conference () A software-implemented fault injection methodology for
 F Fences

design and validation of system fault tolerance. In: Proceedings Discussion


of the  international conference on dependable systems and
networks (DSN), Goteborg, Sweden, pp –, July  Introduction
. Stott DT, Floering B, Kalbarczyk Z, Iyer RK () A frame- The discrete Fourier transform (DFT) is a ubiqui-
work for assessing dependability in distributed systems with tous tool in science and engineering including in
lightweight fault injectors. In: Proceedings of the th international digital signal processing, communication, and high-
computer performance and dependability symposium (IPDS’),
performance computing. Applications include spectral
IEEE Computer Society, Washington, DC, USA, pp –
. Gerard Tel () Introduction to distributed algorithms, nd analysis, image compression, interpolation, solving par-
edn. Cambridge University Press, Cambridge, UK tial differential equations, and many other tasks.
. Vouk MA () On back-to-back testing. In: Proceedings of the Given n real or complex inputs x , . . . , xn− , the DFT
third annual conference on computer assurance (COMPASS’), is defined as
pp –
. Wang L, Pattabiraman K, Kalbarczyk Z, Iyer RK () Model- yk = ∑ ω kℓ
n xℓ ,  ≤ k < n, ()
ing coordinated checkpointing for large-scale supercomputers. In: ≤ℓ<n
Proceedings of the  international conference on dependable √
systems and networks (DSN’) with ω n = exp(−πi/n), i = − . Stacking the
. Zima HP, Chapman BM () Supercompilers for parallel and xℓ and yk into vectors x = (x , . . . , xn− )T and y =
vector computers. ACM Press Frontier Series, New York (y , . . . , yn− )T yields the equivalent form of a matrix–
vector product:

y = DFTn x, n ]≤k,ℓ<n .
DFTn = [ω kℓ ()
Fences Computing the DFT by its definition () requires Θ(n )
many operations. The first fast Fourier transform algo-
Memory Models
rithm (FFT) by Cooley and Tukey in  reduced the
Synchronization
runtime to O(n log(n)) for two-powers n and marked
the advent of digital signal processing. (It was later dis-
covered that this FFT had already been derived and used
by Gauss in the nineteenth century but was largely for-
FFT (Fast Fourier Transform) gotten since then [].) Since then, FFTs have been the
topic of many publications and a wealth of different
Franz Franchetti , Markus Püschel algorithms exist. This includes O(n log(n)) algorithms

Carnegie Mellon University, Pittsburgh, PA, USA for any input size n, as well as numerous variants opti-

ETH Zurich, Zurich, Switzerland
mized for various computing platform and computation
requirements. The by far most commonly used DFT is
for two-power input sizes n, partly because these sizes
Synonyms permit the most efficient algorithms.
Fast algorithm for the discrete Fourier transform (DFT) The first FFT explicitly optimized for parallelism
was the Pease FFT published in . Since then special-
Definition ized FFT variants were developed with every new type
A fast Fourier transform (FFT) is an efficient algorithm of parallel computer. This includes FFTs for data flow
to compute the discrete Fourier transform (DFT) of an machines, vector computers, shared and distributed
input vector. Efficient means that the FFT computes memory multiprocessors, streaming and SIMD vec-
the DFT of an n-element vector in O(n log n) opera- tor architectures, digital signal processing (DSP) pro-
tions in contrast to the O(n ) operations required for cessors, field-programmable gate arrays (FPGAs), and
computing the DFT by definition. FFTs exist for any graphics processing units (GPUs). Just like Pease’s FFT,
vector length n and for real and higher-dimensional these parallel FFTs are mainly for two-powers n and
data. Parallel FFTs have been developed since the advent are adaptations of the same fundamental algorithm to
of parallel computing. structurally match the target platform.
FFT (Fast Fourier Transform) F 

On contemporary sequential and parallel machines Matrix formalism and parallelism. The n × n iden-
it has become very hard to obtain high-performance tity matrix is denoted with In , and the butterfly matrix
DFT implementations. Beyond the choice of a suit- is a DFT of size :
able FFT, many other implementation issues have to be
 
addressed. Up to the s, there were many public FFT DFT = [ ]. ()
 −
implementations and proprietary FFT libraries avail-
able. Due to the code complexity inherent to fast imple- The Kronecker product of matrices A and B is defined as
mentations and the fast advances in processor design,
today only a few competitive open source and vendor A ⊗ B = [ak,ℓ B], for A = [ak,ℓ ].
FFT libraries are available in the parallel computing
space. It replaces every entry ak,ℓ of A by the matrix ak,ℓ B. Most F
important for FFTs are the cases where A or B is the
FFTs: Representation identity matrix. As examples consider
Corresponding to the two different ways () and ()
⎡ ⎤
of representing the DFT, FFTs are represented either ⎢  ⎥
⎢ ⎥
⎢ ⎥
as sequences of summations or as factorizations of the ⎢ − ⎥
⎢ ⎥
transform matrix DFTn . The latter representation is ⎢ ⎥
⎢   ⎥
⎢ ⎥
adopted in Van Loan’s seminal book [] on FFTs and ⎢  − ⎥
⎢ ⎥
used in the following. To explain this representation, I ⊗ DFT = ⎢ ⎥,
⎢ ⎥
⎢   ⎥
assume as example that DFTn in () can be factored into ⎢ ⎥
⎢  − ⎥
four matrices ⎢ ⎥
⎢ ⎥
⎢  ⎥
DFTn = M M M M . ⎢ ⎥
() ⎢ ⎥
⎢  −⎥
⎣ ⎦
Then () can be computed in four steps as
⎡ ⎤
t = M x, u = M t, v = M u, y = M v. ⎢  ⎥
⎢ ⎥
⎢ ⎥
⎢   ⎥
If the matrices Mi are sufficiently sparse (have many ⎢ ⎥
⎢ ⎥
⎢   ⎥
zero entries) the operations count compared to a direct ⎢ ⎥
⎢ ⎥
computation is decreased and () is called an FFT. For ⎢  ⎥
DFT ⊗I = ⎢

⎥,

example, DFT can be factorized as ⎢ − ⎥
⎢ ⎥
⎢ ⎥
⎡ ⎤⎡ ⎤⎡ ⎤⎡ ⎤ ⎢  − ⎥
⎢  ⎥ ⎢ ⎥ ⎢  ⎥ ⎢ ⎥ ⎢ ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ ⎥
⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥ ⎢ − ⎥
⎢  ⎥ ⎢ ⎥ ⎢
⎥ ⎢  ⎥ ⎢ − ⎥ ⎢ ⎥ ⎢  ⎥
DFT = ⎢ ⎥ ⎢  ⎥, () ⎢


⎢ ⎥⎢ ⎥⎢
⎢ − ⎥ ⎢  ⎥ ⎢
⎥⎢ ⎥
⎥ ⎣ −⎥

⎢ ⎥⎢ ⎥⎢  ⎥ ⎢
⎥⎢  ⎥

⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥
⎢  −⎥ ⎢ i⎥ ⎢  −⎥ ⎢ ⎥
⎣ ⎦⎣ ⎦⎣ ⎦⎣ ⎦ with the corresponding dataflows shown in Fig. . Note
where omitted values are zero. This example also that the dataflows are from right to left to match the
demonstrates why the matrix-vector multiplications in order of computation in (). I ⊗ DFT clearly expresses
() are not performed using a generic sparse linear alge- block parallelism: four butterflies computing on con-
bra library, but, since the Mi are known and fixed, by a tiguous subvectors; whereas DFT ⊗I expresses vector
specialized program. parallelism: four butterflies operating on interleaved
Conversely, every FFT can be written as in () (with subvectors which is the same as one vector butterfly
varying numbers of factors). The matrices Mi in FFTs operating on vectors of length four as emphasized in
are not only sparse but also structured, as a glimpse on Fig. (b). More precisely, consider the code for DFT
() illustrates. This structure can be efficiently expressed (i.e., y = DFT x):
using a formalism based on matrix algebra and also y[0] = x[0] + x[1];
clearly expresses the parallelism inherent to an FFT. y[1] = x[0] - x[1];
 F FFT (Fast Fourier Transform)

Then code for DFT ⊗I is obtained by replacing matrix, stored in row-major order, then Lmn m performs
every scalar operation by a four-way vector operation: a transposition of this matrix. Further, if P is a permu-
y[0:3] = x[0:3] + x[4:7]; tation (matrix), then AP = P− AP is the conjugation of
y[4:7] = x[0:3] - x[4:7]; A with P.
Cooley–Tukey FFT. The fundamental algorithm at
Here, x[a:b] denotes (Matlab or FORTRAN style)
the core of the most important parallel FFTs derived
the subvector of x starting at a and ending at b. These
in the literature is the general-radix decimation-in-time
examples illustrate how the tensor product captures
Cooley–Tukey type FFT expressed as
parallelism. To summarize:
block parallelism (n blocks): In ⊗ A, () DFTn = (DFTk ⊗Im )Tm
n
(Ik ⊗ DFTm )Lnk , n = km.
vector parallelism (n-way): A ⊗ In , () ()

where A is any matrix. n


Here, k is called the radix and Tm is a diagonal matrix
The stride permutation matrix Lmn
m permutes the ele- containing the twiddle factors. The algorithm factors the
ments of the input vector as in + j ↦ jm + i,  ≤ i < DFT into four factors as in (), which shows the spe-
m,  ≤ j < n. If the vector x is viewed as an n × m cial case n =  =  × . Two of the four factors in ()
contain smaller DFTs; hence the algorithm is divide-
and-conquer and has to be applied recursively. At each
y x y x step of the recursion the radix is a degree of freedom.
For two-power sizes n =  ℓ , () is sufficient to recurse
up to n = , which is computed by definition ().
Figure  shows the special case  =  ×  as
matrix factorization and as corresponding data-flow
graph (again to be read from right to left). The smaller
DFTs are represented as blocks with different shades
of gray.
A straightforward implementation of () suggests
a y = (I4 ⊗ DFT2)x b y = (DFT2 ⊗ I4)x four steps corresponding to the four factors, where two
steps call smaller DFTs. However, to improve locality,
FFT (Fast Fourier Transform). Fig.  Dataflow (right to left) the initial permutation Lnk is usually not performed but
of a block parallel and its “dual” vector parallel construct interpreted as data access for the subsequent compu-
n
(figure from []) tation, and the twiddle diagonal Tm is fused with the

y x

DFT4 ⊗ I4 T 16
4 I4 ⊗ DFT4 L16
4

DFT16 =

a Matrix factorization b Data-flow graph

FFT (Fast Fourier Transform). Fig.  Cooley–Tukey FFT () for  =  ×  as matrix factorization and as (complex) data-flow
graph (from right to left). Some lines are bold to emphasize the strided access (figure from [])
FFT (Fast Fourier Transform) F 

subsequent DFTs. This strategy is chosen, for example, FFT (Fast Fourier Transform). Table  Formula identities
in the library FFTW .x and the code can be sketched to manipulate FFTs. A is n × n, and B and C are m × m. A⊺ is
as follows: the transpose of A
(BC)⊺ = C ⊺ B⊺
void dft(int n, complex *y, complex *x) {
(A ⊗ B)⊺ = A⊺ ⊗ B⊺
int k = choose_factor(n);
// t1 = (I_k tensor DFT_m)L(n,k)*x Imn = Im ⊗ In
for(int i=0; i < k; ++i) A⊗B = (A ⊗ Im )(In ⊗ B)
dft_iostride(m, k, 1, t1 + m*i, x + i);
In ⊗ (BC) = (In ⊗ B)(In ⊗ C)
// y = (DFT_k tensor I_m)diag(d(j))*t1
(BC) ⊗ In = (B ⊗ In )(C ⊗ In )
for(int i=0; i < m; ++i)
A⊗B = n (B ⊗ A)Lm
Lmn mn
dft_scaled(k, m, precomp_d[i], y + i,
−
F
t1 + i); (Lmn
m ) = Lmn
n
} (Lkn
Lkmn n ⊗ Im )(Ik ⊗ Ln )
mn
n =
// DFT variants needed
Lkmn
km = (Ik ⊗ Lmn
m )(Lk ⊗ Im )
kn
void dft_iostride(int n, int istride,
int ostride, complex *y, complex *x); Lkmn
k = Lkmn kmn
km Lkn
void dft_scaled(int n, int stride,
complex *d, complex *y, complex *x);

The DFT variants needed for the smaller DFTs are Iterative FFTs
implemented similarly based on (). There are many The historically first FFTs that were developed and
additional issues in implementing () to run fast on a adapted to parallel platforms are iterative FFTs. These
nonparallel platform. The focus here is on mapping () algorithms implement the DFT as a sequence of nested
to parallel platforms for two-power sizes n. loops (usually three). The simplest are radix-r forms
(usually r = , , ), which require an FFT size of
n = r ℓ ; more complicated mixed-radix radix variants
Parallel FFTs: Basic Idea always exist. They all factor DFTn into a product of ℓ
The occurrence of tensor products in () shows that matrices, each of which consists of a tensor product
the algorithm has inherent block and vector paral- and twiddle factors. Iterative algorithms are obtained
lelism as explained in () and (). However, depending from () by recursive expansion, flattening the nested
on the platform and for efficient mapping, the algo- parentheses, and other identities in Table .
rithm should exhibit one or both forms of parallelism The most important iterative FFTs are discussed
throughout the computation to the extent possible. To next, starting with the standard version, which is not
achieve this, () can be formally manipulated using optimized for parallelism but included for complete-
well-known matrix identities shown in Table . ness. Note that the exact form of the twiddle factors
The table makes clear that there is a virtually unlim- differs in these FFTs, even though they are denoted with
ited set of possible variants of (), which also explains the same symbol.
the large set of publications on FFTs. These variants Cooley–Tukey iterative FFT. The radix-r iterative
hardly differ in operations count but in structure, which decimation-in-time FFT
is crucial for parallelization. The remainder of this entry ℓ− ℓ ℓ
introduces the most important parallel FFTs derived in DFTr ℓ = (∏(Iri ⊗ DFTr ⊗Ir ℓ−i− ) Dri ) Rrr , ()
i=
the literature. All these FFTs can be derived from ()
using Table . The presentation is divided into itera- is the prototypical FFT algorithm and shown in Fig. .

tive and recursive FFTs. Each FFT is visualized for size Rrr is the radix-r digit reversal permutation and the

n =  in a form similar to () (and again from right diagonal Dri contains the twiddle factors in the ith
to left) to emphasize block and vector parallelism. In stage. The radix- version is implemented by Numer-
these visualizations, the twiddle factors are dropped ical Recipes using a triple loop corresponding to the
since they do not affect the dataflow and hence pose no two tensor products (inner two loops) and the product
structural problem for parallelization. (outer loop).
 F FFT (Fast Fourier Transform)

Stage 3 Stage 2 Stage1 Stage 0 Bit reversal

((I1 ⊗ DFT2 ⊗ I8) D16


0 ) ((I2 ⊗ DFT2 ⊗ I4) D1 ) ((I4 ⊗ DFT2 ⊗ I2) D2 ) ((I8 ⊗ DFT2 ⊗ I1) D3 ) R2
16 16 16 16

FFT (Fast Fourier Transform). Fig.  Iterative FFT () for n =  and r = 

Formal transposition of () yields the iterative the Pease FFT also requires the digit reversal permu-
decimation-in-frequency FFT: tation. Each stage of the Pease algorithm consists of
ℓ− the twiddle diagonal and a parallel butterfly block, fol-
ℓ ℓ
DFTrℓ = Rrr ∏ Dri (Ir ℓ−i− ⊗ DFTr ⊗Iri ) . () lowed by the same data exchange across parallel blocks
i= specified through a stride permutation. The Pease FFT
Both () and () contain the bit reversal permutation was originally developed for parallel computers, and

Rrr . The parallel and vector structure of the occurring its regular structure makes it a good choice for field-
butterflies depends on the stage. Thus, even though programmable gate arrays (FPGAs) or ASICs. Formal
every stage is data parallel, the algorithm is neither transposition of () yields a variant with the bit-reversal
well suited for machines that require block parallelism in the end.
nor vector parallelism. For this reason, very few paral- Korn–Lambiotte FFT. The Korn–Lambiotte FFT is
lel triple-loop implementations exist and compiler par- given by
allelization and vectorization tend not to succeed in ℓ−
ℓ ℓ ℓ
producing any speedup when targeting the triple-loop DFTrℓ = Rrr (∏ Lrr ℓ− Dri (DFTr ⊗Ir ℓ− )) , ()
algorithm. i=

Pease FFT. A variant of () is the Pease FFT and is the algorithm that is dual to the Pease FFT in
ℓ− the sense used in Fig. . Namely, it has also constant
ℓ ℓ ℓ
DFTr ℓ = (∏ Lrr (Ir ℓ− ⊗ DFTr ) Dri ) Rrr , () geometry, but maximizes vector parallelism as shown
i= in Fig.  for r = . Each stage contains one vector
shown in Fig.  for r = . It has constant geometry, that butterfly operating on vectors of length n/r, and a twid-
is, the control flow is the same in each stage, and maxi- dle diagonal. As last step it performs the digit reversal
mizes block parallelism by reducing the block sizes to permutation. The Korn–Lambiotte FFT was developed
r on which single butterflies are computed. However, for early vector computers. It is derived from the Pease
FFT (Fast Fourier Transform) F 

Stage 3 Stage 2 Stage 1 Stage 0 Bit reversal


Comm Parallel Comm Parallel Comm Parallel Comm Parallel communication

2 (I8 ⊗ DFT2) D0 ) (L2 (I8 ⊗ DFT2) D1 ) (L2 (I8 ⊗ DFT2) D2 ) (L2 (I8 ⊗ DFT2) D3 ) R2
(L16 16 16 16 16 16 16 16 16

FFT (Fast Fourier Transform). Fig.  Pease FFT in () for n =  and r = 

Bit reversal Stage 3 Stage 2 Stage 1 Stage 0


Communication Comm Vbutterfly Comm Vbutterfly Comm Vbutterfly Comm Vbutterfly

2 (L8 D0 (DFT2 ⊗ I8)) (L8 D1 (DFT2 ⊗ I8)) (L8 D2 (DFT2 ⊗ I8)) (L8 D3 (DFT2 ⊗ I8))
R16 16 16 16 16 16 16 16 16

FFT (Fast Fourier Transform). Fig.  Korn–Lambiotte FFT in () for n =  and r = . Vbutterfly = vector butterfly
 F FFT (Fast Fourier Transform)

algorithm through formal transposition followed by the Stockham FFT. The formal transposition of () is also
translation of the tensor product from a parallel into a called Stockham FFT.
vector form.
Stockham FFT. The Stockham FFT Recursive FFT Algorithms
The second class of Cooley–Tukey-based FFTs are
ℓ−
DFTr ℓ = rℓ r ℓ−i
∏(DFTr ⊗Ir ℓ− ) Di (Lr ⊗ Iri ) , () recursive algorithms, which reduce a DFT of size n =
i= km into k DFTs of size m and m DFTs of size k. The
advantage of recursive FFTs is better locality and hence
is self-sorting, that is, it does not have a digit reversal better performance on computers with deep memory
permutation. It is shown in Fig.  for r = . Like the hierarchies. They also can be used as kernels for iter-
Korn–Lambiotte FFT, it exhibits maximal vector paral- ative algorithms. For parallelism, recursive algorithms
lelism but the permutations change across stages. Each are derived, for example, to maximize the block size for
of these permutations is a vector permutation, but the multicore platforms, or to obtain vector parallelism for
vector length increases by a factor of r in each stage a fixed vector length for platforms with SIMD vector
(starting with ). Thus, for most stages a sufficiently extensions. The most important recursive algorithms
long vector length is achieved. The Stockham FFT was are discussed next.
originally developed for vector computers. Its struc- Recursive Cooley–Tukey FFT. The recursive,
ture is also suitable for graphics processors (GPUs), and general-radix decimation-in-time Cooley–Tukey FFT
indeed most current GPU FFT libraries are based on the was shown before in (). Typically, k is chosen to be

Stage 3 Stage 2 Stage 1 Stage 0


Vbutterfly VShuffle Vbutterfly VShuffle Vbutterfly VShuffle Vbutterfly VShuffle

((DFT2 ⊗ I8) D16


0 (L2 ⊗ I8)) ((DFT2 ⊗ I8) D1 (L2 ⊗ I4)) ((DFT2 ⊗ I8) D2 (L2 ⊗ I2)) ((DFT2 ⊗ I8) D3 (L2 ⊗ I1))
2 16 4 16 8 16 16

FFT (Fast Fourier Transform). Fig.  Stockham FFT in () for n =  and r = . Vbutterfly = vector butterfly, VShuffle =
vector shuffle
FFT (Fast Fourier Transform) F 

Recursive FFT Accumulated


shuffles

(DFT2 ⊗ I8) T 16
8 (I2 ⊗ ((DFT2 ⊗ I4) T 4 (I2 ⊗ ((DFT2 ⊗ I2) T 2 (I2 ⊗ DFT2) L2)) L2)) L2
8 4 4 8 16

FFT (Fast Fourier Transform). Fig.  Recursive radix- decimation-in-time FFT for n = 

small, with values up to . If () is applied to n = r ℓ recursion level the working set will be small enough to
recursively with k = r the algorithm is called radix-r fit into a certain cache level, a property sometimes called
decimation-in time FFT. As explained before, the initial cache oblivious. Both () and () contain both vec-
permutation is usually not performed but propagated as tor and parallel blocks and stride permutations. Thus,
data access into the smaller DFTs. For radix- the algo- despite their inherent data parallelism, they are not
rithm is shown in Fig. . Note that the dataflow is equal ideal for either parallel or vector implementations. The
to Fig. , but the order of computation is different as following variants address this problem.
emphasized by the shading. Four-step FFT. The four-step FFT is given by
Formal transposition of () yields the recursive
decimation-in-frequency FFT DFTn = (DFTk ⊗Im )Tm Lk (DFTm ⊗Ik ),
n n
n = km,
()
DFTn = Lnm (Ik ⊗ DFTm )Tm
n
(DFTk ⊗Im ), n = km.
and shown in Fig. . It is built from two stages of vector
()
FFTs, the twiddle diagonal and a transposition. Typi-

Recursive application of () and () eventually leads cally, k, m ≈ n is chosen (also called “square root
to prime sizes k and m, which are handled by a special decomposition”). Then, () results in the longest possi-
prime-size FFT. For two-powers n the butterfly matrix ble vector operations except for the stride permutation
DFT terminates the recursion. in the middle.
The implementation of () and () is more involved The four-step FFT was originally developed for
than the implementation of iterative algorithms, in vector computers and the stride permutation (or
particular in the mixed-radix case. The divide-and- transposition) was originally implemented explicitly
conquer nature of () and () makes them good choices while the smaller FFTs were expanded with some
for machines with memory hierarchies, as at some other FFT—typically iterative. The transposition can
 F FFT (Fast Fourier Transform)

Vector FFT Shuffe Vector FFT

(((DFT2 ⊗ I2) T 42 (I2 ⊗ DFT2) L42) ⊗ I4) L16


4 T 4 (((DFT2 ⊗ I2) T 2 (I2 ⊗ DFT2) L2) ⊗ I4)
16 4 4


FFT (Fast Fourier Transform). Fig.  Four-step FFT for n =  and k = m = n =

be implemented efficiently using blocking techniques. multiple memory spaces and require explicit data move-
Equation () can be a good choice on parallel machines ment, like message passing, off-loading to accelerators
that execute operations on long vectors well and on (GPUs and FPGAs), and out-of-core computation.
which the overhead of a transposition is not too high. Multicore FFT. The multicore FFT for a platform
Examples includes vector computers and machines with with p cores and cache block size μ is given by
streaming memory access like GPUs.
Six-step FFT. The six-step FFT is given by kp
((L p ⊗Im/pμ )⊗I μ )
DFTn = (Ip ⊗ (DFTk ⊗Im/p )) n
Tm
DFTn = Lnk (Im ⊗ DFTk )Lnm Tm
n
(Ik ⊗ DFTm )Lnk , n = km, n/p
× (Ip ⊗ (Ik/p ⊗ DFTm )Lk/p )
()
pm
× ((Lp ⊗ Ik/pμ ) ⊗ I μ ) , n = km, ()
and shown in Fig. . It is built from two stages of
parallel butterfly blocks, the twiddle diagonal, and
three global transpositions (all-to-all data exchanges). and is a version of () that is optimized for homoge-
() was originally developed for distributed mem- neous multicore CPUs with memory hierarchies. An
ory machines and out-of-core computation. Typically, example is shown in Fig. . () follows the recur-

k, m ≈ n is chosen to maximize parallelism. The sive FFT () closely but ensures that all data exchanges
transposition was originally implemented explicitly as between cores and all memory accesses are performed
all-to-all communication while the smaller FFTs were with cache block granularity. For a multicore with cache
expanded with some other FFT algorithm—typically block size μ and p cores, () is built solely from per-
iterative. As in (), the required matrix transposition mutations that permute entire cache lines and p-way
can be blocked for more efficient data movement. () parallel compute blocks. This property allows for par-
can be a good choice on parallel machines that have allelization of small problem sizes across a moderate
FFT (Fast Fourier Transform) F 

Communication Parallel DFTs Communication Parallel DFTs Communication

4 (I4 ⊗ ((DFT2 ⊗ I2) T 2 (I2 ⊗ DFT2) L2)) L4 T 4 (I4 ⊗ ((DFT2 ⊗ I2) T 2 (I2 ⊗ DFT2) L2)) L4
L16 4 4 16 16 4 4 16


FFT (Fast Fourier Transform). Fig.  Six-step FFT for n =  and k = m = n =

Block exchange Parallel DFTs Block exchange Parallel DFTs Block exchange

(L84 ⊗ I2) (I2 ⊗ ((DFT2 ⊗ I2) T 42 (I2 ⊗ DFT2) L42) ⊗ I2) (L82 ⊗ I2) T 16
4 (I2 ⊗ (I2 ⊗ (DFT2 ⊗ I2) T 2 (I2 ⊗ DFT2)) R 2) (L2 ⊗ I2)
4 8 8

FFT (Fast Fourier Transform). Fig.  Multicore FFT for n =  , k = m = , p =  cores, and cache block size μ = 
 F FFT (Fast Fourier Transform)

Vector FFTs In-Register Shuffes Vector FFTs Vector Shuffe

(((DFT2 ⊗ I2) T 42 (I2 ⊗ DFT2) L42) ⊗ I2) T 16


4 (I2 ⊗ (L 2
4 ⊗ I2) (I2 ⊗ L 42) (((DFT2 ⊗ I2) T 42 (I2 ⊗ DFT2) L42) ⊗ I2) (L 82 ⊗ I2)

FFT (Fast Fourier Transform). Fig.  Short vector FFT in () for n =  , k = m = , and (complex) vector length ν = 


number of cores. Implementation of () on a cache- ν-way vectorized is the stride permutations Lνν , which
based system relies on the cache coherency proto- can be implemented efficiently using in-register shuffle
col to transmit cache lines of length μ between cores instructions. () requires the support or implementa-
and requires a global barrier. Implementation on a tion of complex vector arithmetic and packs ν complex
scratchpad-based system requires explicit sending and elements into a machine vector register of width ν. A
receiving of the data packets, and depending on the variant that vectorizes the real rather than the complex
communication interface additional synchronization dataflow exists.
may be required. Vector recursion. The vector recursion performs a
The smaller DFTs in () can be expanded, for exam- locality optimization for deep memory hierarchies for
ple, with the short vector FFT (discussed next) to opti- the first stage (I k ⊗ DFTm )Lnk of (). Namely, in this
mize for vector extensions. stage DFTm is further expanded using again () with
SIMD short vector FFT. For CPUs with SIMD m = m m and the resulting expression is manipulated
ν-way vector extensions like SSE and AltiVec and a to yield
memory hierarchy, the short vector FFT is defined as
(Ik ⊗ DFTm )Lnk = (Ik ⊗ (DFTm  ⊗Im  )Tm
m

)
DFTn = ((DFTk ⊗Im/ν ) ⊗ Iν ) Tm
n
(Ik/ν ⊗ (Lm
ν ⊗ Iν )

× (Lkm
k

⊗ Im  ) (Im ⊗ (Ik ⊗ DFTm  ) Lkm
k )

n/ν
× (Im/ν ⊗ Lνν )(DFTm ⊗Iν )) (Lk/ν ⊗ Iν ) , n = km,
m  ⊗ Ik ) .
× (Lm ()
()

and can be implemented using solely vector arithmetic, While the recursive FFT () ensures that the working
aligned vector memory access, and a small number set will eventually fit into any level of cache, large
of vector shuffle operations. An example is shown in two-power FFTs induce large -power strides. For
Fig. . All compute operations in () have complex caches with lower associativity these strides result in
ν-way vector parallelism. The only operation that is not a high number of conflict misses, which may impose
FFT (Fast Fourier Transform) F 

Vector stage Vector stage Vector shuffle Recursive FFTs Vector shuffle

8 (I2 ⊗ (DFT2 ⊗ I4) T 4) (L2 ⊗ I4) (I2 ⊗ ((DFT2 ⊗ I2) T 2 (I2 ⊗ DFT2))) R 2) (L2 ⊗ I2)
(DFT2 ⊗ I8) T 16 8 4 4 8 8

FFT (Fast Fourier Transform). Fig.  Vector recursive FFT for n =  . The vector recursion is applied once and yields
vector shuffles, two recursive FFTs, and two iterative vector stages

a severe performance penalty. For large enough two- prime), and Bluestein or Winograd (any n). In practice,
power sizes, in the first stage of () every single load these are mostly used for small sizes < , which then
will result in a cache miss. The vector recursion allevi- serve as building blocks for large composite sizes via ().
ates this problem by replacing the stride permutation The exception is Bluestein’s algorithm that is often used
in () by stride permutations of vectors, at the expense to compute large sizes with large prime factors or large
of an extra pass through the working set. Since () prime numbers.
matches (Ik ⊗ DFTn ) Lkn k , it is recursively applicable DFT variants and other FFTs. In practice, sev-
and will eventually produce child problems that fit into eral variants of the DFT in () are needed including
any cache level. The vector recursion produces algo- forward/inverse, interleaved/split complex format, for
rithms that are a mix of iterative and recursive as shown complex/real input data, in-place/out-of-place, and oth-
in Fig. . ers. Fortunately, most of these variants are close to the
standard DFT in (), so fast code for the latter can be
adapted. An exception is the DFT for real input data,
Other FFT Topics which has its own class of FFTs.
So far the discussion has focused on one-dimensional Multidimensional FFT algorithms. The Kronecker
complex two-power size FFTs. Some extensions are product naturally arises in D and D DFTs, which
mentioned next. respectively can be written as
General size recursive FFT algorithms. DFT algo-
rithms fundamentally different from () include prime- DFTm×n = DFTm ⊗ DFTn , ()
factor (n is a product of coprime factors), Rader (n is DFTk×m×n = DFTk ⊗ DFTm ⊗ DFTn . ()
 F FFT (Fast Fourier Transform)

For a D DFT, applying identities from Table  to Bibliography


() yields the row-column algorithm . Bailey DH () FFTs in external or hierarchical memory. J.
Supercomput :–
DFTm×n = (DFTm ⊗In )(Im ⊗ DFTn ). () . Cooley JW, Tukey JW () An algorithm for the machine calcu-
lation of complex Fourier series. Math Comput :–
The D vector-radix algorithm can also be derived with
. Franchetti F, Püschel M () Short vector code generation
identities from Table  from (): for the discrete Fourier transform. In: Proceedings of the th
I ⊗L rn International Symposium on Parallel and Distributed Processing,
r ⊗Is
DFTmn×rs = (DFTm×r ⊗Ins ) m (Tnmn ⊗ Tsrs ) pp –, Washington, DC, USA
I ⊗L rn
r ⊗Is . Franchetti F, Voronenko Y, Püschel M () FFT program gener-
× (Imr ⊗ DFTn×s ) m (Lmn
m ⊗ Lr ) .
rs
()
ation for shared memory: SMP and multicore. In: Proceedings of
Higher-dimensional versions are derived similarly, and the  ACM/IEEE conference on Supercomputing, New York,
the associativity of ⊗ gives rise to more variants. NY, USA
. Franchetti F, Püschel M, Voronenko Y, Chellappa S, Moura JMF
() Discrete Fourier transform on multicore. IEEE Signal
Related Entries Proc Mag, special issue on “Signal Processing on Platforms with
ATLAS (Automatically Tuned Linear Algebra Soft- Multiple Cores” ():–
ware) . Frigo M, Johnson SG () FFTW: An adaptive software archi-
tecture for the FFT. In: Proceedings of the IEEE International
FFTW
Conference on Acoustics, Speech, and Signal Processing, vol ,
Spiral pp –, Seattle, WA
. Frigo M, Johnson SG () The design and implementation
Bibliographic Notes and Further of FFTW. Proc IEEE special issue on “Program Generation,
Optimization, and Adaptation” ():–
Reading . Harris DB, McClellan JH, Chan DSK, Schuessler HW () Vec-
The original Cooley–Tukey FFT algorithm can be found tor radix fast Fourier transform. In: Proc Inter Conf Acoustics,
in []. The Pease FFT in [] is the first FFT derived and Speech, and Signal Processing. Conference Proceedings (ICASSP
represented using the Kronecker product formalism. ’), pp –, Los Alamitos, IEEE Comput. Soc. Press
The other parallel FFTs were derived in [] (Korn– . Heidemann MT, Johnson DH, Burrus CS () Gauss and
the history of the fast fourier transform. Arch Hist Exact Sci
Lambiotte FFT), [] (Stockham FFT), [] (four-step
:–
FFT), [] (six-step FFT). The vector-radix FFT algo- . Korn DG, Lambiotte JJ, Jr. () Computing the fast Fourier
rithm can be found in [], the vector recursion in [], transform on a vector computer. Math Comput ():
the short vector FFT in [], and the multicore FFT in []. –
A good overview on FFTs including the classical parallel . Norton A, Silberger AJ () Parallelization and performance
variants is given in Van Loan’s book [] and the book by analysis of the Cooley-Tukey FFT algorithm for shared-memory
architectures. IEEE Trans Comput ():–
Tolimieri, An and Lu []; both are based on the formal-
. Nussbaumer HJ () Fast Fourier transformation and convolu-
ism used here. Also excellent is Nussbaumer FFT book tion algorithms, nd ed. Springer, New York
[]. An overview on real FFTs can be found in []. . Pease MC () An adaptation of the fast Fourier transform for
At the point of writing the most important fast parallel processing. J ACM ():–
DFT libraries are FFTW by Frigo and Johnson . Press WH, Flannery BP, Teukolsky SA, Vetterling WT ()
[, ], Intel’s MKL and IPP, and IBM’s ESSL and PESSL. Numerical recipes in C: the art of scientific computing, nd ed.
Cambridge University Press, Cambridge
FFTE is currently used in the HPC Challenge as Global . Püschel M, Moura JMF, Johnson J, Padua D, Veloso M, Singer BW,
FFT benchmark reference implementation. Most CPU, Xiong J, Franchetti F, Gačić A, Voronenko Y, Chen K, Johnson RW,
GPU, and FPGA vendors maintain DFT libraries. Some Rizzolo N () SPIRAL: Code generation for DSP transforms.
historic DFT libraries like FFTPACK are still widely Proc IEEE special issue on “Program Generation, Optimization,
used. Numerical Recipes [] provides C code for an and Adaptation” ():–
. Schwarztrauber PN () Multiprocessor FFTs. Parallel Comput
iterative radix- FFT implementation. BenchFFT pro-
:–
vides up-to-date FFT benchmarks of about  single- . Tolimieri R, An M, Lu C () Algorithms for discrete Fourier
node DFT libraries. The Spiral system is capable of gen- transforms and convolution, nd ed. Springer, New York
erating parallel DFT libraries directly from the tensor . Van Loan C () Computational framework of the fast Fourier
product-based algorithm description [, ]. transform. SIAM, Philadelphia, PA, USA
File Systems F 

. Voronenko Y, de Mesmay F, Püschel M () Computer gen-


eration of general size linear transform libraries. In: Proceedings File Systems
of the th annual IEEE/ACM International Symposium on Code
Generation and Optimization, pp –, Washington, DC, USA Robert B. Ross
. Voronenko Y, Püschel M () Algebraic signal processing the- Argonne National Laboratory, Argonne, IL, USA
ory: Cooley-Tukey type algorithms for real DFTs. IEEE Trans
Signal Proc ():–

Synonyms
Cluster file systems
Fast Algorithm for the Discrete
Fourier Transform (DFT) F
Definition
FFT (Fast Fourier Transform) A parallel file system (PFS) is a system software com-
ponent that organizes many disks, servers, and network
links to provide a file system name space that is acces-
sible from many clients; distributes data across many
FFTW devices to enable high aggregate bandwidth; and coor-
dinates changes to file system data so that clients’ views
of the data are kept coherent.
FFTW is a C library for computing the Discrete Fourier
Transform (DFT). It was originally developed at MIT
by Matteo Frigo and Steven G. Johnson. FFTW is an Discussion
example of autotuning in two ways. First, the library
is adaptive by allowing for different recursion strate-
Introduction
File systems are the traditional mechanism for read-
gies. The best one is chosen at the time of use by a
ing and writing data on persistent storage. The cur-
feedback-driven search. Second, the recursion is ter-
rent standard interface for file systems, part of the
minated by small optimized kernels (called codelets)
Portable Operating System Interface (POSIX) standard,
that are generated by a special purpose compiler. The
was first defined in  []. As networking of comput-
library continues to be maintained by the developers
ers became common, sharing storage between multiple
and is widely used due to its excellent performance.
computers became desirable. In  [], the Network
The name FFTW stands for the somewhat whimsical
File System (NFS) standard was created, allowing a file
“Fastest Fourier Transform in the West.”
system stored locally on one computer to be accessed
by other computers. With the advent of NFS, shared file
Related Entries name spaces were possible – many computers could see
Autotuning and access the same files at the same time.
FFT (Fast Fourier Transform) Figures a and b show a high-level view of a local and
Spiral network file system, respectively. In the local file system
case, a single software component manages local storage
Bibliography resources (typically one or more disks) and allows pro-
. Frigo M, Johnson SG () FFTW: An adaptive software archi- cesses running on that computer to store and retrieve
tecture for the FFT. In: Proceedings IEEE international conference data organized as files and directories. In the network
acoustics, speech, and signal processing (ICASSP), , pp – file system case, the software allowing access to stor-
. Frigo M () A fast Fourier transform compiler. In Proceed- age is split from the software managing the storage.
ings of the ACM SIGPLAN  conference on programming
The file system client presents the traditional view of
language design and implementation (PLDI ‘). ACM, New York,
pp –
files and directories to processes on a computer, while
. Frigo M, Johnson SG () The design and implementation of the file system server software manages the storage and
FFTW. Proceedings of the IEEE ():– coordinates access from multiple clients.
 F File Systems

Application Application Application and technologies are used in PFS deployments, from
storage area networks and InfiniBand to Gigabit Ether-
File system File system net and TCP/IP.
client client
File system This discussion focuses on parallel file system
software
deployments where processes accessing the file system
Interconnection network are on compute nodes distinct from the nodes that man-
age storage of data (storage nodes). This model is the
most common in high-performance computing (HPC)
File system
a Local file system server deployments, because (a) there are typically many more
compute nodes than needed to provide storage services,
(b) placing disks in all compute nodes has an impact on
b Networked file system their reliability and power consumption, (c) servicing
I/O operations for some other process on a node exe-
Parallel application Parallel application Parallel application cuting an HPC application can perturb performance of
that application, and (d) keeping storage nodes distinct
Parallel FS Parallel FS Parallel FS enables higher-reliability components to be used and
client client client
helps colocate storage resources in the machine room.
That said, Beowulf-style commodity clusters and
Interconnection network clusters used for Data Intensive Scalable Computing
(DISC) applications are typically configured with local
drives, and organizing those drives with a parallel file
Parallel FS Parallel FS
server server system can be an effective way to provide a low-cost,
high-performance, cluster-wide storage resource.

c Parallel file system Data Distribution


A number of approaches are used for managing data
File Systems. Fig.  Local, network, and parallel file distribution in parallel file systems. Figures a and b
systems show a simple example file system name space and a
distribution of that name space onto multiple storage
As distributed-memory parallel computers became nodes, respectively. In the name space, two HPC appli-
popular, the utility of a shared file name space became cations have generated three files. A fusion code has
obvious as a way for parallel programs to store and written a large checkpoint, while a DISC application is
retrieve data without concern for the actual location analyzing a pair of relatively small images from obser-
of files and file data. However, accessing data stored vations of the night time sky. The checkpoint has been
on a single server quickly became a bottleneck, driving split into a set of blocks distributed across multiple stor-
the implementation of parallel file systems – file sys- age nodes, while the data for each of the smaller images
tems that opened up many concurrent paths between resides on a single server. In addition to distributing the
computers and persistent storage. file data across storage nodes, in this example the file
Figure c depicts a parallel file system. With multi- system metadata has also been distributed, as depicted
ple network links, servers, and local storage units, high by directory data residing on different storage nodes.
degrees of concurrent access are possible. This enables File system metadata is the data that represents the
very high aggregate I/O rates for the parallel file sys- directory tree as well as information such as the owners
tem as a whole when it is accessed by many processes and permissions of files.
on different nodes in the system, especially when those File data is typically divided into fixed-size units
accesses are relatively evenly distributed across the stor- called stripe units and distributed in a round-robin man-
age servers. A wide variety of network configurations ner over a fixed set of servers. This approach is similar
File Systems F 

/pfs Access Patterns


The designs of parallel file systems have been heavily
/fusion /disc influenced by the characteristics and access patterns of
ckpoint43.h5 sky4325.img sky8792.img
HPC applications. These applications execute at very
B232 B089 B756 large scale, with current systems allowing over ,
B443 application processes to execute at once. Potentially all
B781
of those processes might initiate an I/O operation at
a Parallel file system name space any time, including all at once. Additionally, HPC appli-
cations are primarily scientific simulations CROSSREF.
Application Application Application
These codes operate on complex data models, such as
Parallel FS Parallel FS Parallel FS regular or irregular grids, and these data models are F
client client client
reflected in the data that they store. The end result
is that HPC access patterns are often highly concur-
Interconnection network
rent and complex, including noncontiguity of access [].
This often arises from access to subarrays of large global
Parallel FS Parallel FS Parallel FS Parallel FS
server server server server arrays.
A primary use case for parallel file systems in HPC
/disc /pfs B443 B781 environments is for checkpointing. Checkpointing is the
B089 B756 /fusion B232 writing of application state to persistent storage for the
b Distribution of data across parallel file system servers
purpose of enabling an application to restart at some
point in the event of a system failure. This state is typ-
File Systems. Fig.  Data distribution in a parallel file ically referred to as a “checkpoint.” If a node on which
system (PFS) the application is running crashes, then the application
can be restarted using the checkpoint to avoid having
to recalculate early results. Typically an application will
to the way in which data is distributed over disks in a compute for some amount of time, and then all the pro-
disk array. How PFSes keep track of where data is stored cesses will pause and write their checkpoint so that the
will be covered in the section on Metadata Storage and checkpoint represents the same point in time across all
Access. the processes.
In a configuration such as the one shown here, the Users are often surprised that their serial workloads
file system deployment is said to be symmetric, mean- are not faster when run on a parallel file system. The
ing that metadata and data are both distributed and cost of coordination in the PFS in conjunction with
stored on the same storage nodes. Not all parallel file different caching policies and the latency of commu-
systems operate this way, and those that keep meta- nicating with file servers can result in performance no
data on separate servers are considered asymmetric. different from, or even worse than, that of local stor-
There are advantages to both approaches. The symmet- age. Serial workloads are simply not the primary focus
ric approach provides a high degree of concurrency of PFS designs.
for metadata operations, which can be important for
applications that generate and process many files. On
the other hand, storing metadata on a smaller set of Consistency
nodes leads to greater locality of access for metadata In the context of PFSes, consistency is about the guar-
and simplifies fault tolerance for the name space. Both antees that the file system makes with respect to con-
approaches are commonly used, and some parallel current access by processes, particularly processes on
file systems, such as GPFS [] and PVFS [], can be different compute nodes. The stronger these guarantees,
deployed with either configuration depending on work- the more coordination is generally necessary, inside the
load expectations. PFS, to ensure the guarantees.
 F File Systems

For example, most parallel file systems ensure that locking, instead slightly relaxing consistency guaran-
data written by one process is visible to all other pro- tees to a level that may be supported without locking
cesses immediately following the completion of the infrastructure [].
write operation. If no other copies of file data are kept in
the system, then this is easy to enforce: the server own-
ing the data simply always returns data from the most Parallel File System Design
recent completed write. Five major mechanisms shape a PFS design:
Some parallel file systems make a stronger guaran- ● Underlying storage abstraction
tee, that of atomicity of writes. This means that other ● Metadata storage and access
processes accessing the PFS will always see either all, ● Caching
or none, of a write operation. This guarantee requires ● Tolerance of failures
coordination, because if a write spans multiple servers, ● Enforcement of consistency semantics
then the servers need to agree on which data to return
in reads if reading is allowed while a write is in progress. Consistency semantics have been discussed. The other
Alternatively, the PFS could force readers to wait when a four topics are covered next.
write is ongoing. Many parallel file systems do guarantee
atomicity of writes, and this approach of delaying read-
ing when writes are ongoing is commonly used to pro- Underlying Storage Abstraction
vide this guarantee. These PFSes include a distributed Some early parallel file systems relied on storage area
locking component. Distributed locking systems in networks (SANs) to provide shared access to physical
parallel file systems are typically single-writer, multi- drives. In that approach, data was accessed directly by
reader, meaning that they allow many client nodes to block locations on drives or logical volumes. Eventu-
simultaneously read a resource but permit only a single ally designers realized that interposing storage servers
client node access when writing is in progress. Scal- had some advantages, but this block access model was
able parallel file systems perform locking on ranges retained. IBM’s General Parallel File System (GPFS)
of bytes in files, usually rounded to some block size. uses this model, with software on the server side provid-
This allows concurrent writes to different ranges from ing access to storage over an interconnection network
different compute nodes. such as Gigabit Ethernet. Figure a shows the divi-
The use of distributed locking in parallel file sys- sion of functionality in such a parallel file system. On
tems has two significant impacts on designs that rely the client side, PFS software translates application I/O
on them. First, the granularity of locking places a limit operations directly into block accesses on specific file
on concurrency and leads to false sharing during writes. servers and obtains permission to access those blocks.
False sharing occurs when two processes on different File servers receive block access requests and perform
compute nodes attempt to write to different bytes of them on locally accessible storage.
a file that happen to map to a single lock. While the An important development that shaped more recent
two processes are not really performing shared access PFS designs was the concept of object storage devices.
to the same bytes, the locking system behaves as if Object storage devices (OSDs) are storage devices (e.g.,
they were, serializing their writes. Given the nature of disks, disk arrays) that allow users to store data in logi-
HPC access patterns, false sharing is common. Second, cal containers called objects. By abstracting away block
availability is impacted. If a compute node fails while locations, file systems built using this abstraction are
holding a lock, access to the locked resource is halted simpler to build, and blocks can be managed locally,
until the lock can be reclaimed. Determining the dif- at the device itself. Recently developed parallel file sys-
ference between a busy compute node and a failed one tems, such as Lustre [], PVFS [], and PanFS [], all
can be difficult at large scale, especially in the presence use object-based access. Figure b shows how function-
of lightweight, non-preemptive kernels. For these rea- ality is divided in such a design. Application accesses
sons, some PFS designs eschew the use of distributed are translated into accesses to objects stored on file
File Systems F 

Parallel application Parallel application

PFS client translates


application I/O PFS client translates
requests into accesses Block translation Object translation application I/O
to data blocks stored requests into accesses
on remote servers. Block I/O interface Object I/O interface to objects stored on
Blocks are allocated remote servers.
from a global pool
when needed.

Interconnect Interconnect
F
File system server
File system server
provides access to
provides access to
locally attached
locally attached storage
Block I/O manager Object I/O manager storage devices
devices, using a block
through an object
protocol similar to SCSI.
Block I/O manager interface. Objects are
mapped to local
storage blocks on the

a Block-based data access b Object-based data access

File Systems. Fig.  The two most common approaches to data storage in parallel file systems are block-based (left) and
object-based (right)

servers. File servers translate these accesses into loca- other information can change rapidly (e.g., modifica-
tions on locally accessible storage and manage alloca- tion time). Some parallel file systems keep all these
tion of space when needed. attributes on a single server, while others distribute the
A less obvious advantage of the object-based attributes for different files to different servers to allow
approach is the possibility of algorithmic distribution of concurrent access.
file data. This relates to file system metadata. File data locations are the metadata necessary to
map a logical portion of a file to where it is stored in
Metadata Storage and Access the file system. For a block-based PFS, this information
There are three major components of PFS metadata: generally consists of a list of blocks or (server, block)
directory contents, file and directory attributes, and file pairs, similar to the traditional inode structure in a local
data locations. Directory contents are simply the list- file system. For an object-based PFS, a list of objects or
ing of files and directories residing in a single directory. (server, object) pairs is kept along with some param-
Typically the contents for a single directory are stored eters that define how data is mapped from the logical
on a single server. This can become a bottleneck for file into the objects. In a typical round-robin distribu-
workloads that generate a lot of files in a single directory, tion, a stripe unit parameter defines how much data is
so some parallel file systems will split the contents of a placed in each object for a single stripe, and this pat-
single directory across multiple servers (e.g., GPFS []). tern is repeated as necessary over the set of objects.
File and directory attributes are pieces of infor- This algorithmic approach to data distribution has the
mation such as the owner, group, modification time, advantages that file data location metadata does not
and permissions on a file or directory. Some of this grow over time and does not need to be modified as the
information changes infrequently (e.g., owner), while file grows.
 F File Systems

Updates to metadata are usually handled in one of again soon after. Likewise, large analysis applications
two ways. First, if the PFS already implements a dis- that read datasets in parallel often exhibit access pat-
tributed locking system, that system might be used to terns that confuse typical cache read-ahead algorithms.
synchronize access to metadata by clients. Clients can Nevertheless, if the parallel file system is also used for
then read, modify, and write metadata themselves in home directory storage, or if applications access many
order to make updates. This approach has the advantage small files, then caching can be beneficial.
of allowing caching of metadata as well. Second, the PFS
might implement a set of atomic operations that allow
clients to update metadata without locks. For example, Fault Tolerance
a special “insert directory entry” operation can be used Leading HPC systems use parallel file systems with hun-
by a client to add a new file. The server receiving this dreds of servers and thousands of drives. With that
message is then responsible for ensuring atomicity of many components, failures are inevitable. In order to
the operation, blocking read operations, or returning ensure that data is not lost, and that the file system
previous state until it completes its modifications. This remains available in the event of component failures,
approach reduces network traffic for modifications, but parallel file systems rely on a collection of fault tolerance
it does not lend itself to caching of modifiable data on techniques.
the compute node side. Nearly all parallel file systems rely on redundant
array of independent disk (RAID) techniques to cope
Caching with drive failures, usually employed only on drives
Caching of file system data contributes greatly to the directly accessible by servers (i.e., not between multiple
observed performance of local file systems, in part servers). These techniques use parity or erasure codes to
because access patterns on local file systems (e.g., allow data to be reconstructed in the event of a failure
editing files, compiling code, surfing the Internet) often using the data that remains. Typically, RAID is per-
focus on a relatively small subset of data with multi- formed on blocks of storage, but some systems perform
ple accesses to the same data in a short period of time. RAID on an object-by-object basis [].
Implementing caching of file system data in the local file Server failures are often handled through the use
system context is also fairly straightforward, because the of failover techniques. Figures a and b depict a set
file system code sees all accesses and can ensure that the of PFS servers configured with shared storage to allow
cache is kept consistent with what is on storage without for failover. In Fig. a, no failures have occurred. The
any communication. system is running in an active–passive configuration,
Caching in the context of a parallel file system is meaning that an additional server is sitting idle in case
more complex when consistency semantics are taken a failure occurs. When a server does fail (Fig. b), the
into account. Keeping multiple copies of file system idle server is brought into active service to take over
data and metadata consistent across multiple compute where the previous server failed. The active–passive
nodes requires the same sort of infrastructure as needed configuration allows for a certain number of failures
to enforce serialization of changes to file data and to to occur without degrading service. Alternatively, in
ensure notification of changes. Typically parallel file an active–active configuration all servers are active all
systems that allow caching of data on compute nodes the time, with some server “doubling up” and han-
rely on a distributed locking system to ensure that all dling the workload of a failed server when a failure
cached copies are the same, or coherent. Thus the caches occurs. This can lead to a significant degradation of
are organized much the same as a software distributed performance.
memory system. While not depicted in the figures, this configuration
The benefits of caching in parallel file systems are can also tolerate storage controller failures, because all
highly dependent on the applications using the system. drives are accessible through either of the storage con-
Checkpointing operations do not typically benefit from trollers. If a storage controller were to fail, the servers
caching, because the amount of data written usually on the remaining functional storage controller would
exceeds data cache sizes, and the data is not accessed take over the responsibilities of the servers that could
File Systems F 

Interconnection network

Active PFS Active PFS Active PFS Idle PFS


server server server server

Storage controller Storage controller

Internal storage network


F

Fault tolerant storage configuration with active-passive failover,


a no failures.

Interconnection network

Active PFS Failed PFS Active PFS Active PFS


server server server server

Storage controller Storage controller

Internal storage network

Fault tolerant storage configuration after failure;


b previously idle server has taken over.

File Systems. Fig.  Typical PFS configuration using shared backend storage to enable server failover, before (a), and
after (b) a server failure. Compute nodes have been omitted from the figure

no longer see storage, maintaining accessibility but with Parallel File System Interfaces
obvious performance penalties. Most applications access parallel file systems through
Designs exist that do not share access to storage the same interface provided for local file systems, and
on the back-end [, ]. As opposed to the approach the industry standard for file access is the POSIX stan-
described above, these designs apply RAID techniques dard. As mentioned previously, the POSIX standard was
across multiple servers. This approach eliminates the developed primarily with local file systems and single-
need for expensive enterprise storage on the back-end, user applications in mind, and as a result it is not an
but it requires additional coordination between clients ideal interface for parallel I/O. One reason is its use of
and servers in order to correctly update erasure code the open model for translating a file name into a refer-
blocks. In the presence of concurrent writers to the same ence for use in I/O. This model forces name translation
file region (e.g., writing to different portions of the same to take place on every process, which can be expensive
file stripe), performance can be significantly reduced. at scale.
 F File Systems

Additionally, the POSIX read and write I/O calls The Frangipani [] parallel file system and Petal []
allow for description only of contiguous I/O regions. virtual block device were originally developed at Dig-
Thus, complex I/O patterns cannot be described in ital Equipment Corporation and are well-documented
a single call, leading instead to a long sequence of examples of one way of organizing a block-based par-
small I/O operations. Moreover, the POSIX consistency allel file system. The Petal system combines a collection
semantics require atomicity of writes and immediate of block devices into a single, large, replicated pool that
visibility of changes on all nodes. These semantics are the Frangipani system uses as its underlying storage.
stronger than are needed for many HPC applications,
and require communication to enforce. An effort is
underway to define a set of extensions to the POSIX Current Parallel File Systems
standard for HPC I/O. This effort includes a set of new Currently the largest HPC systems tend to use one
function calls that address the three issues outlined here. of four parallel file systems: IBM’s General Parallel
Implementations of these calls have not yet begun to File System (GPFS); Sun’s Lustre; Panasas’s PanFS; and
appear in operating systems, but a few custom imple- the Parallel Virtual File System (PVFS), a community-
mentations do exist in specific parallel file systems, and built parallel file system whose development is led by
early results are encouraging. Argonne National Laboratory and Clemson University.
The MPI-IO interface standard is the only current GPFS [] grew out of the Tiger Shark multimedia file
alternative to POSIX for low-level file system access. system. It uses a block-based approach, and can oper-
ate either with clients physically attached to a storage
area network or, more typically, with clients accessing
Related Entries storage attached to file system servers via a shared disk
Checkpointing model. A lock-based approach is used for coordinating
Clusters access to blocks of data, and through its locking system
Cray XT and Cray XT Series of Supercomputers GPFS is able to cache data on the client side and provide
Cray XT and Seastar -D Torus Interconnect full POSIX semantics.
Exascale Computing Lustre [] is an asymettric parallel file system that
Fault Tolerance uses software-based object storage servers (OSSes) to
HDF provide data storage. It is the file system chosen by
IBM Blue Gene Supercomputer Cray for use on the Cray XT// systems. File data is
MPI-IO striped across objects stored on multiple OSSes for per-
NetCDF I/O Library, Parallel formance. A locking subsystem is used to coordinate
Roadrunner Project, Los Alamos file access by clients and allows for coherent, client-side
Synchronization caching. Failure tolerance is provided by using shared
Software Distributed Shared Memory storage and traditional server failover.
PanFS [] is an asymmetric parallel file system that
uses a clustered metadata system in conjunction with
Bibliographic Notes and Further lightweight data storage blades that provide data access
Reading via the ANSI T- object storage device (OSD) pro-
Early Parallel File Systems tocol [, ]. It is the file system used on the IBM
The Vesta parallel file system [, ] was a file system built Roadrunner system at Los Alamos National Labora-
at the user level (i.e., not in the kernel) and tested on tory. PanFS is notable in that it does not rely on
IBM SP systems. AIX Journaled File System volumes shared storage on the back-end; instead clients compute
were used for storage by file system servers. Vesta is and update parity information directly on data storage
notable for using an algorithmic distribution of data, blades. Parity is calculated on a perfile basis, and objects
exposing a two-dimensional view of files, and provid- in which file data is stored are spread across all storage
ing numerous collective I/O modes and control over blades to enable declustering of rebuilds in the event of
consistency semantics. a failure.
Fixed-Size Speedup F 

The PVFS [] design team chose to specifically tar- . Welch B, Unangst M, Abbasi Z, Gibson G, Mueller B, Small J,
get HPC workloads with their design. It is one of the two Zelenka J, Zhou B () Scalable performance of the Panasas
parallel file system. In: Proceedings of the th USENIX Confer-
parallel file systems deployed on the IBM Blue Gene/P
ence on File and Storage Technologies (FAST). San Jose, 
system at Argonne National Laboratory; GPFS is the . Weil SA, Brandt SA, Miller EL, Long DDE. Maltzahn C ()
other. Coordinated caching and locking were not used, Ceph: a scalable, high-performance distributed file system.
eliminating false sharing and locking overheads, and In: Proceedings of the th USENIX Symposium on Operating
API extensions are available that allow complex non- Systems Design and Implementation. Berkeley, , pp –
contiguous accesses to be described with single I/O . Corbett PF, Feitelson DG () The Vesta parallel file system.
ACM Trans Comput Syst ():–
operations. Data is distributed algorithmically across a
. Corbett PF, Feitelson DG, Prost J-P, Baylor SJ () Parallel
user-specifiable number of servers, and an object-based access to files in the Vesta file system. In: Proceedings of Super-
model is used that allows servers to locally allocate computing ’. IEEE Computer Society Press, Portland, ,
F
storage space. Failure tolerance is provided via shared pp –
storage between file system servers. . Thekkath C, Mann T, Lee E () Frangipani: a scalable dis-
tributed file system. In: Proceedings of the Sixteenth ACM Sym-
posium on Operating System Principles (SOSP). Saint-Malo,
Acknowledgments October 
This work was supported by the Office of Advanced . Lee EK, Thekkath CA () Petal: DISTRIBUTED virtual disks.
Scientific Computing Research, Office of Science, U.S. In: Proceedings of the Seventh International Conference on
Department of Energy, under Contract DE-AC- Architectural Support for Programming Languages and Operat-
CH. ing Systems. Cambridge, October , pp –
. ANSI INCITS T Committee, Project t/-d working draft:
The submitted manuscript has been created by
Information technology – SCSI object-based storage device com-
UChicago Argonne, LLC, Operator of Argonne National mands (OSD). July 
Laboratory (“Argonne”). Argonne, a U.S. Department of . Weber RO () SCSI object-based storage device Commands-
Energy Office of Science laboratory, is operated under (OSD-), revision a. INCITS Technical Committee T/-D
Contract No. DE-AC-CH. The U.S. Govern- working draft, work in progress, Tech. Rep., January 
ment retains for itself, and others acting on its behalf, a . Posix, IEEE () (ISO/IEC) [IEEE/ANSI Std .,  Edi-
tion] Information Technology — Operating System Interface
paid-up nonexclusive, irrevocable worldwide license in
(POSIX ) — Part : System Application: Program Interface (API)
said article to reproduce, prepare derivative works, dis- [C Language]. IEEE, New York, NY, USA
tribute copies to the public, and perform publicly and . Nowicki, B () NFS: Network File System Protocol Spec-
display publicly, by or on behalf of the Government. ification, RFC , Sun Microsystems, Inc. March, ,
http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc.html
Bibliography
. Schmuck F, Haskin R () GPFS: a shared-disk file system
for large computing clusters. In: First USENIX Conference on
File and Storage Technologies (FAST’), Monterey, January Fill-Reducing Orderings
, pp –
. Lang S, Carns P, Latham R, Ross R, Harms K, Allcock W ()
METIS and ParMETIS
I/O performance challenges at leadership scale. In: Proceedings
of Supercomputing. Portland, November 
. Nieuwejaar N, Kotz D, Purakayastha A, Ellis CS, Best M
() File-access characteristics of parallel scientific work- First-Principles Molecular
loads. IEEE T Parall Distr, ():– [Online]. Available:
http://www.computer.org/tpds/td/abs.htm Dynamics
. Carns PH, Ligon III WB, Ross RB, Thakur R () PVFS:
a parallel file system for linux clusters. In: Proceedings of the Car-Parrinello Method
th Annual Linux Showcase and Conference. Atlanta: USENIX
Association, October , pp –. [Online]. Available:
http://www.mcs.anl.gov/thakur/papers/pvfs.ps Fixed-Size Speedup
. Braam PJ () The luster storage architecture. Clus-
ter File Systems, Tech. Rep.,  [Online]. Available:
http://lustre.org/docs/lustre.pdf Amdahl’s Law
 F FLAME

was designed for mini-computer host machines, such as


FLAME the DEC PDP-, a larger memory version, FPS-L,
was also introduced for mainframe hosts such as the
libflame
IBM- series.
In order to expand to more general purpose scien-
tific and engineering applications, FPS announced the
Floating Point Systems FPS-B FPS- system in . The word length was extended
and Derivatives from -bit to -bit. The addressing capability was
expanded from -bit to -bit, and the memory size
Chris Hsiung grew to . MW (Million -bit Words). Even though
Hewlett Packard, Palo Alto, CA, USA the peak speed went down a bit, from  MFLOPS to
 MFLOPS, the size of the problem and the preci-
Synonyms sion of the calculations have both increased. The silicon
SIMD (single instruction, multiple data) machines technology remained the same, TTL.
In , FPS announced the first -bit machine at
Definition higher speed with enhanced architecture, which is the
FPS AP-B was the first commercially successful FPS-/MAX. This was the first VLSI design which
attached array processor product designed and man- incorporated matrix accelerator (MAX) boards. This is
ufactured by Floating Pont Systems Inc., in . It is a SIMD (Single Instruction stream and Multiple Data
designed to be attached to a host computer, such as stream) design that can incorporate up-to  MAX
the DEC PDP-, as a number crunching accelerator. boards, each board having computation power equiva-
The key innovation behind this family of array proces- lent to that of two FPS- processors. With a total of 
sors is the horizontal micro-coding of the wide instruc- FPS- CPU’s, the peak computational power reached
tion set architecture with pipelined execution units to  MFLOPS.
achieve high parallelism. Floating Point Systems Inc., In , FPS announced the first ECL version, FPS-
later introduced several derivative products to -bit , with a peak speed of  MFLOPS. In the meantime,
arithmetic and floating point calculations, by employing FPS- also went through a makeover with the new
higher speed silicon technology, VLSI, and multiple VLSI technology, which can now be attached to both
floating point coprocessors. the MicroVAX or the Sun workstations.
Another derivative of the FPS-B system is
a MIMD (Multiple Instruction and Multiple Data
Discussion streams) version called FPS-, announced in . It
Introduction incorporated a shared-memory bus architecture among
Floating Point Systems (FPS) Inc. was founded in  a control processor, three arithmetic coprocessors and
by former Techtronix engineer C. Norm Winningstad an I/O processor, connected to a host. The control pro-
in Beaverton, Oregon, to manufacture low-cost, high cessor is the traditional AP-B, and the arithmetic
performance attached floating point accelerators, coprocessors, XP, were a new design, but by employ-
mainly, for minicomputers, targeting signal processing ing the same concept as AP-B. Since this MIMD
applications. architecture approach digressed from the array concept,
FPS AP-B was the first attached processor we will not go into any length in this discussion.
product marketed under the company’s own brand. One word of clarification regarding AP. In the high
It was co-designed by George O’Leary and Alan performance computing circle, AP stands for Array
Charlesworth, first delivered in . Its pipelined archi- Processors, even though, unlike the ICL DAP family,
tecture gave it a peak performance of  MFLOPS there was not necessarily any “array” of processors pre-
(Million FLoating-point Operations Per Second), with sented in the architecture. So, realistically, AP can be
-bit floating point calculations. By , roughly viewed as Attached Processors, or, we can interpret
, machines had been delivered []. Since FPS-B “array processor” as the attached processor designed
Floating Point Systems FPS-B and Derivatives F 

especially for efficient array operations. For this fam- (two cycle latency to data pad register). Hence, in the
ily of systems, the common grounds are the horizontal case of fast memory, sequential accesses could be done
micro-coding concept (each instruction contains mul- at each cycle with no delay. For slow memory, access
tiple fields of micro-instructions, and are decoded in time  ns, each bank can only deliver one result every
parallel) and the pipelined architecture. three cycles.
The floating point format of the -bit data was a bit
unconventional in its word length. The exponent field
FPS AP-B
had -bits in Radix-, and the mantissa had -bits
Hardware in ’s complement format. Thus, the exponent had a
The FPS AP-B was a micro-coded, pipelined, array dynamic range of  to the power ±, and the man-
processor design, attached to host computers such as the tissa had -digit of accuracy. Hence, it provided much F
DEC PDP-/. Data transfer between host and AP was better range and accuracy compared to the IBM -bit
done via MDA (Direct Memory Access). Its cycle time format. Extra guard bits were also kept in the calcula-
was  ns ( MHz), with a peak speed (multiple-add) tion in order to improve precision. Format conversion
of  MFLOPS. For a detailed architecture block dia- between host and array was done on the fly during the
gram, please consult with Hockney and Jesshope’s book, data move.
Fig. ., p. , in []. Data moving into, or out of, AP-B were per-
The whole idea of micro-coding was based on the formed by I/O-port (IOP) or a general programmable
concept of the (synchronous) parallel execution of mul- I/O port (GPIOP) by DMA operations, by stealing
tiple execution units. A single -bit instruction word cycles if necessary. There were two data width, -bit
was divided into arithmetic field (address calculation), and -bit. The bandwidth of moving into AP memory
floating point fields, and two data pad register fields, all was . MW/s, and that of moving out was . MW/s.
under the control of the decoding and execution unit. Special features to support real-time audio and video
The address and logic unit (ALU) was -bit wide. There processing were also provided.
were  of -bit S-registers for the ALU. The floating
point units (FMUL and FADD) were -bit wide. There Software and Performance
were two  data pad registers (DX and DY), each was In order to achieve optimal performance on pipelined
-bit wide. They were both for receiving data from and horizontally micro-coded machines, it was best to
memory and used as operands in floating point calcula- write compact loops, based on skillful arrangement of
tions. The FADD was a two-stage adder, and FMUL was the critical path. If done right, it was truly a thing of
a three-stage multiplier. In order to keep the pipelines beauty.
moving, they need to be explicitly “pushed.” For exam- In carrying out matrix-vector type of operations,
ple, in order to push the result of a FADD operation into for example, the main loop body sometimes could be
its destination register, we need to add another FADD written in an one-line (single instruction) loop struc-
operation in the next cycle, either with a valid float- ture. Within that one line, there could be a floating
ing add of another pair of operands (FADD DX, DY), multiple-add pair (might be unrelated ops, but typi-
or with another FADD operation (FADD) without any cally “chained” together), indexing, memory ops, and
operands. The same principle is true with FMUL. conditional test and branch. Sometimes, an extra cycle
There were three kinds of memory units, the pro- could even be saved by storing/accessing some temp
gram memory, the main (data) memory, and the table data in the table memory, since table memory had an
memory. The program memory was K × -bit. The independent data path. Of course, once the loop body
table memory was K × -bit. It contained approxi- was set, the programmer needed to wrap the initial-
mations for sine/cosine and other special functions, and ization part and the winding-down code around the
was used to calculate intrinsic function values. The main loop in order to complete the whole execution. The
memory was interleaved into two banks, odd and even. whole philosophy was very much like coding for vector
For fast memory option ( ns access time), each bank machines, except that there were multiple instruction
was capable of producing one result every other cycle fields in the horizontally micro-coded instruction set.
 F Floating Point Systems FPS-B and Derivatives

There was no hardware support for divide in the the first K or memory space was read-only, used to
machine. The loop body for a vector divide was store commonly used constants and table values.
three-cycles, for example. A scalar divide was much The main memory consisted of  memory mod-
longer. ules. Each module had two boards (hence two banks).
Initially, in order to use the array processor, users The memory size ranged from the initial . MW (with
had to use library procedures from the main program K-bit dynamic NMOS) to . MW later (with K-
running on the host machine. Users use library pro- bit dynamic NMOS). All memory accesses were done
cedures to transfer data to-and-from the attached in a pipelined fashion, both for main memory and table
machine, and then library procedures to do the com- memory. Memory mapping and protections were also
putation. Library procedures mainly were used to available.
execute vector (array) operations including intrinsic The FPS- did provide programmable real-time
functions, basic arithmetic operations, signal process- clock and a CPU timer, which enabled accurate timing
ing library, image processing library (including D FFT, of program executions.
convolution, filtering), and advanced math library (inc.
sparse matrix solution and eigenvalues, Runge–Kutta
integration). FPS-/MAX
The next step was the availability of cross Fortran The FPS-/MAX was introduced in . The “MAX”
compiler on the host machine, to compile Fortran IV here stood for matrix algebra accelerator. The machine
code into object code for the AP, and executed on the consisted of one FPS- processor, and up-to  MAX
array processor side. With this offering, the AP became boards. Each board was capable of two FPS- CPU.
much more versatile. Besides, four vector registers were added, each with
But, since AP-B did not provide a real time , elements, at each CPU. In total, the computing
clock, most timing analyses were done on the host power consisted of  -CPUs, and had a peak per-
side, through APSIM, a timing simulator. Of course, formance of  MFLOPS.
with the use of a simulator, which was , times The MAX boards could execute just a handful of
slower, it could only be useful for shorter running vector operations (SDOT, vector multiple, vector add,
codes. scalar vector multiple add, etc.). The logic was imple-
mented with CMOS VLSI technology. The vector reg-
isters were memory mapped with the top  MW of the
FPS-  MW main memory. That means, vector operations
The FPS-, announced in , was a -bit enhance- were executed simply by reading from, or writing to,
ment from the -bit AP-B. It incorporated some appropriate memory locations.
VLSI components in its design. The cycle time was at Of course, in order to take advantage of the extra
 ns, a bit slower than its predecessor. It made for  computing power, the arrays had to be very long. Hence,
MFLOPS peak speed. Other than the double precision this type of SIMD architecture was more of a special
design, there are other improvements as well. First, the purpose machine.
instruction memory (program memory) and the data
memory were merged into one large main memory.
Instruction cache was loaded automatically from the FPS-
instruction memory as needed. Hence this instruction The FPS-, announced in , was an (custom air-
cache (size: , -bit) replaced the program memory cooled Fairchild K) ECL implementation of the
of the AP-B. There was also a stack memory, consist- (Schottky TTL technology) FPS-. It increased the
ing of  -bit entries used to store subroutine return clock cycle time from  to  ns, a speed up of more
addresses, to expedite procedure calls. The table mem- than three times. Other improvements in the instruc-
ory, also called auxiliary memory, has primarily become tion cache and data memory (static NMOS) boasted
random access memory, used to store intermediate data a total improvement of x to x over the FPS-
from registers (e.g., register overflow area). However, generation.
Flow Control F 

Bibliographic Notes and Further Discussion


Reading Flow control is defined, in its broad sense, as a syn-
For interested readers, the best textbook that covers the chronization protocol that dictates the advance of infor-
complete family of FPS products can be found in []. mation from a sender to a receiver. Flow control
For a complete history of all parallel computers, please determines how resources in a network are allocated
find it in []. to packets traversing the network. First, basic defini-
In , FPS brought to market its long anticipated tions and types of flow control are introduced. Then,
T-Series product, a hypercube, distributed memory, differentiation between flow control and switching is
message passing multicomputer, with Occam software. highlighted. Finally, basic flow control mechanisms are
Because of its arcane software environment, and non- described and briefly compared.
general-purpose nature, it was not very successful in F
the marketplace. Since then, it changed direction, and Messages, Packets, Flits, and Phits
brought FPS- product family to the market. These Usually, the sender needs to transmit information to
developments are beyond the scope of this entry. Inter- the receiver. To do this, the sender transmits blocks
ested readers can find more details from: “FPS Comput- of information, referred to as messages, through a
ing: A History of Firsts,” by Howard Thraikill, former path. A header including the routing information is
President and CEO of FPS, in []. prepended to the message. Depending on the switch-
ing technique used (see below and the “switching”
entry), the message may be further divided into pack-
Bibliography ets. A packet is made up of a header (with the proper
. Hockney RW, Jesshope CR () Parallel computers : architec-
information to route the packet through the network)
ture, programming, and algorithms. IOP, Bristol
. Wilson GV The history of the development of parallel computing. and a payload.
http://ei.cs.vt.edu/~history/Parallel.html Every message or packet is then further split into
. Ames KR, Brenner A () Frontiers of supercomputing II: a flits. The flit (flow control bit) term is defined as the
national reassessment. University of California Press, Berkeley minimum amount of information that is flow con-
trolled. Flits can then be divided into smaller data units
called phits (physical units). A phit is usually the amount
of information that is forwarded in a cycle through a
Flow Control network link. Depending on the link width a certain
number of phits will be required to transmit a flit. Phits
José Flich are not flow controlled, they only exist because of the
Technical University of Valencia, Valencia, Spain link width. Figure  shows the breakdown of a message
into packets, flits, and phits.
Not all the communication systems use these four
types of data units. Indeed, in some networks, pack-
Synonyms
ets are not needed and the entire message is trans-
Backpressure
mitted (see Wormhole Switching in Switching Tech-
niques entry). Also, in some networks, the flit size
Definition equals the phit size, and thus, there is no need to split a
Flow control is a synchronization protocol for trans- flit into smaller phits. In addition, as will be highlighted
mitting and receiving units of information. It deter- later, the flit size can span from few bits to the entire
mines the advance of information between a sender packet size.
and a receiver, enabling and disabling the transmis-
sion of information. Since messages are usually buffered Types of Flow Control
at intermediate switches, flow control also determines The sender and receiver can be connected in differ-
how resources in a network are allocated to messages ent ways, for instance they can be directly connected
traversing the network. through a link, or they can be connected through a
 F Flow Control

Message

Packets

Flits

Phits

Flow Control. Fig.  Messages, packets, flits, and phits

network (made of routers, switches, and links) where the type of network (lossy or lossless) constrains the
several routers/switches are visited, and links traversed, solutions for packet routing, congestion management,
prior to reaching the receiver. In both cases, a flow con- and deadlock issues. Also, network performance will
trol protocol is defined to set up the proper transmission be significantly different. Typically, storage/system-area
of information. Therefore, different levels of flow con- networks (SAN) used in parallel systems and on-chip
trol can be used at the same time: end-to-end flow networks are lossless and local area networks (LAN) and
control and link-level flow control. wide area networks (WAN) are lossy.

Bufferless Flow Control Versus Buffered


Flow Control Flow Control Versus Switching
Usually, flow control is used to prevent the overflow Flow control is also closely related to switching.
of buffering resources when transmitting information. Switching determines when and how the information
When the queue at the receiver gets full then the flow advances through a switch or router. Nowadays, the
control must stop the transmission of information at the most frequently used switching mechanisms are worm-
sender. Once there is room at the queue the flow control hole and virtual cut-through. One of the key differences
resumes the transmission of information. Therefore, between wormhole switching and virtual cut-through
flow control is closely related to buffer management. switching lies in the size of buffers and messages.
However, there are cases where buffers (or queues) In wormhole switching buffers at routers are usually
are not used along the transmission path of messages. In smaller than message sizes, having slots for only a few
that case a bufferless flow control is defined. This is the flits (and messages can be tens of flits in size or even
simplest form of flow control. In that case information much larger). However, in virtual cut-through switch-
is simply forwarded, and potentially the information ing, buffers must be larger than packets so as to store an
gets discarded or misrouted (due to contention along entire packet in a router. Indeed, in virtual cut-through,
the path from the sender to the receiver). A bufferless whenever a packet blocks in the network the packet
flow control is also used in the so-called lossy networks needs to be stored completely at the blocking router.
were packets may be dropped (discarded) at receiver On the contrary, in wormhole switching the message is
when the buffer fills up. In such networks, the sender spread along its paths across several routers. This fact
will need to retransmit the packet (via time-out or a greatly affects the flow control mechanism and, indeed,
NACK control packet sent back from the receiver from affects the flit size of the flow control protocol. In worm-
upper communication layers, e.g., transport layer). hole switching, the flit size is a small part of the message,
However, in buffered networks, some flow control whereas in virtual cut-through, the flit size is the size of
mechanism is usually used to prevent packet drop- the packet. Because original messages from the sender
ping, and such networks are known as lossless (or flow could be larger than buffers, packetization (splitting a
controlled) networks. This is an important distinction message into bounded packets) is required in virtual
of types of networks with huge implications. Indeed, cut-through.
Flow Control F 

As information may travel across multiple routers, of the buffers used most of the times and will mini-
the switching inside a router is also seen as a means mize the time flits are seated waiting at queues, thus
of flow control. Indeed, some authors do not distin- achieving high performance in the network. How-
guish between flow control and switching [], and they ever, a suboptimal flow control protocol may keep
describe most used flow control mechanisms as worm- resources underutilized and flits blocked most of the
hole flow control and virtual cut-through flow control. time waiting at buffers, thus ending up in a net-
However, a basic distinction exists depending on where work with low performance (low throughput and high
the flow control is applied. If applied inside a router latencies).
or switch then switching mechanism is used and if In a flow control protocol, the buffer resources at
applied over a link connecting two nodes then flow con- the receiver end must be efficiently used to keep max-
trol is used. Keep in mind also the end-to-end flow imum performance. To do this, the buffer depth must F
control as a third possibility. Figure  shows a pair of be sized according to the latency of the path between
connecting devices (source and destination nodes) con- the transmitter and the receiver (the time it takes for
nected through a series of switches and links. The figure a flit to travel from the transmitter to the receiver).
illustrates the different levels of flow control (end-to- As both receiver and transmitter will exchange flow
end and link-level) and where switching is performed control information, the buffer must be sized, at min-
(inside the switches). Taking into account this basic dif- imum, to allow for the round-trip time of the link.
ferentiation between flow control and switching, the This value is the path latency multiplied by two (two
following sections describe the most frequently used way transmission) plus the time it takes to generate
link-level flow control mechanisms. They are described and process the flow control command at both sides.
linked to the suitable switching mechanisms they usu- Notice that this is needed in any type of flow-controlled
ally work with. transmission, either link-level or end-to-end.
Another key property that distinguishes flow con- Flow control is also highly related to congestion
trol from switching is that flow control is used as a management. Congestion management deals with high
means to provide backpressure not to overflow receiv- levels of blocking in the network among different mes-
ing buffers. Instead, switching is related to decide when sages. To solve such a situation, a congestion manage-
and how information is switched from input ports of a ment technique may reduce the injection of traffic from
switch/router to output ports of the switch/router. a router or from an end node, thus resembling flow
control methods (as they dictate when flows advance).
However, the reader must be aware that in a conges-
Impact of Flow Control and Buffer Size tion management technique the goal is to reduce the
Flow control is a key component in a network. An blocking experienced in the network, whereas in a flow
optimized flow control protocol will keep a fraction control method the goal is not to overrun a buffer.

End-to-end flow control

...

Destination
Source node
node
...
Switch
Switch Switch

... Switching
Point-to-point
flow control Switch

Flow Control. Fig.  Different levels of flow control


 F Flow Control

Link-Level Flow Control Xon and Xoff thresholds are critical from the per-
The most commonly used link-level flow control pro- formance point of view. Indeed, Xoff threshold must
tocols are Xon/Xoff, credit-based, and Ack/Nack. How- be set not to overrun the receiving queue, and the
ever, there are other solutions that, although basic and Xon threshold must be set not to introduce bubbles
probably low in performance, help to better under- along the transmission path (once forwarding of flits is
stand the concept. This is the case for a handshake resumed). Therefore, the round-trip time must be taken
protocol. Next, each of these flow control protocols is into account when setting the thresholds. In addition,
described. Xoff and Xon thresholds must be away enough so as not
Handshake protocol. The most straightforward to send many off and on control signals unnecessarily. It
method to implement flow control among two devices is common to set the Xoff threshold at / and the Xon
is by setting a handshake protocol. As an example, a threshold at / of buffer capacity. Control signals may
sender may assert a request signal to the receiver thus be sent back to the sender either by using special con-
notifying it has a flit to send. The receiver (depending trol signals (thus not using the link data path) or using
on its availability for storing new flits) may assert a grant special control packets through the link data path. With
signal to let the transmitter know that there is slot at appropriate thresholds, Xon/Xoff significantly reduces
the queue. Then, the transmission of the flit may begin. the amount of flow control information exchange. How-
For the next flit the handshake protocol needs to be ever, it requires large buffer memories to avoid buffer
repeated (the handshake works at the flit granularity). overruns.
Although such protocol is correct and ensures that no However, there are other approaches to set up
queue will overrun, the problem is the low performance the thresholds that may be beneficial. This is the case
achieved in the transmission. Indeed, the delay incurred for the Prizma switch [] where both thresholds are set
in the control signal exchange is not overlapped with to the same value. This is efficient if the link length is
the transmission of flits, and therefore, there are bub- short enough (round trip time is short) as is the case for
bles introduced along the transmission path. Another the Prizma switch. The benefit in this case is the use of
potential problem is the excessive exchange of con- a single control signal that when set means a Go, and
trol information (request, grant signals) between the when reset means a Stop.
transmitter and the receiver for every transmitted flit. Credit-based flow control. Credit-based flow con-
The following link-level flow control methods reduce trol is usually linked to networks using virtual cut-
such overhead. through switching. Indeed, the flit size is the packet size
Xon/Xoff flow control. Xon/Xoff flow control is and, therefore, packet size is limited to a maximum
also known as Stop&Go. It is mostly used in networks (buffer size needs to be larger or equal than packet size).
with wormhole switching. Basically, the receiver noti- In credit-based flow control the sender has a counter
fies the sender either to stop or to resume sending flits of credits. The counter is initially set to the number of
once high and low buffer occupancy levels are reached, flits/slots available at the receiver side. Typically, a credit
respectively. To do this, two queue thresholds are set at means a slot for a flit is granted at the receiver side. There
the receiving side. When the queue fills and passes the are, however, systems where the credit means a chunk
Xoff threshold, then the receiver side sends an off flow of bytes not necessarily the size of the packet (e.g., in
control signal to the transmitter to stop transmission InfiniBand network the credit means a slot for  bytes).
(in order to prevent buffer overrun). At the transmitter The sender transmits a flit whenever it has credits
side, when the off control signal is received the trans- available. If the credit counter is different from zero,
mission is stopped and the sender enters the off mode. then a flit can be transmitted, and the counter is decre-
Transmission will be resumed only when notified by the mented by one. If the counter reaches zero then the
receiver. Indeed, when the queue drains and reaches transmitter cannot send more flits to the receiver.
the Xon threshold, then the receiver sends an on flow The receiver keeps track of receptions and trans-
control signal to the transmitter. The transmitter enters missions. Whenever a flit is consumed the receiver
again the on mode and resumes transmission. sends a new credit upstream. This is notified by a flow
Flow Control F 

control signal (again this info can travel decoupled Ack/Nack flow control allows the receiver to notify the
from the transmission link or as a control packet sender that the last flit sent was correct (received with-
through the transmission link). Alternatively, and to out error) or erroneous (due to a failure or even because
reduce the overhead in transmitting control informa- there was no buffer available). There is no state informa-
tion, the receiver may collect credits and send them in tion at the transmitter side. Indeed, the transmitter is
a batch. However, in that case, performance may suffer optimistic and keeps sending flits to the receiver. In this
since forwarding of flits may be delayed. sense, Ack/Nack protocol is also known as optimistic
Credit-based flow control requires less than half flow control.
the buffer size required by Xon/Xoff flow control, thus This flow control method, although it reduces the
being more effective regarding resource requirements. latency (the sender just transmits the flit when it is avail-
In addition, the link reaction time is shorter in credit- able) has a poor bandwidth utilization (sending flits that F
based flow control. When the sender runs out of cred- are dropped). In addition, the transmitter needs logic to
its then it stops sending flits. When a single credit is deal with timeouts in order to trigger retransmissions
returned from the receiver then the transmission is (in the absence of either an ACK or NACK signal). Also,
resumed. In Xon/Xoff the go threshold needs to be it needs to keep a copy of the flit until it receives an ACK
reached before sending the go signal, thus incurring signal. Thus, the protocol incurs in an inefficient buffer
higher transmission latencies. Figure  shows a com- utilization.
parison between the Stop&Go flow control protocol Xon/Xoff and credit-based flow control methods do
(Xon/Xoff) and the credit-based flow control. As can be not support failures. Indeed, these protocols only care
seen, the credit-based flow control protocol is able to about buffer flooding. When using such protocols and
resume forwarding flits sooner than Stop&Go. in the face of a failure, the system needs to rely on
Ack/Nack flow control. Flow control is also linked a higher-level flow control protocol, most of the time
to fault-tolerance in message transmissions. Indeed, the being an end-to-end flow control protocol.

Stop Stop Stop Stop

Go Go Go Go

Stop signal Sender Last flit Flits in Go signal Sender First flit
returned by stops reaches receiver buffer get returned to resumes reaches
receiver transmission buffer processed sender transmission buffer
Stop & Go

Sender Last flit Flits get Sender First flit Time


# credits
returned uses reaches receiver processed and transmits reaches
Flow control latency
to sender last credit buffer credits returned flits buffer
observed by
receiver buffer
Credit
based

Time

Flow Control. Fig.  Flow control comparison (Xon/Xoff vs credits)


 F Flow Control

Other Specific Flow Control Protocols virtual channel will not consume the entire mem-
Depending on the type of network used, other flow ory resources, thus leaving room for other virtual
control protocols may be used trying to add new func- channels.
tionalities, like higher performance, lower flit laten-
cies, fault coverage, etc. One example is T-ERR [] End-to-End Flow Control
flow control protocol, proposed for on-chip networks. Link-level flow control protocols can, in principle,
T-ERR increases the operating frequency of a link with be applied in end-to-end scenarios. Indeed, nothing
respect to a conventional design. In such situations, tim- impedes them to work. This is the case for the Ack-
ing errors become likely on the link. These faults are /Nack protocol implemented at end nodes, in order
handled by a repeater architecture leveraging upon a to guarantee the reception of messages at destination.
second clock to resample input data, so as to detect However, since the round trip time is longer, large
any inconsistency. Therefore, T-ERR obtains higher buffers are required at both sides to transmit multi-
link throughput at the expense of introducing timing ple messages while waiting for ACK signals. Usually
errors. this is not a major problem, since end nodes imple-
ment much larger buffer resources than routers and
Flow Control and Virtual Channels switches.
An important issue related to flow control is how to There are, however, other protocols designed at that
deal with virtual channels (Switch Architecture). Flow level. Indeed, these flow control protocols are known
control can be instantiated for each virtual channel also as message protocols. Two examples are the eager
so that individual virtual channels do not get over- protocol and the rendezvous protocol.
flowed. However, this leads to the assumption that The eager protocol just assumes the receiver will
buffer resources are statically assigned to each vir- have enough allocated space at its buffers to store
tual channel, leading to low memory utilization. The the transmitted message. In case the receiver was not
problem with this approach comes from the fact that expecting the message then it has to allocate more
every virtual channel must be deep enough to cover buffering and copy the message. During that time, the
the round-trip time of the link and avoid introduc- network may fill up. Thus, the eager protocol may
ing bubbles in the transmission line. To reduce such impact performance.
overhead, it is common to design a single (and shared) In contrast, the rendezvous protocol does not send
landing pad at the input port of the switch. The land- a message unless the receiver is aware of it. This is of
ing pad is flow controlled. Flits arriving to the land- particular interest when the amount of information to
ing pad are then moved to the appropriate virtual send is large. When using the rendezvous protocol, the
channel. transmitter first sends a single short message request-
More interesting approaches rely on the dynamic ing buffer storage at the reception so as to allocate a
assignment of memory resources to virtual channels. large buffer. The receiver, after allocating the buffer,
Thus virtual channels may grow in depth depend- sends an acknowledgment to the sender. Upon recep-
ing on the traffic demands. In order to cope with tion, the sender injects all the messages, which include
such dynamicity, it is common to combine two flow a sequence number in their header. The receiver just
control mechanisms. One example is the use of the stores every incoming message at the correct buffer slot,
Stop&Go flow control protocol per virtual channel and depending on the sequence number the message has.
the credit-based flow control protocol for the entire Upon reception of all the messages the data info is at
memory at the input port of the switch. Thus, when- the receiver and ordered.
ever a switch wants to forward a flit to the next The use of the eager or the rendezvous protocol
switch, it needs to check whether there are global may also affect the way programs are coded. This is
credits available at the input port and whether the the case for MPI programming where different com-
receiving virtual channel is in go state. Notice that mands can be implemented with different end-to-end
with the proper setting of thresholds a particular flow control protocols. Also, out of order delivery of
Flynn’s Taxonomy F 

messages can be avoided with the rendezvous proto-


col. As there are applications that do not work in net- Flynn’s Taxonomy
works with unordered delivery of messages (e.g., using
adaptive routing), then the choice of the rendezvous Michael Flynn
Stanford University, Stanford, CA, USA
protocol may be a good choice for exploiting the ben-
efits of adaptive routing without suffering out-of-order
delivery.
Synonyms
Related Entries SIMD (single instruction, multiple data) machines
Congestion Management
Switching Techniques Definition
F
Flynn’s taxonomy is a categorization of forms of par-
Bibliographic Notes and Further allel computer architectures. From the viewpoint of
Reading the assembly language programmer, parallel comput-
There are several books that cover the field of intercon- ers are classified by the concurrency in processing
nection networks. High-performance interconnection sequences (or streams), data, and instructions. This
networks (used in Massively Parallel Systems and/or results in four classes SISD (single instruction, single
clusters of computers) are covered in the books of Duato data), SIMD (single instruction, multiple data), MISD
[] and Dally []. Both offer a good text for flow control (multiple instruction, single data), and MIMD (multi-
and in some cases complementary views. In particular, ple instruction, multiple data).
Dally’s book couples flow control with switching pro-
viding a common text for both terms. In Duato’s book
the term is clearly differentiated and is more focused on Discussion
switching (rather than flow control). Introduction
For more specific and emerging scenarios, like on- Developed in  [] and slightly expanded in 
chip networks, the book by De Micheli and Benini [] [], this is a methodology to classify general forms of
provides alternative flow control protocols (link-level parallel operation available within a processor. It was
flow control in Chap.  and end-to-end flow control proposed as an approach to clarify the types of paral-
protocols in Chap. ). For computer networks (LANs lelism either supported in the hardware by a processing
and WANs) the reader should focus on books like system or available in an application. The classification
Tanenbaum [] (Chap.  for the transport layer and is based on the view of either the machine or the applica-
Chap.  for the data link layer). tion by the machine language programmer. It implicitly
assumes that the instruction set accurately represents a
Bibliography machine’s micro-architecture.
. Dally W, Towles B () Principles and practices of interconnec- At the instruction level, parallelism means that mul-
tion networks. Morgan Kaufmann, San Francisco, CA tiple operations can be executed concurrently within a
. Duato J, Yalamanchili S, Ni N () Interconnection networks:
program. At the loop level, consecutive loop iterations
an engineering approach. Morgan Kaufmann, San Francisco, CA
. de Micheli G, Benini L () Networks on chips: technology and are ideal candidates for parallel execution provided that
tools. Morgan Kaufmann, San Francisco, CA there is no data dependency between subsequent loop
. Tanenbaum AS () Computer networks. Prentice-Hall, Upper iterations. Parallelism available at the procedure level
Saddle River, NJ depends on the algorithms used in the program. Finally,
. Tamhankar R, Murali S, Micheli G () Performance driven
multiple independent programs can obviously execute
reliable link for networks-on-chip. In: Proceedings of the Asian
Pacific Conference on Design Automation, Shahghai
in parallel. Different instruction set (or instructions set
. Denzel WE, Engbersen APJ, Iliadis I () A flexible shared- architectures or simply, computer architectures) have
buffer swtich for ATM at Gb/s rates. Comput Netw ISDN been built to exploit this parallelism. In general, an
Syst () instruction set architecture consists of instructions each
 F Flynn’s Taxonomy

of which specifies one or more interconnected proces- an architecture or an implementation. As pointed out,
sor elements that operate concurrently, solving a single the original work in  even an SISD processor can be
overall problem. highly parallel in its execution of operations. This par-
allelism need not be visible to the programmer at the
assembly language level, but generally becomes visible
The Basic Taxonomy at execution time through improved performance and,
The taxonomy uses the stream concept to categorize possibly, programming constraints.
the instruction set architecture. A stream is simply a
sequence of objects or actions. There are both instruc-
tion streams and data streams, and there are four simple SISD: Single Instruction, Single Data
combinations that describe the most familiar parallel Stream
architectures []: The SISD class of processor architecture includes
. SISD – single instruction, single data stream; this is most commonly available workstation type computers.
the traditional uniprocessor (Fig. ) which includes Although a programmer may not realize the inher-
pipelined, superscalar, and VLIW processors. ent parallelism within these processors, a good deal
. SIMD – single instruction, multiple data stream, of concurrency can be present. Obviously, pipelining
which includes array processors and vector proces- is a powerful concurrency technique and was recog-
sors (Fig.  shows an array processor). nized in the  original work. Indeed, in that work,
. MISD – multiple instruction, single data stream, the idea of simultaneously decoding multiple instruc-
which includes systolic arrays, GPUs, and dataflow tions was mentioned but was regarded as infeasible for
machines (Fig. ). the technology of the time. (That paper referred to the
. MIMD – multiple instruction, multiple data stream, issuance of more than one instruction at a time as a bot-
which includes traditional multiprocessors (multi- tleneck and later publications sometimes refer to this
core and multi-threaded) as well as the newer work problem as Flynn’s bottleneck). Of course by now almost
of networks of workstations (Fig. ). all machines use pipelining and many machines use
As the stream description of instruction set architecture one or another form of multiple instruction issue. These
is the programmer’s view of the machine, there are multiple issue techniques aggressively exploit paral-
limitations to the stream categorization. Although it lelism in executing code whether it is declared statically
serves as useful shorthand, it ignores many subtleties of or determined dynamically from an analysis of the code
stream. During execution, a SISD processor executes
one or more operations per clock cycle from the instruc-
tion stream. Strictly speaking, the instruction stream
I consists of a sequence of instruction words. An instruc-
tion word is a container that represents the smallest exe-
cution packet visible to the programmer and executed
by the processor.
PE
One or more operations are contained within
an instruction word. In some cases, the distinction
between instruction word and operations is crucial to
distinguish between processor behaviors. When there is
only one operation in an instruction word, it is simply
referred to an instruction.
Data
memory Scalar and superscalar processors executes one or
more instructions per cycle where each instruction con-
tains a single operation. VLIW processors, on the other
Flynn’s Taxonomy. Fig.  SISD – single instruction hand, execute a single instruction word per cycle where
operating on a single data unit this instruction word contains multiple operations.
Flynn’s Taxonomy F 

PE[0] PE[1] PE[n –1]

Data Data Data


memory[0] memory[1] memory[n –1]

Inter PE Communication Network

Flynn’s Taxonomy. Fig.  SIMD – array processor type using PEs with local memory

I[0] I[1] I[n –1]

DATA IN[0] Data in[n –1]


Data
PE[0] PE[1] PE[n –1] Data out[n –1]
memory
to memory
Data out[1]

Flynn’s Taxonomy. Fig.  MISD – a streaming processor example

The amount and type of parallelism achievable in a On the occurrence of an exception (or interrupt),
SISD processor has four primary determinates: a precise exception handler will (usually) complete
processing of the current instruction and immedi-
ately process the exception. An imprecise handler
. The number of operations that can be executed con-
may have already completed instructions that fol-
currently on a sustainable basis.
lowed an instruction that caused the exception and
. The scheduling of operations for execution. This can
will simply signal that the exception has occurred
be done statically at compile time, dynamically at
and it is left to the operating system to manage the
execution, or possibly both.
recovery.
. The order that operations are issued and retired rela-
tive to the original program order – these operations
can be in order or out of order. Most SISD processors implement precise exceptions,
. The manner in which exceptions are handled by the although a few high-performance architectures allow
processor – precise, imprecise, or a combination. imprecise floating-point exceptions.
 F Flynn’s Taxonomy

I[0] I[1] I[n –1]

PE[0] PE[1] PE[n–1]

Data Data Data


memory[0] memory[1] memory[n–1]

Data communications network

Flynn’s Taxonomy. Fig.  MIMD – multiple instruction

Scalar (Including Pipelined) Processors For scalar and superscalar processors with only a sin-
A scalar processor is a simple (usually pipelined) pro- gle operation per instruction, instruction-precise and
cessor that processes a maximum of one instruction operation-precise executions are equivalent. The tra-
per cycle and executes a maximum of one operation ditional definition of sequential execution requires
per cycle. These processors process instructions sequen- instruction-precise execution behavior at all times,
tially from the instruction stream. The next instruction mimicking the execution of a nonpipelined sequential
is not processed until the execution for the current processor.
instruction is complete, and its results have been stored. Most scalar processors directly implement the
The semantics of the instruction determines that a sequential execution model. The next instruction is
sequence of actions that must be performed to pro- not processed until all execution for the current
duce the desired result (instruction fetch, decode, data instruction is complete and its results have been
or register access, operation execution, and result stor- committed.
age). These actions can be overlapped but the result Pipelining is based on concurrently performing dif-
must appear in the specified serial order. This sequen- ferent phases (instruction fetch, decode, execution, etc.)
tial execution behavior describes the sequential exe- of processing an instruction, with a maximum of one
cution model that requires each instruction executed instruction being decoded each cycle. Ideally, these
to completion in sequence. In the sequential execution phases are independent between different operations
model, execution is instruction-precise if the following and can be overlapped; when this is not true, the pro-
conditions are met: cessor stalls the process to enforce the dependency.
. All instructions (or operations) preceding the cur- Thus, multiple operations can be processed simultane-
rent instruction (or operation) have been executed ously with each operation at a different phase of its
and all results have been committed. processing.
. All instructions (or operations) after the current For a simple pipelined machine, only one operation
instruction (or operation) are unexecuted or no occurs in each phase at any given time. Thus, one oper-
results have been committed. ation is being fetched, one operation is being decoded,
. The current instruction (or operation) is in an arbi- one operation is accessing operands, one operation is
trary state of execution and may or may not have in execution, and one operation is storing results. The
completed or had its results committed. most rigid form of a pipeline, sometimes called the
Flynn’s Taxonomy F 

static pipeline, requires the processor to go through issued for execution, dependencies between the instruc-
all stages or phases of the pipeline whether required tion and its prior instructions must be checked by hard-
by a particular instruction or not. Dynamic pipeline ware. Advanced superscalar processors usually include
allows the bypassing of one or more of the stages of order-preserving hardware to insure precise exception
the pipeline depending on the requirements of the handling, simplifying the programmer’s model. Because
instruction. of the complexity of the dynamic scheduling logic, high-
There are at least two basic types of dynamic pipeline performance superscalar processors are usally limited
processors: to processing four to eight instructions per cycle.

. Dynamic pipelines that require instructions to be


decoded in sequence and results to be executed and F
stored in sequence. This type of dynamic pipeline VLIW Processors
provides a small performance over a static pipeline. VLIW processors also decode multiple operations per
In-order execution requires the actual change of cycle and use multiple functional units. A VLIW
state to occur in order specified in the instruction processor executes operations from statically sched-
sequence. uled instruction words that contain multiple indepen-
. Dynamic pipelines that require instructions to be dent operations. In contrast to superscalar which uses
decoded in order, but the execution stage of oper- hardware-supported dynamic analyses of the instruc-
ation need not be in order. In these organizations, tion stream to determine which operations can be
the address generation stage of the load and store executed in parallel, VLIW processors rely on static
instructions must be completed before any subse- analyses by the compiler. VLIW processors are thus less
quent ALU instruction does a write back. The reason complex than superscalar processors, and have, at least,
is that the address generation may cause a page the potential for higher performance.
fault and affect the execution sequence of follow- For applications that can be effectively scheduled
ing instructions. This type of pipeline can result in statically, VLIW implementations offer high perfor-
imprecise exceptions mentioned earlier without the mance. Unfortunately, not all applications can be so
use of special order-preserving hardware. effectively scheduled as execution does not proceed
exactly along the path defined by the compiler code
scheduler.
Two classes of execution variations can arise and
Superscalar Processors affect the scheduled execution behavior:
Even the most sophisticated scalar dynamic pipelined
. Delayed results from operations whose latency dif-
processor is limited to decoding a single operation per
fers from the assumed latency scheduled by the
cycle. Superscalar processors decode multiple instruc-
compiler.
tions in a cycle and use multiple functional units and
. Exceptions or interrupts, which change the execu-
a dynamic scheduler to process multiple instructions
tion path to a completely different and unanticipated
per cycle. These processors can achieve execution rates
code schedule.
of several instructions per cycle (usually limited to 
or , but more is possible in favorable application). While stalling the processor can control delayed result,
A significant advantage of a superscalar processor is this can result in significant performance penalties. The
that processing multiple instructions per cycle is done most common execution delay is a data cache miss.
transparently to the user and provide binary code com- Most VLIW processors avoid all situations that can
patibility. Compared to a dynamic pipelined processor, result in a delay by avoiding data caches and by assum-
a superscalar processor adds a scheduling instruction ing worst-case latencies for operations. However, when
window that analyzes multiple instructions from the there is insufficient parallelism to hide the worst-case
instruction stream each cycle. Although processed in operation latency, the instruction schedule can have
parallel, these instructions are treated in the same man- many incompletely filled or empty operation slots in the
ner as in a pipelined processor. Before an instruction is instructions, resulting in degraded performance.
 F Flynn’s Taxonomy

SIMD: Single Instruction, Multiple Data idle when the specified condition occurs. Often imple-
Stream mentation consist of an array processor coupled to a
The SIMD class of processor architecture includes both SISD general-purpose control processor that executes
array and vector processors. The SIMD processor is cre- scalar operations and issues array operations that are
ated around the use of certain regular data structures, broadcast to all processor elements in the array. The
such as vectors and matrices. From the reference point control processor performs the scalar sections of the
of an assembly-level programmer, programming SIMD application, interfaces with the outside world, and con-
architecture appears to be very similar to programming trols the flow of execution; the array processor performs
a simple SISD processor except that some operations the array sections of the application as directed by the
perform computations on these aggregate data struc- control processor.
tures. As these regular structures are widely used in The programmer’s reference point for an array pro-
scientific programming, the SIMD processor has been cessor is typically the high-level language level; the
very successful in these environments. programmer is concerned with describing the relation-
Array processor and the vector processor dif- ships between the data and the computations but is not
fer both in their implementations and in their data directly concerned with the details of scalar and array
organizations. instruction scheduling or the details of the inter proces-
An array processor consists of interconnected pro- sor distribution of data within the processor. In fact, in
cessor elements with each having its own local memory many cases, the programmer is not even concerned with
space. A vector processor consists of a single proces- size of the array processor. In general, the programmer
sor that references a single global memory space and specifies the size and any specific distribution informa-
has special functional units that operate specifically on tion for the data and the compiler maps the implied
vectors. virtual processor array onto the physical processor ele-
ments that are available and generates code to perform
Array Processors the required computations.
The array processor consists of multiple processor ele-
ments connected via one or more networks, possibly Vector Processors
including local and global inter element communica- A vector processor is a single processor that resembles a
tions and control communications. Processor elements traditional SISD processor except that some of the func-
operate in lockstep in response to a single broadcast tion units (and registers) operate on vectors – sequences
instruction from a control processor. Each processor of data values that are processed as a single entity. These
element has its own private memory, and data is dis- functional units are deeply pipelined and have a high
tributed across the elements in a regular fashion that is clock rate; although the vector pipelines have a long
dependent on both the actual structure of the data and or longer latency than a normal scalar function unit,
also on the computations to be performed on the data. their high clock rate and the rapid delivery of the input
Direct access to global memory or another proces- vector data elements results in a significant through-
sor element’s local memory is expensive, so interme- put that cannot be matched by scalar function units.
diate values are propagated through the array through Early vector processors processed vectors directly from
local interprocessor connections, which requires that memory. The primary advantage of this approach was
the data be distributed carefully so that the routing that the vectors could be of arbitrary lengths and were
required to propagate these values is simple and regu- not limited by processor registers or resources; however,
lar. It is sometimes easier to duplicate data values and the high startup cost, limited memory system band-
computations than it is to affect a complex or irregular width, and memory system contention were significant
routing of data between processor elements. limitations.
As instructions are broadcast, there is no means Modern vector processors require that vectors be
local to a processor element of altering the flow of the explicitly loaded into special vector registers and stored
instruction stream; however, individual processor ele- back into memory from these registers, the same course
ments can conditionally disable instructions based on that modern scalar processors have taken for similar
local status information – these processor elements are reasons. The vector register file consists of several sets of
Flynn’s Taxonomy F 

vector registers (with perhaps  registers in each set). the programmer initially organizes the data into suit-
Vector processors have several features that enable them able data structure, then streams the data elements into
to achieve high performance. One feature is the ability stages where the operations are simply fragments of
to concurrently load and store values between the vector an assembly-level operation, as distinct from being a
registers and main memory while performing computa- complete operation. Surprisingly, some of the earliest
tions on values in the vector register file. This feature is attempts in the s could be seen as the MISD com-
important because the limited length of vector registers puters (they were actually programmable calculators; as
requires that long vectors be processed in segments. Not they were not “stored program machines” in the sense
being able to overlap memory accesses and computa- that the term is now used). They used plug boards for
tions would pose a significant performance bottleneck. programs, where data in the form of a punched card
Just like SISD processors, vector processors support a was introduced into the first stage of a multistage pro- F
form of result bypassing – in this case called chaining – cessor. A sequential series of actions was taken where
that allows a follow-on computation to commence as the intermediate results were forwarded from stage to
soon as the first value is available from the preceding stage until, at the final stage, a result would be punched
computation. Thus, instead of waiting for the entire vec- into a new card.
tor to be processed, the follow-on computation can be While the MISD term is not commonly used, there
largely overlapped with the preceding computation that are interesting uses of the MISD organization. A fre-
it is dependent on. Sequential computations can be effi- quently used synonym for MISD is “streaming proces-
ciently compounded and behave as if they were a single sor.” Some of these processors are commonly available.
operation with a total latency equal to the latency of the One example is the GPU (graphics processing unit);
first operation with the pipeline and chaining latencies early and simpler GPUs provided limited programmer
of the remaining operations but none of the start-up flexibility in selecting the function or operation of a par-
overhead that would be incurred without chaining. ticular stage (ray tracing, shading etc.). More modern
As with the array processor, the programmer’s ref- GPGPUs (general purposed GPU) are more truly MISD
erence point for a vector machine is the high-level with a more complete operational set at the individual
language. In most cases, the programmer sees a tra- stage in the pipeline. Indeed, many modern GPGPU
ditional SISD machine; however, as vector machines incorporate both MISD and MIMD as these systems
excel on vectorizable loops, the programmer can often offer multiple MISD cores.
improve the performance of the application by care- Other MISD style architectures include dataflow
fully coding the application, in some cases explicitly machines. In these machines, the source program is
writing the code to perform register memory overlap converted into a data flow graph, each node of which
and chaining, and by providing hints to the compiler is a required operation. Data then is streamed across
that help to locate the vectorizable sections of the code. the implemented graph. Each path through the graph
As languages are defined (such as Fortran  or High- is a MISD implementation. If the graph is relatively
Performance Fortran) that make vectors a fundamental static for particular data sequences, then the path is
data type, the programmer is exposed less to the details strictly a MISD if multiple paths are invoked dur-
of the machine and to its SIMD nature. ing program execution then each path is MISD and
there are MIMD instances in the implementation.
Such dataflow machines have been realized by FPGA
MISD: Multiple Instruction, Single Data implementations.
Stream
The MISD architecture is a pipelined ensemble where MIMD: Multiple Instruction, Multiple Data
the function of the individual stages are determined Stream
by the programmer before execution is begun; then data The MIMD class of parallel architecture consists of
is streamed through the pipeline, forwarding results multiple processors (of any type) and some form of
from one (stage) function unit to the next. On the interconnection. From the programmer’s point of view
micro-architecture level this is exactly what the vec- each processor executes independently but to coopera-
tor processor does. However, in the vector pipeline tively execute to solve a single problem although some
 F Flynn’s Taxonomy

form of synchronization is required to pass informa- processor is distributed across all processors and only
tion and data between processors. Although no require- the local processor element has access to it, all data
ment exists that all processor elements be identical, sharing is performed explicitly using messages and all
most MIMD configurations are homogeneous with all synchronization is handled within the message system.
processor elements identical. There have been heteroge- When communications between processor ele-
neous MIMD configurations that use different kinds of ments are performed through a shared memory address
processor elements to perform different kinds of tasks, space – either global or distributed between proces-
but these configurations are usually for special-purpose sor elements (called distributed shared memory to dis-
applications. tinguish it from distributed memory) – there are two
From a hardware perspective, there are two types of significant problems that arise. The first is maintaining
homogeneous MIMD: multi-threaded and multi-core memory consistency: the programmer-visible ordering
processors; frequently implementations use both. effects on memory references, both within a proces-
sor element and between different processor elements.
Multi-Threaded Processors This problem is usually solved through a combination of
In multi-threaded MIMD, a single base processor is hardware and software techniques. The second is cache
extended to include multiple sets of program and data coherency – the programmer invisible mechanism to
registers. In this configuration, separate threads or pro- ensure that all processor elements see the same value for
grams (instruction streams) occupy each register set. As a given memory location. This problem is usually solved
resources are available, the threads continue execution. exclusively through hardware techniques.
Since the threads are independent so is their resource The primary characteristic of a large MIMD multi-
usage. So ideally, the multiple threads make better use processor system is the nature of the memory address
of resources such as floating point units and memory space. If each processor element has its own address
ports resulting in a higher number of instructions exe- space (distributed memory), the only means of commu-
cuted per cycle. Multi-threading is a quite useful com- nication between processor elements is through mes-
plement to large superscalar implementations as these sage passing. If the address space is shared (shared
have relative extensive set of expensive floating point memory), communication is through the memory
and memory resources. Critical threads (tasks) can be system.
given priority to insure short execution while less criti- The implementation of a distributed memory
cal tasks avail of the underused resources. As a practical machine is easier than the implementation of a shared
matter, these designs must insure a minimum service memory machine, when memory consistency and
for every task to avoid a type of task lockout. Other- cache coherency are taken into account, while pro-
wise, the implementation is straightforward since all gramming a distributed memory processor can be more
tasks can share data caches and memory without any difficult.
interconnection network. In recent times, large-scale (more than , cores)
multiprocessors have used the distributed memory
Multi-Core and Multiple Multi-Core model based on implementation requirements.
Processor Systems
At the other end of the MIMD spectrum are mul-
tiple multi-core processors. These implementations
must communicate results through an interconnec- Related Entries
tion network and coordinate task control. Their imple- Cray Vector Computers
mentation is significantly more complex than the Data Flow Computer Architecture
simpler MIMD or even more complex SIMD array EPIC Processors
processor. Fujitsu Vector Computers
The interconnection network in the multipro- IBM Blue Gene Supercomputer
cessor passes data between processor elements and Illiac IV
synchronizes the independent execution streams Superscalar Processors
between processor elements. When the memory of the VLIW Processors
FORGE F 

Bibliographic Notes and Further


Reading Forall Loops
Parallel processors have been implemented in one
form or another almost as soon as the first unipro- Loops, Parallel
cessor implementation. Most early implementations
were straightforward MIMD multiprocessors. With the
advent of SIMD array processors, such as Illiac IV [], FORGE
the need for a distinction became clear. Because the
taxonomy is simple, numerous extensions have been John M. Levesque , Gene Wagenbreth
proposed; some fairly general such as those of Handler 
Cray Inc., Knoxville, TN, USA
[], Kuck [], and El-Rewini []. Other extensions 
University of Southern Califorina, Topanga,
F
were of more specialized nature such as Göhringer [], CA, USA
Chemij [].
Additional comments on the taxonomy and elabo-
rations can be found in the standard texts on parallel Synonyms
architecture such as Hwang [], Hockney [], and Interactive parallelization; Parallelization; Vectoriza-
El-Rewini []. Xavier [] and Quinn [] are helpful tion; Whole program analysis
texts which view the taxonomy from the programmer’s
view point. Definition
FORGE is an interactive program analysis package
built on top of a database representation of an appli-
Bibliography cation. FORGE was the first commercial package that
. Flynn MJ () Very high speed computers. Proc IEEE :– addressed the issue of performing interprocedural anal-
 ysis for Program Consistency, Shared and Distributed
. Flynn MJ () Some computer organizations and their effec- Parallelization, and Vectorization. FORGE was devel-
tiveness. IEEE Trans Comput ():– oped by a team of computer scientists employed
. Flynn MJ, Rudd RW () Parallel architectures. ACM Comput
by Pacific Sierra Research and later Applied Parallel
Surv ():–
. Barnes GH, Brown RM, Kato M, Kuck D, Slotnick DL, Stokes Research.
RQ () The ILLIAC IV computer. IEEE Trans Comput C-:
–
Discussion
. Handler W () Innovative computer architecture – how to
increase parallelism but not complexity. In: Evans DJ (ed) Parallel Introduction
processing systems, an advanced course. Cambridge University When FORGE was marketed in , whole program
Press, Cambridge, pp –, ---
parallelization was not quick, easy, or convenient. Vec-
. Kuck D () The Structure of Computers and Computation
vol. , J Wiley, New York torization was a mature user and compiler art. Many
. El-Rewini H, Abd-El-Barr M () Advanced computer archi- users could write vectorizeable loops that a compiler
tecture and parallel processing. Wiley, New York could translate to efficient vector instructions. Dis-
. Göhringer D et al () A taxonomy of reconfigurable single-/ tributed and shared memory multiprocessors could be
multiprocessor systems-on-chip. Int J Reconfigurable Comput
treated as vector processors and parallelized in the
:
. Chemij W () Parallel computer taxonomy, MPhil, Aberyst-
same manner. This is frequently very inefficient. Paral-
wyth University lelization introduces much more overhead than vector-
. Hwang K, Briggs FA () Computer architecture and parallel ization. To overcome overhead parallelization requires
processing. McGraw Hill, London, pp –, --- a much higher granularity than vectorization. Paral-
. Hockney RW, Jesshope CR () Parallel computers . Adam lelization of outer loops containing nested subrou-
Hilger/IOP Publishing, Bristol, ---
tine calls is usually the best approach. Achieving this
. Xavier C, Iyengar SS () Introduction to parallel algorithms,
Wiley, New York high granularity was very difficult for a user to do
. Quinn MJ () Designing efficient algorithms for parallel com- by hand, and practically impossible for a compiler to
puters. McGraw Hill, London, --- accomplish automatically. FORGE was created to bridge
 F FORGE

this gap, allowing the user to interactively initiate com- c. Automatic loop parallelization based on array
piler analysis and view the results. The user controlled distribution
the parallelization process using high-level program d. Automatic array distribution based on parallel
knowledge. FORGE performed the time-consuming, loop access
tedious analysis that computers are better at than pro- e. Iteration of c. and d. for whole program paral-
grammers. lelization
At the time, a group at Rice University, under the f. Generate HPF
direction of Ken Kennedy, were developing a “Whole . Interactive Program Restructuring
Program Analysis” software called Parascope. At the a. Loop Inversion
same time, a few of the developers of the VAST vector- b. Loop splitting
izing pre-compiler began a project to develop FORGE, c. Array index reordering
based on a database representation of the input applica- d. Routine inlining
tion. In addition to the source code, runtime statistics e. Routine outlining
could be entered into the database to identify important f. TASK COMMON generation
features derived during the execution of the application. . Interactive CACHE View tool
Upon this database representation of the program an a. Given a hardware’s Level  and Level  Cache
interactive, windows-based environment was built that sizes and runtime statistics
allowed the user to interactively: i. Instrument code
ii. Identify Cache alignment issues
iii. Identify Cache overflows
. Peruse the database to find inconsistencies in the iv. Propose Blocking strategies
program
a. COMMON block analysis
b. Tracing of variable through the call chain show- General Structure
ing aliases from subroutine calls, equivalences, Forge was written in the C programming language. The
and COMMON blocks parser was constructed using the compiler-compiler
c. Identification of variables that are used and tools LEX and YACC. The windowing system used
not set was originally a custom interface written for SUN,
d. Identification of variables that are set and APOLLO, and DEC workstations. Later a MOTIF based
not used X-Windows system was used.
e. Queries to identify variables that satisfied cer-
tain user-defined templates The Database
. Interactively Vectorize Application To perform the aforementioned functions, FORGE
a. Perform loop restructuring needed significant information concerning the input
i. Loop splitting application. At the end of the development of FORGE
ii. Routine inlining – pull routine into a loop in , the database could input the application source
iii. Routine outlining – push loop into a code (Only Fortran ) and runtime statistics.
routine From the source code, the following data was
iv. Loop inversion extracted:
. Interactive Shared Memory Parallelization
a. Array scoping . For each routine
b. Storage conflicts due to COMMON blocks a. Variable information
c. Identification of Reduction function variables i. Aliasing from COMMON, EQUIVA-
d. Identification of Ordered regions due to data LENCE, SUBROUTINE argruments
dependencies ii. Read and Write references
. Interactive Distributed Memory Parallelization iii. Data type and size
a. User-directed loop parallelization iv. Array dimensions
b. User-directed array distribution v. Array indexing information
FORGE F 

b. Control structures within the routine – IF’s and . Basic Block (control flow) information for the entire
DO’s application
. Call chain information a. All IF constructs controlling the execution of the
a. Static Call chain – all possible calling sequences basic block
b. Dynamic Call chain – derived from the runtime b. All reaching definitions for variables used in the
statistics basic block
. Variable Aliasing through call chain, both static and c. All uses of variable set in the basic block
dynamic . Array section analysis
a. Information on every looping structure
i. Section of array that is accessed by each
DO loop F
. Integer value information (FACTS) for array index
and data dependency analysis

Database Design
The FORGE database was designed and implemented
by the developers. An off-the-shelf database was not
used. In design of the database, the generation of the
database was performed incrementally with the ability
to automatically update the database when the source
code was modified. The intent was to spend the time
building the database so the interactive analysis within
the FORGE environment could be performed quickly.
FORGE stored a program as a Package. Each pack-
age contained a list of the source files in a program,
along with other information such as directories to
search, compiler options to use, and target hardware.
Brute force whole program analysis using the hard-
ware available at the time was very time consuming for
convenient interactive program analysis. For that rea-
son, a database was created containing only data needed
for parallelization of a program. Each program unit
(subroutine, function, main program) was analyzed and
stored separately in the database, on disk. Last modified
dates were saved for each file. A checksum was saved for
each program unit in a file. When the modification date
for a file indicated that the database was out of date, the
file was rescanned and the checksums for each routine
compared to the checksums saved with the database.
The database was regenerated only for modified rou-
tines. Building the initial database for a large program
could take an hour. After the initial creation, updating
the database for source file changes usually took a few
seconds.
When the user selected a package for analysis, tables
were built in memory combining the databases for
called and calling routines. This process typically took
 F FORGE

a few seconds. The efficiency of the incremental update tant dataset and an ASCII output would be generated
of database and the handling of called routines in mem- that could be fed into the database to assist in per-
ory are the key features that made interactive whole forming analysis. There was an option to ignore static
program analysis feasible. information in favor of the dynamic information when
The database was designed to quickly supply infor- performing the aforementioned analysis. The following
mation to the Vectorizer and Parallelizer when per- runtime statistics were gathered.
forming array section and scalar dependency analysis.
– CPU time of each routine, inclusive and exclusive
All of the analysis features used by the parallelization
– Lengths of DO loops
analysis were made available to the user for program
– CPU time of each DO loop, inclusive and exclusive
browsing. This only required the implementation of
– Frequency of IF constructs (optional)
menus, since all the analysis was already implemented.
– Array addresses (optional)
The browser ended up as one of the most popular and
powerful features of FORGE.
Users could develop templates to define routines
that were called from the application and whose source Interactive Tools available to the User
was not available. These templates simply supplied the Query Database
information about variables that were either passed A popular tool named Query allowed the user to find all
into the routine through the argument list or through variables in a program that matched criteria defined in
COMMON blocks. templates. Templates specified criteria for:
Runtime Statistics ● Variable name
FORGE had an option to instrument the application ● Variable type
to gather information about important routines, DO ● Variable storage
loops, IF constructs, and Cache utilization. The instru- ● Statement context
mented program could then be run with an impor- ● Statement type
FORGE F 

Many common templates were supplied or the user contained I/O or unknown routine, the user would be
could create their own templates. alerted to that. The user would then say:
● Show all variables read into the program . Parallelize Automatically
● Show all variables written out . Parallelize Interactively
● Show all variables used within particular DO loop
The automatic mode would complete without inter-
● Show all variables set under a condition and used
action and show the user if there were any dependency
unconditionally
regions, storage conflicts, or other inhibitors. The Inter-
active mode would converse with the user, showing each
inhibitor as encountered. The users could then instruct
Cache Vu Tool F
the parallelizer to ignore dependencies, if appropriate,
Cache Vu tool, another tools within the FORGE system,
and continue.
was extremely powerful and informative. The runtime
Finally, the user could output the parallelized code
statistics to support this feature was very large and had
in OpenMP, Cray Micro-tasking, or as a parallelized
to be collected in a controlled manner. FORGE was able
routine to be run with FORGE’s shared memory run-
to visually show the user which arrays were overlapping
time library. The latter was accomplished by outlin-
in the cache and being evicted from cache due to cache
ing the DO loop into a subroutine call that got called
associativity.
in parallel. Shared variables were passed as argu-
ments and private variables made local to the outlined
COMMON Block Grid routine.
A grid was displayed with routines down the side and
COMMON blocks across the top. When a COMMON Distributed Memory Parallelizer
block was selected, variables within that COMMON The user could select a set of arrays to decompose,
block were shown across the top showing how each decompose the arrays through a menu – only single
variable was referenced in each routine. index decomposition was supported. FORGE would
then identify every DO loop that accessed the decom-
USE-DEF and DEF-USE posed dimension and the user could have FORGE dis-
This was extremely useful in debugging an applica- tribute the loop iterations using the owner execute rule
tion. Plans to attach FORGE to a debugger never automatically. During the loop parallelization, FORGE
materialized. searched for additional arrays to decompose and loops
to parallelize. This cycle of array decomposition and
– User could click on a variable in the source and drop loop distribution continued until no new transforma-
down a menu to select, show reaching definition, tions were found.
or show subsequent uses. The uses and/or sets were Parallelization was saved in internal format between
shown within the control structures controlling the sessions. The user could choose to save the paralleliza-
execution of the line. tion in a new copy of the source program with HPF
directives.
Executable parallel code could be generated either
Tracing Variables Through the Call Chain with the interactive parallelizer or a batch translator
All occurrences of a variable were shown interspersed named XHPF. The translated code contained calls to a
within the call chain, including IF constructs. Alias- runtime library. Storage for distributed arrays was allo-
ing was shown where occurs due to EQUIVALENCE, cated at runtime as determined by the size of the array
COMMON block, or SUBROUTINE call. and the number of distributed CPUs used. Ghost cells
were allocated as needed. Parallel loop bounds were
Shared Memory Parallelizer computed at runtime. Internode communication was
If runtime statistics were available, the window would performed pre and post loop as required. Communica-
present the user with the highest level DO loop that is tion was inserted in serial code as required. One node
a potential candidate for parallelization. If the DO loop was used to perform all IO operations.
 F FORGE

Program Restructurer
The parallelized code was available to the user for
all the program transformations executed by FORGE.
When the source code was modified, it would be
presented to the user and the user could accept
the new version and the database would be updated
accordingly.

Parallel Program Generation


FORGE used the result of analysis to generate paral-
lel programs. It performed source-to-source translation.
The input could contain compiler parallelization direc-
tives. Syntax defined by most major vendors of the time
(IBM, NEC, FUJITSU, DEC, INTEL, NCUBE) were
supported, as well as HPF and OMP standards. FORGE
could take any of these as input, combine them with
interactive or automatic analysis, and generate direc-
tives. FORGE could also generate a parallel program
containing calls to a FORGE runtime library. Versions
of the library existed for shared and distributed memory
parallelization for the major vendors. The user would
compile the FORGE output with the vendor supplied
F compiler, link with the appropriate FORGE library,
and run the parallel application.
Instrumented versions of the FORGE parallel
libraries were provided. When linked with the instru-
mented libraries and run, timing data was created. The
timing data could be viewed interactive with FORGE,
supper imposed on a dynamic callchain. Efficiencies
and “hotspots” were identifiable and correctable.
Batch tools, SPF – shared memory parallelizer, and
XHPF (translate HPF) performed the same translation
functions as FORGE. A user could modify program
directives with a conventional editor and run these
tools as precompilers as part of the normal compilation
stack.

Future Directions
There were three major deficiencies in Forge. First,
Forge handled only a subset of Fortran . Second,
Forge was developed with Unix Xwindows interface.
A convenient native MS Windows interface was needed.
Last, the display of parallel loop inhibitors due to
loop carried dependencies often contained many false
FORGE F 

dependencies. The minimization of the false dependen- . Applied Parallel Research. FORGE  Distributed memory par-
cies and a convenient method for the user to view and allelizer user’s guide
ignore dependencies were problems for Forge, as well . Song ZW, Roose D, Yu CS, Berlamont J () Parallelization of
software for coastal hydraulic simulations for distributed mem-
as all other parallelization tools then and now. Work
ory parallel computers using FORGE . In: Power H (ed)
continued in these areas through the year  when Applications of high-performance computing in engineering IV,
Applied Parallel Research went out of business. pp –
. Frumkin M, Hribar M, Jin H, Waheed A, Yan J () A com-
Bibliography parison of automatic parallelization tools/compilers on the SGI
. Balasundaram V, Kennedy K, Kremer U, McKinley K, Subhlok Origin. In: Proceedings of the  ACM/IEEE SC Con-
J () The parascope editor: an interactive parallel program- ference (SC’), Orlando. IEEE Computer Society, Washington
ming tool. In: Conference on high performance networking and . Bergmark D () Optimization and parallelization of a com-
computing proceedings of the  ACM/IEEE conference on modity trade model for the IBM SP/, using parallel program-
Supercomputing, ACM, New York ming tools. In: International Conference on Supercomputing,
. Kushner EJ () Automatic parallelization of grid-based appli- Proceedings of the th international conference on Supercomput-
cations for the iPSC/. In: Lecture notes in computer science, ing, Barcelona
vol , pp – . Kennedy K, Koelbel C, Zima H () The rise and fall of high
. Gernt M () Automatic parallelization of a crystal growth performance Fortran: an historical object lesson - History of
simulation program for distributed-memory system. In: Lecture Programming Languages. In: Proceedings of the Third ACM SIG-
notes in computer science, vol , pp – PLAN Conference on History of Programming Languages, San
. Applied Parallel Research. FORGE  Baseline system user’s Diego. ACM, New York, pp ---
guide . Saini S () NAS Experiences of Porting CM Fortran Codes
. Applied Parallel Research. FORGE high performance Fortran to HPF on IBM SP and SGI Power Challenge. NASA Ames
xHPF user’s guide Research Center, Moffett Field
 F Formal Methods–Based Tools for Race, Deadlock, and Other Errors

foundations and applications of these approaches, dis-


Formal Methods–Based Tools for cusses their strengths and weaknesses, and names a few
Race, Deadlock, and Other Errors representative tools for each approach.
The evaluation of the three approaches focuses on
Peter Müller
the following criteria:
ETH Zurich, Zurich, Switzerland

Soundness. A verification or analysis tool is sound if


Definition it produces only results that are valid with respect
Formal methods–based tools for parallel programs use to the semantics of the programming language. In
mathematical concepts such as formal semantics, for- other words, a sound tool does not miss any errors.
mal specifications, or logics to examine the state space Soundness is the key property that distinguishes for-
of a program. They aim at proving the absence of mal methods from testing. However, some formal
common concurrency errors such as races, deadlocks, methods–based tools sacrifice soundness in favor of
livelocks, or atomicity violations for all possible exe- reducing false positives or improving efficiency.
cutions of a program. Common techniques applied by Completeness. A verification or analysis tool is complete
formal methods–based tools include deductive verifica- if it does not produce false positives, that is, each
tion, model checking, and static program analysis. error detected by the tool can actually occur accord-
ing to the semantics of the programming language.
Discussion Minimizing the number of false positives is cru-
Introduction cial for the practical applicability of a tool because
Concurrent programs are notoriously difficult to test. each reported error needs to be subjected to fur-
Their behavior depends not only on the input but also ther investigation, which usually involves human
on the scheduling of threads or processes. Many con- inspection.
currency errors occur only for very specific schedul- Modularity. A verification or analysis tool is modular
ings, which makes them extremely hard to find and if it can analyze the modules of a software system
reproduce. independently of each other and deduce the correct-
An alternative to testing is to apply formal methods. ness of the whole system from the correctness of
Formal methods use mathematical concepts to exam- its components. Modularity allows a tool to analyze
ine all possible executions of a program. Therefore, they program libraries and improves scalability.
are able to guarantee certain properties for all inputs Automation. A verification or analysis tool has a high
and all possible schedulings of a program. Most formal degree of automation if it requires little or no
methods–based tools for concurrent programs focus user interaction. Typical forms of user interaction
on races and deadlocks, the two most prominent con- include providing the specification to be checked by
currency errors. Some of them also detect other errors the tool, assisting the tool through program anno-
such as livelocks, atomicity violations, undesired non- tations, and constructing proofs manually. A high
determinism, and violations of specifications provided degree of automation is necessary to make the appli-
by the user. cation of formal methods-based tools economically
This entry covers tools based on three kinds of for- feasible.
mal methods. () Tools for deductive verification use Efficiency. A verification or analysis tool is efficient if
mathematical logic to construct a correctness proof it can analyze large programs in a relatively short
for the input program. () Tools for state space explo- amount of time and space. In general, the number
ration (model checkers) check properties of a pro- of states and executions of a program grows expo-
gram by exploring all its executions. () Tools for static nentially in the size of the program (for instance,
program analysis approximate the possible dynamic the number of variables, branches, and threads).
behaviors of a program and then check properties Therefore, tools must be highly efficient in order to
of these approximations. This entry summarizes the handle practical programs.
Formal Methods–Based Tools for Race, Deadlock, and Other Errors F 

Deductive Verification void transfer(Account to, int amount) {


if(this == to) return;
Overview acquire this;
acquire to;
Tools for deductive verification construct a formal balance = balance - amount;
proof that demonstrates that a given program satis- to.balance = to.balance + amount;
transactionCount = transactionCount + 1;
fies its specification. Common ways of constructing to.transactionCount = to.transactionCount + 1;
the correctness proof are () to use a suitable calculus release to;
release this;
such as the weakest precondition calculus to compute
}
verification conditions – logical formulas whose valid-
ity entails the correctness of the program – and then // constructors and other methods omitted
}
prove their validity in a theorem prover, or () to use F
symbolic execution to compute symbolic states for each
The absence of data races can be proved using a per-
path through the program and then use a theorem
mission system that maintains an access permission for
prover to check that the symbolic states satisfy the
specification. Both approaches employ a formal seman- each shared variable. The permission gets created when
tics of the programming language. The specification to the variable comes into existence; it can be transferred
be verified typically includes implicit properties (such among threads and between threads and monitors, but
as deadlock freedom) as well as properties explicitly not duplicated. Each shared variable access gives rise
specified by the user (for instance through pre- and to a proof obligation that the current thread possesses
postconditions). the permission for that variable. Since access permis-
Deductive verifiers for concurrent programs apply sions cannot be duplicated, this scheme prevents data
races. In the example above, one could store the per-
various verification methodologies, which are illus-
missions for balance and transactionCount in
trated by the Java example below. Monitor invariants
the monitor of their object. By acquiring the monitors
(see entry Monitors, Axiomatic Verification of) allow
for this and to, the current thread obtains those per-
one to express properties of the monitor variables
missions and, therefore, is allowed to access the fields.
(here, the fields in class Account) that hold when
When the monitors are released, the permissions are
the monitor is not currently locked by any thread. For
transferred back to the monitors. Such a permission
instance, one could express that the transaction count
is a nonnegative number. Rely-guarantee reasoning [] system is for instance applied by concurrent separation
allows one to reason about the possible interference logic [].
of other threads. For instance, one could prove that An alternative technique for proving the absence
no thread ever decreases the transaction count. Dead- of data races is through object ownership []. Own-
lock freedom can be proved by establishing a partial ership schemes associate zero or one owner with each
order on locks and proving that each thread acquires object. An owner can be another object or a thread.
locks in ascending order (see entry on Deadlocks). A thread becomes the owner of an object by creating it
For instance, one could specify that the order on the or by locking its monitor. Each access to an object’s field
locks associated with Account objects is given by the gives rise to a proof obligation that the object is (directly
less-than order on the accounts’ numbers. The check or transitively) owned by the current thread. Since an
that locks are acquired in ascending order would fail object has at most one owner, this scheme prevents data
for the example because the locks of this and to races.
are acquired irregardless of the order. Therefore, the Besides preventing data races, both permissions and
concurrent execution of x.transfer(y, a) and ownership guarantee that the value of a variable does
y.transfer(x, b) might deadlock. not change while a thread holds its permission or owns
the object it belongs to. Therefore, they also give an
class Account { atomicity guarantee, which allows verifiers to apply
int number;
int balance;
sequential reasoning techniques in order to prove addi-
int transactionCount; tional program properties.
 F Formal Methods–Based Tools for Race, Deadlock, and Other Errors

Discussion theorem provers have so far prevented program ver-


Program verifiers are sound unless there is a flaw in the ification from being applied routinely in mainstream
underlying logic or the implementation of the verifier. software development.
In principle, program verification is complete if the
underlying logic is complete with respect to the seman- Examples
tics of the programming language. However, program Most verifiers for concurrent programs are in the
verifiers typically enforce some kind of verification stage of research prototypes. Microsoft’s VCC is an
methodology as described above. These methodologies automatic verifier for concurrent, low-level C code.
simplify the verification of certain programs, but fail for It has been used to verify the virtualization layer
others. For instance, a tool that requires lock ordering Hyper-V. VCC applies a verification methodology based
for deadlock prevention might not be able to handle on ownership []. Permission systems are used by ver-
programs that prevent deadlocks by other strategies. ifiers based on separation logic such as VeriFast for C
Another source of incompleteness are the limitations programs [], SmallfootRG [], and Heap-Hop [],
of theorem provers. Since the validity of verification as well as by Chalice []. Both Heap-Hop and Chalice
conditions is in general undecidable, program verifiers support shared variable concurrency and message pass-
based on automatic provers such as SMT solvers report ing. Lock-free data structures have been verified using
spurious errors when the prover cannot confirm the the interactive verifier KIV [].
validity of a verification condition.
A number of program verifiers support modular
State Space Exploration
reasoning. For instance, a permission system removes
the necessity to inspect the whole program in order to Overview
detect data races. Nevertheless, modular verification is Software model checkers consider all possible states
still challenging for certain program properties such as that may occur during the execution of a program.
global invariants and liveness properties. They check properties of individual states or sequences
Deductive program verifiers are typically not fully of states (traces) through an exhaustive search for a
automatic. Users have to provide the specification the counterexample. This search is either performed by
program is verified against. Even if the property to be enumerating all possible states of a program execu-
proved is the absence of a simple error such as data tion (explicit state model checking) or by representing
races, users typically have to annotate the code with states through formulas in propositional logic (sym-
auxiliary assertions such as loop invariants. Moreover, bolic model checking). In both approaches, programs
interactive verifiers require users to guide the proof are regarded as transition systems whose transitions
search. are described by a formal semantics of the program-
The efficiency of automatic verifiers (those verifiers ming language. The specifications to be checked are
where the only user interaction is by writing annota- given as formulas in propositional logic for individual
tions) depends on the number and complexity of the states (assertions) or as formulas in a temporal logic
generated verification conditions as well as on the per- (typically LTL or CTL) for traces. The use of tempo-
formance of the underlying automatic theorem prover. ral logic allows model checkers in particular to check
The former aspect is positively influenced by applying liveness properties such as “every request will even-
a modular verification methodology, the latter benefits tually be handled.” Liveness properties are important
from the enormous progress in the area of automatic for the correctness of all parallel programs, but espe-
theorem proving. cially for those based on message passing (such as MPI
In summary, since program verifiers are sound and programs) because they often apply sophisticated pro-
since modular verifiers scale well, they are well suited tocols. Other errors such as deadlocks can be detected
for software projects with the highest quality require- without giving an explicit specification. For instance for
ments, especially safety-critical software. However, the the account example above, a model checker detects
overhead of writing specifications as well as limitations that for some input values and thread interleavings,
of existing verification methodologies and automatic method transfer does not reach a valid terminal
Formal Methods–Based Tools for Race, Deadlock, and Other Errors F 

state. The inputs and interleavings are then reported as iteratively refine the abstraction. In contrast to deduc-
a counterexample that can be inspected and replayed tive verification, model checking is not limited by the
by the user. expressiveness of a verification methodology. Model
The main limitation of software model checking is checkers are thus capable of analyzing arbitrary pro-
the so-called state space explosion problem. The state grams, including for instance optimistic concurrency,
space of a program execution grows exponentially in benevolent data races, and interlocked operations. In
the number of variables and the number of threads. practice, software model checking can be inconclusive
Therefore, even small programs have a very large state because the state space is too large, causing the model
space and require elaborate reduction techniques to be checker to abort.
efficiently checkable. Partial order reduction reduces Although there is work on compositional model
the number of relevant interleavings by identifying checking, most model checkers are not modular F
operations that are independent of actions of other because it is difficult to summarize temporal properties
threads (for instance, local variable access). at module boundaries.
Abstraction is used to simplify the transition system Model checking, including abstraction, is fully auto-
before checking it. It is common that many aspects of a matic. Users need to provide only the property to be
program execution are not relevant for a given property checked. When the checking is not feasible with the
to be checked. For instance, the values of amount and available resources, users also need to manually simplify
balance in the transfer example are not relevant the program or provide upper limits on the number of
for the existence of a deadlock. Therefore, they can be objects, threads, etc.
abstracted away, which reduces the state space signifi- Due to the large state space of programs (in partic-
cantly. Yet another approach to reduce the state space is ular of parallel programs and programs that manipu-
to limit the search to program executions with a small late large data structures) and the non-modularity of
number of data objects, threads, or thread preemp- the approach, model checking is typically too slow to
tions. Such limits compromise soundness, but practical be applied routinely during software development (in
experience shows that many concurrency errors may be contrast to type checking and similar static analyses).
detected with a small number of threads and preemp- However, for control-intensive programs with rather
tions. For instance, detecting the deadlock in method small data structures such as device drivers or embed-
transfer requires only two Account objects, two ded software, model checking is an efficient and very
threads, and one preemption. effective verification technique.

Discussion Examples
For finite transition systems (i.e., programs with a finite Even though most software model checkers focus on
number of states such as programs with an upper bound sequential programs, there are a number of tools avail-
on the number of threads, the number of dynami- able that check concurrency errors. JavaPathFinder
cally allocated data objects, and the recursion depth, [] is an explicit state model checker for concur-
in particular, terminating programs), model checking is rent Java programs that has been applied in industry
sound since all possible states and traces can be checked. and academia. It detects data races and deadlocks as
However, some model checkers explore only a subset well as violations of user-written assertions and LTL
of the possible states of a program execution based on properties.
heuristics that are likely to detect errors; this approach MAGIC [] checks whether a labeled transition sys-
trades efficiency for soundness. tem is a safe abstraction of a C procedure. The C proce-
Model checking for finite transition systems is in dure may invoke other procedures which are themselves
principle complete, although abstraction may lead to specified in terms of state machines, which enables
spurious errors. Counterexample-guided abstraction compositional checking. MAGIC uses counterexample-
refinement [] is a technique to recover completeness guided abstraction refinement to reduce the state space
by using the counterexample traces of spurious errors to of the system.
 F Formal Methods–Based Tools for Race, Deadlock, and Other Errors

BLAST [] finds data races in C programs. It com- Static data flow analyses typically apply the lockset
bines counterexample-guided abstraction refinement algorithm to detect races. They compute the set of locks
and assume-guarantee reasoning to infer a suitable each thread holds when accessing any given variable.
model for thread contexts, which results in a low rate If for each variable the locksets of all threads are not
of false positives. disjoint then there is mutual exclusion on the variable
The Zing tool [] checks for data races, deadlocks, accesses (see entry on Race Detection Techniques for
API usage rules, and other safety properties. Its dis- details). In the example above, the static analysis would
tinctive features are the use of procedure summaries to render the field accesses safe if it can show that for all
make the model checking compositional and the use accesses to x.balance and x.transactionCount,
of Lipton’s theory of reduction [] to prune irrelevant the executing thread holds at least the lock associated
thread interleavings. with the object x.
CHESS [] is a tool for systematically enumerat- Static analyses detect atomicity violations using Lip-
ing thread interleavings in order to find errors such ton’s theory []. Via a classification of operations into
as data races, deadlocks, and livelocks in Win DLLs right and left movers, an analysis can show that a given
and managed.NET code. The analysis is complete, code block is atomic, that is, any execution of the block
that is, every error found by CHESS is possible in is equivalent to an execution without interference from
an execution of the program. CHESS uses preemp- other threads. For this purpose, the analysis must show
tion bounding, that is, limits the number of thread that variable accesses are both left and right movers,
preemptions per program execution to reduce the which amounts to showing the absence of data races. In
state space. Modulo this preemption bound, CHESS is the example, the conditional statement does not access
sound; all remaining errors require more preemptions any shared variable and is, thus, a both-mover. The
than the bound. Soundness is achieved by capturing all acquire statements are right movers, the release
sources of non-determinism, including memory model statements are left movers, and the field accesses are
effects. both-movers provided that there are no data races. Con-
Model checkers for message-passing concurrency sequently, the method is atomic.
include for instance MPI-Spin [], which check Deadlocks can be detected by computing a static
Promela models of MPI programs, and ISP [], which lock-order graph and checking this graph to be
operates directly on the MPI/C source code. Both tools acyclic. A deadlock analysis would complain about the
detect deadlocks and assertion violations; MPI-Spin transfer example if it cannot rule out that two
also checks liveness properties. threads execute transfer concurrently with reversed
arguments.
Most data flow analyses infer the information
needed to check for concurrency errors. This also
Static Program Analysis
involves the computation of a whole range of addi-
Overview tional program properties, which are required for these
Static analyzers compute static approximations of the checks. For instance, the analyses require alias infor-
possible behaviors of a program and then check prop- mation in order to decide whether to expressions refer
erties of these approximations. Common forms of static to the same object at run time. A may-alias analysis
analysis include, among others, data flow analysis and is for instance used for sound deadlock checking (to
type systems. Both techniques are typically described check whether two threads potentially compete for the
formally as sets of inference rules that allow the ana- same lock), whereas a must-alias analysis is required for
lyzer to infer properties of execution states, for instance, sound race checking (to check whether two locksets def-
the set of possible values of a variable. When annota- initely overlap). Another common auxiliary analysis is
tions are given by the programmer, these rules are also thread-escape analysis which distinguishes thread-local
used to check the annotations. Both data flow analysis variables from shared variables; this is for instance use-
and type systems have been applied extensively to detect ful to suppress warnings when accesses to thread-local
concurrency errors. variables are not guarded by any lock.
Formal Methods–Based Tools for Race, Deadlock, and Other Errors F 

Discussion the number of false positives. RacerX and Relay have


Static analyses are in principle sound. However, many been applied to millions of lines of code such as the
existing tools sacrifice soundness in favor of efficiency. Linux kernel.
For instance, several data race analyses omit a must- Chord [] is a context-sensitive, flow-insensitive
alias analysis, which is necessarily flow-sensitive, and, race checker for Java that supports three common syn-
therefore, cannot check soundly whether two locksets chronization idioms: lexically scoped lock-based syn-
overlap. chronization, fork/join, and wait/notify. Since Chord
Static analyses are by nature incomplete since they does not perform a must-alias analysis, it is not sound.
over-approximate the possible executions of a program. Chord has been applied to hundreds of thousands of
Another source of incompleteness is that static analy- lines of code.
ses – much like deductive verification – check whether rccjava [] uses a type system to check for data races F
programs comply with a given methodology, such as in Java programs. The annotations required from the
lock-based synchronization or static lock ordering. Pro- programmer include declarations of thread-local vari-
grams that do not comply might lead to false positives. ables and of the locks that guard accesses to a shared
Whether or not a static analysis is modular depends variable. The type system has been extended to atom-
mostly on the existence of program annotations. Most icity checking []. rccJava scales to hundreds of thou-
data flow analyses are whole-program analyses, but sands of lines of code.
there exist a number of analyses that compute procedure
summaries to enable modular checking. By contrast,
type systems typically require annotations, which are
Related Entries
Deadlocks
then checked modularly.
Determinacy
Aside from annotations that may be required from
Intel Parallel Inspector
programmers, static analyses are fully automatic. How-
Monitors, Axiomatic Verification of
ever, they often produce a large number of false posi-
Race Conditions
tives, which need to be inspected manually or filtered
Race Detectors for Cilk and Cilk++ Programs
aggressively (which may compromise soundness).
Race Detection Techniques
Static analyses are extremely efficient. Many static
Owicki-Gries Method of Axiomatic Verification
analyzers have been applied to programs consisting of
hundreds of thousands or even millions of lines of
code. This scale is orders of magnitudes larger than for Bibliographic Notes and Further
deductive verification and model checking. Reading
In summary, static analysis tools typically favor effi- This entry focuses entirely on tools that detect con-
ciency over soundness and completeness. This choice currency errors present in programs. Two related areas
makes them ideal for finding errors in large appli- are () tools that prevent concurrency errors during the
cations, whereas deductive verification and model construction of the program and () tools that detect
checking are more appropriate when strong correctness concurrency errors in other artifacts.
guarantees are needed. Tools of the first kind support correctness by con-
struction. Programs are typically constructed through a
Examples stepwise refinement of some high-level specification to
Warlock [], RacerX [], and Relay [] use the lock- executable code. Since the code is derived systematically
set algorithm to detect data races in C programs. from a formal specification, errors including concur-
RacerX also computes a static lock-order graph to detect rency errors are prevented. Interested readers may start
deadlocks. It uses various heuristics to improve the from Abrial’s book [].
analysis results, for instance, to identify benign races. Tools of the second kind do not operate on pro-
Relay is a modular analysis that limits the sources grams but on other representations of software and
of unsoundness to four very specific sources, one of hardware. Races and deadlocks may occur for instance
them being the syntactic filtering of warnings to reduce also in workflow graphs; they can be detected by the
 F Formal Methods–Based Tools for Race, Deadlock, and Other Errors

approaches described here, but also through Petri-net . Jacobs B, Leino KRM, Piessens F, Schulte W, Smans J ()
and graph algorithms []. Similarly, hardware can be A programming model for concurrent object-oriented programs.
ACM Trans Program Lang Syst ():–
verified not to contain concurrency errors such as dead-
. Jacobs B, Piessens F () The VeriFast program verifier. Techni-
lock [], an area in which model checking has been cal Report CW-, Department of computer science, Katholieke
applied with great success. Universiteit Leuven
. Jones CB () Specification and design of (parallel) programs.
Acknowledgments In: Proceedings of IFIP’. North-Holland, pp –
. Leino KRM, Müller P, Smans J () Verification of concurrent
Thanks to Felix Klaedtke and Christoph Wintersteiger
programs with Chalice. In: Aldini A, Barthe G, Gorrieri R (eds)
for their helpful comments on a draft of this entry. Foundations of security analysis and design V. Lecture Notes in
Computer Science, vol . Springer, Berlin, pp –
Bibliography . Lipton RJ () Reduction: a method of proving properties of
. Abrial J-R () Modeling in Event-B. Cambridge University parallel programs. Commun ACM ():–
Press, Cambridge . Musuvathi M, Qadeer S, Ball T, Basler G, Nainar PA, Neamtiu I
. Andrews T, Qadeer S, Rajamani SK, Xie Y () Zing: exploit- () Finding and reproducing heisenbugs in concurrent pro-
ing program structure for model checking concurrent software. grams. In: Draves R, van Renesse R (eds) Operating sys-
In: Gardner P, Yoshida N (eds) Concurrency Theory (CONCUR). tems design and implementation (OSDI). USENIX Association,
Lecture Notes in Computer Science, vol . Springer, Berlin, pp –
pp – . Naik M, Aiken A, Whaley J () Effective static race detec-
. Bryant R, Kukula J () Formal methods for functional veri- tion for Java. In: Schwartzbach MI, Ball T (eds) Programming
fication. In: Kuehlmann A (ed) The best of ICCAD:  years of language design and implementation (PLDI), ACM, pp –
excellence in computer aided design. Kluwer, Norwell, pp – . O’Hearn PW () Resources, concurrency, and local reasoning.
. Calcagno C, Parkinson MJ, Vafeiadis V () Modular safety Theor Comput Sci (–):–
checking for fine-grained concurrency. In: Nielson HR, Filé G . Siegel SF () Model checking nonblocking MPI programs.
(eds) Static analysis (SAS). Lecture Notes in Computer Science, In: Cook B, Podelski A (eds) Verification, model checking, and
vol . Springer, Berlin, pp – abstract interpretation (VMCAI). Lecture Notes in Computer
. Chaki S, Clarke EM, Groce A, Jha S, Veith H () Modular Science, vol , pp –
verification of software components in C. IEEE Trans. Softw Eng . Sterling N () Warlock – a static data race analysis tool. In:
():– USENIX Winter, pp –
. Clarke EM, Grumberg O, Jha S, Lu Y, Veith H () . Tofan B, Bäumler S, Schellhorn G, Reif W () Verifying
Counterexample-guided abstraction refinement for symbolic linearizability and lock-freedom with temporal logic. Techni-
model checking. J ACM ():– cal report, Fakultät für Angewandte Informatik der Universität
. Cohen E, Dahlweid M, Hillebrand M, Leinenbach D, Moskal M, Augsburg, 
Santen T, Schulte W, Tobies S () VCC: a practical system . van der Aalst WMP, Hirnschall A, Verbeek HMWE () An
for verifying concurrent C. In: Berghofer S, Nipkow T, Urban C, alternative way to analyze workflow graphs. In: Pidduck AB,
Wenzel M (eds) Theorem proving in higher order logics (TPHOLs Mylopoulos J, Woo CC, Özsu MT (eds) Advanced information
). Lecture Notes in Computer Science, vol . Springer, systems engineering (CAiSE). Lecture Notes in Computer Sci-
Berlin, pp – ence, vol . Springer, pp –
. Engler DR, Ashcraft K () RacerX: effective, static detection . Villard J, Lozes É, Calcagno C () Tracking heaps that
of race conditions and deadlocks. In: Scott ML, Peterson LL hop with Heap-Hop. In: Esparza J, Majumdar R (eds) Tools
(eds) Symposium on operating systems principles (SOSP). ACM, and algorithms for the construction and analysis of systems
New York, pp – (TACAS). Lecture Notes in Computer Science, vol . Springer,
. Flanagan C, Freund SN () Type-based race detection for –
Java. In: Proceedings of the ACM conference on programming . Visser W, Havelund K, Brat GP, Park S, Lerda F () Model
language design and implementation (PLDI). ACM, New York, checking programs. Autom Softw Eng ():–
pp – . Vo A, Vakkalanka S, DeLisi M, Gopalakrishnan G, Kirby RM,
. Flanagan C, Freund SN, Lifshin M, Qadeer S () Types for Thakur R () Formal verification of practical MPI programs.
atomicity: static checking and inference for Java. ACM Trans In: Principles and practice of parallel programming (PPoPP).
Program Lang Syst ():– ACM, pp –
. Henzinger TA, Jhala R, Majumdar R () Race checking by . Voung JW, Jhala R, Lerner S () Relay: static race detection
context inference. In: Pugh W, Chambers C (eds) Programming on millions of lines of code. In: Crnkovic I, Bertolino A (eds)
Language Design and Implementation (PLDI). ACM, New York, European software engineering conference and foundations of
pp – software engineering (ESEC/FSE). ACM, pp –
Fortran  and Its Successors F 

introduction of subprograms, with their associated con-


Fortran  and Its Successors cept of shared data areas, and separate compilation.
In , FORTRAN IV was released. It contained
Michael Metcalf
type statements, the logical-IF statement, and the pos-
Berlin, Germany
sibility to pass procedure names as arguments. In ,
a standard for FORTRAN, based on FORTRAN IV,
was published. This was the first programming lan-
Definition guage to achieve recognition as an American and sub-
Fortran is a high-level programming language that is sequently international standard, and is now known as
widely used in scientific programming in the physical FORTRAN .
sciences, engineering, and mathematics. The first ver- The permissiveness of the standards led to a pro- F
sion was developed by John Backus at IBM in the s. liferation of dialects and a new revision was published
Following an initial, rapid spread in its use, the language in , becoming known as FORTRAN . It was
was standardized, first in  and, after many inter- adopted by the International Standards Organization
mediate standards, most recently in . The language (ISO) shortly afterwards. The new standard brought
is a procedural, imperative, compiled language with a with it many new features, for instance the IF...
syntax well suited to a direct representation of mathe- THEN... ELSE construct, a character data type, and
matical formulae. Individual procedures may either be enhanced I/O facilities. Although only slowly adopted,
grouped into modules or compiled separately, allow- this standard eventually gave Fortran a new lease of
ing the convenient construction of very large programs life that allowed it to maintain its position as the most
and of subroutine libraries. Fortran contains features for widely used scientific programming language.
array processing, abstract and polymorphic data types,
and dynamic data structures. Compilers typically gener-
ate very efficient object code, allowing an optimal use of Modern Fortran
computing resources. Modern language provide exten- Fortran 
sive facilities for object-oriented and parallel program- Fortran’s strength had always been in the area of numer-
ming. Earlier versions have relied on auxiliary standards ical, scientific, engineering, and technical applications.
to facilitate parallel programming. In order that it be brought properly up to date, an Amer-
ican subcommittee, working as a development body for
Discussion the ISO committee ISO/IEC JTC/SC/WG (or sim-
ply WG), prepared a further standard, now known as
The Development of Fortran Fortran . It was developed in the heyday of the super-
Historical Fortran computer, when large-scale computations using vector-
Backus, at the end of , began the development of ized codes were being carried out on machines like the
the Fortran programming language (the name being an Cray  and CDC Cyber , and the experience gained
acronym derived from FORmula TRANslation), with in this area was brought into the standardization process
the objective of producing a compiler that would gen- by representatives of computer hardware and software
erate efficient object code. The first version, known as vendors, users, and academia. Thus, one of the main
FORTRAN I, contained early forms of constructs that features of Fortran  was the array language, built on
have survived to the present day: simple and subscripted whole array operations and assignments, array sections,
variables, the assignment statement, a DO-loop, mixed- intrinsic procedures for arrays, and dynamic storage.
mode arithmetic, and input/output (I/O) specifications. It was designed with optimization in mind. Another
Many novel compiling techniques had to be developed. was the notion of abstract data types, built on modules
Based on the experience with FORTRAN I, it was and module procedures, derived data types (structures),
decided to introduce a new version, FORTRAN II, in operator overloading and generic interfaces, together
. The crucial differences between the two were the with pointers. Also important were new facilities for
 F Fortran  and Its Successors

numerical computation including a set of numeric directly to facilitate parallel programming. Rather, this
inquiry functions, the parameterization of the intrin- has had to be achieved through the intermediary of
sic types, new control constructs (SELECT CASE and ad hoc industry standards, in particular HPF, MPI,
new forms of DO), internal and recursive procedures, OpenMP, and Posix Threads.
optional and keyword arguments, improved I/O facil- The HPF directives, already introduced, take the
ities, and new intrinsic procedures. The resulting lan- form of Fortran  comment lines that are recognized
guage was a far more powerful tool than its predeces- as directives only by an HPF processor. An example is
sor and a safer and more reliable one too. Subsequent
!HPF$ ALIGN WITH b :: a1, a2, a3
experience showed that Fortran  compilers detected
errors far more frequently than previously, resulting in to align three conformable arrays with a fourth, thus
a faster development cycle. The array syntax in particu- ensuring locality of reference. Further directives allow,
lar allowed compact code to be written, in itself an aid for instance, aligned arrays to be distributed over a set
to safe programming. Fortran  was published by ISO of processors.
in . MPI is a library specification for message passing.
OpenMP supports multi-platform, shared-memory
Fortran  parallel programming and consists of a set of com-
Following the publication of Fortran , the For- piler directives, library routines, and environment vari-
tran standards committees continued to operate, and ables that determine run-time behavior. Posix Threads
decided on a strategy whereby a minor revision of For- is again a library specification, for multithreading.
tran  would be prepared, establishing a principle In any event, outside Japan, HPF met with little suc-
of having major and minor revisions alternating. At cess, whereas the use of MPI and OpenMP has become
the same time, the High-Performance Fortran Forum widespread.
(HPFF) was founded.
The HPFF was set up in an effort to define a set of Fortran 
extensions to Fortran such that it would be possible to The following language standard, Fortran , was
write portable, single-threaded code when using par- published in . Its main features were: the handling
allel computers for handling problems involving large of floating-point exceptions, based on the IEEE model
sets of data that can be represented by regular grids, of floating-point arithmetic; allowing allocatable arrays
allowing efficient implementations on SIMD (Single- as structure components, dummy arguments, and func-
Instruction-Multiple-Data) architectures. This version tion results; interoperability with C; parameterized data
of Fortran was to be known as High Performance For- types; object-orientation via constructors/destructors,
tran (HPF), and it was quickly decided, given its array inheritance, and polymorphism; derived type I/O; asyn-
features, that Fortran  should be its base language. chronous I/O; and procedure variables. Although all
Thus, the final form of HPF was of a superset of Fortran this all represented a major upgrade, especially given the
, the main extensions being in the form of directives. introduction of object-oriented programming, there
However, HPF included some new syntax too and, in was no new development specifically related to parallel
order to avoid the development of divergent dialects programming.
of Fortran, it was agreed to include some of this new
syntax into Fortran . Apart from this syntax, only a Fortran 
small number of other pressing but minor changes were Notwithstanding the fact that Fortran -conformant
made, and Fortran  was adopted as a formal standard compilers had been very slow to appear, the standard-
in . It is currently the most widely used version of ization committees thought fit to proceed with yet
Fortran. another standard, Fortran . In contrast to the previ-
ous one, its single most important new feature is directly
Auxiliary Standards related to parallel processing – the addition of coar-
With few exceptions, no Fortran standard up to and ray handling facilities. Further, the DO CONCURRENT
including Fortran  included any feature intended form of loop control and the CONTIGUOUS attribute
Fortran  and Its Successors F 

are introduced. These are further described below. DO CONCURRENT (i = 1:m)


Other major new features include: sub-modules, enhan- a(k + i) = a(k + i) + scale * a(n + i)
END DO
ced access to data objects, enhancements to I/O and
to execution control, and more intrinsic procedures, that asserts that there is no overlap between the
in particular for bit processing. Fortran  was pub- range of values accessed (on the right-hand side of
lished in . the assignment) and the range of values altered (on
the left-hand side). The individual iterations are thus
Selected Fortran Features independent.
Given that all the actively supported Fortran compilers
on the market offer at least the whole of Fortran , that F
version is the one used as the basis of the descriptions in Array Handling
this section, except where otherwise stated. Only those Array Variables
features are described that are relevant to vectorization Arrays are variables in their own right; typically
or parallelization. Under Fortran , achieving effective they are specified using the DIMENSION attribute.
parallelization requires the exploitation of a combina- Every array is characterized by its type, rank (dimen-
tion of efficient Fortran serial execution on individual sionality), and shape (defined by the extents of each
processors with, say, MPI, controlling the communica- dimension). The lower bound of each dimension is by
tion between processors. default , but arbitrary bounds can be explicitly speci-
fied. The DIMENSION keyword is optional; if omitted,
the array shape must be specified after the array-variable
The DO Constructs
name. The examples
Many iterative calculations can be carried out in the
form of a DO construct. A simplified but sufficient REAL :: x(100)
nested form is illustrated by: INTEGER, DIMENSION(0:1000, -100:100) :: plane

outer: DO declare two arrays: x is rank- and plane is rank-.


inner: DO i = j, k, m ! from j to Fortran arrays are stored with elements in column-
! k in steps of m
major order. Elements of arrays are referenced using
! (m is optional)
: subscripts, for example,
IF (...) CYCLE ! take
! next iteration of x(n) x(i*j)
! construct ‘inner’
: Array elements are scalar. The subscripts may be any
IF (...) EXIT outer ! exit scalar integer expression.
! the construct ‘outer’ A section is a part of an array variable, referenced
END DO inner using subscript ranges, and is itself an array:
END DO outer
x(i:j) ! rank one
(The Fortran language is case insensitive; however, to plane(i:j, k:l:m) ! rank two
enhance the clarity of the code examples, Fortran key- x(plane(7, k:l)) ! vector subscript zero
words are distinguished by being written in upper case.) x(5:4) ! zero length

As shown, the constructs may be optionally named so Array-valued constants and variables (array construc-
that any EXIT or CYCLE statement may specify which tors) are available, enclosed in (/ ... /)or, in Fortran ,
loop is intended. in square brackets, [ ]:
Potential data dependencies between the iterations
(/ 1, 2, 3, 4, 5 /)
of a DO construct can inhibit optimization, includ-
(/ ( (/ 1, 2, 3 /), i = 1, 10) /)
ing vectorization. To alleviate this problem, Fortran (/ (i, i = 1, limit, 2) /)
 introduces the DO CONCURRENT form of the DO [ (0, i = 1, 1000) ]
construct, as in [ (0.01*i, i = 1, 100) ]
 F Fortran  and Its Successors

They may make use of an implied-DO loop notation, REAL, DIMENSION(10, 20) :: x, y, z
as shown, and can be freely used in expressions and REAL, DIMENSION(5) :: a, b
assignments.
it can be written:
A derived data type may contain array components.
Given x = y ! whole array
! assignment
TYPE point z = x/y ! whole array
REAL, DIMENSION(3) :: vertex ! division and
END TYPE point ! assignment
z = 10.0 ! whole array
TYPE(point), DIMENSION(10) :: points
! assignment of
TYPE(point), DIMENSION(10, 10):: & ! scalar value
points_array b = a + 1.3 ! whole
! array addition
points(2) is a scalar (a structure) and points(2) ! to scalar value
%vertex is an array component of a scalar. A reference b = 7/a + x(1:5, 5) ! array
like points_array(n,2) is an element (a scalar) ! division, and
of type point. However, points_array(n,2) ! addition to
%vertex(2) is an array of type real, and ! section
z(1:8, 5:10) = x(2:9, 5:10) + &
points_array(n,2)%vertex(2) is a scalar ele-
z(1:8, 15:20)
ment of it. An array element always has a subscript or ! array section
subscripts qualifying at least the last name. Arrays of ! addition
arrays are not allowed. ! and assignment
The general form of subscripts for an array section is a(2:5) = a(1:4) ! overlapping
! section
[lower ] : [upper ] [:stride ] ! assignment

where [ ] indicates an optional item. Examples for an An array of zero size is a legitimate object, and is
array x(10, 100) are handled without the need for special coding.
x(i, 1:n) ! part of one row
x(1:m, j) ! part of one column
x(i, : ) ! whole row i
x(i, 1::3) ! every third Elemental Procedures
! element of row i In an assignment like
x(i, 100:1:-1) ! row i in reverse
! order y = SQRT(x)
x((/1, 7, 3/), 10) ! vector
! subscript an intrinsic function, SQRT, returns an array-valued
x(:, 1:7) ! rank-2 section result for an array-valued argument. It is an elemental
A given value in an array can be referenced both as procedure.
an element and as a section: Elemental procedures may be called with scalar or
array actual arguments. The procedure is applied to
x(1, 1) ! scalar (rank zero)
x(1:1, 1) ! array section (rank one)
array arguments as though there were a reference for
each element. In the case of a function, the shape of the
depending on the circumstances or requirements. result is the shape of the array arguments.
Most intrinsic functions are elemental and Fortran
Operations and Assignments  extended this feature to non-intrinsic procedures. It
As long as they are of the same shape (conformable), is a further an aid to optimization on parallel proces-
scalar operations and assignments are extended to sors. An elemental procedure requires the ELEMENTAL
arrays on an element-by-element basis. For example, attribute and must be pure (i.e., have no side effects, see
given declarations of below).
Fortran  and Its Successors F 

Array-Valued Functions and Arguments Fortran  and Its Successors. Table  Classes of intrinsic
Users can write functions that are array valued; they procedures, with examples
require an explicit interface and are usually placed in Class of intrinsic array function Examples
a module. This example shows such a module function, Numeric abs, modulo
accessed by a main program: Mathematical acos, log

MODULE show Floating-point manipulation exponent, scale


CONTAINS Vector and matrix multiply dot_product, matmul
FUNCTION func(s, t) Array reduction maxval, sum
REAL, DIMENSION(:) :: s, t
! An assumed-shape Array inquiry allocated, size
array, see below Array manipulation cshift, eoshift F
REAL, DIMENSION(SIZE(s)) :: func Array location maxloc, minloc
func = s * t**2
Array construction merge, pack
END FUNCTION func
END MODULE show Array reshape reshape

PROGRAM example
USE show
REAL, DIMENSION(3) the corresponding dummy argument specification must
:: x = (/ 1., 2., 3./), & define the type and rank of the array, but it may omit the
y = (/ 3., 2., 1. /), s shape. Thus in
s = func(x, y)
PRINT *, s SUBROUTINE calc(d)
END PROGRAM example REAL, DIMENSION(:, :) :: d

In order to allow for optimization, especially on it is as if d were dimensioned (, ). The shape, not the
parallel and vector machines, the order of expression bounds, is passed, meaning that the default lower bound
evaluation in executable statements is not specified. is  and the default upper bound is the correspond-
However, some potential optimizations might be lost if ing extent. However, any lower bound can be specified
it is not certain that an array is in contiguous storage; for and the array maps accordingly. Assumed-shape arrays
instance, if the array is an assumed-shape dummy argu- allow great flexibility in array argument passing.
ment array or a pointer array. In Fortran , it is open
to a programmer to assert that an array is contiguous by Automatic Arrays
adding the CONTIGUOUS attribute to its specification, Automatic arrays are useful for defining local, tempo-
as in rary work space in procedures, as in
REAL, CONTIGUOUS, DIMENSION(:, :) :: x SUBROUTINE exchange(x, y)
REAL, DIMENSION(:) :: x, y
The many intrinsic functions that accept array- REAL, DIMENSION(SIZE(x)) :: work
valued arguments should be regarded as an integral work = x
part of the array language. A brief summary of their x = y
categories appears in Table . y = work
END SUBROUTINE exchange

Here, the array work is created on entering the pro-


Assumed-Shaped Arrays
cedure and destroyed on leaving. The actual storage is
Given an actual argument in a procedure reference, typically maintained on a stack.
as in:
REAL, DIMENSION(0:10, 0:20) :: x Allocatable Arrays
: Fortran provides dynamic allocation of storage, com-
CALL calc(x) monly implemented using a heap storage mechanism.
 F Fortran  and Its Successors

An example, for establishing a work array for a whole FORALL statement and construct, which can be con-
program, is sidered as an array assignment expressed with the help
of indices. An example is
MODULE global
INTEGER :: n FORALL(i = 1:n) x(i, i) = s(i)
REAL, DIMENSION(:,:), ALLOCATABLE :: &
work where the individual assignments may be carried out in
END MODULE global any order, and even simultaneously. In
PROGRAM analysis
USE global FORALL(i=1:n, j=1:n, y(i,j)/=0.)
READ (*, *) n x(j,i) = 1.0/y(i,j)
ALLOCATE(work(n, 2*n),STAT=status)
: the assignment is subject also to a masking condition.
DEALLOCATE (work) The FORALL construct allows several assignment
statements to be executed in order, as in
The work array can be propagated throughout the
whole program via a USE statement in each program FORALL(i = 2:n-1, j = 2:n-1)
x(i,j) = x(i,j-1) + x(i,j+1)
unit. An explicit lower bound may be specified and
+ x(i-1,j) + x(i+1,j)
several entities may be allocated in one statement. y(i,j) = x(i,j)
Deallocation of arrays is automatic when they go out END FORALL
of scope.
Assignment in a FORALL is like an array assign-
ment: it is as if all the expressions were evaluated in any
Masked Assignment order, held in temporary storage, then all the assign-
Often, there is a need to mask an assignment. This can ments performed in any order. The first statement in
be done using either a WHERE statement: a construct must fully complete before the second can
begin. Procedures referenced within a FORALL must be
WHERE (x /= 0.0) x = 1.0/x pure (see below).
an example that avoids division by zero by replacing In general, the FORALL has been poorly imple-
only non-zero values by their reciprocals, or a WHERE mented and the DO CONCURRENT is seen as a better
construct: alternative: using DO CONCURRENT requires the pro-
grammer to ensure that the iterations are independent,
WHERE (x /= 0.0)
whereas, with FORALL, the compiler is expected to
x = 1.0/x
y = 0.0 ! all arrays assumed determine whether a temporary copy is needed, which
conformable is often unrealistic.
ELSEWHERE
x = HUGE(0.)
y = 1.0 - y Pure Procedures
END WHERE
This is another Fortran  feature expressly for parallel
The ELSEWHERE statement may also have a mask computing, ensuring that execution of a procedure ref-
clause, but at most one ELSEWHERE statement can be erenced from within a FORALL statement or construct
without a mask, and that must be the final one; WHERE (or, in Fortran , from within a DO CONCURRENT
constructs may be nested within one another. construct) cannot have any side effects (whereby the
result of one reference could cause a change to the result
of another). Any such side effects in a function could
The FORALL Statement and Construct impede optimization on a parallel processor – the order
When a DO construct is executed, each successive iter- of execution of the assignments could affect the results.
ation is performed in order and one after the other – To control this situation, a PURE keyword may be added
an impediment to optimization on a parallel processor. to the SUBROUTINE or FUNCTION statement – an
To help in this situation, Fortran  introduced the assertion that the procedure: alters no global variable;
Fortran  and Its Successors F 

performs no input/output operations; has no saved vari- path in each may differ, possibly under the control of
ables; and, in the case of functions, alters no argument. an image index to which the programmer has access
These constraints are designed such that they can be (via the THIS_IMAGE function). When synchroniza-
verified by a compiler. tion between two images is required, use can be made
All the intrinsic functions are pure. of a set of intrinsic synchronization procedures, such
The FORALL statement and construct and the as SYNCH_ALL or LOCK. Thus, it is possible in par-
PURE keyword were adopted from HPF. ticular to avoid race conditions whereby one image
alters a value still required by another, or one image
Coarrays (Fortran  Only) requires an altered value that is not yet available from
The objective of coarrays is to distribute over some another. Handling this is the responsibility of the pro-
number of processors not only data, as in an SIMD grammer. Between synchronization points, an image F
model, but also work, using the SPMD (Single- has no access to the fresh state of any other image.
Program-Multiple-Data) model. The syntax required Flushing of temporary memory, caches, or registers
to implement this facility has been designed to make is normally handled implicitly by the synchroniza-
the smallest possible impact on the appearance of a tion mechanisms themselves. In this way, a compiler
program and to require a programmer to learn just a can safely take advantage of all code optimizations on
modest set of new rules. all processors between synchronization points without
Data distribution is achieved by specifying the rela- compromising data integrity.
tionship among memory images using an elegant new Where it might be necessary to limit execution of
syntax similar to conventional Fortran. Any object not a code section to just one image at a time, a criti-
declared using this syntax exists independently in all the cal section may be defined using a CRITICAL ...
images and can be accessed only from within its own END CRITICAL construct.
image. Objects specified with the syntax have the addi- The codimensions of a coarray are specified in a simi-
tional property that they can be accessed directly from lar way to the specifications of assumed-size arrays, and
any other image. Thus, the statement coarray sub-objects may be referenced in a similar way
to sub-objects of normal arrays.
REAL, DIMENSION()[∗] :: a, b The following example shows how coarrays might be
used to read values in one image and distribute them to
specifies two coarrays, a and b, that have the same size
all the others:
() in each image. Execution by an image of the
statement
REAL :: val[*]
a(:) = b(:)[j] ...
IF(THIS_IMAGE() == 1) THEN
causes the array b from image j to be copied into ! Only image 1 executes this construct
its own array a, where square brackets are the nota- READ(*, *) val
tion used to access an object on another image. On DO image = 2, NUM_IMAGES()
a shared-memory machine, an implementation of a val[image] = val
END DO
coarray might be as an array of a higher dimension.
END IF
On a distributed-memory machine with one physical CALL SYNC_ALL()
processor per image, a coarray is likely to be stored at ! Execution on all images
the same address in each physical processor. ! pauses here until all
Work is distributed according to the concept of ! images reach this point
images, which are copies of the program that each have
a separate set of data objects and a separate flow of con- Coarrays can be used in most of the ways that nor-
trol. The number of images is normally chosen on the mal arrays can. The most notable restrictions are that
command line, and its fixed value is available at exe- they cannot be automatic arrays, cannot be used for a
cution time via an inquiry function, NUM_IMAGES. function result, cannot have the POINTER attribute,
The images execute asynchronously and the execution and cannot appear in a pure or elemental procedure.
 F Fortran, Connection Machine

Related Entries
Code Generation
Fortran, Connection Machine
Data Distribution
Connection Machine Fortran
Dependences
HPF (High Performance Fortran)
Loop Nest Parallelization
MPI (Message Passing Interface) Fortress (Sun HPCS Language)
OpenMP
Parallelization, Basic Block Guy L. Steele Jr. , Eric Allen , David Chase ,
Run Time Parallelization Christine Flood , Victor Luchangco ,
Jan-Willem Maessen , Sukyoung Ryu

Oracle Labs, Burlington, MA, USA

Oracle Labs, Austin, TX, USA
Bibliographic Notes and Further 
Google, Cambridge, MA, USA
Reading 
Korea Advanced Institute of Science and Technology,
The development of Fortran  was very contentious Daejeon, Korea
and the entire issue of the journal [] is devoted to its
various aspects.
Definition
Fortran , is defined in the ISO publication
Fortress is a programming language designed to sup-
[]. Fortran  was published in  as []. For-
port parallel computing for scientific applications at
tran  and Fortran  are informally but completely
scales ranging from multicore desktops and laptops to
described in [] and [] and, with the latest additions
petascale supercomputers. Its generality and features
contained in Fortran , in [].
for user extension also make it a useful framework for
Coarrays have evolved from a programming model
experimental design of domain-specific languages. The
initially intended for the Cray-TD. They were extended project originated at Sun Microsystems Laboratories
to Fortran  by Numrich and Steidel [] and subse- in  as part of their work on the DARPA HPCS
quently defined for Fortran  by Numrich and Reid (High Productivity Computing Systems) program []
[]. Their formal definition is in the Fortran  and became an open-source project in .
standard [] and they are described informally also
in [].
Discussion
Design Principles
Fortress provides the programmer with a single global
Bibliography address space, an automatically managed heap, and
. Adams JC et al () The Fortran  handbook. Springer,
automatically managed thread parallelism. Programs
London/New York
. () Comput Stand Interfaces 
can also spawn threads explicitly.
. ISO/IEC -:. ISO, Geneva The three major design principles for Fortress are:
. ISO/IEC -:. ISO, Geneva
● Mathematical syntax: Fortress syntax is inspired
. Metcalf M, Reid J, Cohen M () Fortran / explained.
Oxford University Press, Oxford/New York by mathematical tradition, using the full Unicode
. Metcalf M, Reid J, Cohen M () Modern Fortran explained. character set and two-dimensional notation where
Oxford University Press, Oxford/New York relevant. The goal is to simplify the desk-checking
. Numrich RW, Steidel JL () F−− : a simple parallel extension to of code against what working scientists and math-
Fortran . SIAM News ():–
ematicians tend to write in their notebooks or on
. Numrich RW, Reid JK () CoArray Fortran for parallel pro-
gramming. ACM Fortran Forum : and Rutherford Apple-
their whiteboards.
ton Laboratory report RAL-TR-–. ftp://ftp.numerical.rl. ● Parallelism by default: Rather than grafting parallel
ac.uk/pub/reports/nrRAL.pdf features onto a sequential language design, Fortress
Fortress (Sun HPCS Language) F 

uses parallelism wherever possible and reasonable. and enforced by, specific traits defined by a standard
The design intentionally makes sequential idioms a library, enabling a generator to choose from a variety
little more verbose than parallel idioms, and inten- of implementation strategies.
tionally makes side effects a little more verbose than
“pure” (side effect free) code.
● A growable language []: As few features as pos-
Salient Features
sible are built into the compiler, and as many lan- Syntax
guage design decisions as possible are expressed at Fortress syntax is designed to resemble standard math-
the source-code level. In many ways, Fortress is ematical notation as much as possible while integrating
a framework for constructing domain-specific lan- programming-language notions of variable binding,
guages, and while the first such language is aimed at assignment, and control structures. All mathematical F
scientific programmers, the intent is that radically operator symbols in the Unicode character set [],
different design choices can also be supported. such as ⊙ ⊕ ⊗ ≤ ≥ ∩ ∪ ⊓ ⊔ ⊆ ⊑ ≺ ≽ ∑
∏ , may be defined and overloaded as user-defined
Fortress is a strongly typed language. The type sys- infix, prefix, and postfix operators; a large number of
tem is object oriented, with parameterized polymorphic bracketing symbols such as { } ⟨ ⟩ ⌜ ⌝ ⌞ ⌟ ⌊ ⌋ ⌈ ⌉ ∣ ∣
types but also with type inference so that programmers ∥ ∥ are likewise available for library or user (re-)
frequently need not declare the types of bound vari- definition. Relational operators may be chained, as
ables explicitly. The objects enjoy multiple inheritance in  ≤ n ≤  ; the compiler treats it as if it had been
of code but not of data fields. A Fortress trait is much written ( ≤ n) ∧ (n ≤ ) but arranges to evaluate
like an interface in the JavaTM programming language the expression n just once. Simple juxtaposition is
but can contain concrete implementations of methods itself a binary operator that can be defined and over-
that can be inherited. A Fortress object can declare data loaded by Fortress source code; the standard library
fields and in other respects is also much like a Java defines juxtaposition to perform function application
class, but a Fortress object can extend (and therefore (as in f (x, y) or print x ), numerical multiplication (as
inherit from) only traits, not other objects. Methods in  n and  x y ), and string concatenation (as in
may be overloaded; overload resolution is performed “My name is” myName “.” ).
by multimethod dispatch using the dynamic types of Unicode characters may be used directly or repre-
all arguments. (In contrast, the Java programming lan- sented in ASCII through a Wiki-inspired syntax that
guage uses the dynamic type of the invocation target is reasonably readable in itself. Fortress also provides
“before the dot” but the static types of the arguments concise syntax to mimic mathematical notation: semi-
in parentheses.) colons and parentheses may often be omitted, and curly
Fortress supports and favors the use of pure, func- braces { } denote sets (rather than blocks of state-
tional object-oriented programming, but also supports ments). The notation is whitespace sensitive but not
side effects, including assignment to variables and indentation sensitive. Fortress is expression oriented,
to fields of objects that have been declared mutable. and it favors the use of immutable bindings over muta-
Atomic blocks are used to manage the synchronization ble variables, requiring a slightly more verbose syntax
of side effects. for declaration of mutable variables and for assignment.
Fortress also favors divide-and-conquer approaches The left-hand side of Fig.  shows a sample Fortress
to parallelism (as opposed to, say, pipelining); generators function that performs a matrix/vector “conjugate gra-
split aggregate data structures and numeric ranges into dient” calculation (based on the NAS “CG” conjugate
portions that can be processed in parallel, and reducers gradient benchmark for parallel algorithms) []. The
combine or aggregate the results for further process- function conjGrad takes a matrix and a vector, per-
ing. Generators and reducers are classified according forms a number of calculations, and returns two val-
to their algebraic properties, such as associativity and ues: a result vector and a measure of the numerical
commutativity; this classification is expressed through, error. The right-hand side of Fig.  presents the original
 F Fortress (Sun HPCS Language)

conjGrad E extends Number, nat N,


V extends VectorE, N,
M extends MatrixE, N, N 
(A: M, x: V): (V, E) = do
z: V :=  z=
r: V := x r=x
ρ: E := r⋅r ρ = rT r
p: V := r p=r
for j ← seq( # ) do DO i = , 
q=Ap q =Ap
α = ρ/(p⋅q) α = ρ/(pT q)
z += α p z=z+αp
ρ = ρ ρ = ρ
r −= α q r=r−αq
ρ := r⋅r ρ = rT r
β = ρ/ρ  β = ρ/ρ 
p := r + β p p=r+βp
end ENDDO
(z, ∥x − A z∥) compute residual norm explicitly:
end ∥r∥ = ∥x − A z∥

Fortress (Sun HPCS Language). Fig.  Sample Fortress code for a conjugate gradient calculation (left) and the original
NAS CG pseudocode specification (right)

specification of the conjugate gradient calculation to that are traditionally defined as core language primitives
show how the Fortress program resembles the prob- (such as for loops) can be moved into Fortress’ own
lem specification. The principal differences are that the libraries, thereby reducing the size of the core language.
Fortress code provides type declarations – including a
generic (type-parametric) function header – and dis- Type System
tinguishes between binding and assignment of variables Fortress has a rich type system designed to give pro-
(within the loop body, q and α and ρ  are read- grammers the expressive power they need to write
only variables that are locally bound using = syntax, libraries that provide features often built into other
whereas z and r and ρ and p are mutable variables languages. The Fortress type system is a trait-based,
that are updated by the assignment operator := and the object-oriented system that supports parametrically
compound assignment operators += and −= ). polymorphic named types (also known as generic traits)
Fortress is designed to grow over time to accom- organized as a multiple-inheritance hierarchy. Fortress
modate the changing needs of its users via a syntactic code is statically type-checked: it is a static (“compile-
abstraction mechanism [] that, in effect, allows source- time”) error if the type of an expression is not a subtype
code declarations to augment the language grammar by of the type required by its context.
specifying rewriting rules for new language constructs. Objects may be declared to extend traits, from which
Parsing of new constructs is done alongside parsing of they can inherit method declarations and definitions.
primitive constructs, allowing programmers to detect If an object inherits an abstract method declaration,
syntax errors in use sites of new constructs early. Pro- then it has the obligation to provide (or inherit) a con-
grams in domain-specific languages can be embedded crete implementation of that method. A trait may also
in Fortress programs and parsed as part of their host extend one or more other traits. Extension creates a sub-
programs. Moreover, the definition of many constructs type relationship; if type A extends type B, then every
Fortress (Sun HPCS Language) F 

object belonging to type A (that is, every instance of A) Tuples A tuple is written as a comma-separated
necessarily also belongs to type B. series of expressions within parentheses, for example,
In addition to subtyping, Fortress allows the decla- (a, , b + ) ; they are convenient for passing multiple
ration of exclusion and comprises relationships among arguments and for returning multiple results. Tuples are
types. If type A excludes type B, then no value can be first-class values but are not objects. Tuple types are
an instance of both A and B. If a type T comprises a set covariant: if type A extends type B, and C extends D,
of types { U , U , . . . , Un } , then every instance of T is and E extends F, then the tuple type (A, C, E) extends
an instance of some Ui (possibly more than one). This the tuple type (B, D, F) . The type of a variable, of a field,
allows a more flexible description of important relation- or of array elements may be a tuple type. Tuples of vari-
ships among types than simply the ability to declare that ables may appear on the left-hand side of binding and
a class is final. assignment constructs. F

Types, values, and expressions Fortress provides several Functions and Methods Functions are also first-class
kinds of types, including trait types, tuple types, arrow values, and furthermore, they are objects. However,
types. At the top of the type hierarchy is the type Any, their types are not traits, but arrow types, constructed
which comprises three types, Object, Tuple, and () from other types using the arrow combinator → . A
(pronounced “void”); at the bottom of the hierarchy is function that is declared to return an instance of type
the type BottomType. Every expression in Fortress has B whenever it is passed an instance of type A as an
a static type. Every value in Fortress is either an object, argument is an instance of the arrow type A → B .
a tuple, or the value () , and has a (run-time) tag, or ilk, Because a function can be overloaded, having multi-
which is a “leaf ” in the type hierarchy: the only strict ple definitions with different parameter types, it can
subtype of a value’s tag is BottomType. Fortress guaran- be an instance of several different arrow types (Fig. ).
tees that when an expression is evaluated, the ilk of the Because arrow types are covariant in their return types
resulting value is a subtype of the type of the expression. and contravariant in their parameter types, every arrow
type is a subtype of BottomType → Any and a super-
type of Any → BottomType .
Traits Like an interface in the Java programming lan- Fortress provides two kinds of methods that
guage, a trait is named program construct that declares may be declared in objects and traits: dotted meth-
a set of methods, and which may extend one or more ods, which are similar to methods in other object-
other traits, inheriting their methods. However, unlike oriented languages and are invoked using the syntax
Java interfaces, whose methods are always abstract (i.e., target.methodName(arguments), and functional meth-
they do not provide bodies), the methods of Fortress ods, which are overloadings of global function names
traits may be either abstract or concrete (having bodies). and therefore are invoked using the syntax methodName
Like functions, methods may be overloaded, and a call (arguments), but one of the arguments is treated as the
to such a method is resolved using the run-time tags of target just as for a dotted method (Fig. ).
its arguments. Ignoring the possibility of automatic coercion, a
function declaration is applicable to a call to that func-
Objects Objects form the leaves of the trait hierarchy: tion if its parameter type is a supertype of (possibly
no trait may extend an object trait type. Thus object equal to) the run-time tag of the argument, and it is
declarations are analogous to final classes in the Java more specific than another declaration if its parameter
programming language. In addition to methods, objects type is a strict subtype of the parameter type of the other.
may have fields, in which state may be stored. (If a declaration of a function or method takes more
Numbers and booleans and characters are all than one argument, then its “parameter type” (singular)
objects. This does not imply that they are always heap- is considered to be a tuple type.)
allocated. Objects declared with the value keyword The Java programming language allows any set of
have no object identity, must have only immutable overloaded method declarations, resolves overloading
fields, and may be freely copied by the implementation. based on the static types of the arguments and the
 F Fortress (Sun HPCS Language)

(* The function pad is an instance of the arrow type String → String and also of the arrow type Z → Z ,
neither of which is a subtype of the other. *)
pad(x: String): String = “ ” x
pad(y: Z): Z = y

(* Arrow types can include the types of exceptions that can be thrown. The function pad′ in
an instance of the arrow type String → String throws EmptyString and also of the arrow type
Z → Z throws ZeroNumber . *)
pad′ (x: String): String throws EmptyString =
if x = “” then throw EmptyString else “ ” x “ ” end
pad′ (y: Z): Z throws ZeroNumber =
if y =  then throw ZeroNumber else y end

Fortress (Sun HPCS Language). Fig.  Two examples of a function that belongs to more than one arrow type

(* This example object has one field, v, and four method definitions. The method add is a conventional
“dotted method”; the method name maximum is overloaded with three functional method definitions.
The parameter self indicates that a method is a functional method that is invoked using functional
rather than dot syntax. *)

object Example(v: Z)


add(n: Z) = n + v ⊛ Dotted method (no self parameter)
maximum(self, n: Z) = ⊛ Functional method

if n > v then n else v end


maximum(n: Z, self) = ⊛ Functional method
maximum(self, n)
maximum(self, other: Example) = ⊛ Functional method
maximum(self, other.v)
end
do
o = Example()
p = Example()
⊛ Examples of method invocation

println o.add() ⊛ Prints “”

println p.add() ⊛ Prints “”

println maximum(o, ) ⊛ Prints “”

println maximum(, o) ⊛ Prints “”

println maximum(o, p) ⊛ Prints “”

end

Fortress (Sun HPCS Language). Fig.  Example definition of a dynamically parameterized object
Fortress (Sun HPCS Language) F 

dynamic type of the target, and requires this resolution an explicit getter declaration. Similarly, the assign-
be unambiguous for calls that actually appear in the pro- ment operator := actually invokes a setter method,
gram (i.e., ambiguity is separately checked at every call and the declaration of a mutable field by default pro-
site). In contrast, when a call to an overloaded Fortress vides a trivial setter method. In this manner, the
function or method is executed, the most specific appli- behavior of field access and assignment is completely
cable declaration is selected based on the tags of the programmable.
arguments passed to the function; that is, Fortress per-
forms full multimethod dynamic dispatch on the target
and all arguments. Furthermore, Fortress imposes rules Parametric polymorphism (or Generic types and func-
on the set of overloaded declarations so that dispatch tions) Fortress supports generic traits (and objects), as
is always unambiguous (i.e., ambiguity is checked for well as generic functions (and methods). All these enti- F
each set of declarations and no separate check is needed ties can have static parameters of various kinds: type
at each call site). In particular, any pair of declarations parameters (for which any type can serve as an actual
of overloaded functions of method must satisfy one of argument), nat and int and bool parameters (for
three conditions: which static arithmetic and Boolean expressions can
serve as actual arguments), unit and dim parameters
Subtype rule: The parameter type of one declaration is a
(for which declared physical dimensions and units can
strict subtype of the parameter type of the other, and
serve as actual arguments), and opr parameters (for
the return type of the first declaration is a subtype of
which operator symbols can serve as actual arguments).
(possibly the same as) the return type of the second
Type parameters may be given one or more bounds,
declaration.
thereby requiring that they be instantiated only with
Exclusion rule: The parameter types of the two declara-
subtypes of all the bounds. If no bound is given, it is
tions exclude each other.
implicitly assumed to be bound by Object. Note that
Meet rule: The parameter types of the two declarations
Object is not the root of the Fortress type hierarchy;
are incomparable (there is neither a subtype nor an
a type parameter that may be instantiated with a tuple
exclusion relationship between them) and there is
type, for example, must be given an explicit bound
another declaration of that function whose param-
(Fig. ). This unusual default for the bounds of type
eter type “covers” the intersection of the parameter
parameters helps to reduce confusion when functions
types of the two incomparable declarations.
and methods are overloaded with different numbers of
If all pairs of declarations satisfy one of these condi- parameters.
tions, there can be no ambiguity for any particular call The Java programming language has a system of
to that function or method: in the first case, one declara- generic types that erases type parameter information
tion is more specific than the other; in the second case, at run time. In contrast, Fortress types are not erased,
there is no call to which both declarations are applicable; and it is possible to perform the equivalent of an
and in the third case, whenever both declarations are instanceof check to distinguish a list of strings from
applicable, there is another declaration that is more spe- a list of booleans. Thus, there is no significant dispar-
cific than both that is also applicable, so neither of the ity between the static type system and the run-time tags
two incomparable declarations can be the most specific associated with Fortress values: tags are simply a spe-
applicable declaration. These rules are how the design cial class of types corresponding to leaves in the type
of Fortress solves the problem of multiple inheritance hierarchy.
of methods. In addition to standard parametric polymorphism,
Access to fields of objects is abstract: the field Fortress provides where clauses, in which hidden type
access syntax x.fieldName actually invokes a getter variables can be declared and used within a trait (or
method that takes no arguments but returns a value. object) or function declaration; they are called “hidden”
Declaring a field by default provides a trivial getter because they are not static parameters of the trait or
method that simply fetches the contents of the field, function – rather, they are universally quantified within
but this method can be suppressed or displaced by their declared bounds.
 F Fortress (Sun HPCS Language)

(* The trait LinkedList can contain elements of type T. Because the declared bound on the type T is Any, the
actual element type may be a tuple type. Because of the comprises clause, only instances of the object
types ConsT and Empty can be instances of LinkedListT . *)
trait LinkedList T extends Any  comprises { ConsT, Empty } end
(* A Cons instance has two fields, first and rest. This object declaration implicitly declares a constructor
function called Cons that takes two arguments and creates a new instance with its two fields initialized to
the given arguments. *)
object Cons T extends Any (first: T, rest: ListT) extends LinkedListT end
⊛The single object Empty belongs to all LinkedList types.
object Empty extends LinkedListT where { T extends Any } end

Fortress (Sun HPCS Language). Fig.  Example definitions of a statically parameterized trait and two objects that
implement it

Having to write out all static type arguments and declaration in this larger set, then that is the one that
complete type signatures for all functions and meth- will be invoked. To help prevent surprises, coercions in
ods, especially local “helper” functions, can be tedious, Fortress are not transitive.
and often results in code that is difficult to read and
maintain. To mitigate these problems, Fortress (like
ML and Haskell) has a type-inference mechanism that Parallelism
allows many types and static parameters to be elided in Many constructs in the Fortress language give rise
practice. to multiple implicit tasks to execute subexpressions,
such as the components of a tuple, blocks of the
do -also construct, function arguments, operands of
Coercion A trait T may contain one or more coercion binary operators, and the target and arguments of a
declarations; such a declaration specifies a computa- method call (Fig. ). This is fork-join parallelism: all
tion that, when given any instance of another type U, tasks for a construct execute to completion before exe-
will return a corresponding instance of type T. Such cution of the construct itself can complete. Tasks are
declarations should be used with care, only in situa- implicitly scheduled and provide no fairness guaran-
tions where it is desirable that every instance of type U tees; while tasks may execute concurrently, a Fortress
also be regarded as, in effect, an instance of type T in compiler or run-time may choose to schedule tasks
every computational context whatsoever. Typically this for sequential execution instead. This allows flexible
is used to convert among specialized numerical repre- allocation of hardware resources.
sentations, and in particular to convert numeric literals Comprehensions, reduction expressions, and for
to other numeric types. (Fortress in unusual in that the loops are all syntactic sugar for uses of generators and
type of a literal such as  is not any of the standard reducers. A generator is an object capable of supplying
computational integer types such as Z or Z or Z , some number of separate values to a reducer through
but a special numeric-literal trait.) one or more library-defined protocols; a reducer is an
Coercion comes into play during multimethod dis- object that can accept values from a generator and pro-
patch: if there is no applicable function or method duce a combined or aggregated result.
declaration for a given call, then the net is cast wider For example, the nested for loop in Fig.  contains
by considering declarations that would be applicable two generator expressions  :  and (i + ) :  repre-
if the arguments were automatically coerced to some senting parallel counted ranges; the loop desugars into
other type; if there is a unique most specific applicable a nested pair of calls to their respective loop methods.
Fortress (Sun HPCS Language) F 

(* Five examples of situations in which expressions such as f (x) and g(y) and h(z) may be executed in
parallel as concurrent tasks. *)
(a, b, c) = ( f (x), g(y), h(z)) ⊛ Elements of a tuple
do ⊛ Separate blocks of a do -also construct

f (x)
also do
g(y)
also do
h(z)
end
F
p( f (x), g(y), h(z)) ⊛ Arguments of a function call
f (x) + g(y) ⊛ Operands of a binary operator

f (x).m(g(y), h(z)) ⊛ Target and arguments of a method call

Fortress (Sun HPCS Language). Fig.  Implicit parallelism in Fortress

The actual parallelism is provided by using implicit tasks Implicit parallelism may be exploited inside of a trans-
in the implementation of the loop method, generally action. Inner transactions may be retried independently
using a divide-and-conquer coding style. Most Fortress of the outer transaction (“closed nesting”). Contention
data structures are generators and support structure- may be managed by the run-time system, or at the
respecting parallel traversal. It is a library conven- Fortress language level using the tryAtomic construct.
tion that generators are implicitly parallel by default; The current implementation of Fortress uses an
the seq(self) method conventionally constructs a automatic dynamic load-balancing algorithm (based on
sequential version of a generator. Definitions of BIG work done at MIT as part of the Cilk project) that
operators typically define the behavior of reductions uses work-stealing to move pending tasks from one
(Fig. ) and comprehensions (Fig. ) in terms of an processor to another.
underlying monoid.
Note that while generators run computations in par- Programming in the Large
allel, there is still a well-defined spatial order to the Fortress supports a number of features for program-
results produced by a generator (Fig. ). ming in the large.
Fortress also provides explicit parallelism using the
spawn construct (Fig. ). Spawned threads are sched- Components and APIs Fortress source code is encapsu-
uled fairly with respect to one another, and implicit lated in components, which contain definitions of pro-
tasks of separately spawned threads are scheduled fairly gram elements. Each component imports a collection of
with respect to one another. APIs (Application Programming Interfaces) that declare
The programmer is responsible for wrapping con- the program elements used by that component. Each
current accesses to shared data in an atomic block. component also exports a collection of APIs. If a com-
The semantics of an atomic block is that either all the ponent C exports an API A, then C must include a
reads and writes to memory happen simultaneously, or definition for every element declared in A. Compo-
none of them do; in effect, all the memory reads and nents can be linked together into larger components. If
memory writes of an atomic block appear to all other a component C exports an API A that is imported by
tasks to occur as a single atomic memory transaction. a component D, and C and D are linked, then refer-
Such transactions are in fact strongly atomic (trans- ences in D to the declarations in A are resolved to the
actional and non-transactional accesses can coexist). corresponding definitions in C.
 F Fortress (Sun HPCS Language)

(* This for loop controls two index variables in nested fashion. The : operator, given two integers,
produces a CountedRange object that can serve as a generator of integer index values. *)
for i ←  : , j ← (i + ) :  do
ai,j :=  i −  j
end
(* The for loop is desugared into nested calls to higher-order methods that take functions as arguments.
These functions bind the index variables. The fn syntax is analogous to Lisp’s lambda expression syntax
for anonymous functions. *)

( : ).loop(fn (i) ⇒
((i + ) : ).loop(fn ( j) ⇒
do ai,j :=  i −  j end))
(* The loop method of the CountedRange generator object uses a recursive divide-and-conquer strategy that
exploits implicit task parallelism (do -also ) to permit concurrent execution of loop iterations. *)
object CountedRange(lo: Z, hi: Z) extends GeneratorZ
...
loop(body: Z → ()): () =
if lo = hi then body(lo)
elif lo < hi then
lo + hi
mid = ⌊ ⌋

do
CountedRange(lo, mid).loop(body)
also do
CountedRange(mid + , hi).loop(body)
end
end
...
seq(self) = . . .
end

Fortress (Sun HPCS Language). Fig.  Implementation of for loops through desugaring

Contracts Fortress functions and methods can be anno- When test data is provided with a property, the prop-
tated with contracts that declare preconditions and post- erty is checked against all values in the test data when
conditions (Fig. ). the program tests are run.
One use of the design-by-contract features is to doc-
ument and enforce algebraic relationships that are nec-
Test code, properties, and automated unit testing Fort- essary for correct parallel execution of generators and
ress includes support for defining tests inline in pro- reducers (Fig. ).
grams alongside the data and functionality they test.
Similar to tests are property declarations, which docu- Innovations
ment conditions that a program is expected to obey. The ● Fortress introduced comprises and excludes
parameters in a property declaration denote the types clauses that allow the type checker to deduce that
of values over which the property is expected to hold. two types are disjoint; this matters for certain kinds
Fortress (Sun HPCS Language) F 

⊛ Summation is one kind of reduction expression.


∑ ai,j
i←:
j←(i+):

⊛The summation is desugared into nested calls to higher-order methods.


∑(( : ).nest(fn (i) ⇒
((i + ) : ).map(fn ( j) ⇒ ai,j )))
(* The unary ∑ operator takes a generator and gives its reduce method an object that implements the
standard reduction protocol using addition. The reduce method, like the loop method, provides implicit
parallelism. This ∑ operator is generic; it can be used with any data type T that implements the Additive
trait, meaning that type T implements a binary + operator and can supply an identity value named zero. *)
F
opr ∑[[T extends AdditiveT]](g: GeneratorT): T =
g.reduce(SumReductionT)
object SumReduction[[T extends AdditiveT]] extends ReductionT, T
empty(): T = zero ⊛ AdditiveT means type T has a zero value

join(a: T, b: T): T = a + b ⊛ that is the identity for the + operator on type T.

singleton(a: T): T = a ⊛ The sum of one number is the number itself.

end

Fortress (Sun HPCS Language). Fig.  Implementation of reduction operations through desugaring

⊛ List comprehension ⟨ . . . ∣ . . . ⟩ is one kind of comprehension expression.


⊛ It produces an ordered list of values.

⟨ ai,j ∣ i ←  : , j ← (i + ) :  ⟩
(* Other kinds of comprehension brackets construct arrays, sets, or multisets. They are desugared in much
the same manner as reduction expressions. *)
(* The meaning of a comprehension is defined by library code in the same way as for a reduction operator. *)
opr BIG ⟨T g : GeneratorT ⟩ : ListT =
g.reduce(ListReductionT)
(* List aggregation is essentially reduction of singleton lists using the list concatenation operator ∥ . The
library definitions of the ∥ operator and the (noncomprehension) list constructors ⟨ ⟩ and ⟨ x ⟩ are not
shown here. *)
object ListReductionT extends Reduction[[T, ListT]]
empty(): ListT = ⟨ ⟩ ⊛ Produce an empty list.

join(a: ListT, b: ListT): ListT = a ∥ b ⊛ Concatenate two lists.


singleton(a : T): ListT = ⟨ a ⟩ ⊛ Produce a singleton list.

end

Fortress (Sun HPCS Language). Fig.  Implementation of comprehension expressions through desugaring
 F Fortress (Sun HPCS Language)

of code optimization. Such declarations can also on whether an identifier is lowercase, uppercase,
improve code maintainability. or mixed case, and whether underscore characters
● Lexical syntax is inspired in part by Wiki notation: are used in particular ways. Many single-character
the visual appearance of “rendered” code depends operators and brackets can be represented by mul-
ticharacter ASCII sequences, but there is no single
“escape” character (such as backslash) that intro-
(* The squares of the nine values , , , , , , , duces such sequences. As a rule, if an operator has a
,  may be computed concurrently or in any standard name in TEX or LATEX, then that same name
temporal order, but the logical or spatial order converted to uppercase, omitting the backslash, is
of the results within ordered list will always be a Fortress name for that operator. See Fig. . Text
⟨ , , , , , , , ,  ⟩ . *) in comments uses a markup syntax based on Wiki
Creole . []. See Fig. .
⟨ k ∣ k ←  :  ⟩
● Most programming languages handle operator
Fortress (Sun HPCS Language). Fig.  The distinction precedence by imposing a total order on operators
between logical order and temporal order (sometimes by mapping them to specific integer val-
ues). Fortress uses only precedence rules that would

do
⊛ The spawn construct forks a fair concurrent thread
⊛ and immediately returns a handle object for the thread.

firstThread = spawn f (x)


secondThread = spawn g(y)
⊛ Other computations here will run concurrently with
⊛ the spawned threads. The val method waits for a
⊛ spawned thread to complete and then produces whatever

⊛ value or exeception the thread may have produced.

firstThread.val + secondThread.val
end

Fortress (Sun HPCS Language). Fig.  Explicit spawning of fair concurrent threads

(* This definition of factorial has a contract that requires the argument to be nonnegative. The contract also
guarantees that the result will be nonnegative. *)
factorial (n) requires { n ≥  } ensures { outcome ≥  } =
if n =  then  else n factorial (n − ) end
(* This test-only code (indicated by the keyword test ) calls the factorial function for all values from  to
, and checks that the result for each is not less than the argument. *)
test factorialNotLessThanInput [x ←  : ] = (x ≤ factorial (x))
(* This property declaration requires that the result of factorial be not less than the argument for every
argument of type Z . Test data for spot-checking a property may be provided separately. *)
property factorialNeverLessThanInput = ∀(x: Z) (x ≤ factorial(x))

Fortress (Sun HPCS Language). Fig.  Examples of contracts, test code, and declared properties
Fortress (Sun HPCS Language) F 

(* This trait imposes upon any trait T that extends BinaryOperator T, ⊙  the requirement to provide a con-
crete implementation of the specified binary operator ⊙ . Note that both T and ⊙ are parameter names, for
which an actual type and actual operator are substituted by an invocation such as BinaryOperator Q, + 
or BinaryOperator Z, MAX  . *)
trait BinaryOperator T, opr ⊙  comprises T
opr ⊙(self, other: T): T
end
⊛An associative operator is a binary operator with the associative property.
trait Associative T, opr ⊙  comprises T
extends { BinaryOperator T, ⊙ , EquivalenceRelation T, =  } F
property ∀(A; T, b: T, c: T) ((a ⊙ b) ⊙ c) = (a ⊙ (b ⊙ c))
end
⊛A monoid is a type with an operator that is associative and has an identity.
trait Monoid T, opr ⊙  comprises T
extends { Associative T, ⊙ , HasIdentity T, ⊙  }
end

Fortress (Sun HPCS Language). Fig.  Use of declared properties to enforce algebraic relationships

be familiar from high-school or college mathematics ● Fortress supports units and dimensions with a rela-
courses. A consequence is that operator precedence tively small amount of built-in mechanism. All the
in Fortress is not transitive (Fig. ). Fortress compiler knows is that units and dimen-
● The “meet rule” eliminates the need for arbi- sions may be declared (Fig. ), that they form
trary “tie-break” or “ordering” rules found in other a free Abelian group under multiplication, that
multiple-inheritance type systems. these abstract values may be used as static argu-
● Fortress addresses the “self-type problem” common ments to generic types, and that such a generic
to many type systems with inheritance by using type may optionally be declared to absorb units.
generic traits that each have a static parameter That you can multiply apples and oranges, that
that is a type that must extend the generic trait; apples may be added only to apples, and whether
the comprises clause is used to enforce this you can exclusive-or apples and apples, are all at
requirement. The parameter name then serves as the discretion of the library programmer. Dimen-
a self-type adequate to solve the “binary method sioned types can ensure that data is correctly scaled
problem.” (Fig. ).
● The Fortress libraries use the self-type idiom to ● The Maybe type (instances are either Just x or
encode and enforce (through use of design-by- Nothing), borrowed from Haskell, can be used as
contract features) a hierarchy of algebraic properties a generator of at most one item. A syntactic conve-
of data types and operators ranging from commu- nience allows it to be used in an if statement so as
tativity and associativity all the way up to partial to bind a variable to the item in just the then clause
and total orders, monoids and groups, and Boolean (Fig. ).
algebras. Type-checking ensures that, for example, a ● The Java programming language allows a method
comparison operator given to a sort method actually of the superclass to be called using the super
implements a correct ordering criterion; overloaded keyword. A language with multiple inheritance
method dispatch allows the sort method to use dif- requires a more general feature. Using a type
ferent algorithms depending on whether the order assumption expression x asif T as an argument
is partial or total. causes method dispatch to use the specified static
 F Fortress (Sun HPCS Language)

ASCII rendered

z z normal variable (italic)

z_ z roman

_z z bold

_z_ z bold italic

ZZ Z blackboard

ZZ_ calligraphic

_ZZ Z sans serif

z_bar z̄ math accent

z_hat ẑ math accent

z_dot ż math accent

z’ z′ math accent

_z_vec ⃗z math accent

z_max zmax roman subscript

z13 z numeric subscript

_z13_bar’ z̄′ combination of features

and and normal variable (italic)

And And type name (roman)

AND ∧ operator name

OPLUS ⊕ operator name as in TEX

SQCAP ⊓ operator name as in TEX

<= ≤ less than or equal to

>= ≥ greater than or equal to

<- ← left arrow

[\ \]   white brackets

<| |> ⟨ ⟩ angle brackets


Fortress (Sun HPCS Language). Fig.  Examples of rendering ASCII identifiers as mathematical symbols
Fortress (Sun HPCS Language) F 

This ASCII Fortress source code includes Wiki markup in the comment:
(* Compute the logical exclusive OR of two Boolean values.
=Arguments
; ‘self‘ :: a first Boolean value
; ‘other‘ :: a second Boolean value
=Returns
> The exclusive OR of the two arguments.
This is ‘true‘ if **either** argument is ‘true‘,
but not if **both** arguments are ‘true‘.
F
> | ||||| ‘other‘ |||
| |||||-------------------||
| ‘OPLUS‘ ||||| false | true ||
|=========================================||
| ‘self‘ || false ||| false | true ||
| || true ||| true | false ||
|-----------------------------------------||
*)
opr OPLUS(self, other: Boolean): Boolean
and it is rendered as follows:
(* Compute the logical exclusive OR of two Boolean values.
Arguments
self a first Boolean value
other a second Boolean value
Returns
The exclusive OR of the two arguments. This is true if either argument is true, but not if both arguments
are true.
other

⊕ false true

self false false true

true true false


*)
opr ⊕(self, other: Boolean): Boolean

Fortress (Sun HPCS Language). Fig.  Example of rendering Wiki-style markup in Fortress comments

type T for the argument expresssion x during ● Fortress provides an object-oriented implicit coer-
method lookup. This general mechanism solves cion feature. It differs from a conventional method
the multiple supertype problem as a special case such as toString in that it is declared in the type
(Fig. ). converted to rather the type converted from, and
 F Fortress (Sun HPCS Language)

a+b>c+d ⊛ Correct: + has higher precedence than >


p>q∧r >s ⊛ Correct: > has higher precedence than ∧

w+x∧y+z ⊛ Incorrect: no defined precedence between + and ∧

(w + x) ∧ (y + z) ⊛ Correct: parentheses specify desired grouping

w + (x ∧ y) + z ⊛ Correct: parentheses specify desired grouping

Fortress (Sun HPCS Language). Fig.  Operator precedence in Fortress is not transitive

⊛Declarations of several physical dimensions and their default units


dim Mass default kilogram
dim Length default meter
dim Time default second
dim ElectricCurrent default ampere
⊛Declarations of derived physical dimensions
dim Velocity = Length/Time
dim Acceleration = Length/Time
dim Force = Mass⋅Acceleration
dim Energy = Force⋅Length
⊛Declarations of basic units and their abbreviations
unit kilogram kg: Mass
unit meter m: Length
unit second s: Time
unit ampere A: ElectricCurrent
⊛Declarations of derived units
unit newton n: Force = meter⋅kilogram/second
unit joule J: Energy = newton meter
⊛Declarations of derived units that require scaling factors
unit gram g: Mass = − kilogram
unit kilometer km: Length =  meter
unit centimeter cm: Length = − meter
unit nanosecond ns: Time = − second
unit inch inches in: Length = . cm
unit foot feet ft: Length =  inches

Fortress (Sun HPCS Language). Fig.  Example declarations of physical dimensions and units

in that it is invoked implicitly by method dispatch be marked as widening; automatic rewriting rules on
if necessary. Among method overloadings, one that parts of the expression tree that include widening
requires no coercion to be applicable is always coercions provide the effect of “widest need evalu-
considered more specific than one that requires one ation” [], but in a manner that is fully under user
or more coercions. Coercion declarations may also (or library) control.
Fortress (Sun HPCS Language) F 

⊛ Declarations in terms of dimensions implicitly use default units.


m v
kineticEnergy(m: R Mass, v: R Velocity): R Energy =

⊛ Alternatively, declarations may specify units explicitly.
m v
kineticEnergy(m: R kg, v: R m/s): R J =

do
mySpeed = . feet per second
⊛ The function kineticEnergy requires a speed measured in

⊛ meters per second, so mySpeed by itself is not a valid argument. F


⊛ The in operator multiplies by the necessary scale factor,
−
⊛ which in this case is  × . ×  . The scale factor

⊛ is determined automatically from the declarations of the units.

kineticEnergy( kg, mySpeed in m/sec)


end

Fortress (Sun HPCS Language). Fig.  Examples of the use of declared physical dimensions and units

⊛Assume the position method returns a value of type MaybeZ .


if p ← myString.position(‘=’) then
myString[  : p ] ⊛ Here p is bound to a value of type Z .

else
myString
end

Fortress (Sun HPCS Language). Fig.  Use of a generator expression in an if statement

● Fortress uses where clauses to bind additional type Fortress defines a hierarchical data structure (the
parameters and to constrain relationships among region) that can describe hardware resources (both
static type parameters. In particular, the constraint processors and memory) and relative communication
may be a subtype relationship. This general mecha- costs, and can be queried by a running program. A
nism suffices to express covariance and contravari- Fortress distribution maps the parts of an aggregate data
ance (Fig. ), which are important type relation- structure to one or more regions. The class of distribu-
ships that require more specialized mechanisms in tions is open ended and library programmers can define
other languages. new distributions.
● MPI allows a program to inquire how many hard-
ware processors there are, but little else about Influences from Other Programming
the execution environment. High Performance For- Languages
tran provides alignment and distribution directives Aspects of the Fortress type system, including inher-
that describe how arrays should be laid out across itance and overloaded multimethod dispatch, were
multiple processors, but the processors must be influenced by the ML, Haskell, Java, NextGen, Scala,
organized as a flat Cartesian grid, and the class CLOS, Smalltalk, and Cecil programming languages,
of distributions (such as block, cyclic, and block- among others. Inspirations for the use of mathematical
cyclic) is fixed. syntax and a character set beyond ASCII include
 F Fortress (Sun HPCS Language)

trait FloorWax
advertise(self) = print “Great shine!”
end
trait DessertTopping
advertise(self) = print “Great taste!”
end
trait Shimmer extends { FloorWax, DessertTopping }
advertise(self) = do
⊛ Type assumption controls which “super-method” to call.

advertise(self asif FloorWax)


advertise(self asif DessertTopping)
end
end

Fortress (Sun HPCS Language). Fig.  Example use of asif expressions

⊛List is covariant in its type parameter T.


trait ListT extends ListU where { T extends U }
...
end
⊛OutputStream is contravariant in its type parameter E.
trait OutputStreamE extends OutputStreamG where { G extends E }
...
end

Fortress (Sun HPCS Language). Fig.  Use of where clauses to express covariance and contravariance of type
parameters

APL, MADCAP, MIRFAC, the Klerer-May System, and Future Directions


COLASL. The idea of having “publication” (beauti- The current Fortress compiler targets the Java Vir-
fully rendered) textual form as well as one or more tual Machine (JVM) and uses a custom class loader
“implementation” forms goes back to Algol . Notions to instantiate generic types as they are needed. The
of data encapsulation, comprehensions, generators, JVM provides good multiplatform native code genera-
and reducers can be traced to CLU, Alphard, KRC, tion, threads, and garbage collection, and hosts libraries
Miranda, and Haskell. In the area of functional pro- implementing both work-stealing and transactions. The
gramming, higher-order functions, and divide-and- Fortress type system is embedded into the Java type sys-
conquer parallelism, many languages had an influ- tem where it is convenient, but tuple types and arrow
ence, but especially Lisp, Haskell, and APL. The types are handled specially. Fortress value types are
approach to data distribution was strongly influenced eligible for compilation into an unboxed (primitive)
by experience with High Performance Fortran and its representation. Fortress traits compile to Java inter-
predecessors. faces, and Fortress objects compile to Java final classes.
Fujitsu Vector Computers F 

A straightforward analysis of traits comprising object Version .. http://research.sun.com/projects/plrg/fortress.pdf,


singletons can demonstrate finiteness, which permits March . See also http://research.sun.com/projects/plrg/
the general application of unboxing to new types. Publications/fortress.beta.pdf, March 
. Allen E, Culpepper R, Nielsen JD, Rafkind J, Ryu S () Grow-
A specialized virtual machine (VM) and run-time
ing a syntax. In: ACM SIGPLAN Foundations of Object-Oriented
system may someday be desirable, but at this time the Languages workshop, Savannah, 
benefits of targeting the JVM far outweigh the overhead. . Bailey DH, Barszcz E, Barton JT, Browning DS, Carter RL, Fatoohi
The design of regions and distributions for man- RA, Frederickson PO, Lasinski TA, Simon HD, Venkatakrishnan
aging the mapping of data and threads to hardware V, Weeratunga SK () The NAS parallel benchmarks. Technical
report, Int J Supercomput Appl ():–
resources has been worked out on paper, but has not yet
. Corbett RP () Enhanced arithmetic for Fortran. ACM
been implemented. SIGPLAN Notices ():–, http://doi.acm.org/./.
The original design of Fortress includes a form 
F
of where clause that conditionally gates such type . High Productivity Computer Systems program of the Defense
relationships as extension, comprising, and exclusion. Advanced Research Projects Agency (United States Department
of Defense). http://www.highproductivity.org/ and see also //www.
For example, one might define an generic aggregate
darpa.mil/IPTO/programs/hpcs/hpcs.asp
data structure AE with a multiplication operator that . Steele Jr GL () Growing a Language. Invited talk. Abstract
is defined in terms of multiplication for the element in OOPSLA ’ Addendum: Addendum to the  Proceedings
type E, and one might well wish to declare that the of the Conference on Object-Oriented Programming, Sys-
multiplication of aggregates is commutative if and only tems, Languages, and Applications. ISBN ---. ACM,
if multiplication of the elements is commutative. This New York, http://doi.acm.org/./. Transcript
in Higher-Order and Symbolic Computation ,  (October
may be achieved by declaring
), –. Video at http://video.google.com/videoplay?
trait AE extends { CommutativeAE, × docid=-#
where { E extends CommutativeE, × } } . The Unicode Consortium () The Unicode Standard,
Version .. Addison-Wesley, Boston
. Wiki Creole project and website. Sponsored by the Wiki Sympo-
The implications of such conditional type relationships sium. http://www.wikicreole.org/wiki/Creole.
for the implementation of a practical type checker are
still a subject of research.
An object-oriented pattern-matching facility for
easily binding multiple variables to the fields of an
object, similar to the pattern-matching facilities in Forwarding
Haskell and Scala but with more provisions for user
extensibility, has been sketched out but only recently Routing (Including Deadlock Avoidance)
implemented.

Related Entries
Chapel (Cray Inc. HPCS Language)
HPF (High Performance Fortran)
Fujitsu Vector Computers
PGAS (Partitioned Global Address Space) Languages
Kenichi Miura
National Institute of Informatics, Tokyo,
Bibliography Japan
. Allen E, Chase D, Flood C, Luchangco V, Maessen JW, Ryu S,
Steele Jr GL () Project Fortress Community website. http://
projectfortress.java.net
Synonyms
. Allen E, Chase D, Hallett J, Luchangco V, Maessen JW, Ryu S, Steele Fujitsu vector processors; Fujitsu VPP systems; SIMD
Jr GL, Hochstadt ST () The Fortress Language Specification (single instruction, multiple data) machines
 F Fujitsu Vector Computers

Definition
Fujitsu Vector Machines are the series of supercomput-
ers developed by Fujitsu from  to . They are
based on the pipeline architecture, and VPP systems
have also incorporated a high degree of parallelism with
distributed memory in the system architecture, utiliz-
ing the full crossbar network for non-blocking system
interconnect.

Discussion
FACOM – Array Processor Unit (APU)
The first vector machine in Japan was FACOM –
 Array Processing Unit (APU) which was installed
Fujitsu Vector Computers. Fig.  VP LSI packaging
at the National Aerospace Laboratory (later became
technology. (Photo Credit: Fujitsu Limited)
the Japanese Aerospace Exploration Agency (JAXA) in
October ) in  [].
This system was organized as an asymmetric multi- and main storage unit. The scalar unit fetches and
processor system which consists of the mainframe com- decodes all the instructions and executes the scalar
puter FACOM – and the Array Processor Unit. instructions. The vector unit consists of six functional
The clock was  ns, and the peak performance of APU pipeline units, vector registers, and mask registers. The
is  Mflop/s for addition/subtraction and  Mflop/s for functional pipes are add/logical pipe, multiply pipe,
multiplication. This system was a pipeline-based array divide pipe, mask pipe, and two load/store ports, which
processor with  data registers (scalar) and one vec- are also pipelined. The first three pipes are for arithmetic
tor register with , word ( bits/word) for vector operations. The load/store pipes support the contiguous
operations. The capacity of the main memory was  M access, strided access, and the indirect addressing (gath-
words with -way interleave. AP-Fortran, an extended er/scatter) for flexible vector operations. The mask pipes
Fortran language, was developed for the – APU to are used to handle the conditional branches within
allow parallelism description. loops. One of the most unique features of the VP/
Series is the dynamically reconfigurable vector registers.
FACOM VP/ Series The total capacity of the vector registers for VP is  K
The vector processor FACOM VP and FACOM words ( bits), but they can take such configurations
VP [–] were announced in July . The major as  (vector length) × (number of vector registers),
goals of the VP/ Series were to achieve a very  × ,  × , . . . ,  × .
high performance, ease-of-use, and a good affinity The major features of the VP/ Series are sum-
with general-purpose computing environment with the marized in Table . Figure  shows the cabinet of the
FACOM M-Series mainframe computers. The state-of- VP system.
the-art semiconductor technology was incorporated in The first delivery of the VP-Series system took
the system; the ECL LSI with a gate delay time of  ps place in January , that is, VP- at the Insti-
(Fig. ) and  and , gates per chip, and high-speed tute for Plasma Physics in Nagoya. Later, the high-end
RAM with an access time of . ns for vector registers. model FACOM VP-, with improved performance,
As for the memory technology, the main memory unit was added, and a processing performance exceeding
with high capacity and high data transfer capability was Gflop/s was achieved. Also, the market was expanded
realized by using the  Kbit SRAM with access time by adding lower-end models: the FACOM VP and
of  ns. FACOM VP .
Figure  illustrates the architecture of VP/ The VP-Series E Models, enhanced versions of
Series system. It consists of the scalar unit, vector unit, the VP / Series, were announced in July .
Fujitsu Vector Computers F 

Vector Unit Mask Registers


(570MFLOPS) (1KB)
Mask

Add/Logical
Load/Store
Main (2.3GB/S)
Storage Multiply
Unit Reconfigurable
Vector
(Max. 256MB) Load/Store
Registers Divide F
(2.3GB/S)
(64KB)

Cache Scalar
(64KB) Execution
Unit
Scalar
Channels
Unit General Purpose Registers (4B x 16)
Floating Point Registers (8B x 8)

Fujitsu Vector Computers. Fig.  VP architectural block diagram

Fujitsu Vector Computers. Table  VP Series specifications

Model VP VP VP VP


Announcement April,  July,  July,  April, 
Machine cycle . ns . ns . ns . ns
Max. performance  Mflop/s  Mflop/s  Mflop/s , Mflop/s
Total vector regs. size  KB  KB  KB  KB
Basic vector length  elements  elements  elements  elements
Total Mask regs.  kbits  kbits  kbits  kbits
Logical no. times
multiplicity of vector
pipes
Add/logical × × × ×
Multiply × × × ×
Divide × × × ×
Mask × × × ×
Load/Store × × × ×
Ckt. technology  gates/chip  gates/chip  gates/chip  gates/chip
(ECL) & delay  ps  ps  ps  ps
Memory technology  kbits SRAM  kbits SRAM  kbits SRAM  kbits SRAM
 ns access  ns access  ns access  ns access
Main memory size , ,  MB , ,  MB , ,  MB , ,  MB
 F Fujitsu Vector Computers

Fujitsu Vector Computers. Fig.  VP system. (Photo Credit: Fujitsu Limited)

Vector processor

Mask registers Mask

Mask

Multiplication &
addition/logical Vector unit
calculation*
Load/
System
store* Vector Multiplication &
storage Main
registers addition/logical
unit storage calculation*
unit Load/
store* Division

Scalar Scalar unit **


Buffer
storage execution
unit
Channel General registers
processor Floating registers

* Model VP2000 provides separate pipelines for loading/storing and addition/subtraction/


logic calculations.
**Uni-processor models have one scalar unit.

Fujitsu Vector Computers. Fig.  VP architectural block diagram


Fujitsu Vector Computers F 

Fujitsu Vector Computers. Table  VP Series specifications

VP/ VP/ VP/ VP/


Models VP/ VP/ VP/ VP/
Announcement Dec.  Dec.  Dec.  Dec. 
Max. performance  Mflop/s  Gflop/s  Gflop/s  Gflop/s
No. scalar unit/ – – – –
vector unit    
Total capacity of vector  KB  KB  KB  KB
registers
Total capacity of mask  kbits  kbits  kbits  kbits F
registers
No. vector pipes    
& multiplicity    
Ckt. technology  K gates  K gates  K gates  K gates
(ECL) & delay/gate  ps  ps  ps  ps
Ckt technology  kbit SRAM  kbit SRAM  kbit SRAM  kbit SRAM
(Capacity & delay) . ns . ns . ns . ns
Capacity of main  MB– GB  MB– GB  MB- GB  MB– GB
memory
Memory technology  Mbit SRAM/  Mbit SRAM/  Mbit SRAM/  Mbit SRAM/
/delay  ns  ns  ns  ns
Capacity of system ,,,, GB ,,,, GB ,,,, GB ,,,, GB
storage
I/O capabilities Max.  GB/s Max.  GB/s Max.  GB/s Max.  GB/s
No. channels – – – –
Block Mpx Ch./ . MB/s . MB/s . MB/s . MB/s
Optical Ch. . MB/s  MB/s . MB/s . MB/s

There were five models : VP E, VP E, VP E,VP sharing one vector unit. This feature, called Dual Scalar
E, and VP E. These E models achieved perfor- Processors (DSP), may be regarded as two-threaded
mance which is . times more powerful than that of system. The objective of DSP is to utilize the vector unit
the previous VP / Series machines by installing more efficiently when the vectorization ratio of users’
high-speed vector memory and an additional arith- application programs is not very high [].
metic pipeline for the fused multiply-add operations. The key features of VP Series system are sum-
marized in Table .
FUJITSU VP Series
The VP Series [–] was announced in December
 as the successor to Fujitsu’s VP-E Series supercom-
Numerical Wind Tunnel and VPP-Series
puters. Figure  illustrates the architecture of VP
Series systems. It carries most of the VP/ archi- Numerical Wind Tunnel
tectural features with various enhancements in the The National Aerospace Laboratory (NAL) of Japan
vector unit, such as two sets of fused multiply-add developed the Numerical Wind Tunnel (NWT) [] par-
pipelines, doubled mask pipelines, and faster division allel supercomputer system jointly with Fujitsu. The
circuit. The most notable feature of the VP Series system architecture was a distributed memory vector-
is that it can be configured with one or two scalar units parallel computer, in which vector supercomputers
 F Fujitsu Vector Computers

Fujitsu Vector Computers. Fig.  NWT system. (Photo Credit: JAXA)

were connected with the non-blocking crossbar net- time with a speed of  Gflop/s. Fujitsu has developed
work. The NWT’s final specifications were  process- processor elements (PEs), each being capable of vec-
ing elements, each with . Gflop/s of processing power tor processing and also can operate in highly parallel
to give a peak performance of  Gflop/s with . GB fashion to obtain this performance. The maximum con-
of the total main memory. figuration was with  processing elements to reach
The NWT’s processing element consisted of a CPU  Gflop/s.
and memory. The CPU is a vector supercomputer As shown in Fig. , each processing element con-
by itself, comprising three types of LSIs: GaAs, ECL, sisted of a scalar unit, a vector unit, a main storage unit,
and BiCMOS,  in total. It is one of the very few and a Data Transfer Unit (DTU), built with Gallium
commercially successful GaAs-based system. The pri- Arsenide LSI’s, Bipolar LSI’s and BiCMOS LSI’s, and
mary cooling method was water cooling, but the ultrahigh-density circuit boards, and was able to deliver
forced air-cooling method was also used for the mem- a maximum vector performance of . Gflop/s. The
ory units. scalar unit is VLIW architecture and may execute up to
Figure  shows the NWT as installed at NAL. The four operations in parallel. In order to achieve very high
NWT, which went into operation in January , was system performance, PEs were connected with a high-
rated as the most powerful supercomputer site in the speed crossbar network which allows non-blocking data
world in the TOP list from  to . It should transfer. The data transfer rate was  MB/s from one
be noted that the NWT was the technology precursor PE to another. Since a PE can simultaneously send
to the VPP, which will be described in more detail and receive data, the aggregate data transfer rate per
in the next section. PE was  MB/s. Figure  illustrates the system con-
figuration of VPP, based around the crossbar net-
FUJITSU VPP work. In order to fully utilize the powerful capability
The VPP was Fujitsu’s third generation supercom- of the interconnecting network, the Data Transfer Unit
puter system [, –]. As stated in the previous sec- (DTU) was incorporated, which allows the mapping
tion, it was a commercial version of NWT. Fujitsu between the local and the global addresses, a very fast
announced the VPP in September  as the synchronization across PEs, and actual data transfer
world’s most powerful supercomputer at that point in with proprietary message passing protocol. Various
Fujitsu Vector Computers F 

Mask Registers
N
E Data Mask
T Transfer
W Unit Vector
O Unit
(DTU) Multiply
R
K
Load Add/Logical
Vectors
Registers
Store Divide
Main
Storage
F
Unit Scalar
Unit
Scalar
Cache Execution
Unit
General Purpose Registers
Floating Point Registers

Fujitsu Vector Computers. Fig.  VPP architectural block diagram

PEn PE(n-1) PE1 PE0

VU VU VU VU

D M D M D M D M
SU SU ... T E SU T E SU
T E T E
U M U M U M U M
IOP IOP

Crossbar
Interconnect

Fujitsu Vector Computers. Fig.  VPP system configuration

data transfer modes were supported by the DTU. They the distributed memory units of VPP. The abil-
were: contiguous, constant-strided, sub-array, and indi- ity to define the array data in Fortran programs, dis-
rect addressing (random gather/scatter). tributed across multiple processing elements as a single
Figure  shows the inside of one cabinet of VPP, global space, greatly simplified the porting and tun-
which contains  PEs. Plumbing for water cooling is ing of existing codes to VPP VPP Fortran may
also visible in this photo. be regarded as an early implementation of PGAS
The specifications of the VPP system and the (Partitioned Global Address Space). More details are
follow-on products are summarized in Table . described in [].

New Parallel Language VPP/VPP/VPP Series


Fujitsu developed a new parallel language, called VPP After a great success with VPP, Fujitsu developed
Fortran, which can define a single name space across the CMOS versions of the vector-parallel system The
 F Fujitsu Vector Computers

oxide semiconductor) technology, were introduced in


February , the Fujitsu VPP Series in March
, and the VPP Series in April . All VPP
Series supercomputers after the VPP Series used the
CMOS technology. Packaging technology of VPP
is shown in Fig. .
The VPP was the successor to the former
VPP/VPPE systems (with E for extended, i.e.,
the clock cycle . instead of  ns). The clock cycle has
been halved, and the vector pipes are able to deliver
fused multiply-add results. With a multiplicity of  for
these vector pipes,  floating-point results per clock
cycle can be generated. In this way, a fourfold increase
in speed per processor can be attained with respect to
the VPPE.
The architecture of the VPP nodes is almost
identical to that of the VPP, but one of the major
enhancements is in the performance of noncontiguous
memory access by incorporating more address match-
ing circuits.

Conclusion
Fujitsu Vector Computers. Fig.  VPP cabinet. (Photo While the vector processing with very sophisticated
Credit: Fujitsu Limited) hardware pipelines together with highly interleaved
memory subsystem was the dominant supercomputer
Fujitsu VPP Series and VX Series (the single CPU architecture in the s and s, the wave of highly
model), which used CMOS (Complementary metal (or massively) parallel scalar systems have been gaining

Fujitsu Vector Computers. Table  VPP Series specifications (typical models)

Models VPP VPPE VPP


Announcement September,  February,  April, 
Ckt. technology GaAs/ECL/Bipolar CMOS(.μ) CMOS(.μ)
Level of integration  K Tx’s (GaAs)  M Tx’s/chip  M Tx’s/chip
Delay time/gate  ps  ps  ps
Cooling method Liquid (water) Forced air Forced air
Clock frequency  MHz  MHz  MHz
Max. perf./PE . Gflop/s . Gflop/s . Gflop/s
Mem. capacity/PE  GB  GB – GB
Memory technology  Mbit SRAM  Mbit SDRAM  Mbit SDRAM,  Mbit
SSRAM
I/O capability/PE  MB ×   MB ×  . GB × 
Max. configuration  PEs  PEs  PEs
Max. system perf.  Gflop/s . Tflop/s . Tflop/s
CPU architecture  bit  bit  bit
Fujitsu Vector Computers F 

Fujitsu Vector Computers. Fig.  VPP PE/Memory board. (Photo Credit: Fujitsu Limited)

more popularities in this century. The major reasons programming models and compiler technology devel-
may be stated as; oped for the vector architecture continue to be vital
technologies in the new approaches.
. Pipelining has become the common techniques
even in the scalar microprocessor design.
. Larger market opportunities have arisen with the
advent of high volume produced the powerful scalar
Related Entries
PGAS (Partitioned Global Address Space) Languages
microprocessors.
TOP
. The highly interleaved memory structure in the
VLIW Processors
vector architecture has become a more expensive
implementation as compared with the cache-based
approach in the scalar architecture, for the purpose
of coping with the widening speed gap between Bibliography
CPU and memory. . Uchida K, Seta Y, Tanakura Y () The FACOM - Array
. More and more application programs have been Processor System. In: Proc rd USA-JAPAN Computer Con-
tailored toward scalar architecture. ference, San Francisco, CA. AFIPS Press, Montvale, NJ, pp
–
. Miura K, Uchida K () FACOM vector processor VP-/VP-
Thus, Fujitsu decided to change direction from vector . In: Kowalik JS (ed) High-speed computation, NATO ASI
architecture to highly parallel scalar Symmetric Multi- Series F: computer and systems sciences, vol . Springer, Berlin,
processor (SMP) and its constellations. The VPP pp –
system was Fujitsu’s last vector product. . Miura K () Fujitsu’s supercomputer: FACOM vector pro-
But it should also be pointed out that the recent cessor system. In: Fernbach S (ed) Supercomputers: class VI
systems, hardware and software. North-Holland, Amsterdam,
trends in SIMD instructions in the general pur-
pp –
pose microprocessors and the popularity of GPGPU . Matsuura T, Miura K, Makino M () Supervector performance
(General Purpose Graphic Processing Unit) could be without toil: Fortran implemented vector algorithms on the VP-
regarded as “the return of vectors,” in the sense that the /. Comput Phys Commun :–
 F Fujitsu Vector Processors

. Miura K () Vectorization and parallelization of transport


Monte Carlo simulation codes. In: NATO ASI Series vol F, Functional Decomposition
pp –
. Uchida N et al () Fujitsu VP Series. In: Proceedings Domain Decomposition
of COMPCON Spring’, San Francisco, CA. IEEE Computer
Society Press, Washington, DC, pp –
. Miura K, Nagakura H, Tamura H () VP Series dual
scalar and quadruple scalar models supercomputer systems – A
new concept in vector processing. In: Proceedings of COMP-
Functional Languages
CON Spring’, San Francisco, CA. IEEE Computer Society Press,
Washington, DC, pp – Philip Trinder , Hans-Wolfgang Loidl ,
. Hwang K () Section .. Fujitsu VP and VPP. In: Kevin Hammond
Advanced computer architecture: parallelism, scalability, pro- 
Heriot-Watt University, Edinburgh, UK
grammability. McGraw-Hill, Inc., New York, pp – 
University of St. Andrews, St. Andrews, UK
. Miyoshi H et al () Development and achievement of NAL
numerical wind tunnel (NWT) for CFD computations. In: Pro-
ceedings of  ACM/IEEE conference on supercomputing
(SC’), Washington, DC. IEEE Computer Society Press (), Definition
Washington, DC, pp – Parallel functional languages are parallel variants of
. Utsumi T, Ikeda M, Takamura M () Architecture of the functional languages, that is, languages that treat com-
VPP parallel supercomputer. In: Proceedings of supercom-
putation as the evaluation of mathematical functions
puting’, Washington, DC. IEEE Computer Society Press, Wash-
ington, DC, pp –
and avoid state and mutable data. Functional languages
. Nakashima Y, Kitamura T, Tamura H, Takiuchi M, Miura K () are founded on the lambda calculus. The majority
The scalar processor of the VPP parallel supercomputer. In: of parallel functional languages add a small number
Proceedings of the th ACM international conference on super- of high-level parallel coordination constructs to some
computing (ICS’), Manchester, England. ACM, New York, functional language, although some introduce paral-
pp –
lelism without changing the language.
. Nakanishi M, Ina H, Miura K () A high perfor-
mance linear equation solver on the VPP parallel super-
computer. In: Proceedings of  ACM/IEEE conference on Discussion
supercomputing (SC’), Washington, DC. IEEE Computer
Society Press, Washington DC, pp – Introduction
. Iwashita H et al () VPP Fortran and parallel programming on The potential of functional languages for parallelism has
the VPP supercomputer. In: Proceedings of  international
been recognized for over  years, for example [].
symposium on parallel architectures, algorithms and networks
(ISPAN’), Kanazawa, Japan, – Dec , pp –
The key advantage of functional languages is that they
. Nodomi A, Ikeda M, Takamura M, Miura K () Hardware per- mainly, or solely, contain stateless computations. Sub-
formance of the VPP parallel supercomputer. In: Dongarra J, ject to data dependencies between the computations,
Grandinetti L, Joubert G, Kowalik J (eds) High performance com- the evaluation of stateless computations can be arbi-
puting: technology, method and applications, vol . Advances in trarily reordered or interleaved while preserving the
Parallel Computing. Elsevier, Amsterdam, pp –
sequential semantics.
Parallel programs must not only specify the com-
putation, that is, a correct and efficient algorithm, it
Fujitsu Vector Processors must also specify the coordination, for example, how the
program is partitioned, or how parts of the program
Fujitsu Vector Computers are placed on processors. As functional languages pro-
vide high level constructs for specifying computation,
for example, higher-order functions and sophisticated
Fujitsu VPP Systems type systems, they typically provide correspondingly
high level coordination sublanguages. A wide range
Fujitsu Vector Computers of parallel paradigms and constructs have been used,
Functional Languages F 

for example, data-parallelism [], and skeleton-based languages like Haskell [] and Clean []. At first
parallelism []. glance this is surprising as lazy evaluation is sequential
As with computation, the great advantage of high- and performs minimum work, but non-strict languages
level coordination is that it frees the programmer have a number of advantages for parallelism [].
from specifying low-level coordination details. Where They can be easily married to many different coor-
low-level coordination constructs encourage program- dination languages as the execution order of expres-
mers to construct static, simple, or regular coordina- sions is immaterial; they naturally support highly
tion, higher-level constructs encourage more dynamic dynamic coordination where evaluation is performed
and irregular coordination. The challenges of high- and data communicated on demand; their implemen-
level coordination are that automatic coordination tations already have many of the mechanisms required
management complicates the operational semantics, for parallel execution. Laziness also exists in the lan- F
makes the performance of programs opaque, requires guage implementations, for example, distributing work
a sophisticated language implementation, and is fre- and data on demand [].
quently less effective than hand-crafted coordination.
Despite these challenges many new nonfunctional Paradigms
parallel programming languages also support high- Parallel functional languages have adopted a range of
level coordination, and examples include Fortress [], paradigms. For the main parallel paradigms this sec-
Chapel [] or X [] that also support high-level tion outlines the correspondence with the functional
coordination. paradigm, and describes a representative language.
Coordination may be managed statically by the Some languages support more than one paradigm, for
compiler as in PSML [], dynamically by the run- example, Manticore [] combines data and task par-
time system as in GpH [], or by both as in allelism, and Eden [] combines semi-explicit and
Eden []. Whichever mechanism is chosen, the imple- skeleton parallelism.
mentation of sophisticated automatic coordination
management is arduous, and there have been many Skeleton-Based Languages
more parallel language designs than well-engineered Algorithmic Skeletons [, ] are a popular parallel
implementations. coordination construct, and as higher-order functions
The combination of a high-level computation fit naturally in the functional model. Often these higher
and coordination language requires specialized par- order functions work over compound data structures
allel development tools and methodologies, as dis- like lists or vectors and consequently the resulting paral-
cussed below. lel code often resembles data parallel code as discussed
in section.
Strict and Non-strict Parallel Functional Example skeleton-based functional languages
Languages include SCL [], PL [], Eden [], PSML [], and
Most programming languages strictly evaluate the argu- HDC. HDC [] is a strictly evaluated subset of the
ments to a function before evaluating the function. Haskell language with skeleton-based coordination.
However, as purely functional languages have signifi- HDC programs are compiled using a set of skeletons
cant freedom of evaluation order, some use non-strict for common higher-order functions, like fold and map,
or lazy evaluation. Here the arguments to a function are and several forms of divide-and-conquer. The language
evaluated only if and when they are demanded by the supports two divide-and-conquer skeletons and a par-
function body. In consequence more programs termi- allel map, and the system relies on the use of these
nate, it is easy to program with infinite structures, and higher-order functions to generate parallel code.
programming concerns are separated [].
Like most languages, many parallel functional Data Parallel Languages
languages are strict, for example, NESL [] or Manticore Data parallel languages [] focus on the efficient imple-
[]. However there are many parallel variants of lazy mentation of the parallel evaluation of every element
 F Functional Languages

in a collection. Functional languages fit well with the obscure programs. This problem is overcome by eval-
data parallel paradigm as they provide powerful opera- uation strategies: lazy, polymorphic, higher-order func-
tions on collection types, and in particular lists. Indeed, tions controlling the evaluation degree and the paral-
all of the languages discussed here use some parallel lelism of a Haskell expression. They provide a clean
extension of list comprehensions and implicitly parallel separation between coordination and computation. The
higher order functions such as map. Compared to other driving philosophy behind evaluation strategies is that
approaches to parallelism, the data parallel approach it should be possible to understand the computa-
makes it easier to develop good cost models, although, tion specified by a function without considering its
it is notoriously difficult to develop cost models for coordination.
languages with a non-strict semantics.
Example data parallel functional languages include
NESL [] and Data Parallel Haskell []. NESL is a Coordination Languages
strict, strongly typed, language with implicit parallelism Parallel coordination languages [] are separate from
and implicit thread interaction. It has been imple- the computation language and thereby provide a clean
mented on a range of parallel architectures, including distinction between coordination and computation.
several vector computers. A wide range of algorithms Historically, Linda [] and PCN [] have been the
have been parallelized in NESL, including a Delau- most influential coordination languages, and often a
nay algorithm for triangularization [], several algo- coordination language can be combined with many
rithms for the n-body problem [], and several graph different computation languages, typically Fortran or
algorithms. C. Other systems such as SCL [] and PL [] focus
on a skeleton approach for introducing parallelism
Semi-explicit Languages and employ sophisticated compilation technology to
Semi-explicit parallel languages provide a few high- achieve good resource management.
level constructs for controlling key coordination aspects, Example coordination languages using functional
while automatically managing most coordination aspects computation languages include Haskell-Linda [] and
statically or dynamically. Historically, annotations were Caliban [, ]. Caliban has constructs for explicit
commonly used for semi-explicit coordination, but partitioning of the computation into threads, and for
more recent languages provide compositional language assigning threads to (abstract) processors in a static
constructs. As a result, the distinction between semi- process network. Communication between processors
explicit coordination and coordination languages is works on streams, that is, eagerly evaluated lists, similar
now rather blurred, but the key difference in the to Eden.
approach is that semi-explicit languages aim for mini-
mal explicit coordination and minimal change to lan-
guage semantics. Dataflow Languages
Example semi-explicit parallel functional languages Historically a number of functional languages adopted
include Eden [] and GpH []. GpH is a small the dataflow paradigm and some, like SISAL [],
extension of Haskell with a parallel (par) composi- achieved very impressive performance.
tion primitive. par returns its second argument and Example dataflow functional languages include
indicates that the first argument may be executed in SISAL [], pHLuid system []. SISAL [] is a first-order,
parallel. In this model the programmer only has to strict functional language with implicit parallelism and
expose expressions in the program that can usefully implicit thread interaction. Its implementation is based
be evaluated in parallel. The runtime-system manages on a dataflow model and it has been ported to a range
the details of the parallel execution such as thread cre- of parallel architectures. Comparisons of SISAL code
ation, communication, etc. Experience in implement- with parallel Fortran code show that its performance is
ing large programs in GpH shows that the unstruc- competitive with Fortran, without adding the additional
tured use of par and seq operators often leads to complexity of explicit coordination [].
Functional Languages F 

Explicit Languages coordination. As a consequence the parallel behaviour


In contrast to the high-level coordination found in the of programs is often far from obvious, and hence hard
majority of parallel functional languages, a number of to tune. Hence suites of profiling and visualization tools
languages support low-level explicit coordination. For are very important for many parallel functional lan-
example, there are several bindings for explicit message guages, for example [, , ].
passing libraries, such as PVM [] and MPI [] for Implicit parallelism, often promised in the context
languages like Haskell and OCaml [, ]. of functional languages, offers the enticing vision of
These languages use an open system model of parallel execution without changes to the program. In
explicit parallelism with explicit thread interaction. reality, however, the program must be designed with
Since the coordination language is basically a state- parallelism in mind to avoid unnecessary sequential-
ful (imperative) language, stateful code is required ity. In theory, program analyses such as granularity, F
at the coordination level. Although the high avail- sharing, and usage analysis can be used to automati-
ability and portability of these systems are appeal- cally generate parallelism. In practice, however, almost
ing, the language models suffer from the rigid sep- all current systems rely on some level of programmer
aration between the stateful and purely functional control.
levels. Current development methodologies, like [],
Other functional languages support explicit coordi- have several interesting features. The combination of
nation, for example, Erlang []. Erlang is probably languages with high-level coordination and good pro-
the most commercially successful functional language, filing tools facilitates the prototyping of alternative par-
and was developed in the telecommunications indus- allelisations. Obtaining good coordination at an early
try for constructing distributed, real-time fault toler- stage of parallel software development avoids expensive
ant systems. Erlang is strict, impure, weakly typed, redesigns. In later development stages, detailed con-
and relatively simple: omitting features such as curry- trol over small but crucial parts of the program may
ing and higher order functions. However the language be required, and profiling tools can help locate expen-
has a number of extremely useful features, including the sive parallel computations. During performance tuning
OTP libraries, hot loading of new code into running the high level of abstraction may obscure key low-
applications, explicit time manipulation to support soft level features that could be usefully controlled by the
real-time systems, message authentication, and sophis- programmer.
ticated fault tolerance.

Parallel Functional Languages Today


Tools and Methodologies Parallel functional languages are used in a range
Development methodologies for parallel functional of domains including numeric, symbolic, and data
programs exploit the amenability of the languages to intensive []. Commercially Erlang has been enor-
derivation and analysis. For example, a programmer mously successful for developing substantial telecoms
can reason about costs during program design using and banking applications [] and parallel Haskell for
abstract cost models like BMF-PRAM []. Guided by providing standardized parallel computational algebra
the cost imformation the program design can rela- systems.
tively easily be transformed to reduce resource con- The high-level coordination and computation con-
sumption before implementation, for example, []. structs in parallel functional languages have been very
Similarly, automated static resource analysis can pro- influential. For example, algorithmic skeletons appear
vide information to improve coordination, for in MPI [] and are the interface for cloud parallel
example, predicted task execution time can improve search engines like Google MapReduce. Similarly, state-
scheduling. less threads or comprehensions occur in modern paral-
Parallel functional languages favour high-level and lel languages like Fortress [], Chapel [], or X [].
often dynamic coordination in contrast to detailed static Moreover the dominance of multicore and emergence
 F Functional Languages

of many-core architectures has focused attention on the . Bacci B, Danelutto M, Orlando S, Pelagatti S, Vanneschi M ()
importance of stateless computation. P L: a structured high level programming language and its struc-
tured support. Concurrency Practice Exp ():–
Some trends are as follows. As clock speeds plateaux
. Berthold J, Loogen R () Visualizing parallel functional pro-
but register width increases, data parallelism becomes
gram runs – case studies with the eden trace viewer. In: ParCo’:
attractive for functional, and other, parallel languages. Proceedings of the parallel computing: architectures, algorithms
In the functional community there are often multiple and applications, Jülich, Germany
parallelizations of a single base language, for example, . Blelloch GE () Programming parallel algorithms. Commun
GpH and Eden are both Haskell variants, moreover, a ACM ():–
. Breitinger S, Loogen R, Priebe S () Parallel program-
single compiler often supports more than one parallel
ming with Haskell and MPI. In: IFL’ – International Work-
version, for example, the Glasgow Haskell Compiler has shop on the implementation of functional languages, Univer-
four experimental multicore implementations []. sity College London, UK, September . Draft proceedings,
pp –
. Blelloch GE, Miller GL, Talmor D () Developing a practical
projection-based parallel Delaunay algorithm. In: Symposium on
Related Entries Computational Geometry. ACM, Philadelphia, PA
Data Flow Graphs
. Blelloch GE, Narlikar G () A practical comparison of N-body
Data Flow Computer Architecture algorithms. In: Parallel Algorithms, volume  of Series in Dis-
MPI (Message Passing Interface) crete Mathematics and Theoretical Computer Science. American
Parallel Skeletons Mathematical Society, University of Tennessee, Knoxville, TN
Profiling . Cann D () Retire Fortran? A debate rekindled. Commun
ACM ():–
. Chamberlain BL, Callahan D, Zima HP () Parallel pro-
grammability and the chapel language. Int J High Perform Com-
Languages put Appl ():–
Chapel (Cray inc. HPCS Language) . Charles P, Donawa C, Ebcioglu K, Grothoff C, Kielstra A, von
Concurrent ML Praun C, Saraswat V, Sarkar V () X: An object-oriented
approach to non-uniform cluster computing. In: OOPSLA’.
Eden
ACM Press, New York
Fortress (Sun HPCS Language) . Carriero N, Gelernter D () How to write parallel programs:
Glasgow Parallel Haskell a guide to the perplexed. ACM Comput Surv ():–
Multilisp . Chakravarty MMT (ed) () ACM SIGPLAN Workshop on
NESL Declarative Aspects of Multicore Programming (DAMP’).
ACM Press, New York
. Cole M () Algorithmic skeletons. In: Hammond K,
Michaelson G (eds) Research directions in parallel functional
Bibliographic Notes and Further programming. Springer-Verlag, New York, pp –
Reading . Darlington J, Guo Y, To HW () Structured parallel program-
A sound introduction to parallel functional language ming: theory meets practice. In Milner R, Wand I (eds) Research
Directions in Computer Science. Cambridge University Press,
concepts and research can be found in []. A com-
Cambridge, MA
prehensive survey of parallel and distributed languages . Ellmenreich N, Lengauer C () Costing stepwise refinements
based on Haskell is available in []. A recent series of parallel programs. Comp Lang Syst Struct (–):–
of workshops studies declarative aspects of multicore (special issue on semantics and cost models for high-level parallel
programming, for example []. programming)
. Flanagan C, Nikhil RS () pHluid: the design of a par-
allel functional language implementation on workstations. In:
ICFP’ – International Conference on functional programming,
Bibliography ACM Press, Philadelphia, PA, pp –
. Allen E, Chase D, Flood C, Luchangco V, Maessen J-W, Steele S, . Foster I, Olson R, Tuecke S () Productive parallel program-
Ryu GL () Project fortress: a multicore language for multi- ming: The PCN approach. J Scientific Program ():–
core processors. Linux Magazine, September  . Fluet M, Rainey M, Reppy J, Shaw A () Implicitly-threaded
. Armstrong J () Programming Erlang: software for a concur- parallelism in Manticore. In: ICFP’ – International Conference
rent world. The Pragmatic Bookshelf, Raleigh, NC on functional programming, ACM Press, Victoria, BC
Futures F 

. Glasgow Haskell Compiler. WWW page, . . Taylor FS () Parallel functional programming by partition-
http://www.haskell.org/ghc/ ing. PhD thesis, Department of Computing, Imperial College,
. Herrmann C, Lengauer C () HDC: a higher-order lan- London
guage for divide-and-conquer. Parallel Processing Lett (–): . Trinder PW, Hammond K, Loidl H-W, Jones SLP ()
– Algorithm + Strategy = Parallelism. J Funct Program ():
. Hammond K, Michaelson G () Research directions in parallel –
functional programming. Springer-Verlag, New York . Trinder PW, Hammond K, Mattson Jr JS, Partridge AS, Jones
. Hughes J () Why functional programming matters. Com- SLP () GUM: a portable implementation of Haskell.
puter J ():– In: PLDI’ — programming language design and implementa-
. Jones Jr D, Marlow S, Singh S () Parallel performance tun- tion, Philadephia, PA, May 
ing for haskell. In: Haskell ’: Proceedings of the nd ACM . Trinder PW, Loidl H-W, Pointon RF () Parallel and dis-
tributed Haskells. JFP, . Special issue on Haskell
SIGPLAN symposium on Haskell. ACM Press, New York
. Weber M () hMPI — Haskell with MPI. WWW page, July
F
. Kelly PHJ () Functional programming for loosely-coupled
multiprocessors. Research monographs in parallel and dis- . http://www-i.informatik.rwth-aachen.de/~michaelw/
tributed computing. MIT Press, Cambridge, MA hmpi.html
. Kelly P, Taylor F () Coordination languages. In: Hammond K, . Wegner P () Programming languages, information structures
Michaelson G (eds) Research directions in parallel functional and machine organisation. McGraw-Hill, New York
programming. Springer-Verlag, New York, pp –
. Sisal Performance Data. WWW page, January .
http://www.llnl.gov/sisal/PerformanceData.
. Loogen R, Ortega-Mallen Y, Pena R () Parallel functional
programming in Eden. J Funct Program ():–
Futures
. MPI-: Extensions to the message-passing interface. Technical
report, University of Tennessee, Knoxville, TN, July  Cormac Flanagan
. Chakravarty MMT, Leshchinskiy R, Jones SP, Keller G () University of California at Santa Cruz, Santa Cruz,
Partial vectorisation of haskell programs. In DAMP’: ACM CA, USA
SIGPLAN workshop on declarative aspects of multicore pro-
gramming, San Francisco, CA
. Michaelson G, Horiguchi S, Scaife N, Bristow P () A parallel Synonyms
SML compiler based on algorithmic skeletons. J Funct Program
Eventual values; Promises
():–
. Nöcker EGJMH, Smetsers JEW, van Eekelen MCJD, Plasmeijer
MJ () Concurrent clean. In: PARLE’ — parallel architec- Definition
tures and languages Europe, LNCS . Springer-Verlag, Veld- The programming language construct “future E” indi-
hoven, the Netherlands, pp –
cates that the evaluation of the expression “E” may
. O’Donnell J () Data parallelism. In: Hammond K,
proceed in parallel with the evaluation of the rest of
Michaelson G (eds) Research directions in parallel functional
programming. Springer-Verlag, New York, pp – the program. It is typically implemented by immedi-
. Jones SLP, Hughes J, Augustsson L, Barton D, Boutel B, ately returning a future object that is a proxy for the
Burton W, Fasel J, Hammond K, Hinze R, Hudak P, Johnsson eventual value of “E.” Attempts to access the value of
T, Jones M, Launchbury J, Meijer E, Peterson J, Reid A, an undetermined future object block until that value is
Runciman C, Wadler P () Haskell : a non-strict, purely
determined.
functional language. Electronic document available on-line at
http://www.haskell.org/, February 
. Peterson J, Trifonov V, Serjantov A () Parallel functional Discussion
reactive programming. In: PADL’ – practical aspects of
declarative languages, LNCS . Springer-Verlag, New York, Introduction
pp – The construct “future E” permits the run-time system
. Parallel Virtual Machine Reference Manual. University of to evaluate the expression “E” in parallel with the rest
Tennessee, August  of the program. To initiate parallel evaluation, the run-
. Rabhi FA, Gorlatch S (eds) () Patterns and skeletons for
time system forks a parallel task that evaluates “E,” and
parallel and distributed computing. Springer-Verlag, New York,
 also creates an object known as a future object that will
. Skillicorn DB, Cai W () A cost calculus for parallel functional eventually contain the result of that computation of “E.”
programming. J Parallel Distrib Comput ():– The future object is initially undetermined, and becomes
 F Futures

determined or resolved once the parallel task evaluates introduce additional overheads, since every strict prim-
“E” and updates (or resolves/fulfils/binds) the future itive operation must now check whether its operands
object to contain the result of that computation. are futures.
Parallel evaluation arises because the future object is As an optimization, a static analysis can be used to
immediately returned as the result of evaluating “future translate from a language with implicit futures into one
E,” without waiting for the future object to become with explicit futures. Essentially, the translation intro-
determined. Thus, the continuation (or enclosing duces explicit touch operations on those primitive oper-
context) of “future E” executes in parallel with the ations that, according to a static analysis, may be applied
evaluation of “E” by the forked task. to futures. In the context of the Gambit compiler, this
The result of performing a primitive operation on a optimization reduces the overhead of implicit futures
future object depends on whether that operation is con- from roughly % to less than % []. As an orthogonal
sidered strict or not. Some program operations, such optimization, a copying garbage collector could replace
as assignments, initializing a data structure, passing an a determined future object by its underlying value, sav-
argument to a procedure, returning a result from a ing space and reducing the cost of subsequent touch
procedure, are non-strict. These operations can manipu- operations.
late (determined or undetermined) future objects in an
entirely transparent fashion, and do not need to know Task Scheduling
the underlying value of that future object. Future annotations are often used to introduce paral-
In contrast, strict program operations, such as addi- lel evaluation into recursive, divide-and-conquer style
tion, need to know the value of their arguments, and so computations such as quicksort. In this situation, there
cannot be directly applied to future objects. Instead, the may be exponentially more tasks than processors, and
arguments to these operations must first be converted consequently task scheduling has a significant impact
to a regular (non-future) value, by forcing or touching on overall performance. A naïve choice of round-
that argument. robin scheduling would interleave the executions of
Forcing a future object that has been determined all available tasks and result in an essentially breadth-
extracts its underlying value (which may in turn be first exploration of the computation tree, which may
a future and so the force operation must operate requires exponentially more memory than an equiva-
recursively). Forcing an undetermined future object lent sequential execution.
blocks until that future is determined, and so force Unfair scheduling policies significantly reduce this
operations introduce synchronization between parallel memory overhead, by better matching the number of
tasks. concurrently active tasks to the number of processors.
In the Multilisp implementation of this technique [],
Implicit Versus Explicit Futures each task has two possible states: active and pending,
A central design choice in a language with futures is and each processor maintains a LIFO queue of pend-
whether forcing operations should be explicit in the ing tasks. When evaluating “future E,” the newly forked
source code, or whether they should be performed task evaluating “E” is marked active, while the parent
implicitly by strict operations, such as addition, that task is moved to the pending queue. In the absence of
need to know the values of their argument. Explicit idle processors, the parent task remains in the pending
futures can be implemented as a library that introduces queue until the task evaluating “E” terminates, resulting
an additional data type. For example, the polymorphic in the same evaluation order (and memory footprint)
type Future<T> could denote a future object contain- as sequential execution, and the pending task queue
ing values of type T. However, explicit futures require functions much like the control stack of a sequential
explicit forcing operations that are tedious and error- execution.
prone for programs that use futures heavily. The benefit of the pending task queue is that if
In contrast, implicit futures require language sup- another processor becomes idle, it can steal and start
port since futures objects should be indistinguishable evaluating one of these pending tasks. In this manner,
from their final resolved value. Implicit futures do the number of active tasks in the system is roughly
Futures F 

dynamically balanced with the number of available pro- Futures provide a simple method for taming the
cessors. Each processor typically has just one active task. implicit parallelism of purely functional programs.
Once that task terminates, it removes and makes active A programmer who believes that the parallel evaluation
the next task in its pending queue. If the processor’s of some expression outweighs the overhead of creating
pending task queue becomes empty, it tries to steal and a separate task may annotate the expression with the
activate a pending task from one of the other processors keyword “future.” In a purely functional language, these
in the system. future annotations are transparent in that they have no
An active task may block by trying to force an unde- effect on the final values computed by the program, and
termined future object. In this situation, the task is only influence the amount of parallelism exposed dur-
typically put on a waiting list associated with that future. ing the program’s computation. Consequently, purely
Once the future is later determined, all tasks on the wait- functional programs with future are deterministic, just F
ing list are marked as pending, and will become active like their sequential counterparts.
once an idle processor is available.

Futures in Mostly Functional


Lazy Task Creation Languages
Lazy task creation [, ] provides a mechanism for
Futures have been used effectively in “mostly func-
reducing task creation overhead. Under this mecha-
tional” languages such as Multilisp [] and Scheme [],
nism, the evaluation of “future E” does not immedi-
where most computation is performed in a functional
ately create a new task. Instead, the expression “E” is
style and assignment statements are limited to situations
evaluated as normal, except that the control stack is
where they provide increased expressiveness. In this
marked to identify that the continuation of this future
setting, futures are not transparent, as they introduce
expression is implicitly a pending task.
parallelism that may change the order in which assign-
Later, an idle processor can inspect this control stack
ments are performed. To preserve determinism, the
to find this implicit pending task, create the appropriate
programmer needs to ensure there are no race condi-
future object, and start executing this pending task. The
tions between parallel computations, possibly by intro-
control stack is also modified so that after the evaluation
ducing additional forcing operations that block until a
of “E” terminates, its value is used to resolve that future
particular task has terminated and all of its side effects
object.
have been performed.
In the common case, however, the evaluation of “E”
Mostly functional languages include a side effect–
will terminate without the implicit pending task having
free subset with substantial expressive power, and
been stolen by a different processor. In this situation, the
so the future construct is still effective for exposing
result of “E” will be returned directly to the context of
parallelism within functional sub-computations written
“future E” which still resides on the control stack, with
in these languages.
little additional overhead and without any need to create
a new task or future object.
Deterministic Futures in Imperative
Futures in Purely Functional Languages
Languages Futures have been combined with transactional mem-
Programs in purely functional languages offer numer- ory techniques to guarantee determinism even in the
ous opportunities for executing program components presence of side effects. Essentially, each task is consid-
in parallel. For example, the evaluation of a function ered to be a transaction, and these transactions must
application could spawn a parallel task for the eval- commit in a fixed, as-if-sequential order. Each trans-
uation of each argument expression. Applying such a action tracks its read and write sets. If one transac-
strategy indiscriminately, however, leads to many par- tion accesses data that another transaction is mutat-
allel tasks whose overhead outweighs any benefits of ing, then the later transaction is rolled back and
parallel execution. restarted [].
 F Futures

Futures in Distributed Systems is evaluated lazily, that is, when the future object is first
Future objects have been used in distributed systems touched.
to hide the latency associated with a remote procedure
call (RPC). Instead of waiting for a message send and a Futures and Exceptions
matching response from a remote machine, an RPC can In the construct “future E,” an interesting design choice
immediately return a future object as a proxy for the is how to handle any exceptions that are raised during
eventual result of that call. Note that this future could the evaluation of “E,” since enclosing exception handlers
be returned even before the first RPC message is sent from the context of “future E” may be no longer active.
over the network. Consequently, if one machine per- A more general version of this question is how the
forms multiple RPCs to the same remote machine, then future construct should interact with call/cc (call-with-
these calls can all be batched into a single message; this current-continuation). This question has been studied
technique is referred to as promise pipelining. by a number of researchers (see, e.g., Moreau []), with
the goal of providing determinism guarantees in the
presence of these constructs.
Thread-Specific Futures, Resolvable
Some languages take a more pragmatic choice,
Futures, and I-vars
whereby if “E” raises an exception, then any attempt
The programming language construct “future E”
to touch the corresponding future object will also raise
described above does not expose the ability to resolve
that exception.
the future to an arbitrary value; it can only be resolved
to the result of evaluating the expression “E.”
An I-var is analogous to a future object in that it
Related Entries
Cilk
functions as a proxy for a value that may not yet be
determined. The key difference is that an I-var does not
have an explicit associated thread that is computing its
Bibliographic Notes and Further
value (as in the case for future objects). Instead, any
Reading
Friedman and Wise introduced the term “promise” in
thread can update or resolve the I-var. Thus, an I-var
 []. Peter Hibbard described a form of explicit
is essentially a single entry, one-shot queue. Attempts
futures, called “eventual values,” in the context of
to access an I-var before it is determined block, as with
Algol []. Baker and Hewitt [] proposed implicit futures
future objects.
for the parallel evaluation of functional languages.
Somewhat confusingly, the terms “future” and
Implicit futures were implemented in the Multilisp
“promise” are sometimes used to mean an I-var object
version of Scheme in the mid-s [], and futures
that also exposes this resolve operation. The terms
have since been adapted to a variety of other lan-
“thread-specific future” and “resolvable future” help dis-
guages and compilers, including the Gambit Scheme
ambiguate these distinct meanings.
compiler [], Mul-T [], Butterfly Lisp, Portable Stan-
dard Lisp, Act , Alice ML, Habanero Java, and many
Futures and Lazy Evaluation others.
Futures are closely related to lazy evaluation, in that Libraries supporting futures have been imple-
both immediately return a placeholder for the result of a mented for several languages, including (to name just
pending computation. A major difference is that future a few of many examples) Java, C++x, OCaml, python,
expressions typically must be evaluated before program Perl, and Ruby.
termination, but lazy expressions need not be evaluated The single-assignment I-var structure originated in
before program termination. This semantic difference Id [] and was included in subsequent languages such
means that preemptive evaluation of lazy expressions as parallel Haskell and Concurrent ML.
is considered speculative, whereas evaluation of future
expressions is considered mandatory. Bibliography
Sometimes, as in the language Alice ML, the term . Halstead R (Oct ) Multilisp: A language for concurrent sym-
“lazy future” is used to denote a future whose expression bolic computation. ACM Trans Program Lang Syst ():–
Futures F 

. Feeley M () An efficient and general implementation of . Welc A, Jagannathan S, Hosking A () Safe futures for Java.
futures on large scale shared-memory multiprocessors. PhD In: Proceedings of the th annual ACM SIGPLAN conference
thesis, Department of Computer Science, Brandeis University on object-oriented programming, systems, languages, and appli-
. Flanagan C, Felleisen M () The semantics of future and cations (OOPSLA ), ACM, pp –
its use in program optimization. In: Proceedings of the ACM . Baker H, Hewitt C (Aug ) The incremental garbage col-
SIGPLAN-SIGACT symposium on the principles of program- lection of processes. In: Proceedings of symposium on AI
ming languages. POPL’, pp – and programming languages, ACM SIGPLAN Notices ():
. Moreau L () Sound evaluation of parallel functional pro- –
grams with first-class continuations. PhD thesis, Universite . Moreau L () The semantics of scheme with future. In: Pro-
de Liege ceedings of the first ACM SIGPLAN international conference on
. Kranz D, Halstead R, Mohr E () Mul-T: A high performance functional programming, pp –
parallel lisp. In: Proceedings of the ACM SIGPLAN  Con- . Mohr E, Kranz D, Halstead R () Lazy task creation: a tech-
ference on Programming language design and implementation, nique for increasing the granularity of parallel programs. In:
F
Portland, pp – Proceedings of the  ACM conference on LISP and functional
. Knueven P, Hibbard P, Leverett B (June ) A language sys- programming, pp –
tem for a multiprocessor environment. In: Proceedings of the . Arvind, Nikhil RS, Pingali KK () I-Structures: Data struc-
th international conference on design and implementation of tures for parallel computing. ACM Trans Program Lang Syst
algorithmic languages, Courant Institute of mathematical Studies, ():–
New York, pp –
. Friedman DP, Wise DS () The impact of applicative pro-
gramming on multiprocessing.  International conference on
parallel processing, pp –
G

GA Generalized Meshes and Tori


Global Arrays Parallel Programming Toolkit Hypercubes and Meshes

Gather Genome Assembly


Collective Communication Ananth Kalyanaraman
Washington State University, Pullman, WA, USA

Synonyms
Gather-to-All Genome sequencing
Allgather
Definition
Genome assembly is the computational process of
deciphering the sequence composition of the genetic
Gaussian Elimination material (DNA) within the cell of an organism, using
numerous short sequences called reads derived from
Dense Linear System Solvers different portions of the target DNA as input. The
Sparse Direct Methods term genome is a collective reference to all the DNA
molecules in the cell of an organism. Sequencing gener-
ally refers to the experimental (wetlab) process of deter-
mining the sequence composition of biomolecules such
GCD Test as DNA, RNA, and protein. In the context of genome
Banerjee’s Dependence Test assembly, however, the term is more commonly used to
Dependences refer to the experimental (wetlab) process of generat-
ing reads from the set of chromosomes that constitutes
the genome of an organism. Genome assembly is the
computational step that follows sequencing with the
Gene Networks Reconstruction objective of reconstructing the genome from its reads.

Systems Biology, Network Inference in Discussion


Introduction
Deoxyribonucleic acid (or DNA) is a double-stranded
Gene Networks molecule which forms the genetic basis in most known
Reverse-Engineering organisms. The DNA along with other molecules, such
as the ribonucleic acid (or RNA) and proteins, col-
Systems Biology, Network Inference in lectively constitute the subject of study in various

David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----,


© Springer Science+Business Media, LLC 
 G Genome Assembly

branches of biology such as molecular biology, genet- overlap preserved between reads that were sequenced
ics, genomics, proteomics, and systems biology. A DNA from overlapping regions along the target genome. To
molecule is made up of two equal length strands with increase the chance of overlap, the target genome is typ-
opposite directionality, with the ends labeled from ′ ically sequenced in a redundant fashion. This is referred
to ′ on one and ′ to ′ on the other. Each strand to as genome coverage. A higher coverage typically tends
is a sequence of smaller molecules called nucleotides, to provide information for a more accurate assembly,
and each nucleotide contains one of the four possible although at increased costs of generation and analysis.
nitrogenous bases – adenine (a), cytosine (c), guanine The fact that reads could have originated from an arbi-
(g), and thymine (t). Therefore for computational pur- trary strand adds another dimension to the complexity
poses, each strand can be represented in the form of of the reconstruction process. The process is further
a string over the alphabet {a, c, g, t}, expressed always complicated by other factors such as errors introduced
in the direction from ′ to ′ . Furthermore, the base at during read sequencing and the presence of genomic
a given position in one strand is related to the base at repeats that could ambiguate overlap detection and read
the corresponding position in the other strand by the placement. To partially aid the resolving of the genomic
following base-pairing rule (referred to as “base com- repeat regions, sequencing is sometimes performed in
plementarity”): a ↔ t, c ↔ g. Therefore, the sequence of pairs from clonal insert libraries, where the genomic
one strand can be directly deduced from the sequence of distance between the two reads of a pair can be esti-
the other. For example, if one strand is ′ agaccagttac ′ , mated, typically in the – Kbp range. This technique
then the other is ′ gtaactggtct ′ . is called pair-end sequencing and genome assemblers
The length of a genome is measured in the num- could take advantage of this information to resolve
ber of its base pairs (“bp”). Genomes range in length repeats that are smaller than the specified range.
from being just a few million base pairs in microbes to Genome assembly is a classical computational prob-
several billions of base pairs in many eukaryotic organ- lem in the field of bioinformatics and computational
isms. However, all sequencing technologies available till biology. More than two decades of research has yielded a
date, since the invention of Sanger sequencing in late number of approaches and algorithms and yet the prob-
s, have been limited to accurately sequencing DNA lem continues to be actively pursued. This is because of
molecules no longer than ∼ Kbp. Consequently, scien- several reasons. While there are different ways of formu-
tists have had to deploy alternative sequencing strate- lating the genome assembly problem, all known formu-
gies for extending the reach of technology to genome lations are NP-Hard. Therefore, researchers continue to
scale. The most popular strategy is the whole genome work on developing improved, efficient approximation
shotgun (or WGS) strategy, where multiple copies of and heuristic algorithms. However, even the best serial
a single long target genome are shredded randomly algorithms take tens of thousands of CPU hours for
into numerous fragments of sequenceable length and eukaryotic genomes owing to their: large genomic sizes,
the corresponding reads sequenced individually using which increase the number of reads to be analyzed; and
any standard technology. Another popular albeit more high genomic repeat complexity, which adds substan-
expensive strategy is the hierarchical approach, where tial processing overheads. For instance, in the Celera’s
a library of smaller molecules called Bacterial Artifi- WGS version of the human genome assembly published
cial Chromosomes (or BACs) is constructed. Each BAC in , over  million reads representing .x cover-
is ∼ Kbp in length and they collectively provide age over the genome were assembled in about , 
a minimum tiling path over the entire length of the CPU h.
genome. Subsequently, the BACs are sequenced individ- A paradigm shift in sequencing technologies has
ually using the shotgun approach. further aggravated the data-intensiveness of the problem
In both approaches, the information of the relative and thereby the need for continued algorithmic
ordering among the sequenced reads is lost during development. Until the mid-s, the only available
sequencing either completely or almost completely. technology for use in genome assembly projects was
Therefore, the primary information that genome assem- Sanger sequencing, which produces reads of approx-
blers should rely upon is the end-to-end sequence imate length  Kbp each. Since then, a slew of
Genome Assembly G 

high-throughput sequencing technologies, collectively a survey of tools for genome assembly. Rather, it will
referred to as the next-generation sequencing tech- focus on the parallelism in the problem and related
nologies, have emerged, significantly revitalizing the efforts in parallel algorithm development. In order to
sequencing community. Examples of next-generation set the stage, the key ideas from the corresponding serial
technologies include Roche  Life Sciences sys- approach will also be presented as necessary. The bulk of
tem’s pyrosequencing (read length ∼ bp), Illumina the discussion will be on de novo assembly, which is typi-
Genome Analyzer and HiSeq (read length ∼– bp); cally the harder task of reconstructing a genome from its
Life Technologies SOLiD (read length ∼ bp) and reads assuming no prior knowledge about the sequence
Helicos HeliScope (read length ∼ bp). While these of the genome except for its reads and as available, their
instruments generate much shorter reads than Sanger, pair-end sequencing information.
they do so at a much faster rate effectively producing
several hundred millions of reads in a single experi- Algorithmic Formulations of Genome
ment, and at significantly reduced costs (about – Assembly
G
times cheaper). These attractive features are essentially The genome assembly problem: Let R = {r , r . . . rm }
democratizing the sequencing process and broaden- denote a set of m reads sequenced from an unknown
ing community contribution to sequenced data. From target genome G of length g. The problem of genome
a genome assembly perspective, a shorter read length assembly is to reconstruct the sequence of genome G
could easily deteriorate the assembly quality because the from R.
reads are more likely to exhibit false or insufficient over- As a caveat, for nearly all input scenarios, the out-
laps. To offset this shortcoming, sequencing is required come expected is not a single assembled sequence but
at a much higher coverage (x–x) than for Sanger a set of assembled sequences for a couple of reasons.
sequencing. The higher coverage has another desirable Typically, the genome of a species comprises of multiple
effect in that, because of its built-in redundancy, it could chromosomes (e.g.,  pairs in the human genome) and
aid in a more reliable identification of real genomic therefore each chromosome can be treated as an indi-
variations and their differentiation from experimental vidual sequence target. Note that, however, the infor-
artifacts. Detecting genomic variations such as single mation about the source chromosome for a read is lost
nucleotide polymorphisms (SNPs) is of prime impor- during a sequencing process such as WGS and it is left
tance in comparative and population genomics. for the assembler to detect and sort these reads by chro-
This combination of high coverage and short read mosomes. Furthermore, any shotgun sequencing pro-
lengths makes the short read genome assembly prob- cedure tends to leave out “gaps” along the chromosomal
lem significantly more data-intensive than for Sanger DNA during sampling, and therefore it is possible to
reads. For instance, any project aiming to reconstruct reconstruct the genome only for those sampled sections.
a mammalian genome (e.g., human genome) de novo It is expected that, through incorporation of pair-end
using any of the current next-generation technologies information, at least a subset of these assembled prod-
would have to contend with finding a way to assemble ucts (called “contigs” in genome assembly parlance) can
several billion short reads. This changing landscape in be partially ordered and oriented.
technology and application has led to the development Notation and terminology: Let s denote a sequence over
of a new generation of short read assemblers. Although a fixed alphabet Σ. Unless otherwise specified, a DNA
traditional developmental efforts have been targeting alphabet is assumed – i.e., Σ = {a, c, g, t}. Let ∣s∣ denote
serial computers, the increasing data-intensiveness and the length of s; and s[i . . . j] denote the substring of s
the increasing complexity of genomes being sequenced starting and ending at indices i and j respectively, with
have gradually pushed the community toward parallel the convention that string indexing starts at . A pre-
processing, for what has now become an active branch fix i (alternatively, suffix i) of s is s[ . . . i] (alternatively,
i= ∣ri ∣ and ℓ = m . The sequencing
s[i . . . ∣s∣]). Let n = Σ m n
within the area.
g
In what follows, an overview of the assembly prob- coverage c is given by n . The terms string and sequence
lem, with its different formulations and algorithmic are used interchangeably. Let p denote the number of
solutions is presented. This entry is not intended to be processors in a parallel computer.
 G Genome Assembly

The most simplistic formulation of genome assem- were gaps during sequencing, then one Hamilto-
bly is that of the Shortest Superstring Problem (SSP). nian Path is sought for every connected component
A superstring is a string that contains each input read in G (if it exists) and the genome can be recov-
as a substring. The SSP formulation is NP-complete. ered as an unordered set of pieces corresponding
Furthermore, its assumptions are not realistic in prac- to the sequenced sections. This formulation using
tice. Reads can contain sequencing errors and hence the overlap graph model has been shown to be
they need not always appear preserved as substrings NP-Hard.
in the genome; and genomes typically are longer than () De Bruijn graph: Let a k-mer denote a string of
the shortest superstring due to presence of repeats and length k. Construct a De Bruijn graph, where the
nearly identical genic regions (called paralogs). vertex set is the set of all k-mers contained inside
Among the more realistic models for genome the reads of R; and a directed edge is drawn from
assembly, graph theoretic formulations have been in the vertices i to j, iff the k −  length suffix of the k-mer
forefront. In broad terms, there are three distinct ways i is identical to the k −  length prefix of the k-mer j.
to model the problem (refer to Fig. ): Put another way, each edge in E represents a unique
k + -mer that is found in at least one of the reads
() Overlap graph: Construct graph G(V, E) from R,
in R. Given a De Bruijn graph G, genome assem-
where each read is represented by a unique ver-
bly is the problem of finding a shortest Euler tour
tex in V and an edge is drawn between (ri , rj ) iff
in which each read is represented by a sub-path in
there is a significant suffix–prefix overlap between
the tour. While finding an Euler tour is polynomially
ri and rj . Overlap is typically defined using a pair-
solvable, the optimization problem is still NP-Hard
wise sequence alignment model (e.g., semi-global
by reduction from SSP.
alignment) and related metrics to assess the qual-
() String graph: This model is a variant of the overlap
ity of an alignment. Given G, the genome assem-
graph in which the edges represent reads, and ver-
bly problem can be reduced to the problem of
tices represent branching of end-to-end overlaps of
finding a Hamiltonian Path, if one exists, pro-
the adjoining reads. For example, if a read ri overlaps
vided that there were no breaks (called sequencing
with both rj and rk then it is represented by a vertex
gaps) along the genome during sequencing. If there

Unknown
genome Repeat Repeat
Reads

Overlap graph
cagtcaac
De Bruijn graph agtcagt
cagttcgg
tcaa caac
agtc gtca tcag cagt
agtt gttc ttcg tcgg

String graph
r1 r3

r2 r4

Genome Assembly. Fig.  Illustration of the three different models for genome assembly. The top section of the figure
shows a hypothetical unknown genome and the reads sequenced from it. The dotted box shows a repetitive region within
the genome. In the overlap graph, solid lines indicate true overlaps and dotted lines show overlaps induced by the
presence of the repeat. The De Bruijn graph shows the graph for an arbitrary section involving three reads. The string
graph shows an example of a branching vertex with multiple possible entry and exit paths
Genome Assembly G 

which has an inbound edge corresponding to ri and which is expected to sample roughly equal number of
two outbound edges representing rj and rk . Further- reads covering each genomic base and it is typically only
more, the graph is pruned off transitively inferable reads that originate from the same locus that tend to
edges – e.g., if both ri and rj overlaps with rk , whereas overlap with one another, with the exception of reads
ri also overlaps with rj , then the overlap from rj to rk from repetitive regions of the genome.
is inferable and is therefore removed. Given a string A more scalable approach to detect pairwise over-
graph G, the assembly problem becomes one that is laps is to first shortlist pairs of sequences based on the
equivalent of finding a constrained least cost Chi- presence of sufficiently long exact matches (i.e., a fil-
nese Postman tour of G. This formulation has also ter built on a necessary-but-not-a-sufficient condition)
been shown to be NP-Hard. and then compute pairwise comparisons only on them.
Methods vary in the manner in which these “promising
All the three formulations incorporate features to
pairs” are shortlisted. One way of generating promising
handle sequencing errors. In the overlap graph model,
pairs is to use a string lookup table data structure that
G
the pairwise alignment formulation used to define edges
is built to record all k-mer occurrences within the reads
automatically takes into account character substitu-
in linear time. Sequence pairs that share one or more
tions, insertions, and deletions. In the other two models,
k-mers can be subsequently considered for alignment-
an “error correction” procedure is run as a preprocess-
based evaluation. While this approach supports a simple
ing step to mask off anomalous portions of the graph
implementation, it poses a few notable limitations. The
prior to performing the assembly tours.
size complexity of the lookup table data structure con-
tains an exponential term O(∣Σ∣k ). This restricts the
Parallelization for the Overlap Graph value of k in practice; for DNA alphabet, physical mem-
Model ory constraints demand that k ≤ . Whereas, reads
Algorithms that use the overlap graph model deploy could share arbitrarily long matches, especially if the
a three-stage assembly approach of overlap–layout– sequencing error rates are low. Another disadvantage of
consensus. In the first stage, the overlap graph is the lookup table is that it is only suited to detect pairs
constructed by performing all-against-all pairwise with fixed length matches. An exact match of an arbi-
comparison of the reads in R. As a result, a layout for the trary length q will reveal itself as q − k +  smaller k-mer
assembly is prepared using the pairwise overlaps and in matches, thus increasing computation cost. Further-
the third stage, a multiple sequence alignment (MSA) more, the distribution of entries within the lookup table
of the reads is performed as dictated by the layout. is highly data dependent and could scatter the same
Given the large number of reads generated in a typi- sequence pair in different locations of the lookup table,
cal sequencing project, the overlap computation phase making it difficult for parallelization on distributed
dominates the run-time and hence is the primary target memory machines. Methods that use lookup tables sim-
for parallelization and performance optimization. ply use a high end shared memory machine for serial
A brute-force way of implementing this step will be generation and storage, and focus on parallelizing only
to compare all (m ) possible pairs. This approach can the pairwise sequence comparison step.
also be parallelized easily, as individual alignment tasks An alternative albeit more efficient way of enumer-
can be distributed in a load-balanced fashion given ating promising pairs based on exact matches is to use
the uniformity expected in the read lengths. However, a string indexing data structure such as suffix tree or
an implementation may not be practically feasible for suffix array that will allow detection of variable length
larger number of reads due to the quadratic increase matches. A match α between two reads ri and rj is
in alignment workload. Each alignment task executed called left-maximal (alternatively, right-maximal) if the
serially could take milliseconds of run-time. In fact, for characters preceding (alternatively, following) it in both
the assembly problem, the nature of sampling done dur- strings, if any, are different. A match α between reads
ing sequencing guarantees that performing all-against- ri and rj is said to be a maximal match between the two
all comparisons is destined to be highly wasteful. This is reads if it is both left- and right-maximal. Pairs contain-
because of the random shotgun approach to sequencing ing maximal matches of a minimum length cutoff, say
 G Genome Assembly

ψ, can be detected in optimal run-time and linear space whereby the prefix length for bucketing is iteratively
using a suffix tree data structure. A generalized suffix extended until the size of the bucket fits in the local
tree of a set of strings is the compacted trie of all suf- memory.
fixes in the strings. The pair enumeration algorithm first S. (comm, comp) Each bucket in the output corre-
builds a generalized suffix tree over all reads in R and sponds to a unique subtree in the generalized suffix
then navigates the tree in a bottom-up order. tree rooted exactly k characters below the root. The
subtree corresponding to each bucket is then locally
Parallel Generalized Suffix Tree constructed using a recursive bucket sort–based
Construction method that compares characters between suffixes.
For constructing a suffix tree, there are several linear However, this step cannot be done strictly locally
time construction serial algorithms and optimal algo- because it needs the sequences of reads whose suf-
rithms for the CRCW-PRAM model. However, optimal fixes are being compared. This can be achieved by
construction under the distributed memory machine building the local set of subtrees in small-sized
model, where memory per processor is assumed to be batches and performing one round of communica-
too small to store the input sequences in R, is an open tion using Alltoallv() to load the reads required to
problem. Of the several approaches that have been stud- build each batch from other processors. In order to
ied, the following approach is notable as it has demon- reduce the overhead due to read fetches, observe that
strated linear scaling to thousands of processors on the aggregate memory on a relatively small group
distributed memory supercomputers. of processors is typically sufficient to accommodate
The major steps of the underlying algorithm are as the read set for most inputs in practice. For example,
follows. Each step is a parallel step. The steps marked as even a set of  million reads each of length Kbp
comm are communication-bound and the steps marked needs only of the order of  GB. Even assuming
comp are computation-bound.  GB per processor, this means the read set will fit
within  processors. Higher number of processors
S. (comp) Load R from I/O in a distributed fashion such is required only to speedup computation. One could
that each processor receives O( np ) characters and no take advantage of this observation by partitioning
read is split across processor boundaries. the processor space into subgroups of fixed, smaller
S. (comp) Slide a window of length k ≤ ψ over the size, determined by the input size, and then have
set of local reads and bucket sort locally all suffixes the on-the-fly read fetches to seek data from pro-
of reads based on their k length prefix. Note that cessors from within each subgroup. This improved
for DNA alphabet, even a value as small as  for scheme could distribute the load of any potential hot
k is expected to create over a million buckets, suffi- spots and also the overall communication could be
cient to support parallel distribution. An alternative faster as the collective transportation primitive now
to local generation of buckets is to have a master– operates on a smaller number of processors.
worker paradigm where the reads are scanned in
small-sized batches and have a dedicated master The above approach can be implemented to run in
distribute batches to workers in a load-balanced O( nℓ
p
) time and O( np ) space, where ℓ is the mean read
fashion. length. The output of the algorithm is a distributed rep-
S. (comm) Parallel sort the buckets such that each resentation of the generalized suffix tree, with each pro-
processor receives approximately ∼ np suffixes and cessor storing roughly np suffixes (or leaves in the tree).
no bucket is split across processor boundaries.
While the latter cannot be theoretically guaranteed, The cluster-then-assemble Approach
real-world genomic data sets typically distribute the A potential stumbling block in the overlap–layout–
suffixes uniformly across buckets thereby virtually consensus approach is the large amount of memory
ensuring that no bucket exceeds the O( np ) bound. If required to store and retrieve the pairwise overlaps so
a bucket size becomes too big to fit in the local mem- that they can be used for generating an assembly layout.
ory, then an alternative strategy can be deployed For genomes which have been sequenced at a uniform
Genome Assembly G 

coverage c, the expectation is that each base along the should correspond to one genomic island, although
genome is covered by c reads on an average, thereby the presence of genomic repeats could collapse reads
implying (c ) overlapping read pairs for every genomic belonging to different islands together. Once clustered,
base. However, this theoretical linear expectation holds each cluster represents an independent subproblem for
for only a fraction of the target genome, whereas factors assembly, thereby breaking a large problem with m
such as genomic repeats and oversampled genic regions reads into numerous, disjoint subproblems of signif-
could increase the number of overlapping pairs arbitrar- icantly reduced size small enough to fit in a serial
ily beyond the expected level. An example case in point computer. In practice, tens of thousands of clusters are
is the gene-enriched sequencing of the highly repetitive generated making the approach highly suited for trivial
∼. billion bp maize genome. parallelization after clustering.
One way to overcome this scalability bottleneck The primary challenge is to implement the clus-
is to use clustering as a preprocessing step to assem- tering step in parallel, in a time- and space-efficient
bly. This approach, referred to as cluster-then-assemble, manner. In graph-theoretic terms, the output produced
G
builds upon the assumption that any genome-scale by a transitive closure clustering is representative of
sequencing effort is likely to undersample the genome, connected components in the overlap graph, thereby
leaving out sequencing gaps along the genome length allowing the problem to be reduced to one of connected
and thereby allowing the assemblers to reconstruct the component detection. However, generating the entire
genome in pieces – one piece for every contiguous graph G prior to detection would contradict the purpose
stretch of the genome (aka “contig” or “genomic island”) of clustering from a space complexity standpoint.
that is fully sampled through sequencing. This assump- Figure  outlines an algorithm to perform sequence
tion is not unrealistic as tens of thousands of sequenc- clustering. The algorithm can be described as follows:
ing gaps have always resulted in almost all eukaryotic Initialize each read in a cluster of its own. In an iterative
genomes sequenced so far using the WGS approach. process, generate promising pairs based on maximal
The cluster-then-assemble takes advantage of this matches in a non-increasing order of their lengths using
assumption as follows: The set of m reads is first parti- suffix trees (as explained in the previous section). Before
tioned into groups using a single-linkage transitive clo- assigning a promising pair for further alignment pro-
sure clustering method, such that there exists a sequence cessing, a check is made to see whether the constituent
of overlapping pairs connecting every read to every reads are in the same cluster. If they are part of the same
other read in the same cluster. Ideally, each cluster cluster already, then the pair is discarded; otherwise it

Algorithm 1 Read Clustering


Input: Read set R = {r1 ; r2 ; : : : rm }
Output: A partition C = {C1 ; C2 ; : : : Ck } of R, 1 ≤ k ≤ m
1. Initialize Clusters: ⇒ (master)
C ← { {ri } | 1 ≤ i ≤ m}
2. FOR each pair (ri ; rj ) with a maximal match of length ≥ Ã
generated in non-increasing order of maximal match length ⇒ (worker)
Cp ← F ind(ri ) ⇒ (master)
Cq ← F ind(rj ) ⇒ (master)
IF Cp = Cq THEN ⇒ (master)
overlap quality ← Align(ri ; rj ) ⇒ (worker)
IF overlap quality is significant THEN ⇒ (master)
U nion(Cp ; Cq ) ⇒ (worker)

3. Output C ⇒ (master)

Genome Assembly. Fig.  Pseudocode for a read clustering algorithm suited for parallelization based on the
master-worker model. The site of computation is shown in brackets. Operations on the set of clusters are performed using
the Union-Find data structure
 G Genome Assembly

is aligned. If the alignment results show a satisfactory Short Read Assembly


overlap (based on user-defined alignment parameters), The landscape of genome assembly tools has signifi-
then merge the clusters containing both reads into one cantly transformed itself over the last few years with
larger cluster. The process is repeated until all promising the advent of next-generation sequencing technolo-
pairs are exhausted or until no more merges are possi- gies. A plethora of new-generation assemblers that
ble. Using the union–find data structure would ensure collectively operate under the banner of “short read
the Find and Union calls to run in amortized time pro- assemblers” is continuing to emerge. Broadly speak-
portional to the Inverse Ackerman function – a small ing, these tools can be grouped into two categories:
constant for all practical inputs. () those that follow or extend from the classical over-
There are several advantages to this clustering strat- lap graph model; and () those that deploy either
egy. Clustering can be achieved in at most m −  merg- the De Bruijn graph or the string graph formulation.
ing steps. Checking if a read pair is already clustered The former category includes programs such as Edena,
before alignment is aimed at reducing the number of Newbler, PE-Assembler, QSRA, SHARCGS, SSAKE,
pairs aligned. Generating promising pairs in a non- and VCAKE. The programs that fall under the lat-
increasing order of their maximal match lengths is a ter category are largely inspired by two approaches –
heuristic that could help identify pairs that are more the De Bruijn graph formulation used in the EULER
likely to succeed the alignment test sooner during exe- assembler; and the string graph formulation proposed
cution. Furthermore, promising pairs are processed as by Myers. As of this writing, these programs include
they are generated, obviating the need to store them. EULER-SR, ALLPATHS, Velvet, SOAPdenovo, ABySS,
This coupled with the use of the suffix tree data struc- and YAGA. Either of these lists is likely to expand
ture implies an O(n) serial space complexity for the in the coming years, as new implementations of short
clustering algorithm. read assemblers are continuing to emerge as are new
The parallel algorithm can be implemented using a technologies.
master–worker paradigm. A dedicated master proces- In principle, the techniques developed for assem-
sor can be responsible for initializing and maintaining bling Sanger reads do not suit direct application for
the clusters and also for distributing alignment work- short read assembly due to a combination of factors
load to the workers in a load-balanced fashion. The that include shorter read length, higher sequencing cov-
workers at first can generate a distributed representation erage, an increased reliance on pair-end libraries, and
of the generalized suffix tree in parallel (as explained idiosyncrasies in sequencing errors.
in the previous section). Subsequently, they can gener- A shorter read length implies that the method has
ate promising pairs from their local portion of the tree to be more sensitive to differences to avoid potential
and send them to the master. To reduce communication mis-assemblies. This also means providing a robust sup-
overheads, pairs can be sent in arbitrary-sized batches port for incorporating the knowledge available from
as demanded by the situation of the work queue buffer pair-end libraries. In case of next-generation sequenc-
at the master. The master processor can check the pairs ing, pair-end libraries are typically available for different
against the current set of clusters, filter out pairs that do clone insert sizes, which translates to a need to imple-
not need alignment, and add only those pairs that need ment multiple sets of distance constraints.
alignment to its work queue buffer. The pairs in the work A high sequencing coverage introduces complexity
queue buffer can be redistributed to workers in fixed at two levels. The number of short reads to assem-
size batches, to have the workers compute alignments ble increases linearly with increased coverage, and this
and send back the results. Communication overheads number can easily reach hundreds of millions even
can be masked by overlapping alignment computation for modest-sized genomes because of shorter read
with communication waits using non-blocking calls. length. For instance, a mid-size genome such as that
The PaCE software suite implements the above paral- of Arabidopsis (∼ Mbp) or fruit fly (∼ Mbp)
lel algorithm, and it has demonstrated linear scaling sequenced at x coverage either using Illumina or
to thousands of processors on a distributed memory SOLiD would generate a couple of hundred million
supercomputer. reads. Secondly, the average number of overlapping read
Genome Assembly G 

pairs at every given genomic location is expected to all steps in the method are parallel and some steps are
grow quadratically (∝ (c )) with the coverage depth (c). disk-bound. Experimental results show that the parallel
This particularly affects the time and memory scalabil- efficiency drops drastically beyond three threads.
ity of methods that use the overlap graph model not ABySS and YAGA are two parallel methods that
only because they operate at the level of pairwise read implement the De Bruijn graph model and work for
overlaps, but also because many tools assume that such distributed memory computers. The De Bruijn and
overlaps can be stored and retrieved from local mem- string graph formulations are conceptually better suited
ory. Consequently, such methods take between – h for short read assembly. These formulations allow the
and several gigabytes of memory even for assembling algorithm to operate at a read or read’s subunit (i.e.,
small bacterial genomes. From a parallelism perspec- k-mer) level rather than the pairwise overlap level.
tive, a high coverage in sampling could also poten- Repetitive regions manifest themselves in the form of
tially mean fewer sequencing gaps, as the probability special graph patterns, and error correction mecha-
of a genomic base being captured by a read improves nism could detect and reconcile anomalous graph sub-
G
with coverage by the Lander–Waterman model. There- structures prior to performing assembly tours, thereby
fore, divide-and-conquer–based techniques such as the reducing the chance of misassemblies. The assembly
cluster-then-assemble may not be as effective in break- itself manifests in the form of graph traversals. Con-
ing the initial problem size down. structing these graphs and traversing them to produce
Short read assemblers also have to deal with tech- assembly tours (although not guaranteed to be opti-
nology specific errors. The error rates associated with mal due to intractability of the problem) are problems
next-generation technologies are typically in the –% with efficient heuristic solutions on a serial computer.
range, and cannot be ignored as differentiating them The methods ABySS and YAGA provide two different
from real differences could be key to capturing natural approaches to implement the different stages in paral-
variations such as near identical paralogs. lel using De Bruijn graphs. While these methods dif-
In the current suite of tools specifically built for fer in their underlying algorithmic details and in the
short read assembly, only a handful of tools support degree offered for parallelism, the structural layout of
some degree of parallelism. These tools include PE- their algorithms is similar consisting of these four major
Assembler, ABySS, and YAGA. Others are serial tools steps: () parallel graph construction; () error correc-
that work on desktop computers and rely on high- tion; () incorporation of distance constraints due to
memory nodes for larger inputs. Even the tools that pair-end reads; and () assembly tour and output.
support parallelism do so to varying degrees. PE- In what follows, approaches to parallelize each of
Assembler, which implements a variant of the over- these major steps are outlined, with the bulk of exposi-
lap graph model, limits parallelism to node level and tion closely mirroring the algorithm in YAGA because
assumes that the number of parallel threads is small its algorithm more rigorously addresses parallelization
(< ). It deploys a “seed-extend” strategy in which a for all the steps, and takes advantage of techniques that
subset of “reliable” reads are selected as seeds and other are more standard and well understood among dis-
reads that overlap with each seed are incrementally tributed memory processing codes (e.g., sorting, list
added to extend and build into a consensus sequence ranking). By contrast, parallelization is described only
in the ′ direction. Parallelism is supported by select- for the initial phase of graph construction in ABySS.
ing multiple seeds and launching multiple threads that Therefore, where appropriate, variations in the ABySS
assume responsibility of extending these different reads algorithmic approach will be highlighted in the text.
in parallel. While the algorithm has the advantage of not Other than differences in their parallelization strate-
having to build and manage large graphs, it performs gies, the two methods also differ particularly in the type
a conservative extension primarily relying on pair-end of De Bruijn graph they construct (directed vs. bidi-
data to collapse reads into contigs. From a parallelism rected) and their ability to handle multiple read lengths.
perspective, the algorithm relies on shared memory Such details are omitted in an attempt to keep the text
access for managing the read set, which works well if focused on the parallel aspects of the problem. An inter-
the number of threads is very small. Furthermore, not ested reader can refer to the individual papers for details
 G Genome Assembly

pertaining to the nuances of the assembly procedure An alternative to this edge-centric representation
and the output quality of these assemblies. is a vertex-centric representation, which is followed in
ABySS and in an earlier version of YAGA. Here, the
k-mer set corresponding to each Ri is generated by
Parallel De Bruijn Graph Construction the corresponding pi . The vertices adjacent to a given
and Compaction vertex could be generated remotely by one or more
Given m reads in set R, the goal is to construct in processors. Therefore, this approach necessitates pro-
parallel a distributed representation of the correspond- cessors to communicate with one another in order to
ing De Bruijn graph built out of k-mers, for a user- check the validity of all edges that could be theoretically
specified value of k. Note that, once a De Bruijn graph drawn from its local set of vertices. While the number
representation is generated, it can be transformed into of such edge validation queries is bounded by  per ver-
a corresponding string graph representation by com- tex ({a, c, g, t} on either strand orientation), the method
pressing paths whose label when concatenated spells runs the risk of generating false edges, e.g., if a k − -mer,
out the characters in a read, and by accordingly intro- say α, occurs in exactly two places along the genome
ducing branch nodes that capture overlap continuation as aαc and gαt, then the vertex-centric approach will
between adjoining reads. This can be done in a suc- erroneously generate edges for aαt and gαc, which are
cessive stage of graph compaction, a minor variant of k + -mers nonexistent in the input. Besides this risk,
which is described in the later part of this section. The care must be taken to guarantee an even redistribution
computing model assumed is p processors (or equiva- of vertices among processors by the end of the construc-
lently, processes), each with access to a local RAM, and tion process. For instance, a static allocation scheme in
connected through a network interconnect and with which a hash function is used to map each k-mer to a
access to a shared file system where the input is made destination processor, as it is done in ABySS, runs the
available. risk of producing unbalanced distribution as the k-mer
To construct the De Bruijn graph, the reads in R are concentration within reads is input dependent.
initially partitioned and loaded in a distributed manner Graph compaction: Once a distributed edge-centric rep-
such that each processor receives O( np ) input charac- resentation of the De Bruijn graph is generated, the next
ters. Let Ri refer to the subset of reads in processor pi . step is to simplify it by identifying maximal segments of
Through a linear scan of the reads in Ri , each pi enumer- simple paths (or “chains”) and compact each of them
ates the set of k-mers present in Ri . However, instead into a single longer edge. Edges that belong to a given
of storing each such k-mer as a vertex of the De Bruijn chain could be distributed and therefore this problem
graph, the processor equivalently generates and stores of removing chains becomes a two-step process: First,
the corresponding edges connecting those vertices in to detect the edges (or equivalently, the vertices) which
the graph. In other words, there is a bijection between are part of chains and then perform compaction on the
the set of edges and the set of distinct k + -mers in Ri . individual chains. To mark the vertices that are part
In this edge-centric representation, the vertex informa- of some chain, assume without loss of generality that
tion connecting each edge is implicit and the count of each edge is stored twice as < u, v > and < v, u >. Sort-
reads containing a given k + -mer is stored internally ing the edges in parallel by their first vertex it would
at that edge. Note that after this generation process, the bring all edges incident on a vertex together on some
same edge could be potentially generated at multiple processor. Only those vertices that have a degree of two
processor locations. To detect and merge such dupli- can be part of chains and such vertices can be easily
cates, the algorithm simply performs a parallel sort of marked with a special label. In the next step, the prob-
the edges using the k + -mers as the key. Therefore, in lem of compacting the vertices along each chain can be
one sorting step, a distributed representation of the De treated as a variant of segmented parallel list ranking. A
Bruijn graph is constructed. Standard parallel sort rou- distributed version of the list ranking algorithm can be
tines such as sample sort can be used here to ensure used to calculate the relative distances of each marked
even redistribution of the edges across processors due vertex from the start and end of its respective chain.
to sorting. The same procedure can also be used to determine the
Genome Assembly G 

vertex identifiers of those two boundary vertices for Incorporation of Distance Constraints
each marked vertex. Compaction then follows through Using Pair-End Information
the operations of concatenating the edge labels along The goal of this step is to map the information provided
the chain at one of the terminal edges, removing all by read pairs that are linked by the pair-end library onto
internal edges, and aggregating an average k + -mer the compacted and error corrected De Bruijn graph.
frequency of the constituent edges along the chain. All Once mapped, this information can be used to guide
of these operators are binary associative to allow being the assembly tour of the graph consistent (to the extent
implemented using calls to a segmented parallel prefix possible) with the distance constraints imposed by the
routine. The output of this step is a distributed represen- pair-end information. To appreciate the value added by
tation of the compacted graph. this step to the assembly procedure, recall that the pair-
end information consists of a list of read pairs of the
form < ri , rj > that have originated from the same clonal
insert during sequencing. In genomic distance parlance,
G
Error Correction and Variation Detection this implies that the number of bases separating the
In the De Bruijn graph representation, errors due to two reads is bounded by a minimum and maximum.
sequencing manifest themselves as different subgraph Also recall that an assembly from a De Bruijn graph
motifs which could be detected and pruned. For ease corresponds to a tour of the graph (possibly multiple
of exposition, let us informally call an edge in the tours if there were gaps in the original sequencing).
compacted De Bruijn graph as being “strongly sup- Now consider a scenario where a path in the De Bruijn
ported” (alternatively, “weakly supported”) if its aver- graph branches into multiple separate subpaths. A cor-
age k + -mer frequency is relatively high (alternatively, rect assembly tour would have to decide which of those
low). Examples of motifs are follows: () Tips are weakly branches reflect the sequence of characters along the
supported dead ends in the graph created due to a base unknown genome. It should be easy to see how pair-end
miscall occurring in a read at one of its end positions. information can be used to resolve such situations.
Because of compaction such tips occur as single edges To incorporate the information provided in the
branching out of a strongly supported path, and can form of read pairs by such pair-end libraries, the
be easily removed; () Bubbles are detours that pro- YAGA algorithm uses a cluster summarization proce-
vide alternative paths between two terminal vertices. dure, which can be outlined as follows: First, the list of
Due to compaction, these detours will also be single read pairs provided as input by the pair-end informa-
edges. There are two types of bubbles – weakly sup- tion is delineated into a corresponding list of constituent
ported bubbles are manifestations of single base miscall k + -mer pairs. Note that two k + -mers from two reads
occurring internal to a read and need to be removed; of a pair can map to two different edges on the dis-
whereas, bubbles that originate from the same vertex tributed graph. Furthermore, different positions along
and are supported roughly to the same degree could the same edge could be paired with positions emanat-
be the result of natural variations such as near identical ing from different edges. To this effect, the algorithm
paralogs (i.e., copies of the same gene occurring at dif- attempts to compute a grouping of edge pairs based on
ferent genomic loci). Such bubbles need to be retained. their best alignment with the imposed distance con-
() Spurious links connect two otherwise disparate paths straints. To achieve this, an observed distance interval
in the graph. Weakly supported links are manifesta- is computed between every edge pair on the graph
tions of erroneous k + -mers that happen to match the linked by a read pair, and then overlapping intervals
k + -mer present at a valid genomic locus, and they that capture roughly the same distances along the same
can be severed by examining the supports of the other orientation are incrementally clustered using a greedy
two paths. heuristic. This last step is achieved using a two-phase
The first pass of error correction on the compacted clustering step that primarily relies on several rounds of
graph could reveal new instances of motifs that can be parallel sorting tuples containing edge pair and interval
pruned through iterative passes subsequently until no distance information. The formal details pertaining to
new instances are observed. this algorithmic step has been omitted here for brevity.
 G Genome Assembly

Completing the Assembly Tour and more complex genomes in the pipeline for sequenc-
An important offshoot of the above summarization pro- ing (e.g., wheat, pine, metagenomic communities), this
cedure is that redundant distance information captured scenario is about to change, and more strongly cou-
by edge pairs in the same cluster can be removed, pled parallel codes are expected to become part of the
thereby allowing significant compression in the level mainstream computing in genome assembly.
of information needed to store relative to the original In addition to de novo sequencing, next-generation
graph. This compression, in most practical cases, would sequencing is also increasingly being used in genome
allow for the overall tour to be performed sequentially resequencing projects, where the goal is to assemble a
on a single node. The serial assembly touring procedure genome of a particular variant strain (or subspecies) of
begins by using edges that have significantly longer edge an already sequenced genome. The type of computation
labels than the original read length as “seeds,” and then that originates is significantly different from that of de
by extending them in both directions as guided by the novo assembly. For resequencing, the reads generated
pair-end summarization traversal constraints. from a new strain are compared against a fully assem-
The YAGA assembler has demonstrated scaling on bled sequence of a reference strain. This process, some-
 IBM BlueGene/L processors and performs assembly times called read mapping, only requires comparison
of over a billion synthetically generated reads in under of reads against a much larger reference. Approaches
 h. The run-time is dominated by the pair-end cluster that capitalize on advanced string data structures such
summarization phase. As of this writing, performance- as suffix trees are likely to play an active role in these
related information is not available for ABySS. algorithms. Also, as this branch of science becomes
more data-intensive, reaching the petascale range, dif-
Future Trends ferent parallel paradigms such as MapReduce need to
Genome sequencing and assembly is an actively be explored, in addition to distributed memory and
pursued, constantly evolving branch of bioinformat- shared memory models. The developments in genome
ics. Technological advancements in high-throughput sequencing can be carried over to other related applica-
sequencing have fueled algorithmic innovation and tions that also involve large-scale sequence analysis, e.g.,
along with a compelling need to harness new com- in transcriptomics, metagenomics, and proteomics.
puting paradigms. With the adoption of ever faster, Fine-grain parallelism in the area of string matching
cheaper, and massively parallel sequencing machines, and sequence alignment has been an active pursued
this application domain is becoming increasingly data- topic over the last decade and there are numerous
and compute-intensive. Consequently, parallel process- hardware accelerators for performing sequence align-
ing is destined to play a critical role in genomic ment on various platforms including General Pur-
discovery. pose Graphical Processing Units, Cell Broadband
Numerous large-scale sequencing projects includ- Engine, Field-Programmable Gate Arrays, and Multi-
ing the  assembly of the human genome to the cores. These advances are yet to take their place in
most recent  assembly of the maize genome have mainstream sequencing projects.
benefited from the use of parallel processing, although Genome sequencing is at the cusp of revolution-
in different ad hoc ways. Heterogeneous clusters com- ary possibilities. With a rapidly advancing technology
prising of a mixture of a few high-end shared memory base, the possibility of realizing landmark goals such
machines along with numerous compute nodes have as personalized medicine and “$, genome” do not
been used to “farm” out tasks and accelerate the over- look distant or far-fetched. The well-advertised $ mil-
lap computation phase in particular. This is justified lion Archon Genomics X PRIZE will be awarded to
because nearly all of these large-scale projects used the first team that sequences  human genomes in
the more traditional Sanger sequencing. A few special-  days, at a recurring cost of no more than $,
purpose projects such as the  maize gene-enriched per genome. In , the cost for sequencing a human
sequencing used more strongly coupled parallel codes genome plummeted below $, using technologies
such as PaCE. However, with an aggressive adoption from Illumina and SOLiD. Going past these next-
of next-generation sequencing for genome sequencing generation technologies, however, companies such as
Genome Assembly G 

Pacific Biosciences are now releasing a third-generation conducted as part of the maize genome sequencing con-
(“gen-”) sequencer that uses an impressive approach sortium demonstrated scaling of this method to over a
called single-molecule sequencing (or “SMS”) [], and million reads generated from gene-enriched fractions of
have proclaimed grand goals such as sequencing a the maize genome on a , node BlueGene/L super-
human genome in  min for less than $ by . computer []. The parallel suffix tree construction algo-
These are exciting times for genomics, and the field rithm described in this entry was first described in []
is likely to continue serving as a rich reservoir for and a variant of this method was later presented in [].
new problems that pose interesting compute- and An optimal algorithm to detect maximal matching pairs
data-intensive challenges which can be addressed only of reads in parallel using the suffix tree data structure is
through a comprehensive embrace of parallel comput- presented in [].
ing. The NP-completeness of the Shortest Superstring
Problem (SSP) was shown by Gallant et al. []. The
De Bruijn graph formulation for genome assembly was
G
Related Entries
first introduced by Idury and Waterman [] in the
Homology to Sequence Alignment, From
context of a sequencing technique called sequencing-
Suffix Trees
by-hybridization, and later extended to WGS based
approaches in the EULER program by Pevzner et al.
Bibliographic Notes and Further []. The string graph formulation was developed by
Reading Myers []. The proof of NP-Hardness for the overlap–
Even though DNA sequencing technologies have been layout–consensus is due to Kececioglu and Myers [].
available since the late s, it was not until the The proofs of NP-Hardness for the De Bruijn and
s that they were applied at a genome scale. The string graphs models of genome assembly are due to
first genome to be fully sequenced and assembled was Medvedev et al. [].
the ∼. Mbp bacterial genome of H. influenzae in Since the later part of s, various next-generation
 []. The sequencing of the more complex, ∼ bil- sequencing technologies such as Roche 
lion bp long human genome followed []. Several (http://www.genome-sequencing.com/), SOLiD (http://
other notable large-scale sequencing initiatives followed www.appliedbiosystems.com/), Illumina (http://www.
in the new millennium including that of chimpanzee, illumina.com/), and HeliScope (http://www.helicosbio.
rice, and maize (to cite a few examples). All of these com/) have emerged along side serial assemblers.
used either the WGS strategy or hierarchical strat- A “third” generation of machines that promise a
egy coupled with Sanger sequencing, and their assem- brand new way of sequencing (by single-molecule
blies were performed using programs that followed the sequencing) are also on their way (e.g., Pacific Bio-
overlap–layout–consensus model. The National Cen- sciences (http://www.pacificbiosciences.com/)). Con-
ter for Biotechnology Information (NCBI) (http://www. sequently, the development of short read assemblers
ncbi.nlm.nih.gov) maintains a comprehensive database continue to be in hot pursuit. Edena, Newbler (http://
of all sequenced genomes. More than a dozen programs www..com), PE-Assembler [], QSRA [], SHAR-
exist to perform genome assembly under the overlap– CGS [], SSAKE [], and VCAKE [] are all examples
layout–consensus model. Notable examples include of programs that operate using the overlap graph model.
PCAP [], Phrap (http://www.phrap.org/), Arachne EULER-SR [], ALLPATHS [], Velvet [], SOAPden-
[], Celera [], and TIGR assembler []. For a detailed ovo [], ABySS [], and YAGA [] are programs that
review of fragment assembly algorithms, refer to use the De Bruijn graph formulation. Of these tools,
[, ]. The parallel algorithm that uses the cluster- PE-Assembler, ABySS, and YAGA are parallel imple-
then-assemble approach along with the suffix tree data mentations, although to varying degrees as described in
structure for assembly was implemented in a program the main text.
called PaCE and was first described in [] in the con-
text of clustering Expressed Sequence Tag data and then Acknowledgment
later adapted for genome assembly []. Experiments Study supported by NSF grant IIS-.
 G Genome Sequencing

Bibliography . Myers EW () The fragment assembly string graph.


. Ariyaratne P, Sung W () PE-assembler: de novo assembler Bioinformatics, (Suppl ):ii–ii
using short paired-end reads. Bioinformatics ():– . Myers EW, Sutton GG, Delcher AL, Dew IM et al () A Whole-
. Batzoglou S, Jaffe DB, Stanley K, Butler J et al () ARACHNE: Genome assembly of drosophila. Science ():–
a whole-genome shotgun assembler. Genome Res ():– . Pevzner PA, Tang H, Waterman M () An eulerian path
. Bryant D, Wong W, Mockler T () QSRA – a quality-value approach to DNA fragment assembly. In: Proceedings of the
guided de novo short read assembler. BMC Bioinform (): national academy of sciences of the United States of America, vol
. Butler J, MacCallum L, Kleber M, Shlyakhter IA et al () ALL- , pp –
PATHS: de novo assembly of whole-genome shotgun microreads. . Pop M () Genome assembly reborn: recent computational
Genome Res :– challenges. Briefings in Bioinformatics ():–
. Chaisson MJ, Pevzner PA () Short read fragment assembly of . Simpson J, Wong K, Jackman S, Schein J et al () ABySS:
bacterial genomes. Genome Res :– a parallel assembler for short read sequence data. Genome Res
. Dohm J, Lottaz C, Borodina T, Himmelbaurer H () SHAR- :–
CGS, a fast and highly accurate short-read assembly algorithm for . Sutton GG, White O, Adams MD, Kerlavage AR () TIGR
de novo genomic sequencing. Genome Res ():– assembler: a new tool for assembling large shotgun sequencing
. Emrich S, Kalyanaraman A, Aluru S () Chapter : algo- projects. Genome Sci Technol ():–
rithms for large-scale clustering and assembly of biological . Venter C, Adams MD, Myers EW, Li P et al () The sequence
sequence data. In: Handbook of computational molecular biol- of the human genome. Science ():–
ogy. CRC Press, Boca Raton . Warren P, Sutton G, Holt R () Assembling millions of short
. Fleischmann R, Adams M, White O, Clayton R et al () Whole- DNA sequences using SSAKE. Bioinformatics :–
genome random sequencing and assembly of Haemophilus . Zerbino DR, Velvet BE () Algorithms for de novo short read
influenzae rd. Science ():– assembly using de bruijn graphs. Genome Res :–
. Flusberg BA, Webster DR, Lee JH, Travers KJ et al () Direct
detection of DNA methylation during single-molecule, real-time
sequencing. Nat Methods :–
. Gallant J, Maier D, Storer J () On finding minimal length Genome Sequencing
superstrings. J Comput Syst Sci :–
. Ghoting A, Makarychev K () Indexing genomic sequences
Genome Assembly
on the IBM blue gene. In: Proceedings ACM/IEEE conference on
supercomputing. Portland
. Huang X, Wang J, Aluru S, Yang S, Hiller L () PCAP: a whole-
genome assembly program. Genome Res :–
. Idury RM, Waterman MS () A new algorithm for DNA GIO
sequence assembly. J Comput Biol ():–
. Jackson BG, Regennitter M, Yang X, Schnable PS, Aluru S PCI Express
() Parallel de novo assembly of large genomes from high-
throughput short reads. In: IEEE international symposium on
parallel distributed processing, pp –
. Jeck W, Reinhardt J, Baltrus D, Hickenbotham M et al () Glasgow Parallel Haskell (GpH)
Extending assembly of short DNA sequences to handle error.
Bioinformatics :– Kevin Hammond
. Kalyanaraman A, Aluru S, Brendel V, Kothari S () Space and University of St. Andrews, St. Andrews, UK
time efficient parallel algorithms and software for EST clustering.
IEEE Trans Parallel Distrib Syst ():–
. Kalyanaraman A, Emrich SJ, Schnable PS, Aluru S () Assem- Synonyms
bling genomes on large-scale parallel computers. J Parallel Distrib
GpH (Glasgow Parallel Haskell)
Comput ():–
. Kececioglu J, Myers E () Combinatorial algorithms for DNA
sequence assembly. Algorithmica (–):– Definition
. Li R, Zhu H, Ruan J, Qian W et al () De novo assembly of
Glasgow Parallel Haskell (GpH) is a simple parallel
human genomes with massively parallel short read sequencing.
Genome Res ():–
dialect of the purely functional programming language,
. Medvedev P, Georgiou K, Myers G, Brudno M () Com- Haskell. It uses a semi-explicit model of parallelism,
putability of models for sequence assembly. Lecture notes in where possible parallel threads are marked by the pro-
computer science, vol . Springer, Heidelberg, pp – grammer, and a sophisticated runtime system then
Glasgow Parallel Haskell (GpH) G 

decides on the timing of thread creation, allocation to the application domain, and it can lead to code that is
processing elements, migration. There have been several specialized to a specific parallel architecture, or class of
implementations of GpH, covering platforms ranging architectures. It also violates a key design principle for
from single multicore machines through to compu- most functional languages, which is to provide as much
tational grids or clouds. The best known of these is isolation as possible from the underlying implementa-
the GUM implementation that targets both shared- tion. GpH is therefore designed to allow programmers
memory and distributed-memory systems using a to provide information about parallel execution while
sophisticated virtual shared-memory abstraction built delegating issues of placement, communication, etc. to
over a common message-passing layer, but there is the runtime system.
a new implementation, GHC-SMP, which directly
exploits shared-memory systems, and which targets History and Development of GpH
multicore architectures. Glasgow Parallel Haskell (GpH) was first defined in
. GpH was designed to be a simple parallel exten-
G
Discussion sion to the then-new nonstrict, purely functional lan-
guage Haskell [], adding only two constructs to
Purely Functional Languages and sequential Haskell: par and seq. Unlike most earlier lazy
Parallelism functional languages, Haskell was always intended to be
Because of the absence of side effects in purely parallelizable. The use of the term “non-strict” rather
functional languages, it is relatively straightforward to than lazy reflects this: while lazy evaluation is inherently
identify computations that can be run in parallel: any sequential, since it fixes a specific evaluation order for
sub-expression can be evaluated by a dedicated parallel sub-expressions, under Haskell’s non-strict evaluation
task. For example in the following very simple function model, any number of sub-expressions can be evaluated
definition, each of the two arguments to the addition in parallel provided that they are needed by the result of
operation can be evaluated in parallel. the program.
f x = fibonacci x + f act o ri al x
The original GpH implementation targeted the
GRIP novel parallel architecture, using direct calls to
This property was already realized by the mid-s, low-level GRIP communications primitives. GRIP was
when there was a surge of interest both in parallel a shared-memory machine, using custom microcoded
evaluation in general, and in the novel architectural “intelligent memory units” to share global program data
designs that it was believed could overcome the sequen- between off-the-shelf processing elements (Motorola
tial “von-Neumann bottleneck.” In fact, the main issue  processors, each with a private  MB memory),
in a purely functional language is not extracting enough developed in an Alvey research project which ran from
parallelism – it is not unusual for even a short-running  to . Initially, GpH built on the prototype
program to produce many tens of thousands of paral- Haskell compiler developed by Hammond and Pey-
lel threads – but is rather one of identifying sufficiently ton Jones in . It was subsequently ported to the
large-grained parallel tasks. If this is not done, then Glasgow Haskell Compiler, GHC [], that was devel-
thread creation and communication overheads quickly oped at Glasgow University from  onward, and
eliminate any benefit that can be obtained from parallel which is now the de-facto standard compiler for Haskell
execution. This is especially important on conventional (since the focus of the maintenance effort was moved
processor architectures, which historically provided lit- to Microsoft Research in , GHC has also become
tle, if anything, in the way of hardware support for known as the Glorious Haskell Compiler). In ,
parallel execution. The response of some parallel func- the communications library was redesigned to give
tional language designers has therefore been to provide what became the highly portable GUM implementa-
very explicit parallelism mechanisms. While this usu- tion []. This allowed a single parallel implementation
ally avoids the problem of excessive parallelism, it places to target both commercial shared-memory systems and
a significant burden on the programmer, who must the then-emerging class of cost-effective loosely cou-
understand the details of parallel execution as well as pled networks of workstations. Initially, GUM targeted
 G Glasgow Parallel Haskell (GpH)

system-specific communication libraries, but it was the results of each computation are shared through the
subsequently ported to PVM. There are now UDP, variables r1 and r2. The runtime system also decides
PVM, MPI, and MPICH-G instances of GUM, as on issues such as when a thread is created, where it
well as system-specific implementations, and the same is placed, how much data is communicated, etc. Very
GUM implementation runs on multicore machines, importantly, it can also decide whether a thread is cre-
shared-memory systems, workstation clusters, compu- ated. The parallelism model is thus semi-explicit: the
tational grids, and is being ported to large-scale high- programmer marks possible sites of parallelism, but the
performance systems, such as the ,-core HECToR runtime system takes responsibility for the underly-
system at the Edinburgh Parallel Computer Centre. Key ing parallel control based on information about system
parts of the GUM implementation have also been used load, etc. This approach therefore eliminates significant
to implement the Eden [] parallel dialect of Haskell, difficulties that are commonly experienced with more
and GpH is being incorporated into the latest main- explicit parallel approaches, such as raw MPI. The pro-
stream version of GHC. grammer needs to make sure there is enough scope for
Although the two parallelism primitives that GpH parallelism, but does not need to worry about issues of
uses are very simple, it became clear that they could deadlock, communication, throttling, load-balancing,
be packaged using higher-order functions to give etc. In the example above, it is likely that only one of
much higher-level parallel abstractions, such as paral- r1 or r2 (or neither) will actually be evaluated in par-
lel pipelines, data-parallelism, etc. This led ultimately to allel, since the current thread will probably need both
the development of evaluation strategies []: high-level their values. Which of r1 or r2 is evaluated first by the
parallel structures that are built from the basic par and original thread will, however, depend on the context in
seq primitives using standard higher-order functions. which it is called. It is therefore left unspecified to avoid
Because they are built from simple components and unnecessary sequentialization.
standard language technology, evaluation strategies are
highly flexible: they can be easily composed or nested; Lazy Thread Creation
the parallelism structure can change dynamically; and Several different versions of the par function have
the applications programmer can define new strategies been described in the literature. The version used in
on an as-need basis, while still using standard strategies. GpH is asymmetric in that it marks its first argument
as being suitable for possible execution (the expres-
The GpH Model of Parallelism sion is sparked), while continuing sequential execution
GpH is unusual in using only two parallelism prim- of its second argument, which forms the result of the
itives: par and seq. The design of the par primitive expression. For example, in par s e, the expression
dates back to the late s [], where it was used in s will be sparked for possible parallel evaluation, and
a parallel implementation of the Lazy ML (LML) com- the value of e will be returned as the result of the cur-
piler. The primitive, higher-order, function, par is used rent thread. Since the result expression is always eval-
by the programmer to mark a sub-expression as being uated by the sparking thread, and since there can be
suitable for parallel evaluation. For example, a variant no side effects, it follows that it is completely safe to
of the function f above can be parallelized using two ignore any spark. That is, unlike many parallel systems,
instances of par. the creation of threads from sparks is entirely optional.
This fact can be used to throttle the creation of threads
f x =
from sparks in order to avoid swamping the parallel
let r = fibonacci x ;
machine. This is a lazy thread creation approach (the
r = f a c t o r i a l x in
term “lazy task creation” was coined later by Mohr et
l e t r e s u l t = ( r , r ) in
al. [] to describe a similar mechanism in MultiLisp).
r  ‘ par ’ ( r  ‘ par ’ r e s u l t )
A corollorary is that, since sparks do not carry any exe-
r1 and r2 can now both be evaluated in parallel with cution state, they can be very lightweight. Generally, a
the construction of the result pair (r1, r2). There single pointer is adequate to record a sparked expression
is no need to specify any explicit communication, since in most implementations of GpH.
Glasgow Parallel Haskell (GpH) G 

Sparks may be chosen for conversion to threads just at the root of each thread, but whenever they share
using a number of different strategies. One common any sub-expressions.
approach is to use the oldest spark first. If the appli- Figure  shows a simple example of a divide-and-
cation is a divide-and-conquer program, this means conquer parallel program, where the root of the compu-
that the spark representing the largest amount of work tation, f 9, is rewritten using three threads: one for the
will be chosen. Alternatively, an approach may be main computation and two sub-threads, one to evaluate
used where the youngest (smallest) spark is executed f 8 and one to evaluate f 7. These thunks are linked
locally and the oldest (largest) spark is offloaded for into the addition nodes in the main computation. Hav-
remote execution. This will improve locality, but may ing evaluated the sparked thunks, the second and third
increase thread creation costs, since many of the locally threads update their root nodes with the corresponding
created threads might otherwise be subsumed into their result. Once the main thread has evaluated the remain-
parent thread. ing thunk (another call to f 8, which we have assumed
is not shared with Thread #), it will incorporate these
G
results into its own result. The final stage of the compu-
Parallel Graph Reduction tation is to rewrite the root of the graph (the result of
A second key issue that must be dealt with is that of the program) with the value 89.
thread synchronization. Efficient sequential non-strict
functional language implementations generally use an Evaluate-and-Die
evaluation technique called graph reduction, where a As a thread evaluates its program graph, it may
program builds a graph data structure representing the encounter a node that has been sparked. There are three
work that is needed to give the result of a program possible cases, depending on the evaluation status of
and gradually rewrites this using the functional rules the sparked node. If the thunk has already been eval-
defined in the program until the graph is sufficiently uated, then its value can be used as normal. If the thunk
complete to yield the required result. This rewriting pro- has not yet been evaluated, then the thread will eval-
cess is known as reducing the graph to some normal uate it as normal, first annotating it to indicate that
form (in fact weak head normal form). Each node in the it is under evaluation. Once a result is produced, the
graph represents an expression in the original program. node will be overwritten with the result value, and
Initially, this will be a single unevaluated expression cor- any blocked threads will be reawakened. Finally, if the
responding to the result of the program (a “thunk”). node is actually under evaluation, then the thread must
As execution proceeds, thunks are evaluated, and each be blocked until the result is produced and the node
graph node is overwritten with its result. In this way, updated with this value. This is the evaluate-and-die
results are automatically shared between several con- execution model []. The advantage of this approach is
sumers. Only graph nodes that actually contribute to that it automatically absorbs sparks into already execut-
the final result need to be rewritten. This means that ing threads, so increasing their granularity and avoiding
unnecessary work can be avoided, a process that, in the thread creation overheads. If a spark refers to a thunk
sequential world, allows lazy evaluation. that has already been evaluated then it may be discarded
The same mechanism naturally lends itself to paral- without a thread ever being created.
lel evaluation. Each shared node in the graph represents
a possible synchronization point between two parallel The seq Primitive
threads. The first thread to evaluate a graph node will While par is entirely adequate for identifying many
lock it. Any thread that evaluates the node in paral- forms of parallelism, Roe and Peyton Jones discov-
lel with the thread that is evaluating it will then block ered [] that by adding a sequential combining func-
when it attempts to read the value of the node. When the tion, called seq, much tighter control could be obtained
evaluating thread produces the result, the graph node over parallel execution. This could be used both to
will be updated and any blocked threads will be awo- reduce the need for detailed knowledge of the underly-
ken. In this way, threads will automatically synchronize ing parallel implementation and to encode more sophis-
through the graph representing the computation, not ticated patterns of parallel execution. For example, one
 G Glasgow Parallel Haskell (GpH)

+
f 10
Thread #1 Thread #2

f 9 f 8

+
Thread #1 Thread #2

+ +

f 8 f 7
Thread #3

+ +

Thread #1 Thread #2 Thread #1 Thread #2

+ + + 34

f 8 f 7 34 21
Thread #3 Thread #3

+ 89
Thread #1 Thread #2

55 34

Glasgow Parallel Haskell (GpH). Fig.  Parallel graph reduction

of the par constructs in the example above can be were already under evaluation, or one or both of the
replaced by a seq construct, as shown below. sparked threads would have blocked, because they
f x = were being evaluated (or had been evaluated) by the
let r = fibonacci x ; parent thread. Now, however, the only possible syn-
r = f a c t o r i a l x in chronization is between the thread evaluating r1 and
l e t r e s u l t = ( r , r ) in the parent thread. Note that it is not possible to use
r  ‘ par ’ ( r  ‘ seq ’ r e s u l t ) either of the simpler forms of par r1 (r1, r2)
−− was r  ‘ par ’ r  ‘ par ’ r e s u l t
or par r2 (r1, r2) to achieve the same effect, as
might be expected. Because the implementation is free
The seq construct acts like ; in a conventional lan- to evaluate the pair (r1, r2) in whichever order it
guage such as C: it first evaluates its first argument prefers, there is a % probability that the spark will
(here r2), and then returns its second argument (here block on the parent thread.
result). So in the example above, rather than cre- This is not the only use of seq. For example, eval-
ating two sparks as before, now only one is created. uation strategies (discussed below) also make heavy
Previously, depending on the order in which threads use of this primitive to give precise control over
were scheduled, either the parent thread would have evaluation order. While not absolutely essential for par-
blocked when it evaluated r1 or r2, because these allel evaluation, it is thus very useful to provide a finer
Glasgow Parallel Haskell (GpH) G 

degree of control than can be achieved using seq alone. strategies to separate what is to be evaluated from how it
However, there are two major costs to the use of seq. could be evaluated in parallel []. For example, a sim-
The first is that the strictness properties of the program ple sequential ray tracing function could be defined as
may be changed – this means that some thunks may be follows:
evaluated that were previously not needed, and that the r a y t r a c e r : : I n t −> I n t −> S c e n e −>
termination properties of the program may therefore [ L i g h t ] −> [ [ V e c t o r ] ]
be changed. The second is that parallel programs may r a y t r a c e r xlim ylim scene l i g h t s =
become speculative. map t r a c e l i n e [  . . ylim −  ]
where t r a c e l i n e y = [ t r a c e p i x e l s c e n e
Speculative Evaluation l i g h t s x y | x <− [  . . xlim −  ] ]
As described above, in GpH it is safe to spark any sub-
expession that is needed by the result of the parallel Given a visible image with maximum x and y dimen-
computation. However, nothing prevents the program- sions of xlim and ylim, the raytracer function G
mer from sparking a sub-expression that is not known applies the traceline function to each line in the
to be needed. This therefore allows speculative evalua- image (by mapping it across each value of y in the range
tion. Although there was some early experimentation 0..ylim-1), and hence applies the tracepixel
with mechanisms to control speculation, these were function to each pixel using the specified scene and
found to be very difficult to implement, and current lighting model. The raytracer function may be par-
implementations of GpH do not provide facilities to kill allelized, for example, by adding the parMap strategy
created threads, change priorities dynamically, etc., as is to parallelize each line as follows:
necessary to support safe speculation. For example, r a y t r a c e r : : I n t −> I n t −> S c e n e −>
s p e c x y = x ‘ par ’ y [ L i g h t ] −> [ [ V e c t o r ] ]
r a y t r a c e r x l i m y l i m s c e n e l i g h t s = map
defines a new function spec that sparks x and continues t r a c e l i n e [  . . ylim −  ] ‘ u s i n g ’ parMap r n f
execution of y. If x is not definitely needed by y, or is
Here, parMap is a strategy that specifies that each
a completely independent expression, this will spark x
element of its value argument should be evaluated in
speculatively. If a thread is created to evaluate x, it will
parallel. It is parameterized on another strategy that
terminate either when it has completed evaluation of x,
specifies what to do with each element. Here, rnf indi-
or when the main program terminates, having produced
cates that each element should be evaluated as far as
its own result, or if it evaluates an undefined value.
possible (rnf stands for “reduce to normal form”). The
In the latter case, it may cause the entire program to
using function simply applies its second argument (an
fail. Moreover, it is theoretically possible to prevent any
evaluation strategy) to its first argument (a functional
progress on the main computation by flooding the run-
value), and returns the value. It can be easily defined
time system with useless speculative threads. For these
using higher-order functions and the seq primitive as
reasons, any speculative evaluation has to be treated
follows:
carefully: if the program is to terminate, speculative
sub-expressions should terminate in finite time without u s i n g : : a −> S t r a t e g y a −> a
an error, and there should not be too many speculative x ‘ u s i n g ’ s = s x ‘ seq ’ x
sparks. So, when a strategy is applied to an expression by the
using operation, the strategy is first applied to the
Evaluation Strategies value of the expression, and once this has completed,
As discussed above, it is possible to construct higher- the value is returned. This allows the use of both parallel
level parallel abstractions from basic seq/par constructs and sequential strategies.
using standard Haskell programming constructs. These In fact, any list strategy may be used to parallelize the
parallel abstractions can be associated with computa- raytracer function instead of parMap. For exam-
tion functions using higher-order definitions. This is the ple, a task farm could be used, or the list could be
evaluation strategy approach. The key idea of evaluation divided into equally sized chunks as shown below.
 G Glasgow Parallel Haskell (GpH)

r a y t r a c e r : : I n t −> I n t −> S c e n e −> shared with other PEs. These nodes are given global
[ L i g h t ] −> [ [ V e c t o r ] ] addresses, which identify the owner of the graph node.
r a y t r a c e r xlim ylim scene l i g h t s = . . . The advantage of this model is that it allows com-
‘ using ’ parListChunk chunkSize r nf pletely independent memory management: by keep-
ing tables of global in-pointers, which are used as
Evaluation strategies have proved to be very pow- garbage collection roots into the local heap, each PE
erful and flexible, having been successfully applied to can garbage-collect its own local heap independently
several large programs, including symbolic programs of all other PEs. This local garbage collection is con-
with irregular patterns of parallelism []. Typically, only servative, as required, since even if a global in-pointer
a few lines need to be changed at key points by adding is no longer referenced by any other PE, it will still be
appropriate using clauses. In abstracting parallel pat- treated as a garbage collection root. In order to col-
terns from sequential computations, evaluation strate- lect global garbage, a separate scheme is used, based
gies have some similarities with algorithmic skeletons. on distributed reference counting. Since the major-
The key differences between evaluation strategies and ity of the program graph never needs to be shared,
typical skeleton approaches is that evaluation strate- this approach brings major efficiency gains over typi-
gies do not mandate a specific parallel implementation cal single-level memory management. In addition to the
(since they rely on semi-explicit parallelism, they pro- reduced need for synchronization during garbage col-
vide hints to the runtime system rather than directives); lection, there is also no need to maintain global locks
and that they are completely user-programmable – across purely local graph. Within each PE, GUM uses
everything above the par and seq primitives is pro- the same efficient (and sequential) garbage collector as
grammed using standard Haskell code. Unlike most the standard GHC implementation. This is currently a
skeleton approaches, they may also be easily composed stop-and-copy generational collector based on Appel’s
to give irregular nested parallel structures or phased collector.
parallel computations, for example.
Visualizing the Behavior of GpH Programs
The GUM Implementation of GpH Good visualization is an essential part of understanding
GUM is the original and most general implemen- parallel behavior. Runtime information can be visual-
tation of GpH. It provides a virtual shared-memory ized at a variety of levels to give progressively more
abstraction for parallel graph reduction, using a detailed information. For example, Fig.  visualizes the
message-passing implementation to target both physi- overall and per-PE parallel activity for a -PE system
cally shared-memory and distributed-memory systems. running the soda application (a simple crossword-
In the GUM model, the graph representing the program puzzle solver). Apart from the clear phase-transition
that is being evaluated is distributed among the set of about % into the execution, very few threads are
processor elements (PEs) that are available to execute blocked; and there are very few runnable, but not run-
the program. These PEs usually correspond to the cores ning threads. Those that are runnable are migrated to
or processors that are available to execute the parallel balance overall system load. The profile also reveals that
program, but it is possible to map PEs onto multiple relatively little time is spent fetching nonlocal data. The
processes on the same core/processor, if required. Each per-PE profile gives a more detailed view. It is obvious
PE manages its own execution environment, with its that the workload is well balanced and that the work-
own local pools of sparks and threads, which it sched- stealing mechanism is effective in distributing work.
ules as required. Sparks and threads are offloaded on The example also shows where PEs are idle for repeated
demand to maintain good work balance. periods, and that only the main PE is active at the
A key feature of GUM is that it uses a two-level end of the computation (collating results to be written
memory model: each PE has its own private heap as the output of the program). It thus clearly identi-
that contains unshared program graph created by local fies places where improvements could be made to the
threads. Within this heap, some graph nodes may be parallel program.
Glasgow Parallel Haskell (GpH) G 

GrAnSim
GrAnSim
GrAnSim
GrAnSim
GrAnSim
GrAnSim
GrAnSim
GrAnSim soda_mg +RTS -bP -bp32 -bM -b-G -by1 -bl400 Average Parallelism = 14.8
GrAnSim
GrAnSim
tasks

45

40

35

30

25

20 G
15

10

0
0 50.0 k 100.0 k 150.0 k 200.0 k 250.0 k 300.0 k 350.0 k 400.0 k 450.0 k

running runnable fetching blocked migrating Runtime = 493.2 k cycles

GrAnSim
GrAnSim
GrAnSim
GrAnSim
GrAnSim
GrAnSim
GrAnSim
GrAnSim soda_mg +RTS -bP -bp32 -bM -b-G -by1 -bl400 Mon Oct 21 22:59 1996
GrAnSim
GrAnSim
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0 50.0 k 100.0 k 150.0 k 200.0 k 250.0 k 300.0 k 350.0 k 400.0 k 450.0 k 493251

Glasgow Parallel Haskell (GpH). Fig.  Sample overall and per-PE Activity Profiles for a -PE machine
 G Glasgow Parallel Haskell (GpH)

The GHC-SMP Implementation of GpH and simulates inter-PE communication. It thus fol-
A recent development is the GHC-SMP [] implemen- lows the actual GUM implementation very precisely.
tation of GpH from Microsoft Research Labs, Cam- It allows a range of communication costs to be sim-
bridge, UK, which is integrated into the standard GHC ulated for a specific parallel application, ranging from
distribution. This provides an implementation of GpH zero (ideal communication) to costs that are similar to
that specifically targets shared-memory and multicore those Program execution times are also simulated, using
systems. It uses a similar model of PEs and threads to a cost model that takes architectural characteristics into
the GUM implementation, including spark pools and account. One of the key uses of GranSim is as the core
runnable thread pools, and implements a similar spark- of a parallel program development methodology, where
stealing model. The key difference from GUM is that a GpH program is first simulated under ideal parallel
GHC-SMP uses a physically shared heap, rather than a conditions, and then under communication cost set-
virtual shared heap with an underlying message-passing tings for specific parallel machines: shared-memory,
implementation. This heap is garbage-collected using distributed-memory, etc. This allows the program to be
a global stop-and-copy collector that requires the syn- gradually tuned for a specific parallel setting without
chronization of all PEs, but which may itself be executed needing access to the actual parallel machine, and in
in parallel using the available processor cores. Also, a way that is repeatable, can be easily debugged, and
unlike GUM, GHC-SMP exports threads to remote PEs which provides detailed metrics.
based on the load of the local core, rather than on
demand. GdH and Mobile Haskell
GdH [] is a distributed Haskell dialect that builds
GRID-GUM and the SymGrid System on GpH. It adds explicit constructs for task creation
By replacing the low-level communications library on specific processors, with inter-processor commu-
with MPICH-G, it is possible to execute GUM (or nication through explicit mutable shared variables,
Eden – see below) not only on standard clusters of that can be written on one processor and read on
workstations, but also on wide-area computational another. Internally, each processor runs a number of
grids, coordinated by the standard Globus grid middle- threads. The GdH implementation extends the GUM
ware. The GRID-GUM and grid-enabled Eden imple- implementation with a number of explicit constructs.
mentations form the basis for the SymGrid-Par mid- The main construct is revalIO, which constructs
dleware [], which aims to provide high-level support a new task on a specific procesor. In conjunction
for computational grids for a variety of symbolic com- with mutable variables, this can be used to construct
puting systems. The middleware coordinates symbolic higher-level constructs. For example, the showSystem
computing engines into a coherent parallel system, pro- definition below prints the names of all available
viding high-level skeletons. The system uses a high-level PEs. The getAllHostNames function maps the
data-exchange protocol (SCSCP) based on the stan- hostName function over the list of all processor iden-
dard OpenMath XML format for mathematical data. tifiers, returning an IO action that obtains the environ-
Results have been very promising to date, with superlin- ment variable HOST on the specified PE. This is used
ear performance being achievable for some mathemati- in the showSystem IO operation which first obtains
cal applications, without changing any of the sequential a list of all host names, then outputs them as a sorted
symbolic computing engines. list, with duplicates eliminated using the standard nub
function.
The GranSim Simulator g e t A l l H o s t N a m e s : : IO [ S t r i n g ]
The GranSim parallel simulator was developed in g e t A l l H o s t N a m e s = mapM hostName a l l P E I d
 [] as an efficient and accurate simulator for
GpH running on a sequential machine, specifically to hostName : : PEId −> IO S t r i n g
expose granularity issues. It is unusual in modifying hostName pe = r e v a l I O ( g e t E n v ‘ ‘ HOST ’ ’ ) pe
the sequential GHC runtime system so that evaluat-
ing graph nodes also triggers the addition of sparks, showSystem = do { h o s t n a m e s <=
Glasgow Parallel Haskell (GpH) G 

g e t A l l H o s t N a m e s ; showHostNames h o s t n a m e s } particularly useful technique is to use Eden to imple-


where ment a master–worker evaluation strategy, where a set
showHostNames names = p u t S t r L n of dynamically generated worker functions is allocated
( show ( s o r t ( nub names ) ) ) to a fixed set of worker processors using an explicitly
programmed scheduler. This allows Eden to deal with
At the language level, GdH is similar to Concurrent varying thread granularities, without using a lazy thread
Haskell, which is designed to execute explicit concur- creation mechanism, as in GpH.
rent threads on a single PE. The key language difference While it does not need the advanced adaptivity
is the inclusion of an explicit PEId to allocate tasks to mechanisms that GUM provides, the Eden implementa-
PEs, supported by a distributed runtime environment. tion shares several basic components, in particular, the
GdH is also broadly similar to the industrial Erlang communication library and scheduler have very similar
language, targeting a similar area of distributed pro- implementations.
G
gramming, but is non-strict rather than strict and, since
it is a research language, does not have the range of Current Status
telecommunications-specific libraries that Erlang sup- GpH has been adopted as a significant part of the
ports in the form of the Erlang/OTP development plat- Haskell community effort on parallelism. The evalua-
form. Mobile Haskell [] similarly extends Concurrent tion strategies library has just been released as part of
Haskell, adding higher-order communication channels the mainstream GHC compiler release, building on the
(known as Mobile Channels), to support mobile com- GHC-SMP implementation, and there has also been
putations across dynamically evolving distributed sys- significant recent effort on visualization with the release
tems. The system uses a bytecode implementation to of the EdenTV and ThreadScope visualizers for Eden
ensure portability, and serializes arbitrary Haskell val- and GHC-SMP, respectively. Work is currently under-
ues including (higher-order) functions, IO actions, and way to integrate GUM and GHC-SMP to give a wide-
even mobile channels. spectrum implementatation for GpH, and Eden will
also form part of this effort. The SymGrid-Par system
is being developed as part of a major UK project to
Eden provide support for high-performance computing on
Eden [] (described in a separate encyclopedia entry) massively parallel machines, such as HECToR.
is closely related to GpH. Like GpH, it uses an
implicit communication mechanism, sharing values Future Directions
through program graph nodes. However, it uses an The advent of multicore computers and the promise of
explict process construct to create new processes and manycore computers have changed the nature of par-
explicit process application to pass streams of values to allel computing. The GpH model of lightweight mul-
each process. Unlike GpH, all data that is passed to an tithreaded parallelism is well suited to this new order,
Eden process must be evaluated before it is communi- offering the ability to easily generate large amounts of
cated. There is also no automatic work-stealing mech- parallelism. The adaptivity mechanisms built into the
anism, task migration, granularity agglomeration, or GUM implementation mean that the parallel program
virtual shared graph mechanism. Eden, does, however, can dynamically, and automatically, change its behavior
provide additional threading support: each output from to increase parallelism, improve locality, or throttle back
the process is evaluated using its own parallel thread. It parallelism as required.
also provides mechanisms to allow explicit communica- Future parallel systems are likely to be built more
tion channels to be passed as first-class objects between hierarchically than present ones, with processors built
processes. from heteregeneous combinations of general-purpose
Eden has been used to implement both algorith- cores, graphics processing units, and other special-
mic skeletons and evaluation strategies, where it has ist units, using several communication networks and
the advantage of providing more controllable paral- a complex memory hierarchy. Proponents of many-
lelism but the disadvantage of losing adaptivity. One core architectures anticipate that a single processor
 G Glasgow Parallel Haskell (GpH)

will involve hundreds or even thousands of such units. (TFP). Papers also frequently appear in the Jour-
Because of the cost of maintaining a uniform mem- nal of Functional Programming (JFP). A survey of
ory model across large numbers of systems, when Haskell-based parallel languages and implementations,
deployed on a large scale these processors are likely as of , can be found in [], and a general
to be combined hierarchically into multiple levels of introduction to research in parallel functional pro-
clusters. These systems may then be used to form high gramming, as of , can be found in []. Some
performance “clouds.” Future parallel languages and examples of the use of GpH in larger applications
implementations must therefore be highly flexible and can be found in []. Much of this entry is based on
adaptable, capable of dealing with multiple levels of private notes, e-mails, and final reports on the vari-
communication latency and internal parallel structure, ous research projects that have used GpH. The main
and perhaps fault tolerance. The GpH model with evalu- paper on evaluation strategies is []. The main paper
ation strategies, supported by adaptive implementations on the GUM implementation is [], and the main
such as GUM forms a good basis for this, but it will be paper on the GranSim similator is []. Many subse-
necessary to extend the existing models and implemen- quent papers have used these systems and ideas. For
tations to cover more heterogeneous processor types, example, one recent paper describes the SymGrid-Par
to deal with multiple levels of parallelism and addi- system []. Further material may be found on the GpH
tional program structure, and to focus more directly on web page at http://www.macs.hw.ac.uk/~dsg/gph/, on
locality issues. the GdH web page at http://www.macs.hw.ac.uk/~dsg/
gdh/, on the GHC web page at http://www.haskell.
org/ghc, on the SCIEnce project web page at http://
Related Entries
www.symbolic-computation.org, on the Eden web page
Eden
at http://www.mathematik.uni-marburg.de/~eden, and
Fortress (Sun HPCS Language)
on the contributor’s web page at http://www-fp.cs.
Functional Languages
st-andrews.ac.uk/~kh.
Futures
MPI (Message Passing Interface)
MultiLisp Bibliography
NESL . Hammond K, Loidl H-W, Partridge A () Visualising gran-
Parallel Skeletons ularity in parallel programs: a graphical winnowing system for
Processes, Tasks, and Threads Haskell. Proceedings of the HPFC’ – Conference on High
Performance Functional Computing, Denver, pp –, –
Profiling
April 
PVM (Parallel Virtual Machine)
. Hammond K, Michaelson G (eds) () Research directions in
Shared-Memory Multiprocessors parallel functional programming. Springer, Heidelberg
Sisal . Hammond K, Zain AA, Cooperman G, Petcu D, Trinder PW
Speculation, Thread-Level () SymGrid: a framework for symbolic computation on the
grid. Proceedings of the EuroPar ’: th International EuroPar
Conference, Springer LNCS , Rennes, France, pp –
. Loidl H-W, Trinder P, Hammond K, Junaidu SB, Morgan RG,
Bibliographic Notes and Further Peyton Jones SL () Engineering parallel symbolic programs
in GpH. Concurrency Pract Exp ():–
Reading
. Loogen R, Ortega-Mallén Y, Peña Mar R () Parallel func-
The main venues for publication on GpH and other tional programming in Eden. J Functional Prog ():–
parallel functional languages are the International . Marlow S, Jones SP, Singh S () Runtime support for multicore
Conference on Functional Programming (ICFP) and Haskell. Proceedings of the ICFP ’: th ACM SIGPLAN Inter-
its satellite events including the Haskell Sympo- national Conference on Functional Programming, ACM Press,
sium; the International Symposium on Implementa- New York
. Mohr E, Kranz DA, Halstead RH Jr () Lazy task creation:
tion and Application of Functional Languages (IFL);
a technique for increasing the granularity of parallel programs.
the International Conference on Programming Lan- IEEE Trans Parallel Distrib Syst ():–
guage Design and Implementation (PLDI); and the . Peyton Jones S, Clack C, Salkild J () High-performance par-
Symposium on Trends in Functional Programming allel graph reduction. Proceedings of the PARLE’ – Conference
Global Arrays Parallel Programming Toolkit G 

on Parallel Architectures and Languages Europe, Springer LNCS Definition


, pp – Global Arrays is a high-performance programming
. Peyton Jones S, Hall C, Hammond K, Partain W, Wadler P () model for scalable, distributed-memory, parallel com-
The Glasgow Haskell compiler: a technical overview. Proceed-
puter systems. Global Arrays is based on the concept of
ings of the JFIT (Joint Framework for Information Technology)
Technical Conference, Keele, UK, pp – globally accessible dense arrays that are logically shared,
. Peyton Jones S (ed), Hughes J, Augustsson L, Barton D, Boutel B, yet physically distributed onto the memories of a par-
Burton W, Fasel J, Hammond K, Hinze R, Hudak P, Johnsson T, allel distributed computer system (Fig.  illustrates this
Jones M, Launchbury J, Meijer E, Peterson J, Reid A, Runciman C, concept).
Wadler P () Haskell  language and libraries. The revised
report. Cambridge University Press, Cambridge
. Pointon R, Trinder P, Loidl H-W () The design and imple- Discussion
mentation of Glasgow distributed Haskell. Proceedings of the
IFL’ – th International Workshop on the Implementation of Introduction
Functional Languages, Springer LNCS , Aachen, Germany, Global Arrays (GA) is a high-performance program-
G
pp –
ming model for scalable, distributed-memory, paral-
. Rauber Du Bois A, Trinder P, Loidl H () mHaskell: mobile
computation in a purely functional language. J Univ Computer
lel computer systems. GA is a library-based Parti-
Sci ():– tioned Global Address Space (PGAS) programming
. Roe P () Parallel programming using functional languages. model. The underlying supported sequential languages
PhD thesis, Department of Computing Science, University of are Fortran, C, C++, and Python. GA provides global
Glasgow view access to very large dense arrays through API
. Trinder P, Hammond K, Loidl H-W, Peyton Jones S ()
functions implemented for those languages, under
Algorithm + strategy = parallelism. J Funct Prog ():–
. Trinder P, Hammond K, Mattson J Jr, Partridge A, Peyton Jones S a Single Program Multiple Data (SPMD) execution
() GUM: a portable parallel implementation of Haskell. Pro- environment.
ceedings of the PLDI’ – ACM Conference on Programming GA was originally developed as part of the under-
Language Design and Implementation, Philadelphia, pp – lying software infrastructure for the US Department
. Trinder P, Loidl H-W, Pointon R () Parallel and distributed
of Energy’s NWChem computational chemistry soft-
Haskells. J Funct Prog (&):–. Special Issue on Haskell
ware package. Over time, it has been developed into
a standalone package with a rich set of API functions
(+) that cater to many needs in scientific application
Global Arrays development. GA has been used to enable scalable

Global Arrays Parallel Programming Toolkit Physically distributed data

Global Arrays Parallel


Programming Toolkit
Jarek Nieplocha,† , Manojkumar Krishnan ,
Bruce Palmer , Vinod Tipparaju ,
Robert Harrison , Daniel Chavarría-Miranda

Pacific Northwest National Laboratory, Richland,
WA, USA

Oak Ridge National Laboratory, Oak Ridge, TN, USA

Synonyms
Global arrays; GA Single, shared data structure

Global Arrays Parallel Programming Toolkit. Fig.  Dual


† deceased. view of Global Arrays data structures
 G Global Arrays Parallel Programming Toolkit

parallel execution for several major scientific applica- of data for load balancing or other reasons, additional
tions including NWChem (computational chemistry, versions of the nga_create function are available
specifically electronic structure calculation), STOMP that allow the user to specify in detail how data is dis-
(Subsurface Transport Over Multiple Phases, a subsur- tributed between processors. The basic nga_create
face flow and transport simulator), ScalaBLAST (a more call provides a simple mechanism to control data dis-
scalable, higher-performance version of BLAST), Mol- tribution via the specification of an array that indicates
pro (quantum chemistry), TE THYS (unstructured, the minimum dimensions of a block of data on each
implicit CFD and coupled fluid/solid mechanics finite processor.
volume code), Pagoda (Parallel Analysis of Geodesic One of the most important features of Global Arrays
Data), COLUMBUS (computational chemistry), and is the ability to easily move blocks of data between
GAMESS-UK (computational chemistry). global arrays and local buffers. The data in the global
GA’s development has occurred over the last two array can be referred to using a global indexing scheme
decades. For this reason, the number of people involved and data can be moved in a single function call, even
and their contributions is large. GA’s original devel- if it represents data distributed over several proces-
opment occurred as a co-design effort between the sors. The nga_get function can be used to move a
NWChem team and the computer science team focused block of distributed data from a global array to a local
on GA. The main designer and original developer of buffer. The arguments consist of the array handle for
GA was Jarek Nieplocha. Robert Harrison led the main the array that data is being taken from, two integer
effort in the development of NWChem. arrays representing the lower and upper indices that
bound the block of distributed data that is going to
Basic Global Arrays be moved, a pointer to the local buffer or a location
There are three classes of operations in Global Arrays: in the local buffer that is to receive the data, and an
core operations, task parallel operations, and data- array of strides for the local data. The nga_put call is
parallel operations. These operations have multiple lan- similar and can be used to move data in the opposite
guage bindings, but provide the same functionality direction.
independent of the language. The current GA library The number of basic GA operations is fairly small
contains approximately  operations that provide and many parallel programs can be written with just the
a rich set of functionality related to data manage- following ten routines:
ment and computations involving distributed arrays.
GA is interoperable with MPI, enabling the develop- ● GA_Initialize(): Initialize the GA library.
ment of hybrid programs that use both programming ● GA_Terminate(): Release internal resources
models. and finalize execution of a GA program.
The basic components of the Global Arrays toolkit ● GA_Nnodes(): Return the number of GA com-
are function calls to create global arrays, copy data pute processes (corresponds to the SPMD execution
to, from, and between global arrays, and identify and environment).
access the portions of the global array data that are ● GA_Nodeid(): Return the GA process ID of the
held locally. There are also functions to destroy arrays calling compute process, this is a number between 
and free up the memory originally allocated to them. and GA_Nnodes() – 1.
The basic function call for creating new global arrays is ● NGA_Create(): Create an n-dimensional glob-
nga_create. The arguments to this function include ally accessible dense array (global array instance).
the dimension of the array, the number of indices along ● NGA_Destroy(): Deallocate memory and re-
each of the coordinate axes, and the type of data (inte- sources associated with a global array instance.
ger, float, double, etc.) that each array element repre- ● NGA_Put(): Copy data from a local buffer to an
sents. The function returns an integer handle that can array section within a global array instance in a one-
be used to reference the array in all subsequent opera- sided manner.
tions. The allocation of data can be left completely to the ● NGA_Get(): Copy data from an array section
toolkit, but if it is desirable to control the distribution within a global array instance to a local buffer in a
Global Arrays Parallel Programming Toolkit G 

P2 P0 P1

GA_Get P3 P4
P2 calls GA Get to get
part of the global array
into its local buffer

P0 P1
P2
Determine ownership
and data locality
P3 P4

G
P0 P1
P2
Issue multiple
non-blocking
ARMCI_NbGet’s,
one for data on each
remote destination
P3 P4

P2 P2 has the data, GA Get


Waits for
completion of all is complete
data transfers

Global Arrays Parallel Programming Toolkit. Fig.  Left: GA_Get flow chart. Right: An example: Process P issues GA_Get
to get a chunk of data, which is distributed (partially) among P, P, P, and P (owners of the chunk)

one-sided manner (see Fig.  for a detailed descrip- Discussion on the Example Program
tion of how get() operates). The program is (mostly) a fully functional GA code,
● GA_Sync(): Synchronize compute processes via a except for omitted variable declarations. It creates global
barrier and ensure that all pending GA operations array instances with specific data distributions, illus-
are complete (in accordance to the GA consistency trates the use of the nga_put() and nga_get()
model). primitives, as well as locality information through the
● NGA_Distribution(): Returns the array sec- nga_distribution() call. The code includes calls
tion owned by a specified compute process. to initialize and terminate the MPI library, which are
needed to provide the SPMD execution environment to
Example GA Program the GA application. (It is possible to write a GA applica-
We present a parallel matrix multiplication program tion that does not call the MPI library through the use
written in Global Arrays using the Fortran language of the TCGMSG simple message-passing environment
interface. It uses most of the basic GA calls described included with GA.) The code includes the creation and
before, in addition to some more advanced calls to use of local buffers to be used as sources and targets for
create global array instances with specified data dis- put and get operations (lA, lB, lC), which in this
tributions. The program computes the result of C = case were allocated as Fortran  dynamic arrays.
A × B. Some variable declarations have been omitted for Lines – contain the principal part of the
brevity. example code and illustrate several of the features of
 G Global Arrays Parallel Programming Toolkit

1 program matmul
2 integer :: sz
3 integer :: i, j, k, pos, g_a, g_b, g_c
4 integer :: nproc, me, ierr
5 integer, dimension(2) :: dims, nblock, chunks
6 integer, dimension(1) :: lead
7 double precision, dimension(:,:), pointer :: lA, lB
8 double precision, dimension(:,:), pointer :: lC
9 call mpi_init(ierr)

10 call ga_initialize()
11 nproc = ga_nnodes()
12 me = ga_nodeid()

13 if (me .eq. 0) then


14 write (∗ , ∗ ) ’Running on: ’, nproc, ’ processors’
15 end if

16 dims(:) = sz
17 chunks(:) = sz/sqrt(dble(nproc)) ! only runs on a perfect square number
of processors
18 nblock(1) = sz/chunks(1)
19 nblock(2) = sz/chunks(2)
20 allocate(dmap(nblock(1) + nblock(2)))

21 pos = 1
22 do i = 1, sz - 1, chunks(1) ! compute beginning coordinate of each
partition in the 1st dimension
23 dmap(pos) = i
24 pos = pos + 1
25 end do

26 do j = 1, sz - 1, chunks(2) ! compute beginning coordinate of each


partition in the 2nd dimension
27 dmap(pos) = j
28 pos = pos + 1
29 end do

30 ret = nga_create_irreg(MT_DBL, ubound(dims), dims, ’A’, dmap, nblock, g_a)!


create a global array instance with specified data distribution
31 ret = ga_duplicate(g_a, g_b, ’B’) ! duplicate same data distribution for
array B
32 ret = ga_duplicate(g_a, g_c, ’C’) ! and C

33 allocate(lA(chunks(1), chunks(2)), lB(chunks(1), chunks(2)),


lC(chunks(1), chunks(2)))
34 lA(:, :) = 1.0
Global Arrays Parallel Programming Toolkit G 

35 lB(:, :) = 2.0
36 lC(:, :) = 0.0
37 lead(1) = chunks(2)

38 call nga_distribution(g_a, me, tcoordsl, tcoordsh)

39 ! initialize global array instances to respective values


40 call nga_put(g_a, tcoordsl, tcoordsh, lA(1, 1), lead)
41 call nga_put(g_b, tcoordsl, tcoordsh, lB(1, 1), lead)
42 call nga_put(g_c, tcoordsl, tcoordsh, lC(1, 1), lead)

43 ! obtain all blocks in the row of the A matrix


44 tcoordsl1(1) = 1 G
45 tcoordsl1(2) = tcoordsl(2)
46 tcoordsh1(1) = chunks(1)
47 tcoordsh1(2) = tcoordsh(2)

48 ! obtain all blocks in the column of the B matrix


49 tcoordsl2(1) = tcoordsl(1)
50 tcoordsl2(2) = 1
51 tcoordsh2(1) = tcoordsh(1)
52 tcoordsh2(2) = chunks(2)

53 do pos = 1, nblock(1) ! matrix is square


54 call nga_get(g_a, tcoordsl1, tcoordsh1, lA(1, 1), lead)
55 call nga_get(g_b, tcoordsl2, tcoordsh2, lB(1, 1), lead)

56 do j = 1, n
57 do k = 1, n
58 do i = 1, n
59 lC(i, j) = lC(i, j) + lA(i, k) * lB(k, j)
60 end do
61 end do

62 ! advance coordinates for blocks


63 tcoordsl1(1) = tcoordsl1(1) + chunks(1)
64 tcoordsh1(1) = tcoordsh1(1) + chunks(1)

65 tcoordsl2(2) = tcoordsl2(2) + chunks(2)


66 tcoordsh2(2) = tcoordsh2(2) + chunks(2)
67 end do
68 ! lC contains the final result for the block owned by the process
69 call nga_put(g_c, tcoordsl, tcoordsh, lC(1, 1), lead)

70 ! do something with the result


71 call ga_print(g_c)
72 deallocate(dmap, lA, lB, lC)
 G Global Arrays Parallel Programming Toolkit

73 ret = ga_destroy(g_a)
74 ret = ga_destroy(g_b)
75 ret = ga_destroy(g_c)
76 call ga_terminate()
77 call mpi_finalize(ierr)

78 end program matmul

GA: data is being accessed using global coordinates library to determine the array distribution, () speci-
(lines – and –) that correspond to the global fying the decomposition for only one array dimension
size of the global array instances that were created pre- and allowing the library to determine the others, ()
viously; in addition, the access to the global array data specifying the distribution block size for all dimensions,
for arrays g_a and g_b in lines  and  is done with- or () specifying an irregular distribution as a Carte-
out requiring the participation of the process where that sian product of irregular distributions for each axis. The
data is allocated (one-sided access). Figure  illustrates distribution and locality information is always available
the concept of one-sided access. through interfaces that allow the application developer
to query: () which data portion is held by a given pro-
cess, () which process owns a particular array element,
Global Arrays Concepts and () a list of processes and the blocks of data owned
GA allows the programmer to control data distribution by each process corresponding to a given section of
and makes the locality information readily available to an array.
be exploited for performance optimization. For exam- The primary mechanisms provided by GA for
ple, global arrays can be created by: () allowing the accessing data are block copy operations that trans-
fer data between layers of memory hierarchy, namely,
global memory (distributed array) and local memory.
Further, extending the benefits of using blocked data
accesses and copying remote locations into contigu-
ous local memory can improve cache performance by
reducing both conflict and capacity misses []. In addi-
tion, each process is able to access directly the data held
in a section of a Global Array that is locally assigned to
that process. Data representing sections of the Global
Array owned by other processes on SMP clusters can
also be accessed directly using the GA interface, if
desired. Atomic operations are provided that can be
used to implement synchronization and assure correct-
Process Z ness of an accumulate operation (floating-point sum
ga_get(a, 175, 185,
19, 70, buf, 10) reduction that combines local and remote data) exe-
Process Y cuted concurrently by multiple processes and targeting
ga_get(a, 180, 210,
23, 40, buf, 30) overlapping array sections.
GA is extensible as well. New operations can
Process X
ga_get(a, 100, 200, be defined exploiting the low-level interfaces dealing
17, 20, buf, 100) with distribution, locality, and providing direct mem-
Global Arrays Parallel Programming Toolkit. Fig.  Any ory access (nga_distribution, nga_locate_
part of GA data can be accessed independently by any region, nga_access, nga_release, nga_
process at any time release_update). These, e.g., were used to provide
Global Arrays Parallel Programming Toolkit G 

additional linear algebra capabilities by interfacing with data. In Fortran, this can be converted to an array by
third-party libraries, e.g., ScaLAPACK []. passing it through a subroutine call. The C interface pro-
vides a function call that directly returns a pointer to the
Global Arrays Memory Consistency Model local data.
In shared-memory programming, one of the issues cen- In addition to the communication operations that
tral to performance and scalability is memory consis- support task parallelism, the GA toolkit includes a set of
tency. Although the sequential consistency model [] interfaces that operate on either entire arrays or sections
is straightforward to use, weaker consistency models of arrays in the data-parallel style. These are collective
[] can offer higher performance on modern archi- data-parallel operations that are called by all processes
tectures and they have been implemented on actual in the parallel job. For example, movement of data
hardware. GA’s nature as a one-sided, global-view pro- between different arrays can be accomplished using a
gramming model requires similar attention to memory single function call. The nga_copy_patch function
consistency issues. The GA approach is to use a weaker- can be used to move a patch, identified by a set of lower
G
than-sequential consistency model that is still relatively and upper indices in the global index space, from one
straightforward to understand by an application pro- global array to a patch located within another global
grammer. The main characteristics of the GA approach array. The only constraints on the two patches are that
include: they contain equal numbers of elements. In particu-
● GA distinguishes two types of completion of the lar, the array distributions do not have to be identical,
store operations (i.e., put, scatter) targeting global and the implementation can perform, as needed, the
shared memory: local and remote. The block- necessary data reorganization (the so-called MxN prob-
ing store operation returns after the operation is lem []). In addition, this interface supports an optional
completed locally, i.e., the user buffer containing transpose operation for the transferred data. If the copy
the source of the data can be reused. The opera- is from one patch to another on the same global array,
tion completes remotely after either a memory fence there is an additional constraint that the patches do not
operation or a barrier synchronization is called. The overlap.
fence operation is required in critical sections of the
user code, if the globally visible data is modified. Historical Development and Comparison
● The blocking operations (get/put) are ordered only
with Other Programming Models
if they target overlapping sections of global arrays. The original GA package [–] offered basic one-sided
Operations that do not overlap or access different communication operations, along with a limited set of
arrays can complete in arbitrary order. collective operations on arrays in the style of BLAS
● The nonblocking get/put operations complete in
[]. Only two-dimensional arrays and two data types
arbitrary order. The programmer uses wait/test were supported. The underlying communication mech-
operations to order completion of these operations, anisms were implemented on top of vendor-specific
if desired. interfaces. In the course of  years, the package evolved
substantially and the underlying code was completely
rewritten. This included separation of the GA internal
Global Arrays Extensions
one-sided communication engine from the high-level
To allow the user to exploit data locality, the toolkit
data structure. A new portable, general, and GA-
provides functions identifying the data from the global
independent communication library called ARMCI was
array that is held locally on a given processor. Two func-
created []. New capabilities were later added to GA
tions are used to identify local data. The first is the
without the need to modify the ARMCI interfaces. The
nga_distribution function, which takes a pro-
GA toolkit evolved in multiple directions:
cessor ID and an array handle as its arguments and
returns a set of lower and upper indices in the global ● Adding support for a wide range of data types and
address space representing the local data block. The sec- virtually arbitrary array ranks.
ond is the nga_access function, which returns an ● Adding advanced or specialized capabilities that
array index and an array of strides to the locally held address the needs of some new application areas,
 G Global Arrays Parallel Programming Toolkit

e.g., ghost cells or operations for sparse data However, this performance comes at the cost of com-
structures. promising the ease of use that the UMA shared-memory
● Expansion and generalization of the existing basic model posits. Distributed, shared-nothing memory
functionality. For example, mutex and lock oper- models, such as message-passing or one-sided commu-
ations were added to better support the develop- nication, offer performance and scalability but they are
ment of shared-memory-style application codes. more difficult to program. The classic message-passing
They have proven useful for applications that per- paradigm not only transfers data but also synchronizes
form complex transformations of shared data in the sender and receiver. Asynchronous (nonblocking)
task parallel algorithms, such as compressed data send/receive operations can be used to diffuse the syn-
storage in the multireference configuration interac- chronization point, but cooperation between sender
tion calculation in the COLUMBUS package []. and receiver is still required. The synchronization effect
● Increased language interoperability and interfaces. is beneficial in certain classes of algorithms, such as
In addition to the original Fortran interface, C, parallel linear algebra, where data transfer usually indi-
Python, and a C++ class library were developed. cates completion of some computational phase; in these
● Developing additional interfaces to third-party algorithms, the synchronizing messages can often carry
libraries that expand the capabilities of GA, espe- both the results and a required dependency. For other
cially in the parallel linear algebra area: ScaLAPACK algorithms, this synchronization can be unnecessary
[] and SUMMA []. Interfaces to the TAO opti- and undesirable, and a source of performance degrada-
mization toolkit have also been developed []. tion and programming complexity.
● Developed support for multilevel parallelism based The Global Arrays toolkit [–] attempts to offer the
on processor groups in the context of a shared- best features of both models. It implements a global-
memory programming model, as implemented in view programming model, based on one-sided com-
GA [, ]. munication, in which data locality is managed by the
programmer. This management is achieved by calls to
These advances generalized the capabilities of the GA
functions that transfer data between a global address
toolkit and expanded its appeal to a broader set of appli-
space (a distributed array) and local storage. In this
cations. At the same time, the programming model,
respect, the GA model has similarities to the distributed
with its emphasis on a shared-memory view of the data
shared-memory models that provide, e.g., an explicit
structures in the context of distributed memory systems
acquire/release protocol []. However, the GA model
with a hierarchical memory, is as relevant today as it was
acknowledges that remote data is slower to access than
in  when the project started.
local data and allows data locality to be specified by
the programmer and hence managed. GA is related
Comparison with Other Programming Models to the global address space languages such as UPC
The two predominant classes of programming models [], Titanium [], and, to a lesser extent, Co-Array
for parallel computers are distributed-memory, shared- Fortran []. In addition, by providing a set of data-
nothing, and Uniform Memory Access (UMA) shared- parallel operations, GA is also related to data-parallel
everything models. Both the shared-everything and languages such as HPF [], ZPL [], and Data Par-
fully distributed models have advantages and short- allel C []. However, the Global Array programming
comings. The UMA shared-memory model is easier to model is implemented as a library that works with
use but it ignores data locality/placement. Given the most languages used for technical computing and does
hierarchical nature of the memory subsystems in mod- not rely on compiler technology for achieving paral-
ern computers, this characteristic can have a negative lel efficiency. It also supports a combination of task
impact on performance and scalability. Careful code and data parallelism and is fully interoperable with the
restructuring to increase data reuse and replacing fine- message-passing (MPI) model. The GA model exposes
grained load/stores with block access to shared data can to the programmer the hierarchical memory of modern
address the problem and yield performance for shared high-performance computer systems [], and by rec-
memory that is competitive with message passing []. ognizing the communication overhead for remote data
Global Arrays Parallel Programming Toolkit G 

transfers, it promotes data reuse and locality of refer- . Nieplocha J, Harrison RJ, Littlefield RJ () Global arrays:
ence. Virtually all scalable architectures possess nonuni- A nonuniform memory access programming model for high-
performance computers. J Supercomput :–
form memory access characteristics that reflect their
. Nieplocha J, Harrison RJ, Krishnan M, Palmer B, Tipparaju V
multilevel memory hierarchies. These hierarchies typ-
() Combining shared and distributed memory models: Evo-
ically comprise processor registers, multiple levels of lution and recent advancements of the Global Array Toolkit. In:
cache, local memory, and remote memory. Over time, Proceedings of POHLL’  workshop of ICS-, New York
both the number of levels and the cost (in processor . Dongarra JJ, Croz JD, Hammarling S, Duff I () Set of level 
cycles) of accessing deeper levels have been increasing. basic linear algebra subprograms. ACM Trans Math Softw :–
. Nieplocha J, Carpenter B () ARMCI: a portable remote mem-
Scalable programming models must address memory
ory copy library for distributed array libraries and compiler run-
hierarchy since it is critical to the efficient execution of time systems. In: Proceedings of RTSPP of IPPS/SDP’, San
applications. Juan, Puerto Rico
. Dachsel H, Nieplocha J, Harrison RJ () An out-of-core
implementation of the COLUMBUS massively-parallel multiref-
G
Related Entries erence configuration interaction program. In: Proceedings of
Coarray Fortran high performance networking and computing conference, SC’,
MPI (Message Passing Interface) Orlando
. VanDeGeijn RA, Watts J () SUMMA: Scalable universal
PGAS (Partitioned Global Address Space) Languages
matrix multiplication algorithm. Concurr Pract Exp :–
UPC
. Benson S, McInnes L, Moré JJ Toolkit for Advanced Optimization
(TAO). http://www.mcs.anl.gov/tao
. Nieplocha J, Krishnan M, Palmer B, Tipparaju V, Zhang Y ()
Bibliographic Notes and Further Exploiting processor groups to extend scalability of the GA shared
Reading memory programming model. In: Proceedings of ACM comput-
A more detailed version of this entry has been published ing frontiers, Italy
in the International Journal of High Performance Com- . Krishnan M, Alexeev Y, Windus TL, Nieplocha J () Mul-
tilevel parallelism in computational chemistry using common
puting Applications, vol. , no. , May  by SAGE
component architecture and global arrays. In: Proceedings of
Publications, Inc., All rights reserved. © . Supercomputing, Seattle
. Shan H, Singh JP () A comparison of three programming
models for adaptive applications on the origin. In: Proceed-
Bibliography ings of supercomputing, Dallas
. Lam MS, Rothberg EE, Wolf ME () Cache performance and . Zhou Y, Iftode L, Li K () Performance evaluation of two
optimizations of blocked algorithms. In: Proceedings of the th home-based lazy release consistency protocols for shared virtual
international conference on architectural support for program- memory systems. In: Proceedings of operating systems design
ming languages and operating systems, Santa Clara, – Apr  and implementation symposium, Seattle, pp –
. Blackford LS, Choi J, Cleary A, D’Azevedo E, Demmel J, Dhillon . Carlson WW, Draper JM, Culler DE, Yelick K, Brooks E,
I, Dongarra J, Hammarling S, Henry G, Petitet A, Stanley K, Warren K () Introduction to UPC and language specification.
Walker D, Whaley RC () ScaLAPACK: a linear algebra Center for Computing Sciences CCS-TR--, IDA Center for
library for message-passing computers. In: Proceedings of eighth Computing Sciences, Bowie
SIAM conference on parallel processing for scientific computing, . Yelick K, Semenzato L, Pike G, Miyamoto C, Liblit B, Krishna-
Minneapolis murthy A, Hilfinger P, Graham S, Gay D, Colella P, Aiken A ()
. Scheurich C, Dubois M () Correct memory operation of Titanium: A high-performance Java dialect. Concurr Pract Exp
cache-based multiprocessors. In: Proceedings of th annual :–
international symposium on computer architecture, Pittsburgh . Numrich RW, Reid JK () Co-array Fortran for parallel pro-
. Dubois M, Scheurich C, Briggs F () Memory access buffering gramming. ACM Fortran Forum :–
in multiprocessors. In: Proceedings of th annual international . High Performance Fortran Forum () High Performance For-
symposium on Computer architecture, Tokyo, Japan tran Language Specification, version .. Sci Program ():–
. CCA-Forum. Common component architecture forum. http:// . Snyder L () A programmer’s guide to ZPL. MIT Press,
www.cca-forum.org Cambridge
. Nieplocha J, Harrison RJ, Littlefield RJ () Global arrays: a . Hatcher PJ, Quinn MJ () Data-parallel programming on
portable shared memory programming model for distributed MIMD computers. MIT Press, Cambridge
memory computers. In: Proceedings of Supercomputing, . Nieplocha J, Harrison RJ, Foster I () Explicit management of
Washington, DC, pp – memory hierarchy. Adv High Perform Comput –
 G GpH (Glasgow Parallel Haskell)

a chip doubles in every  months. Under the assump-


Gossiping tion of so-called CMOS scaling, this means that the
switching speed doubles in every  months. In the last
Allgather  years, the number of transistors available on an LSI
chip increased by roughly a factor of one million.
One way to make use of this huge number
of transistors is to implement application-specific
GpH (Glasgow Parallel Haskell)
pipeline processors into a chip. The GRAPE series
of special-purpose computers is one of such efforts
Glasgow Parallel Haskell (GpH)
to make efficient use of large number of transistors
available on LSIs.
In many scientific simulations, it is necessary to
GRAPE solve N-body problems numerically. The gravitational
N-body problem is one such example, which describes
Junichiro Makino the evolution of many astronomical objects from the
National Astronomical Observatory of Japan, Tokyo, solar system to the entire universe. In some cases, it
Japan is important to treat non-gravitational effects such as
the hydrodynamical interaction, radiation, and mag-
Definition netic fields, but the gravity is the primary driving force
GRAPE (GRAvity PipE) is the name of a series of that shapes the universe.
special-purpose computers designed for the numerical To solve the gravitational N-body problem, one
simulation of gravitational many-body systems. Most of needs to calculate the gravitational force on each body
GRAPE machines consist of hardwired pipeline proces- (particle) in the system from all other particles in the
sors to calculate the gravitational interaction between system. There are many ways to do so, and if relatively
particles and programmable computers to handle all low accuracy is sufficient, one can use the Barnes–
other works. GRAPE-DR (Greatly Reduced Array of Hut tree algorithm [] or FMM []. Even with these
Processor Elements with Data Reduction) replaced the schemes, the calculation of the gravitational interaction
hardwired pipeline by simple SIMD programmable between particles (or particles and multipole expan-
processors. sions of groups of particles) is the most time-consuming
part of the calculation. Thus, one can greatly improve
the speed of the entire simulation, just by accelerating
Discussion the speed of the calculation of particle–particle interac-
Introduction tion. This is the basic idea behind GRAPE computers.
The improvement in the speed of computers has been The basic idea is shown in Fig. . The system con-
a factor of  in every decade, for the last  years. In sists of a host computer and special-purpose hardware,
these  years, however, the computer architecture has and the special-purpose hardware handles the calcula-
become more and more complex. Pipelined architec- tion of gravitational interaction between particles. The
ture were introduced in s, and vector architectures host computer performs other calculations such as the
became the mainstream in s. In s, a num- time integration of particles, I/O, and diagnostics.
ber of parallel architectures appeared, but in the s
and s, distributed memory parallel computers built
from microprocessors have taken over.
Position, mass
The technological driving force of this evolution Host
GRAPE
of computer architecture has been the increase of the computer
Acceleration,
number of available transistors in integrated circuits, at potential
least after the invention of integrated circuits in s.
In the case of CMOS LSIs, the number of transistors in GRAPE. Fig.  Basic structure of a GRAPE system
GRAPE G 

This architecture accelerates not only the simple is the error in the numerical integration of orbits of
algorithm in which the force on a particle is calcu- particles. The second one is the error in the calculated
lated by taking the summation of forces from all other accelerations themselves. The third one comes from the
particles in the system, but also the Barnes–Hut tree fact that in many cases, the number of particles used
algorithms and FMM. Moreover, it can be used with is much smaller than the number of stars in the real
individual timestep algorithms [], in which particles systems such as a galaxy.
have their own times and timesteps and integrated in Whether or not the third one should be regarded
an event-driven fashion. The use of individual timestep as the source of error depends on the problem one
is critical in simulations of many systems including star wants to study. If the problem is, for example, merg-
clusters and planetary systems, where close encounters ing of two galaxies, which takes place in relatively short
and physical collisions of two particles require very timescale (compared to the orbital period of typical
small timesteps for a small number of particles. stars in galaxies), the effect of small number of particles
can be, and should be, regarded as the numerical error.
G
History On the other hand, if we want to study long-term
The GRAPE project started in . The first machine evolution of a star cluster, which takes place in the
completed, the GRAPE- [], was a single-board unit timescale much longer than the orbital timescale, the
on which around  IC and LSI chips were mounted evolution of orbits of individual stars is driven by close
and wire-wrapped. The pipeline processor of GRAPE- encounters with other stars. In this case, the effect of
was implemented using commercially available IC and small (or finite) number of particles is not an numerical
LSI chips. It was a natural consequence of the fact that error but what is there in real systems.
project members lacked both money and experience Thus, the required accuracy of pairwise force cal-
to design custom LSI chips. In fact, none of the orig- culation depends on the nature of the problem. In the
inal design and development team of GRAPE- had case of the study of the merging of two galaxies, the
the knowledge of electronic circuit more than what average error can be as large as % of the pairwise
was learned in basic undergraduate course for physics force, if the error is guaranteed to be random. The aver-
students. age error of interaction calculated by GRAPE- is less
For GRAPE-, an unusually short word format was than %, which was good enough for many problems.
used, to make the hardware as simple as possible. The In the number format used in GRAPE-, the positions
input coordinates are expressed in -bit fixed point of particles are expressed in the -bit fixed-point for-
format. After subtraction, the result is converted to mat, so that the force between two nearby particles,
-bit logarithmic format, in which  bit are used for both far from the origin of coordinates, is still expressed
the “fractional” part. This format is used for all follow- with sufficient accuracy. Also, the accumulation of the
ing operations except for the final accumulation. The forces from different particles is done in the -bit fixed-
final accumulation was done in -bit fixed point, to point format, so that there is no loss of effective bits
avoid overflow and underflow. The advantage of the during the accumulation. Thus, the primary source of
short word format is that ROM chips can be used to error with GRAPE- is the low-accuracy pairwise force
implement complex functions that require two inputs. calculation. It can be regarded as random error.
Any function of two -bit words can be implemented Strictly speaking, the error due to the short word
by one ROM chip with -bit address input. Thus, all format cannot always be regarded as random, since
operations other than the initial subtraction of the coor- it introduces correlation both in space and time. One
dinates and final accumulation of the force were imple- could eliminate this correlation by applying random
mented by ROM chips. coordinate transformation, but the quantitative study of
The use of extremely short word format in GRAPE- such transformation has not done yet.
was based on the detailed theoretical analysis of error GRAPE- used the GPIB (IEEE-) interface for
propagation and numerical experiment []. There are the communication with the host computer. It was fast
three dominant sources of error in numerical simula- enough for the use with simple direct summation. How-
tions of gravitational many-body systems. The first one ever, when combined with the tree algorithm, faster
 G GRAPE

communication was necessary. GRAPE-A used VME the first computer for scientific calculation to achieve
bus for communication, to improve the performance. the peak speed higher than  Tflops. Also, in  and
The speed of GRAPE- and A was around  Mflops, , it was awarded the Gordon Bell Prize for peak
which is around / of the speed of fastest vector super- performance, which is given to a real scientific calcu-
computers of the time. Hardware cost of them was lation on a parallel computer with the highest perfor-
around /, of supercomputers, or roughly equal mance. Technical details of machines from GRAPE-
to the cost of low-end workstations. Thus, these first- through GRAPE- can be found in [] and references
generation GRAPEs offered price-performance two therein.
orders of magnitude better than that of general-purpose GRAPE- [] was an improvement over GRAPE-.
computers. It integrated two full pipelines which operate on 
GRAPE- is similar to GRAPE-A, but with much MHz clock. Thus, a single GRAPE- chip offered the
higher numerical accuracy. In order to achieve higher speed eight times more than that of the GRAPE- chip,
accuracy, commercial LSI chips for floating-point arith- or the same speed as that of an eight-chip GRAPE-
metic operations such as TI SNACT and Ana- board. GRAPE- was awarded the  Gordon Bell
log Devices ADSP/ were used. The pipeline Prize for price-performance. The GRAPE- chip was
of GRAPE- processes the three components of the fabricated with . μ m design rule by NEC.
interaction sequentially. So it accumulates one inter- Table  summarizes the history of GRAPE project.
action in every three clock cycles. This approach was Figure  shows the evolution of GRAPE systems and
adopted to reduce the circuit size. Its speed was around general-purpose parallel computers. One can see that
 Mflops, but it is still much faster than workstations evolution of GRAPE is faster than that of general-
or minicomputers at that time. purpose computers.
GRAPE- was the first GRAPE computer with a The GRAPE- was essentially a scaled-up version of
custom LSI chip. The number format was the combina- GRAPE- [], with the peak speed of around  Tflops.
tion of the fixed point and logarithmic format similar The peak speed of a single pipeline chip was  Gflops. In
to what were used in GRAPE-. The chip was fabri- comparison, GRAPE- consists of , pipeline chips,
cated using  μm design rule by National Semiconduc- each with  Mflops. The increase of a factor of  in
tor. The number of transistors on chip was  K. The speed was achieved by integrating six pipelines into one
chip operated at  MHz clock speed, offering the speed chip (GRAPE- chip has one pipeline which needs three
of about . Gflops. Printed-circuit boards with eight cycles to calculate the force from one particle) and using
chips were mass-produced, for the speed of . Gflops three times higher clock frequency. The advance of the
per board. Thus, GRAPE- was also the first GRAPE device technology (from  μm to . μm) made these
computer to integrate multiple pipelines into a sys- improvements possible. Figure  shows the processor
tem. Also, GRAPE- was the first GRAPE computer to chip delivered in early . The six pipeline units are
be manufactured and sold by a commercial company. visible.
Nearly  copies of GRAPE- have been sold to more Starting with GRAPE-, the concept of virtual mul-
than  institutes (more than  outside Japan). tiple pipeline (VMP) is used. VMP is similar to simul-
With GRAPE-, a high-accuracy pipeline was inte- taneous multithreading (SMT), in the sense that a
grated into one chip. This chip calculates the first time single pipeline processor behaves as multiple proces-
derivative of the force, so that fourth-order Hermite sors. However, what is achieved is quite different. In the
scheme [] can be used. Here, again, the serialized case of SMT, the primary gain is in the latency toler-
pipeline similar to that of GRAPE- was used. The chip ance, since one can execute independent instructions
was fabricated using  μm design rule by LSI Logic. Total with different threads. In the case of hardwired pipeline
transistor count was about  K. processors, there is no need to reduce the latency. With
The completed GRAPE- system consisted of , VMP, the bandwidth to the external memory is reduced,
pipeline chips ( PCB boards each with  pipeline since the data of one particle which exerts force are
chips). It operated on  MHz clock, delivering the shared by multiple virtual pipelines, each of which cal-
speed of . Tflops. Completed in , GRAPE- was culates the force on its own particle. This sharing of the
GRAPE G 

GRAPE. Table  History of GRAPE project

GRAPE- (/–/)  Mflops, low accuracy


GRAPE- (/–/)  Mflops, high accuracy( bit/ bit)
GRAPE-A (/–/)  Mflops, low accuracy
GRAPE- (/–/)  Gflops, high accuracy
GRAPE-A (/–/)  Mflops, high accuracy
HARP- (/–/)  Mflops, high accuracy
Hermite scheme
GRAPE-A (/–/)  Gflops/board
some  copies are used all over the world
GRAPE- (/–/)  Tflops, high accuracy
G
Some  copies of small machines
MD-GRAPE (/–/)  Gflops/chip, high accuracy
programmable interaction
GRAPE- (/–/)  Gflops/chip, low accuracy
GRAPE- (/–/)  Tflops, high accuracy

1,000

100 Vectors
MPPS
10 GRAPEs
Peak speed (Tflops)

0.1

0.01

10-3

10-4

10-5
1980 1990 2000 2010
Year

GRAPE. Fig.  The evolution of GRAPE and general-


purpose parallel computers. The peak speed is plotted
GRAPE. Fig.  The GRAPE- processor chip
against the year of delivery. Open circles, crosses, and stars
denote GRAPEs, vector processors, and parallel processors,
respectively

particle can be extended to physical multiple pipelines, In the case of GRAPE-, each of six physical
as far as the total number of pipeline is not too large. pipelines is implemented as eight virtual pipelines.
Thus, special-purpose computers based on GRAPE-like Thus, one GRAPE- chip calculates forces on  par-
pipeline have a unique advantage that their requirement ticles in parallel. The required memory bandwidth was
of external memory bandwidth is much smaller than  MB/s, for the peak speed of  Gflops. A tra-
that of general-purpose computers with similar peak ditional vector processor with the peak speed of 
performance. Gflops would require the memory bandwidth of 
 G GRAPE

GB/s. Thus, GRAPE- requires the memory bandwidth from image particles. The direct Ewald method is rather
around / of that of a traditional vector processor. well suited for the implementation in hardware. In ,
One processor board of GRAPE- housed  WINE- was developed. It is a pipeline to calculate
GRAPE- chips. Each GRAPE- chip has its own mem- the wave-space part of the direct Ewald method. The
ory to store particles which exert the force. Thus, differ- real-space part can be handled by GRAPE-A or MD-
ent processor chips on one processor board calculate the GRAPE hardware.
forces on the same  particles from different particles, In , a group led by Toshikazu Ebisuzaki in
and the partial results are summed up by an hardwired RIKEN started to develop MDM [], a massively par-
adder tree when the result is sent back to the host. allel machine for large-scale MD simulations. Their
Thus, the summation over  chips added only a small primary goal was the simulation of protein molecules.
startup overhead (less than  μs) per one force calcula- MDM consists of two special-purpose hardware,
tion, which typically requires several milliseconds. massively parallel version of MD-GRAPE (MDGRAPE-
The completed GRAPE- system consisted of  ) and that of WINE (WINE-). The MDGRAPE- part
processor boards, grouped into  clusters with  boards consisted of , custom chips with four pipelines, for
each. Within a cluster,  boards are organized in a  ×  the theoretical peak speed of  Tflops. The WINE-
matrix, with  host computers. They are organized so part consists of , custom pipeline chips, for the peak
that the effective communication speed is proportional speed of  Tflops.
to the number of host computers. In a simple configura- The MDM effort was followed up by the devel-
tion, the effective communication speed becomes inde- opment of MDGRAPE- [], led by Makoto Taiji
pendent of the number of host computers. The details of RIKEN. MDGEAPE- achieved the peak speed of
of the network used in GRAPE- are given in [].  Pflops in .

Machines for Molecular Dynamics Related Projects


Classical MD calculation is quite similar to astrophysi- The GRAPE project is not the first project to imple-
cal N-body simulations since, in both cases, we integrate ment the calculation of pairwise force in particle-based
the orbit of particles (atoms or stars) which interact with simulations in hardware.
other particles with simple pairwise force. In the case of Delft Molecular Dynamics Processor [] (DMDP)
Coulomb force, the force law itself is the same as that of is one of the earliest efforts. It was completed in early
the gravitational force, and the calculation of Coulomb s. For the calculation of interaction between par-
force can be accelerated by GRAPE hardware. ticles, it used the hardwired pipeline similar to that
However, in MD calculations, the calculation cost of of GRAPE systems. However, in DMDP, time integra-
van der Waals force is not negligible, though van der tion of orbits and other calculations are all done in the
Waals force decays much faster than the Coulomb force hardwired processors. Thus, in addition to the force
(r− compared to r− ). calculation pipeline, DMDP had pipelines to update
It is straightforward to design a pipelined processor position, select particles for interaction calculation, and
which can handle particle–particle force given by some calculate diagnostics such as correlation function. FAS-
arbitrary function of the distance between particles. In TRUN [] has the architecture similar to that of DMDP,
GRAPE-A and its successors, a combination of table but designed to handle more complex systems such as
lookup and polynomial approximation is used. protein molecule.
GRAPE-A and MD-GRAPE were developed in To some extent, this difference in the designs of
the University of Tokyo, following these lines of idea. GRAPE computers and that of machines for molecu-
GRAPE-A was built using commercial chips and MD- lar dynamics comes from the difference in the nature of
GRAPE used a custom-designed pipeline chip. the problem. In astronomy, the wide ranges in the num-
Another difference between astrophysical simula- ber density of particles and timescale make it necessary
tions and MD calculations is that in MD calculations, to use adaptive schemes such as treecode and indi-
usually the periodic boundary condition is applied. vidual timesteps. With these schemes, the calculation
Thus, we need some way to calculate Coulomb forces cost per timestep per particle is generally higher than
GRAPE G 

that of fastest scheme optimized to shared timestep and technology advances, the relative advantage of special-
near-uniform distribution of particles. The approach in purpose architecture such as GRAPE becomes bigger.
which only the force calculation is done in hardware is However, there is another economical factor. As the
more advantageous for astronomical N-body problems silicon semiconductor technology advances, the initial
than for molecular dynamics. cost to design and fabricate custom chip increases. In
Anton [] is the latest effort to speed up the molec- , the initial cost for a custom chip was around  K
ular dynamics simulation of proteins by specialized USD. By , it has become higher than  M USD. By
hardware. It is essentially the revival of the basic idea , the initial cost of a  nm chip is around  M USD.
of DMDP, except that pipeline processors for opera- Roughly speaking, initial cost has been increasing as
tions other than force calculation were replaced by pro- n. , where n is the number of transistors one can fit into
grammable parallel processors. It achieved the speed a chip.
almost two orders of magnitude faster than that of The total budget for GRAPE- and GRAPE-
general-purpose parallel computers for the simulation projects is  and  M USD, respectively. Thus, a simi-
G
of protein molecules in water. lar budget had become insufficient by early s. The
whole point of special-purpose computer is to be able to
LSI Economics and GRAPE outperform “expensive” supercomputers, with the price
GRAPE has achieved the cost performance much bet- of – M USD. Even if a special-purpose computer
ter than that of general-purpose computers. One reason is –, times faster, it is not practical to spend the
for this success is simply that with GRAPE architec- cost of a supercomputer for a special-purpose computer
ture, one can use practically all transistors for arithmetic which can solve only a narrow range of problems.
units, without being limited by the memory wall prob- There are several possible solutions. One is to reduce
lem. Another reason is the fact that arithmetic units can the initial cost by using FPGA (Field-Programmable
be optimized to their specific uses in the pipeline. For Gate Array) chips. An FPGA chip consists of a num-
example, in the case of GRAPE-, the subtraction of ber of “programmable” logic blocks (LBs) and also
two positions is performed in -bit fixed point format, “programmable” interconnections. A LB is essentially
not in the floating-point format. Final accumulation is a small lookup table with multiple inputs, augmented
also done in fixed point. In addition, most of arith- with one flip-flop and sometimes full-adder or more
metic operations to calculate the pairwise interactions additional circuits. The lookup table can express any
are done in single precision. These optimizations made combinatorial logic for input data, and with flip-flop, it
it possible to pack more than  arithmetic units into can be part of a sequential logic. Interconnection net-
a single chip with less than  M transistors. The first work is used to make larger and more complex logic,
microprocessor with fully pipelined double-precision by connecting LBs. The design of recent FPGA chips
floating-point unit, Intel , required . M transis- has become much more complex, with large functional
tors for two (actually one and half) operations. Thus, the units like memory blocks and multiplier (typically
number of transistors per arithmetic unit of GRAPE is  ×  bit) blocks.
smaller by more than a factor of . When compared Because of the need for the programmability, the
with more recent processors, the difference becomes size of the circuit that can be fit into an FPGA chip is
even larger. The Fermi processor from NVIDIA inte- much smaller than that for a custom LSI, and the speed
grates  arithmetic unit (adder and multiplier) with of the circuit is also slower. Roughly speaking, the price
G transistors. Thus, it is five times less efficient than of an FPGA chip per logic gate is around  times higher
Intel , and nearly  times less efficient than than that of a custom chip with the same design rule.
GRAPE-. The difference in the power efficiency is even If the relative advantage of a specialized architecture is
larger, because the requirement for the memory band- much larger than this factor of , its implementation
width is lower for GRAPE computers. As a result, per- based on FPGA chips can outperform general-purpose
formance per watt of GRAPE- chip, fabricated with the computers.
 nm design rule, is comparable to that of GPGPU In reality, there are quite a number of projects to
chips fabricated with  nm design rule. Thus, as silicon use FPGAs for scientific computing, but most of them
 G GRAPE

turned out to be not competitive with general-purpose differences between GRAPE-DR and GPGPUs are (a)
computers. The primary reason for this result is the processor element of GRAPE-DR is much simpler, (b)
relative cost of FPGA discussed above. Since the logic external memory bandwidth of GRAPE-DR is much
gate of FPGAs is much more expensive than that of smaller, and (c) GRAPE-DR is designed to achieve
general-purpose computers, the design of a special- near-peak performance in real scientific applications
purpose computer with FPGA must be very efficient such as gravitational N-body simulation and molecular
in gate usage. FPGA-based systems which use standard dynamics simulation, and also dense matrix multiplica-
double- or single-precision arithmetic are generally not tion. These differences made GRAPE-DR significantly
competitive with general-purpose computers. In order more efficient in both transistor usage and power usage.
to be competitive, it is necessary to use much shorter GRAPE-DR chip, which was fabricated with  nm
word length. GRAPE architecture with reduced accu- design rule and has  mm area, integrates  pro-
racy is thus an ideal target for FPGA-based approach. cessing elements. The NVIDIA Fermi chip, which is
Several successful approaches have been reported [, ]. fabricated with  nm design rule and has >  mm
area, integrates the same  processing elements. Thus,
GRAPE-DR there is about a factor of  difference in the transistor
Another solution for the problem of the high initial cost efficiency. This difference resulted in more than a factor
is to widen the application range by some way to jus- of  difference in the power efficiency.
tify the high cost. GRAPE-DR project [] followed that Whether or not the approach like GRAPE-DR will
approach. be competitive with other approaches, in particular
With GRAPE-DR, the hardwired pipeline processor GPGPUs, is at the time of writing rather unclear. The
of previous GRAPE systems was replaced by a collec- reason is simply that the advantage of a factor of 
tion of simple SIMD programmable processors. The is not quite enough, because of the difference in other
internal network and external memory interface was factors, among which the most important is the devel-
designed so that it could emulate GRAPE processor opment cycle. New GPUs are announced roughly every
efficiently and could be used for several other impor- year, while it is somewhat unlikely that one develops
tant applications, including the multiplication of dense the special-purpose computers every year, even if there
matrices. is sufficient budget. In  years, general-purpose com-
GRAPE-DR is an acronym of “Greatly Reduced puters become ten times faster, and GPGPUs will also
Array of Processor Elements with Data Reduction.” The become faster by a similar factor. Thus, a factor of 
last part, “Data Reduction,” means that it has an on-chip advantage will disappear while the machine is being
tree network which can do various reduction operations developed. On the other hand, the transistor efficiency
such as summation, max/min, and logical and/or. of general-purpose computers, and that of GPUs, has
The GRAPE-DR project was started in FY , and been decreasing for the last  years and probably will
finished in FY . The GRAPE-DR processor chip continue to do so for the next  years or so. GRAPE-
consists of  simple processors, which can operate at DR can retain its efficiency when it is implemented with
the clock cycle of  MHz, for the  Gflops of single more advanced semiconductor technology, since, as in
precision peak performance ( Gflops double preci- the case of GRAPE, one can use the increased number of
sion). It was fabricated with TSMC  nm process and transistors to increase the number of processor element.
the size is around  mm . The peak power consump- Thus, it might remain competitive.
tion is around  W. The GRAPE-DR processor board
houses four GRAPE-DR chips, each with its own local
DRAM chips. It communicates with the host computer Future Directions
through Gen -lane PCI-Express interface. Future of Special-Purpose Processors
To some extent, the difference between GRAPE and In hindsight, s was a very good period for the
GRAPE-DR is similar to that between traditional GPUs development of special-purpose architecture such as
and GPGPUs. In both cases, hardwired pipelines are GRAPE, because of two reasons. First, the semicon-
replaced by simple programmable processors. The main ductor technology reached the point where many
GRAPE G 

floating-point arithmetic units can be integrated into a However, the efficiency of large-scale parallel com-
chip. Second, the initial design cost of a chip was still puters based on general-purpose microprocessors for
within the reach of fairly small research projects in basic these grid-based simulation has been decreasing rather
science. quickly. There are two reasons for this decrease. One
By now, semiconductor technology reached to the is the lack of the memory bandwidth. Currently, the
point that one could integrate thousands of arithmetic memory bandwidth of microprocessors normalized by
units into a chip. On the other hand, the initial design the calculation speed is around . bytes/flops, which is
cost of a chip has become too high. not enough for most of grid-based simulations. Even so,
The use of FPGAs and the GRAPE-DR approach this ratio will become smaller and smaller in the future.
are two examples of the way to tackle the problem The other reason is the latency of the communication
of increasing initial cost. However, unless one can between processors.
keep increasing the budget, GRAPE-DR approach is One possible solution for these two problems is to
not viable, simply because it still means exponential integrate the main memory, processor cores, and com-
G
increase in the initial, and therefore total, cost of the munication interface into a single chip. This integration
project. gives practically unlimited bandwidth to the memory,
On the other hand, such increase in the budget and the communication latency is reduced by one or
might not be impossible, since the field of computa- two orders of magnitude.
tional science as a whole is becoming more and more Obvious disadvantage of this approach is that the
important. Even though a supercomputer is expen- total amount of memory would be severely limited.
sive, it is still much less expensive compared to, for However, in many application of grid-based calculation,
example, particle accelerators or space telescopes. Of very long time integrations of relatively small systems
course, computer simulation cannot replace the real are necessary. Many of such applications requires mem-
experiments of observations, but computer simulations ory much less than one TB, which can be achieved by
have become essential in many fields of science and using several thousand custom processors each with
technology. around  GB of embedded DRAM.
In addition, there are several technologies available
in between FPGAs and custom chips. One is what is
called “structured ASIC.” It requires customization of Bibliography
typically just one metal layer, resulting in large reduc- . Aarseth SJ () Dynamical evolution of clusters of galaxies, I.
tion in the initial cost. The number of gates one can fit MN :
into the given silicon area falls between those of FPGAs . Bakker AF, Gilmer GH, Grabow MH, Thompson K () A
and custom chips. Another possibility is just to use the special purpose computer for molecular dyanamics calculations.
J Comput Phys :
technology one or two generations older.
. Barnes J, Hut P () A hiearchical O(NlogN) force calculation
algorithm. Nature :
. Fine R, Dimmler G, Levinthal C () FASTRUN: a special
Application Area of Special-Purpose purpose, hardwired computer for molecular simulation. Proteins
Computers Struct Funct Genet :
. Greengard L, Rokhlin V () A fast algorithm for particle simu-
Primary application area of GRAPE and GRAPE-DR
lations. J Comput Phys :
has been the particle-based simulation, in particu- . Hamada T, Fukushige T, Kawai A, Makino J () PROGRAPE-:
lar that requires the evaluation of long-range interac- a programmable, multi-purpose computer for many-body simula-
tion. It is suited to special-purpose computers because tions. PASJ :–
they are compute-intensive. In other words, the nec- . Ito T, Makino J, Ebisuzaki T, Sugimoto D () A special-purpose
essary bandwidth to the external memory is rela- N-body machine GRAPE-. Comput Phys Commun :
. Kawai A, Fukushige T () $/GFLOP Astrophysical N-body
tively small. Grid-based simulations based on schemes
simulation with a reconfigurable add-in card and a hierarchical
like finite-difference or finite-element methods are tree algorithm. In: Proceedings of SC, ACM (Online)
less compute-intensive and thus not suited to special- . Kawai A, Fukushige T, Makino J, Taiji M () GRAPE-: a
purpose computers. special-purpose computer for N-body simulations. PASJ :
 G Graph Algorithms

. Makino J, Aarseth SJ () On a Hermite integrator with Ahmad- synchronous, and in each unit of time, each processor
Cohen scheme for gravitational many-body problems. PASJ : either executes one instruction or stays idle.
. Makino J, Fukushige T, Koga M, Namura K () GRAPE-:
massively-parallel special-purpose computer for astrophysical
particle simulations. PASJ :
Techniques
. Makino J, Hiraki K, Inaba M () GRAPE-DR: -Pflops Graph problems are diverse. Given a graph G = (V, E),
massively-parallel computer with -core, -Gflops processor where ∣V∣ = n and ∣E∣ = m, several techniques are
chips for scientific computing. In: Proceedings of SC, ACM frequently used in designing parallel graph algorithms.
(Online) The basic techniques are described as follows (Detailed
. Makino J, Ito T, Ebisuzaki T () Error analysis of the GRAPE-
descriptions can be found in []).
special-purpose N-body machine. PASJ :
. Makino J, Taiji M () Scientific simulations with special-
purpose computers – The GRAPE systems. Wiley, Chichester Prefix Sum
. Makino J, Taiji M, Ebisuzaki T, Sugimoto D () GRAPE- Given a sequence of n elements s , s , ⋯, sn with a binary
: a massively parallel special-purpose computer for collisional associative operator denoted by ⊕, the prefix sums of the
N-body simulations. ApJ :
sequence are the partial sums defined by
. Narumi T, Susukita R, Ebisuzaki T, McNiven G, Elmegreen B
() Mol Simul : S i = s ⊕ s  ⊕ ⋯ ⊕ si ,  ≤ i ≤ n
. Shaw DE, Denero MM, Dror RO, Kuskin JS, Larson RH, Salmon
JK, Young C, Batson B, Bowers KJ, Chao JC, Eastwood MP, Using the balanced tree technique, the prefix sums
Gagliardo J, Grossman JP, Ho CR, Ierardi DJ, Kolossváry I, Klepeis of n elements can be computed in O(log n) time with
JL, Layman T, McLeavey C, Moraes MA, Mueller R, Priest EC, O(n/ log n) processors. Fast prefix sum is of fundamen-
Shan Y, Spengler J, Theobald M, Towles B, Wang SC () Anton: tal importance to the design of parallel graph algorithms
a special-purpose machine for molecular dynamics simulation.
In: Proceedings of the th annual international symposium on
as it is frequently used in algorithms for more complex
computer architecture (ISCA ’), ACM, San Diego, pp – problems.
. Taiji M, Narumi T, Ohno Y, Futatsugi N, Suenaga A, Takada N,
Konagaya A () Protein explorer: a petaflops special-purpose Pointer Jumping
computer system for molecular dynamics simulations. In: The Pointer jumping, also sometimes called path doubling,
SC Proceedings, IEEE, Los Alamitos, CD–ROM
is useful for handling computation on rooted forests.
For a rooted forest, there is a parent function P defined
on the set of vertices. P(r) is set to r when r does not
have a parent. When finding the root of each tree, the
Graph Algorithms pointer jumping technique updates the parent of each
node by that node’s grandparent, that is, set P(r) =
David A. Bader , Guojing Cong P(P(r)). The algorithm runs in O(log n) time with

Georgia Institute of Technology, Atlanta, GA, USA
 O(n) processors. Pointer jumping can also be used to
IBM, Yorktown Heights, NY, USA
compute the distance between each node to its root.
Pointer jumping is used in several connectivity and tree
Discussion algorithms (e.g., see [, ]).

Parallel Graph Algorithms Divide and Conquer


Relationships in real-world situations can often be The divide-and-conquer strategy is recognized as a fun-
represented as graphs. Efficient parallel processing of damental technique in algorithm design (not limited to
graph problems has been a focus of many algorithm parallel graph algorithms). The frequently used quick-
researchers. A rich collection of parallel graph algo- sort algorithm is based on divide and conquer. The
rithms have been developed for various problems on strategy partitions the input into partitions of roughly
different models. The majority of them are based on the equal size and recursively works on each partition con-
parallel random access machine (PRAM). PRAM is a currently. There is usually a final step to combine the
shared-memory model where data stored in the global solutions of the subproblems into a solution for the
memory can be accessed by any processor. PRAM is original problem.
Graph Algorithms G 

Pipelining X(i).value ⊕ X(predecessor).prefix, where head is the


Like divide-and-conquer, the use of pipelining is not first element of the list, i is not equal to head, and prede-
limited to parallel graph algorithm design. It is of critical cessor is the node preceding i in the list. Pointer jump-
importance in computer hardware and software design. ing can be applied to solve the list-ranking problem.
Pipelining breaks up a task into a sequence of smaller An optimal list-ranking algorithm is given in [].
tasks (subtasks). Once a subtasks is complete and moves The Euler tour technique is another of the basic
to the next stage, the ones following it can be pro- building blocks for designing parallel algorithms,
cessed in parallel. Insertion and deletion with – trees especially for tree computations. For example, pos-
demonstrate the pipelining technique []. torder/preorder numbering, computing the vertex level,
computing the number of descendants, etc., can be
Deterministic Coin Tossing done work-time optimally on EREW PRAM by apply-
Deterministic coin tossing was proposed by Cole and ing the Euler tour technique. As suggested by its name,
Vishkin [] to break the symmetry of a directed cycle the power of the Euler tour technique comes from
G
without using randomization. Consider the problem of defining a Eulerian circuit on the tree. In Tarjan and
finding a three-coloring of graph G. A k-coloring of G Vishkin’s biconnected components paper [] that orig-
is a mapping c : V → {, , ..., k − } such that c(i) ≠ c(j) inally introduced the Euler tour technique, the input to
if < i, j > ∈ E. Initially, the cycle has a trivial color- their algorithm is an edge list with the cross-pointers
ing with n = ∣V∣ colors. The apparent symmetry in between twin edges < u, v > and < v, u > established.
the problem (i.e., the vertices cannot be easily distin- With these cross-pointers it is easy to derive an Eule-
guished) presents the major difficulty to reducing the rian circuit. The Eulerian circuit can be treated as a
number of colors needed for the coloring. Deterministic linked list, and by assigning different values to each
coin tossing uses the binary representation of an integer edge in the list, list ranking can be used for many tree
i = ⋯ik ⋯i i to break the symmetry. For example, sup- computation. For example, when rooting a tree, the
pose t is the least significant bit position in which c(i) value  is associated with each edge. After list rank-
and c(S(i)) differ (S(i) is the successor of i in the cycle), ing, simply inspecting the list rank for < u, v > and
then the new coloring for i can be set as t + c(i)t . < v, u > can set the correct parent relationship for
u and v.
Accelerating Cascades Tree contraction systematically shrinks a tree into
Accelerating cascades was presented together with a single vertex by successively shrinking parts of the
deterministic coin tossing in [] for the design of faster tree. It can be used to solve the expression evaluation
algorithms for list ranking and other problems. This problem. It is also used in other algorithm, for example,
technique combines two algorithms, one optimal and computing the biconnected components [].
the other super fast, for the same problem to get a
optimal and very fast algorithm. The general strategy Classical Algorithms
is to start with the optimal algorithm to reduce the The use of the basic techniques are demonstrated in
problem size and then apply the fast but nonoptimal several classical graph algorithms for spanning tree,
algorithm. minimum spanning tree, and biconnected components.
Various deterministic and randomized techniques
Other Building Blocks for Parallel Graph Algorithms have been given for solving the spanning tree prob-
List ranking determines the position, or rank, of the lem (and the closely related connected components
list items in a linked list. A generalized list-ranking problem) on PRAM models. A brief survey of these
problem is defined as follows. Let X be an array of algorithms can be found in []. The Shiloach–Vishkin
n elements stored in arbitrary order. For each ele- algorithm (SV) algorithm is representative of several
ment i, let X(i).value be its value and X(i).next connectivity algorithms in that it adapts the widely
be the index of its successor. Then, for any binary used graft-and-shortcut approach. Through carefully
associative operator ⊕, compute X(i).prefix such that designed grafting schemes, the algorithm achieves
X(head).prefix = X(head).value and X(i).prefix = complexities of O(log n) time and O((m + n) log n)
 G Graph Algorithms

work under the arbitrary CRCW PRAM model. The Communication Efficient Graph Algorithms
algorithm takes an edge list as input and starts with Communication-efficient parallel algorithms were pro-
n isolated vertices and m processors. Each processor posed to address the “bottleneck of processor-to-
Pi ( ≤ i ≤ m) inspects edge ei =< vi , vi > and processor communication” (e.g., see []). Goodrich
tries to graft vertex vi to vi under the constraint that [] presented a communication-efficient sorting algo-
vi < vi . Grafting creates k ≥  connected compo- rithm on weak-CREW BSP that runs in O(log n/
nents in the graph, and each of the k components is then log(h + )) communication rounds (with at most h
shortcut to to a single super-vertex. Grafting and short- data transported by each processor in each round)
cutting are iteratively applied to the reduced graphs G′ = and O((n log n)/p) local computation time, for h =
(V ′ , E′ ) (where V ′ is the set of super-vertices and E′ is Θ(n/p). Goodrich’s sorting algorithm is frequently
the set of edges among super-vertices) until only one used in communication-efficient graph algorithms.
super-vertex is left. Dehne et al. designed an efficient list-ranking algo-
Minimum spanning tree (MST) is one of the most rithm for coarse-grained multicomputers (CGM) and
studied combinatorial problems with practical appli- BSP that takes O(log p) communication rounds with
cations. While several theoretic results are known for O(n/p) local computation. In the same study, a series
solving MST in parallel, many are considered imprac- of communication-efficient graph algorithms such as
tical because they are too complicated and have large connected components, ear decomposition, and bicon-
constant factors hidden in the asymptotic complexity. nected components are presented using the list-ranking
See for a survey of MST algorithms. Many parallel algo- algorithm as a building block. On the BSP model, Adler
rithms are based on the Borůvka algorithm. Borůvka’s et al. [] presented a communication-optimal MST algo-
algorithm is comprised of Borůvka iterations that are rithm. The list-ranking algorithm and the MST algo-
used in many parallel MST algorithms. A Borůvka itera- rithm take similar approaches to reduce the number
tion is characterized by three steps: find-min, connected- of communication rounds. They both start by simu-
components, and compact-graph. In find-min, for each lating several (e.g., O(log p) or O(log log p) ) steps of
vertex v the incident edge with the smallest weight is the PRAM algorithms on the target model to reduce
labeled to be in the MST; connect-components identifies the input size so that it fits in the memory of a single
connected components of the induced graph with the node. A sequential algorithm is then invoked to pro-
labeled MST edges; compact-graph compacts each con- cess the reduced input of size O(n/p), and finally the
nected component into a single supervertex, removes result is broadcast to all processors for computing the
self-loops and multiple edges, and relabels the vertices final solution.
for consistency.
A connected graph is said to be separable if there Practical Implementation
exists a vertex v such that removal of v results in two Many real-world graphs are large and sparse. These
or more connected components of the graph. Given a instances are especially hard to process due to the
connected, undirected graph G, the biconnected com- characteristics of the workload. Although fast theo-
ponents problem finds the maximal-induced subgraphs retic algorithms exist in the literature, large and sparse
of G that are not separable. Tarjan [] presents an graph problems are still challenging to solve in practice.
optimal O(n + m) algorithm that finds the biconnected There remains a significant gap between algorithmic
components of a graph based on depth-first search model and architecture. The mismatch between mem-
(DFS). Eckstein [] gave the first parallel algorithm ory access pattern and cache organization is the most
that takes O(d log n) time with O((n + m)/d) proces- outstanding barrier to high-performance graph analysis
sors on CREW PRAM, where d is the diameter of the on current systems.
graph. Tarjan and Vishkin [] present an O(log n) time
algorithm on CRCW PRAM that uses O(n + m) proces- Graph Workload
sors. This algorithm utilizes many of the fundamental Compared with traditional scientific applications, graph
primitives including prefix sum, list ranking, sorting, analysis is more memory intensive. Graph algorithms
connectivity, spanning tree, and tree computations. put tremendous pressure on the memory subsystem
Graph Algorithms G 

to deliver data to the processor. Table  shows the FPU STL is the percentage of cycles wasted on floating-
percentages of memory instructions executed in the point unit stalls. In Table , BiCC and MST do not incur
SPLASH benchmarks [] and several graph algo- any FPU stalls. Unfortunately, this does not mean that
rithms. SPLASH represents typical scientific appli- the workload fully utilizes the FPU. Instead, as there
cations for shared-memory environments. On IBM are no floating-point operations in these algorithms, the
Power, on average, about % more memory instruc- elaborately designed floating point units lay idle. CPI
tions are executed in the graph algorithms. For bicon- construction shows that graph workloads spent most of
nected components, the number is over %. the time waiting for data to be delivered, and there are
For memory-intensive applications, the locality not many other operations to hide the long latency to
behavior is especially crucial to the performance. main memory. In fact, of the three algorithms, only BC
Unfortunately, most graph algorithms exhibit erratic incurs execution of a few floating-point instructions.
memory access patterns that result in poor perfor-
mance, as shown in Table . Table  is the cycles per
Implementation on Shared-Memory Machines G
It is relatively straightforward to map PRAM algorithms
instruction (CPI) construction [] for three graph algo-
on to shared-memory machines such as symmetric
rithms on IBM Power. The algorithms studied are
multiprocessors (SMPs), multi-core and many core sys-
betweenness centrality (BC), biconnected components
tems. While these systems are of shared-memory archi-
(BiCC), and minimum spanning tree (MST). CPI con-
tecture, they are by no means the PRAM used in
struction attributes in percentage cycles to categories
theoretical work – synchronization cannot be taken
such as completion, instruction cache miss penalty, and
for granted, memory bandwidth is limited, and per-
stall. In Table , for all algorithms, significant amount
formance requires a high degree of locality. Practical
of cycles are spent on pipeline stalls. About –% of
design choices need to be made to achieve high perfor-
the cycles are wasted on the load-store unit stalls, and
mance on such systems.
more than % of the cycles are spent on stalls due to
data cache misses. Table  clearly shows that graph algo- Adapting to the Available Parallelism
rithms perform poorly on current cache-based archi- Nick’s Class (N C) is defined as the set of all prob-
tectures, and the culprit is the memory access pattern. lems that run in polylog-time with a polynomial num-
The floating-point stalls (FPU STL) column is revealing ber of processors. Whether a problem P is in N C is a
about one other prominent feature of graph algorithms. fundamental question. The PRAM model assumes an

Graph Algorithms. Table  Percentages of load-store instructions for the SPLASH benchmark and the graph problems.
ST, MST, BiCC, and CC stand for spanning tree, minimum spanning tree, biconnected components, and connected
components, respectively
SPLASH Graph problems
Benchmark Barnes Cholesky Ocean Raytrace ST MST BiCC CC
Load .% .% .% .% .% .% .% .%
Store .% .% .% .% % .% .% .%
Load+store .% .% .% .% .% .% .% .%

Graph Algorithms. Table  CPI construction for three graph algorithms. Base cycles are for “useful” work. The “Stall”
columns show the percentages of cycles on pipeline stalls followed by stalls due to load-store unit, data cache miss,
floating-point unit, fix point unit, respectively
Algorithm Base GCT Stall LSU STL DCache STL FPU STL FXU STL
BC . . . . . . .
BiCC . . . . . . .
MST . . . . . . .
 G Graph Algorithms

unlimited number of processors and explores the max- One problem related to synchronization is that there
imum inherent parallelism of P. Acknowledging the could be portions of the graph traversed by multiple
practical restriction of limited parallelism provided by processors and be in different subgraphs of the spanning
real computers, Kruskal et al. [] argued that non- tree. The immediate remedy is to synchronize using
polylogarithmic time algorithms (e.g., sublinear time either locks or barriers. With locks, coloring the ver-
algorithms) could be more suitable than polylog algo- tex becomes a critical section, and a processor can only
rithms for implementation with practically large input enter the critical section when it gets the lock. Although
size. EP (short for efficient parallel) algorithms, by the the nondeterministic behavior is now prevented, it does
definition in [], is the class of algorithms that achieve not perform well on large graphs due to an excessive
a polynomial reduction in running time with a polylog- number of locking and unlocking operations.
arithmic inefficiency. With EP algorithms, the design In the proposed algorithm, no barriers are intro-
focus is shifted from reducing the complexity factors duced in graph traversal. The algorithm runs correctly
to solving problems of realistic sizes efficiently with a without, barriers even when two or more processors
limited number of processors. color the same vertex. In this situation, each processor
will color the vertex and set as its parent the vertex it
Reducing Synchronization has just colored. Only one processor succeeds at setting
When adapting a PRAM algorithm to shared-memory the vertex’s parent to a final value. For example, using
machines, a thorough understanding of the algorithm Fig. , processor P colored vertex u, and processor P
usually suffices to eliminate unnecessary synchroniza- colored vertex v, and at a certain time they both find
tion such as barriers. Reducing synchronization thus w unvisited and are now in a race to color vertex w.
is an implementation issue. Asynchronous algorithm It makes no difference which processor colored w last
design, however, is more aggressive in reducing syn- because w’s parent will be set to either u or v (and it
chronization. It sometimes allows nondeterministic is legal to set w’s parent to either of them; this will not
intermediate results but deterministic solutions. change the validity of the spanning tree, only its shape).
In a parallel environment, to ensure correct final Further, this event does not create cycles in the spanning
results oftentimes a total ordering on all the events is tree under sequential consistency model. Both P and P
not necessary, and a partial ordering in general suffices. record that w is connected to each processor’s own tree.
Relaxed constraints on ordering reduce the number of When each of w’s unvisited children are visited by var-
synchronization primitives in the algorithm. ious processors, its parent will be set to w, independent
Bader and Cong presented a largely asynchronous of w’s parent.
spanning tree algorithm in [] that employs a constant
number of barriers. The spanning tree algorithm for
shared-memory machines has two main steps: () stub
spanning tree and () work-stealing graph traversal. In P2
the first step, one processor generates a stub spanning P1
u v
tree, that is, a small portion of the spanning tree by ran-
domly walking the graph for O(p) steps. The vertices of w
the stub spanning tree are evenly distributed into each
processor’s queue, and each processor in the next step
will traverse from the first element in its queue. After the
traversals in step , the spanning subtrees are connected
to each other by this stub spanning tree. In the graph
traversal step, each processor traverses the graph (by Graph Algorithms. Fig.  Two processors P and P see
coloring the nodes) similar to the sequential algorithm vertex w as unvisited, so each is in a race to color w and set
in such a way that each processor finds a subgraph of the w’s parent pointer. The shaded area represents vertices
final spanning tree. Work-stealing is used to balance the colored by P , the black area represents those marked by
load for graph traversal. P , and the white area contains unvisited vertices
Graph Algorithms G 

Choosing Between Barriers and Locks keeping a task array of O(N) on disk. For each PRAM
Locks and barriers are two major types of synchroniza- step, the simulation sorts a copy of the contents of the
tion primitives. In practice, the choice of using locks or PRAM memory based on the indices of the processors
barriers may not be very clear. Take the “graft and short- for which they will be operands, and then scans this
cut” spanning tree algorithm for example. For graph G = copy and performs the computation for each processor
(V, E) represented as an edge list, the algorithm starts being simulated. The following can be easily shown:
with n isolated vertices and m processors. For edge ei =
Theorem  Let A be a PRAM algorithm that uses
⟨u, v⟩, processor Pi ( ≤ i ≤ m) inspects u and v, and if
N processors and O(N) space and runs in time T. Then
v < u, it grafts vertex u to v and labels ei to be a spanning
A can be simulated in O(T ⋅sort(N)) I/Os [].
tree edge. The problem here is that for a certain vertex v,
its multiple incident edges could cause grafting v to dif- Here, sort(N) represents the optimal number of
ferent neighbors, and the resulting tree may not be valid. I/Os needed to sort N items striped across the disks, and
To ensure that v is only grafted to one of the neighbors, scan(N) represents the number of I/Os needed to read
G
locks can be used. Associated with each vertex v is a N items striped across the disks. Specifically,
flag variable protected by a lock that shows whether v x x
sort(x) = log M
has been grafted. In order to graft v a processor has to DB B B
obtain the lock and check the flag, thus race conditions x
scan(x) =
are prevented. A different solution uses barriers [] in a DB
two-phase election. No checking is needed when a pro- where M = # of items that can fit into main memory,
cessor grafts a vertex, but after all processors are done B = # of items per disk block, and D = # of disks in the
(ensured with barriers), a check is performed to deter- system.
mine which one succeeds and the corresponding edge A similar technique can be applied to the cache-
is labeled as a tree edge. Whether to use a barrier or friendly parallel implementation of PRAM algorithms
lock is dependent on the algorithm design as well as for large inputs. I/O efficient algorithms exhibit good
the barrier and lock implementations. Locking typically spatial locality behavior that is critical to good cache
introduces large memory overhead. When contention performance. Instead of having one processor simulate
among processors is intense, the performance degrades the PRAM step, p ≪ n processors may perform the
significantly. simulation concurrently. The simulated PRAM imple-
mentation is expected to incur few cache block trans-
Cache Friendly Design fers between different levels. For small input sizes, it
The increasing speed difference between processor and would not be worthwhile to apply this technique as
main memory makes cache and memory access pat- most of the data structures can fit into cache. As the
terns important factors for performance. The fact that input size increases, the cost to access memory becomes
modern processors have multiple levels of memory more significant, and applying the technique becomes
hierarchy is generally not reflected by most of the par- beneficial.
allel models. As a result, few parallel algorithm studies
have touched on the cache performance issue. The SMP Algorithmic Optimizations
model proposed by Helman and JáJá is the first effort to For most problems, parallel algorithms are inherently
model the impact of memory access and cache over an more complicated than the sequential counterparts,
algorithm’s performance []. The model forces an algo- incurring large overheads with many algorithm steps.
rithm designer to reduce the number of noncontiguous Instead of lowering the asymptotic complexities, in
memory accesses. However, it does not give hints to the many cases, reducing the constant factors improves per-
design of cache-friendly parallel algorithms. formance. Cong and Bader demonstrates the benefit of
Chiang et al. [] presented a PRAM simulation tech- such optimizations with their biconnected components
nique for designing and analyzing efficient external- algorithm [].
memory (sequential) algorithms for graph problems. The algorithm eliminates edges that are not essen-
This technique simulates the PRAM memory by tial in computing the biconnected components. For any
 G Graph Algorithms

input graph, edges are first eliminated before the com- same or different programs are mapped to the streams
putation of biconnected components is done so that at by the runtime system. A processor switches among its
most min(m, n) edges are considered. Although apply- streams every cycle, executing instructions from non-
ing the filtering algorithm does not improve the asymp- blocked streams in a fair manner. As long as one stream
totic complexity, in practice, the performance of the has a ready instruction, the processor remains fully
biconnected components algorithm can be significantly utilized.
improved. Bader et al. compared the performance of list-
An edge e is considered as nonessential for biconnec- ranking algorithm on SMPs and MTA []. For list rank-
tivity if removing e does not change the biconnectivity ing, they used two classes of list to test the algorithms:
of the component to which it belongs. Filtering out Ordered and Random. Ordered places each element in
nonessential edges when computing biconnected com- the array according to its rank; thus, node i is the ith
ponents (these edges are put back in later) yields perfor- position of the array and its successor is the node at
mance advantages. The Tarjan–Vishkin algorithm (TV) position (i + ). Random places successive elements
is all about finding the equivalence relation R′∗ c [, ]. randomly in the array. Since the MTA maps contigu-
Of the three conditions for R′c , it is trivial to check for ous logical addresses to random physical addresses, the
condition  which is for a tree edge and a non-tree edge. layout in physical memory for both classes is similar,
Conditions  and , however, are for two tree edges, the performance on the MTA is independent of order.
and checking involves the computation of high and low This is in sharp contrast to SMP machines which rank
values. To compute high and low, every non-tree edge Ordered lists much faster than Random lists. On the
of the graph is inspected, which is very time consuming SMP, there is a factor of – difference in performance
when the graph is not extremely sparse. The fewer edges between the best case (an ordered list) and the worst
the graph has, the faster the Low-high step. Also, when case (a randomly-ordered list). On the ordered lists, the
building the auxiliary graph, the fewer edges in the orig- MTA is an order of magnitude faster than the SMP,
inal graph means the smaller the auxiliary graph and the while on the random list, the MTA is approximately 
faster the Label-edge and Connected-components steps. times faster.
Combining the filtering algorithm for eliminating
nonessential edges and TV, the new biconnected com- Implementation on Distributed Memory
ponents algorithm runs in max(O(d), O(log n)) time Machines
with O(n) processors on CRCW PRAM, where d is the As data partitioning and explicit communication are
diameter of the graph. Asymptotically, the new algo- required, implementing highly irregular algorithms is
rithm is not faster than TV. In practice, however, paral- hard on distributed-memory machines. As a result,
lel speedups upto  with  processors are achieved on although many fast theoretic algorithms exist in the
SUN Enterprise  using the filtering technique. literature, few experimental results are known. As
for performance, the adverse impact of irregular
Implementation on Multithreaded accesses is magnified in the distributed-memory envi-
Architectures ronment when memory requests served by remote
Graph algorithms have been observed to run well on nodes experience long network latency. Two studies
multi-threaded architectures such as the CRAY MTA- have demonstrated reasonable parallel performance
[] and its successor, the Cray XMT. The Cray MTA with distributed-memory machines [, ]. Both stud-
[] is a flat, shared-memory multiprocessor system. ies implement parallel breadth-first search (BFS), one
All memory is accessible and equidistant from all pro- on BlueGene/L [] and the other on CELL/BE []. The
cessors. There is no local memory and no data caches. CELL architecture resembles a distributed-memory set-
Parallelism, and not caches, is used to tolerate memory ting as explicit data transfer is necessary between the
and synchronization latencies. local storage on an SPE and the main memory. Neither
An MTA processor consists of  hardware streams study establishes a strong evidence for fast execution of
and one instruction pipeline. Each stream can have up parallel graph algorithms on distributed-memory sys-
to  outstanding memory operations. Threads from the tems. The individual CPU in BlueGene and the PPE
Graph Algorithms G 

of CELL are weak compared with other Power proces- random graph with , vertices and fewer than a half-
sors or the SPE. It is hard to establish a meaningful million edges, the parallel implementation was slower
baseline to compare the parallel performance against. than the sequential code.
Indeed, in both studies, either only wall clock times or Chung and Condon [] implemented a parallel
speedups compared with other reference architectures minimum spanning tree (MST) algorithm based on
are reported. Borůvka’s algorithm. On a -processor CM-, for geo-
Partitioned global address space (PGAS) languages metric graphs with , vertices and average degree
such as UPC and X [, ] have been proposed  and graphs with fewer vertices but higher average
recently that present a shared-memory abstraction to degree, their code achieved a parallel speedup of about
the programmer for distributed-memory machines. , on -processors, over the sequential Borůvka’s algo-
They allow the programmer to control the data lay- rithm, which was – times slower than their sequential
out and work assignment for the processors. Mapping Kruskal algorithm.
shared-memory graph algorithms onto distributed- Dehne and Götz [] studied practical parallel algo-
G
memory machines is straightforward with PGAS rithms for MST using the BSP model. They imple-
languages. mented a dense Borůvka parallel algorithm, on a
Performance wise, straightforward PGAS imple- -processor Parsytec CC-, that works well for
mentation for irregular graph algorithms does not usu- sufficiently dense input graphs. Using a fixed-sized
ally achieve high performance due to the aggregate input graph with , vertices and , edges, their
startup cost of many small messages. Cong, Almasi, code achieved a maximum speedup of . using  pro-
and Saraswat presented their study in optimizing the cessors for a random dense graph. Their algorithm is not
UPC implementation of graph algorithm in []. They suitable for sparse graphs.
improve both the communication efficiency and the Woo and Sahni [] presented an experimental
cache performance of the algorithm through improving study of computing biconnected components on a
the locality behavior. hypercube. Their test cases are graphs that retain  and
% edges of the complete graphs, and they achieved
parallel efficiencies up to . for these dense inputs. The
Some Experimental Results implementation uses adjacency matrix as input repre-
Greiner [] implemented several connected compo- sentation, and the size of the input graphs is limited to
nents algorithms using NESL on the Cray Y-MP/C less than K vertices.
and TMC CM-. Hsu, Ramachandran, and Dean [] Bader and Cong presented their studies [, , ] of
also implemented several parallel algorithms for con- the spanning tree, minimum spanning tree, and bicon-
nected components. They report that their parallel code nected components algorithms on SMPs. They achieved
runs  times slower on a MasPar MP- than Greiner’s reasonable parallel speedups on the large, sparse inputs
results on the Cray, but Hsu et al.’s implementation compared with the best sequential implementations.
uses one-fourth of the total memory used by Greiner’s
approach. Krishnamurthy et al. [] implemented a
connected components algorithm (based on Shiloach- Bibliography
. Adler M, Dittrich W, Juurlink B, Kutyłowski M, Rieping I ()
Vishkin []) for distributed memory machines. Their
Communication-optimal parallel minimum spanning tree algo-
code achieved a speedup of  using a -processor rithms (extended abstract). In: SPAA’: proceedings of the tenth
TMC CM- on graphs with underlying D and D annual ACM symposium on parallel algorithms and architec-
regular mesh topologies, but virtually no speedup on tures. ACM, New York, pp –
sparse random graphs. Goddard, Kumar, and Prins [] . Allen F, Almasi G et al () Blue Gene: a vision for
implemented a connected components algorithm for protein science using a petaflop supercomputer. IBM Syst J
():–
a mesh-connected SIMD parallel computer, the -
. Bader DA, Cong G () A fast, parallel spanning tree algorithm
processor MasPar MP-. They achieve a maximum par- for symmetric multiprocessors (SMPs). In: Proceedings of the
allel speedup of less than two on a random graph th international parallel and distributed processing symposium
with , vertices and about one-million edges. For a (IPDPS ), Santa Fe
 G Graph Algorithms

. Bader DA, Cong G () Fast shared-memory algorithms for SN (ed) Parallel algorithms: third DIMACS implementation
computing the minimum spanning forest of sparse graphs. In: challenge, – October . DIMACS series in discrete mathe-
Proceedings of the th international parallel and distributed matics and theoretical computer science, vol . American Math-
processing symposium (IPDPS ), Santa Fe ematical Society, Providence, pp –
. Bader DA, Cong G, Feo J () On the architectural require- . Goodrich MT () Communication-efficient parallel sorting.
ments for efficient execution of graph algorithms. In: Proceeding In STOC’: proceedings of the twenty-eighth annual ACM sym-
of the  international conference on parallel processing, Oslo, posium on theory of computing. ACM, New York, pp –
pp – . Greiner J () A comparison of data-parallel algorithms for
. Charles P, Donawa C, Ebcioglu K, Grothoff C, Kielstra A, Praun connected components. In: Proceedings of the sixth annual
CV, Saraswat V, Sarkar V () X: an object-oriented approach symposium on parallel algorithms and architectures (SPAA-),
to non-uniform cluster computing. In: Proceedings of the  Cape, May, pp –
ACM SIGPLAN conference on object-oriented programming . Helman DR, JáJá J () Designing practical efficient algo-
systems, languages and applications (OOPSLA), San Diego, rithms for symmetric multiprocessors. In: Algorithm engineering
pp – and experimentation (ALENEX’). Lecture notes in computer
. Chiang Y-J, Goodrich MT, Grove EF, Tamassia R, Vengroff DE, science, vol . Springer, Baltimore, pp –
Vitter JS () External-memory graph algorithms. In: Proceed- . Hofstee HP () Power efficient processor architecture and the
ings of the  symposium on discrete algorithms, San Francisco, cell processor. In: International symposium on high-performance
pp – computer architecture, San Francisco, pp –
. Chung S, Condon A () Parallel implementation of Borůvka’s . Hsu T-S, Ramachandran V, Dean N () Parallel implementa-
minimum spanning tree algorithm. In: Proceedings of the tion of algorithms for finding connected components in graphs.
tenth international parallel processing symposium (IPPS’), In: Bhatt SN (ed) Parallel algorithms: third DIMACS implemen-
Honolulu, pp – tation challenge, – October . DIMACS series in discrete
. Cole R, Vishkin U () Deterministic coin tossing and acceler- mathematics and theoretical computer science, vol . American
ating cascades: micro and macro techniques for designing parallel Mathematical Society, Providence, pp –
algorithms. In: STOC’: proceedings of the eighteenth annual . JáJá J () An introduction to parallel algorithms. Addison-
ACM symposium on theory of computing. ACM, New York, Wesley, New York
pp – . Krishnamurthy A, Lumetta SS, Culler DE, Yelick K ()
. Cole R, Vishkin U () Approximate parallel scheduling. part II: Connected components on distributed memory machines. In
applications to logarithmic-time optimal graph algorithms. Info Bhatt SN (ed) Parallel algorithms: third DIMACS implementa-
Comput :– tion challenge, – October . DIMACS series in discrete
. Cong G, Bader DA () An experimental study of parallel mathematics and theoretical computer science, vol . American
biconnected components algorithms on symmetric multiproces- Mathematical Society, Providence, pp –
sors (SMPs). In: Proceedings of the th international parallel and . Kruskal CP, Rudolph L, Snir M () Efficient parallel algorithms
distributed processing symposium (IPDPS ), Denver for graph problems. Algorithmica ():–
. Cong G, Almasi G, Saraswat V () Fast PGAS implementa- . Paul WJ, Vishkin U, Wagener H () Parallel dictionaries in –
tion of distributed graph algorithms. In: Proceedings of the  trees. In: Tenth colloquium on automata, languages and program-
ACM/IEEE international conference for high performance com- ming (ICALP), Barcelona. Lecture notes in computer science.
puting, networking, storage and analysis (SC), SC’. IEEE Springer, Berlin, pp –
Computer Society, Washington, DC, pp – . Scarpazza DP, Villa O, Petrini F () Efficient breadth-first
. CPI analysis on Power. On line, . http://www.ibm.com/ search on the Cell/BE processor. IEEE Trans Parallel Distr Syst
developerworks/linux/library/pacpipower/index.html ():–
. Cray, Inc. () The CRAY MTA- system. www.cray.com/ . Shiloach Y, Vishkin U () An O(logn) parallel connectivity
products/programs/mta_/ algorithm. J Algorithms ():–
. Culler DE, Dusseau AC, Martin RP, Schauser KE () Fast . Tarjan RE () Depth-first search and linear graph algorithms.
parallel sorting under LogP: from theory to practice. In: Porta- SIAM J Comput ():–
bility and performance for parallel processing. Wiley, New York, . Tarjan RE, Vishkin U () An efficient parallel biconnectivity
pp – (Chap ) algorithm. SIAM J Comput ():–
. Dehne F, Götz S () Practical parallel algorithms for mini- . Unified Parallel C, URL: http://en.wikipedia.org/wiki/Unified_
mum spanning trees. In: Workshop on advances in parallel and Parallel_C
distributed systems, West Lafayette, pp – . Woo J, Sahni S () Load balancing on a hypercube. In:
. Eckstein DM () BFS and biconnectivity. Technical Report Proceedings of the fifth international parallel processing sym-
–, Department of Computer Science, Iowa State University posium, Anaheim. IEEE Computer Society, Los Alamitos,
of Science and Technology, Ames pp –
. Goddard S, Kumar S, Prins JF () Connected components . Woo SC, Ohara M, Torrie E, Singh JP, Gupta A ()
algorithms for mesh-connected parallel computers. In: Bhatt The SPLASH- programs: characterization and methodological
Graph Partitioning G 

considerations. In: Proceedings of the nd annual international units of work) need to be divided into P sets with about
symposium computer architecture, pp – the same number of vertices in each. Additionally, the
. Yoo A, Chow E, Henderson K, McLendon W, Hendrick-
number of edges that connect vertices in two differ-
son B, Çatalyürek ÜV () A scalable distributed parallel
breadth-first search algorithm on BlueGene/L. In: Proceedings of
ent sets needs to be kept small since these will reflect
supercomputing (SC ), Seattle the need for interprocessor communication. This prob-
lem is known as graph partitioning and is an important
approach to the parallelization of many applications.
More generally, the vertices of the graph can
Graph Analysis Software have weights associated with them, reflecting different
amounts of computation, and the edges can also have
SNAP (Small-World Network Analysis and Partition- weights corresponding to different quantities of com-
ing) Framework munication. The graph partitioning problem involves
dividing the set of vertices into P sets with about the
G
same amount of total vertex weight, while keeping small
Graph Partitioning the total weight of edges that cross between partitions.
This problem is known to be NP-hard, but a number
Bruce Hendrickson of heuristics have been devised that have proven to be
Sandia National Laboratories, Albuquerque, NM, USA effective for many parallel computing applications. Sev-
eral software tools have been developed for this prob-
Definition lem, and they are an important piece of the parallel
Graph partitioning is a technique for dividing work computing ecosystem. Important algorithms and tools
amongst processors to make effective use of a parallel are discussed below.
computer.
Parallel Computing Applications of Graph
Discussion Partitioning
When considering the data dependencies in a parallel Graph partitioning is a useful technique for paralleliz-
application, it is very convenient to use concepts from ing many scientific applications. It is appropriate when
graph theory. A graph consists of a set of entities called the calculation consists of a series of steps in which the
vertices, and a set of pairs of entities called edges. The computational structure and data dependencies do not
entities of interest in parallel computing are small units vary much. Under such circumstances the expense of
of computation that will be performed on a single pro- partitioning is rewarded by improved parallel perfor-
cessor. They might be the work performed to update the mance for many computational steps.
state of a single atom in a molecular dynamics simula- The partitioning model is most applicable for bulk
tion, or the work required to compute the contribution synchronous parallel applications in which each step
of a single row of a matrix to a matrix-vector multi- consists of local computation followed by a global data
plication. Each such work unit will be a vertex in the exchange. Fortunately, many if not most scientific appli-
graph which describes the computation. If two units cations exhibit this basic structure. Particle simulations
have a data dependence between them (i.e., the output are one such important class of applications. The parti-
of one computation is required as input to the other), cles could be atoms in a material science or biological
then there will be an edge in the graph that joins the simulation, stars in a simulation of galaxy formation, or
two corresponding vertices. units of charge in an electromagnetic application.
For a computation to perform efficiently on a paral- But by far the most common uses of graph parti-
lel machine each of the P processors needs to have about tioning involve computational meshes for the solution
the same amount of work to perform, and the amount of differential equations. Finite volume, finite difference,
of inter-processor communication must be small. These and finite element methods all involve the decompo-
two conditions can be viewed in terms of the com- sition of a complex geometry into simple shapes that
putational graph. The vertices of the graph (signifying interact only with near neighbors. Various graphs can be
 G Graph Partitioning

constructed from the mesh and a partition of the graph in the fine details. This can be rectified by the applica-
identifies subregions to be assigned to processors. The tion of a local refinement method. The main drawback
numerical methods associated with such approaches are of spectral methods is their high computational cost.
often very amenable to the graph partitioning approach. Local refinement methods are epitomized by the
These ideas have been used to solve problems from approach proposed by Fiduccia and Mattheyses []
many areas of computational science including fluid (FM). This method works by iteratively moving vertices
flow, structural mechanics, electromagnetics, and many between partitions in a manner that maximally reduces
more. the size of the set of cut edges. Moves are considered
even if they make the cut size larger since they may
enable subsequent moves that lead to even better par-
Graph Partitioning Algorithms for Parallel titions. Thus, this method has a limited ability to escape
Computing from local minima to search for even better solutions.
A wide variety of graph partitioning algorithms have The key advance underlying FM is the use of clever
been proposed for parallel computing applications. data structures that allow all the moves and their con-
Here we review some of the more important approaches. sequences to be explored and updated efficiently. The
Geometric partitioning algorithms are very fast FM algorithm is quite fast, and consistently improves
techniques for partitioning sets of entities that have results generated by other approaches. But since it only
an underlying geometry. For parallel computing appli- explores sets of partitions that are not far from the initial
cations involving simulations of physical phenomena one, it is generally limited to making small changes and
in two or three dimensions, the corresponding data will not find better partitions that are quite different.
structures typically have geometric coordinates associ- The most widely used class of graph partition-
ated with each entity. Examples include molecules in ing techniques are multilevel algorithms as they pro-
atomistic simulations, masses in gravitational models, vide a good balance between speed and quality. They
or mesh points in a finite element method. Recursive were independently invented by several research groups
coordinate partitioning is a method in which the ele- more or less simultaneously [, , ]. Multilevel algo-
ments are recursively divided by planar cuts that are rithms work by applying a local refinement method
orthogonal to one of the axes []. This has the advantage like FM at multiple scales. This largely overcomes the
of producing geometrically simple subdomains – just myopia that limits the effectiveness of local methods.
rectangular parallelepipeds. Recursive inertial bisection This is accomplished by constructing a series of smaller
also uses planar cuts, but instead of being orthogo- and smaller graphs that roughly approximate the origi-
nal to an axis, they are orthogonal to the direction of nal graph. The most common way to do this is to merge
greatest inertia []. Intuitively, this is a direction in small clusters of vertices within the original graph (e.g.,
which the point set is elongated, so cutting perpen- combine two vertices sharing an edge into a single
dicular to this direction is likely to produce a smaller vertex). Once this series of graphs is constructed, the
cut. Yet another alternative is to cut with circles or smallest graph is partitioned using any global method.
spheres instead of planes []. Geometric methods tend Then the partition is refined locally and extended to the
to be very fast, but produce low-quality partitions. They next larger graph. The refinement/extension process is
can be improved via local refinement methods like the repeated on larger and larger graphs until a partitioning
approach of Fiduccia-Mattheyses discussed below. of the original graph has been produced.
A quite different set of approaches uses eigenvectors A number of general purpose global optimization
of a matrix associated with the graph. The most popular approaches have been proposed for graph partitioning
method in this class uses the second-smallest eigenvec- including simulated annealing, genetic algorithms, and
tor of the Laplacian matrix of the graph []. A justifica- tabu search. These methods can produce high-quality
tion for this approach is beyond the scope of this article, partitions but are usually very expensive and so are
but spectral methods generally produce partitions of limited to niche applications within parallel computing.
fairly high quality. In a global sense, they find attrac- Graph partitioning is often used as a preprocessing
tive regions for cutting a graph, but they are often poor step to set up a parallel computation. The output of a
Graph Partitioning G 

graph partitioner determines which objects are assigned edges would be cut and so the actual volume of com-
to which processors, and appropriate input files and munication would be over-counted. Several alternatives
data structures are prepared for a parallel run. However, to standard graph partitioning have been proposed to
there are several situations in which the partitioning address this problem. In one approach, the number of
must be done in parallel. For a very large problem, vertices with off-processor neighbors is counted instead
the memory of a serial machine may be insufficient to of the number of edges cut. A more powerful and ele-
hold the graph that needs to be partitioned. Also, for gant alternative uses hypergraphs and is sketched below.
some classes of applications the structure of the com- Yet another deficiency in graph partitioning is that
putation changes over time. Examples include adaptive it emphasizes the total volume of communication. In
mesh simulations, or particle methods in which the par- many practical situations, latency is the performance-
ticles move significantly. For such problems the work limiting factor, so it is the number of messages that
load must be periodically redistributed across the pro- matters most, not the size of messages.
cessors, and a parallel partitioning tool is required. Sim- As discussed above, graph partitioning is most
G
ple geometric algorithms have the advantage of being appropriate for bulk synchronous applications. If the
easy to parallelize, but multilevel partitioners have also calculation involves complex interleaving of computa-
been parallelized to provide higher-quality solutions. tions with communication or partial synchronizations
Techniques for effectively parallelizing such methods is then graph partitioning is less useful. An important
an ongoing area of research. application with this character is the factorization of a
A variety of open source graph partitioning tools sparse matrix.
have been developed in serial or parallel including Finally, graph partitioning is only appropriate for
Chaco, METIS, Jostle, and SCOTCH. Several of these applications in which the work and communication
are discussed in companion articles. pattern are predictable and stable. This happens to be
the case for many important scientific computing ker-
nels, but there are other applications that do not fit this
Limitations of Graph Partitioning model.
Although widely used to enable the parallelization of
scientific applications, graph partitioning is an imper-
fect abstraction. For a parallel application to perform Hypergraph Partitioning
well, the work must be evenly distributed among pro- A hypergraph is a generalization of a graph. Whereas
cessors and the cost of interprocessor communication a graph edge connects exactly two vertices, a hyperedge
must be minimized. Graph partition provides only a can connect any subset of vertices. This seemingly sim-
crude approximation for achieving these objectives. ple generalization leads to improved and more general
In the graph partitioning model, each vertex is partitioning models for parallel computing [].
assigned a weight that is supposed to represent the time Consider a graph in which vertices represent com-
required to perform a piece of computation. On mod- putation and edges represent data dependencies. For
ern processors with complex memory hierarchies it is each vertex, replace all the edges connected to it with
very difficult to accurately predict the runtime of a piece a single hyperedge that joins the vertex and all of its
of code a priori. Cache performance can dominate run- neighbors. When the vertices are partitioned, if a par-
time, and this is very hard to predict in advance. So the ticular vertex is separated from any of its neighbors, the
weights assigned to vertices in the graph partitioning corresponding hyperedge will be cut. For the common
model are just rough approximations. situation in which the vertex needs to communicate
An even more significant shortcoming of graph the same information to all of its neighbors, this single
partitioning has to do with communication. For most hyperedge will reflect the amount of data that needs to
applications, a vertex has data that needs to be known by be shared with another processor. Thus, the number of
all of its neighbors. If two of those neighbors are owned cut hyperedges (or more generally the total weight of cut
by the same processor, then that data need only be com- hyperedges) correctly captures the total volume of com-
municated once. In the graph partitioning model, two munication induced by a partitioning. In this way, the
 G Graph Partitioning Software

hypergraph model resolves an important shortcoming US Department of Energy under contract DE-AC-
of standard graph partitioning. AL.
Hypergraphs also address a second deficiency of the
graph model. If the communication is not symmetric Bibliography
(e.g., vertex i needs to send data to j, but j does not . Berger MJ, Bokhari SH () A partitioning strategy for
need to send data to i), then the graph model has dif- nonuniform problems on multiprocessors. IEEE Trans Comput
ficulty capturing the communication requirements. The C-():–
hypergraph model does not have this problem. A hyper- . Bui T, Jones C () A heuristic for reducing fill in sparse matrix
factorization. In: Proceedings of the th SIAM Conference on
edge simply spans a vertex i and every other vertex that i
parallel processing for scientific computing, SIAM, Portsmouth,
needs to send data to. There is no implicit assumption of Virginia, pp –
symmetry in the construction of the hypergraph model. . Çatalyürek U, Aykanat C () Decomposing irregularly sparse
Graph and hypergraph partitioning models, algo- matrices for parallel matrix-vector multiplication. In: Lecture
rithms, and software continue to be active areas of notes in computer science , Proceedings Irregular’, Springer-
research in parallel computing. PaToH and hMETIS are Verlag, Heidelberg, pp –
. Cong J, Smith ML () A parallel bottom-up clustering algorithm
widely used hypergraph partitioning tools for parallel
with application to circuit partitioning in VLSI design. In: Proceed-
computing. ings of the th Annual CAM/IEEE International Design Automa-
tion Conference, DAC’, ACM, San Diego, CA, pp –
. Fiduccia CM, Mattheyses RM () A linear time heuristic
Related Entries for improving network partitions. In: Proceedings of the th
Chaco ACM/IEEE Design Automation Conference, ACM/IEEE, Las
Load Balancing, Distributed Memory Vegas, NV, June , pp –
Hypergraph Partitioning . Hendrickson B, Kolda T () Graph partitioning models for
METIS and ParMETIS parallel computing. Parallel Comput :–
. Hendrickson B, Leland R () A multilevel algorithm for par-
titioning graphs. In: Proceedings of Supercomputing ’, ACM,
Bibliographic Notes and Further New York, December . Previous version published as Sandia
Reading Technical Report SAND–, Albuquerque, NM
Graph partitioning is a well-studied problem in theo- . Miller GL, Teng SH, Vavasis SA () A unified geometric
approach to graph separators. In: Proceedings of the nd Sympo-
retical computer science and is known to be difficult
sium on Foundations of Computer Science, IEEE, Pittsburgh, PA,
to solve optimally. For parallel computing, the chal- October , pp –
lenge is to find algorithms that are effective in practice. . Simon HD () Partitioning of unstructured problems for par-
Algebraic methods like Laplacian partitioning [] are allel processing. In: Proceedings of the conference on parallel
an important class of techniques, but can be expensive. methods on large scale structural analysis and physics applications.
Pergammon Press, Elmsford, NY
Local refinement techniques like FM are also impor-
tant [], but get caught in local optima. Multilevel meth-
ods seem to offer the best trade off between cost and
performance [, , ].
Hypergraph partitioning provides an important Graph Partitioning Software
alternative to graph partition in many instances []. A
survey of different partitioning models can be found in Chaco
the paper by Hendrickson and Kolda []. METIS and ParMETIS
Several good codes for graph partitioning are avail- PaToH (Partitioning Tool for Hypergraphs)
able on the Internet including Chaco, METIS, PaToH,
and Scotch.

Acknowledgment Graphics Processing Unit


Sandia is a multiprogram laboratory operated by San-
dia Corporation, a Lockheed Martin Company, for the NVIDIA GPU
Green Flash: Climate Machine (LBNL) G 

a question whose answer has staggering economic,


Green Flash: Climate Machine political, and sociological ramifications. The compu-
(LBNL) tational power required to inform such critical pol-
icy decisions requires a new breed of extreme scale
John Shalf , David Donofrio , Chris Rowen ,
computers to accurately model the global climate. The
Leonid Oliker , Michael Wehner
 “business as usual” approach of using commercial off-
Lawrence Berkeley National Laboratory, Berkeley,
CA, USA the-shelf (COTS) hardware to build ever-larger clus-

CEO, Tensilica, Santa Clara, CA, USA ters is increasingly unsustainable beyond the petaflop
scale due to the constraints of power and cooling. Some
Synonyms estimates indicate an exaflop-capable machine would
LBNL climate computer; Manycore; Tensilica; View consume close to  MW of power. Such unreason-
from Berkeley able power costs drive the need for a radically new
approach to HPC system design. Green Flash is a the-
G
Definition oretical system designed with an application-driven
Green Flash is a research project focused on an hardware and software codesign for HPC systems that
application-driven manycore chip design that lever- leverages the innovative and low-power architectures
ages commodity-embedded circuit designs and hard- and design processes of the low-power/embedded com-
ware/software codesign processes to create a highly puting industry. Green Flash is the result of Berkeley
programmable and energy-efficient HPC design. The Lab’s research into energy-efficient system design –
project demonstrates how a multidisciplinary hard- many details that are common to all system design,
ware/software codesign process that facilitates close such as power, cooling, mechanical design, etc. are not
interactions between applications scientists, computer addressed in this research as they are not unique to
scientists, and hardware engineers can be used to Green Flash and are challenges that would need to be
develop a system tailored for the requirements of overcome regardless of system architecture. The work
scientific computing. By leveraging the efficiency presented here represents the energy efficiency gained
gained from application-driven design philosophy, through application-tailored architectures that leverage
advanced processor synthesis tools from Tensilica, embedded processors to build energy efficient many-
FPGA-accelerated architectural simulation from RAMP, core processors.
auto-tuning for rapid optimization of the software
implementation, the project demonstrated how a hard-
ware/software codesign process can achieve a × History
increase in energy efficiency over its contemporaries In , a group of University of California researchers,
using cost-effective commodity-embedded building with backgrounds ranging from circuit design, com-
blocks. To demonstrate application-driven design pro- puter architecture, CAD, embedded hardware/software,
cess, Green Flash was tailored for high-resolution global programming languages, compilers, applied math, to
cloud resolving models, which are the leading justifi- HPC, met for a period of two years to consider how
cation for exascale computing systems. However, the current constraints on device physics at the silicon level
approach can be generalized to a broader array of sci- would affect CPU design, system architecture, and pro-
entific applications. As such, Green Flash represents a gramming models for future systems. The results of
vision of a new design process that could be used to the discussions are documented in the University of
develop effective exascale-class HPC systems. California Berkeley Technical Report entitled “The
Landscape of Parallel Computing Research: A View
from Berkeley.” This report was the genesis of the
Discussion UC Berkeley ParLab, which was funded by Intel and
Introduction Microsoft as well as the Green Flash project. Whereas
The scientific community is facing one of its greatest the ParLab carried the work of the View from Berke-
challenges in the prediction of global climate change – ley forward for desktop and handheld applications,
 G Green Flash: Climate Machine (LBNL)

the Green Flash project took the same principles and of optimizations to improve the computational effi-
applied them to the design of energy-efficient scientific ciency of application kernels and help produce a more
computing systems. balanced architecture. To enable fast, accurate perfor-
Hardware/software codesign, a methodology that mance evaluation a Field Programmable Gate Array
allows both software optimization and semi-specialized (FPGA) based hardware emulation platform will be
processor design to be simultaneously developed, has used to allow an experimental architecture to be evalu-
long been a feature of power-sensitive embedded sys- ated at speeds , × faster than typical software-based
tem designs, but thus far has seen very little application simulation methods – fast enough to allow execution of
in the HPC space. However, given power has become a real application rather than an arbitrary benchmark.
the leading design constraint of future HPC systems High-resolution global cloud system resolving mod-
and codesign, and other application-driven design pro- els are the target application that will motivate Green
cesses have received considerably more attention. Green Flash’s architectural decisions. A truly exascale prob-
Flash leverages tools that were developed by Tensil- lem, a . km scale model would decompose the earth’s
ica for rapid-synthesis of application-optimized CPU atmosphere into twenty-billion individual cells and a
designs, and retargets them to designing processors that machine with unprecedented performance would need
are optimized for scientific applications. The project to be realized in order for the model to run faster than
also created novel inter-processor communication to real time. While using more power-efficient, off-the-
enable that easier-to-program environment than its shelf, embedded processors is a crucial first in meeting
GPU contemporaries – providing hardware support this challenge it is still insufficient. Green Flash will
for more natural programming environments based on offer many other novel optimizations, both hardware
partitioned global address space programming models. and software, including alternatives to cache coherence
that enable far more efficient inter-processor communi-
cation than a conventional symmetric multiprocessing
Approach (SMP) approach and aggressive, architecture-specific
It is widely agreed that architectural specialization software optimization through auto-tuning. All these
can significantly improve efficiency, however, creat- specialization techniques will allow Green Flash to effi-
ing full-custom designs of HPC systems has often ciently meet the exascale computation requirements of
proven impractical due to excessive design/verification global climate change prediction.
costs and lead-times. The embedded processor mar-
ket relies on architectural customization to meet the Modeling the Earth’s Climate System
demanding cost and power efficiency requirements of Current generation climate models are comprehen-
its products with a short turn-around time. With time sive representations of the various systems that deter-
to market a key element in profitability, sophisticated mine the Earth’s climate. Models prepared for the
toolchains have been developed to enable rapid and fourth report of the Intergovernmental Panel on Cli-
cost-effective turn-around of power-efficient semicus- mate Change coupled submodels of the atmosphere,
tom designs implementations appropriate to each spe- ocean, and sea ice together to provide simulations of
cific processor design. Green Flash leverages these same the past, present, and future climate. It is expected that
toolchains to design power-efficient exascale systems, the major remaining components of the climate sys-
tailoring embedded chips to target scientific applica- tem, the terrestrial and oceanic biosphere, the Green-
tions and providing a feedback path from the appli- land and Antarctic ice sheets, and certain aspects of
cation programmer to the hardware design enabling a atmospheric chemistry will be represented in mod-
tight hardware/software codesign loop that is unprece- els currently being prepared for the next report. Each
dented in the HPC industry. Auto-tuning technolo- of the subsystem models has their own strengths and
gies are used to automate the software tuning process weaknesses, and each introduces a certain amount of
and maintain portability across the differing architec- uncertainty into projections of the future. Current com-
tures produced inside the codesign loop. Auto-tuners putational resources limit the resolution of these sub-
can automatically search over a broad parameter space models and are a contributor to these uncertainties. In
Green Flash: Climate Machine (LBNL) G 

particular, resolution constraints on models of atmo- is important to emphasize that Green Flash will not be
spheric processes do not allow clouds to be resolved built to only run this particular discretization. Rather,
forcing model developers to rely on sub-grid scale this approach calls for optimizing a system for a class
parameterizations based on statistical methods. How- of scientific applications; therefore, Green Flash will be
ever, simulations of the recent past produce cloud dis- able to efficiently run most global climate models.
tributions that do not agree well with observations. Extrapolation based on today’s cluster-based, gen-
These disagreements, traceable to the cumulus con- eral purpose, HPC systems produce estimates that the
vection parameterizations, lead to other errors in pat- sustained computational rate necessary to simulate the
terns of the Earth’s radiation and moisture budgets. Earth’s climate , times faster than it actually occurs
Current global atmospheric models have resolutions of was  Pflops. A tentative estimate from the CSU model
order  km, obviously many times larger than indi- is as much as  Pflops. This difference can be regarded
vidual clouds. Development of models at the limit of as one measure of the considerable uncertainty in mak-
the validity of cumulus parameterization (∼ km) is ing these estimates. As the CSU model matures, there
G
now underway by a few groups, although the neces- will be the opportunity to determine this rate much
sary century scale integrations are just barely feasible on more accurately. Multiple realizations of individual sim-
the largest current computing platforms. It is expected ulations are necessary to address the statistical complex-
that many issues will be rectified by this increase in ities of climate system. Hence, an exaflop scale machine
horizontal fidelity but that the fundamental limitations would be necessary to carry out this kind of science.
of cumulus parameterization will remain. The solution The exact peak flop rate required depends greatly on the
to this problem is to directly simulate cloud processes efficiency that the machine could be used.
rather than attempt to model them statistically. At hor- These enormous sustained computational rates are
izontal grid spacing of order ∼  km, cloud systems can not even imaginable if there is not enough paral-
be individually resolved providing this direct numerical lelism in the climate problem. Fortunately, cloud system
simulation. However, the computation burden of fluid resolving models at the kilometer scale do offer plenty
dynamics algorithms scales nonlinearly with the num- of opportunity to decompose the physical domain.
ber of grid points due to time step limitations imposed Bisection of the triangles composing the icosahedron
by numerical stability requirements. Hence, the compu- twelve successive times produces a global mesh with
tational resources necessary to carry out century scale ,, vertices spaced between  and  km apart. A
simulations of the Earth’s climate dwarfs any traditional logically rectangular two-dimensional domain decom-
machine currently under development. position strategy can be applied horizontally to the
icosahedral grid. Choosing square segments of the
Climate Model Requirements mesh containing  grid points each ( × ) results in
Extrapolation from measured computational require- ,, horizontal domains. The vertical dimension
ments of existing atmospheric models allow estimates offers additional parallelism. Assuming that  layers
of what would be necessary at resolutions of order  km could be decomposed in  separate vertical domains,
to support Global Cloud Resolving Models. To better the total number of physical sub-domains could be
make these estimates, the Green Flash project has part- ,,.
nered with Prof. David Randall’s group at the Colorado Twenty-one million way parallelism may seem
State University (CSU). In their approach, the globe is mind-boggling but this particular strawman decompo-
represented by a mesh based on an icosahedron as the sition was devised with practical constraints on the per-
starting point. By successively bisecting the sides of the formance of an individual core in an SMP in mind. Each
triangles making up this object, a remarkably uniform of the -million cores in this system will be assigned a
mesh on the sphere can be generated. However, this is small group of sub-domains on which they will execute
not the only way to discretize the globe at this resolution the full (physics, dynamics, etc.) climate model. Flops
and it will be important to have a variety of indepen- per watt is the key performance metric for designing
dent cloud system–resolving models if projections of the SMP for Green Flash, and the goal of × energy
the future are to have any credibility. For this reason it efficiency over existing machines will be achieved by
 G Green Flash: Climate Machine (LBNL)

tailoring the architecture to the needs of the climate support to a processor, or perhaps add a larger cache
model. One example that drives the need for a high core or local store. These functional units can be added to
count per socket is the model’s communication pat- a processor design as easily as clicking a checkbox or
tern. While the model is nearest-neighbor dominated, dropdown menu. The tool will then select the correct
the majority of the more latency-sensitive communica- functional unit from its library and integrate it into the
tion occurs between the vertical layers. By keeping this design – all without the designer needing to intervene.
communication on-chip the more latency tolerant hor- These tools eliminate large amounts of not only boil-
izontal communication can be sent off-chip with less erplate, but also full custom logic that once needed to
performance penalty. Looking at per-core features, by be written and re-written in order to change a proces-
running with a lower ( MHz) clock speed relative sor’s architecture. Of course the tools are not boundless
to today’s server-class processors Green Flash gains a and are subject to the same design limitations as any
cubic improvement in power consumption. Removal of other physical design process – for instance, one can-
processor features that are nonoptimal for science such not efficiently have hundreds of read ports from a single
as Translation Look-aside Buffers (TLBs) and Out of memory, but the amount of flexibility created through
Order (OOO) processing creates a smaller die, which these tools vastly outweighs any inherent limitations.
reduces leakage to help further reduce power. The rapid generation of processor cores alone makes
these tools very interesting, however, the overhead of
Hardware Design Flow generating a usable software stack for each processor
Verification costs are quickly becoming the dominant would negate the time saved developing the hardware.
force in custom hardware solutions. Large arrays of sim- While adding caches or changing bus widths has lit-
ple processors again hold a significant advantage here, tle effect on the ISA, and therefore a minimal software
as the work to verify to a simple processing element impact, adding a new functional unit such as floating-
and then replicate them on die is significantly lower. point or integer division has a large impact on the soft-
The long lead times and high costs of doing custom ware flow. Building custom hardware creates significant
designs has generally dissuaded the HPC community work not only in the creation of a potentially complex
from custom solutions and pushed more for clusters of software stack but also a time-consuming verification
COTS (Commercial off-the-self) hardware. The tradi- process. As with the software stack, without a method
tional definition of COTS in the HPC space is typically to jump-start the verification process the tools would
at the board or socket level; Green Flash seeks to rede- begin to lose their effectiveness. To address both of these
fine this notion of COTS and asserts that a custom critical issues these tools generate optimizing compilers,
processor made up of pre-verified building blocks can test benches as well as a functional simulator in paral-
still be considered COTS hardware. This fine grained lel with the RTL for the design. Having the processor
view permits Green Flash to benefit from both the archi- constructed of pre-verified building blocks combined
tectural specialization afforded by these specialized pro- with the automatic generation of test benches greatly
cessing elements and the shorter lead times and reduced reduces the risk and time required for formal verifica-
verification costs that come with using a building block tion. To help maintain backward and general purpose
approach. compatibility the processor’s ISA is restricted to one that
The constraints of power have long directed the is functionally complete and allows for the execution of
development of embedded architectures and so it is general purpose code.
advantageous to begin with an embedded core and
leverage the sophisticated tool chains developed to min- A Science Optimized Processor Design
imize time from architectural specifications to ASIC. The processor design for Green Flash is driven by effi-
These toolchains start with a collection of pre-verified ciency and the best way to reduce power consumption
functional units and allow them to be combined in and increase efficiency is to reduce waste. With that in
a myriad of ways rapidly producing power-efficient mind, the target architecture calls for a very simple, in
semi-custom designs. For instance, starting with a base order core with no branch prediction. The heavy mem-
architecture a designer may wish to add floating-point ory and communication requirements demanded by the
Green Flash: Climate Machine (LBNL) G 

climate model have imparted the greatest influence on neighbors. To further simplify programming, a tradi-
the design of the Green Flash core. Building on prior tional cache hierarchy is also in place to allow codes to
work from Williams et al. where it was shown that for be slowly ported to the more efficient local-store based
memory-intensive applications, cores with a local store, interprocessor network. In order to minimize power,
such as Cell, were able to utilize a higher percentage the use of photonic interlinks for the inter-core network
of the available DRAM bandwidth, the target processor is being investigated as an efficient method of transfer-
architecture includes a local store. In the Green Flash ring long messages. In the case of Green Flash, the data
SMP design there will be two on-chip networks – as network is one cache line in width and will consist of
illustrated in Fig. . As can be somewhat expected, the several phases per message.
majority of communication that occurs between sub-
domains within the climate model is nearest neighbor. Hardware/Software Codesign Strategy
Building on work from both Balfour and Dally [] Conventional approaches to hardware design gener-
a packet switched network with a concentrated torus ally have a long latency between hardware design and
G
topology was chosen for the SMP as it has been shown software development/optimization so designers fre-
to provide superior performance and energy efficiency quently rely on benchmark codes to find a power-
for codes where the dominant communication pattern efficient architecture. However, modern compilers fail
is nearest neighbor. to generate even close to optimal code for target
To further optimize the Green Flash processor for machines. Therefore, a benchmark-based approach to
science the programming model is being considered hardware design does not exploit the full performance
a first class citizen when designing the architecture. potential of the architecture design points under con-
Traditional cache coherent models found in many mod- sideration leading to possibly sub-optimal hardware
ern SMPs do not allow fine-grained synchronization solutions. The success of auto-tuners has shown that it
between cores. In fact, when benchmarking the cur- is still possible to generate efficient code using domain
rent climate model on present day machines it is shown knowledge. In combination with the ability to rapidly
that greater than % of execution time is spent in produce semi-custom hardware designs a tight, effec-
communication. By creating an architecture where an tive hardware/software codesign loop can be created.
individual core will not pay a huge overhead penalty The codesign approach, as shown in Fig.  incorporates
for sending or receiving a relatively small message the extensive software tuning into the process of hardware
amount of time spent in communication can be greatly design. Hardware design space exploration is routinely
reduced. The processing cores used in the Green Flash done to tailor the hardware design parameters to the
SMP have powerful, flexible streaming interfaces. Each target applications. The auto-tuned software tailors the
processor can have multiple, designer defined ports application to the hardware design point under consid-
with a simple FIFO-like interface with each port capable eration by empirically searching over a range of soft-
of sending and receiving a packet of data on each clock. ware implementations to find the best mapping of the
This low-overhead streaming interface will bypass the software to the micro-architecture. One of the hin-
cache and connect to one of the torus networks on drances toward the practical relevance of codesign is
chip. This narrow network can be used for exchange the large hardware/software design space exploration.
of addresses, while the wider torus network is used Conventional hardware design approaches use software
for exchange of data. Following a Partitioned Global simulation of hardware to perform hardware design
Address Space (PGAS) model the address space for each space exploration. Because codesign involves search-
processor’s local store is mapped into the global address ing over a much larger design space (there is now a
space and the data exchange is done as a DMA from need to explore the software design space at each hard-
local store to local store. This allows the communication ware design point), codesign is impractical if software
between processors to map very well to a MPI send/re- simulation of hardware is used.
ceive model used by the climate model and many other Rather than be constrained by the limitations
scientific codes. The view to the programmer will be of a software simulation environment, it is possible
as though all processors are directly connected to their instead to take advantage of the processor generation
 G Green Flash: Climate Machine (LBNL)

DRAM

Arbiter

$ Xtensa Xtensa $

To global To global
network network

$ Xtensa Xtensa $

Local Store

Arbiter

DRAM

Green Flash: Climate Machine (LBNL). Fig.  A concentrated torus network fabric yields the highest performance and
most power efficient design for scientific codes

toolchain’s ability to create synthesizable RTL for any more accurate than a software simulation environment
given processor. By loading this design onto an FPGA, as it truly represents the hardware design. This fast accu-
a potential processor design can be emulated running rate emulation environment provides the ability to run
× faster than a functional simulator. This speedup and benchmark the actual climate model as it is being
allows the benchmarking of true applications rather developed and allows the codesign infrastructure to
than being forced to rely on representative code snip- quickly search a large design space.
pets or statically defined benchmarks. Furthermore, this The speed and accuracy advantages of using FPGAs
speed advantage does not come at the expense of accu- have typically been dwarfed by the increased complex-
racy; to the contrary, FPGA emulation is arguably much ity of coding in Verilog or VHDL versus C++ or Python
Auto-tuning Flow

Strategy Code
.f95 Engines Generators Search
.c .h
.f95 .c .cu Engines
Serial
Internal Abstract FORTRAN
Reference Parallel Best performing
Syntax Tree in context

Parse
C with Myriad of equivalent, implementation
implementation Representation of specific
Victoria Falls pthreads optimized, implementations and configuration
(plus test harness) problem
GTX280 CUDA parameters

Transformation & Code Generation


a with high-level knowledge

Proposed Co-tuning Flow


Novel Co-tuning Methodology

Conventional Auto-tuning Methodology

Generate Generate new Benchmark Acceptable SW Acceptable


new HW config. code variant code variant Performance? Efficiency?
Reference Optimized
HW and SW Estimate Power HW and SW
configuration configurations
Estimate Area
b
Green Flash: Climate Machine (LBNL). Fig.  The result of combining an existing auto-tuning framework (a) with a rapid hardware design cycle and FPGA
emulation is (b) a proposed hardware/software codesign flow
Green Flash: Climate Machine (LBNL)
G


G
 G Green Flash: Climate Machine (LBNL)

as well as the ability to emulate large designs due to where over , cores were emulated using a stack of
limitations in FPGA area/LUT count. The practicality of  BEE boards.
using FPGAs for large system emulation has increased
dramatically over the past decade. The ability to access Hardware Support for New Programming
relatively large dynamic memories, such as DDR, has Models
always been a difficult challenge with FPGAs due to Applications and algorithms will need to rely increas-
the tight timing requirements. FPGA vendors, such as ingly on fine-grained parallelism and strong scaling and
Xilinx, have eased this difficulty by providing IP support fault resilience to accommodate the massive
through its Memory Interface Generator (MIG) tool growth of explicit on-chip parallelism and constrained
and adding IO features to the Virtex- series. Freely bandwidth anticipated for future chip architectures.
available Verilog IP libraries – whether they are History shows that the application-driven approach
Xilinx CoreGen, or the RAMP group’s GateLib – offers the most productive strategy for evaluating and
allow for a modular, building block approach to HW selecting among the myriad choices for refactoring
design. Finally, while commercial microprocessors are algorithms for full scientific application codes as the
experiencing a plateau in their clock rates and power industry moves through this transitional phase. Green
consumption, FPGAs are not. FPGA LUT count con- Flash functions as a testbed to explore novel program-
tinues to increase allowing the emulation of more ming models together with hardware support to express
complex designs and FPGA clocks, while tradition- fine-grained parallelism to achieve performance, pro-
ally significantly slower than commercial microproces- ductivity, and correctness for leading-edge application
sor clock rates have been growing steadily, closing the codes in the face of massive parallelism and increas-
gap between emulated and production clock rates. In ingly hierarchical hardware. The goal of this develop-
the case of Green Flash, the relatively low target clock ment thrust is to create a new software model that can
frequency ( MHz) of the final ASIC is an addi- provide a stable platform for software development for
tional motivation to target an FPGA emulation envi- the next decade and beyond for all scales of scientific
ronment. The current emulated processor design runs computing.
at  MHz – a significant fraction of the target clock The Green Flash design created direct hardware
rate. This relatively high speed enables the efficient support for both the message passing interface (MPI)
benchmarking of an entire application rather than a and partitioned global address space (PGAS) program-
representative portion. ming models to enable scaling of these familiar single
While the steady growth in LUT count on FPGAs program, multiple data (SPMD) programming styles
has enabled the emulation of more complex designs, to much larger-scale systems. The modest hardware
with a strawman architecture of  cores per socket support enables relatively well-known programming
it is necessary to emulate more than the two or four paradigms to utilize massive on-chip concurrency and
cores that will fit on a single FPGA. To scale beyond the to use hierarchical parallelism to enable use of larger
cores that will fit on a single FPGA a multi-FPGA sys- messages for interchip communication.
tem, such as the Berkeley Emulation Engine (BEE) can However, not all applications will be able to express
be used. The BEE board has four Virtex-  FPGAs parallelism through simple divide-and-conquer prob-
connected in a ring with a cross-over connection. Each lem partitioning. So the message-queues and software-
FPGA has access to two channels of DDR memory managed memories that are used to implement PGAS
allowing  GB of memory per FPGA. The BEE will are also being used to explore new asymmetric and
allow effective emulation of eight cores with the appro- asynchronous approaches to achieving strong-scaling
priate NoC infrastructure per board. To scale beyond performance improvements from explicit parallelism.
eight cores, the BEE includes  Gb connections allow- Techniques that resemble class static dataflow methods
ing the boards to be linked and emulation of an entire are garnering renewed interest because of their abil-
socket becomes possible. There is significant prece- ity to flexibly schedule work and to accommodate state
dence for emulation of massively multithreaded archi- migration to correct load imbalances and failures. In
tectures across multiple FPGAs. One recent example the case of the climate code, dataflow techniques can be
was demonstrated by the Berkeley RAMP Blue project used to concurrently schedule the physics computations
Green Flash: Climate Machine (LBNL) G 

with the dynamic core of the climate code, thereby dou- sockets, which reduces exposure to hard-failures
bling the effective concurrency without moving to a due to bad electrical connections or other mechan-
finer domain decomposition. This approach also ben- ical/electrical defects.
efits from the unique interprocessor communication . Like BlueGene, memory and CPUs can be flow-
interfaces developed for Green Flash. soldered onto the board to reduce hard and soft
Fault Resilience failure rates for electrical connections given remov-
A question that comes up when proposing a  mil- able sockets are far more susceptible to both kinds of
lion processor computing system is how to deal with faults. So eliminating removable sockets can greatly
fault resilience. While trying not to trivialize the issue, reduce error rates.
it should be noted that this is a problem for everyone For soft errors:
designing large-scale machines. The proposed approach . All of the basics for reliability and error recovery
of using many simpler cores does not introduce any in the memory subsystem including full ECC (error G
unique challenges that are different than the challenges correcting code) protection for caches and memory
faced for aggregating conventional server chips into interfaces are included in the design.
large-scale systems provided the total number of dis- . Using many simpler cores allows fewer sockets to be
crete chips in the system is not dramatically differ- used and less silicon surface area to achieve the same
ent. The following observations are made to qualify delivered performance. So that is to say, Green Flash
this point. has less exposure to major sources of failure than
For a given silicon process (e.g., a  nm process and a conventional high-frequency core design. There-
same design rules) fore, fewer sockets and fewer random bit-flips due to
. Hard failure rates are primarily proportional to the mechanical noise and other stochastic error sources.
number of sockets in a system (e.g., solder joint fail- . The core clock frequency of  MHz improves Sig-
ures, weak wire bonds, and a variety of mechanical nal to Noise Ratio for on-chip data transfers.
and electrical issues) and secondarily related to the . Incorporation of a Nonvolatile Random Access
total chip surface area (probability of defect vs tol- Memory (NVRAM) memory controller and chan-
erance of the design rules to process variation). It is nel on each System on Chip (SoC). Each node
not proportional to the number of processor cores can copy the image of memory to the NVRAM
per se given that the cores come in all shapes and periodically to do local checkpoints. If there is a
sizes. soft-error (e.g., an uncorrectable memory error),
. Soft error rates caused by comic rays roughly pro- then the node can initiate a roll-back to the last
portional to chip surface area when comparing cir- available checkpoint. For hard failures (e.g., a node
cuits that employ the same process technology (e.g., does and cannot be revived), the checkpoint image
 nm). will be copied to neighboring nodes on a peri-
. Bit error rates for data transfer tend to increase odic basis to facilitate localized state recovery. Both
proportionally with clock rate. strategies enable much faster roll-back when errors
. Thermal stress is also a source of hard errors. are encountered than the conventional user-space
For hard errors: checkpointing approach.
. Spare cores can be designed into each ASIC to toler- Therefore, the required fault resilience strategies will
ate defects due to process variation. This approach bear similarity to other systems that employ a sim-
is already used by the  core Cisco Metro chip, ilar number of sockets (∼, ), which are not
which incorporates  spare cores ( cores in total) unprecedented. The BlueGene system at Lawrence
to cover chip defects. Livermore National Laboratory contains a compara-
. Each chip is expected to dissipate a relatively small ble number of sockets, and achieves a – day Mean
– W (or that is the target) subjecting them to less Time Between Failures (MTBF), which is far longer
mechanical/thermal stress. than systems that contain a fraction the number of
. It has been demonstrated that Green Flash can processor cores. Therefore, careful application of well-
achieve more delivered performance out of fewer known fault-resilience techniques together with a few
 G Green Flash: Climate Machine (LBNL)

novel “extended fault resilience” mechanisms such as data movement, while still retaining a smaller con-
localized NVRAM checkpoints can achieve an accept- ventional cache-hierarchy only to support incremental
able MTBF for extreme-scale implementations of this porting to the more energy and bandwidth-efficient
approach to system design. design point. Furthermore, simple hardware support
for lightweight on-chip interprocessor synchronization
Conclusions and communication make it much simpler and straight-
Green Flash proposes a radical approach of application- forward and efficient to program massive arrays of pro-
driven computing design to break through the slow cessors than more exotic programming models such as
pace of incremental changes, and foster a sustainable CUDA and Streaming.
hardware/software ecosystem with broad-based sup- Green Flash has been a valuable research vehicle
port across the IT industry. Green Flash has enabled the to understand how the evolution of massively paral-
exploration of practical advanced programming models lel chip architectures can be guided by close-coupled
together with lightweight hardware support mecha- feedback with the design of the application, algorithms,
nisms that allow programmers to utilize massive on- and hardware together. Application-driven design
chip concurrency, thereby creating the market demand ensures hardware design decisions do not evolve in
for massively concurrent components that can also be reaction to hardware constraints, without regard to pro-
the building block of midrange and extreme-scale com- grammability and delivered application performance.
puting systems. New programming models must be part The design study has been driven by a deep dive into the
of a new software development ecosystem that spans all climate application space, but enables explorations that
scales of systems, from midrange to the extreme-scale cut across all application areas and have ramifications
to facilitate a viable migration path from development to the next generation of fully general-purpose archi-
to large-scale production computing systems. The use tectures. Ultimately, Green Flash should consist of an
of the FPGA-based hardware emulation platforms, such architecture that can maximally leverage reusable com-
as RAMP, to prototype and run hardware prototypes ponents from the mass market of the embedded space
at near-realtime speeds before it is built allow testing while improving the programmability for the many-
of full-fledged application codes and advanced soft- core design point. The building blocks of a future HPC
ware development to commence many years before the system must be the preferred solution in terms of per-
final hardware platform is constructed. These tools have formance and programmability for everything from
enabled a tightly coupled software/hardware codesign the smallest high-performance energy-efficient embed-
process that can be applied effectively to the complex ded system, to midrange departmental systems, to the
HPC application space. largest-scale systems.
Rather than ask “what kind of scientific applications
can run on our HPC cluster after it arrives,” the question
should be turned around to ask “what kind of system
Bibliography
. Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P,
should be built to meet the needs of the most impor-
Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW,
tant science problems.” This approach is able to realize Yelick KA () The landscape of parallel computing research:
its most substantial gains in energy-efficiency by peeling a view from Berkeley. Technical report no. UCB/EECS--,
back the complexity of high-frequency microproces- EECS Department University of California, Berkeley
sor design point to reduce sources of waste (wasted . Wehner M, Oliker L, Shalf J () Towards ultra-high resolution
models of climate and weather. Int J High Perform Comput Appl
opcodes, wasted bandwidth, waste caused by orient-
:–
ing architectures toward serial performance). BlueGene . Shalf J () The new landscape of parallel computer architec-
and SiCortex have demonstrated the advantages of ture. J Phys: Conf Ser :
using the simpler low-power embedded processing ele- . Donofrio D, Oliker L, Shalf J, Wehner MF, Rowen C, Krueger J,
ments to create energy-efficient computing platforms. Kamil S, Mohiyuddin M () Energy-efficient computing for
extreme-scale science. IEEE Computer, Los Almitos
However, the Green Flash codesign approach goes
. Wehner M, Oliker L, Shalf J () Low-power supercomputers.
beyond traditional embedded core design point of Blue-
IEEE Spectrum ():–
Gene and SiCortex by using explicit message queues . Shalf J, Wehner M, Oliker L, Hules J () Green flash project:
and software controlled memories to further optimize the challenge of energy-efficient HPC. SciDAC Review, Fall
Gustafson’s Law G 

. Kamil SA, Shalf J, Oliker L, Skinner D () Understanding same amount of time as before, the speedup is
ultra-scale application communication requirements. In: IEEE
international symposium on workload characterization (IISWC)
Speedup = f + P( − f )
Austin, – Oct  (LBNL-) = P − f (P − ) .
. Kamil S, Chan Cy, Oliker L, Shalf J, Williams S () An auto-
tuning framework for parallel multicore stencil computations. In: It shows more generally that the serial fraction does not
IPDPS , Atlanta theoretically limit parallel speed enhancement, if the
. Hendry G, Kamil SA, Biberman A, Chan J, Lee BG, problem or workload scales in its parallel component. It
Mohiyuddin M, Jain A., Bergman K, Carloni LP, Kubiatocics J, models a different situation from that of Amdahl’s Law,
Oliker L, Shalf J () Analysis of photonic networks for chip
which predicts time reduction for a fixed problem size.
multiprocessor using scientific applications. In: NOCS , San
Diego
. Mohiyuddin M, Murphy M, Oliker L, Shalf J, Wawrzynek J, Discussion
Williams S () A design methodology for domain-optimized
power-efficient supercomputing, In: SC , Portland Graphical Explanation G
. Balfour J, Dally WJ () Design tradeoffs for tiled cmp on-chip Figure  explains the formula in the Definition:
networks. In ICS ’: Proceedings of the th annual interna- The time the user is willing to wait to solve the work-
tional conference on supercomputing
load is unity (lower bar). The part of the work that
is observably serial, f , is unaffected by parallelization.
The remaining fraction of the work,  − f , parallelizes
perfectly so that a serial processor would take P times
Grid Partitioning longer to execute it. The ratio of the top bar to the bot-
tom bar is thus f + P( − f ). Some prefer to rearrange
Domain Decomposition this algebraically as P − f (P − ).
The diagram resembles the one used in the expla-
nation of Amdahl’s Law (see Amdahl’s Law) except
that Amdahl’s Law fixes the problem size and answers
Gridlock the question of how parallel processing can reduce the
execution time. Gustafson’s Law fixes the run time and
Deadlocks answers the question of how much longer time the
present workload would take in the absence of paral-
lelism []. In both cases, f is the experimentally observ-
able fraction of the current workload that is serial.
Group Communication The similarity of the diagram to the one that explains
Amdahl’s Law has led some to “unify” the two laws
Collective Communication by a change of variable. It is an easy algebraic exercise
to set the upper bar to unit time and express the f of
Gustafson’s Law in terms of the variables of Amdahl’s
Law, but this misses the point that the two laws pro-
Gustafson’s Law ceed from different premises. Every attempt at unifica-
tion begins by applying the same premise, resulting in a
John L. Gustafson circular argument that the two laws are the same.
Intel Labs, Santa Clara, CA, USA
The fundamental underlying observation of
Gustafson’s Law is that more powerful computer sys-
Synonyms tems usually solve larger problems, not the same size
Gustafson–Barsis Law; Scaled speedup; Weak scaling problem in less time. Hence, a performance enhance-
ment like parallel processing expands what a user can
Definition do with a computing system to match the time the user
Gustafson’s Law says that if you apply P processors to a is willing to wait for the answer. While computing power
task that has serial fraction f , scaling the task to take the has increased by many orders of magnitude over the last
 G Gustafson’s Law

Time required if only serial processing were available


f P(1–f )

Serial Parallel fraction that would have to be


fraction executed in P serial stages

f 1–f

Present execution time

Gustafson’s Law. Fig.  Graphical derivation of Gustafson’s Law

half-century (see Moore’s Law), the execution time for showed excellent absolute performance in terms of
problems of interest has been constant, since that time floating-point operations per second, and seemed to
is tied to human timescales. defy Amdahl’s pessimistic prediction. Seitz’s success led
John Gustafson at FPS to drive development of a mas-
History sively parallel cluster product with backing from the
In a  conference debate over the merits of parallel Defense Advanced Research Projects Agency (DARPA).
computing, IBM’s Gene Amdahl argued that a consid- Although the largest configuration actually sold of that
erable fraction of the work of computers was inherently product (the FPS T Series) had only  processors,
serial, from both algorithmic and architectural sources. the architecture permitted scaling to , processors.
He estimated the serial fraction f at about .–.. The large number of processors led many to question:
He asserted that this would sharply limit the approach What about Amdahl’s Law? Gustafson formulated a
of parallel processing for reducing execution time []. counterargument in April , which showed that per-
Amdahl argued that even the use of two processors was formance is a function of both the problem size and the
less cost-effective than a serial processor. Furthermore, number of processors, and thus Amdahl’s Law need not
the use of a large number of processors would never limit performance. That is, the serial fraction f is not a
reduce execution time by more than /f , which by his constant but actually decreases with increased problem
estimate was a factor of about –. size. With no experimental evidence to demonstrate
Despite many efforts to find a flaw in Amdahl’s the idea, the counterargument had little impact on the
argument, “Amdahl’s Law” held for over  years as computing community.
justification for the continued use of serial computing An idea for a source of experimental evidence arose
hardware and serial programming models. in the form of a challenge that Alan Karp had publi-
cized the year before []. Karp had seen announcements
The Rise of Microprocessor-Based Systems of the ,-processor CM- from Thinking Machines
By the late s, microprocessors and dynamic and the ,-processor NCUBE from nCUBE, and
random-access memory (DRAM) had dropped in price believed Amdahl’s Law made it unlikely that such mas-
to the point where academic researchers could afford sively parallel computers would achieve a large fraction
them as components in experimental parallel designs. of their rated performance. He published a skeptical
Work in  by Charles Seitz at Caltech using a challenge and a financial reward for anyone who could
message-passing collection of  microprocessors [] demonstrate a parallel speedup of over  times on
Gustafson’s Law G 

three real applications. Karp suggested computational 30


fluid dynamics, structural analysis, and econometric Speedup when execution
25
modeling as the three application areas and gave some time is fixed (Gustafson)
ground rules to insure that entries focused on honest 20
parallel speedup without tricks or workarounds. For
15
example, one could not cripple the serial version to
make it artificially  times slower than the parallel Speedup when problem
10
size is fixed (Amdahl)
system. And the applications, like the three suggested,
5
had to be ones that had interprocessor communication
throughout their execution as opposed to “embarrass-
0.0 0.2 0.4 0.6 0.8 1.0
ingly parallel” problems that had communication only
Observable parallel fraction of existing workload
at the beginning and end of a run.
By , no one had met Karp’s challenge, so Gor- Gustafson’s Law. Fig.  Speedup possible with 
G
don Bell adopted the same set of rules and sug- processors, by Gustafson’s Law and Amdahl’s Law
gested applications as the basis for the Gordon Bell
Award, softening the goal from  times to whatever
the best speedup developers could demonstrate. Bell published the detailed explanation of the application
expected the initial entries to achieve about tenfold parallelizations in a Society of Industrial and Applied
speedup []. Mathematics (SIAM) journal [].
The purchase by Sandia National Laboratories of
the first ,-processor NCUBE  system created the Parallel Computing Watershed
opportunity for Gustafson to demonstrate his argu- Sandia’s announcement of ,-fold parallel speedups
ment on the experiment outlined by Karp and Bell, created a sensation that went well beyond the computing
so he joined Sandia and worked with researchers research community. Alan Karp announced that the
Gary Montry and Robert Benner to demonstrate the Sandia results had met his Challenge, and Gordon Bell
practicality of high parallel speedup. Sandia had real gave his first award to the three Sandia researchers. The
applications in fluid dynamics and structural mechan- results received publicity beyond that of the usual tech-
ics, but none in econometric modeling, so the three nical journals, appearing in TIME, Newsweek, and the
researchers substituted a wave propagation applica- US Congressional Record. Cray, IBM, Intel, and Digital
tion. With a few weeks of tuning and optimization, Equipment began work in earnest developing commer-
all three applications were running at over -fold cial computers with massive amounts of parallelism for
speedup with the fixed-size Amdahl restriction, and the first time.
over ,-fold speedup with the scaled model pro- The Sandia announcement also created consider-
posed by Gustafson. Gustafson described his model to able controversy in the computing community, partly
Sandia Director Edwin Barsis, who suggested explain- because some journalists sensationalized it as a proof
ing scaled speedup using a graph like that shown that Amdahl’s Law was false or had been “broken.”
in Fig. . This was never the intent of Gustafson’s observation.
Barsis also insisted that Gustafson publish this He maintained that Amdahl’s Law was the correct
concept, and is probably the first person to refer to answer but to the wrong question: “How much can
it as “Gustafson’s Law.” With the large experimen- parallel processing reduce the run time of a current
tal speedups combined with the alternative model, workload?”
Communications of the ACM published the results in
May  []. Since Gustafson credited Barsis with Observable Fraction and Scaling Models
the idea of expressing the scaled speedup model As part of the controversy, many maintained that
as graphed in Fig. , some refer to Gustafson’s Law as Amdahl’s Law was still the appropriate model to use
the Gustafson–Barsis Law. The three Sandia researchers in all situations, or that Gustafson’s Law was simply
 G Gustafson’s Law

a corollary to Amdahl’s Law. For scaled speedup, the


argument went that one simply works backward to Log of
determine what the f fraction in Amdahl’s Law must problem
size

el
have been to yield such performance. This is an example Insufficient

od
m
of circular reasoning, since the proof that Amdahl’s Law main

ed
memory

al
applies begins by assuming it applies.

sc
y
or
For many programs, it is possible to instrument

em
l
ode

M
and measure the fraction of time f spent in serial exe- m
cution. One can place timers in the program around t ime Commu-
ed
Fix nication
serial regions and obtain an estimate of f . This fraction bound
then allows Amdahl’s Law estimates of time reduction, Fixed size model
or Gustafson’s Law estimates of scaled speedup. Nei-
ther law takes into account communication costs or Log of number of processors
intermediate degrees of parallelism. (When commu-
nication costs are included in Gustafson’s fixed-time Gustafson’s Law. Fig.  Different scaling types and
model, the speedup is again limited as the number of communication costs
processors grows, because communication costs rise
to the point where there is no way to increase the
the communication cost effects and because the per-
size of the amount of work without increasing the
centage of the problem that is in each memory tier (mass
execution time.)
storage, main RAM, levels of cache) changes with the
A more common practice is to measure the par-
use of more processors.
allel speedup as the number of processors is varied,
and fit the resulting curve to derive f . This approach
confuses serial fraction with communication overhead, Analogies
load imbalance, changes in the relative use of the mem- There are many aspects of technology where an
ory hierarchy, and so on. Some refer to the requirement enhancement for time reduction actually turns out to
to keep the problem size the same yet use more proces- be an enhancement for what one can accomplish in the
sors as “strong scaling.” Still, a common phenomenon same time as before. Just as Amdahl’s Law is an expres-
that results from “strong scaling” is that it is easier, sion of the more general Law of Diminishing Returns,
not harder, to obtain high amounts of speedup. When Gustafson’s Law is an expression of the more general
spreading a problem across more and more proces- observation that technological advances are used to
sors, the memory per processor goes down to the point improve what humans accomplish in the length of time
where the data fits entirely in cache, resulting in super- they are accustomed to waiting, not to shorten the
linear speedup []. Sometimes, the superlinear speedup waiting time.
effects and the communication overheads partially can-
cel out, so what appears to be a low value of f is actu- Commuting Time
ally the result of the combination of the two effects. As civilization has moved from walking to horses to
In modern parallel systems, performance analysis with mechanical transportation, the average speed of get-
either Amdahl’s Law or Gustafson’s Law will usually be ting to and from work every day has gone up dramat-
inaccurate since communication costs and other par- ically. Yet, people take about half an hour to get to
allel processing phenomena have large effects on the or from work as a tolerable fraction of the day, and
speedup. this amount of time is probably similar to what it has
In Fig. , Amdahl’s Law governs the Fixed-Sized been for centuries. Cities that have been around for
Model line, Gustafson’s Law governs the Fixed-Time hundreds or thousands of years show a concentric pat-
Model line, and what some call the Sun–Ni Law gov- tern that reflect the increasing distance people could
erns the Memory Scaled Model []. The fixed-time commute for the amount of time they were able to
model line is an irregular curve in general, because of tolerate.
Gustafson’s Law G 

Transportation provides many analogies for account many more details about people than the sim-
Gustafson’s Law that expose the fallacy of fixing the size ple head count that the Constitution mandates. This
of a problem as the control variable in discussing large illustrates a connection between Gustafson’s Law and
performance gains. A commercial jet might be able to the jocular Parkinson’s Law: “Work expands to fill the
travel  miles per hour, yet if one asks “How much available time.”
will it reduce the time it takes me presently to walk
to work and back?” the answer would be that it does Printer Speed
not help at all. It would be easy to apply an Amdahl- In the s, when IBM and Xerox were developing the
type argument to the time to travel to an airport as the first laser printers that could print an entire page at a
serial fraction, such that the speedup of using a jet only time, the goal was to create printers that could print
applies to the remaining fraction of the time and thus several pages per second so that printer speed could
is not worth doing. However, this does not mean that match the performance improvements of computing
commercial jets are useless for transportation. It means speed. The computer printouts of that era were all of
G
that faster devices are for larger jobs, which in this case monospaced font with a small character set of upper-
means longer trips. case letters and a few symbols. Although many laser
Here is another transportation example: If one takes printer designers struggled to produce such simple out-
a trip at  miles per hour and immediately turns put with reduced time per page, the product category
around, how fast does one have to go to average  evolved to produce high quality output for desktop pub-
miles per hour? This is a trick question that many peo- lishing instead of using the improved technology for
ple incorrectly answer, “ miles per hour.” To average time reduction. People now wait about as long for a page
 miles per hour, one would have to travel back at of printout from a laser printer as they did for a page of
infinite speed, that is, return instantly. Amdahl’s Law printout from the line printers of the s, but the task
applies to this fixed-distance trip. However, suppose has been scaled up to full color, high resolution printing
the question were posed differently: “If one travels for encompassing graphics output, and a huge collection
an hour at  miles per hour, how fast does one have of typeset fonts from alphabets in all the world’s lan-
to travel in the next hour to average  miles per guages. This is an example of Gustafson’s Law applied
hour?” In that case, the intuitive answer of “ miles per to printing technology.
hour” is the correct one. Gustafson’s Law applies to this
fixed-time trip. Biological Brains
Kevin Howard, of Massively Parallel Technologies Inc.,
The US Census once observed that if Amdahl’s Law governed the
In the early debates about scaled speedup, Heath and behavior of biological brains, then a human would have
Worley [] provided an example of a fixed-sized prob- about the same intelligence as a starfish. The human
lem that they said was not appropriate for Gustafson’s brain has about  billion neurons operating in par-
Law and for which Amdahl’s Law should be applied: allel, so for us to avoid passing the point of diminishing
the US Census. While counting the number of people in returns for all that parallelism, the Amdahl serial frac-
the USA would appear to be a fixed-sized problem, it is tion f would have to be about − . The fallacy of this
actually a perfect example of a fixed-time problem since seeming paradox is in the underlying assumption that
the Constitution mandates a complete headcount every a human brain must do the same task a starfish brain
 years. It was in the late nineteenth century, when does, but must reduce the execution time to nanosec-
Hollerith estimated that the population had grown to onds. There is no such requirement, and a human brain
the point where existing approaches would take longer accomplishes very little in a few nanoseconds no mat-
than  years that he developed the card punch tabula- ter how many neurons it uses at once. Gustafson’s Law
tion methods that made the process fast enough to fit says that on a time-averaged basis, the human brain
the fixed-time budget. will accomplish vastly more complex tasks than what a
With much faster computing methods now avail- starfish can attempt, and thus avoids the absurd conclu-
able, the Census process has grown to take into sion of the fixed-task model.
 G Gustafson’s Law

Perspective Bibliographic Entries and Further


The concept of scaled speedup had a profound enabling Reading
effect on parallel computing, since it showed that sim- Gustafson’s  two-page paper in the Communica-
ply asking a different question (and perhaps a more tions of the ACM [] outlines his basic idea of fixed-
realistic one) renders the pessimistic predictions of time performance measurement as an alternative to
Amdahl’s Law moot. Gustafson’s  announcement Amdahl’s assumptions. It contains the rhetorical ques-
of ,-fold parallel speedup created a turning point tion, “How can this be, in light of Amdahl’s Law?” that
in the attitude of computer manufacturers towards some misinterpreted as a serious plea for the resolution
massively parallel computing, and now all major ven- of a paradox. Readers may find a flurry of responses in
dors provide platforms based on the approach. Most Communications and elsewhere, as well as attempts to
(if not all) of the computer systems in the TOP “unify” the two laws.
list of the worlds’ fastest supercomputers are com- An objective analysis of Gustafson’s Law and its rela-
prised of many thousands of processors, a degree of tion to Amdahl’s Law can be found in many modern
parallelism that computer builders regarded as sheer textbooks on parallel computing such as [], [], or [].
folly prior to the introduction of scaled speedup In much the way some physicists in the early twentieth
in . century refused to accept the concepts of relativity and
A common assertion countering Gustafson’s Law is quantum mechanics, for reasons more intuition-based
that “Amdahl’s Law still holds for scaled speedup; it’s just than scientific, there are computer scientists who refuse
that the serial fraction is a lot smaller than had been pre- to accept the idea of scaled speedup and Gustafson’s
viously thought.” However, this requires inferring the Law, and who insist that Amdahl’s Law suffices for all
small serial fraction from the measured speedup. This is situations.
an example of circular reasoning since it involves choos- Pat Worley analyzed the extent to which one can
ing a conclusion, then working backward to determine usefully scale up scientific simulations by increasing
the data that make the conclusion valid. Gustafson’s their resolution []. In related work, Xian-He Sun and
Law is a simple formula that predicts scaled perfor- Lionel Ni built a more complete mathematical frame-
mance from experimentally measurable properties of a work for scaled speedup [] in which they promote the
workload. idea of memory-bounded scaling, even though execu-
Some have misinterpreted “scaled speedup” as sim- tion time generally increases beyond human patience
ply increasing the amount of memory for variables, or when the memory used by a problem scales as much
increasing the fineness of a grid. It is more general than as linearly with the number of processors. In a related
this. It applies to every way in which a calculation can vein, Vipin Kumar proposed “Isoefficiency” for which
be improved somehow (accuracy, reliability, robustness, the memory increases as much as necessary to keep the
etc.) with the addition of more processing power, and efficiency of the processors at a constant level even when
then asks how much longer the enhanced problem would communication and other impediments to parallelism
have taken to run without the extra processing power. are taken into account.
Horst Simon, in his  keynote talk at the Inter-
national Conference on Supercomputing, “Progress in
Supercomputing: The Top Three Breakthroughs of the
Bibliography
. Amdahl GM () Validity of the single-processor approach
Last  Years and the Top Three Challenges for the Next to achieve large scale computing capabilities. AFIPS Joint
 Years,” declared the invention of the Gustafson’s Spring Conference Proceedings  (Atlantic City, NJ, Apr.
scaled speedup model as the number one achievement –), pp –. AFIPS Press, Reston VA. At http://www-
in high-performance computing since . inst.eecs.berkeley.edu/∼n /paper/Amdahl.pdf
. Bell G (interviewed) () An interview with Gordon Bell. IEEE
Software, vol , No.  (July ), pp –
Related Entries . Gustafson JL () Fixed time, tiered memory, and superlin-
Amdahl’s Law ear speedup. Distributed Memory Computing Conference, ,
Distributed-Memory Multiprocessor Proceedings of the Fifth, vol  (April ), pp –. ISBN:
Metrics ---
Gustafson–Barsis Law G 

. Gustafson JL, Montry GR, Benner RE () Development of . Seitz CL () Experiments with VLSI ensemble machines. Jour-
parallel methods for a -processor hypercube. SIAM Journal nal of VLSI and Computer Systems, vol , No. , pp –
on Scientific and Statistical Computing, vol , No. , (July ), . Sun X-H, Ni L () Scalable problems and memory-bounded
pp – speedup.” Journal of Parallel and Distributed Computing, vol ,
. Gustafson () Reevaluating Amdahl’s Law. Communica- No. , pp –
tions of the ACM, vol , No.  (May ), pp –. . Worley PH () The effect of time constraints on scaled
DOI=./. speedup. Report ORNL/TM , Oak Ridge National
. Heath M, Worley P () Once again, Amdahl’s Law. Com- Laboratory
munications of the ACM, vol , No.  (February ),
pp –
. Hwang K, Briggs F, Computer Architecture and Parallel Process-
ing, . McGraw-Hill Inc., . ISBN: 
. Karp A () http://www.netlib.org/benchmark/karp-challenge Gustafson–Barsis Law
. Lewis TG, El-Rewini H () Introduction to Parallel Comput-
ing, Prentice Hall. ISBN: ---. – Gustafson’s Law G
. Quinn M () Parallel Computing: Theory and Practice.
Second edition. McGraw-Hill, Inc
H
Discussion
Half Vector Length
Introduction
Metrics The HDF technology suite is designed to organize,
store, discover, access, analyze, share, and preserve
diverse, complex data in continuously evolving hetero-
geneous computing and storage environments. It sup-
Hang ports an unlimited variety of datatypes, and is designed
for flexible and efficient I/O and for high volume and
Deadlocks complex data. The HDF library and file format are
portable and extensible, allowing applications to evolve
in their use of HDF. The HDF technology suite also
includes tools and applications for managing, manipu-
Harmful Shared-Memory Access lating, viewing, and analyzing data in the HDF format.
Originally designed within the National Center for
Race Conditions Supercomputing Applications at the University of Illi-
nois at Urbana-Champaign [], HDF is now primarily
developed and maintained by The HDF Group [], a
nonprofit organization dedicated to ensuring the sus-
Haskell
tainable development of HDF technologies and the
Glasgow Parallel Haskell (GpH) ongoing accessibility of data stored in HDF files. HDF
builds on lessons learned from other data storage
libraries and file formats, such as the original HDF
file format (now known as HDF []), netCDF [],
Hazard (in Hardware) TIFF [], and FITS [], while adding unique features
and extending the boundaries of prior data storage
Dependences models.

Data Model
HDF HDF implements a simple but versatile data model,
which has two primary components: groups and
Quincey Koziol datasets. Group objects in an HDF file contain a col-
The HDF Group, Champaign, IL, USA
lection of named links to other objects in an HDF
file. Dataset objects in HDF files store arrays of arbi-
Synonyms trary element types and are the main method for storing
Hierarchical data format application data.
Groups, which are analogous to directories in a tra-
Definition ditional file system, can contain an arbitrary number of
HDF [] is a data model, software library, and file uniquely named links. A link can connect a group to
format for storing and managing data. another object in the same HDF file; include a named

David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----,


© Springer Science+Business Media, LLC 
 H HDF

path to an object in the HDF file, which may not exist allows for constant time access to any element in the
currently; or refer to an object in another HDF file. array and no storage overhead for locating the elements
Unlike links in a traditional file system, HDF links can in the dataset. However, contiguous data storage does
be used to create fully cyclic directed graph structures. not allow a dataset to use a dataspace with unlimited
Each group contains one or more B-tree data structures dimensions or to compress the dataset elements.
as indices to its collection of links, which are stored in a The dataspace for a dataset can also be decomposed
heap structure within the HDF file. into fixed-size sub-arrays of elements, called “chunks,”
Dataset objects store application data in an HDF which are stored individually in the file. This “chunked”
file as a multidimensional array of elements. Each data storage requires an index for locating the chunks
dataset is primarily defined by the description of how that store the data elements. Datasets that have a data
many dimensions its array has and the size of those space with unlimited dimensions must use chunked
dimensions, called a “dataspace”; and the description of data storage for storing their elements.
the type of element to store at each location in the array, Using chunked data storage allows an application
called a “datatype.” that will be accessing sub-arrays of the dataset to tune
An HDF dataspace describes the number of the chunk size to its sub-array size, allowing for much
dimensions for an array, as well as the current and faster access to those sub-arrays than would be possible
maximum number of elements in each dimension. The with contiguous data storage. Additionally, the elements
maximum number of elements in an array dimension of datasets that use chunked data storage can be com-
can be specified as “unlimited,” allowing an array to pressed or have other operations, like checksums, etc.,
be extended over time. An HDF dataspace can have applied to them.
multiple dimensions that have unlimited maximum The advantages of chunked data storage are bal-
dimensions, allowing that array to be extended in any anced by some limitations, however. Using an index
or all of those dimensions. for mapping dataset element coordinates to chunk loca-
An HDF datatype describes the type of data to store tions in the file can slow down access to dataset elements
in each element of an array and can be one of the fol- if the application’s I/O access pattern does not line
lowing classes: integer, floating-point, string, bitfield, up with the chunk’s sub-array decomposition. Further-
opaque, compound, reference, enum, variable-length more, there is additional storage overhead for storing an
sequence, and array. These classes generally correspond index for the dataset, along with extra I/O operations to
to the analogous computer science concepts, but the access the index data structure.
reference and variable-length sequence datatypes are Datasets with very small amounts of element data
unusual. Reference datatypes contain links to HDF can store their elements as part of the dataset descrip-
objects, allowing HDF applications to create datasets tion in the file, avoiding any extra I/O accesses to
that act like groups. The former contain references retrieve the dataset elements, since the HDF library
or pointers to HDF objects, allowing HDF applica- will read them when accessing the dataset description.
tions to create datasets that can act as lookup tables This “compact” data storage must be very small (less
or indices. Variable-length sequence datatypes allow a than  KB), and may not be used with a dataspace that
dynamic number of elements of a base datatype to be has unlimited dimensions or when dataset elements are
stored as an element and are one mechanism for cre- compressed.
ating datasets that represent ragged arrays. All of the Finally, a dataset can store its elements in a different,
HDF datatypes can be combined in any arbitrary way, non-HDF, file. This “external” data storage method can
allowing for great flexibility in how an application stores be used to share dataset elements between an HDF
its data. application and a non-HDF application. As with con-
The elements of an HDF dataset can be stored tiguous data storage, external data storage does not
in different ways, allowing an application to choose allow a dataset to use a dataspace with unlimited dimen-
between various I/O access performance trade-offs. sions or compress the dataset elements.
Dataset elements can be stored as a single sequence HDF also allows application-defined metadata
in the HDF file, called “contiguous” storage, which to be stored with any object in an HDF file.
HDF H 

These “attributes” are designed to store information In this example, lines – declare the variables
about the object they are attached to, such as input needed for the example, including the -D array of
parameters to a simulation, the name of an instrument data to store. Line  represents the application’s pro-
gathering data, etc. Attributes are similar to datasets in cess of filling the data array with information. Lines –
that they have a dataspace, a datatype, and elements val- create a new HDF file, a new dataspace describ-
ues. Attributes require a name that is unique among the ing a fixed-size three-dimensional array of dimensions
attributes for an object, similar to link names within  ×  × , and a new dataset using a single-
groups. Attributes are limited to dataspaces that do precision floating-point datatype and the dataspace cre-
not use unlimited maximum dimensions and cannot ated. Line  writes the entire  GB array to the file in a
have their data elements compressed, but can use any single I/O operation, and lines – close the objects
datatype for their elements. created earlier. Several of the calls use HP_DEFAULT
as a parameter, which is a placeholder for an HDF
property list object, which can control more compli-
Examples of Using HDF cated properties of objects or operations.
The following C code example shows how to use the
H
The next C code example creates an identically
HDF library to store a large array. In the example structured file, but adds the necessary calls to open
below, the data to be written to the file is a three- the file with  processes in parallel and to perform a
dimensional array of single-precision floating-point val- collective write to the dataset created.
ues, with dimensions of  elements along each axis:

 float data[][][];
 float data[][][];  hid_t file_id, file_dataspace_id,
 hid_t file_id, dataspace_id, dataset_id; mem_dataspace
_id, dataset_id, fa_plist_id, dx_plist_id;
 hsize_t dims[] = {, , };
 hsize_t file_dims[] = {, , };

 hsize_t mem_dims[] = {, , };
 < . . .acquire or assign data values. . . >


 < . . .acquire or assign data values. . . >
 file_id = HFcreate(“example.h”,
HF_ACC_TRUNC, HP_DEFAULT, 
HP_DEFAULT);  fa_plist_id = HPcreate(HP_FILE_ACCESS);
 dataspace_id = HScreate_simple(, dims,  HPset_fapl_mpio(fa_plist_id,
NULL); MPI_COMM_WORLD, MPI_INFO_NULL);
 dataset_id = HDcreate(file_id, “/Float_data”,  file_id = HFcreate(“example.h”, HF_ACC_
HT_NATIVE_FLOAT, dataspace_id, TRUNC, HP_DEFAULT, fa_plist_id);
HP_DEFAULT, HP_DEFAULT, HP_DEFAULT);
 HPclose(fa_plist_id);

 file_dataspace_id = HScreate_simple(,
 HDwrite(dataset_id, HT_NATIVE_FLOAT, file_dims, NULL);
dataspace_id, dataspace_id, HP_DEFAULT,
 dataset_id = HDcreate(file_id, “/Float_data”,
data);
HT_NATIVE_FLOAT, file_dataspace_id,
 HP_DEFAULT, HP_DEFAULT, HP_DEFAULT);
 HDclose(dataset_id); 
 HSclose(dataspace_id);  mem_dataspace_id = HScreate_simple(,
 HFclose(file_id); mem_dims, NULL);
 H HDF

 datasets and describe the coordinates of dataspace ele-


ments. A scientific user community can also use HDF
 < . . .select process’s elements in file
as the basis for exchanging data among its members
dataspace. . . >
by creating a standardized domain-specific data model
 that is relevant to their area of interest. Domain-specific
 dx_plist_id = HPcreate(HP_DATASET_XFER); data models specify the names of HDF groups, datasets
 HPset_dxpl_mpio(dx_plist_id, and attributes, the dataspace and datatype for the
HFD_MPIO_COLLECTIVE); datasets and attributes, etc. Frequently, a domain’s user
community also creates a “wrapper library” that calls
 HDwrite(dataset_id, HT_NATIVE_FLOAT, HDF library routines while enforcing the domain’s
mem_dataspace_id, file_dataspace_id, standardized data model.
dx_plist_id, data);
 Library Interface
 HPclose(dx_plist_id); Software applications create, modify, and delete HDF
objects through an object-oriented library interface that
 HDclose(dataset_id);
manipulates the objects in HDF files. Applications can
 HSclose(mem_dataspace_id); use the HDF library to operate directly on the base
 HSclose(file_dataspace_id); HDF objects or use a domain-specific wrapper library
 HFclose(file_id); that operates at a higher level of abstraction. The core
software library for accessing HDF files is written in
C, but library interfaces for the HDF data model have
In this updated example, the size of the data array been created for many programming languages, includ-
on line  has been changed to be only one eighth of the ing Fortran, C++, Java, Python, Perl, Ada, and C#.
total array size, to allow for each of the eight processes to
write a portion of the total array in the file. Lines – File Format
have been updated to create a file access property list, Objects in the HDF data model created by the library
change the file driver for opening the file to use MPI-I/O interface are stored in files whose structure is defined
and collectively open the file with all processes, using by the HDF file format. The HDF file format has many
the file access property list. Lines – create the dataset unique aspects, some of which are: a mechanism for
in the file in the same way as the previous example. Line storing non-HDF formatted data at the beginning of a
 creates a dataspace for each process’s portion of the file, a method of “micro-versioning” file data structures
dataset in the file, and line  represents a section of code that makes incremental changes to the format possible,
for selecting the part of the file’s dataspace that each and data structures that enable constant-time lookup
process will write to (which is omitted due to space con- of data within the file in situations which previously
straints). Lines – create a dataset transfer property required a logarithmic number of operations.
list, set the I/O operation to be collective, and perform The HDF file format is designed to be flexible
a collective write operation where each process writes a and extensible, allowing for evolution and expansion of
different portion of the dataset in the file. Finally, lines the data model in an incremental and structured way.
– release resources used for the example. This allows new releases of the HDF software library
This example shows some ways that property lists to continue to access all previous versions of the HDF
can be used to modify the operation of HDF API calls, file format. This capability empowers application devel-
as well as demonstrating a simple example of parallel opers to create HDF files and access data contained
I/O using HDF API calls. within them over very long periods of time.

Tools
Higher-Level Data Models Built on HDF HDF is distributed with command-line utilities that
HDF provides a set of generic higher-level data can inspect and operate on HDF files. Operations pro-
models that describe how to store images and tables as vided by command-line utilities include copying HDF
HDF H 

objects from one file to another, compacting internally objects for reading can be performed independently.
fragmented HDF files to reduce their size, and com- Requiring collective operations for modifying the file’s
paring two HDF files to determine differences in the structure is currently necessary so that all processes in
objects contained within them. The latter differencing the parallel application keep a consistent view of the
utility, called “hdiff ”, is also provided as a parallel file’s contents.
computing application that uses the MPI programming Reading or writing the elements of a dataset can
interface to quickly compare two files using multiple be performed independently or collectively. Accessing
processes. the elements of a dataset with independent or collective
Many other applications, both commercial and open operations involves different trade-offs that application
source, can access data stored in HDF files. Some of developers must balance.
these applications include MATLAB [], Mathematica Using independent operations requires a parallel
[], HDFView [], VisIt [], and EnSight []. Some of application to create the overall structure of the HDF
these applications provide generic browsing and modi- file at the beginning of its execution, or possibly with
fication of HDF files, while others provide specialized a nonparallel application prior to the start of the paral- H
visualization of domain-specific data models stored in lel application. The parallel application can then update
HDF files. elements of a dataset without requiring any synchro-
nization or coordination between the processes in the
Parallel File I/O application. However, the application must subdivide
Applications that use the MPI programming interface the portions of the file accessed from each process to
[] can use the HDF library to access HDF files avoid race conditions that would affect the contents of
in parallel from multiple concurrently executing pro- the file. Additionally, accessing a file independently may
cesses. Internally, the HDF library uses the MPI inter- cause the underlying parallel file system to perform very
face for coordinating access to the HDF file as well as poorly, due to its lack of a global perspective on the
for performing parallel operations on the file. Efficiently overall access pattern, which prevents the file system
accessing an HDF file in parallel requires storing the from taking advantage of many caching and buffering
file on a parallel file system designed for access through opportunities available with collective operations.
the MPI interface. Accessing HDF dataset elements collectively
Two methods of accessing an HDF file in parallel requires that all processes in the parallel application
are possible: “independent” and “collective.” Indepen- cooperate when performing a read or write operation.
dent parallel access to an HDF file is performed by a Collective operations use the MPI interface within the
process in a parallel application without coordination HDF library to describe the region(s) of the file to
with or cooperation from the other processes in the access from each process, and then use collective MPI
application. Collective parallel access to an HDF file is I/O operations to access the dataset elements. Collec-
performed with all the processes in the parallel appli- tively accessing HDF dataset elements allows the MPI
cation cooperating and possibly communicating with implementation to communicate between processes in
each other. the application to determine the best method of access-
The following discussion of HDF library capabili- ing the parallel file system, which can greatly improve
ties describes parallel I/O features in the current release performance of the I/O operation. However, the com-
at the time this entry was written, release ... The par- munication and synchronization overhead of collec-
allel I/O features in the HDF library are continually tive access can also slow down an application’s overall
improving and evolving to address the ongoing changes performance.
in the landscape of parallel computing. Unless other- To get good performance when accessing an HDF
wise stated, limitations in the capabilities of the HDF dataset, it is important to choose a storage method
library are not inherent to the HDF data model or file that is compatible with the type of parallel access cho-
format and may be addressed in future library releases. sen. For example, compact data storage requires all
The HDF library requires that operations that cre- writes to dataset elements be performed collectively,
ate, modify, or delete objects in an HDF file be per- while external data storage requires all dataset element
formed collectively. However, operations that only open accesses be performed independently.
 H HDF

Collective and independent element access also Program Center for the University Corporation for
involves the MPI and parallel file system layers, and Atmospheric Research (UCAR). Originally designed
those layers add their own complexities to the equation. for the climate modeling community, netCDF has since
HDF application developers must carefully balance the been embraced by many other scientific communi-
trade-offs of collective and independent operations to ties. netCDF has adopted HDF as its principal storage
determine when and how to use them. method, as of version ., in order to take advantage
High performance access to HDF files is a strongly of several features in HDF that its previous file for-
desired feature of application developers, and the HDF mat did not provide, including data compression, a
library has been designed with the goal of provid- wider array of types for storing data elements, hierar-
ing performance that closely matches the performance chical grouping structures, and more flexible parallel
possible when an application accesses unformatted operations.
data directly. Considerable effort has been devoted to Created prior to the development of netCDF-,
enhancing the HDF library’s parallel performance, and parallel-netCDF or “PnetCDF” [] was developed by
this effort continues as new parallel I/O developments Argonne National Laboratory. PnetCDF allows parallel
unfold. applications to access netCDF format files with col-
lective data operations. PnetCDF does not extend the
Significant Parallel Applications and netCDF data model or format beyond allowing larger
Libraries That Use HDF objects to be stored in netCDF files. Files written by
Many applications and software libraries that use HDF PnetCDF and versions of the netCDF library prior to
have become significant software assets, either commer- netCDF- do not use the HDF file format and instead
cially or as open source projects governed by a user store data in the “netCDF classic” format [].
community. HDF’s stability, longevity, and breadth of
features have attracted many developers to it for storing
their data, both for sequential and parallel computing Future Directions
purposes. Both the primary development team at The HDF Group
The first software library to use HDF was developed and the user community that has formed around the
collaboratively by developers at the US Department HDF project are constantly improving it. HDF con-
of Energy’s (DOE) Lawrence Livermore, Los Alamos tinues to be ported to new computer systems and
and Sandia National Laboratories. This effort was called architectures and has its performance improved and
the “Sets and Fields” (SAF) library [] and was devel- errors corrected over time. Additionally, the HDF
oped to give the parallel applications that dealt with data model is expanding to encompass new develop-
large, complex finite element simulations on the high- ments in the field of high performance storage and
est performing computers of the time a way to efficiently computing.
store their data. Some improvements being designed or imple-
Many large scientific communities worldwide have mented as of this entry’s writing include: increasing the
adopted HDF for storing data with parallel applica- efficiency of small file I/O operations through advanced
tions. Some significant examples include the FLASH caching mechanisms, finding ways to allow parallel
software for simulating astrophysical nuclear flashes applications to create HDF objects in a file with inde-
from the University of Chicago [], the Chombo pack- pendent operations, and implementing new chunked
age for solving finite difference equations using adap- data storage indexing methods to improve collective
tive mesh refinement from Lawrence Berkeley National access performance.
Laboratory [], and the open source NeXus software Additionally, HDF continues to lead the scientific
and data format for interchanging data in the neutron, data storage field in its adoption of asynchronous file
x-ray, and muon science communities []. I/O for improved performance, journaled file updates
for improved resiliency, and methods for improving
netCDF, PnetCDF, and HDF concurrency by allowing different applications to read
Another significant software library that uses HDF and write to the same HDF file without using a locking
is the netCDF library [], developed at the Unidata mechanism.
High-Performance I/O H 

Related Entries . NeXus, http://www.nexusformat.org/Main_Page


File Systems . PnetCDF, http://trac.mcs.anl.gov/projects/parallel-netcdf
. netCDF file formats, http://www.unidata.ucar.edu/software/
MPI (Message Passing Interface)
netcdf/docs/netcdf/File-Format.html
NetCDF I/O Library, Parallel
. A history of the HDF group, http://www.hdfgroup.org/about/
history.html
. Fiber bundle definition, http://mathworld.wolfram.com/
Bibliographic Notes and Further FiberBundle.html
Reading . HDF user guide, Chapter : datatypes, http://www.hdfgroup.
HDF has been under development since , a short org/HDF/doc/UG/UG_frameDatatypes.html
history of its development is recorded at The HDF . HDF file format specification, http://www.hdfgroup.org/HDF/
doc/H.format.html
Group web site [].
. Summary of software using HDF, http://www.hdfgroup.org/
Datasets in HDF files are analogous to sections products/hdf_tools/SWSummarybyName.htm
of trivial fiber bundles [], where the HDF datas-
pace corresponds to a fiber bundle’s base space, the
HDF datatype corresponds to the fiber, and the dataset, H
a variable whose value is the totality of the data ele-
ments stored, represents a section through the total HEP, Denelcor
space (which is the Cartesian product of the base space
Denelcor HEP
and the fiber).
HDF datatypes can be very complex and many
more details are found in reference [].
The HDF file format is documented in [].
Many more applications and libraries use HDF
Heterogeneous Element
than are discussed in this entry. A partial list can be
Processor
found in []. Denelcor HEP

Bibliography
. HDF, http://www.hdfgroup.org/HDF/
. National Center for Supercomputing Application at the Univer-
Hierarchical Data Format
sity of Illinois at Urbana-Champaign, http://www.ncsa.illinois.
HDF
edu/
. The HDF group, http://www.hdfgroup.org/
. HDF, http://www.hdfgroup.org/products/hdf/
. netCDF, http://www.unidata.ucar.edu/software/netcdf/
. TIFF, http://partners.adobe.com/public/developer/tiff/index.
High Performance Fortran (HPF)
html
. FITS, http://fits.gsfc.nasa.gov/ HPF (High Performance Fortran)
. MATLAB, http://www.mathworks.com/products/matlab/
. Mathematica, http://www.wolfram.com/products/mathematica/
index.html
. HDFView, http://www.hdfgroup.org/hdf-java-html/hdfview/
. VisIt, https://wci.llnl.gov/codes/visit/ High-Level I/O Library
. EnSight, http://www.ensight.com/
. MPI, http://www.mpi-forum.org/ NetCDF I/O Library, Parallel
. Miller M et al () Enabling interoperation of high perfor-
mance, scientific computing applications: modeling scientific
data with the sets & fields (SAF) modeling system. ICCS – ,
May, part II, San Francisco. Lecture Notes in Computer Science,
vol . Springer, Heidelberg, pp – High-Performance I/O
. FLASH, http://flash.uchicago.edu/web/
. Chombo, https://seesar.lbl.gov/anag/chombo/index.html I/O
 H Homology to Sequence Alignment, From

sequences, e.g., protein sequences that share a common


Homology to Sequence domain.
Alignment, From Many approaches have been proposed for aligning
a pair of sequences. Among them, dynamic program-
Wu-Chun Feng, , Heshan Lin ming is a common technique that can find optimal

Virginia Tech, Blacksburg, VA, USA alignments between sequences. Dynamic programming

Wake Forest University, Winston-Salem, NC, USA can be used for both local alignment and global align-
ment. The algorithms in both cases are quite similar.
The scoring model of an alignment algorithm is given
Discussion by a substitution matrix and a gap-penalty function.
Two sequences are considered to be homologous A substitution matrix stores a matching score for every
if they share a common ancestor. Sequences are possible pair of letters. A matching score is typically
either homologous or nonhomologous, but not in- measured by the frequency that a pair of letters occurs
between []. Determining whether two sequences are in the known homologous sequences according a cer-
actually homologous can be a challenging task, as it tain statistical model. Popular substitution matrices
requires inferences to be made between the sequences. include PAM and BLOSUM, an example of which is
Further complicating this task is the potential that the shown in Fig. . A gap-penalty function defines how
sequences may appear to be related via chance similarity gaps in the alignments are weighed in alignment scores.
rather than via common ancestry. For instance, with a linear gap-penalty function, the
One approach toward determining homology entails penalty score grows linearly with the length of a gap.
the use of sequence-alignment algorithms that maxi- With an affine gap-penalty function, the penalty fac-
mize the similarity between two sequences. For homol- tors are differentiated for the opening and the extension
ogy modeling, these alignments could be used to of a gap.
obtain the likely amino-acid correspondence between The following discussion will focus on the basic
the sequences. algorithm and its parallelization of Smith–Waterman,
a popular local-alignment tool based on dynamic pro-
Introduction gramming. (For more advanced techniques to compute
Sequence alignment identifies similarities between a pairwise sequence alignment, the reader should consult
pair of biological sequences (i.e., pairwise sequence the “Bibliographic Notes and Further Reading” section
alignment) or across a set of multiple biological at the end of this entry.)
sequences (i.e., multiple sequence alignment). These
alignments, in turn, enable the inference of functional, Case Study: Smith–Waterman Algorithm
structural, and evolutionary relationships between Given two sequences S = a a ⋯am and S = b b ⋯bn ,
sequences. For instance, sequence alignment helped the Smith–Waterman algorithm uses an m by n scor-
biologists identify the similarities between the SARS ing matrix H to calculate and track the alignments.
virus and the more well-studied coronaviruses, thus A cell Hi,j stores the highest similarity score that can be
enhancing the biologists’ ability to combat the new virus. achieved by any possible alignment ending at ai and bj .
The Smith–Waterman algorithm has three phases: ini-
Pairwise Sequence Alignment tialization, matrix filling, and traceback.
There are two types of pairwise alignment: global align- The initialization phase simply assigns a value of
ment and local alignment. Global alignment seeks to  to each of the matrix cells in the first row and the
align a pair of sequences entirely to each other, i.e., first column. In the matrix-filling phase, the problem of
from one end to the other. As such, it is suitable for aligning the two whole sequences is broken into smaller
comparing sequences with roughly the same length, subproblems, i.e., aligning partial sequences. Accord-
e.g., two closely homologous sequences. Local align- ingly, a cell Hi,j is updated based on the values of its
ment seeks to identify significant matches between parts preceding neighbors. For the sake of illustration, the rest
of the sequences. It is useful to analyze partially related of the discussion assumes a linear gap-penalty function
Homology to Sequence Alignment, From H 

C 9
S −1 4
T −1 1 4
P −3 −1 1 7
A 0 1 −1 −1 4
G −3 0 1 −2 0 6
N −3 1 0 −2 −2 0 6
D −3 0 1 −1 −2 −1 1 6
E −4 0 0 −1 −1 −2 0 2 5
Q −3 0 0 −1 −1 −2 0 0 2 5
H −3 −1 0 −2 −2 −2 1 1 0 0 8
R −3 −1 −1 −2 −1 −2 0 −2 0 1 0 5
K −3 0 0 −1 −1 −2 0 −1 1 1 −1 2 5
M −1 −1 −1 −2 −1 −3 −2 −3 −2 0 −2 −1 −1 5
I −1 −2 −2 −3 −1 −4 −3 −3 −3 −3 −3 −3 −3 1 4
L −1 −2 −2 −3 −1 −4 −3 −4 −3 −2 −3 −2 −2 2 2 4
V −1 −2 −2 −2 0 −3 −3 −3 −2 −2 −3 −3 −2 1 3 1 4
F −2 −2 −2 −4 −2 −3 −3 −3 −3 −3 −1 −3 −3 0 0 0 −1 6
Y −2 −2 −2 −3 −2 −3 −2 −3 −2 −1 2 −2 −2 −1 −1 −1 −1 3 7 H
W −2 −3 −3 −4 −3 −2 −4 −4 −3 −2 −2 −3 −3 −1 −3 −2 −3 1 2 11
C S T P A G N D E Q H R K M I L V F Y W

Homology to Sequence Alignment, From. Fig.  BLOSUM substitution matrix

where the penalty of a gap is equal to a constant factor Algorithm  Matrix Filling in Smith–Waterman
g (typically negative) times the length of the gap.
for i =  to m do
There are three possible alignment scenarios where
for j =  to n do
Hi,j is derived from its neighbors: () ai and bj are asso-
max = Hi−,j− + S(ai , bj ) > Hi−,j + g ? Hi−,j− +
ciated, () there is a gap in sequence S , and () there is
S(ai , bj ) : Hi−,j + g
a gap in sequence S . As such, the scoring matrix can be
if Hi,j− + g > max then
filled according to (). The first three terms in () corre-
Hi,j = Hi,j− + g
spond to the three scenarios; the zero value ensures that
else
there are no negative scores. S(ai , bj ) is the matching
Hi,j = max
score derived by looking up the substitution matrix.
end if
end for

⎪ end for

⎪ Hi−,j− + S(ai , bj )







⎪ Hi−,j + g

Hi,j = max⎨ ()




⎪ Hi,j− + g

⎪ filling. In the inner loop of the algorithm, the cell cal-




⎪  culated in one iteration depends on the value updated

in the previous iteration, resulting in a “read-after-
When a cell is updated, the direction from which write” hazard (see [] for details), which can reduce
the maximum score is derived also needs to be stored the instruction-level parallelism that can be exploited by
(e.g., in a separate matrix). After the matrix is filled, pipelining, and hence, adversely impact performance.
a traceback process is used to recover the path of the In addition, this algorithm is difficult to directly par-
best alignment. It starts from the cell with the highest allelize because of the data dependency between itera-
score in the matrix and ends at a cell with a value of , tions in the inner loop.
following the direction information recorded earlier. As depicted in Fig. , the calculation of a particular
The majority of execution time is spent on the cell depends on its west, northwest, and north neigh-
matrix-filling phase in Smith–Waterman. Algorithm  bors. However, the updates of individual cells along
shows a straightforward implementation of the matrix an anti-diagonal are independent. This observation
 H Homology to Sequence Alignment, From

P0 P1 P2 P3 P4

P0 P1 P2 P3

P0 P1 P2

P0 P1

P0

Homology to Sequence Alignment, From. Fig.  Data Homology to Sequence Alignment, From. Fig.  Tiled
dependency of matrix filling implementation

along each successive anti-diagonal until reaching the


motivates a wavefront-filling algorithm [], where the maximum parallelism along the longest anti-diagonal,
matrix cells on an anti-diagonal can be updated simul- and then monotonically decreases thereafter. A trade-
taneously. That is, because there is no dependency off needs to be made in choosing the tile size. If the
between two adjacent cells along an anti-diagonal, this tile size is too large, there is not sufficient parallelism
algorithm greatly reduces read-after-write hazards, and to exploit at the beginning of the wavefront computa-
in turn, increases the execution efficiency and ease of tion, which results in idle resources in systems with a
parallelization. For example, in a shared-memory envi- large number processing units. On the other hand, too
ronment, individual threads can compute a subset of small a tile size will incur much more synchronization
cells along an anti-diagonal. However, synchronization and communication overhead. Nonetheless, the wave-
between threads must occur after computing each anti- front approach may generate imbalanced workloads on
diagonal. different processors, especially at the beginning and end
Since a scoring matrix is typically stored in “row of the computation. It is worth noting that there is
major” order, as shown in Algorithm , the above wave- an alternative parallel algorithm that uses prefix-sum
front algorithm may have a large memory footprint to compute the scoring matrix row by row (or col-
when computing an anti-diagonal, thus limiting the umn by column) [], which can generate uniform task
benefits of processor caches. One improvement entails distribution among all processors.
partitioning the matrix into tiles and having each paral- The above discussion assumes a simple linear gap-
lel processing unit fill a subset of the tiles, as shown in penalty function. In practice, the Smith–Waterman
Fig. . By carefully choosing the tile size, data processed algorithm uses an affine gap-penalty scheme, which
by a thread can fit in the processor cache. Further- requires maintaining three scoring matrices in order
more, when parallelized in distributed environments, to track the gap opening and extension. Consequently,
the tiled approach can effectively reduce internode both the time and space usages increase by a factor of
communication, as compared to the fine-grained wave- three in implementation using affine gap penalties.
front approach, because only elements at the borders of
individual tiles need to be exchanged between different Sequence Database Search
compute nodes. With the proliferation of public sequence data, sequence
With the wavefront approach, the initial amount of database search has become an important task in
parallelism in the algorithm is low. It gradually increases sequence analysis. For example, newly discovered
Homology to Sequence Alignment, From H 

sequences are typically searched against a database of theory uses a statistic called the e-value (E) to mea-
sequences with known genes and known functions in sure the likelihood that an alignment is resulted from
order to help predict the functions of the newly dis- matches by chance (i.e., matches between random
covered sequences. A sequence database-search tool sequences) as compared to true homologous relation-
compares a set of query sequences against all sequences ships. The e-value can be calculated according to ():
in a database with a pairwise alignment algorithm
and reports the matches that are statistically signifi- E = Kmn⋅e−λS ()
cant. Although dynamic-programming algorithms can
be used for sequence database search, the algorithms where K and λ are the Karlin–Altschul parameters,
are too computationally demanding to keep up with the m and n are the query length and the total length of
database growth. Consequently, heuristic-based algo- all database sequences, and S is the alignment score.
rithms, such as BLAST [, ] and FASTA [], have The e-value indicates how many alignments with a score
been developed for rapidly identifying similarities in higher than S can be found by chance in the search
sequence databases. space, i.e., the multiple of the query sequence length H
BLAST is the most widely used sequence database- and the database length. The lower the e-value, the more
search tool. It reduces the complexity of alignment significant is an alignment. The alignment results of a
computation by filtering potential matches with com- query sequence are sorted in the order of e-value.
mon words, called k-mers. Specifically, there are four A sequence database-search job needs to compute
stages in comparing a query sequence and a database M × N pairwise sequence alignments, where M and N
sequence. are the numbers of the query and database sequences,
respectively. This computation can be parallelized with
● Stage : The query and the database sequences are
a coarse-grained approach, where the alignments of
parsed into words of length k (k is  for protein
individual pairs of sequences are assigned to different
sequences and  for DNA sequences by default).
processing units.
The algorithm then matches words between the
Early parallel sequence-search software adopted a
query sequence and the database sequence and cal-
query segmentation approach, where a sequence-search
culates an alignment score for each matched word,
job is parallelized by having individual compute nodes
based on a substitution matrix (e.g., BLOSUM).
concurrently search disjoint subsets of query sequences
Only matched words with alignment scores higher
against the whole sequence database. Since the searches
than a threshold are kept for the next stage.
of individual query sequences are independent, this
● Stage : For a high-scoring matched word, ungapped
embarrassingly parallel approach is easy to implement
alignment is performed by extending the matched
and scales well. However, the size of sequence databases
word in both directions. An alignment score will
is growing much faster than the memory size of a typical
be calculated along the extension. The extension
single computer. When the database cannot fit in mem-
stops when the alignment score stops increasing
ory, data will be frequently swapped in and out of the
and slightly drops off from the maximum alignment
memory when searching multiple queries, thus caus-
score (controlled by another threshold).
ing significant performance degradation, because disk
● Stage : Ungapped alignments with scores larger
I/O is several orders of magnitude slower than mem-
than a given threshold obtained from stage  are
ory access. Query segmentation can improve the search
chosen as seed alignments. Gapped alignments are
throughput, but it does not reduce the response time
then performed on the seed alignments using a
taken to search a single query sequence.
dynamic-programming algorithm, following both
Database segmentation is an alternative paralleliza-
forward and backward directions.
tion approach, where large databases are partitioned
● Stage : Traceback is performed to recover the paths
and cached in the aggregate memory of a group of com-
of gapped alignments.
pute nodes. By doing so, the repeated I/O overhead of
BLAST calculates the significance of result alignments searching large databases is avoided. Database segmen-
using Karlin–Altschul statistics. The Karlin–Altschul tation also improves the search response time since a
 H Homology to Sequence Alignment, From

compute node searches only a portion of the database. equal-sized partitions, which are supervised by a ded-
However, database segmentation introduces computa- icated supermaster process. The supermaster is respon-
tional dependencies between individual nodes because sible for assigning tasks to different partitions and
the distributed results generated at different nodes needhandling inter-partition load balancing. Within each
to be merged and sorted to produce the final output. partition, there is one master process and many worker
The parallel overhead of merging and sorting increases processes. The master is responsible for coordinat-
as the system size grows. ing both computation and I/O scheduling in a parti-
With the astronomical growth of sequence databases, tion. The master periodically fetches a subset of query
today’s large-scale sequence search jobs can be very sequences from the supermaster and assigns them to
resource demanding. For instance, BLAST search- workers, and it coordinates output processing of queries
ing in a metagenomics project can consume sev- that have been processed in the partition. The sequence
eral millions of processor hours. Massively parallel database is partitioned into fragments and replicated
sequence-search tools, such as mpiBLAST [, , ] to workers in the system. This hierarchical design
and ScalaBLAST [], have been developed to acceler- avoids creating scheduling bottlenecks in large sys-
ate large-scale sequence search jobs on state-of-the-art tems by distributing scheduling workloads to multiple
supercomputers. These tools use a combination of query masters.
segmentation and database segmentation to offer mas- Large-scale sequence searches can be highly data-
sive parallelism needed to scale on a large number of intensive, and as such, the efficiency of data man-
processors. agement is critical to the program scalability. For the
input data, having thousands of processors simultane-
Case Study: mpiBLAST ously load database fragments from shared storage may
mpiBLAST is an open-source parallelization of NCBI overwhelm the I/O subsystem. To address this, mpi-
BLAST that has been designed for petascale deploy- BLAST designates a set of compute nodes as I/O prox-
ment. Adopting a scalable, hierarchical design, mpiBLAST ies, which read database fragments from the file system
parallelizes a search job via a combination of query in parallel and replicate them to other workers using
segmentation and database segmentation. As shown the broadcasting mechanism in MPI [, ] libraries.
in Fig. , processors in the system are organized into In addition, mpiBLAST allows workers to cache

Supermaster

Qi Qj

Master1 Mastern

qi1 qi1 qj1 qj1

f1 f2 f1 f2 f1 f2 f1 f2
P11 P12 P13 P14 Pn1 Pn2 Pn3 Pn4

Partition 1 Partition n

Homology to Sequence Alignment, From. Fig.  mpiBLAST hierarchical design. Qi and Qj are query batches fetched
from the supermaster to masters, and qi and qj are query sequences that are assigned by masters to their workers. In this
example, the database is segmented into two fragments f and f and replicated twice within each partition
Homology to Sequence Alignment, From H 

assigned database fragments in the memory or local imbalanced across different processes. To address this
storage, and it uses a task-scheduling algorithm that issue, mpiBLAST introduces a parallel I/O technique
takes into account data locality to minimize repeated called asynchronous, two-phase I/O (ATIO), which
loading of database fragments. allows worker processes to rearrange I/O data with-
With database segmentation, result alignments of out synchronizing with each other. Specifically, mpi-
different database fragments are usually interleaved in BLAST appoints a worker as the write leader for each
the global output because those alignments need to query sequence. The write leader aggregates output
be sorted by e-values. Consequently, the output data data from other workers via nonblocking MPI com-
generated at each worker is noncontiguous in the out- munication and carries out the write operation to
put file. Straightforward noncontiguous I/O with many the file system. ATIO overlaps I/O reorganization and
seek-and-write operations is slow on most file systems. sequence-search computation, thus improving the over-
This type of I/O can be optimized with collective all application performance. Figure  shows the dif-
I/O [, , ], which is available in parallel I/O ference between collective I/O and ATIO within the
libraries such as ROMIO []. Collective I/O uses a context of mpiBLAST. H
two-phase process. In the first phase, involved pro-
cesses exchange data with each other to form large
trunks of contiguous data, which are stored as mem- Multiple Sequence Alignment
ory buffers in individual processes. In the second phase, Multiple sequence alignment (MSA) identifies similar-
the buffered data is written to the actual file sys- ities among three or more sequences. It can be used to
tem. Collective I/O improves I/O performance because analyze a family of related sequences to reveal phyloge-
continuous data accesses are much more efficient netic relationships. Other usages of multiple sequence
than noncontiguous ones. Traditional collective I/O alignment include detection of conserved biological
implementations require synchronization between all features and genome sequencing. Multiple sequence
involved processes for each I/O operation. This alignment can also be global or local.
synchronization overhead will adversely impact Like pairwise sequence alignment, multiple
sequence-search performance when computation is sequence alignment can be computed using dynamic

Worker 1 Worker 2 Worker 3 Worker 1 Worker 2 Worker 3

qi qi qi qi
qi qi

1
1
1 1
qi+1
qi+1 qi+1
qi+1 qi+1 qi+1 Search Search
2
Idle 2
2 1 Exchange
2 1 Exchange
2 Write 2 Write

Output file Output of qi Output file Output of qi


Collective I/O ATIO

Homology to Sequence Alignment, From. Fig.  Collective I/O and ATIO. Collective I/O requires synchronization and
introduces idle waiting in worker processes. ATIO uses a write leader to aggregate noncontiguous data from other
workers in an asynchronous manner
 H Homology to Sequence Alignment, From

programming algorithms. Although these algorithms or dynamically assigned with a greedy algorithm to
can find optimal alignments, the required resources for different processing units. Additional parallelism can
these algorithms grow exponentially as the number of be exploited by parallelizing the alignment of a sin-
sequences increases. Suppose there are N sequences and gle pair of sequences, similar to the Smith–Waterman
the average sequence length is L, the time and space algorithm.
complexities are both O(LN ). Thus, it is computation- In the second stage, a guided tree is constructed
ally impractical to use dynamic programming to align a using a neighbor-joining algorithm based on the
large number of sequences. alignment scores in the distance matrix. Initially all
Many heuristic approaches have been proposed sequences are leaf nodes of a star-like tree. The algo-
to reduce the computational complexity of multi- rithm iteratively selects a pair of nodes and joins them
ple sequence alignment. Progressive alignment methods into a new internal node until there are no nodes left
[, , , , ] use guided pairwise sequence align- to join, at which point a bifurcated tree is created.
ment to rapidly construct MSA for a large number of Algorithm  gives the pseudocode of this process. Sup-
sequences. These methods first build a phylogenetic tree pose there are a total of M nodes at an iteration, there
based on all-to-all pairwise sequence alignments using are n(n−)

possible pairs of nodes. The pair of nodes that
neighbor joining [] or UPGMA [] techniques. results in the smallest length of branches will be selected
Guided by the phylogenetic tree, the most similar to join. An example of neighbor-joining process with
sequences are first aligned, the less similar sequences are four sequences is given in Fig. .
then progressively added to the initial MSA. One prob- In Algorithm , the outermost loop cannot be par-
lem with progressive methods is that errors that occur allelized because one iteration of neighbor joining is
in early aligning stages are propagated to the final dependent on the previous one. However, within an
alignment results. Iterative alignment methods [, ] iteration of neighbor joining, the calculation of branch
address this problem by introducing “correction” mech- lengths of all pairs of nodes can be performed in par-
anisms during the MSA construction process. These allel. Since the number of available nodes reduces after
methods incrementally align sequences as progressive each joining operation, the parallelism level decreases
methods, but continuously adjust the structure of the as the iteration advances in the outermost loop.
phylogenetic tree and previously computed alignments
according to a certain objective function.

Case Study: ClustalW Algorithm  Construction of Guided Tree


ClustalW [] is a widely used MSA program based on
while there are nodes to join do
progressive methods. The ClustalW algorithm includes
Let n be the number of current available nodes
three stages: () distance matrix computation, ()
Let D be the current distance matrix
guided tree construction, and () progressive align-
Let Lmin be the smallest length of the tree branches
ment. Due to the popularity of ClustalW, its paral-
for i =  to n do
lelization has been well studied on clusters [, ] and
for j =  to i −  do
multiprocessor systems [, ].
L(i, j) = (n − )Di,j − ∑nk= Di,k − ∑nk= Dj,k
In the first stage, ClustalW computes a distance
if L(i, j) < Lmin then
matrix by performing all-to-all pairwise alignment over
Lmin = L(i, j)
input sequences. This requires a total of N(N−)

com-
end if
parison for N sequences since the alignment of a pair
end for
of sequences is symmetric. ClustalW allows users to
end for
choose between two alignment algorithms: a faster
Combine nodes i, j that result in Lmin into a new
k-mer matching algorithm and a slower but more accu-
node
rate dynamic programming algorithm. This stage of
Update distance matrix with the new node
the algorithm is embarrassingly parallel. Alignments of
end while
individual pairs of sequences can be statically assigned
Homology to Sequence Alignment, From H 

1
1 3 3
2
1–2 1–2 3–4 3
2 4 4 4

a Step 1 b Step 2 c Step 3 d Tree

Homology to Sequence Alignment, From. Fig.  Neighbor-joining example

In the third stage, the sequence is progressively remove duplication. Constructing the library requires
N(N−)
aligned according to the guided tree. For instance, in the 
global and local alignments and thus is highly
tree shown in Fig. (d), sequences  and  are aligned compute-intensive.
first, followed by  and . Finally, the alignment results After all pairwise alignments are incorporated in the
of <  −  > and <  −  > are aligned. Note that the library, T-Coffee performs library extension, a proce- H
alignment on a node can only be performed after both dure that incorporates transitive alignment information
of its children are aligned, but the alignments at the to the weighing of pairwise constraints. Basically, for
same level of the tree can be performed simultaneously. a pair of sequences x and y, if there is a sequence z
The level of parallelism of the algorithm depends heav- that aligns to both x and y, then constraints between
ily on the tree structure. In the best case, the guided tree x and y will be reweighed by combining the weights
is a balanced tree. At the beginning, all the leaves can of the corresponding constraints between x and z as
be aligned in parallel. The number of concurrent align- well as y and z. Suppose the average sequence length
ments decreases by a half at each higher level toward the is L, since for each pair of sequences, the extension
root of the tree. algorithm needs to scan the rest of N −  sequences,
and there are at most L constraints between a pair of
Case Study: T-Coffee sequences, the worst computation complexity of library
T-Coffee [] is another popular progressive align- extension is O(N  L ).
ment algorithm. Compared to other typical progres- In the second step, T-Coffee performs progressive
sive alignment tools, T-Coffee improves the alignment alignment guided by a phylogenetic tree built with
accuracy by adopting a consistency-based scoring func- the neighbor-joining method, similar to the ClustalW
tion, which uses similarity information among all input algorithm. However, T-Coffee uses the weights in
sequences to guide the alignment progress. T-Coffee the extended library to align residues when grouping
consists of two main steps: library generation and pro- sequences/alignments. Since those weights bear com-
gressive alignment. The first step constructs a library plete alignment information from all input sequences,
that contains a mixture of global and local align- the progressive alignment in T-Coffee can reduce errors
ments between every pair of input sequences. In the caused by the greediness of classic progressive methods.
library, an alignment is represented as a list of pairwise Parallel T-Coffee (PTC) is a parallel implementation
constraints. Each constraint stores a pair of matched of T-Coffee in cluster environments []. PTC adopts a
residues along with an alignment weight, which is master–worker architecture and uses MPI to communi-
essentially a three-tuple < Sxi , Syj , w >, where Sxi is cate between different processes. As mentioned earlier,
the ith residue of sequence Sx and w is the weight. the library generation in T-Coffee needs to compute
T-Coffee incorporates pairwise global alignments gen- all-to-all global and local alignments. The computa-
erated by ClustalW as well as top ten nonintersecting tion tasks of these alignments are independent of each
local alignments reported by Lalign from the FASTA other. PTC uses guided self-scheduling (GSS) [] to
package []. T-Coffee can also take alignment informa- distribute alignment tasks to different worker nodes.
tion from other MSA software. The pairwise constraints GSS first assigns a portion of the tasks to the workers.
of a same pair of matched residues from various sources Each worker monitors its performance when processing
(e.g., global and local alignments) will be combined to the initial assignments; this performance information is
 H Homology to Sequence Alignment, From

then sent to the master and used for dynamic schedul- megabytes) on commodity machines. To address this
ing for subsequent assignments. issue, Mayers and Miller introduced a space-efficient
After pairwise alignments are finished, duplicated alignment algorithm [] adapted from the Hirschberg
constraints generated on distributed workers need to technique [], which was originally developed for
be combined. PTC implements this with parallel sort- finding the longest common subsequence between
ing. Each constraint is assigned to a bucket resident at two strings. By recursively dividing the alignment
a worker. Each worker can then concurrently combine problem into subproblems at a “midpoint” along the
duplicated constraints within its own bucket. The con- middle column of the scoring matrix, Mayers and
straints in the library are then transformed into a Miller’s approach can find the optimal alignment within
three-dimensional lookup table, with rows and columns O(m + n) space but still O(mn) time. Huang showed
indexed by sequences and residues, respectively. Each that a straightforward parallelization of the Hirschberg
element in the lookup table stores all constraints for a algorithm would require more than linear aggregate
residue of a sequence. The lookup table will be accessed space [], i.e., each processor needed to store more
by all processors during the progressive alignment. PTC than O( m+np
) data, where p is the number of concurrent
evenly distributes the lookup table by rows to all pro- processors. In turn, Huang proposed an improved algo-
cessors and allows table entries to be accessed by other rithm that recursively divides the alignment problem
processors through one-sided remote memory access. at the midpoint along the middle anti-diagonal of the
An efficient caching mechanism is also implemented to scoring matrix. By doing so, Huang’s algorithm required
improve lookup performance. only O( mn ) space per processor but with an increased
p
During the progressive alignment, PTC schedules 
time complexity of O( (m+n) ). Aluru et al. presented an
tree nodes to a processor according to their readiness; a p
alternative space-efficient parallel algorithm [] that is
tree node that has a fewer number of unprocessed child
nodes has a higher scheduling priority. For tree nodes more time efficient (O( mn p
)) but also consumes more
that have all child nodes processed, PTC gives higher space (O(m + np ) , m ≤ n) than Huang’s approach. In
priority to the ones with shorter estimated execution their approach, an O(mn) algorithm is first used to
time. Similar to ClustalW, the parallelism of progress partition the scoring matrix into p vertical slices, and
alignment in PTC can be limited if the guided tree is the last column of each slice as well as its intersection
unbalanced. To address this issue, Orobitg et al. pro- with the optimal alignment is stored. Each processor
posed a heuristic approach that can construct a more then takes a slice and uses a Hirschberg-based algo-
balanced guided tree by allowing a pair of nodes to rithm to compute the optimal alignment within the
be grouped if their similarity value is smaller than the slice. In a subsequent study, Aluru et al. proposed an
average similarity value between all sequences in the improved parallel algorithm [] that requires O( mn )
p
distance matrix [].
time and O( m+n p
) space, when p = O( logn n ) proces-
sors are used. In other words, such a parallel algorithm
Related Entries achieves “optimal” time and space complexities because
Bioinformatics this algorithm delivers a linear speedup with respect to
Genome Assembly the best known sequential algorithm.
The computational intensity of sequence-alignment
Bibliographic Notes and Further algorithms has motivated studies in parallelizing these
Reading algorithms on accelerators. Various algorithms have
As discussed in the case study of Smith–Waterman, been accelerated using the SIMD instruction extensions
typical sequence-alignment algorithms require O(mn) of commodity processors [, ], field-programmable
space and time, where m and n are the lengths of gate array (FPGA) [, ], Cell Broadband Engine
compared sequences. Such a space requirement can [, , ], and graphics processing units (GPUs)
be impractical for computing alignments of large [–, , , , ]. Developing and optimizing
sequences (e.g., those with a length of multiple applications on traditional accelerators is much more
Homology to Sequence Alignment, From H 

difficult than on CPUs, which may partially explain . Feng DF, Doolittle RF () Progressive sequence alignment
why accelerator-based solutions have not been widely as a prerequisite to correct phylogenetic trees. J Mol Evol
adopted even if these solutions have demonstrated very ():–
. Fitch W, Smith T () Optimal sequences alignments. Proc Natl
promising performance results. However, the contin-
Acad Sci :–
uing improvement of software environments on com- . Hennessy JL, Patterson DA () Computer architecture: a
modity GPUs has made them increasingly popular for quantitative approach, th edn. Morgan Kaufmann Publishers,
accelerating sequent alignments. To cope with the astro- San Francisco
nomical growth of sequence data, cloud-based solu- . Hirschberg DS () A linear space algorithm for com-
puting maximal common subsequences. Commun ACM :
tions [, ] have also been developed to enable users
–
to tackle large-scale problems with elastic compute . Huang X () A space-efficient parallel sequence comparison
resources from public clouds such as Amazon EC. algorithm for a message-passing multiprocessor. Int J Parallel
Program :–
. Li K () W-MPI: ClustalW analysis using distributed and
parallel computing. Bioinformatics ():–
Bibliography . Li I, Shum W, Truong K () -fold acceleration of the H
. Aji AM, Feng W, Blagojevic F, Nikolopoulos DS () Cell-SWat: Smith-Waterman algorithm using a field programmable gate
modeling and scheduling wavefront computations on the cell array (FPGA). BMC Bioinformatics ():
broadband engine. In: CF ’: Proceedings of the th conference . Lin H, Ma X, Chandramohan P, Geist A, Samatova N () Effi-
on computing frontiers. ACM, New York, pp – cient data access for parallel BLAST. In: Proceedings of the th
. Altschul S, Gish W, Miller W, Myers E, Lipman D () Basic IEEE international parallel and distributed processing sympo-
local alignment search tool. J Mol Biol ():– sium (IPDPS’). IEEE Computer Society, Los Alamitos
. Altschul S, Madden T, Schffer A, Zhang J, Zhang Z, Miller W, . Lin H, Ma X, Feng W, Samatova NF () Coordinating compu-
Lipman D () Gapped BLAST and PSI-BLAST: a new gen- tation and I/O in massively parallel sequence search. IEEE Trans
eration of protein database search programs. Nucleic Acids Res Parallel Distrib Syst :–
():– . Lipman D, Pearson W () Improved toolsW, HT for biological
. Aluru S, Futamura N, Mehrotra K () Parallel biological sequence comparison. Proc Natl Acad Sci ():–
sequence comparison using prefix computations. J Parallel Dis- . Liu W, Schmidt B, Voss B, Müller-Wittig W () GPU-
trib Comput :– ClustalW: using graphics hardware to accelerate multiple
. Chaichoompu K, Kittitornkun S, Tongsima S () MT- sequence alignment, Chapter  In: Robert Y, Parashar M,
ClustalW: multithreading multiple sequence alignment. In: Inter- Badrinath R, Prasanna VK (eds) High performance computing –
national parallel and distributed processing symposium. Rhodes HiPC . Lecture notes in computer science, vol . Springer,
Island, Greece, p  Berlin/Heidelberg, pp –
. Darling A, Carey L, Feng W () The design, implementation, . Liu Y, Maskell D, Schmidt B () CUDASW++: optimizing
and evaluation of mpiBLAST. In: Proceedings of the Cluster- Smith-Waterman sequence database searches for CUDA-enabled
World conference and Expo, in conjunction with the th inter- graphics processing units. BMC Res Notes ():
national conference on Linux clusters: The HPC revolution , . Liu Y, Schmidt B, Maskell DL () MSA-CUDA: multiple
San Jose sequence alignment on graphics processing units with CUDA. In:
. Di Tommaso P, Orobitg M, Guirado F, Cores F, Espinosa T, ASAP ’: Proceedings of the  th IEEE international con-
Notredame C () Cloud-Coffee: implementation of a parallel ference on application-specific systems, architectures and proces-
consistency-based multiple alignment algorithm in the T-Coffee sors, Washington, DC. IEEE Computer Society, Los Alamitos,
package and its benchmarking on the Amazon Elastic-Cloud. California, USA, pp –
Bioinformatics ():– . Lu W, Jackson J, Barga R () AzureBlast: a case study of cloud
. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S () Prob- computing for science applications. In: st workshop on scien-
Cons: probabilistic consistency-based multiple sequence align- tific cloud computing, co-located with ACM HPDC  (High
ment. Genome Res ():– performance distributed computing). Chicago, Illinois, USA
. Ebedes J, Datta A () Multiple sequence alignment in parallel . Mahram A, Herbordt MC () Fast and accurate NCBI
on a workstation cluster. Bioinformatics ():– BLASTP: acceleration with multiphase FPGA-based prefiltering.
. Edgar R () MUSCLE: a multiple sequence alignment method In: Proceedings of the th ACM international conference on
with reduced time and space complexity. BMC Bioinformatics supercomputing. Tsukuba, Ibaraki, Japan
(): . May J () Parallel I/O for high performance computing. Mor-
. Edmiston EE, Core NG, Saltz JH, Smith RM () Parallel gan Kaufmann Publishers, San Francisco
processing of biological sequence comparison algorithms. Int J . Message Passing Interface Forum () MPI: message-passing
Parallel Program :– interface standard
 H Horizon

. Message Passing Interface Forum () MPI- extensions to the . Vouzis PD, Sahinidis NV () GPU-BLAST: using graphics pro-
message-passing standard cessors to accelerate protein sequence alignment. Bioinformatics
. Mikhailov D, Cofer H, Gomperts R () Performance optimiza- ():–
tion of Clustal W: parallel Clustal W, HT Clustal, and MULTI- . Wallace IM, Orla O, Higgins DG () Evaluation of itera-
CLUSTAL. White Papers, Silicon Graphics, Mountain View tive alignment algorithms for multiple alignment. Bioinformatics
. Myers EW, Miller W () Optimal alignments in linear space. ():–
Comput Appl Biosci (CABIOS) ():– . Wozniak A () Using video-oriented instructions to speed up
. Notredame C () T-coffee: a novel method for fast and accu- sequence comparison. Comput Appl Biosci ():–
rate multiple sequence alignment. J Mol Biol ():– . Xiao S, Aji AM, Feng W () On the robust mapping
. Oehmen C, Nieplocha J () ScalaBLAST: a scalable imple- of dynamic programming onto a graphics processing unit.
mentation of BLAST for high-performance data-intensive bioin- In: ICPADS ’: proceedings of the  th international
formatics analysis. IEEE Trans Parallel Distrib Syst ():– conference on parallel and distributed systems, Washington,
. Orobitg M, Guirado F, Notredame C, Cores F () Exploiting DC. IEEE Computer Society, Los Alamitos, California, USA,
parallelism on progressive alignment methods. J Supercomput pp –
–. doi: ./s--- . Xiao S, Lin H, Feng W () Characterizing and optimizing pro-
. Pei J, Sadreyev R, Grishin NV () PCMA: fast and accu- tein sequence search on the GPU. In: Proceedings of the th
rate multiple sequence alignment based on profile consistency. IEEE international parallel and distributed processing sympo-
Bioinformatics ():– sium Anchorage, Alaska. IEEE Computer Society, Los Alamitos,
. Polychronopoulos CD, Kuck DJ () Guided self-scheduling: California, USA
a practical scheduling scheme for parallel supercomputers. IEEE . Zola J, Yang X, Rospondek A, Aluru S () Parallel-TCoffee: a
Trans Comput :– parallel multiple sequence aligner. In: ISCA international confer-
. Rajko S, Aluru S () Space and time optimal parallel sequence ence on parallel and distributed computing systems (ISCA PDCS
alignments. IEEE Trans Parallel Distrib Syst :– ), pp –
. Rognes T, Seeberg E () Six-fold speed-up of Smith-Waterman
sequence database searches using parallel processing on common
microprocessors. Bioinformatics ():–
. Sachdeva V, Kistler M, Speight E, Tzeng TK () Exploring the
viability of the cell broadband engine for bioinformatics applica-
tions. Parallel Comput ():– Horizon
. Saitou N, Nei M () The neighbor-joining method: a new
method for reconstructing phylogenetic trees. Mol Biol Evol Tera MTA
():–
. Sandes EFO, de Melo ACMA () CUDAlign: using GPU
to accelerate the comparison of megabase genomic sequences.
SIGPLAN Not ():–
. Sarje A, Aluru S () Parallel genomic alignments on the HPC Challenge Benchmark
cell broadband engine. IEEE Trans Parallel Distrib Syst ():
– Jack Dongarra, Piotr Luszczek
. Sneath PH, Sokal RR () Numerical taxonomy. Nature University of Tennessee, Knoxville, TN, USA
:–
. Thakur R, Choudhary A () An extended two-phase method
for accessing sections of out-of-core arrays. Sci Program ():
–
Definition
. Thakur R, Gropp W, Lusk W () Data sieving and collec-
HPC Challenge (HPCC) is a benchmark that mea-
tive I/O in ROMIO. In: Symposium on the frontiers of massively sures computer performance on various computational
parallel processing. Annapolis, Maryland, USA, p  kernels that span the memory access locality space.
. Thakur R, Gropp W, Lusk E () On implementing MPI-IO HPCC includes tests that are able to take advantage of
portably and with high performance. In: Proceedings of the sixth nearly all available floating point performance: High
workshop on I/O in parallel and distributed systems. Atlanta,
Performance LINPACK and matrix–matrix multiply
Georgia, USA
allow for data reuse that is only bound by the size of
. Thompson JD, Higgins DG, Gibson TJ () CLUSTAL W:
improving the sensitivity of progressive multiple sequence align- large register file and fast cache. The twofold bene-
ment through sequence weighting, position-specific gap penalties fit from these tests is the ability to answer the ques-
and weight matrix choice. Nucleic Acids Res ():– tion how well the hardware is able to work around the
HPC Challenge Benchmark H 

“memory wall” and how today’s machines compare to


PTRANS HPL
the systems of the past as they are cataloged by the
STREAM DGEMM
LINPACK Benchmark Report [] and TOP. HPCC
also includes other tests, STREAM, PTRANS, FFT, Ran-
domAccess – when they are combined together they CFD Radar cross-section

Spatial locality
span the memory access locality space. They are able
to reveal the growing inefficiencies of the memory sub-
Applications
system and how they are addressed in the new com-
puter infrastructures. HPCC also offers scientific rigor
to the benchmarking effort. The tests stress double pre-
TSP RSA DSP
cision floating point accuracy: the absolute prerequisite
in the scientific world. In addition, the tests include
careful verification of the outputs – undoubtedly an RandomAccess FFT
important fault-detection feature at extreme computing H
0 Temporal locality
scales.
HPC Challenge Benchmark. Fig.  The application areas
Discussion targeted by the HPCS Program are bound by the HPCC
The HPC Challenge benchmark suite was initially tests in the memory access locality space
developed for the DARPA HPCS program [] to pro-
vide a set of standardized hardware probes based
on commonly occurring computational software ker- systems that originated in the HPCS program (the third
nels. The HPCS program involves a fundamental column from the right). In other words, these are the
reassessment of how to define and measure perfor- projected target performance numbers that are to come
mance, programmability, portability, robustness and, out of the wining HPCS vendor designs. The last col-
ultimately, productivity across the entire high-end umn shows the relative improvement in performance
domain. Consequently, the HPCC suite aimed both that needs to be achieved in order to meet the
to give conceptual expression to the underlying com- goals.
putations used in this domain, and to be applica-
ble to a broad spectrum of computational science The TOP Influence
fields. Clearly, a number of compromises needed to The most commonly known ranking of supercomputer
be embodied in the current form of the suite, given installations around the world is the TOP list [].
such a broad scope of design requirements. HPCC was It uses the equally well-known LINPACK benchmark
designed to provide approximate bounds on computa- [] as a single figure of merit to rank  of the world’s
tions that can be characterized by either high or low most powerful supercomputers. The often-raised ques-
spatial and temporal locality (see Fig. , which gives tion about the relation between the TOP list and
the conceptual design space for the HPCC compo- HPCC can be addressed by recognizing the positive
nent tests). In addition, because the HPCC tests consist aspects of the former. In particular, the longevity of the
of simple mathematical operations, HPCC provides a TOP list gives an unprecedented view of the high-
unique opportunity to look at language and parallel end arena across the turbulent era of Moore’s law []
programming model issues. As such, the benchmark is rule and the emergence of today’s prevalent computing
designed to serve both the system user and designer paradigms. The predictive power of the TOP list is
communities. likely to have a lasting influence in the future, as it has
Figure  shows a generic memory subsystem in the had in the past. HPCC extends the TOP list’s con-
leftmost column and how each level of the hierarchy cept of exploiting a commonly used kernel and, in the
is tested by the HPCC software (the second column context of the HPCS goals, incorporates a larger, grow-
from the left), along with the design goals for the future ing suite of computational kernels. HPCC has already
 H HPC Challenge Benchmark

HPCS Program:

Registers Benchmarks Performance Target Required Improvement

Operands HPL 2 Pflop/s 800%

Cache

Lines STREAM 6 Pbyte/s 4000%

Local Memory
FFT
0.5 Pflop/s 20 000%
Messages RandomAccess
64000 GUPS 200 000%
b_eff
Remote Memory

Pages

Disk

HPC Challenge Benchmark. Fig.  HPCS program benchmarks and performance targets

HPC Challenge Benchmark. Table  All of the top  entries of the th TOP list that have results in the HPCC database

Rank Name Rmax HPL PTRANS STREAM FFT RandomAccess Lat. B/w
 BG/L . . .  ,  . . .
 BG W . . .  ,  . . .
 ASC purple . . .   . . .
 Columbia . . .   . . .
 Red storm . . .  ,  . . .

begun to serve as a valuable tool for performance anal- supercomputer makers – a sign that vendors were par-
ysis. Table  shows an example of how the data from the ticipating in the new benchmark initiative. At the same
HPCC database can augment the TOP results (for time, behind the scenes, the code was also being tried
the current version of the table please visit the HPCC out by government and private institutions for pro-
website). curement and marketing purposes. A  milestone
was the announcement of the HPCC Awards con-
test. The two complementary categories of the com-
Short History of the Benchmark petition emphasized performance and productivity –
The first reference implementation of the HPCC suite the same goals as the sponsoring HPCS program.
of codes was released to the public in . The The performance-emphasizing Class  award drew the
first optimized submission came in April  from attention of many of the biggest players in the super-
Cray, using the then-recent X installation at Oak computing industry, which resulted in populating the
Ridge National Lab. Ever since, Cray has champi- HPCC database with most of the top  entries of
oned the list of optimized HPCC submissions. By the the TOP list (some exceeding their performances
time of the first HPCC birds-of-a-feather session at reported on the TOP – a tribute to HPCC’s continu-
the Supercomputing conference in  in Pittsburgh, ous results update policy). The contestants competed to
the public database of results already featured major achieve the highest raw performance in one of the four
HPC Challenge Benchmark H 

HPL

A x = b Compute x from the system of linear equations


Ax = b.

DGEMM
C ¬a A B +b C Compute update to matrix C with a product of
matrices A and B.

STREAM

a ¬ b b +a c Perform simple operations on vectors a, b, and c.

PTRANS
A ¬ AT + B Compute update to matrix A with a sum of its
transpose and another matrix B. H
RandomAccess

T .. .. T
. . Perform integer update of random vector T
locations using pseudo-random sequence.

FFT
®
x X z Compute vector z to be the Fast Fourier
Transform (FFT) of vector x.
®

b.eff
® ®
¬ ¬
¯ ¯ Perform ping-pong and various communication
® ® ring exchanges.
¬ ¬

HPC Challenge Benchmark. Fig.  Detail description of the HPCC component tests (A, B, C – matrices, a, b, c, x, z –
vectors, α, β – scalars, T – array of -bit integers)

tests: HPL, STREAM, RANDA, and FFT. At the SC widely used. The financial incentives for entering turned
conference in Portland, Oregon, HPCC listed its first out to be all but needless, as the HPCC seemed to
Pflop/s machine – Cray XT  called Jaguar from Oak have gained enough recognition within the high-end
Ridge National Laboratory. The Class  award, by solely community to elicit entries even without the monetary
focusing on productivity, introduced a subjectivity assistance. (HPCwire provided both press coverage and
factor into the judging and also into the submission cri- cash rewards for the four winning contestants in Class
teria, regarding what was appropriate for the contest. As  and the single winner in Class .) At the HPCC’s
a result, a wide range of solutions were submitted, span- second birds-of-a-feather session during the SC con-
ning various programming languages (interpreted and ference in Seattle, the former class was dominated by
compiled) and paradigms (with explicit and implicit IBM’s BlueGene/L at Lawrence Livermore National Lab,
parallelism). The Class  contest featured openly avail- while the latter class was split among MTA pragma-
able as well as proprietary technologies, some of which decorated C and UPC codes from Cray and IBM,
were arguably confined to niche markets and some were respectively.
 H HPC Challenge Benchmark

The Benchmark Tests’ Details Benchmark Submission Procedures


Extensive discussion and various implementations of and Results
the HPCC tests are available elsewhere [, , ]. The reference implementation of the benchmark may
However, for the sake of completeness, this section pro- be obtained free of charge at the benchmark’s web site
vides the most important facts pertaining to the HPCC (http://icl.cs.utk.edu/hpcc/). The reference implemen-
tests’ definitions. tation should be used for the base run: it is written in a
All calculations use double precision floating-point portable subset of ANSI C [] using a hybrid program-
numbers as described by the IEEE  standard [], and ming model that mixes OpenMP [, ] threading with
no mixed precision calculations [] are allowed. All the MPI [–] messaging. The installation of the software
tests are designed so that they will run on an arbitrary requires creating a script file for Unix’s make() utility.
number of processors (usually denoted as p). Figure  The distribution archive comes with script files for many
shows a more detailed definition of each of the seven common computer architectures. Usually, a few changes
tests included in HPCC. In addition, it is possible to to any of these files will produce the script file for a given
run the tests in one of three testing scenarios to stress platform. The HPCC rules allow only standard system
various hardware components of the system. The sce- compilers and libraries to be used through their sup-
narios are shown in Fig. . In the “Single” scenario, ported and documented interface, and the build pro-
only one process is chosen to run the test. Accordingly, cedure should be described at submission time. This
the remaining processes remain idle and so does the ensures repeatability of the results and serves as an edu-
interconnect (shown with strike-out font in the Figure). cational tool for end users who wish to use a similar
In the “Embarrassingly Parallel” scenario, all process build process for their applications.
run the tests simultaneously but they do not commu- After a successful compilation, the benchmark
nicate with each other. And finally, in the “Global” sce- is ready to run. However, it is recommended that
nario, all components of the system work together on all changes be made to the benchmark’s input file that
tests. describes the sizes of data to use during the run.
The sizes should reflect the available memory on the
system and the number of processors available for
computations.
There must be one baseline run submitted for each
Single
computer system entered in the archive. An optimized
Pi
P1 .. . . ... .. PN run for each computer system may also be submitted.
The baseline run should use the reference implementa-
Interconnect tion of HPCC, and in a sense it represents the scenario
when an application requires use of legacy code – a
Embarrassingly Parallel code that cannot be changed. The optimized run allows
P1 Pi PN the submitter to perform more aggressive optimizations
... ...
and use system-specific programming techniques (lan-
guages, messaging libraries, etc.), but at the same time
Interconnect
still includes the verification process enjoyed by the
Global base run.
P1 Pi PN All of the submitted results are publicly available
... ... after they have been confirmed by email. In addition to
the various displays of results and exportable raw data,
Interconnect the HPCC website also offers a kiviat chart display to
visually compare systems using multiple performance
HPC Challenge Benchmark. Fig.  Testing scenarios of numbers at once. A sample chart that uses actual HPCC
the HPCC components results data is shown in Fig. .
HPC Challenge Benchmark H 

64 processors: AMD Opteron 2.2 GHz


G-HPL
1.0
RandomRing
G-PTRANS 0.8 Latency

0.6

0.4

0.2

G-RandomAccess RandomRing
Bandwidth

G-FFT S-DGEMM

S-STREAM Triad
Rapid Array
Quadrics
GigE

HPC Challenge Benchmark. Fig.  Sample kiviat diagram of results for three different interconnects that connect the
same processors

Related Entries . Dongarra J, Luszczek P () Introduction to the HPC challenge


Benchmarks benchmark suite. Technical Report UT-CS--, University of
Tennessee, Knoxville
LINPACK Benchmark
. Kepner J () HPC productivity: an overarching view. Int J
Livermore Loops
High Perform Comput Appl ():–
TOP . Kernighan BW, Ritchie DM () The C Programming Lan-
guage. Prentice-Hall, Upper Saddle River, New Jersey
. Luszczek P, Dongarra J () High performance development
Bibliography for high end computing with Python Language Wrapper (PLW).
. ANSI/IEEE Standard – () Standard for binary float- Int J High Perfoman Comput Appl. Accepted to Special Issue on
ing point arithmetic. Technical report, Institute of Electrical and High Productivity Languages and Models
Electronics Engineers,  . Langou J, Langou J, Luszczek P, Kurzak J, Buttari A, Dongarra J
. Chandra R, Dagum L, Kohr D, Maydan D, McDonald J, Menon R () Exploiting the performance of  bit floating point arith-
() Parallel programming in OpenMP. Morgan Kaufmann metic in obtaining  bit accuracy. In: Proceedings of SC,
Publishers, San Francisco,  Tampa, Florida, Nomveber – . See http://icl.cs.utk.edu/
. Dongarra JJ () Performance of various computers using stan- iter-ref
dard linear equations software. Computer Science Department. . Moore GE () Cramming more components onto integrated
Technical Report, University of Tennessee, Knoxville, TN, April circuits. Electronics ():–
. Up-to-date version available from http://www.netlib.org/ . Message Passing Interface Forum () MPI: A Message-Passing
benchmark/ Interface Standard. The International Journal of Supercomputer
. Dongarra JJ, Luszczek P, Petitet A () The LINPACK bench- Applications and High Performance Computing (/):–
mark: past, present, and future. In: Dou Y, Gruber R, Joller JM . Message Passing Interface Forum () MPI: A Message-Passing
() Concurrency and computation: practice and experience, Interface Standard (version .), . Available at: http://www.
vol , pp – mpi-forum.org/
 H HPF (High Performance Fortran)

. Message Passing Interface Forum () MPI-: Extensions to the Definition
Message-Passing Interface,  July . Available at http://www. The microarchitecture specified by Yale Patt, Wen-mei
mpi-forum.org/docs/mpi-.ps Hwu, Stephen Melvin, and Michael Shebanow in 
. Meuer HW, Strohmaier E, Dongarra JJ, Simon HD ()
for implementing high-performance microprocessors.
TOP Supercomputer Sites, th edn. November .
(The report can be downloaded from http://www.netlib.org/ It achieves high performance via aggressive branch
benchmark/top.html) prediction, speculative execution, wide issue, and out-
. Nadya Travinin and Jeremy Kepner () pMatlab parallel Mat- of-order execution, while retaining the ability to handle
lab library. International Journal of High Perfomance Computing precise exceptions via in-order retirement.
Applications ():–
. OpenMP: Simple, portable, scalable SMP programming. http://
www.openmp.org/ Discussion
The High Performance Substrate (HPS) was the name
given to the microarchitecture conceived by Professor
Yale Patt and his three PhD students, Wen-mei Hwu,
Michael Shebanow, and Stephen Melvin, at the Univer-
sity of California, Berkeley in  and first published
HPF (High Performance Fortran) in Micro  in October,  [, ]. Its goal was high-
performance processing of single-instruction streams
by combining aggressive branch prediction, specula-
High Performance Fortran (HPF) is an extension of tive execution, wide issue, dynamic scheduling (out-
Fortran  for parallel programming. In HPF pro- of-order execution), and retirement of instructions in
grams, parallelism is represented as data parallel oper- program order (i.e., in-order).
ations in a single thread of execution. HPF extensions Out-of-order execution had appeared in previous
included statements to specify data distribution, data machines from Control Data [] and IBM [], but
alignment, and processor topology, which were used for had pretty much been dismissed as a nonviable mech-
the translation of HPF codes onto an SPMD message- anism due to its lack of in-order retirement, which
passing form. prevented the processor from implementing precise
exceptions. The checkpoint retirement mechanisms of
Bibliography HPS removed that problem [, ]. It should also be
. Kennedy K, Koelbel C, Zima H () The rise and fall of noted that solutions to the precise exception problem
high performance Fortran: an historical object lesson. In: Pro- were also being developed simultaneously and indepen-
ceedings of the third ACM SIGPLAN conference on History dently by James Smith and Andrew Plezskun [].
of programming languages (HOPL III), ACM, New York, pp. HPS was first targeted for the VAX instruction set
-–-, doi:./., http://doi.acm.org/./
architecture and demonstrated that an HPS implemen-
.
. High Performance Fortran Forum Website. http://hpff.rice.edu
tation of the VAX could process instruction streams at a
rate of three cycles per instruction (CPI), as compared
to the VAX-/, which processed at the rate of .
CPI [].
Instructions are processed as follows: Using an
aggressive branch predictor and wide-issue fetch/de-
HPS Microarchitecture code mechanism, multiple instructions are fetched
each cycle, decoded into data flow graphs (one per
Yale N. Patt
The University of Texas at Austin, Austin, TX, USA
instruction), and merged into a global data flow graph
containing all instructions in process. Instructions are
scheduled for execution when their flow dependen-
cies (RAW hazards) have been satisfied and executed
Synonyms speculatively and out-of-order with respect to the pro-
The high performance substrate gram order of the program. Results produced by these
Hybrid Programming With SIMPLE H 

instructions are stored temporarily in a results buffer . Thornton JE () Design of a computer – the Control Data .
(aka re-order buffer) until they can be retired in-order. Scott, Foresman and Co. Glenview, IL
The essence of the paradigm is that the global data graph . Anderson DW, Sparacio FJ, Tomasulo RM () The IBM sys-
tem/ model : machine philosophy and instruction-handling.
consists of nodes corresponding to micro-operations
IBM J Res Development ():–
and edges corresponding to linkages between micro- . Hwu W, Patt Y () HPSm, a high performance restricted
ops that produce operands and micro-ops that source data flow architecuture having minimal functionality. In: Proceed-
them. The edges of the data flow graph produced as a ings, th annual international symposium on computer architec-
result of decode correspond to internal linkages within ture, Tokyo
. Hwu W, Patt Y () Checkpoint repair for high perfor-
an instruction. Edges created as a result of merging an
mance out-of-order execution machines. IEEE Trans Computers
individual instruction’s data flow graph into the global ():–
data flow graph correspond to linkages between live- . Smith JE, Pleszkun A () Implementing precise interrupts. In:
outs of one instruction and live-ins of a subsequent Proceedings, th annual international symposium on computer
instruction. A Register Alias Table was conceived to architecture, Boston, MA
. Hwu W, Melvin S, Shebanow M, Chen C, Wei J, Patt Y ()
maintain correct linkages between live-outs and live- H
An HPS implementation of VAX; initial design and analysis. In:
ins. A node, corresponding to a micro-op, is avail- Proceedings of the Hawaii international conference on systems
able for execution when all its flow dependencies are sciences, Honolulu, HI
resolved. . Colwell R () The pentium chronicles: the people, passion,
The HPS research group refers to the paradigm as and politics behind intel’s landmark chips. Wiley-IEEE Computer
Restricted Data Flow (RDF) since at no time does the Society Press, NJ, ISBN: ----

data flow graph for the entire program exist. Rather,


the size of the global data flow graph is increased every
cycle as a result of new instructions being decoded and
merged, and decreased every cycle as a result of old HT
instructions retiring. At every point in time, only those
instructions in the active window – the set of instruc- HyperTransport
tions that have been fetched but not yet retired – are
present in the data flow graph. The set of instructions
in the active window are often referred to as “in-flight”
instructions. The number of in-flight instructions is HT.
orders of magnitude smaller than the size of a data
HyperTransport
flow graph for the entire program. The result is data
flow processing of a program without incurring any of
the problems of classical data flow.
Since , the HPS microarchitecture has seen con-
tinual development and improvement by many research
Hybrid Programming With
groups at many universities and industrial labs. The SIMPLE
basic paradigm has been adopted for most cutting-edge
Guojing Cong , David A. Bader
high-performance microprocessors, starting with Intel 
IBM, Yorktown Heights, NY, USA
on its Pentium Pro microprocessor in the late s []. 
Georgia Institute of Technology, Atlanta, GA, USA

Bibliography Definition
. Patt YN, Hwu W, Shebanow M () HPS, a new microarchi- Most high performance computing systems are clus-
tecture: rationale and introduction. In: Proceedings of the th
ters of shared-memory nodes. Hybrid parallel program-
microprogramming workshop, Asilomar, CA
. Patt YN, Melvin S, Hwu W, Shebanow M () Critical issues
ing handles distributed-memory parallelization across
regarding HPS, a high performance microarchitecture. In: Pro- the nodes and shared-memory parallelization within a
ceedings of the th microprogramming workshop, Asilomar, CA node.
 H Hybrid Programming With SIMPLE

SIMPLE refers to the joining of the SMP and two cache (L), which can be tightly integrated into the
MPI-like message passing paradigms [] and the sim- memory system to provide fast memory accesses and
ple programming approach. It provides a methodology cache coherence. The shared memory programming
of programming cluster of SMP nodes. It advocates of each SMP node is based on threads which com-
a hybrid methodology which maps directly to under- municate via coordinated accesses to shared memory.
lying architectural aspects. SIMPLE combines shared SIMPLE provides several primitives that synchronize
memory programming on shared memory nodes with the threads at a barrier, enable one thread to broad-
message passing communication between these nodes. cast data to the other threads, or calculate reductions
SIMPLE provides () a complexity model and set of across the threads. In SIMPLE, only the CPUs from a
efficient communication primitives for SMP nodes and certain node have access to that node’s configuration.
clusters; () a programming methodology for clus- In this manner, there is no restriction that all nodes
ters of SMPs which is both efficient and portable; and must be identical, and certainly configuration can be
() high performance algorithms for sorting integers, constructed from SMP nodes of different sizes. Thus,
constraint-satisfied searching, and computing the two- the number of threads on a specific remote node is
dimensional FFT. not globally available. Because of this, SIMPLE sup-
ports only node-oriented communication, meaning
that communication is restricted such that, given any
The SIMPLE Computational Model
source node s and destination node d, with s ≠ d,
A simple paradigm is used in SIMPLE for designing effi-
only one thread on node s can send (receive) a message
cient and portable parallel algorithms. The architecture
to (from) node d at any given time.
consists of a collection of SMP nodes interconnected by
a communication network (as shown in Fig. ) that can
be modeled as a complete graph on which communica- Complexity Model
tion is subject to the restrictions imposed by the latency In the SIMPLE complexity model, each SMP is viewed
and the bandwidth properties of the network. Each SMP as a two-level hierarchy for which good performance
node contains several identical processors, each typi- requires both good load distribution and the mini-
cally with its own on-chip cache and a larger off-chip mization of secondary memory access. The cluster is
cache, which have uniform access to a shared memory viewed as a collection of powerful processors con-
and other resources such as the network interface. nected by a communication network. Maximizing per-
Parameter r is used to represent the number sym- formance on the cluster requires both efficient load bal-
metric processors per node (see Fig.  for a diagram ancing and regular, balanced communication. Hence,
of a typical node). Notice that each CPU typically has our performance model combines two separate but
its own on-chip cache (L) and a larger off-chip level complimentary models.

Interconnection Network

P0 P1 P2 P3 P4 P5 P6 P7 Pp-4 Pp-3 Pp-2 Pp-1

Hybrid Programming With SIMPLE. Fig.  Cluster of processing elements


Hybrid Programming With SIMPLE H 

CPU 0 CPU 2 CPU r-2


L1 L1 L1 Network
Interface

CPU 1 CPU 3 CPU r-1


L1 L1 L1

L2 L2 L2 L2 L2 L2

Bus or Switching Network

H
Shared Memory

Hybrid Programming With SIMPLE. Fig.  A typical symmetric multiprocessing (SMP) node used in a cluster. L is
on-chip level-one cache, and L is off-chip level-two cache

The SIMPLE model recognizes that efficient algo- of q nodes, a block permutation among the q nodes
rithm design requires the efficient decomposition of the takes (τ + mβ ) time, where m is the size of the largest
problem among the available processors, and so, unlike block. Using this cost model, the communication time
some other models for hierarchical memory, the cost Tcomm (n, p) of an algorithm can be viewed as a func-
of computation is included in the complexity. The cost tion of the input size n, the number of nodes p, and the
model also encourages the exploitations of temporal parameters τ and β. The overall complexity of algorithm
and spatial locality. Specifically, memory at each SMP is for the cluster T(n, p) is given by the sum of Tsmp and
seen as consisting of two levels: cache and main mem- Tcomm (n, p).
ory. A block of m contiguous words can be read from or
written to main memory in (є + mr α
) time, where є is the Communication Primitives
latency of the bus, r is the number of processors com- The communication primitives are grouped into three
peting for access to the bus, and α is the bandwidth. By modules: Internode Communication Library (ICL),
contrast, the transfer of m noncontiguous words would SMP Node, and SIMPLE. ICL communication prim-
require m(є + αr ) time. itives handle internode communication, SMP Node
A parallel algorithm is viewed as a sequence of primitives aid shared-memory node algorithms, and
local SMP computations interleaved with communi- SIMPLE primitives combine SMP Node with ICL on
cation steps, where computation and communication SMP clusters.
is allowed to overlap. Assuming no congestion, the The ICL communication library services
transfer of a block consisting of m words between two internode communication and can use any of the
nodes takes (τ + mβ ) time, where τ is the latency of vendor-supplied or freely available thread-safe imple-
the network, and β is the bandwidth per node. SIM- mentation of MPI. The ICL libraries are based upon a
PLE assumes that the bisection bandwidth is sufficiently reliable, application-layer send and receive prim-
high to support block permutation routings among the itive, as well as a send-and-receive primitive
p nodes at the rate of β . In particular, for any subset which handles the exchanging of messages between
 H Hybrid Programming With SIMPLE

sets of nodes where each participating node is the for ( ≤ j ≤ L − ), each of kj processors writes k unique
source and destination of one message. The library copies of the data for its k children. The time com-
also provides a barrier operation based upon the plexity of this SMP replication algorithm given an
send and receive which halts the execution at each m-word buffer is
node until all nodes check into the barrier, at which L−
kj m
time, the nodes may continue execution. In addition, Tsmp = ∑ k(є + )
ICL includes collective communication primitives, for j= α
example, scan, reduce, broadcast, allreduce, m
= k(L − )є + (r − )
alltoall, alltoallv, gather, and scatter. α
m
See []. ≤ k(logk (r(k − ) + ) − ) є + (r − )
α
m
≤ k(logk (r + /k)) є + (r − )
SMP Node α
The SMP Node Library contains important prim- = O((log r)є + mr
α
). ()
itives for an SMP node: barrier, replicate,
broadcast, scan, reduce, and allreduce, The best choice of k, ( ≤ k ≤ r − ), depends on
whereby on a single node, barrier synchronizes SMP size r and machine parameters є and α, but can be
the threads, broadcast ensures that each thread chosen to minimize Eq. .
has the most recent copy of a shared memory loca- An algorithm which barrier synchronizes r SMP
tion, scan (reduce) performs a prefix (reduction) processors can use a fan-in tree followed by a fan-
operation with a binary associative operator (e.g., out tree, with a unit message size m, taking twice the
addition, multiplication, maximum, minimum, bitwise- replication time, namely, O((log r)є + r/α).
AND, and bitwise-OR) with one datum per thread, and For certain SMP algorithms, it may not be necessary
allreduce replicates the result from reduce. to replicate data, but to share a read-only buffer for a
Each of these collective SMP primitives can be given step. A broadcast SMP primitive supplies each
implemented using a fan-out or fan-in tree constructed processor with the address of the shared buffer by repli-
as follows. A logical k-ary balanced tree is built for an cating the memory address in Tsmp = O((log r)є + r/α).
SMP node with r processors, which serves as both a fan- A reduce primitive, which performs a reduc-
out and fan-in communication pattern. In a k-ary tree, tion with a given binary associate operator, uses a
level  has one processor, level  has k processors, level  fan-in tree, combining partial sums during each step.
has k  processors, and so forth, with level j containing For initial data arrays of size m per processor, this takes
kj processors, for  ≤ j ≤ L − ), where there are L lev- O((log r)є + mr/α). The allreduce primitive per-
els in the tree. If logk r is not an integer, then the last forms a reduction followed by replicate so that
level (L − ) will hold less than kL− processors. Thus, each processor receives a copy of the result with a cost
the number of processors r is bounded by of O((log r)є + mr α
).
L− L−
Scans (also called prefix-sums) are defined as
∑ k < r ≤ ∑k .
j j
() follows. Each processor i, ( ≤ i ≤ r − ), initially holds
j= j= an element ai , and at the conclusion of this primitive,
Solving for the number of levels L, it is easy to see that holds the prefix-sum bi = a ∗ a ∗ . . . ∗ ai , where ∗ is
any binary associative operator. An SMP algorithm sim-
L = ⌈logk (r(k − ) + )⌉ ()
ilar to the PRAM algorithm (e.g., []) is employed which
where the ceiling function ⌈x⌉ returns the smallest inte- uses a binary tree for finding the prefix-sums. Given an
ger greater than or equal to x. array of elements A of size r = d where d is a nonnega-
An efficient algorithm for replicating a data tive integer, the output is array C such that C(, i) is the
buffer B such that each processor i, ( ≤ i ≤ r − ), ith prefix-sum, for ( ≤ i ≤ r − ).
receives a unique copy Bi of B makes use of the fan- In fact, arrays A, B, and C in the SMP prefix-sum
out tree, with the source processor situated as the root algorithm (Alg. ) can be the same array. The analysis
log r
node of the k-ary tree. During step j of the algorithm, is as follows. The first for loop takes ∑h= (є + rh mα ) ,
Hybrid Programming With SIMPLE H 

log r
and the second for loop takes ∑h= ( +

) (є + rh mα ) (see Fig. ). Because no kernel modification is required,
for a total complexity of Tsmp ≤ є log r +  mrα
= these libraries easily port to new platforms.
O((log r)є + mr
α
). As mentioned previously, the number of threads per
node can vary, along with machine size. Thus, each
Simple thread has a small set of context information which
Finally, the SIMPLE communication library, built on holds such parameters as the number of threads on
top of ICL and SMP Node, includes the prim- the given node, the number of nodes in the machine,
itives for the SIMPLE model: barrier, scan, the rank of that node in the machine, and the rank of the
reduce, broadcast, allreduce, alltoall, thread both () on the node and () across the machine.
alltoallv, gather, and scatter. These hierar- Table  describes these parameters in detail.
chical layers of our communication libraries are pic- Because the design of the communication libraries
tured in Fig. . is modular, it is easy to experiment with different
The SMP Node, ICL, and SIMPLE libraries are implementations. For example, the ICL module can
implemented at a high level, completely in user space make use of any of the freely available or vendor- H
supplied thread-safe implementations of MPI, or a
small communication kernel which provides the nec-
Algorithm  SMP scan algorithm for processor i, essary message passing primitives. Similarly, the SMP
( ≤ i ≤ r − ) and binary associative operator ∗ Node primitives can be replaced by vendor-supplied
SMP implementations.
set B(, i) = A(i).

for h =  to log r do
if  ≤ i ≤ r/h then The Alltoall Primitive
set B(h, i) = B(h − , i − ) ∗ B(h − , i). One of the most important collective communication
events is the alltoall (or transpose) primitive
for h = log r downto  do which transmits regular sized blocks of data between
if  ≤ i ≤ r/h then each pair of nodes. More formally, given a collection of




⎪ If i even, Set C(h, i) = C(h + , i/); p nodes each with an m element sending buffer, where



⎪ p divides m, the alltoall operation consists of each
⎨If i = , Set C(h, ) = B(h, );



node i sending its jth block of mp data elements to node j,




⎪ If i odd, Set C(h, i) = C(h + , (i − )/) ∗ B(h, i). where node j stores the data from i in the ith block of its
⎩ receiving buffer, for all ( ≤ i, j ≤ p − ). An efficient
message passing implementation of alltoall would
SIMPLE Communication Library be as follows. The notation “var i ” refers to memory loca-
Barrier, Scan, Reduce, Broadcast, Allreduce,
tion “var + ( mp ∗ i),” and src and dst point to the source
Alltoall, Alltoallv, Gather, Scatter and destination arrays, respectively.
To implement this algorithm (Alg. ), multiple
Internode Communication Library SMP Node Library
threads (r ≤ p) per node are used. The local mem-
Scan, Reduce, Broadcast, Allreduce,
Alltoall, Alltoallv, Gather, Scatter Broadcast ory copy trivially can be performed concurrently by
Scan one thread while the remaining threads handle the
Barrier Reduce internode communication as follows. The p −  itera-
Barrier tions of the loop are partitioned in a straightforward
Send Recv SendRecv Replicate
manner to the remaining threads. Each thread has the
information necessary to calculate its subset of loop
Hybrid Programming With SIMPLE. Fig.  Hierarchy of indices, and thus, this loop partitioning step requires
SMP, message passing, and SIMPLE communication no synchronization overheads. The complexity of this
libraries primitive is twice (є + mr αr ) for the local memory read
 H Hybrid Programming With SIMPLE

User Program

SIMPLE

Message Passing SMP Node Library


ICL (e.g. MPI) POSIX threads

User Libraries
User Space

Kernel Space
Kernel

Hybrid Programming With SIMPLE. Fig.  User code can access SIMPLE, SMP, message passing, and standard user
libraries. Note that SIMPLE operates completely in user space

Hybrid Programming With SIMPLE. Table  The local Computation Primitives


context parameters available to each SIMPLE thread SIMPLE computation primitives do not communicate
Parameter Description data but affect a thread’s execution through () loop
NODES = p Total number of nodes in the cluster parallelization, () restriction, or () shared memory
MYNODE My node rank, from  to NODES −  management. Basic support for data parallelism, that
THREADS = r Total number of threads on my node is, “parallel do” concurrent execution of loops across
MYTHREAD The rank of my thread on this node, from  processors on one or more nodes, is provided.
to THREADS − 
TID Total number of threads in the cluster
Data Parallel
ID My thread rank, with respect to the cluster,
The SIMPLE methodology contains several basic
from  to TID − 
“pardo” directives for executing loops concurrently
on one or more SMP nodes, provided that no depen-
dencies exist in the loop. Typically, this is useful when
Algorithm  SIMPLE Alltoall primitive an independent operation is to be applied to every
location in an array, for example, in the element-wise
copy the appropriate mp elements from srcMYNODE to
addition of two arrays. Pardo implicitly partitions the
dstMYNODE.
loop to the threads without the need for coordinating
overheads such as synchronization or communication
for i =  to NODES −  do
between processors. By default, pardo uses block par-
set k = MYNODE ⊕ i;
titioning of the loop assignment values to the threads,
send mp elements from srck to node k, and
which typically results in better cache utilization due to
receive mp elements from node k to dstk .
the array locations on left-hand side of the assignment
being owned by local caches more often than not. How-
ever, SIMPLE explicitly provides both block and cyclic
partitioning interfaces for the pardo directive.
Similar mechanisms exist for parallelizing loops
and writes, and (τ + m
) for internode communication,
β across nodes. The all_pardo_cyclic (i, a, b)
for a total cost of O(τ + m
β
+є+ m
α
). directive will cyclically assign each iteration of the loop
Hybrid Programming With SIMPLE H 

across the entire collection of processors. For example, Memory Management


i = a will be executed on the first processor of the first Finally, shared memory allocations are the third cat-
node, i = a +  on the second processor of the first node, egory of SIMPLE computation primitives. Two direc-
and so on, with i = a + r −  on the last processor of tives are used:
the first node. The iteration with i = a + r is exe-
. node_malloc for dynamically allocating a shared
cuted by the first processor on the second node. After
structure
r ⋅ p iterations are assigned, the next index will again
. node_free for releasing this memory back to the
be assigned to the first processor on the first node.
heap
A similar directive called all_pardo_block, which
accepts the same arguments, assigns the iterations in a The node_malloc primitive is called by all threads
block fashion to the processors; thus, the first b−arp
iter- on a given node, and takes as a parameter the num-
ations are assigned to the first processor, the next block ber of bytes to be allocated dynamically from the heap.
of iterations are assigned to the second processor, and The primitive returns to each thread a valid pointer to
so forth. With either of these SIMPLE directives, each the shared memory location. In addition, a thread may H
processor will execute at most ⌈ rpn ⌉ iterations for a loop allow others to access local data by broadcasting the cor-
of size n. responding memory address. When this shared mem-
ory is no longer required, the node_free primitive
releases it back to the heap.
Control
The second category of SIMPLE computation primi-
tives control which threads can participate in the con-
SIMPLE Algorithmic Design
text by using restrictions. Programming Model
Table  defines each control primitive and gives the The user writes an algorithm for an arbitrary cluster size
largest number of threads able to execute the portion of p and SMP size r (where each node can assign possibly
the algorithm restricted by this statement. For example, different values to r at runtime), using the parameters
if only one thread per node needs to execute a com- from Table . SIMPLE expects a standard main func-
mand, it can be preceded with the on_one_thread tion (called SIMPLE_main() ) that, to the user’s view, is
directive. Suppose data has been gathered to a single immediately up and running on each thread. Thus, the
node. Work on this data can be accomplished on that user does not need to make any special calls to initial-
node by preceding the statement with on_one_node. ize the libraries or communication channels. SIMPLE
The combination of these two primitives restricts exe- makes available the rank of each thread on its node or
cution to exactly one thread, and can be shortcut with across the cluster, and algorithms can use these ranks
the on_one directive. in a straightforward fashion to break symmetries and

Hybrid Programming With SIMPLE. Table  Subset of SIMPLE control primitives

Control Primitives
Max number of
participating MYNODE MYTHREAD
Primitive Definition threads restriction restriction
on_one_thread only one thread per node p 
on_one_node all threads on a single r 
node
on_one only one thread on a   
single node
on_thread(i) one thread (i) per node p i
on_node(j) all threads on node j r j
 H Hybrid Programming With SIMPLE

partition work. The only argument of SIMPLE_main() Computational Model” have runtime initializations
is “THREADED,” a macro pointing to a private data which take place as follows.
structure which holds local thread information. If the The runtime start-up routines for a SIMPLE algo-
user’s main function needs to call subroutines which rithm are performed in two steps. First, the ICL ini-
make use of the SIMPLE library, this information is eas- tialization expands computation across the nodes in
ily passed via another macro “TH” in the calling argu- a cluster by launching a master thread on each of
ments. After all threads exit from the main function, the the p nodes and establishing communication chan-
SIMPLE code performs a shut down process. nels between each pair of nodes. Second, each mas-
ter thread launches r user threads, where each node
Runtime Support is at least an r-way SMP. (A rule of thumb in prac-
When a SIMPLE algorithm first begins its execution, tice is to use r threads on an r + -way SMP node,
the SIMPLE runtime support has already initialized which allows operating system tasks to fully utilize at
parallel mechanisms such as barriers and established least one CPU.) It is assumed that the r CPUs con-
the network-based internode communication chan- currently execute the r threads. The thread flow of
nels which remain open for the life of the program. an example SIMPLE algorithm is shown in Fig. .
The various libraries described in Sect.“The SIMPLE As mentioned previously, our methodology supports

Node 0 Node 1 Node p-1


Master Thread p-1

Thread (p-1, r-1)


Master Thread 0

Master Thread 1

Thread (p-1, 0)

Thread (p-1, 1)
Thread (0, r-1)

Thread (1, r-1)


Thread (0, 0)

Thread (0, 1)

Thread (1, 0)

Thread (1, 1)

Communication Phase
(Collective)

Time
Node Barrier

Irregular Communication
using sends and receives

Hybrid Programming With SIMPLE. Fig.  Example of a SIMPLE algorithm flow of master and compute-based user
threads. Note that the only responsibility of each master thread is only to launch and later join user threads, but never to
participate in computation or communication
Hybrid Programming With SIMPLE H 

only node-oriented communication, that is, given any shared memories of a p-node cluster of r-way SMPs.
source node s and destination node d, with s ≠ d, only Radix Sort decomposes each key into groups of ρ-bit
one thread on node s can send (receive) a message to digits, for a suitably chosen ρ, and sorts the keys by
(from) node d during a communication step. Also note applying a counting sort routine on each of the ρ-bit
that the master thread does not participate in any com- digits beginning with the digit containing the least sig-
putation, but sits idle until the completion of the user nificant bit positions []. Let R =  ρ ≥ p. Assume
code, at which time it coordinates the joining of threads (w.l.o.g.) that the number of nodes is a power of two,
and exiting of processes. say p = k , and hence Rp is an integer =  ρ−k ≥ .
The programming model is simply implemented
using a portable thread package called POSIX threads SIMPLE Counting Sort Algorithm
(pthreads). Counting Sort algorithm sorts n integers in the range
[, R − ] by using R counters to accumulate the number
A Possible Approach of keys equal to the value i in bucket Bi , for  ≤ i ≤ R − ,
The latency for message passing is an order of mag-
nitude higher than accessing local memory. Thus, the
followed by determining the rank of each key. Once the H
rank of each key is known, each key can be moved into
most costly operation in a SIMPLE algorithm is intern-
its correct position using a permutation ( np -relation)
ode communication, and algorithmic design must
routing [, ], whereby no node is the source or des-
attempt to minimize the communication costs between
tination of more than np keys. Counting Sort is a stable
the nodes.
sorting routine, that is, if two keys are identical, their
Given an efficient message passing algorithm, an
relative order in the final sort remains the same as their
incremental process can be used to design an efficient
initial order.
SIMPLE algorithm. The computational work assigned
to each node is mapped into an efficient SMP algo-
rithm. For example, independent operations such as
those arising in functional parallelism (e.g., indepen- Algorithm  SIMPLE Counting Sort Algorithm
dent I/O and computational tasks, or the local memory Step (): For each node i, ( ≤ i ≤ p − ), count the
copy in the SIMPLE alltoall primitive presented in frequency of its np keys; that is, compute Hi[k] , the
the previous section) or loop parallelism typically can number of keys equal to k, for ( ≤ k ≤ R − ).
be threaded. For functional parallelism, this means that
each thread acts as a functional process for that task, and Step (): Apply the alltoall primitive to the H
for loop parallelism, each thread computes its portion arrays using the block size Rp . Hence, at the end of this
of the computation concurrently. Loop transformations step, each node will hold R
consecutive rows of H.
p
may be applied to reduce data dependencies between
the threads. Thread synchronization is a costly opera- Step (): Each node locally computes the prefix-sums
tion when implemented in software and, when possible, of its rows of the array H.
should be avoided.
Step (): Apply the (inverse) alltoall prim-
Example: Radix Sort itive to the R corresponding prefix-sums augmented
Consider the problem of sorting n integers spread by the total count for each bin. The block size of the
evenly across a cluster of p shared-memory r-way SMP alltoall primitive is  Rp .
nodes, where n ≥ p . Fast integer sorting is crucial
for solving problems in many domains, and as such, is Step (): On each node, compute the ranks of the np
used as a kernel in several parallel benchmarks such as local elements using the arrays generated in Steps ()
NAS (Note that the NAS IS benchmark requires that the and ().
integers be ranked and not necessarily placed in sorted
order.) [] and SPLASH []. Step (): Perform a personalized communication of
Consider the problem of sorting n integer keys in keys to rank location using an np -relation algorithm.
the range [, M − ] that are distributed equally in the
 H Hypercube

The pseudocode for our Counting Sort algorithm works on ρ-bit digits of the input keys, starting from
(Alg. ) uses six major steps and can be described as the least significant digit of ρ bits to the most signifi-
follows. cant digit. Radix Sort easily can be adapted for clusters
In Step (), the computation can be divided evenly of SMPs by using the SIMPLE Counting Sort routine.
among the threads. Thus, on each node, each of r Thus, the total complexity for Radix Sort, assuming that
threads (A) histograms r of the input concurrently, and n ≥ p⋅max(R, r⋅max(p, log r)) is
(B) merges these r histograms into a single array for
node i. For the prefix-sum calculations on each node
in Step (), since the rows are independent, each of O( bρ (τ + np ( β + r
α
+ єr ))). ()
r threads can compute the prefix-sum calculations for
R
rp
rows concurrently. Also, the computation of ranks
on each node in Step () can be handled by r threads, Bibliography
where each thread calculates rpn ranks of the node’s . Bader DA () On the design and analysis of practical parallel
algorithms for combinatorial problems with applications to image
local elements. Communication can also be improved
processing, PhD thesis, Department of Electrical Engineering,
by replacing the message passing alltoall primitive University of Maryland, College Park
used in Steps () and () with the appropriate SIMPLE . Bader DA, Helman DR, JáJá J () Practical parallel algo-
primitive. rithms for personalized communication and integer sorting. Tech-
The h-relation used in the final step of Counting Sort nical Report CS-TR- and UMIACS-TR--, UMIACS and
is a permutation routing since h = np , and was given in Electrical Engineering. University of Maryland, College Park
. Bader DA, Helman DR, JáJá J () Practical parallel algo-
the previous section.
rithms for personalized communication and integer sorting.
Histogramming in Step (A) costs O(є + np α ) to ACM J Exp Algorithmics ():–. www.jea.acm.org//
read the input and O(є + R αr ) for each processor BaderPersonalized/
to write its histogram into main memory. Merg- . Bailey D, Barszcz E, Barton J, Browning D, Carter R, Dagum L,
Fatoohi R, Fineberg S, Frederickson P, Lasinski T, Schreiber R,
ing in Step (B) uses an SMP reduce with cost
Simon H, Venkatakrishnan V, Weeratunga S () The NAS
O((log r)є + R αr ). SIMPLE alltoall in Step () parallel benchmarks. Technical Report RNR--, Numerical
and the inverse alltoall in Step () take Aerodynamic Simulation Facility. NASA Ames Research Center,
O(τ + Rβ + є + Rα ) time. Computing local prefix-sums in Moffett Field
. JáJá J () An Introduction to Parallel Algorithms. Addison-
Step () costs O(є +R r
p ). Ranking each element in
pr α Wesley Publishing Company, New York
Step () takes O(є + np α ) time. Finally, the SIMPLE . Knuth DE () The Art of Computer Programming: Sorting and
Searching, vol . Addison-Wesley Publishing Company, Reading
permutation with h = np costs O(τ + np ( β + α + єr )), . Message Passing Interface Forum. MPI () A message-passing
for np ≥ r ⋅ max(p, log r). Thus, the total com- interface standard. Technical report. University of Tennessee,
Knoxville. Version .
plexity for Counting Sort, assuming that n ≥ p ⋅
. Woo SC, Ohara M, Torrie E, Singh JP, Gupta A () The
max(R, r⋅max(p, log r)) is SPLASH- programs: Characterization and methodological con-
siderations. In Proceeding of the nd Annual Int’l Symposium
O(τ + np ( β + r
α
+ єr )). () Computer Architecture, pp –

SIMPLE Radix Sort Algorithm


The message passing Radix Sort algorithm makes
several passes of the previous message passing Count-
ing Sort in order to completely sort integer keys.
Counting Sort can be used as the intermediate sort- Hypercube
ing routine because it provides a stable sort. Let the n
integer keys fall in the range [, M − ], and M = b . Hypercubes and Meshes
Then bρ passes of Counting Sort is needed; each pass Networks, Direct
Hypercubes and Meshes H 

(vertices) and interconnect wires (edges) that are con-


Hypercubes and Meshes necting the processors. Since high-speed data trans-
fers normally require an unidirectional point-to-point
Thomas M. Stricker connection, the resulting interconnection graphs are
Zürich, Switzerland directed. In addition to the abstract connectivity graph,
the optimal layout of processing elements and wires
in physical space must be studied. In reference to the
Synonyms branch of mathematics dealing with invariant geomet-
Distributed computer; Distributed memory computers; ric properties in relation to different spaces, this spec-
Generalized meshes and tori; Hypercube; Interconnection ification is called the topology of a parallel computer
network; k-ary n-cube; Mesh; Multicomputers; Multi- system.
processor networks In geometry, the classical term cube refers to the
regular convex hexahedron as one of the five platonic
solids in three-dimensional space. Cube alike struc- H
Definition tures, beneath and beyond the three dimensions of a
In the specific context of computer architecture, a hyper- physical cube can be listed as follows and are drawn in
cube refers to a parallel computer with a common Fig. .
regular interconnect topology that specifies the layout
● Zero dimensional: A dot in geometry or a non-
of processing elements and the wiring in between them.
connected uniprocessor in parallel computing.
The etymology of the term suggests that a hypercube
● One dimensional: A geometric line segment or a
is an unbounded, higher dimensional cube alike geo-
twin processor system, connected with two unidi-
metric structure, that is scaled beyond (greek “hyper”)
rectional wires.
the three dimensions of a platonic cube (greek “kubos”).
● Two dimensional: A geometric square or a four
In its broader meaning, the term is also commonly
processor array, connected with four unidirectional
used to denote a genre of supercomputer-prototypes
wires in the horizontal and four wires in the vertical
and supercomputer-products, that were designed and
direction.
built in the time period of –, including the Cos-
● Three dimensional: A geometric cube (hexahedron)
mic Cube, the Intel iPSC hypercube, the FPS T-Series,
or eight processors connected with  wires.
and a similar machine manufactured by nCUBE corpo-
● Four dimensional: A tesseract in geometry or sixteen
ration. A mesh-connected parallel computer is using a
processors, arranged as two cubes with the eight
regular interconnect topology with an array of multiple
corresponding vertices linked by two unidirectional
processing elements in one, two, or three dimensions.
wires each.
In a generalization from hypercubes and meshes to the
● n-dimensional: A hypercube as an abstract geomet-
broader class of k-ary n-cubes, the concept extends to
ric figure, that becomes increasingly difficult to draw
many more distributed-memory multi-computers with
on paper or the arrangement of n processor with
highly regular, direct networks.
each processor having n wires to its n immediate
neighbors in all n dimensions.
Discussion
Introduction and Technical Background History
In a distributed memory multicomputer with a direct net- With the evolution of supercomputers from a single vec-
work, the processing elements also serve as switching tor processor toward a collection of high performance
nodes in the network of wires connecting them. For microprocessors during the mid-s, a large amount
a mathematical abstraction, the arrangement of pro- of research work focused on the best possible topol-
cessors and wires in a parallel machine is commonly ogy for massively parallel machines. In that time period
expressed as a graph of nodes of processing elements the majority of parallel systems were built from a large
 H Hypercubes and Meshes

1 1 3 5 7
0
1 3
0 0 2 4 6

Dot Line Square 0 2


0-cube 1-cube 2-cube Cube
uniprocessor twin proc. quad proc. 3-cube

...
Hypercubes

Tesseract 64-node hypercube


binary 4-cube binary 6-cube

Hypercubes and Meshes. Fig.  Graphical rendering of dot ( dimensional), line (D), square (D), cube (D), and
hypercubes (>D)

number of identical processing elements, each com- of the effort are compiled into a comprehensive text
prising a standard microprocessor, a dedicated semi- book [].
conductor random access memory, a network interface During the golden age of the practical hypercubes
and – in some cases – even a dedicated local storage disk (roughly during the years of –), it was assumed
in every node. All processing elements also served as that the topology of high-dimensional binary hyper-
switching points in the interconnect fabric. This evolu- cubes would result in the best network for the direct
tion of technology in highly parallel computers resulted mapping of many well-known parallel algorithms. In a
in the class of distributed memory parallel computers or binary hypercube each dimension is populated with just
multi-computers. two processing nodes as shown in Fig. . The wiring of
The extensive research activity goes back to a spe- P ⋆ log  P unidirectional wires to connect P processors
cial focus on geometry in the theory of parallel algo- in hypercube topology seemed to be an optimal com-
rithms pre-dating the development of the first prac- promise between the minimal interconnect of just P
tical parallel systems. This led to a widespread belief, wires for P processors in a ring and the complete inter-
that the physical topology of parallel computing archi- connect with P⋆ (P-) wires in a fully connected graph
tectures had to reflect the communication pattern of (clique).
the most common parallel algorithms known at the In a hypercube or a mesh layout for a parallel
time. The obvious mathematical treatment of the ques- system, the processors can only communicate data
tion through graph theory resulted in countless the- directly to a subset of neighboring processors and there-
oretical papers that established many new topologies, fore require a mechanism for indirect communication
mappings, emulations, and entire equivalence classes through data forwarding at the intermediate process-
of topologies including meshes, hypercubes, and fat ing nodes. Indirect communication can take place along
trees. The suggestions for a physical layout to con- pre-configured virtual channels or through the use of
nect the nodes of a parallel multicomputer range from forwarded messages. Accordingly the practical hyper-
simple one-dimensional arrays of processors to some cube and mesh systems are primarily designed for mes-
fully interconnected graphs with direct wires from every sage passing as their programming model. However,
node to every other node in the system. Hierarchical most stream-based programming models can be sup-
rings and busses were also considered. The many results ported efficiently with channel virtualization. Shared
Hypercubes and Meshes H 

memory programming models are often supported on a hypercube communication pattern all communica-
top of a low level message passing software layer or by tion activities take place between nodes that are direct
hardware mechanisms like directory-based caches that neighbors in the layout of a hypercube. For the common
rely on messages for communication. number scheme of nodes this is between node pairs that
Binary hypercube parallel computers like the Cos- differ by exactly one bit in their number. Every commu-
mic Cube [], the iPSC [, ] the FPS T Series nication in a hypercube pattern therefore crosses just
[] or NCube [] were succeeded around  one wire in the hypercube machine.
by a series of distributed memory multi-computers
with a two- or three-dimensional mesh/torus topol-
ogy, such as the Ametek,Warp/iWarp, MassPar, J-
Machine, the Intel Paragon, Intel ASCI Red, Cray Algorithms with Pure Hypercube
TD, TE, Red Storm, and XT/Seastar. Scaling to Communication Patterns
larger machines, the early binary hypercubes were at Several well-known and important algorithms do
a severe disadvantage, because they required a high require a sequence of inter-node communication steps H
number of dimensions resulting in an unbounded that follow a pure hypercube communication pattern. In
number of ports for the switches at every node most of these algorithms, the communication happens
(due to the unbounded in/out degree of the hyper- in steps along increasing or decreasing dimensions, e.g.,
cube topology). Something all the hypercubes and the the first communication step between neighbors whose
more advanced mesh/torus architectures have in com- number differs in the least significant bit followed by
mon is that they rely heavily on the message rout- a next step until a last step with communication to
ing technologies developed for the early hypercubes. neighbors differing in the most significant bit. Most
They are classified as generalized hypercubes (k-ary parallel computers cannot use more than one or two
n-cubes) in the literature and therefore included in this links for communication at the same time due to band-
discussion. By the turn of the century the concept of width constraints within the processor node. Therefore,
hypercubes was definitely superseded by the network of a true pipeline with more than one or two overlapping
workstations (NOWs) and by the cluster of personal com- hypercube communication steps is rarely encountered
puters (COPs) that used a more or less irregular inter- in practical programs.
connect fabric built with dedicated network switches One of the earliest and best-known algorithms
that are readily available from the global working tech- with a pure hypercube communication pattern is the
nology of the Internet. bitonic sorting algorithm that goes back to a generalized
merge sort algorithm for parallel computers published
in  []. The communication graph in Fig.  shows all
Significance of Parallel Algorithms and the hypercube communication steps that are required to
Their Communication Patterns merge locally sorted subsequences into a globally sorted
A valid statement about the optimal interconnect layout sequence on  processors.
in a parallel system can only be made under the assump- The most significant and most popular algorithm
tion that there is a best suitable mapping between the M communicating between nodes in a pure hypercube
data elements of a large simulation problem and the P pattern is the classic calculation of a Fast Fourier Trans-
processing elements in a parallel computer. For simple, form (FFT) over an one-dimensional array, with its data
linear mappings (e.g., M/P elements on every proces- distributed among P processors in a simple block par-
sor in block or block/cyclic distribution) the commu- titioning. During the Fourier transformation of a large
nication patterns of operations between data elements distributed array a certain number of butterfly calcula-
translate into roughly the same communication pat- tions can be carried out in local memory up to the point
tern of messages between the processors in the parallel when the distance between the elements in the butter-
machine. For simplicity, we also assume M >> P and fly becomes large enough that processor boundaries are
that M and P are powers of two. In an algorithm with crossed and some inter-processor communication over
 H Hypercubes and Meshes

P0 P0
P1 P1
P2 P2
P3 P3
P4 P4
P5 P5
P6 P6
P7 P7
P8 P8
P9 P9
P10 P10
P11 P11
P12 P12
P13 P13
P14 P14
P15 P15
Step 1 Step 2,3 Step 4,5 and 6 Step 7,8,9 and 10

Hypercubes and Meshes. Fig.  Batcher bitonic parallel sorting algorithm. A series of hypercube communication steps
will merge  locally sorted subsequences into one globally sorted array. The arrows indicate the direction of the
merge, i.e., one processor gets the upper and the other processor gets the lower half of the merged sequence

the hypercube links is required. The distance of butter- of communication between the model nodes in at most
fly operations is always a power of two and therefore it three dimensions because the longer range forces can be
maps directly to a hypercube. summarized or omitted. Consequently the communica-
In the more advanced and multi-dimensional FFT tion in the parallel code of an equation solver, based on
library routines the communication operations between relaxation techniques is also limited to nearby processor
elements at the larger distances might eventually be in two or three dimensions.
reduced by dynamically rearranging the distribution of Many calculations in natural science do not require
array elements during the calculation. Such a redistri- communication in dimensions that are higher than
bution is typically performed by global data transpose three and there is no need to arrange processing ele-
operation that requires an all-to-all personalized com- ments of a parallel computer in hypercube topologies.
munication (AAPC), i.e., individual and distinct data Users in the field of scientific computing have recently
blocks are exchanged among all P processors. In gen- introduced some optimizations that require increased
eral, fully efficient and scalable AAPC does require a connectivity. Modern physical simulations are carried
scalable bandwidth interconnect, like the binary hyper- out on irregular meshes, that are modeling the physi-
cube. In practice a talented programmer can optimize cal space more accurately by adapting the density of
such collective communication primitives to run at meshing points to some physical key parameter like the
maximal link speed for sufficiently large mesh- and density of the material, the electric field, or a current
torus-connected multi-computers (i.e., Cray TD/TE flow. Irregular meshes grid points and sparse matri-
tori with up to  nodes) []. ces are more difficult to partition into regular parallel
machines. Advanced simulation methods also involve
Algorithms with Next Neighbor the summation of certain particle interactions and
Communication Pattern forces in Fourier and Laplace spaces to achieve a better
Most simulations in computational physics, chemistry, accuracy in fewer simulation steps (e.g., the particle-
and biology are calculating some physical interac- mesh Ewald summation in molecular dynamics). Both
tions (forces) between a large number of model nodes improvements result in denser communication patterns
(i.e., particles or grid points) in three-dimensional phys- that require higher connectivity. In the end it remains
ical space. Therefore, they only require a limited amount fairly hard to judge for the systems architect, whether
Hypercubes and Meshes H 

these codes alone can justify the more complex wiring common strategies for forwarding the messages in a
of a high-dimensional hypercube within the range of hypercube network are wormhole or cut-through rout-
the machine sizes, that are actually purchased by the ing. If not done carefully, blocking and delaying mes-
customers. sages along the path can lead to tricky deadlock situa-
tions, in particular when the messages are indefinitely
blocking each other within an interconnect fabric. Dur-
Meshes as a Generalization of Binary ing the era of the first hypercube multicomputers two
Hypercubes into k-ary n-Cubes key technologies for deadlock avoidance were described
In the original form of a binary hypercube, the wires [] and applied to the construction of prototypes and
connect only two processing nodes in every dimension. products:
In linear arrays and rings a larger number of processors
can be connected in every dimension. The combination ● Dimension order routing: In binary hypercubes
of multiple dimensions and multiple processors con- routes between two nodes must be chosen in a way
nected along a line in one dimension leads to a natural that the distance in the highest dimension is trav-
generalization of the hypercube concept that includes eled first, before any distance of a lesser dimension
H
the popular D, D, and D arrays of processing ele- is traveled. Therefore messages can only be stopped
ments. In computer architecture, the term “mesh” is due to a busy channel in a dimension lower than
used for a two- or higher-dimensional processor array. the one that they are traveling. The channels in the
A linear array of processing nodes can be wired into a lowest dimension always lead to a final destina-
ring by a wrap-around link. A two or higher dimen- tion. The messages occupying these channels can
sional mesh, that is equipped with wrap-around links make progress immediately and free the resource
is called a torus. for other messages blocked at the same and higher
The generalized form of hypercubes and meshes is dimensions. This is sufficient to prevent a routing
characterized by two parameters, the maximal number deadlock.
of elements found along one dimension, k, and the total ● Dateline-based torus routing: With the wrap-around
number of dimensions used, n. The binary hypercubes link of rings and tori, the case of messages block-
described in the introduction are classified as a -ary ing each other in a single dimension around the
n-cubes. A k × k, two-dimensional torus can be classi- back-loops has to be addressed. This is done by repli-
fied as a k-ary -cube. cating the physical wires into some parallel virtual
With the popularity of the hypercube topology in channels forming a higher and a lower virtual ring.
the beginning of distributed memory computing, many A dateline is established at a particular position in
key technologies for message forwarding and rout- the ring. The dateline can only be crossed if the mes-
ing were developed in the early hypercube machines. sage switches from the higher to the lower ring at
These ideas extend quite easily to the generalized that position. With this simple trick, all connections
k-ary n-cubes and were incorporated quickly into along the ring remain possible, but the messages in
meshes and tori. In hypercubes, meshes and tori alike, the ring can no longer block each other in a circular
the messages traveling between arbitrary processing deadlock.
nodes have to traverse multiple links to reach their
destination. They have to be forwarded in a number Both deadlock avoiding techniques are normally com-
of intermediate nodes, raising the important issues of bined to permit deadlock-free routing in the general-
message forwarding policies and deadlock avoidance. ized k-ary n-cube topologies. Extending the well-known
In a general setting it would be very hard to establish dimension-order and dateline routing strategies to
a global guarantee that every message can travel right slightly irregular network topologies is an additional
through without any delay along a given route. There- challenge. The two abstract rules for legal message
fore, some low level flow control mechanisms are used routes can be translated into a simple channel number-
to stop and hold the messages, if another message is ing scheme providing a total order of all channels within
already occupying a channel along the path. The most an interconnect, including all irregular nodes. A simple
 H Hypercubes and Meshes

Disk Disk
1.02.02 3.02.02 1.04.02 3.04.02

1.02.01
3.02.01 3.04.01 1.04.01
3.01.02
1.01.01 2.01.03 2.01.05 2.01.07 3.01.02
1.01.02 3.01.01 2.01.17 2.01.15 2.01.13 1.01.02
2.12.03 2.14.03 2.16.03 2.18.03
2.12.17
2.14.17 2.16.17 2.18.17

2.03.03 2.03.05 2.03.07


Terminal 2.03.17 2.03.15 2.03.13
I/O
2.12.15 2.12.05 2.14.05 2.16.05 2.18.05
2.14.15 2.16.15 2.18.15

Video 2.05.03 2.05.05 2.05.07


I/O
2.05.17 2.05.15 2.05.13
2.12.07 2.14.07 2.16.07 2.18.07
2.12.13
2.14.13 2.16.13 2.18.13

3.03.02 1.03.01 2.07.03 2.07.05 2.07.07 3.03.02


11.18
1.03.02 3.03.01 2.07.17 2.07.15 2.07.13 1.03.02
1.02.02 3.02.02 1.04.02 3.04.02

Hypercubes and Meshes. Fig.  Generalized hypercubes. Channel numbering based on the original hyperube routing
rules for the validation of a deadlock-free router in a slightly irregular two-dimensional torus with several IO nodes
(an iWarp system)

rule that channels must be followed in a strictly increas- halves. In meshes and tori the total bisection bandwidth
ing/decreasing order to form a legal route is sufficient does not scale up linearly with the number of nodes.
to prevent deadlock. With the technique, illustrated in The global operations, that are usually communication
Fig. , it becomes possible to code and validate the bound, become increasingly costly in larger machines.
router code for any k-ary -cube configuration that was The concept of the scalable bisection bandwidth was
offered as part of the Intel/CMU iWarp product line, thought to be a key advantage of the hypercube designs
including the service nodes that are arbitrarily inserted that is not present in meshes for a long time.
into the otherwise regular torus network []. In the theory of parallel algorithms the asymptotic
lower bounds on communication in binary hypercubes
Mapping Hypercube Algorithms into usually differ from the bounds holding for communica-
Meshes and Tori tion in an one-, two- or three-dimensional mesh/torus.
By design the binary hypercube topology provides a The lower bound on the number of steps (i.e., the time
fully scalable worst case bisection bandwidth. Regard- complexity) is determined by the amount of data that
less of how the machine is scaled or partitioned, the has to be moved between the processing elements and
algorithm can count on the same amount of bandwidth by the number of wires that are available for this task
per processor for the communication between the two (i.e., the bisection bandwidth). The upper bound is
Hypercubes and Meshes H 

usually determined by the sophistication of the algo- In this simple mapping some links have to be shared
rithm. One of the most studied algorithms is parallel or multiplexed by a factor of two and next-neighbor
sorting in a – comparison based model [, ]. Simi- communication is extended to a distance of up to three
lar considerations can be made for matrix transposes or hops. The mapping of larger virtual hypercube networks
redistributions of block/cyclic arrays distributed across comes at the cost of an increasing dilation (increase in
the nodes of a parallel machine. link distance) and congestion (a multiplexing factor of
In the practice of parallel computing, a mesh- each link) for higher dimensions. The congestion fac-
connected machine must be fairly large until an tor can be derived using the calculations of bisection
algorithmic limitation due to bisection bandwidth bandwidth for both topologies. A binary -cube can be
is encountered and a slowdown over a properly mapped into a D torus and the -cube into a D torus
hypercube connected machine can be observed []. by extending the linear scheme to multiple dimensions.
As an example, the bandwidth of an -node ring, In their VLSI implementation, a two- or three-
a -node two - dimensional torus or a -node dimensional mesh/torus-machine is much easier to lay
three-dimensional torus is sufficient to permit arbitrary out in an integrated circuit than an eight- or ten- H
array transpose operations without any slowdown over dimensional hypercube with its complicated wiring
a hypercube of the same size. The comparison assumes structure. Therefore its simple next-neighbor links can
that the processing nodes of both machines can trans- be built with faster wires and wider channels for mul-
fer data from and to the network at about the speed of tiple wires. The resulting advantage for the implemen-
a bidirectional link – as this is the case for most parallel tation is known to offset the congestion and dilation
machines. Therefore, a few simple mapping techniques factors in practice. It is therefore highly interesting to
are used in practice to implement higher dimensional study these technology trade-offs in practical machines.
binary hypercubes on k-ary meshes or tori. A three- For – sorting, the algorithmic complexity and the
dimensional binary hypercube can be mapped into an measured execution times on a virtual hypercube
eight node ring with a small performance penalty using mapped to mesh match the known complexities and
a Gray code mapping as illustrated in Fig. .

4
(1,1,0) (1,1,1)
6 7
5

(1,0,0)
4 5 7

(0,1,1) 6 ...
2 3

0 1 3
(0,0,0)
(0,0,1)
1

Hypercubes and Meshes. Fig.  Gray code mapping of a binary three-cube into an eight node ring. Virtual network
channels were used to implement virtual hypercube network in machines that are actually wired as meshes or tori
 H Hypercubes and Meshes

running times of the algorithm for direct execution on the luxury of a low latency, high performance intercon-
a mesh fairly closely [, ]. nect. Despite all trends to abstraction and de-emphasis
of network structure, the key figures of message forward-
Limitations of the Hypercube Machines ing latency and the bisection bandwidth remain the most
The ideal topology of a binary hypercube comes at the important measure of scalability in a parallel computer
high cost of significant disadvantages and restrictions architecture.
regarding their practical implementation in the VLSI
circuits of a parallel computer. Therefore, a number of Hypercube Machine Prototypes and
alternative options to build multicomputers were devel- Products
oped beyond hypercubes, meshes, and their superclass One of the earliest practical computer systems using
of the k-ary n-cubes. a hypercube topology is the Cosmic Cube prototype
Among the alternatives studied are: hierarchical designed by the Group of C. Seitz at the California Insti-
rings, that differ slightly from multi-dimensional tori, tute of Technology []. The system was developed in
cube-connected cycles, that successfully combine the the first half of the s using the VLSI integration
scalable bisection bandwidth of hypercubes with the technologies newly available at the time. The first ver-
bounded in/out degrees of lower - dimensional meshes sion of the system comprised  nodes with an Intel
and finally fat trees that are provably optimal in their lay- / microprocessor/floating point coprocessors
out of wiring for VLSI. Those interconnect structures combination running at  MHz and using  kB of
have in common, that they are still fairly regular and memory in each node. The nodes were connected with
can leverage of the same practical technical solutions point-to-point links running at  Mbit/s nominal speed.
for message forwarding, routing and link level flow con- The original system of  nodes was allegedly planned
trol, the way the traditional hypercubes do. None of as a  ×  ×  three-dimensional torus with bidirec-
these regular interconnects requires a spanning routing tional links – but this particular topology is fully equiv-
protocol or TCP/IP connections to function properly. alent to a binary six-dimensional hypercube under a
Towards the end of the golden age of hypercubes, gray code mapping and became therefore known as
the regular, direct networks were facing an increas- a first hypercube multicomputer, rather than as a first
ing competition by new types of networks built either three-dimensional torus. Subsequently a small number
from physical links with different strength (i.e., dif- of commercial machines were built in a collaboration
ferent link bandwidths implemented by a multiplica- with a division of AMETEK Corporation [].
tion of wires or in a different technology). In , the The Caltech prototypes lead to the development of
Thinking Machines Corporation announced the Con- iPSC at the Intel Supercomputer Systems Division, a
nection Machine CM using a fat tree topology and commercial product series based on hypercube topol-
this announcement contributed to end of the hyper- ogy [, ]. An example of the typical technical spec-
cube age. After the year , the physical network ifications is the iPSC/ system as released in : 
topologies became less and less important in super- nodes, Intel / processor/floating point copro-
computer architecture. The detailed structure of the cessor running at  MHz, – MB of main memory in
networks became hidden and de-emphasized due to each node, the nodes connected with links running 
many higher-level abstractions of parallelizing com- Mbit/s. iPSC systems were programmed using the NX
pilers and message passing libraries available to the message passing library, similar in the functionality, but
programmers. pre-dating the popular portable message passing sys-
The availability of high performance commodity tems like PVM or MPI. The development of the product
networking switches for use in the highly irregular line was strongly supported by (D)ARPA, the Advanced
topology of the global Internet accelerated this trend. Research Project Agency of the US Dept. of Defense.
In most newer designs, a fairly large multistage network During roughly the same time period a hypercube
of switches (e.g., a Clos network) has replaced the older machine was also offered by NCube Corporation, a
direct networks of hypercubes and meshes. At this time super-computing vendor that was fully dedicated to
only a small fraction of PC Clusters is still including the development of hypercube systems at the time. The
Hypercubes and Meshes H 

NCube , model, available to Nasa AMES in  Beowulf system on display at the Supercomputing trade
was a -node system with a  MHz full custom  show wired as a -node binary -cube using multi-
bit CPU/FPU with up to  MB main memory in every ple BaseT network interface cards and simple CAT
node and  Mbit/s interconnects. The largest configu- crossover cables.
ration commercially offered was a binary -cube with But as mentioned before, the designs of processors
, processing nodes [, ]. with their own communication hardware on board and
The FPS (Floating Point Systems) T Series Parallel direct networks were quickly replaced by commodity
Vector Supercomputer was also made of multiple pro- switches manufactured by a large number of network
cessing nodes interconnected as a binary n-cube. Each equipment vendors in the Internet world. So the topol-
node is implemented on a single printed circuit board, ogy of the parallel system suddenly became a secondary
contains  MB of data storage memory, and has a peak issue. A large variety of different network configurations
vector computational speed of  MFLOPS. Eight vector rapidly succeeded the regular Beowulf systems in the
nodes are grouped around a system support node and growing market for clusters of commodity PCs.
a disk storage system to form a module. The module is H
the basic building block for larger system configurations
Research Conferences on Hypercubes
in hypercube topology. The T series was named after
The popularity of hypercube architectures in the second
the tesseract, a geometric cube in four-dimensional
half of the s led to several instances of a com-
space [].
puter architecture conference dedicated solely to dis-
The Thinking Machine CM also included a hyper-
tributed memory multicomputers with hypercube and
cube communication facility for data exchanges between
mesh topologies. The locations and dates of the confer-
the nodes as well as a mesh (called NEWS grid) [].
ences are as remembered or collected from citations in
As an SIMD machine with a large number of bit slice
subsequent papers:
processors, its architecture differed significantly from
the classical hypercube multicomputer. In its successor, ● First Hypercube Conference in Knoxville, August
the Thinking Machine CM, the direct networks were –, , with proceedings published by SIAM
given up in favor of a fat tree using a variable number after the conference.
of (typically N-) of additional switches to form a net- ● Second Conference on Hypercube Multiproces-
work in fat tree topology. A similar trend was followed sors, Knoxville, TN, September –October , ,
in the machines of Meiko/Quadrics. The delay of the with proceedings published by SIAM after the
T transputer chips with processing and communi- conference.
cation capabilities on a single chip forced the designers ● Third Conference on Hypercube Concurrent Com-
of the CS system to build their communication system puters and Applications, Pasadena, CA, January
with a multi-stage fabric of switches instead of a hyper- –,  with proceedings published by the
cube interconnect. The multicomputer line by IBM, the ACM, New York.
IBM SP and its follow-on products also used a multi- ● Fourth Conference on Hypercube Concurrent Com-
stage switch fabric. Therefore, these architectures do puters and Applications, Monterrey, CA, March –,
no longer qualify as generalized hypercubes with direct , proceedings publisher unknown.
networks.
and later giving in to the trend that the architecture as
It is worth mentioning that the first Beowulf clus-
distributed memory computer is more important than
ters of commodity PCs were equipped with three to
the hypercube layout:
five network interfaces that could be wired directly in
hypercube topology. Routing of messages between the ● Fifth Distributed Memory Computing Conference,
links was achieved by system software in the ether- Charleston, SC, April –,  with proceedings
net drivers of the LINUX operating system. The com- published by IEEE Computer Society Press.
munication benchmark presented in the first beowulf ● Sixth Distributed Memory Computing Conference,
paper were measured on an -node system, wired as ˆ Portland, OR, April –May ,  with proceedings
cube []. The author also remembers encountering a published by IEEE Computer Society Press.
 H Hypercubes and Meshes

● First European Distributed Memory Computing, different US national labs evaluating and benchmarking
EDMCC, held in Rennes, France. these distributed memory multicomputers. The most
● Second European Distributed Memory Computing, interesting study dedicated entirely to binary hyper-
EDMCC, Munich, FRG, April –,  with cubes appeared in  in Parallel Computing []. The
Proceedings by Springer, Lecture Notes in Com- visit to the trade show of “Supercomputing” during
puter Science. the “hypercube” years was a memorable experience,
because countless new distributed memory system ven-
After these rather successful meetings of computer
dors surprised the audience with a new parallel com-
architects and application programmers, the conference
puter architecture every year. Most of them contained
series on hypercubes and distributed memory parallel
some important innovation and can still be admired in
systems lost its momentum and in , the specialized
the permanent collections of the computer museums
venues were unfortunately discontinued.
in Boston (MA), Mountain View (CA), and Paderborn
(Germany).
Related Entries
Beowulf clusters
Bitonic Sort Bibliography
Clusters . Leighton FT () Introduction to parallel algorithms and archi-
tectures: array, trees, hypercubes. Morgan Kaufmann Publishers,
Connection Machine
San Francisco p, ISBN:---
Cray TE . Stricker T () Supporting the hypercube programming model
Cray Vector Computers on meshes (a fast parallel sorter for iwarp). In: Proceedings of the
Cray XT and Seastar -D Torus Interconnect symposium for parallel algorithms and architectures, pp –,
Distributed-Memory Multiprocessor San Diego, June 
Fast Fourier Transform (FFT) . Batcher KE () Sorting networks and their applications. In:
Proceedings of the american federation of information processing
IBM RS/ SP
societies spring joint computer conference, vol . AFIPS Press,
Interconnection Networks Montvale, pp –
MasPar . Nassimi D, Sahni S () Bitonic sort on a mesh-connected
MPI (Message Passing Interface) parallel computer. In: IEEE TransComput ():–
MPP . Hinrichs S, Kosak C, O’Hallaron D, Stricker T, Take R ()
Optimal all-to-all personalized communication in meshes and
nCUBE
tori. In: Proceedings of the symposium of parallel algorithms and
Networks, Direct architectures, ACM SPAA’, Cape May, Jan 
Routing (including Deadlock Avoidance) . Stricker T () Message routing in irregular meshes and tori.
Sorting In: Proceedings of the th IEEE distributed memory computing
Warp and IWarp conference, DMCC, Portland, May 
. Seitz CL () The cosmic cube. Commun ACM ():–
. Close P () The iPSC/ node architecture. In: Proceedings of
Bibliographic Notes and Further the third conference on hypercube concurrent computers and
Reading applications, Pasadena, – Jan , pp –
A most detailed description of the algorithms optimally . Nugent SF () The iPSC/ direct-connect communications
technology. In: Proceedings of the third conference on hyper-
suited for hypercubes, the classes of hypercube equiva-
cube concurrent computers and applications, Pasadena, – Jan
lent networks, and the emulation of different topologies , pp –
with respect to algorithmic models is given in an  . Hawkinson S () The FPS T series supercomputer, system
page textbook by F.T. Leighton that appeared in  modelling and optimization. In: Lecture notes in control and
[]. A detailed description of the network topologies information sciences, vol /. Springer, Berlin/Heidelberg,
and the technical data of all practical hypercube pro- pp –
. The nCUBE Handbook, Beaverton OR,  and the nCUBE
totypes built and machines commercially sold between
 processor, user manual, Beaverton
 and  can be found through Google Scholar in . Dunigan TH () Performance of the Intel iPSC //
the numerous papers written by the architects working and Ncube / hypercubes, parallel computing . North
for the computer manufacturers or by researches at the Holland, Elsevier, pp –
Hypergraph Partitioning H 

. Tucker LW, Robertson GG () Architecture and applications Formal Definition of Hypergraph
of the connection machines. IEEE Comput ():– Partitioning
. Becker J, Sterling D, Savarese T, Dorband JE, Ranawake UA, A hypergraph H = (V, N ) is defined as a set of vertices
Packer CV () Beowulf: a parallel workstation for scientific
(cells) V and a set of nets (hyperedges) N among those
computation. In: Proceedings of ICPP workshop on challenges
for parallel processing, CRC Press, Oconomowc, August  vertices. Every net n ∈ N is a subset of vertices, that is,
. Seitz CL, Athas W, Flaig C, Martin A, Seieovic J, Steele CS, Su WK n ⊆ V. The vertices in a net n are called its pins. The size
() The architecture and programming of the ametek series of a net is equal to the number of its pins. The degree of
 multicomputer. In: Proceedings of the third conference on a vertex is equal to the number of nets it is connected to.
hypercube concurrent computers and applications, Pasadena, –
Graph is a special instance of hypergraph such that each
 Jan , pp –
net has exactly two pins. Vertices can be associated with
weights, denoted with w[⋅], and nets can be associated
with costs, denoted with c[⋅].
Π = {V , V , . . . , VK } is a K-way partition of H if the
following conditions hold: H
Hypergraph Partitioning ● Each part Vk is a nonempty subset of V, that is, Vk ⊆
V and Vk ≠ / for  ≤ k ≤ K
Ümit V. Çatalyürek , Bora Uçar , Cevdet Aykanat
 ● Parts are pairwise disjoint, that is, Vk ∩ V ℓ = / for all
The Ohio State University, Columbus, OH, USA

ENS Lyon, Lyon, France
≤k<ℓ≤K

Bilkent University, Ankara, Turkey ● Union of K parts is equal to V, i.e., ⋃Kk= Vk =V

In a partition Π of H, a net that has at least one pin


(vertex) in a part is said to connect that part. Connectiv-
ity λ n of a net n denotes the number of parts connected
Definition
by n. A net n is said to be cut (external) if it connects
Hypergraphs are generalization of graphs where each
more than one part (i.e., λn > ), and uncut (internal)
edge (hyperedge) can connect more than two vertices.
otherwise (i.e., λ n = ). A partition is said to be balanced
In simple terms, the hypergraph partitioning problem
if each part Vk satisfies the balance criterion:
can be defined as the task of dividing a hypergraph
into two or more roughly equal-sized parts such that a Wk ≤ Wavg ( + ε), for k = , , . . . , K. ()
cost function on the hyperedges connecting vertices in
different parts is minimized. In (), weight Wk of a part Vk is defined as the sum
of the weights of the vertices in that part (i.e., Wk =
∑v∈Vk w[v]), Wavg denotes the weight of each part
Discussion under the perfect load balance condition (i.e., Wavg =
(∑v∈V w[v])/K), and ε represents the predetermined
Introduction maximum imbalance ratio allowed.
During the last decade, hypergraph-based models The set of external nets of a partition Π is denoted
gained wide acceptance in the parallel computing com- as NE . There are various [] cutsize definitions for rep-
munity for modeling various problems. By providing resenting the cost χ(Π) of a partition Π. Two relevant
natural way to represent multiway interactions and definitions are:
unsymmetric dependencies, hypergraph can be used to
elegantly model complex computational structures in χ(Π) = ∑ c[n] ()
n∈NE
parallel computing. Here, some concrete applications
will be presented to show how hypergraph models can χ(Π) = ∑ c[n](λ n − ). ()
n∈NE
be used to cast a suitable scientific problem as an hyper-
graph partitioning problem. Some insights and general In (), the cutsize is equal to the sum of the costs of the
guidelines for using hypergraph partitioning methods cut nets. In (), each cut net n contributes c[n](λn − )
in some general classes of problems are also given. to the cutsize. The cutsize metrics given in () and ()
 H Hypergraph Partitioning

will be referred to here as cut-net and connectivity met- development of HP models and methods for efficient
rics, respectively. The hypergraph partitioning problem parallelization of SpMxV operations.
can be defined as the task of dividing a hypergraph into Before discussing the HP models and methods
two or more parts such that the cutsize is minimized, for parallelizing SpMxV operations, it is favorable to
while a given balance criterion () among part weights discuss parallel algorithms for SpMxV. Consider the
is maintained. matrix-vector multiply of the form y ← A x, where
A recent variant of the above problem is the multi- the nonzeros of the sparse matrix A as well as the
constraint hypergraph partitioning [, ] in which each entries of the input and output vectors x and y are par-
vertex has a vector of weights associated with it. The titioned arbitrarily among the processors. Let map(⋅)
partitioning objective is the same as above, and the par- denote the nonzero-to-processor and vector-entry-to-
titioning constraint is to satisfy a balancing constraint processor assignments induced by this partitioning. A
associated with each weight. Here, w[v, i] denotes the C parallel algorithm would execute the following steps at
weights of a vertex v for i = , . . . , C. Hence, the balance each processor Pk .
criterion () can be rewritten as
. Send the local input-vector entries xj , for all j with
Wk,i ≤ Wavg,i ( + ε) for k = , . . . , K and i = , . . . , C , map(xj ) = Pk , to those processors that have at least
() one nonzero in column j.
. Compute the scalar products aij xj for the local
where the ith weight Wk,i of a part Vk is defined
nonzeros, that is, the nonzeros for which map(aij ) =
as the sum of the ith weights of the vertices in that
Pk and accumulate the results yki for the same row
part (i.e., Wk,i = ∑v∈Vk w[v, i]), and Wavg,i is the
index i.
average part weight for the ith weight (i.e., Wavg,i =
. Send local nonzero partial results yki to the processor
(∑v∈V w[v, i])/K), and ε again represents the allowed
map(yi )≠Pk , for all nonzero yki .
imbalance ratio.
. Add the partial yiℓ results received to compute the
Another variant is the hypergraph partitioning with
final results yi = ∑ yiℓ for each i with map(yi )=Pk .
fixed vertices, in which some of the vertices are fixed in
some parts before partitioning. In other words, in this As seen in the algorithm, it is necessary to have
problem, a fixed-part function is provided as an input partitions on the matrix A and the input- and output-
to the problem. A vertex is said to be free if it is allowed vectors x and y of the matrix-vector multiply operation.
to be in any part in the final partition, and it is said to Finding a partition on the vectors x and y is referred
be fixed in part k if it is required to be in Vk in the final to as the vector partitioning operation, and it can be
partition Π. performed in three different ways: by decoding the par-
Yet another variant is multi-objective hypergraph tition given on A; in a post-processing step using the
partitioning in which there are several objectives to be partition on the matrix; or explicitly partitioning the
minimized [, ]. Specifically, a given net contributes vectors during partitioning the matrix. In any of these
different costs to different objectives. cases, the vector partitioning for matrix-vector oper-
ations is called symmetric if x and y have the same
Sparse Matrix Partitioning partition, and non-symmetric otherwise. A vector par-
One of the most elaborated applications of hyper- titioning is said to be consistent, if each vector entry is
graph partitioning (HP) method in the parallel scien- assigned to a processor that has at least one nonzero in
tific computing domain is the parallelization of sparse the respective row or column of the matrix. The con-
matrix-vector multiply (SpMxV) operation. Repeated sistency is easy to achieve for the nonsymmetric vector
matrix-vector and matrix-transpose-vector multiplies partitioning; xj can be assigned to any of the proces-
that involve the same large, sparse matrix are the ker- sors that has a nonzero in the column j, and yi can be
nel operations in various iterative algorithms involving assigned to any of the processors that has a nonzero in
sparse linear systems. Such iterative algorithms include the row i. If a symmetric vector partitioning is sought,
solvers for linear systems, eigenvalues, and linear pro- then special care must be taken to assign a pair of
grams. The pervasive use of such solvers motivates the matching input- and output-vector entries, e.g., xi and
Hypergraph Partitioning H 

yi , to a processor having nonzeros in both row and elements are distributed according to the vertex par-
column i. In order to have such a processor for all vec- tition. The methods in the second class follow a mix-
tor entry pairs, the sparsity pattern of the matrix A and-match approach and use the three main models,
can be modified to have a zero-free diagonal. In such perhaps, along with multi-constraint and fixed-vertex
cases, a consistent vector partition is guaranteed to exist, variations in an algorithmic form. There are a number of
because the processors that own the diagonal entries can methods in this second class, and one can develop many
also own the corresponding input- and output-vector others according to application needs and matrix char-
entries; xi and yi can be assigned to the processor that acteristics. Three common methods belonging to this
holds the diagonal entry aii . class are described later, after the three main models.
In order to achieve an efficient parallelism, the pro- The main property of these algorithms is that the sum
cessors should have balanced computational load and of the cutsizes of each application of hypergraph par-
the inter-processor communication cost should have titioning amounts to the total communication volume
been minimized. In order to have balanced computa- to be incurred under a consistent vector partitioning
tional load, it suffices to have almost equal number of (currently these methods compute a vector partition- H
nonzeros per processor so that each processor will per- ing after having found a matrix partitioning) when the
form almost equal number of scalar products, for exam- matrix elements are distributed according to the vertex
ple, aij xj , in any given parallel system. The communi- partitions found at the end.
cation cost, however, has many components (the total
volume of messages, the total number of messages, max- Three Main Models for Matrix
imum volume/number of messages in a single proces- Partitioning
sor, either in terms of sends or receives or both) each of In the column-net hypergraph model [] used for D
which can be of utmost importance for a given matrix in rowwise partitioning, an M × N matrix A with Z nonze-
a given parallel system. Although there are alternatives ros is represented as a unit-cost hypergraph HR =
and more elaborate proposals, the most common (VR , NC ) with ∣VR ∣ = M vertices, ∣NC ∣ = N nets, and
communication cost metric addressed in hypergraph Z pins. In HR , there exists one vertex vi ∈ VR for each
partitioning-based methods is the total volume of row i of matrix A. Weight w[vi ] of a vertex vi is equal
communication. to the number of nonzeros in row i. The name of the
Loosely speaking, hypergraph partitioning-based model comes from the fact that columns are represented
methods for efficient parallelization of SpMxV model as nets. That is, there exists one unit-cost net nj ∈ NC
the data of the SpMxV (i.e., matrix and vector entries) for each column j of matrix A. Net nj connects the ver-
with the vertices of a hypergraph. A partition on the tices corresponding to the rows that have a nonzero in
vertices of the hypergraph is then interpreted in such a column j. That is, vi ∈nj if and only if aij ≠.
way that the data corresponding to a set of vertices in a In the row-net hypergraph model [] used for D
part are assigned to a single processor. More accurately, columnwise partitioning, an M × N matrix A with Z
there are two classes of hypergraph partitioning-based nonzeros is represented as a unit-cost hypergraph HC =
methods to parallelizing SpMxV. The methods in the (VC , NR ) with ∣VC ∣ = N vertices, ∣NR ∣ = M nets, and Z
first class build a hypergraph model representing the pins. In HC , there exists one vertex vj ∈ VC for each col-
data and invoke a partitioning heuristic on the so-built umn j of matrix A. Weight w[vj ] of a vertex vj ∈ VR
hypergraph. The methods in this class can be said to is equal to the number of nonzeros in column j. The
be models rather than being algorithms. There are cur- name of the model comes from the fact that rows are
rently three main hypergraph models for representing represented as nets. That is, there exists one unit-cost
sparse matrices, and hence there are three methods in net ni ∈ NR for each row i of matrix A. Net ni ⊆
this first class. These three main models are described VC connects the vertices corresponding to the columns
below in the next section. Essential property of these that have a nonzero in row i. That is, vj ∈ ni if and
models is that the cutsize () of any given partition is only if aij ≠.
equal to the total communication volume to be incurred In the column-row-net hypergraph model, other-
under a consistent vector partitioning when the matrix wise known as the fine-grain model [], used for D
 H Hypergraph Partitioning

nonzero-based fine-grain partitioning, an M×N matrix Given an M × N matrix A and the number K of
A with Z nonzeros is represented as a unit-weight and processors organized as a P × Q mesh, the jagged-like
unit-cost hypergraph HZ = (VZ , NRC ) with ∣VZ ∣ = Z partitioning model proceeds as shown in Fig. . The
vertices, ∣NRC ∣ = M +N nets and Z pins. In VZ , there algorithm has two main steps. First, A is partitioned
exists one unit-weight vertex vij for each nonzero aij rowwise into P parts using the column-net hypergraph
of matrix A. The name of the model comes from the model HR (lines  and  of Fig. ). Consider a P-
fact that both rows and columns are represented as nets. way partition ΠR of HR . From the partition ΠR , one
That is, in NRC , there exist one unit-cost row-net ri for obtains P submatrices Ap , for p = , . . . , P each having
each row i of matrix A and one unit-cost column-net cj roughly equal number of nonzeros. For each p, the rows
for each column j of matrix A. The row-net ri connects of the submatrix Ap correspond to the vertices in Rp
the vertices corresponding to the nonzeros in row i of (lines  and  of Fig. ). The submatrix Ap is assigned
matrix A, and the column-net cj connects the vertices to the pth row of the processor mesh. Second, each sub-
corresponding to the nonzeros in column j of matrix A. matrix Ap for  ≤ p ≤ P is independently partitioned
That is, vij ∈ ri and vij ∈ cj if and only if aij ≠ . Note that columnwise into Q parts using the row-net hypergraph
each vertex vij is in exactly two nets. Hp (lines  and  of Fig. ). The nonzeros in the ith
row of A are partitioned among the Q processors in a
Some Other Methods for Matrix row of the processor mesh. In particular, if vi ∈ Rp at
Partitioning the end of line  of the algorithm, then the nonzeros
The jagged-like partitioning method [] uses the row- in the ith row of A are partitioned among the proces-
net and column-net hypergraph models. It is an algo- sors in the pth row of the processor mesh. After par-
rithm with two steps, in which each step models either titioning the submatrix Ap columnwise, the map array
the expand phase (the st line) or the fold phase (the contains the partition information for the nonzeros
rd line) of the parallel SpMxV algorithm given above. residing in Ap .
Therefore, there are two alternative schemes for this par- For each i, the volume of communication required
titioning method. The one which models the expands in to fold the vector entry yi is accurately represented as
the first step and the folds in the second step is described a part of “foldVolume” in the algorithm. For each j, the
below. volume of communication regarding the vector entry xj

JAGGED-LIKE-PARTITIONING (A, K = P × Q, e1, e2)


Input : a matrix A, the number of processors K = P × Q, and the imbalance ratios e1, e2.
Output: map(aij) for all aij ≠ 0 and total Volume.
1: HR = (VR, NC) ← columnNet(A)
2: PR = {R1,...,RP} ← partition(HR, P, e1)  rowwise partitioning of A
3: expand Volume ← cutsize(PR)
4: foldVolume ← 0
5: for p = 1 to P do
6: Rp = {ri:vi ∈ Rp}
7: Ap ← A(Rp,:)  submatrix indexed by rows Rp
8: Hp = (Vp, Np) ← rowNet(Ap)

9: PCp = {C1p,...,CQp} ← partition(Hp, Q, e2)  columnwise partitioning of Ap

10: foldVolume ← foldVolume + cutsize(PCp)


11: for all aij ≠ 0 of Ap do
12: map(aij) = Pp,q ⇔ cj ∈ Cqp
13: return totalVolume←expandVolume+foldVolume

Hypergraph Partitioning. Fig.  Jagged-like partitioning


Hypergraph Partitioning H 

3
R1 R2
r16 r5 4
c11 6
c8 r2 8
r11 c4
r8 c2 r1 11
12
r4 c16 r7 14
c3 c10 16
c7 c1 c13
r6 c5 c9 1
c14 2
r15 r10
r14 5
r3 r9 7
c6 r12 c12 c15 r13 9
10
13
a 15

3 4 6 8 11 12 14 16 1 2 5 7 9 10 13 15
nnz = 47
H
b vol = 3 imbal = [–2.1%, 2.1%]

Hypergraph Partitioning. Fig.  First step of four-way jagged-like partitioning of a matrix; (a) two-way partitioning ΠR
of column-net hypergraph representation HR of A, (b) two-way rowwise partitioning of matrix AΠ obtained by
permuting A according to the partitioning induced by Π; the nonzeros in the same partition are shown with the same
shape and color; the deviation of the minimum and maximum numbers of nonzeros of a part from the average are
displayed as an interval imbal; vol denotes the number of nonzeros and the total communication volume

is accurately represented as a part of “expandVolume” in which models the expands in the first step and the folds
the algorithm. in the second step is presented below.
Figure a illustrates the column-net representation Given an M×N matrix A and the number K of pro-
of a sample matrix to be partitioned among the proces- cessors organized as a P × Q mesh, the checkerboard
sors of a  ×  mesh. For simplicity of the presentation, partitioning method proceeds as shown in Fig. . First,
the vertices and the nets of the hypergraphs are labeled A is partitioned rowwise into P parts using the column-
with letters “r” and “c” to denote the rows and columns net model (lines  and  of Fig. ), producing ΠR =
of the matrix. The matrix is first partitioned rowwise {R , . . . , RP }. Note that this first step is exactly the
into two parts, and each part is assigned to a row of same as that of the jagged-like partitioning. In the sec-
the processor mesh, namely to processors {P , P } and ond step, the matrix A is partitioned columnwise into
{P , P }. The resulting permuted matrix is displayed in Q parts by using the multi-constraint partitioning to
Fig. b. Figure a displays the two row-net hypergraphs obtain ΠC = {C , . . . , CQ }. In comparison to the jagged-
corresponding to each submatrix Ap for p = , . Each like method, in this second step the whole matrix A is
hypergraph is partitioned independently; sample par- partitioned (lines  and  of Fig. ), not the submatrices
titions of these hypergraphs are also presented in this defined by ΠR . The rowwise and columnwise partitions
figure. As seen in the final symmetric permutation in ΠR and ΠC together define a D partition on the matrix
Fig. b, the nonzeros of columns  and  are assigned to A, where map(aij ) = Pp,q ⇔ ri ∈ Rp and cj ∈ Cq .
different parts, resulting P to communicate with both In order to achieve a load balance among proces-
P and P in the expand phase. sors, a multi-constraint partitioning formulation is used
The checkerboard partitioning method [] is also a (line  of the algorithm). Each vertex vi of HC is assigned
two-step method, in which each step models either the P weights: w[i, p], for p = , . . . , P. Here, w[i, p] is equal
expand phase or the fold phase of the parallel SpMxV. to the number of nonzeros of column ci in rows Rp
Similar to jagged-like partitioning, there are two alter- (line  of Fig. ). Consider a Q-way partitioning of HC
native schemes for this partitioning method. The one with P constraints using the vertex weight definition
 H Hypergraph Partitioning

P1 r8 P2 c6 4
c4 8
c8 c5 r3 12
r16 r6
c12
r14 16
c14 c3 3
r12
c16 6
r4 c11
c2 r11 11
14
1
2
5
P3 P4 13
c1 c10 r15 7
r1 9
c13 c15
r5 c9 10
r13 r9 15
c5 r10
c7 c12
c2 4 8 12 16 3 6 11 14 1 2 5 13 7 9 10 15
r2 r7 nnz = 47
vol = 8 imbal = [–6.4%, 2.1%]
a b
Hypergraph Partitioning. Fig.  Second step of four-way jagged-like partitioning: (a) Row-net representations of
submatrices of A and two-way partitionings, (b) Final permuted matrix; the nonzeros in the same partition are shown
with the same shape and color; the deviation of the minimum and maximum numbers of nonzeros of a part from the
average are displayed as an interval imbal; nnz and vol denote, respectively, the number of nonzeros and the total
communication volume

CHECKERBOARD-PARTITIONING(A; K = P × Q; e1; e2)


Input: a matrix A, the number of processors K = P × Q, and the imbalance ratios e1; e2.
Output: map(aij) for all aij ≠ 0 and totalVolume.
1: HR = (VR, NC) ← columnNet(A)
2: PR = {R1,...,RP} ← partition(HR, P, e1)  rowwise partitioning of A
3: expand Volume ← cutsize(PR)
4: HC = (VC, NR) ← rowNet(A)
5: for j = 1 to |VC| do
6: for p = 1 to P do
7: wj,p = |{nj ∩ Rp}|
8: PC = {C1,...,CQ} ← MCPartition(HC, Q, e2)  columnwise partitioning of A
9: foldVolume← cutsize(PC)
10: for all aij ≠ 0 of A do
11: map(aij) = Pp,q ⇔ ri ∈ Rp and cj ∈ Cq
12: totalVolume←expandVolume+foldVolume

Hypergraph Partitioning. Fig.  Checkerboard partitioning

above. Maintaining the P balance constraints () corre- two partitions is fairly straightforward. The volume of
sponds to maintaining computational load balance on communication for the fold operations corresponds
the processors of each row of the processor mesh. exactly to the cutsize(ΠC ). The volume of communica-
Establishing the equivalence between the total com- tion for the expand operations corresponds exactly to
munication volume and the sum of the cutsizes of the the cutsize(ΠR ).
Hypergraph Partitioning H 

C1 C2 3
c3 c16 6
c11 11
r16 14
r11 r4
r3 c8 r12 1
c14 c4 r8 5
c6 10
r14
c12 r9 13
c1
4
c10 r10
c9 c15 8
r6 r1 12
r5
c7 16
c13 r7 r15
r13 2
7
c2 9
c5 r2
15

W1(1) = 12 W2(1) = 12
3 6 11 14 1 5 10 13 4 8 12 16 2 7 9 15
nnz = 47
H
a W1(2) = 12 W2(2) = 11 b vol = 8 imbal = [–6.4%, 2.1%]

Hypergraph Partitioning. Fig.  Second step of four-way checkerboard partitioning: (a) two-way multi-constraint
partitioning ΠC of row-net hypergraph representation HC of A, (b) Final checkerboard partitioning of A induced by
(ΠR , ΠC ); the nonzeros in the same partition are shown with the same shape and color; the deviation of the minimum
and maximum numbers of nonzeros of a part from the average are displayed as an interval imbal; nnz and vol denote,
respectively, the number of nonzeros and the total communication volume

Figure b displays the  ×  checkerboard partition step corresponds to the total communication volume. It
induced by (ΠR , ΠC ). Here, ΠR is a rowwise two-way is possible to dynamically adjust the ε at each recursive
partition giving the same figure as shown in Fig. , and call by allowing larger imbalance ratio for the recursive
ΠC is a two-way multi-constraint partition ΠC of the call on the submatrix A or A .
row-net hypergraph model HC of A shown in Fig. a.
In Fig. a, w[, ]= and w[, ]= for internal column
c of row stripe R , whereas w[, ] =  and w[, ] =  Some Other Applications
for external column c . of Hypergraph Partitioning
Another common method of matrix partitioning As said before, the initial motivations for hypergraph
is the orthogonal recursive bisection (ORB) []. In models were accurate modeling of the nonzero struc-
this approach, the matrix is first partitioned rowwise ture of unsymmetric and rectangular sparse matrices
into two submatrices using the column-net hyper- to minimize the communication volume for iterative
graph model, and then each part is further partitioned solvers. There are other applications that can make use
columnwise into two parts using the row-net hyper- of hypergraph partitioning formulation. Here, a brief
graph model. The process is continued recursively until overview of general classes of applications is given along
the desired number of parts is obtained. The algorithm with the names of some specific problems. Further
is shown in Fig. . In this algorithm, dim represents application classes are given in bibliographic notes.
either rowwise or columnwise partitioning, where −dim Parallel reduction or aggregation operations form
switches the partitioning dimension. a significant class of such applications, including the
In the ORB method shown above, the step bisect MapReduce model. The reduction operation consists of
(A, dim, ε) corresponds to partitioning the given matrix computing M output elements from N input elements.
either along the rows or columns with, respectively, the An output element may depend on multiple input ele-
column-net or the row-net hypergraph models into two. ments, and an input element may contribute to multiple
The total sum of the cutsizes () of each each bisection output elements. Assume that the operation on which
 H Hypergraph Partitioning

ORB-PARTITIONING(A, dim, K min, K max, e)


Input: a matrix A, the part numbers K min (at initial call, it is equal to 1) and K max (at initial call it is
equal to K, the desired number of parts), and the imbalance ratio e.
Output: map(aij) for all aij ≠ 0.
1: if K max − K min > 0 then
2: mid ← (K max − K min + 1)/2
3: P = A1, A2←bisect(A, dim, e) Partition A along dim into two, producing two submatrices
4: totalVolume←totalVolume+cutsize(P)
 Recursively partition each submatrix along the orthogonal direction
5: map1(A1)←ORB-PARTITIONING(A1,−dim, Kmin, Kmin + mid−1, e)
6: map2(A2)←ORB-PARTITIONING(A2−dim, Kmin + mid, Kmax, e)
7: map(A)←map1(A1) ∪ map2(A2)
8: else
9: map(A)←Kmin

Hypergraph Partitioning. Fig.  Orthogonal recursive bisection (ORB)

reduction is performed is commutative and associa- other words, preconditioned iterative methods per-
tive. Then, the inherent computational structure can be form SpMxV operations with both coefficient and pre-
represented with an M × N dependency matrix, where conditioner matrices in a step. Therefore, parallelizing
each row and column of the matrix represents an out- a full step of these methods requires the coefficient
put element and an input element, respectively. For an and preconditioner matrices to be well partitioned, for
input element xj and an output element yi , if yi depends example, processors’ loads are balanced and commu-
on xj , aij is set to  (otherwise zero). Using this rep- nication costs are low in both multiply operations. To
resentation, the problem of partitioning the workload meet this requirement, the coefficient and precondi-
for the reduction operation is equivalent to the prob- tioner matrices should be partitioned simultaneously.
lem of partitioning the dependency matrix for efficient One can accomplish such a simultaneous partitioning
SpMxV. by building a single hypergraph and then partitioning
In some other reduction problems, the input and that hypergraph. Roughly speaking, one follows a four-
output elements may be preassigned to parts. The pro- step approach: (i) build a hypergraph for each matrix,
posed hypergraph model can be accommodated to (ii) determine which vertices of the two hypergraphs
those problems by adding K part vertices and con- need to be in the same part (according to the compu-
necting those vertices to the nets which correspond to tations forming the iterative method), (iii) amalgamate
the preassigned input and output elements. Obviously, those vertices coming from different hypergraphs, (iv)
those part vertices must be fixed to the corresponding if the computations represented by the two hypergraphs
parts during the partitioning. Since the required prop- of the first step are separated by synchronization points
erty is already included in the existing hypergraph par- then assign multiple weights to vertices (the weights of
titioners [, , ], this does not add extra complexity the vertices of the hypergraphs of the first step are kept),
to the partitioning methods. otherwise assign a single weight to vertices (the weights
Iterative methods for solving linear systems usu- of the vertices of the hypergraphs of the first step are
ally employ preconditioning techniques. Roughly speak- summed up for each amalgamation).
ing, preconditioning techniques modify the given lin- The computational structure of the preconditioned
ear system to accelerate convergence. Applications of iterative methods is similar to that of a more general
explicit preconditioners in the form of approximate class of scientific computations including multiphase,
inverses or factored approximate inverses are amenable multiphysics, and multi-mesh simulations.
to parallelization. Because, these techniques require In multiphase simulations, there are a number of
SpMxV operations with the approximate inverse or computational phases separated by global synchroniza-
factors of the approximate inverse at each step. In tion points. The existence of the global synchronizations
Hypergraph Partitioning H 

necessitates each phase to be load balanced individually. partitioning; the hypergraph models cannot be used as
Multi-constraint formulation of hypergraph partition- naturally for applications whose computational require-
ing can be used to achieve this goal. ments vary drastically in time. If, however, the computa-
In multi-physics simulations, a variety of materi- tional requirements change gradually in time, then the
als and processes are analyzed using different physics models can be used to re-partition the load at certain
procedures. In these types of simulations, computa- time intervals (while also minimizing the redistribution
tional as well as the memory requirements are not uni- or migration costs associated with the new partition).
form across the mesh. For scalability issues, processor Ordering methods are quite common techniques to
loads should be balanced in terms of these two com- permute matrices in special forms in order to reduce
ponents. The multi-constraint partitioning framework the memory and running time requirements, as well
also addresses these problems. as to achieve increased parallelism in direct methods
In multi-mesh simulations, a number of grids (such as LU and Cholesky decompositions) used for
with different discretization schemes and with arbi- solving systems of linear equations. Nested-dissection
trary overlaps are used. The existence of overlapping is a well-known ordering method that has been used H
grid points necessitates a simultaneous partitioning of quite efficiently and successfully. In the current state-
the grids. Such a simultaneous partitioning scheme of-the-art variations of the nested-dissection approach,
should balance the computational loads of the pro- a matrix is symmetrically permuted with a permutation
cessors and minimize the communication cost due to matrix P into doubly bordered block diagonal form
interactions within a grid as well as the interactions
among different grids. With a particular transformation ⎡ ⎤
⎢ A AS ⎥
(the vertex amalgamation operation, also mentioned ⎢ ⎥
⎢ ⎥
⎢ ⎥
above), hypergraphs can be used to model the inter- ⎢ A AS ⎥
⎢ ⎥
actions between different grids. With the use of multi- ⎢ ⎥
T⎢ ⎥
ADB = PAP ⎢ ⋱ ⋮ ⎥,
constraint formulation, the partitioning problem in the ⎢ ⎥
⎢ ⎥
multi-mesh computations can also be formulated as a ⎢ ⎥
⎢ AKK AKS ⎥
⎢ ⎥
hypergraph partitioning problem. ⎢ ⎥
⎢ ⎥
In obtaining partitions for two or more compu- ⎢ AS AS ⋯ ASK ASS ⎥
⎣ ⎦
tation phases interleaved with synchronization points,
the hypergraph models lead to the minimization of the
where the nonzeros are only in the marked blocks (the
overall sum of the total volume of communication in
blocks on the diagonal and the row and column bor-
all phases (assuming that a single hypergraph is built as
ders). The aim in such a permutation is to have reduced
suggested in the previous paragraphs). In some sophis-
numbers of rows/columns in the borders and to have
ticated simulations, the magnitude of the interactions
equal-sized square blocks in the diagonal. One way to
in one phase may be different than that of the interac-
achieve such a permutation when A has symmetric pat-
tions in another one. In such settings, minimizing the
tern is as follows. Suppose a matrix B is given (if not, it is
total volume of communication in each phase separately
possible to find one) where the sparsity pattern of BT B
may be advantageous. This problem can be formulated
equals to that of A (here arithmetic cancellations are
as a multi-objective hypergraph partitioning problem
ignored). Then, one can permute B nonsymmetrically
on the so-built hypergraphs.
into the singly bordered form
There are certain limitations in applying hyper-
graph partitioning to the multiphase, multiphysics,
⎡ ⎤
and multi-mesh-like computations. The dependencies ⎢ B BS ⎥
⎢ ⎥
⎢ ⎥
must remain the same throughout the computations, ⎢ ⎥
⎢ B BS ⎥
otherwise the cutsize may not represent the commu- ⎢ ⎥
BSB = QBPT ⎢ ⎥ ,
⎢ ⎥
nication volume requirements as precisely as before. ⎢ ⋱ ⋮ ⎥
⎢ ⎥
The weights assigned to the vertices, for load balanc- ⎢ ⎥
⎢ ⎥
ing issues, should be static and available prior to the ⎢ BKK BKS ⎥
⎣ ⎦
 H Hypergraph Partitioning

so that BTSB BSB = PAPT ; that is, one can use the column operations is seen in []. A more comprehensive study
permutation of B resulting in BSB to obtain a symmet- [] describes the use of the row-net and column-net
ric permutation for A which results in ADB . Clearly, the hypergraph models in D sparse matrix partitioning.
column dimension of Bkk will be the size of the square For different views and alternatives on vector partition-
matrix Akk and the number of rows and columns in ing see [, , ].
the border will be equal to the number of columns in A fair treatment of parallel sparse matrix-vector
the column border of BSB . One can achieve such a per- multiplication, analysis, and investigations on certain
mutation of B by partitioning the column-net model of matrix types along with the use of hypergraph parti-
B while reducing the cutsize according to the cut-net tioning is given in [, Chapter ]. Further analysis of
metric (), with unit net costs, to obtain the permuta- hypergraph partitioning on some model problems is
tion P as follows. First, the permutation Q is defined given in [].
to be able to define P. Permute all rows correspond- Hypergraph partitioning schemes for precondi-
ing to the vertices in part k before those in a part ℓ, for tioned iterative methods are given in [], where ver-
 ≤ k < ℓ ≤ K. Then, permute all columns correspond- tex amalgamation and multi-constraint weighting to
ing to the nets that are internal to a part k before those represent different phases of computations are given.
that are internal to a part ℓ, for  ≤ k < ℓ ≤ K, yield- Discussions on application of such methodology for
ing the diagonal blocks, and then permute all columns multiphase, multiphysics, and multi-mesh simulations
corresponding to the cut nets to the end, yielding the are also discussed in the same paper.
column border (the order of column defining a diago- Some different methods for sparse matrix partition-
nal block). Clearly the correspondence between the size ing using hypergraphs can be found in [], includ-
of the column border of BSB and the doubly border of ing jagged-like and checkerboard partitioning meth-
ADB is exact, and hence the cutsize according to the cut- ods, and in [], the orthogonal recursive bisection
net metric is an exact measure. The requirement to have approach. A recipe to choose a partitioning method for
almost equal sized square blocks Akk decoded as the a given matrix is given in [].
requirement that each part should have an almost equal The use of hypergraph models for permuting matri-
number of internal nets in the partition of the column- ces into special forms such as singly bordered block-
net model of B. Although such a requirement is neither diagonal form can be found in []. This permutation
the objective nor the constraint of the hypergraph parti- can be leveraged to develop hypergraph partitioning-
tioning problem, the common hypergraph-partitioning based symmetric [, ] and nonsymmetric [] nested-
heuristics easily accommodate such requirements. dissection orderings.
The standard hypergraph partitioning and the
hypergraph partitioning with fixed vertices formula-
Related Entries tion, respectively, is used for static and dynamic load
Data Distribution balancing for some scientific applications in [, ].
Graph Algorithms Some other applications of hypergraph partition-
Graph Partitioning ing are briefly summarized in []. These include
Linear Algebra, Numerical image-space parallel direct volume rendering, paral-
PaToH (Partitioning Tool for Hypergraphs) lel mixed integer linear programming, data decluster-
Preconditioners for Sparse Iterative Methods ing for multi-disk databases, scheduling file-sharing
Sparse Direct Methods tasks in heterogeneous master-slave computing envi-
ronments, and work-stealing scheduling, road net-
work clustering methods for efficient query processing,
Bibliographic Notes and Further pattern-based data clustering, reducing software devel-
Reading opment and maintenance costs, processing spatial join
The first use of the hypergraph partitioning meth- operations, and improving locality in memory or cache
ods for efficient parallel sparse matrix-vector multiply performance.
Hyperplane Partitioning H 

Bibliography . Çatalyürek UV, Aykanat C, Kayaaslan E () Hypergraph


. Ababei C, Selvakkumaran N, Bazargan K, Karypis G () Multi- partitioning-based fill-reducing ordering. Technical Report
objective circuit partitioning for cutsize and path-based delay OSUBMI-TR--n and BU-CE-, Department of
minimization. In: Proceedings of ICCAD , San Jose, CA, Biomedical Informatics, The Ohio State University and Computer
November  Engineering Department, Bilkent University (Submitted)
. Aykanat C, Cambazoglu BB, Uçar B () Multi-level . Çatalyürek UV, Aykanat C, Uçar B () On two-dimensional
direct k-way hypergraph partitioning with multiple con- sparse matrix partitioning: models, methods, and a recipe. SIAM
straints and fixed vertices. J Parallel Distr Comput (): J Sci Comput ():–
– . Grigori L, Boman E, Donfack S, Davis T () Hypergraph
. Aykanat C, Pınar A, Çatalyürek UV () Permuting sparse rect- unsymmetric nested dissection ordering for sparse LU factoriza-
angular matrices into block-diagonal form. SIAM J Sci Comput tion. Technical Report -J, Sandia National Labs, Submit-
():– ted to SIAM J Sci Comp
. Bisseling RH () Parallel scientific computation: a structured . Karypis G, Kumar V () Multilevel algorithms for multi-
approach using BSP and MPI. Oxford University Press, Oxford, constraint hypergraph partitioning. Technical Report -,
UK Department of Computer Science, University of Minnesota/Army
. Bisseling RH, Meesen W () Communication balancing HPC Research Center, Minneapolis, MN 
in parallel sparse matrix-vector multiplication. Electron Trans . Karypis G, Kumar V, Aggarwal R, Shekhar S () hMeTiS a H
Numer Anal :– hypergraph partitioning package, version ... Department of
. Boman E, Devine K, Heaphy R, Hendrickson B, Leung V, Riesen Computer Science, University of Minnesota/Army HPC Research
LA, Vaughan C, Catalyurek U, Bozdag D, Mitchell W, Teresco J Center, Minneapolis
() Zoltan .: parallel partitioning, load balancing, and data- . Lengauer T () Combinatorial algorithms for integrated cir-
management services; user’s guide. Sandia National Laboratories, cuit layout. Wiley–Teubner, Chichester
Albuquerque, NM, . Technical Report SAND-W . Selvakkumaran N, Karypis G () Multi-objective hypergraph
http://www.cs.sandia.gov/Zoltan/ug_html/ug.html partitioning algorithms for cut and maximum subdomain degree
. Cambazoglu BB, Aykanat C () Hypergraph-partitioning- minimization. In: Proceedings of ICCAD , San Jose, CA,
based remapping models for image-space-parallel direct volume November 
rendering of unstructured grids. IEEE Trans Parallel Distr Syst . Uçar B, Aykanat C () Encapsulating multiple communi-
():– cation-cost metrics in partitioning sparse rectangular matrices
. Catalyurek U, Boman E, Devine K, Bozdag D, Heaphy R, Riesen L for parallel matrix-vector multiplies. SIAM J Sci Comput
() A repartitioning hypergraph model for dynamic load ():–
balancing. J Parallel Distr Comput ():– . Uçar B, Aykanat C () Partitioning sparse matrices for par-
. Çatalyürek UV () Hypergraph models for sparse matrix par- allel preconditioned iterative methods. SIAM J Sci Comput
titioning and reordering. Ph.D. thesis, Computer Engineering and ():–
Information Science, Bilkent University. Available at http://www. . Uçar B, Aykanat C () Revisiting hypergraph models for sparse
cs.bilkent.edu.tr/tech-reports//ABSTRACTS..html matrix partitioning. SIAM Review ():–
. Çatalyürek UV, Aykanat C () A hypergraph model for map- . Uçar B, Çatalyürek UV () On the scalability of hyper-
ping repeated sparse matrixvector product computations onto graph models for sparse matrix partitioning. In: Danelutto M,
multicomputers. In: Proceedings of International Conference on Bourgeois J, Gross T (eds), Proceedings of the th Euromicro
High Performance Computing (HiPC’), Goa, India, December Conference on Parallel, Distributed, and Network-based Process-
 ing, IEEE Computer Society, Conference Publishing Services,
. Çatalyürek UV, Aykanat C () Hypergraph-partitioning-based pp –
decomposition for parallel sparse-matrix vector multiplication. . Uçar B, Çatalyürek UV, Aykanat C () A matrix parti-
IEEE Trans Parallel Distr Syst ():– tioning interface to PaToH in MATLAB. Parallel Computing
. Çatalyürek UV, Aykanat C () PaToH: a multilevel hypergraph (–):–
partitioning tool, version .. Department of Computer Engineer- . Vastenhouw B, Bisseling RH () A two-dimensional data dis-
ing, Bilkent University, Ankara,  Turkey. PaToH is available tribution method for parallel sparse matrix-vector multiplication.
at http://bmi.osu.edu/~umit/software.htm SIAM Review ():–
. Çatalyürek UV, Aykanat C () A fine-grain hypergraph model
for D decomposition of sparse matrices. In: Proceedings of th
International Parallel and Distributed Processing Symposium
(IPDPS), San Francisco, CA, April 
. Çatalyürek UV, Aykanat C () A hypergraph-partitioning Hyperplane Partitioning
approach for coarse-grain decomposition. In: ACM/IEEE SC,
Denver, CO, November  Tiling
 H HyperTransport

referred to as coherent HyperTransport (cHT), is pro-


HyperTransport prietary to AMD, and therefore it is not described in this
document. Nevertheless, the main difference between
Federico Silla both protocols is that the coherent one includes some
Universidad Politécnica de Valencia, Valencia, Spain
additional types of packets.

Synonyms
HT; HT. HyperTransport Links
The HyperTransport technology is a point-to-point
Definition communication standard, meaning that each of the
HyperTransport is a scalable packet-based, high- HyperTransport links in the system connects exactly
bandwidth, and low-latency point-to-point intercon- two devices. Figure  shows a simplified diagram of a
nect technology intended to interconnect proces- system that deploys HyperTransport in order to inter-
sors and also link them to I/O peripheral devices. connect the devices it is composed of. As can be seen in
HyperTransport was initially devised as an efficient that figure, the main processor is connected to a PCIe
replacement for traditional system buses for on-board device by means of a HyperTransport link. That PCIe
communications. Nevertheless, the last extension to device is, at the same time, connected to a GigaByte
the standard, referred to as High Node Count Hyper- Ethernet device, which is, additionally, connected to a
Transport, as well as the recent standardization of new SATA device. This device connects to a USB device. All
cables and connectors, allow HyperTransport to effi- of these devices make up a HyperTransport daisy chain.
ciently extend its interconnection capabilities beyond a Nevertheless, devices can implement multiple Hyper-
single motherboard and become a very efficient tech- Transport links in order to build larger HyperTransport
nology to interconnect processors and I/O devices in a fabrics.
cluster. HyperTransport is an open standard managed Each of the links depicted in Fig.  consists of two
by the HyperTransport Consortium. unidirectional and independent sets of signals. Each of
these sets includes its own CAD signal, as well as CTL
Discussion and CLK signals.
A complete description of the HyperTransport technol-
ogy should include both an introduction to the protocol
used by communicating devices to exchange data and Memory
also a description of the electrical interface that this
technology makes use of in order to achieve its tremen-
CPU
dous performance. However, describing the electrical
interface seems to be less interesting than the proto-
col itself and, additionally, requires the reader to have
a large electrical background. Therefore, the following PCle
description will be focused on the protocol used by the HyperTransport
HyperTransport technology, leaving aside the electrical link

details. On the other hand, the protocol used by Hyper- GB Eth


Transport is a quite complex piece of technology, thus
requiring an extensive explanation, which may be out of
the scope of this encyclopedia. For this reason, the fol- SATA
lowing description just tries to be a brief introduction
to HyperTransport. Finally, the reader should note that
AMD uses an extended version of the HyperTransport USB
protocol in order to provide cache coherency among the
processors in a system. Such extended protocol, usually HyperTransport. Fig.  System deploying HyperTransport
HyperTransport H 

The CAD signal (named after Command, Address, bandwidth and a narrow link for the opposite direction,
and Data) carries control and data packets. The CTL sig- which requires much less bandwidth.
nal (named after ConTroL) is intended to differentiate In addition to the variable link width, HyperTrans-
control from data packets in the CAD signal. Finally, port also supports variable clock speeds, thus increasing
the CLK signal (named after CLocK) carries the clock even more the possibilities the system designer has for
for the CAD and CTL signals. Figure  shows a diagram tuning the bandwidth of the links. The clock speeds cur-
presenting all these signals. rently supported by the HyperTransport specification
The width of the CAD signal is variable from  to  are  MHz,  MHz,  MHz,  MHz,  MHz,
bits. Actually, not all values are possible. Only widths  MHz,  GHz, . GHz, . GHz,  GHz, . GHz,
of -, -, -, -, or -bits are allowed. Neverthe- . GHz, . GHz,  GHz, and . GHz. Moreover, clock
less, note that the HyperTransport protocol remains frequency in both directions of a link does not need to
the same independently of the exact link width. More be the same, thus introducing an additional degree of
precisely, the format of the packets exchanged among asymmetry. On the other hand, the clock mechanism
HyperTransport devices does not depend on link width. used in HyperTransport is referred to as double data H
However, a given packet will require more time to be rate (DDR), which means that rising and falling edges
transmitted in a -bit link than in a -bit one. The of the clock signal are used to latch data, thus achiev-
reason for having several widths for the CAD signal ing an effective clock frequency that doubles the actual
is allowing system developers to tune their system for clock rate.
a given performance/cost design point. Narrower links In summary, when variable link width is combined
will be cheaper to implement, but they will also provide with variable clock frequency, HyperTransport links
lower performance. present an extraordinary scalability in terms of perfor-
On the other hand, the width of the CAD signal in mance. For example, when both directions of a link
each of the unidirectional portions of the link may be are -bit wide working at  MHz, link bandwidth
different, that is, HyperTransport allows asymmetrical is  MB/s. This would be the lowest performance/
link widths. Therefore, a given device may implement a lowest cost configuration. On the opposite, when -bit
-bit wide link in one direction while deploying a -bit wide links are used at . GHz, overall link performance
link in the other, for example. The usefulness of such rises up to . GB/s. This implementation represents
asymmetry is based on the fact that, if such a device the highest bandwidth configuration, also presenting
sends most of its data in one direction and receives a the highest manufacturing cost. This link configura-
limited amount of data in the other direction, then the tion could be interesting for extreme performance sys-
system designer can reduce manufacturing cost by pro- tems, for example, which usually require as much band-
viding a wide link in the direction that requires higher width as possible. On the other hand, slow devices not

CAD
CTL
CLK

HyperTransport HyperTransport
device 1 device 2
CAD
CTL
CLK

HyperTransport. Fig.  Signals in a HyperTransport link


 H HyperTransport

requiring high bandwidth could reduce cost by using that are forwarded along those links. Packets in Hyper-
narrow links. If these links are additionally clocked at Transport are multiples of -bytes long and carry the
low frequencies, then power consumption is further command, address, and data associated with each trans-
reduced. action among devices. Packets can be classified into
In the same way that the CAD signal can be imple- control and data packets.
mented with a variable width, the CTL and CLK signals Control packets are used to initiate and finalize
can also be implemented with several widths. However, transactions as well as to manage several HyperTrans-
their width does not depend on the implementer’s crite- port features, and consist of  or  bytes. Control packets
rion to choose a performance/cost design point, but on can be classified into information, request, and response
the width of the CAD signal. Additionally, as the CAD packets.
signal can present a different width in each direction of Information packets are used for several link
the link, then the width of the CTL and CLK signals management purposes, like link synchronization, error
can also be different in each direction, depending on the condition signaling, and updating flow control infor-
corresponding CAD signal in that direction. mation. Information packets are always  bytes long and
In the case of the CTL signal, there is an individual can only be exchanged among adjacent devices directly
CTL bit for each set of , or fewer, CAD bits. Therefore, interconnected by a link.
, , or  CTL bits can be found in a HyperTransport On the other hand, request and response control
link. Moreover, the CTL bits are encoded in such a way packets are used to build HyperTransport transactions.
that four CTL bits are transferred every  CAD bits. Request packets, which are - or -bytes long, are used
These four bits are intended to denote different flavors to initiate HyperTransport transactions. On the other
of the information being transmitted in the CAD signal. hand, response packets, which are always -bytes long,
These different flavors may include, for example, that a are used in the response phase of transactions to reply
command is being transmitted, that a CRC for a com- to a previous request. Table  shows the different request
mand without data is in the CAD signal, that the CAD and response types of packets. As can be seen, there
signal is being used by a data packet, etc. are two different types of sized writes: posted and non-
In the case for the CLK signal, a HyperTransport posted. Although both types write data to the target
link has an individual CLK bit for every set of , or device of the request packet, their semantics are differ-
fewer, CAD bits. Thus, the number of CTL and CLK ent. Non-posted writes require a response packet to be
bits in a given link is the same. The reason for hav- sent back to the requesting device in order to confirm
ing a CLK bit for every  bits of the CAD signal is that the operation has completed. On the other hand,
because link implementation is made easier. Effectively, posted writes do not require such confirmation.
the HyperTransport clocking scheme requires that the
skew between clock and data signals must be minimized
in order to achieve high transmission rates. Therefore, HyperTransport. Table  Types of request and response
having a CLK bit for every  CAD bits allows that dif- control packets
ferences in trace lengths in the board layout are much Packet type
lower than just having a CLK bit for the entire set of Request packet Sized read
CAD bits. Sized write (non-posted)
In addition to the signals mentioned above, all Sized write (posted)
HyperTransport devices share one PWROK and one Atomic read-modify-write
RESET# signals for initialization and reset purposes. Broadcast
Moreover, if devices require power management, they Flush
should additionally include LDTSTOP# and LDTREQ# Fence
signals. Address extension
Source identifier extension
HyperTransport Packets
Response packet Read response
Once described the links that connect devices in a
Target done
HyperTransport fabric, this section presents the packets
HyperTransport H 

All the packet types will be further described in next ● Flush transaction
section, except the extension packets. These packets are ● Fence transaction
-bytes long extensions that prepend some of the other
packets, when required. Their purpose is to allow -bit Sized Read Transaction
addressing instead of the -bit one used by default, in Sized read transactions are used by devices when they
the case of the Address Extension, or -bit source iden- request data located in the address space of another
tifiers in bus-device-function format in the case of the device. For example, when a device wants to read data
Source Identifier Extension. from main memory, it starts a read transaction tar-
With respect to data packets, they carry the actual geted to the main processor. Also, when the processor
data transferred among devices. Data packets only requires some data from the USB device in Fig. , it will
include data, with no associated control field. Therefore, issue a read transaction destined to that device.
data packets immediately follow the associated control Read transactions begin with a sized read request
packet. For example, a read response control packet, control packet being issued by the device requesting the
which includes no data, will be followed by the asso- data. Once this packet reaches the destination device,
H
ciated data packet carrying the read data. In the same it accesses the requested data and generates a response.
way, a write request control packet will be followed by This response will be composed of a read response con-
the data to be written. trol packet followed by the read data, included in a data
Data packet length depends on the command that packet. Once these two packets arrive at the device that
generated that data packet. Nevertheless, the maximum initiated the process, the transaction is completed.
length is  bytes. It is noteworthy mentioning that during the time
elapsed since the requestor delivered the read request
on the link until it receives the corresponding response,
the HyperTransport chain is not idle. On the opposite,
HyperTransport Transactions
as HyperTransport is a split transaction protocol, other
HyperTransport devices transfer data by means of
transactions can be issued (even finalized) before our
transactions. For example, every time a device requests
requestor receives the required data.
some data from another device, the former initiates a
read transaction targeted to the latter. Write transac-
Sized Write Transaction
tions happen in a similar way. For example, when a
Sized write transactions are similar to sized read ones,
program writes some data to disk, a write transaction
with the difference that the requestor device writes data
is initiated among the processor, which reads the data
to the target device instead of requesting data from it.
from main memory, and the SATA device depicted in
A write transaction may happen when a device sends
Fig. . Other transactions include broadcasting a mes-
data to memory, or when the processor writes back data
sage or explicitly taking control of the order of pending
from memory to disk, for example.
transactions.
There are two different sized write transactions:
Every transaction has a request phase. Request con-
posted and non-posted. Posted write transactions start
trol packets are used in this phase. The exact request
when the requestor sends to the target a posted write
control packet to be used depends on the particular
request control packet followed by a data packet con-
transaction taking place. On the other hand, many
taining the data to be written. In this case, because
transactions require a response stage. Response control
of the posted nature of the transaction, no response
packets are used in this case, for example, to return read
packet is sent back to the requestor. On the other hand,
data from the target of the transaction, or to confirm its
non-posted write transactions begin with a non-posted
completion.
write control packet being issued by the requestor (fol-
There are six basic transaction types:
lowed by a data packet). In this case, when both packets
● Sized read transaction reach the destination and the data are written, the target
● Sized write transaction issues back a target-done response control packet to the
● Atomic read-modify-write transaction requestor. When the requestor receives this response,
● Broadcast transaction the transaction is finished.
 H HyperTransport

Atomic Read-Modify-Write Transaction Flush Transaction


Atomic read-modify-write transactions are intended to Posted writes do not generate any response once com-
atomically access a memory location and modify it. This pleted. Therefore, when a device issues one or more
means that no other device in the system may access the posted writes targeted to the main processor in order
same memory location during the time required to read to write some data to main memory, the issuing device
and modify it. This is useful to avoid race conditions has no way to know that those writes have effectively
among devices while performing the mutual exclusion completed their way to memory. Thus, if some of the
required to access a critical section, for example. posted writes have not been completely written to mem-
Two different types of atomic operations are ory, that data will not be visible to other devices in the
allowed: system. In this scenario, flush transactions allow the
device that issued the posted writes to make sure that
● Fetch and Add
data has reached main memory by flushing all pending
● Compare and Swap
posted writes to memory.
The Fetch and Add operation is: Flush transactions begin when a device issues a
flush control packet targeted to the processor. Once this
Fetch_and_Add(Out, Addr, In) {
control packet reaches the destination, all pending
Out = Mem[Addr];
transactions will be flushed to memory and then the
Mem[Addr] = Mem[Addr] + In;
processor will generate a target-done response con-
}
trol packet back to the requestor. Once this packet is
The Compare and Swap operation is: received, the transaction is completed.

Compare_and_Swap(Out, Addr, Compare, Fence Transaction


In) { The fence transaction is intended to provide a barrier
Out = Mem[Addr]; among posted writes. When a processor (the only pos-
If (Mem[Addr] == Compare) Mem[Addr] sible target of a fence transaction) receives a fence com-
= In; mand, it will make sure that no posted write received
} after the fence command is written to memory before
The atomic transaction begins when the requestor any of the posted writes received earlier than the fence
issues an atomic read-modify-write request control command.
packet on the link followed by a data packet con- The fence transaction begins when a device issues a
taining the argument of the atomic operation. Once fence control packet. No response is generated for this
both packets are received at the target and it performs transaction.
the requested atomic operation, it will send back a
read response control packet followed by a data packet Virtual Channels and Flow Control
containing the original data read from the memory in HyperTransport
location. All the different types of control and data packets are
multiplexed on the same link and stored in input buffers
when they reach the receiving end of the link. Then,
Broadcast Transaction they are either accepted by that device in case they
This transaction is used by the processor to commu- are targeted to it or forwarded to the next link in
nicate information to all HyperTransport devices in the chain.
the system. This transaction is started with a broadcast If packets are not carefully managed, a protocol
request control packet, which can only be issued by the deadlock may occur. For example, if many devices in
processor. All the other devices accept that packet and the system issue a large number of non-posted requests,
forward it to the rest of devices in the system. Broadcast those requests may fill up all the available buffers and
requests include halt, shutdown, and End-Of-Interrupt hinder responses to make forward progress back to the
commands. initial requestors. In this case, requestors would stop
HyperTransport H 

forever because they are waiting for responses that will can be stored in each buffer depends on the implemen-
never arrive because the interconnect is full of requests tation. Nevertheless, request and response buffers must
that avoid responses to advance. Additionally, requests contain, at least, enough space to store the largest con-
stored in the intermediate buffers will not be able to trol packet of that type. Also, all data buffers can hold
advance toward their destination because output buffers  bytes. Moreover, in order to improve performance, a
at the targets will be filled with responses that can- HyperTransport device may have larger buffers, able to
not enter the link, thus hindering targets to accept new store multiple packets of each type.
requests from the link because they have no free space The HyperTransport protocol states that a transmit-
neither to store them nor to store the responses they ter should not issue a packet that cannot be stored by
would generate. As can be seen, in this situation, no the receiver. Thus, the transmitter must know how many
packet can advance because of lack of available free buffers of each type the receiver has available. To achieve
buffers. The result is that the system freezes. this, a credit-based scheme is used between transmit-
In order to avoid such deadlocks, HyperTransport ters and receivers. With such scheme, the transmitter
splits traffic into virtual channels and stores different has a counter for each type of buffer implemented at the H
types of packets in buffers belonging to different vir- receiver. When the transmitter sends a packet, it decre-
tual channels. Additionally, HyperTransport does not ments the associated counter. When one of the counters
allow that packets traveling in one virtual channel move reaches zero, the transmitter stops sending packets of
to another virtual channel. In this way, if non-posted that type. On the other hand, when the receiver frees a
requests use a virtual channel different from the one buffer, it sends back a NOP control packet (No Oper-
used by responses, the deadlock described above can be ation Packet) to the transmitter in order to update it
avoided. about space availability.
HyperTransport defines a base set of virtual chan-
nels that must be supported by all HyperTransport Extending the HyperTransport Fabric
devices. Moreover, some additional virtual channel sets The system depicted in Fig.  consists of several
are also defined, although support for them is optional. HyperTransport devices interconnected in a daisy
Regarding the base set, it includes three different virtual chain. However, more complex topologies can also be
channels: implemented by using HyperTransport bridges. Bridges
in HyperTransport are devices having a primary link
● The posted request virtual channel, which carries
connecting toward the processor and one or more sec-
posted write transactions
ondary links that allow extending the topology in the
● The non-posted request virtual channel, which
opposite direction. In this way, HyperTransport trees,
includes reads, non-posted writes, and flush packets
like the one shown in Fig. , can be implemented.
● The response virtual channel, which is responsible
In addition to the use of bridges, HyperTransport
for read responses and target-done control packets
defines two more features able to expand a HyperTrans-
In addition to separate traffic into the three virtual port system. These features are the AC mode and the
channels mentioned above, each device must imple- HTX connector. The AC mode allows devices to be
ment separate control and data buffers for each of the connected over longer distances than allowed by reg-
virtual channels. Therefore, there are six types of buffers: ular links, which make use of the DC mode. On the
other hand, the HTX connector allows external expan-
● Non-posted request control buffer
sion cards to be plugged to the HyperTransport link,
● Posted request control buffer
and be presented to the rest of the system as any other
● Response control buffer
HyperTransport device.
● Non-posted request data buffer
● Posted request data buffer
Improving the Scalability
● Response data buffer
of HyperTransport
Figure  shows the basic buffer configuration for a As shown above, HyperTransport offers some degree of
HyperTransport link. The exact number of packets that scalability that enables the implementation of efficient
 H HyperTransport

Posted request VC Posted request VC


Control Control
buffer buffer

Data Data
buffer buffer

Non posted request VC Non posted request VC


Control Control
buffer buffer

Data Data
buffer buffer

Response VC Response VC
Control Control
buffer buffer

Data Data
buffer buffer
Transmitter Receiver

Posted request VC Posted request VC


Control Control
buffer buffer

Data Data
buffer buffer

Non posted request VC Non posted request VC


Control Control
buffer buffer

Data Data
buffer buffer

Response VC Response VC
Control Control
buffer buffer

Data Data
buffer buffer
Receiver Transmitter

Device 1 Device 2

HyperTransport. Fig.  Buffer configuration for a HyperTransport link

topologies. Nevertheless, as HyperTransport was ini- HyperTransport ones. More specifically, HyperTrans-
tially designed to replace traditional system buses, its port is not able to provide device addressability beyond
benefits are mainly confined to interconnects within  devices. Additionally, it does not support efficient
a single motherboard. Therefore, when HyperTrans- routing in scalable network topologies. As a result, high-
port topologies are to be scaled to larger sizes, in order performance computing vendors have no choice but
to interconnect the processors and I/O subsystems in to complement HyperTransport with other intercon-
several motherboards (e.g., an entire cluster), Hyper- nect technologies, link Infiniband in the case for general
Transport is not able to do so because such large system purpose clusters, or proprietary interconnects, like in
sizes require routing capabilities that exceed current the case for Cray’s XT and XT supercomputers [].
HyperTransport H 

Memory Related Entries


Busses and Crossbars
CPU Data Centers
Interconnection Networks
PCI-Express

HT device
PGAS (Partitioned Global Address Space) Languages

HT bridge Bibliographic Notes and Further


Reading
The complete description of HyperTransport . can be
HT device HT device found in the HyperTransport I/O link specification .
[]. Additionally, readers are also encouraged to look up
more information on HyperTransport in []. This book H
HT device HT device nicely describes the HyperTransport technology. On the
other hand, a complete description of the High Node
HyperTransport. Fig.  HyperTransport tree topology Count HyperTransport Specification can be found in
[]. Finally, many white papers and additional infor-
mation are publicly available in the HyperTransport
In order to overcome the limitations of HyperTrans- Consortium web site [].
port ., an extension to it named High Node Count
HyperTransport Specification was recently released.
Bibliography
This extension supports very large system sizes, like the
. Cray Inc. () Cray XT specifications. Available online at
ones found in large data centers, at the same time that http://www.cray.com
it is fully compatible with the HyperTransport specifica- . Duato J, Silla F, Holden B, Miranda P, Underhill J, Cavalli M,
tion. Additionally, the new extension adds a few bytes to Yalamanchili S, Brüning U () Extending HyperTransport
current HyperTransport packets, but only in the cases protocol for improved Scalability. In: Proceedings of the first
international workshop on hypertransport research and applica-
where strictly required, thus minimizing the protocol
tions, Mannheim, Germany, pp –
overhead. . Holden B, Trodden J, Anderson D () HyperTransport .
Briefly, the new extension provides an improved interconnect technology: a comprehensive guide to the st, nd,
addressing scheme and a new control packet that allow and rd generations. MindShare Inc, Colorado Springs, CO
HyperTransport devices to address any other device . HyperTransport Technology Consortium web site. http://www.
in large clusters. Additionally, new HyperTransport hypertransport.org. Accessed 
. HyperTransport Technology Consortium. HyperTransport I/O
connectors and cables have been recently standardized
link specification revision .. Available online at http://www.
in order to efficiently allow the deployment of the High hypertransport.org. Accessed 
Node Count HyperTransport Specification.
I
a large system. This customized approach led to the
IBM Blue Gene Supercomputer scalability and power efficiency characteristics that dif-
ferentiated Blue Gene from other machines that existed
Alan Gara, José E. Moreira at the time of its commercial introduction in .
IBM T.J. Watson Research Center, Yorktown Heights, As of , IBM has produced two commercial ver-
NY, USA
sions of Blue Gene, Blue Gene/L and Blue Gene/P, which
were first delivered to customers in  and ,
Synonyms respectively. A third version, Blue Gene/Q, was under
Blue Gene/L; Blue Gene/P; Blue Gene/Q development. Both delivered versions follow the same
design principles, the same system architecture and the
Definition same software architecture. They differ on the specifics
The IBM Blue Gene Supercomputer is a massively par- of the basic SoC that serves as the building block for
allel system based on the PowerPC processor. A Blue Blue Gene. The November  TOP list includes
Gene/L system at Lawrence Livermore National Labo- four Blue Gene/L system and ten Blue Gene/P systems
ratory held the number  position in the TOP list of (and one prototype Blue Gene/Q system). In this article
fastest computers in the world from November  to we cover mostly the common aspects of both versions
November . of the Blue Gene supercomputing and discuss details
specific to each version as appropriate.
Discussion
System Architecture
Introduction A Blue Gene system consists of a compute section, a file
The IBM Blue Gene Supercomputer is a massively paral- server section, and a host section (Fig. ). The compute
lel system based on the PowerPC processor. It derives its and I/O nodes in the compute section form the compu-
computing power from scalability and energy efficiency. tational core of Blue Gene. User jobs run in the compute
Each computing node of Blue Gene is optimized to nodes, while the I/O nodes connect the compute section
achieve high computational rate per unit of power and to the file servers and front-end nodes through an Eth-
to operate with other nodes in parallel. This approach ernet network. The file server section consists of a set of
results in a system that can scale to very large sizes and file servers. The host section consists of a service node
deliver substantial aggregate performance. and one or more front-end nodes. The service node con-
Most large parallel systems in the – time trols the compute section through an Ethernet control
frame followed a model of using off-the-shelf proces- network. The front-end nodes provide job compilation,
sors (typically from Intel, AMD, or IBM) and intercon- job submisson, and job debugging services.
necting them with either an industry standard network
(e.g., Ethernet or Infiniband) or a proprietary network Compute Section
(e.g., as used by Cray or IBM). Blue Gene took a dif- The compute section of Blue Gene is what is usually
ferent approach by designing a dedicated system-on-a- called a Blue Gene machine. It consists of a three-
chip (SoC) that included not only processors optimized dimensional array of compute nodes interconnected in
for floating-point computing, but also the networking a toroidal topology along the x, y, and z axes. I/O nodes
infrastructure to interconnect these building blocks into are distinct from compute nodes and not in the toroidal

David Padua (ed.), Encyclopedia of Parallel Computing, DOI ./----,


© Springer Science+Business Media, LLC 
 I IBM Blue Gene Supercomputer

Compute section
File server section
Compute nodes I/O nodes
File servers

Ethernet
(I/O)

Host section

Ethernet

Control network Service node Front-end nodes

IBM Blue Gene Supercomputer. Fig.  High-level view of a Blue Gene system

interconnect, but also belong to the compute section. (for external DRAM) and interfaces to the five networks
A collective network interconnects all I/O and compute used to interconnect Blue Gene/L compute and I/O
nodes of a Blue Gene machine. Each I/O node commu- nodes: torus, collective, global barrier, Ethernet, and
nicates outside of the machine through an Ethernet link. control (JTAG) network.
Compute and I/O nodes are built out of the same The Blue Gene/P compute ASIC contains four mem-
Blue Gene compute ASIC (application-specific inte- ory coherent PowerPC  cores, each with private L
grated circuit) and memory (DRAM) chips. The differ- data and instruction caches ( KiB each). Associated
ence is in the function they perform. Whereas compute with each core are both a small ( KiB) L cache that acts
nodes connect to each other for passing application as a prefetch engine, as in Blue Gene/L, and a snoop fil-
data, I/O nodes form the interface between the compute ter to reduce coherence traffic into each core. Two banks
nodes and the outside world by connecting to an Ether- of  MiB EDRAM are configured to operate as a shared
net network. Reflecting the difference in function, the L cache. Completing the ASIC are a dual memory con-
software stacks of the compute and I/O nodes are also troller (for external DRAM), interfaces to the same five
different, as will be discussed below. networks as Blue Gene/L, and a DMA engine to transfer
The particular characteristics of the Blue Gene com- data directly from the memory of one node to another.
pute ASIC for the two versions (Blue Gene/L and Blue Both the PowerPC  and PowerPC  cores
Gene/P) are illustrated in Fig.  and summarized in include a vector floating-point unit that extends the
Table . The Blue Gene/L compute ASIC contains two PowerPC instruction set architecture with instructions
non-memory coherent PowerPC  cores, each with that assist in matrix and complex-arithmetic operations.
private L data and instruction caches ( KiB each). The vector floating-point unit can perform two fused
Associated with each core is a small ( KiB) L cache multiply-add operations per second for a total of four
that acts as a prefetch engine. Completing the on- floating-point operations per cycle per core.
chip memory hierarchy is  MiB of embedded DRAM The interconnection networks are primarily
(eDRAM) that is configured to operate as a shared used for communication primitives used in parallel
L cache. Also on the ASIC is a memory controller high-performance computing applications. The main
IBM Blue Gene Supercomputer I 

PLB (4:1) 32k I1/32k D1


256
128
PPC440 L2
4MB
256 Shared L3
1024b data eDRAM
Double FPU Directory
Shared for 144b ECC
snoop L3 Cache
SRAM eDRAM
32k I1/32k D1 256 or
128 On-Chip
w/ECC Memory
PPC440 L2 256

Double FPU
128

DDR
Ethernet JTAG Global Controller
Torus Collective
Gbit Access Barrier w/ECC
I
Gbit JTAG 6b out, 6b 3b out, 3b 4 global 144b DDR
Ethernet in 1.4 Gb/s in 2.8 Gb/s barriers or interface
a per link per link interrupts

32k I1/32k D1 snoop


Snoop
filter
PPC450 128
Multiplexing switch

L2 4MB
Double FPU 256
Shared L3 512b data eDRAM
72b ECC
32k I1/32k D1 256 Directory
Snoop L3 Cache
for eDRAM
filter or
PPC450 128 On-Chip
w/ECC
Memory

Double FPU L2
32
32k I1/32k D1 Shared
Multiplexing switch

Snoop SRAM
filter
PPC450 128
4MB
Shared L3 512b data eDRAM
Double FPU L2 Directory 72b ECC
for eDRAM L3 Cache
32k I1/32k D1 or
Snoop w/ECC On-Chip
filter Memory
PPC450 128

L2
Double FPU
Arb
DMA
Hybrid
PMU
DDR-2 DDR-2
w/SRAM JTAG Global Ethernet Controller Controller
256x64b Torus Collective
Access Barrier 10 Gbit w/ ECC w/ ECC

JTAG 6 3.4 Gb/s 3 6.8 Gb/s 4 global 10 Gb/s 13.6 Gb/s


bidirectional bidirectional barriers or DDR-2 DRAM bus
b interrupts

IBM Blue Gene Supercomputer. Fig.  Blue Gene compute ASIC for Blue Gene/L (a) and Blue Gene/P (b)
 I IBM Blue Gene Supercomputer

IBM Blue Gene Supercomputer. Table  Summary of differences between Blue Gene/L and Blue Gene/P nodes

Node characteristics Blue Gene/L Blue Gene/P


Processors Two PowerPC  Four PowerPC 
Processor frequency  MHz  MHz
Peak computing capacity . Gflop/s ( flops/cycle/core) . Gflop/s ( flops/cycle/core)
Main memory capacity  MiB or  GiB  GiB or  GiB
Main memory bandwidth . GB/s ( byte @  MHz) . GB/s ( ×  byte @  MHz)
Torus link bandwidth ( links)  MB/s per direction  MB/s per direction
Collective link bandwidth ( links)  MB/s per direction  MB/s per direction

interconnection network is the torus network, which system delivered to date. The highest performing Blue
provides point-to-point and multicast communication Gene system delivered to date is a -rack Blue Gene/P
across compute nodes. A collective network in a tree system at Juelich Research Center. Several single-rack
topology interconnects all compute and I/O nodes and (, compute nodes) Blue Gene systems, as well as sys-
supports efficient collective operations, such as reduc- tems of intermediate size, were also delivered to various
tions and broadcasts. Arithmetic and logical operations customers.
are implemented as part of the communication prim- A given Blue Gene machine can be partitioned along
itives to facilitate low-latency collective operations. midplane boundaries. A partition is formed by a rectan-
A global barrier network supports fast synchroniza- gular arrangement of midplanes. Each partition can run
tion and notification across compute and I/O nodes. only one job at any given time. During each job, all the
The Ethernet network is used to connect the I/O nodes compute nodes of a partition stay in the same execution
outside the machine, as previously discussed. Finally, mode for the duration of the job. These modes of execu-
the control network is used to control the hardware tion are described in section Overall Operating System
from the service node. Figure  illustrates the topology Architecture.
of the various networks.
Compute and I/O nodes are grouped into units of
midplanes. A midplane contains  node cards and each
File Server Section
The file server section of a Blue Gene system provides
node card contains  compute nodes and up to four
the storage for the file system that runs on the Blue Gene
(Blue Gene/L) or two (Blue Gene/P) I/O nodes. The 
I/O nodes. Several parallel file systems have been ported
compute nodes in a midplane are arranged as an ××
to Blue Gene, including GPFS, PVFS, and Lustre.
three-dimensional mesh. Midplanes are typically con-
To feed data to a Blue Gene system, multiple servers are
figured with – (Blue Gene/P) or – (Blue Gene/L)
required to achieve the required bandwidth. The origi-
I/O nodes. Each midplane also has  link chips used for
nal Blue Gene/L system at LLNL, for example, uses 
inter-midplane connection to build larger systems. The
servers operating in parallel. Data is striped across those
link chips also implement the “closing” of the toroidal
servers, and a multi-level switching Ethernet network is
interconnect, by connecting the two ends of a dimen-
used to connect the I/O nodes to the servers. The servers
sion. Midplanes are arranged two to a rack, and racks
themselves are standard rack-mounted machines, typi-
are arranged in a two-dimensional layout of rows and
cally Intel, AMD, or POWER processor based.
columns. Because the midplane is the basic replication
unit, the dimensions of the array of compute nodes must
be a multiple of . Host Section
The configuration of the original Blue Gene/L The host section for a Blue Gene/L system con-
system at Lawrence Livermore National Laboratory sists of one service node and one or more front-end
(LLNL) is shown in Fig. . That machine was later nodes. These nodes are standard POWER processor
upgraded to  racks and is the largest Blue Gene machines. For the LLNL machine, the service node is a
IBM Blue Gene Supercomputer I 

a b

I/O Ethernet
I/O
I

I/O: I/O nodes


C: compute nodes

Control Ethernet
FPGA

C C JTAG

C C C C
c
IBM Blue Gene Supercomputer. Fig.  Blue Gene networks. Three-dimensional torus (a), collective network (b), and
control network and Ethernet (c)

-processor POWER machine, and each of the  The front-end nodes are where users work. They
front-end nodes is a PowerPC  blade. provide access to compilers, debuggers, and job
The service node is responsible for controlling and submission services. Console I/O from user applica-
monitoring the operation of the compute section. The tions are routed to the submitting front-end node.
services it implements include: machine partitioning,
partition boot, application launch, standard I/O rout- Blue Gene System Software
ing, application signaling and termination, event mon- The primary operating system for Blue Gene com-
itoring (for events generated by the compute and I/O pute nodes is a lightweight operating system called
nodes), and environmental monitoring (for things like the Compute Node Kernel (CNK). This simple kernel
power supply voltages, fan speeds, and temperatures). implements only a limited set of services, which are
 I IBM Blue Gene Supercomputer

System
64 Racks, 64x32x32

Rack
32 node cards

Node card 360 TF/s


(32 chips 4x4x2) 32 TB
16 compute, 0–2 IO cards

5.6 TF/s
512 GB
Compute card
2 chips, 1x2x1

180 GF/s
16 GB
Chip
2 processors
11.2 GF/s
1.0 GB

5.6 GF/s
4 MB

IBM Blue Gene Supercomputer. Fig.  Packaging hierarchy for the original Blue Gene/L at Lawrence Livermore National
Laboratory

complemented by services provided by the I/O nodes. Overall Operating System Architecture
The I/O nodes, in turn, run a version of the Linux oper- A key concept in the Blue Gene operating system solu-
ating system. The I/O nodes act as gateways between the tion is the organization of compute and I/O nodes into
outside world and the compute nodes, complementing logical entities called processing sets or psets. A pset
the services provided by the CNK with file and socket consists of one I/O node and a collection of compute
operations, debugging, and signaling. nodes. Every system partition, in turn, is organized as
This split of functions between I/O and compute a collection of psets. All psets in a partition must have
nodes, with the I/O nodes dedicated to system ser- the same number of compute nodes, and the psets of
vices and the compute nodes dedicated to applica- a partition must cover all the I/O and compute nodes
tion execution, resulted in a simplified design for both of the partition. The psets of a partition never overlap.
components. It also enables Blue Gene scalability The supported pset sizes are  (Blue Gene/L only), , ,
and robustness and achieves a deterministic execution , and  compute nodes, plus the I/O node. The psets
environment. are a purely logical concept implemented by the Blue
Scientific middleware for Blue Gene includes a user- Gene system software stack. They are built to reflect the
level library implementation of the MPI standard, opti- topological proximity between I/O and compute nodes,
mized to take advantage of Blue Gene networks, and thus improving communication performance within
various math libraries, also in the user level. Imple- a pset.
menting all the message passing functions in user mode A Blue Gene job consists of a collection of N
simplifies the supervisor (kernel) code of Blue Gene and compute processes. Each process has its own pri-
results in better performance by reducing the number of vate address space and two processes of the same
kernel system calls that an application performs. job communicate only through message passing. The
IBM Blue Gene Supercomputer I 

primary communication model for Blue Gene is MPI. of operations happens in every compute node of a par-
The N compute processes of a Blue Gene job corre- tition, at the start of each job:
spond to tasks with ranks  to N −  in the MPI COMM
WORLD communicator. Compute processes run only . It creates the address space(s) for execution of com-
on compute nodes; conversely, compute nodes run only pute process(es) in a compute node.
compute processes. . It loads code and initialized data for the executable
In Blue Gene/L, the CNK implements two modes of that (those) process(es).
of operation for the compute nodes: coprocessor mode . It transfers processor control to the loaded exe-
and virtual node mode. In the coprocessor mode, the cutable, changing from supervisor to user mode.
single process in the node has access to the entire node
memory. One processor executes user code while the The address spaces of the processes are flat and fixed,
other performs communication functions. In virtual with no paging. The entire mapping is designed to fit
node mode, the node memory is split in half between statically in the TLBs of the PowerPC processors. The
the two processes running on the two processors. Each loading of code and data occurs in push mode. The I/O
process performs both computation and communica- node of a pset reads the executable from the file sys-
tion functions. tem and forwards it to all compute nodes in the pset. I
In Blue Gene/P, the CNK provides three modes The CNK in a compute node receives that executable
of operation for the compute nodes: SMP mode, dual and stores the appropriate memory values in the address
mode, and quad mode. The SMP mode supports a single space(s) of the compute process(es).
multithreaded (up to four threads) application process Once the CNK transfers control to the user appli-
per compute node. That process has access to the entire cation, its primary mission is to “stay out of the way.”
node memory. Dual mode supports two multithreaded In normal execution, the processor control stays with
(up to two threads each) application processes per com- the compute process until it requests an operating sys-
pute node. Quad mode supports four single-threaded tem service through a system call. Exceptions to this
application processes per compute node. In dual and normal execution are caused by hardware interrupts:
quad modes, the application processes in a node split either timer alarms requested by the user code, com-
the memory of that node. Blue Gene/P supports at most munication interrupts caused by arriving packets, or an
one application thread per processor. It also supports an abnormal hardware event that requires attention by the
optional communication thread in each processor. compute node kernel.
Each I/O node runs one image of the Linux operat- When a compute process makes a system call, three
ing system. It can offer the entire spectrum of services things may happen:
expected in a Linux box, such as multiprocessing, file
systems, and a TCP/IP communication stack. These . “Simple” system calls that require little operating
services are used to extend the capabilities of the com- system functionality, such as getting the time or
pute node kernel, providing a richer functionality to the setting an alarm, are handled locally by the com-
compute processes. Due to the lack of cache coherency pute node kernel. Control is transferred back to the
between the processors of a Blue Gene/L node, only one compute process at completion of the call.
of the processors of each I/O node is used by Linux, . “I/O” system calls that require infrastructure for file
while the other processor remains idle. Since the pro- systems and IP stack are shipped for execution in the
cessors in a Blue Gene/P node are cache coherent, Blue I/O node associated with that compute node (i.e.,
Gene/P I/O nodes can run a true multiprocessor Linux the I/O node in the pset of the compute node). The
instance. compute node kernel waits for a reply from the I/O
node, and then returns control back to the compute
process.
The Compute Node Kernel . “Unsupported” system calls that require infrastruc-
CNK is a lean operating system that performs a simple ture not present in Blue Gene are returned right
sequence of operations at job start time. This sequence away with an error condition.
 I IBM Blue Gene Supercomputer

There are two main benefits from the simple which includes the user ID, group ID, and supplemen-
approach for a compute node operating system: robust- tary groups. It then retrieves the executable from the file
ness and scalability. Robustness comes from the fact system and sends the code and initialized data through
that the compute node kernel performs few services, the collective network to each of the compute nodes in
which greatly simplify its design, implementation, and the pset. It also sends the command-line arguments and
test. Scalability comes from lack of interference with environment variables, together with a start signal.
running compute processes. Figure  illustrates how I/O system calls are handled
in Blue Gene. When a compute process performs a sys-
tem call requiring I/O (e.g., open, close, read, write),
System Software for the I/O Node that call is trapped by the compute node kernel, which
The I/O node plays a dual role in Blue Gene. On one packages the parameters of the system call and sends
hand, it acts as an effective master of its corresponding that message to the CIOD in its corresponding I/O
pset. On the other hand, it services requests from com- node. CIOD unpacks the message and then reissues the
pute nodes in that pset. Jobs are launched in a partition system call, this time under the Linux operating system
by contacting corresponding I/O nodes. Each I/O node of the I/O node. Once the system call completes, CIOD
is then responsible for loading and starting the execu- packages the result and sends it back to the originating
tion of the processes in each of the compute nodes of compute node kernel, which, in turn, returns the result
its pset. Once the compute processes start running, the to the compute process.
I/O nodes wait for requests from those processes. Those There is a synergistic effect between simplification
requests are mainly I/O operations to be performed and separation of responsibilities. By offloading com-
against the file systems mounted in the I/O node. plex system operations to the I/O node, Blue Gene
Blue Gene I/O nodes execute an embedded ver- keeps the compute node operating system simple. Cor-
sion of the Linux operating system. It is classified as respondingly, by keeping application processes separate
an embedded version because it does not use any swap from the I/O node activity, it avoids many security
space, it has an in-memory root file system, it uses lit- and safety issues regarding execution in the I/O nodes.
tle memory, and it lacks the majority of daemons and In particular, there is never a need for the common
services found in a server-grade configuration of Linux. scrubbing daemons typically used in Linux clusters to
It is, however, a complete port of the Linux kernel and clean up after misbehaving jobs. Just as keeping sys-
those services can be, and in various cases have been, tem services in the I/O nodes prevents interference with
turned on for specific purposes. The Linux in Blue Gene compute processes, keeping those processes in com-
I/O nodes includes a full TCP/IP stack, supporting pute nodes prevents interference with system services
communications to the outside world through Ethernet. in the I/O node. This isolation is particularly help-
It also includes file system support. Various network file ful during performance debugging work. The overall
systems have been ported to the Blue Gene/L I/O node, simplification of the operating system has enabled the
including GPFS, Lustre, NFS, and PVFS. scalability, reproducibility (performance results for Blue
Blue Gene/L I/O nodes never run application pro- Gene applications are very close across runs), and high-
cesses. That duty is reserved to the compute nodes. The performance of important Blue Gene/L applications.
main user-level process running on the Blue Gene/L I/O
node is the control and I/O daemon (CIOD). CIOD is System Software for the Service Node
the process that links the compute processes of an appli- The Blue Gene service node runs its control software,
cation running on compute nodes to the outside world. typically referred to as the Blue Gene control system.
To launch a user job in a partition, the service node The control system is responsible for the operation and
contacts the CIOD of each I/O node of the partition monitoring of all compute and I/O nodes. It is also
and passes the parameters of the job (user ID, group ID, responsible for other hardware components such as
supplementary groups, executable name, starting work- link chips, power supplies, and fans. Tight integration
ing directory, command-line arguments, and environ- between the Blue Gene control system and the I/O and
ment variables). CIOD swaps itself to the user’s identity, compute nodes operating systems is central to the Blue
IBM Blue Gene Supercomputer I 

Application ciod
fscanf

libc read read data


read

Linux
NFS
CNK

IP
Tree packets Tree packets
File server
BG/L ASIC BG/L ASIC

Tree

Ethernet I
IBM Blue Gene Supercomputer. Fig.  Function shipping between CNK and CIOD

Gene software stack. It represents one more step in the Typically, each model is applicable to a range of system
specialization of services that characterize that stack. sizes being studied. First principle models can be used
In Blue Gene, the control system is responsible for for small systems, while more phenomenological mod-
setting up system partitions and loading the initial code els have to be used for large systems. The original Blue
and state in the nodes of a partition. The Blue Gene Gene/L at LLNL was the first system that allowed cross-
compute and I/O nodes are completely stateless: no hard ing of those boundaries. Scientists can use first principle
drives and no persistent memory. When a partition is models in systems large enough to validate the phe-
created, the control system programs the hardware to nomenological models, which in turn can be used in
isolate that partition from others in the system. It com- even larger systems.
putes the network routing for the torus, collective, and Applications of notable significance at LLNL include
global interrupt networks, thus simplifying the com- ddcMD, a classical molecular dynamics code that has
pute node kernel. It loads the operating system code been used to simulate systems with approximately half
for all compute and I/O nodes of a partition through a billion atoms (and in the process win the 
the dedicated control network. It also loads an initial Gordon Bell award), and QBox, a quantum molecular
state in each node (called the personality of the node). dynamics code that won the  Gordon Bell award.
The personality of a node contains information specific Other success stories for Blue Gene in science include:
to the node. () astrophysical simulations in Argonne National
Laboratory (ANL) using the FLASH code; () global
Blue Gene and Its Impact on Science climate simulations in the National Center for Atmo-
Blue Gene was designed to deliver a level of per- spheric Research (NCAR) using the HOMME code; ()
formance for scientific computing that enables entire biomolecular simulations at the T.J. Watson Research
new studies and new applications. One of the main Center using the Blue Matter code; and () quan-
areas of applications of the original LLNL system is tum chromo dynamics (QCD) at IBM T.J. Watson
in materials’ science. Scientists at LLNL use a vari- Research Center, LLNL, San Diego Supercomputing
ety of models, including quantum molecular dynamics, Center, Juelich Research Center, Massachusetts Insti-
classical molecular dynamics, and dislocation dynam- tute of Technology, Boston University, University of
ics, to study materials at different levels of resolution. Edinburgh, and KEK (Japan) using a variety of codes.
 I IBM Power

One of the most innovative uses of Blue Gene is as . IBM Blue Gene Team () Overview of the IBM Blue Gene/P
the central processor for the large-scale LOFAR radio project. IBM J Res Dev (/):–
telescope in the Netherlands. . Moreira JE, Almasi G, Archer C, Bellofatto R, Bergner P,
Brunheroto JR, Brutman M, Castaños JG, Crumley PG, Gupta M,
Inglett T, Lieber D, Limpert D, McCarthy P, Megerian M,
Related Entries Mendell M, Mundy M, Reed D, Sahoo RK, Sanomiya A, Shok R,
IBM Power Architecture Smith B, Stewart GG () Blue Gene/L programming and
operating environment. IBM J Res Dev (/):–
LINPACK Benchmark
. Moreira JE et al () The Blue Gene/L supercomputer: a hard-
MPI (Message Passing Interface)
ware and software story. Int J Parallel Program ():–
TOP . Salapura V, Bickford R, Blumrich M, Bright AA, Chen D,
Coteus P, Gara A, Giampapa M, Gschwind M, Gupta M, Hall S,
Haring RA, Heidelberger P, Hoenicke D, Kopcsay GV, Ohmacht
Bibliographic Notes and Further M, Rand RA, Takken T, Vranas P () Power and performance
Reading optimization at the system level. In: Proceeding ACM Computing
Additional information about the system architecture of Frontiers , Ischia, May 
Blue Gene can be found in [, , , , ]. For details on . van der Schaaf K () Blue Gene in the heart of a wide area
the Blue Gene system software, the reader is referred to sensor network. In: Proceeding QCDOC and Blue Gene: next
[, ]. Finally, examples of the impact of Blue Gene on generation of HPC architecture workshop, Edinburgh, Oct 
. Streitz FH, Glosli JN, Patel MV, Chan B, Yates RK, de Supinski
science can be found in [–, , –].
BR, Sexton J, Gunnels JA () + TFlop solidification simu-
lations on BlueGene/L. In: Proceeding IEEE/ACM SC, Seattle,
Bibliography Nov 
. Vranas P, Bhanot G, Blumrich M, Chen D, Gara A, Heidelberger P,
. Almasi G, Chatterjee S, Gara A, Gunnels J, Gupta M, Henning A,
Salapura V, Sexton JC () The BlueGene/L supercomputer and
Moreira JE, Walkup B () Unlocking the performance of the
quantum chromodynamics. In: Proceeding IEEE/ACM SC,
BlueGene/L supercomputer. In: Proceeding IEEE/ACM SC,
Tampa, Nov 
Pittsburgh, Nov 
. Wait CD () IBM PowerPC  FPU with complex-arithmetic
. Almasi G, Bhanot G, Gara A, Gupta M, Sexton J, Walkup B,
extensions. IBM J Res Dev (/):–
Bulatov VV, Cook AW, de Supinski BR, Glosli JN, Greenough JA,
Gygi F, Kubota A, Louis S, Streitz FH, Williams PL, Yates RK,
Archer C, Moreira J, Rendleman C () Scaling physics and
material science applications on a massively parallel Blue Gene/L
system. In: Proceedings of the th ACM international conference IBM Power
on supercomputing. Cambridge, MA, –, – June 
. Bulatov V, Cai W, Fier J, Hiratani M, Hommes G, Pierce T, Tang
M, Rhee M, Yates RK, Arsenlis T () Scalable line dynamics in IBM Power Architecture
ParaDiS. In: Proceeding IEEE/ACM SC, Pittsburgh, Nov 
. Fitch BG, Rayshubskiy A, Eleftheriou M, Ward TJC,
Giampapa M, Pitman MC, Germain RS () Blue matter:
approaching the limits of concurrency for classical molecular IBM Power Architecture
dynamics. In: Proceeding IEEE/ACM SC, Tampa, Nov 
. Fryxell B, Olson K, Ricker P, Timmes FX, Zingale M, Lamb DQ, Tejas S. Karkhanis, José E. Moreira
MacNeice P, Rosner R, Truran JW, Tufo H () FLASH: an IBM T.J. Watson Research Center, Yorktown Heights,
adaptive mesh hydrodynamics code for modeling astrophysical NY, USA
thermonuclear flashes. Astrophys J Suppl :
. Gara A, Blumrich MA, Chen D, Chiu GL-T, Coteus P, Giampapa
ME, Haring RA, Heidelberger P, Hoenicke D, Kopcsay GV,
Liebsch TA, Ohmacht M, Steinmacher-Burow BD, Takken T, Synonyms
Vranas P () Overview of the Blue Gene/L system architec- IBM power; IBM powerPC
ture. IBM J Res Dev (/):–
. Gygi F, Yates RK, Lorenz J, Draeger EW, Franchetti F, Ueberhu-
Definition
ber CW, de Supinski BR, Kral S, Gunnels JA, Sexton JC ()
Large-scale first-principles molecular dynamics simulations on
The IBM Power architecture is an instruction set archi-
the BlueGene/L platform using the Qbox code, IEEE/ACM SC, tecture (ISA) implemented by a variety of processors
Seattle, Nov  from IBM and other vendors, including Power, IBM’s
IBM Power Architecture I 

latest server processor. The IBM Power architecture is BOOK-III


Operating Environment Architecture
designed to exploit parallelism at the instruction-, data-,
BOOK-III-S BOOK-III-E
and thread-level. Server Embedded
Environment Environment

Discussion BOOK-VLE
BOOK-II Variable
Introduction Virtual Environment Architecture Length
IBM’s Power ISATM is an instruction set architecture Encoding
designed to expose and exploit parallelism in a wide
range of applications, from embedded computing to
BOOK-I
high-end scientific computing to traditional transaction User Instruction Set Architecture
processing. Processors implementing the Power ISA
have been used to create several notable parallel com-
puting systems, including the IBM RS/ SP, the Blue IBM Power Architecture. Fig.  Books of Power ISA
Gene family of computers, the Deep Blue chess play- version .
ing machine, the PERCS system, the Sony Playstation I
 game console, and the Watson system that competed
in the popular television show Jeopardy! and the main architectural innovations delivered in
Power ISA covers both -bit and -bit variants that generation. Power [] is IBM’s latest-generations
and, as of its latest version (. Revision B []), is orga- server processor and implements the Power ISA accord-
nized in a set of four “books,” as shown in Fig. . Books ing to Books I, II, and III-S of []. Power is used in
I and II are common to all implementations. Book I, both the PERCS and Watson systems and in a variety
Power ISA User Instruction Set Architecture, covers of servers offered by IBM. The latest machine in the
the base instruction set and related facilities available Blue Gene family, Blue Gene/Q, follows a Book III-E
to the application programmer. Book II, Power ISA implementation.
Virtual Environment Architecture, defines the storage Power ISA was designed to support high pro-
(memory) model and related instructions and facilities gram execution performance and efficient utilization of
available to the application programmer. In addition to hardware resources. To that end, Power ISA provides
the specifications of Books I and II, implementations facilities for expressing instruction-level parallelism,
of the Power ISA need to follow either Book III-S or data-level parallelism, and thread-level parallelism.
Book III-E. Book III-S, Power ISA Operating Environ- Providing facilities for a variety of parallelism types
ment Architecture – Server Environment, defines the gives the programmer the flexibility in extracting the
supervisor instructions and related facilities used for particular combination of parallelism that is optimal for
general purpose implementations. Book III-E, Power his or her program.
ISA Operating Environment Architecture – Embedded
Environment, defines the supervisor instructions and
related facilities used for embedded implementations. Instruction-Level Parallelism
Finally, Book VLE, Power ISA Operating Environment Instruction-level parallelism (ILP) is the simultaneous
Architecture – Variable Length Encoding Environment, processing of several instructions by a processor. ILP
defines alternative instruction encodings and defini- is important for performance because it allows instruc-
tions intended to increase instruction density for very tions to overlap, thus effectively hiding the execution
low end implementations. latency of long latency computational and memory
Figure  shows the evolution of the main line of access instructions. Achieving ILP has been so impor-
Power architecture server processors from IBM, just tant in the processor industry that processor core
one of the many families of products based on Power designs have gone from simple multicycle designs to
ISA. The figure shows, for each generation of proces- complex designs that implement superscalar pipelines
sors, its introduction date, the silicon technology used, and out-of-order execution []. Key aspects of Power
 I IBM Power Architecture

POWER7
45 nm
multi core
POWER6
65 nm
ultra high
frequency
POWER5
130 nm
SMT
POWER4
180 nm
dual core
POWER3
.22 mm
SMP, 64-bit

POWER2
.72 mm
wide ILP
POWER1
1.0 mm
RISC

1990 1995 2000 2005 2010

IBM Power Architecture. Fig.  Evolution of main line Power architecture server processors

ISA that facilitate ILP are independent instruction facil- unit (VSU). Also shown in Fig.  is the instruction-
ities, reduced set of instructions, fixed length instruc- sequencing unit (ISU), which controls the execution of
tions, and large register set. instructions, and a level- cache.
Origins of this conceptual decomposition are in the
era of building processors with multiple integrated cir-
Independent Instruction Facilities cuits (chips). With a single processor spread across
Conceptually, Power ISA views the underlying pro- multiple chips, the communication between two inte-
cessor as composed of several engines or units, as grated circuits took significantly more time relative to
illustrated in the floor plan for the Power proces- communication internal to a chip. Consequently, either
sor core shown in Fig. . Book I of the Power ISA the clock frequency would have to be reduced or the
groups the instructions into “facilities,” including () the number of stages in the processor pipeline would have
branch facility, with instructions implemented by the to be increased. Both approaches, reducing clock fre-
instruction fetch unit (IFU); () the fixed-point facility, quency and increasing the number of pipeline stages,
with instructions implemented by the fixed-point unit can degrade performance. The decomposition into mul-
(FXU) and load-store unit (LSU); () the floating-point tiple units allowed a clear separation of work, and each
facility, with instructions implemented by the vector unit could be implemented on a single chip for maxi-
and scalar unit (VSU); () the decimal floating-point mum performance.
facility, with instructions implemented by the decimal Today, the conceptual decomposition provides two
floating-point unit (DFU); and () the vector facility, primary benefits. First, because of the conceptual
with instructions implemented by the vector and scalar decomposition, the interfaces between the engines
IBM Power Architecture I 

convention of the Power architecture. All Power ISA


DFU instructions have a major opcode that is located at
instruction bits –. Some instructions also have a
ISU
minor opcode, to differentiate among instructions with
VSU
the same major opcode. The location and length of the
minor opcode depends on the major opcode. Addition-
FXU
ally, every instruction is word-aligned. Fixed length,
word aligned, and fixed opcode location make the
instruction predecode, fetch, branch prediction, and
decode logic simpler, when compared to the decode
IFU logic of variable length ISA. Srinivasan et al. []
LSU present a comprehensive study of optimality of pipeline
length of Power processors from a power and perfor-
mance perspective.
Instruction set architectures that employ destructive
operations (i.e., one of the source registers is also the tar- I
256 Kbyte L2 cache
get) must temporarily save one of the source registers,
if the contents of that register are required later in the
program. Temporarily saving and later restoring regis-
IBM Power Architecture. Fig.  Power processor core
ters often lead to store to and load from, respectively, a
floor plan, showing the main units
memory location. Memory operations can take longer
to complete than a computational operation. Nonde-
structive operations in Power ISA eliminate the need
are clearly defined. Clearly defined interfaces lead to for extra instructions for saving and restoring one of
hardware design that is simpler to implement and to the source registers, facilitating higher instruction level
verify. Second, the conceptual decomposition addresses parallelism.
the inability of scaling frequency of long on-chip wires
that can be a performance limiter, just as it addressed
wiring issues between two or more integrated circuits
Large Register Set
when the conceptual decomposition was introduced.
Power ISA originally specified  general-purpose
(fixed-point, either - or -bit) and  floating-point
Reduced Set of Nondestructive Fixed Length (-bit) registers. An additional set of  vector (-bit)
Instructions registers were added with the first set of vector instruc-
Power ISA consists of a reduced set of fixed length -bit tions. The latest specification, Power ISA ., expands
instructions. A large fraction of the set of instructions the number of vector registers to . A large number
are nondestructive. That is, the result register is explic- of registers means that more data, including function
itly identified, as opposed to implicitly being one of the and subroutine parameters, can be kept in fast regis-
source registers. A reduced set of instructions simplifies ters. This in turn avoids load/store operations to save
the design of the processor core and also verification of and retrieve data to and from memory and supports
corner cases in the hardware. concurrent execution of more instructions.
Ignoring the Book-VLE case, which is targeted to
very low end systems, Power ISA instructions are all
-bits in length, thus the beginning and end of every Load/Store Architecture
instruction is known before decode. The bits in an Power ISA specifies a load-store architecture consisting
instruction word are numbered from  (most signifi- of two distinct types of instructions: () memory access
cant) to  (least significant), following the big-endian instructions, and () compute instructions. Memory
 I IBM Power Architecture

access instructions load data from memory into com- (VMX) instructions. The second is a new set of SIMD
putational registers and store the data from the compu- instructions called Vector-Scalar Extension (VSX).
tational registers to the memory. Compute instructions
perform computations on the data residing in the com-
VMX Instructions
putational registers. This arrangement decouples the
VMX instructions operate on -bit wide data, which
responsibilities of the memory instructions and com-
can be vectors of byte (-bit), half-word (-bit) and
putational instructions, providing a powerful lever to
word (-bit) elements. The word elements can be either
hide the memory access latency by overlapping the long
integer or single-precision floating-point numbers. The
latency of memory access instructions with compute
VMX instructions follow the load/store model, with a
instructions.
-entry register set (each entry is -bit wide) that is
separate from the original (scalar) fixed- and floating-
ILP in Power point registers in Power ISA ..
Power is an out-of-order superscalar processor that
can operate at frequencies exceeding  GHz. In a given
VSX Instructions
clock cycle, a Power processor core can fetch up
VSX also operates on -bit wide data, which can
to eight instructions, decode and dispatch up to six
be vectors of word (-bit) and double word (-
instructions, issue and execute up to eight instruc-
bit) elements. Most operations are on floating-point
tions, and commit up to six instructions. To ensure
numbers (single and double precision) but VSX also
a high instruction throughput, Power can simultane-
includes integer conversion and logical operations. VSX
ously maintain about  instructions in various stages
instructions also follow the load/store model, with a
of processing. To further extract independent instruc-
-entry register set ( bits per entry) that overlaps
tion for parallelism, Power implements register renam-
the VMX and floating-point registers. VSX requires
ing – each of the architected register files are mapped
no operating-mode switches. Therefore, it is possible
to a much larger set of physical registers. Execution of
to interleave VSX instructions with floating point and
the instructions is carried out by a total of  execu-
integer instructions.
tion units. Power implements the Power ISA in a way
that extracts high levels of instruction-level parallelism
while operating at a high clock frequency. Power Vector and Scalar Unit (VSU)
The vector and scalar unit of Power is responsible for
Data Level Parallelism execution of the VMX and VSX SIMD instructions.
Data-level parallelism (DLP) consists of simultaneously The unit contains one vector pipeline and four double-
performing the same type of operations on differ- precision floating-point pipelines. A VSX floating-point
ent data values, using multiple functional units, with instruction uses two floating-point pipelines, and two
a single instruction. The most common approach of VSX instructions can be issued every cycle, to keep all
providing DLP in general purpose processors is the Sin- floating-point pipelines busy. The four floating-point
gle Instruction Multiple Data (SIMD) technique. SIMD pipelines can each execute a double-precision fused
(also called vector) instructions provide a concise and multiply-add operation, leading to a performance of 
efficient way to express DLP. With SIMD instructions, flops/cycle for a Power core.
fewer instructions are required to perform the same
data computation resulting in lower fetch, decode and Thread Level Parallelism
dispatch bandwidth, and consequently higher power Thread-level parallelism (TLP) is the simultaneous exe-
efficiency. cution of multiple threads of instructions. Unlike ILP
Power ISA . contains two sets of SIMD instruc- and DLP, that rely on extracting parallelism from within
tions. The first one is the original set of instructions the same program thread, TLP relies on explicit paral-
implemented by the vector facility since  and also lelism from multiple concurrently running threads. The
known as AltiVec [] or Vector Media Extensions multiple threads can come from the decomposition of
IBM Power Architecture I 

a single program or from multiple independent pro- Memory Coherence Models


grams. For programs where the concurrent threads share mem-
ory while working on a common task, the memory
consistency model of the architectures plays a key role
in the performance of TLP as a function of the num-
Thread Level Parallelism Within a Processor
ber of threads. The memory consistency model specifies
Core
how memory references from different threads can be
In the first systems that exploited thread-level par-
interleaved. Power ISA specifies a release consistency
allelism, different threads executed on different pro-
memory model. A release consistency model relaxes
cessor cores and shared a memory system. Today,
the ordering of memory references as seen by different
processor core designs have evolved such that mul-
threads. When a particular ordering of memory ref-
tiple threads can run on single processor core. This
erences among threads is necessary for the program,
increases resource utilization and, consequently, the
explicit synchronization operations must be used.
computational throughput of the core. Effectively, TLP
within a core enables hiding of the long latency events
of stalled threads with forward progress of active TLP Support in Power Processor
threads. Examples of Power ISA processors that sup- The structure of a Power processor chip is shown in I
port multithreading within a core include Power [], Fig. . There are eight processor cores and three levels
Power [], and Power [] processors. of cache in a single chip. Each processor core (which

Core (ST, SMT2, SMT4)


8 cores (up to 32 threads)

32-Kbyte 32-Kbyte
L1 I-cache L1 D-cache

Fetch Store

8 private L2s
Fetch
256-Kbyte private L2 cache

Cast out
32-Mbyte shared L3 cache

8 local L3s
4-Mbyte local L3 cache

Memory
Cast out

IBM Power Architecture. Fig.  Structure of a Power Processor Chip


 I IBM Power Architecture

includes -Kbyte level  data and instruction caches) is Power ISA supports thread-level parallelism
paired with a -Kbyte level  (L) cache that is pri- through a release consistency memory model. Because
vate to the core. There is also a -Mbyte level  (L) of the release consistency, Power ISA based systems
cache that is shared by all cores. The level  cache is orga- permit aggressive software and hardware optimizations
nized as eight -Mbyte caches, each local to a core/L that would otherwise be restricted under a sequential
pair. Cast outs from an L cache can only go to its local consistency model.
L cache, but from there data can be cast out across the The Power processor implements the latest ver-
eight local Ls. sion of Power ISA and exploits all forms of paral-
Each core is capable of operating in three dif- lelism supported by the instruction set architecture:
ferent threading modes: single-threaded (ST), dual- instruction-level parallelism, data-level parallelism, and
threaded (SMT), or quad-threaded (SMT). The cores thread-level parallelism.
can switch modes while executing, thus adapting to the
needs of different applications. The ST mode delivers
Related Entries
higher single-thread performance, since the resources
IBM Blue Gene Supercomputer
of a core are dedicated to the execution of that single
Cell Broadband Engine Processor
thread. The SMT mode partitions the core resources
IBM RS/ SP
among four threads, resulting in higher total through-
PERCS System Architecture
put at the cost of reduced performance for each thread.
The SMT mode is an intermediate point.
A single Power processor chip supports up to  Bibliographic Notes
simultaneous threads of execution ( cores,  threads Official information on the Power instruction set
per core). Power scales to systems of  processor chips architecture is available in the Power.org website
or up to  threads of execution sharing a single (www.power.org). In particular, the latest version of
memory image. the Power ISA (. revision B), implemented by the
Power processor, can be found in [].
An early history of the Power architecture is pro-
Summary vided by Diefendorff []. Details of Power architecture
Since its inception in the Power processor in , the including the instruction specification and program-
Power architecture has evolved to address the technol- ming environment are given in several reference manu-
ogy and applications issues of the time. The ability of als [, –].
Power architecture to provide instruction-, data-, and Evolution of IBM’s RISC philosophy is explained
thread-level parallelism has enabled a variety of parallel in []. More detailed information on the microarchi-
systems, including some notable supercomputers. tecture of specific Power processors can be found for
Power ISA allows exposing and extraction of ILP Power [], Power [], Power [], and Power []
primarily because of the RISC principles embodied in processors.
the ISA. The reduced set of fixed length instructions The AltiVec Programming Environment Man-
enables simple hardware implementation that can be ual [] and AltiVec Programming Interface Man-
efficiently pipelined, thus increasing concurrency. The ual [] are two thorough references for effectively
larger register set provides several optimization oppor- employing AltiVec. Gwennap [] and Diefendorff []
tunities for the compiler as well as the hardware. have a good survey of Power AltiVec.
Power ISA provides facilities for data-level Methods for extracting instruction-level parallelism
parallelism via SIMD instructions. VMX and VSX for the Power architecture are described in []. One
instructions increase the computational efficiency of of the key impediments to data-level parallelism
the processor by performing the same operation on is unaligned memory accesses. To overcome these
multiple data values. For some programs DLP can be unaligned accesses Eichenberger, Wu and O’Brien []
extracted automatically by the compiler. For others, present some data-level parallelism optimization tech-
explicit SIMD programming is more appropriate. niques. Finally, Adve and Gharacharloo’s tutorial on
IBM RS/ SP I 

shared memory consistency model [] is a great refer-


ence for further reading on thread-level parallelism.
IBM RS/ SP
José E. Moreira
IBM T.J. Watson Research Center, Yorktown Heights,
NY, USA
Bibliography
. Adve SV, Gharachorloo K () Shared memory consistency
models: a tutorial. IEEE Comput :– Synonyms
. Cocke J, Markstein V () The evolution of RISC technology at
IBM SP; IBM SP; IBM SP; IBM SP
IBM. IBM J Res Dev :–
. Diefendorff K () History of the PowerPC architecture. Com-
mun ACM ():– Definition
. Diefendorff F, Dubey P, Hochsprung R, Scales H () Altivec The IBM RS/ SP is a distributed memory, mes-
extension to PowerPC accelerates media processing. IEEE Micro
sage passing parallel system based on the IBM POWER
():–
. Eichenberger AE, Wu P, O’Brien K () Vectorization for SIMD
processors. Several generations of the system were
architectures with alignment constraints. Sigplan Not ():– developed by IBM and thousands of systems were
. Gwennap L () AltiVec vectorizes PowerPC. Microprocessor delivered to customers. The pinnacle of the IBM I
Report ():– RS/ SP was the ASCI White machine at Lawrence
. Hoxey S, Karim F, Hay B, Warren H (eds) () The PowerPC Livermore National Laboratory, which held the num-
compiler writer’s guide. Warthman Associates, Palo Alto
ber  spot in the TOP list from November  to
. IBM Power ISA Version . Revision B. http://www.power.
org/resources/downloads/PowerISA_V.B_V_PUBLIC.pdf November .
. Kalla R, Sinharoy B, Starke WJ, Floyd M () Power: IBM’s
next-generation server processor. IEEE Micro ():– Discussion
. Le HQ, Starke WJ, Fields JS, O’Connell FP, Nguyen DQ, Ronchetti
BJ, Sauer WM, Schwarz EM, Vaden MT () IBM Power Introduction
microarchitecture. IBM J Res Dev ():– The IBM RS/ SP (SP for short) is a general-purpose
. May C, Silha ED, Simpson R, Warren H (eds) () The PowerPC
parallel system. It was one of the first parallel systems
architecture: a specification for a new family of RISC processors.
Morgan Kaufmann, San Francisco
designed to address both technical computing applica-
. () Freescale Semiconductor. In: AltiVec technology program- tions (the usual domain of parallel supercomputers) and
ming environments manual. commercial applications (e.g., database servers, trans-
. () Freescale Semiconductor.In: AltiVec technology program- action processing, multimedia servers). The SP is a dis-
ming interface manual. tributed memory, message passing parallel system. It
. Sinharoy B, Kalla RN, Tendler JM, Eickemeyer RJ, Joyner
consists of a set of nodes, each running its own oper-
JB () Power system microarchitecture. IBM J Res Dev
(/):– ating system image, interconnected by a high-speed
. Smith JE, Sohi GS () The microarchitecture of superscalar network. In the TOP classification, it falls into the
processors. P IEEE :– cluster class of machines.
. Srinivasan V, Brooks D, Gschwind M, Bose P, Zyuban V, IBM delivered several generations of SP machines,
Strenski PN, Emma PG () Optimizing pipelines for power
all based on IBM POWER processors. The initial
and performance. In: Proceedings of the th annual ACM/IEEE
international symposium on microarchitecture, MICRO . IEEE
machines were simply called the IBM SP (or SP) and
Computer Society Press, Los Alamitos, pp – were based on the original POWER processors. Later,
. Tendler JM, Dodson JS, Fields JS, Le H, Sinharoy B () Power IBM delivered the SP machines based on POWER
system microarchitecture. IBM J Res Dev ():– and then finally a generation based on POWER pro-
cessors (which was unofficially called the SP by some).
IBM continued to deliver parallel systems based on
later generations of POWER processors (POWER and
IBM PowerPC beyond), but those were no longer considered IBM
RS/ SP systems. The November  TOP list
IBM Power Architecture shows  POWER Systems  machines (one based on
 I IBM RS/ SP

POWER processors and the rest based on POWER


processors), which can be considered direct follow-on Node
to the RS/ SP.
Notable IBM RS/ SP systems include the Node Node

Argonne National Laboratory SP (installed in )


[], the Cornell Theory Center IBM SP, the Lawrence
Livermore National Laboratory ASCI Blue Pacific, and
the Lawrence Livermore National Laboratory ASCI
White. This last system consisted of  nodes with  Node Switch Node
POWER processors each and held the number  spot
in the TOP list from November  to November
.
The RS/ SP was designed to serve a broad
range of applications, from both the technical and com- Node Node
mercial computing domains. The designers of the sys-
tem followed a set of principles [] that can be sum- Node

marized as follows: maximize the use of off-the-shelf


hardware and software components while develop- IBM RS/ SP. Fig.  High-level hardware architecture
ing some special-purpose components that maximize of an IBM RS/ SP
the value of the system. As a result, the RS/ SP
utilizes the same processors, operating systems, and
compilers as the contemporary IBM workstations. It processors, symmetric multiprocessing (SMP) nodes
also utilizes a special-purpose high-speed interconnect became available. Nodes could be configured to bet-
switch, a parallel operating environment and message ter serve specific purposes. For example, compute
passing libraries, and a parallel programming environ- nodes could be configured with more processors and
ment, including a High Performance Fortran (HPF) memory, whereas I/O nodes could be configured
compiler. with more I/O adapters. The RS/ SP architec-
To further enable the system for commercial appli- ture supports different kinds of nodes in the same
cations, IBM and other vendors developed parallel ver- system, and it was usual to have both compute-
sions of important commercial middleware such as optimized and I/O-optimized nodes in the same
DB and CICS/. With these parallel middleware, system.
customers were able to quickly port applications from The node architecture (illustrated in Fig. ) is essen-
the more conventional single system image commercial tially the same as contemporary standalone worksta-
servers to the IBM SP. tions and servers based on POWER processors. Pro-
cessor and memory modules are interconnected by a
Hardware Architecture system bus that supports memory coherence within the
The IBM RS/ SP consists of a cluster of nodes node. An I/O bus also hangs off this system bus. This
interconnected by a switch (Fig. ). The nodes (Fig. ) I/O bus supports off-the-shelf adapters found on stan-
are independent computers based on hardware (proces- dalone machines, such as Ethernet and Fibre Channel. It
sors, memory, disks, I/O adapters) developed for IBM also supports the switch adapters that connect the node
workstations and servers. Each node has its own private to the network.
memory and runs its own image of the AIX operating Whereas the SP nodes are built primarily out of off-
system. the-shelf hardware (except for the switch adapter), the
Through the evolution of the RS/ SP, dif- SP switch is a special-purpose design. At the time the
ferent nodes were used. The initial nodes for SP SP was conceived, standard interconnection networks
and SP models were single processor nodes. Later, (Ethernet, FDDI, ATM) delivered neither the band-
with the introduction of PowerPC and POWER width nor the latency that a large-scale general-purpose
IBM RS/ SP I 

Processor Processor Processor

System bus

I/O bus

Memory Memory Memory

Ethernet
adapter

adapter
adapter
Switch

FC
I
IBM RS/ SP. Fig.  Hardware architecture of an IBM RS/ SP node. Different kinds of nodes can be and have been
used in SP systems

parallel system like the SP required. Therefore, the 16⫻16 Switch board
designers decided that a special-purpose interconnect
was necessary [, ].
4⫻4 4⫻4
The IBM RS/ SP switch [] is an any-to-any

To switch boards in other frames


packet-switched multistage network. The bisection
To nodes in the same frame

bandwidth of the switch scales linearly with the size


(number of nodes) of the system. The available band- 4⫻4 4⫻4
width between any pair of communicating nodes
remains constant irrespective of where in the topol-
ogy the two nodes lie. These features supported both
system scalability and ease of use. The system could 4⫻4 4⫻4
be viewed as a flat collection of nodes, which could be
freely selected for parallel jobs irrespective to their loca-
tion. Selection could focus on other features, such as
4⫻4 4⫻4
processor speed and memory size. As a consequence,
the IBM RS/ SP did not suffer from the fragmenta-
tion problem that was observed in other parallel systems
of the time []. IBM RS/ SP. Fig.  A  ×  switch board is built by
The SP switch is built from a basic × bidirectional interconnecting eight  ×  bidirectional crossbar
crossbar switching elements, which are grouped eight switching elements. The switch board connects to nodes
to a board to form a  ×  switch board, as shown in on one side and other boards on the other side
Fig. . A switch board connects to nodes on one side
and to other switch boards on the other side. Systems Software Architecture
with up to  nodes can be assembled with just one Figure  illustrates the software stack of the RS/
layer of switch boards, whereas larger systems require SP. That software stack is built upon off-the-shelf
additional layers. UNIX components and specialized services for parallel
 I IBM RS/ SP

Parallel applications node to access any disk in the system as if it were locally
attached to that node. Another global service is the sys-
Application middleware tem data repository (SDR). The SDR contains system-
wide information about the nodes, switches, and jobs
System Job Parallel Compilers &
management management environment libraries currently in the system.
The job management system of the IBM SP supports
Global services
both interactive and batch jobs. Batch jobs are submit-
ted, scheduled, and controlled by LoadLeveler []. For
Availability services
interactive jobs, a user can login directly to any node in
High-performance the SP, since the nodes all run a full version of the AIX
Standard operating system (AIX)
services
operating system.
Hardware (Processors, memory, I/O devices, adapters) System management for the IBM SP is built upon
components used for management of RS/ AIX
IBM RS/ SP. Fig.  Software stack for RS/ SP workstations. It also includes extensions developed
specifically for the SP to facilitate performing standard
management functions across the many nodes of an SP.
processing. Each node runs a full AIX operating system The functions supported include system installation,
instance. That operating system is complemented at the system operation, user management, configuration
bottom layer of the software stack by high-performance management, file management, security management,
services that provide connectivity to the SP switch. job accounting, problem management, change manage-
The availability and global services layers implement ment, hardware monitoring and control, and print and
aspects of a single-system image. They are intended to mail services. The system management functions can
be the basis for parallel applications and application be performed via a control workstation that acts as the
middleware. The availability services of the IBM SP sup- system console.
port heartbeat, membership, notification, and recovery The compilers and run-time libraries for the IBM
coordination. Heartbeat services implement the mon- RS/ SP are based on the standard software stack
itoring of components to detect failures. Membership for IBM AIX augmented with certain features specific
services allow processors and processes to be identi- to the SP. Fortran, C and C ++ compilers, and run-time
fied as belonging to a group. Notification services allow libraries for POWER-based workstations and servers
members of a group to be notified when new members can be directly used in the SP, since the nodes of the lat-
are added or old members are removed from that group. ter are based on hardware developed for the former. The
Finally, recovery coordination services provide a mech- software stack for the SP also includes message-passing
anism for performing recovery procedures within the libraries that implement both IBM proprietary models,
group in response to changes in the membership. such as MPL, and standard models such as PVM and
The global services of the IBM SP provide global MPI [, ]. Also available for the IBM RS/ SP is
access to resources such as disks, files, and networks. an implementation of the High Performance Fortran
Global access to files is provided by networked file solu- (HPF) programming language [].
tions, either with a client-server model (e.g., NFS) or In addition to middleware like MPI libraries that
with a parallel file system model (e.g., GPFS). With cater to scientific applications, the RS/ SP software
this approach, processes in every node have access to stack also includes middleware targeted at enabling par-
the same file space. Global network access is imple- allel commercial applications. The main example is DB
mented through TCP/IP and UDP/IP protocols over Parallel Edition (PE) [], an implementation of the DB
the IBM SP switch. Gateway nodes, connected to both relational database product that runs in parallel across
the SP switch and an external Ethernet network, allow the nodes of the SP. DB PE is a shared-nothing par-
all nodes to access the Ethernet network. Global access allel database system, in which the data is partitioned
to disks is implemented by virtual shared disk (VSD) across the nodes. DB PE splits SQL queries into mul-
functionality. VSD allows a process running on any SP tiple operations that are then shipped to the nodes for
IBM RS/ SP I 

execution against their local data. A final stage com- each with  POWER processors. Each processor had
bines the results from each node into a single result. DB a clock speed of  MHz and a peak floating-point
PE enables database applications to use the parallelism performance of . Gflops. As indicated in the TOP
of the RS/ SP without changes to the application list, the machine had a peak performance (Rpeak) of
itself, since the exploitation of parallelism happens in , Gflops and a Linpack performance (Rmax) of
the database middleware layer. , Gflops. It was ranked # in the November ,
June  and November  lists.
Example Applications
Thousands of IBM RS/ SP systems were delivered
over the product lifetime, ranging in size from as few as
Related Entries
IBM Power Architecture
two nodes all the way up to  nodes and larger. The
LINPACK Benchmark
availability of a full workstation- and server-compatible
MPI (Message Passing Interface)
software stack on the SP nodes allowed it to run off-
TOP
the-shelf AIX applications with zero porting effort. The
availability of standard message passing libraries (such
as MPI), High Performance Fortran, and parallel com- Bibliographic Notes and Further I
mercial middleware such as DB PE also meant that Reading
existing parallel applications could be moved to the IBM For a thorough discussion of the system architecture
RS/ SP with relative ease. Furthermore, several of the RS/ SP, the reader is referred to []. Details of
applications were specifically developed or optimized the RS/ SP interconnection network can be found
for the SP. in []. An overview of the system software for the
At a relatively early point in the product lifetime RS/ SP is given in [] while details for the MPI envi-
(), the SP was already being used in many different ronment and the job scheduling facilities are described
areas, including computational chemistry, crash anal- in [] and [], respectively. The HPF compiler for the
ysis, electronic design analysis, seismic analysis, reser- RS/ SP is described in []. Commercial middle-
voir modeling, decision support, data analysis, on-line ware is covered in [] and user experience in a scien-
transaction processing, local area network consolida- tific computing environment is described in []. Finally,
tion, and as workgroup servers. In terms of economic additional information on the machine fragmentation
sectors, SP systems were being used in manufacturing, problem is available in [].
distribution, transportation, petroleum, communica-
tions, utilities, education, government, finance, insur-
ance, and travel [].
Bibliography
. Agerwala T, Martin JL, Mirza JH, Sadler DC, Dias DM, Snir M
The IBM RS/ SP played an important role in the
() SP system architecture. IBM Syst J ():–
Accelerated Strategic Computing Initiative by the US . Baru CK, Fecteau G, Goyal A, Hsiao H, Jhingran A, Padmanabhan
Department of Energy. That program was responsible S, Copeland GP, Wilson WG () DB Parallel Edition. IBM Syst
for several of the fastest computers in the world, includ- J ():–
ing two SPs: the ASCI Blue-Pacific and ASCI White . Dewey S, Banas J () LoadLeveler: a solution for job manage-
ment in the UNIX environment. AIXtra, May/June 
machines. ASCI Blue-Pacific was the largest SP in num-
. Feitelson DG, Jette MA () Improved utilization and respon-
ber of nodes. It consisted of , nodes, each with siveness with gang scheduling. In: Feitelson DG, Rudolph L (eds)
four PowerPC e processors. Each processor had a Proceedings of job scheduling strategies for parallel processing
clock speed of  MHz and a peak floating-point per- (JSSPP ’). Lecture notes in computer science, vol . Springer,
formance of  Mflops. As indicated in the TOP Berlin, pp –
list, the machine had a peak performance (Rpeak) of . Franke H, Wu CE, Riviere M, Pattnaik P, Snir M () MPI pro-
gramming environment for IBM SP/SP. In: Proceedings of the
. Gflops and a Linpack performance (Rmax) of
th international conference on distributed computing systems
 Gflops. It was ranked # in the November  and (ICDCS ’), Vancouver, May –June , , pp –
June  lists. ASCI White was the largest SP in num- . Gropp WD, Lusk E () Experiences with the IBM SP. IBM Syst
ber of processors (or cores). It consisted of  nodes, J ():–
 I IBM SP

. Gupta M, Midkiff S, Schonberg E, Seshadri V, Shields D, form of branch speculation, pipelined execution units,
Wang K-Y, Ching W-M, Ngo T () An HPF compiler for the and division by binomial series approximation.
IBM SP. In: Proceedings of the  ACM/IEEE conference on
supercomputing, San Diego, 
. Snir M, Hochschild P, Frye DD, Gildea KJ () The communica- Discussion
tion software and parallel environment of the IBM SP. IBM Syst J
():– Introduction
. Stunkel CB, Shea DG, Abali B, Atkins MG, Bender CA, Grice When IBM introduced System/ in , it included
DG, Hochschild P, Joseph DJ, Nathanson BJ, Swetz RA, Stucke RF, an announcement of a high-end computer referred to
Tsao M, Varker PR () The SP high-performance switch. IBM
as “Model .” This computer was realized as the Model
Syst J ():–
 with first installation in . While the Model  had
a core memory with  ns cycle time, there was a fast
memory version which used thin-film magnetic storage
with a cycle time of  ns. This version was labeled the
IBM SP
Model ; two Model s were produced. Orders for the
Model  were closed in , but a revised version with
IBM RS/ SP
improved technology and cache was introduced around
 as the Model . The Model  used the same
logic as the Model . In the mid s, there were some
IBM SP references to a Model  (with  ns memory cycle).
This version was never realized.
IBM RS/ SP The Model  series used a hybrid technology called
ASLT (advanced solid logic technology), and had a pro-
cessor cycle of  ns for Models  and ; the Model 
had a processor cycle of  ns.
IBM SP
The total production of all computers in the Model
IBM RS/ SP
 series was about two dozen.

Overview of the System and CPU


IBM SP Instruction Processing
The instruction set was exactly that of System/; how-
IBM RS/ SP ever, instruction execution of the decimal instruction
set was not supported in hardware in the Model . If
such an instruction was invoked, it would trap and be
interpreted by the processor [, ].
IBM System/ Model  The design was predicated on the requirements of
a long ( stage) pipeline. Instructions were prefetched
Michael Flynn
Stanford University, Stanford, CA, USA
into an  entry I buffer each entry having  bytes. The I
buffer enabled limited speculation on branch outcomes.
If the branch target was backward and in the buffer,
the branch was predicted to be successful otherwise the
Definition branch was predicted to be untaken and proceed in line.
The Model  was the highest performing computer A significant feature of the execution units was a
system introduced by IBM as part of its System/ mechanism called the common data bus, developed by
series in the s. It was distinguished by a number of R Tomasulo [] and better known as Tomasulo’s algo-
innovations such as out-of-order instruction execution, rithm. This is a data flow control algorithm, which
Tomasulo’s algorithm for data flow execution, a limited renames registers into reservation stations.
IBM System/ Model  I 

Each register in the central register set is extended required only four logic stages (two CSAs) to imple-
to include a tag that identifies the functional unit that ment. This allowed six signed multiples to be assimi-
produces a result to be placed in a particular register. lated every  ns. Thus, the assimilation process took
Similarly, each of the multiple functional units has one  ns to form the produce; another  ns was required
or more reservation stations. The reservation station, for startup.
however, can contain either a tag identifying another The divide process introduced the Goldschmidt
functional unit or register, or it can contain the variable algorithm based on a binomial expansion of the recip-
needed. Each reservation station effectively defines its rocal. The value of the reciprocal was then multiplied by
own functional unit; thus, two reservations for a floating the dividend to form the quotient. In binomial expan-
point multiplier are two functional unit tags: multiplier sion the divisor, d = −x; now the expansion of /(−x)
 and multiplier . If operands can go directly into the can be represented as ( − x)( + x )( + x )( + x ). . .
multiplier, then there is another tag: multiplier . Once Since d is binary normalized, x is less than or equal /.
a pair of operands has a designated functional unit tag, So each term doubles the number of zeros following the
that tag remains with that operand pair until comple-  and the resulting product quadratically converges to
tion of the operation. Any unit (or register) that depends the reciprocal. A initial table lookup reduces the num-
on that result has a copy of the functional unit tag and ber of multiplies, the Model  took  cycles for floating I
in gates the result that is broadcast on the common data point divide.
bus. In this dataflow approach, the results to a targeted
register may never actually go to that register; in fact, Memory
the computation based on the load of a particular regis- The Model  memory system used conventional (for
ter may be continually forwarded to various functional the day) magnetic core technology. It was interleaved
units, so that before the value is stored, a new value  ways and had a . microsecond cycle time with
based upon a new computational sequence (a new load a  cycle access time. A similar memory technology
instruction) is able to use the targeted register. supported the Model ’s cache based memory. The
The renaming of registers enabled out of order memory system buffered  outstanding requests [].
instruction execution, but in doing so it did not allow
precise interrupts for all cases of exceptions. Since one The Technology
instruction may have completed execution before a slow The machine itself was implemented with ECL (emit-
executing earlier issued one (such as divide) takes an ter coupled logic) as the basic circuit technology using
exception and causes an interrupt it is impossible to multi transistor chips mounted on a  ×  cm aluminum
reconstruct the machine state precisely. ceramic substrate. Passive devices were implemented
as thin-film printed components on the substrate. Two
Execution Functional Units substrates were stacked one on top of each other form-
The floating point units also provided significant inno- ing a module which formed a cube about  cm on a
vation. The floating point adder executed an instruction side. The module provided a circuit density of about
in two cycles but it accepted a new instruction each two to three circuits. Approximately  modules could
cycle. This functional unit pipelining was novel at this be mounted on a daughterboard and twenty daughter
time but now widely used []. boards could be plugged in to a motherboard ( ×
The floating point multiplier executed an instruc-  cm). Twenty motherboards formed a frame about
tion in three cycles. It used a Booth  encoding of the  ×  m ×  cm and four frames formed the basic CPU
multiplier, requiring the addition of  signed multiples for the system [–].
of the multiplicand (the mantissa had  bits). These The ECL circuit delay was about . ns but the tran-
were added in Sum + Carry form, six at a time using sit time between logic gates added another . ns. All
a tree of carry save adders (CSA). The partial product interconnections were made by terminated transmis-
was fed back into the tree for assimilation with the next sion line except for a small number of stubbed transmis-
six multiples. Using a newly developed Earle latch, this sion lines. A great deal of care was paid in developing
assimilation iteration eliminated latching overhead and the signal transmission system. For example: a dual
 I IEEE .

impedance system of  and  ohms was used and the . Flynn MJ () Very high speed computers. Proc IEEE :
width of the basic  ohm line was reduced to create a –
 ohm line in the vicinity of loads so that the effective . Flynn MJ () Computer engineering  years after the IBM
model . IEEE Comput ():–
impedance of the loaded  ohm line would appear as a
 ohm line.
The processor had in total about , gates. Since
each gate had an associated interconnection delay the IEEE .
transit time plus loading effects made the total delay per
gate approximately . ns. With  stages of logic as the Ethernet
definition for cycle time, this defined  ns as the basic
CPU cycle. The multiply-divide unit had a subcycle of
 ns for a partial product iteration. Illegal Memory Access
The processor used water-cooled heat exchangers
between motherboards for cooling. A motor generator Intel Parallel Inspector
set powered the system, isolating the system from short
power disruptions. The total power consumption was a
significant fraction of a megawatt.
Illiac IV
Bibliographic Notes and Further Yoichi Muraoka
Reading Waseda University, Tokyo, Japan
The basic source material for the Model  is the
IBM J of Research and Development cited below
[–]. The term “Tomasulo’s algorithm” and some sim-
History
The Illiac IV computer was the first practical large-
ilar designations are introduced in []. Thirty years
scale array computer, which can be classified as
after the Model  was initially dedicated there was
an SIMD (single-instruction-stream-multiple-data-
a retrospective paper on its accomplishments and
streams)-type computer. As the name suggests, the
problems [].
project was managed at the University of Illinois Dig-
ital Computer Laboratory under the contract from
Bibliography the Defense Advanced Research Project Agency (then
The following eight papers are all from the special issue of the called ARPA). The project started in , and the
IBM Journal of Research and Development devoted to The IBM machine was delivered to the NASA Ames Research
System/ Model ; vol , issue , January 
Center in . It took  years to run its first successful
. Flynn MJ, Low PR Some remarks on system development,
pp –
application and was made available via the ARPANET,
. Anderson DW, Sparacio FJ, Tomasulo RM Machine philosophy the predecessor of the Internet. The principal inves-
and instruction handling, pp – tigator was Professor Daniel Slotnick, who conceived
. Tomasulo RM An efficient algorithm for exploiting multiple the idea in the mid-s as the Solomon computer.
arithmetic units, pp – The machine was built to answer the large computa-
. Anderson SF, Goldschmidt RE, Earle JG, Powers DM, Floating-
point execution unit, pp –
tional requirements such as ballistic missile defense
. Boland LJ, Messina BU, Granito GD, Smith JW, Marcotte AU analysis, climate modeling, and so on. It was said then
Storage system, pp – that two Illiac-IV computers would suffice to cover all
. Langdon JL, Van Derveer EJ Design of a high-speed transistor for computational requirements in the planet.
the ASLT current switch, pp – The design of the computer and the majority of
. Sechler RF, Strube AR, Turnbull JR ASLT circuit design,
early software suits development were done by Illi-
pp –
. Lloyd RF ASLT: an extension of hybrid miniaturization tech-
nois researchers, including many ambitious graduate
niques, pp – students, while the computer was built by the Burroughs
Other papers mentioning the Model  or related computers Corporation.
Illiac IV I 

To be precise, the project was transferred from the The TRANQUIL language is an extension of the
University of Illinois to the Ames Research Center ALGOL. Its main feature is the capability to specify the
in  before its completion due to the heavy anti- parallel execution of a FOR loop. For example,
Vietnam war protest activity on the campus.
The hardware is now being displayed in the Com- FOR SIM I = ( . . . ) A(I) = B(I) + B(I)
puter History Museum at Silicon Valley. adds two  elements vectors in one step. Furthermore,
it allows a user to specify how to store data in the
Hardware memory. For example, suppose we have an array of 
An instruction is decoded by Control Unit (CU), and columns by  rows. If we store each entire column
the decoded signals are sent to  processors, called PE. in separate PE memory, then while operation on rows
PE is basically a combination of arithmetic-logic-unit, (e.g., addition of corresponding elements of two rows)
registers, and ,-words memory of -bit length (So can be done in parallel, operation on columns must be
in total,  MB). It operates at a  MHz clock. Each PE done sequentially. To avoid this situation, we may store
loads data to a register from its own memory. Thus, the data as
 PEs perform an identical operation over indepen-
dent data simultaneously. For example, two  elements I
vectors can be added by one operation. PE PE PE PE
The operation of each PE can be controlled by a a(,) a(,) a(,) a(,)
mode bit. If a control bit is set, then a PE participates a(,) a(,) a(,) a(,)
the operation, otherwise it does not. Thus, PE has some a(,) a(,) a(,) a(,)
degree of freedom. There is an instruction to set/reset
the mode bit.
A memory address to access the memory in each While the simple data mapping scheme is called straight
PE is also provided from CU. Included in each PE is storage, this scheme is called skewed storage. TRAN-
an index register, so one can modify and access to a QUIL provides further varieties of storage scheme for
different memory address in different PEs. an array to allow efficient parallel computation.
To exchange data between PEs, PEs are set in the While the TRANQUIL language aims at a high-level
shape of a list with wrapped around connection. All programming gear, the GLYPNIR language, so to speak,
PEs move data, for example, to their left neighbor PEs provides a low-level view of the computer. In GLYPNIR,
simultaneously. The mechanism is called the routing. one writes a program for a PE in ALGOL like state-
The megabyte memory is far less from ideal. So it ments. The parallel execution is implicit, i.e., basically
is backed up by a  million word head-per-track desk the identical program is executed in all PEs. The spe-
with an I/O rate of  MB/s. The average access time is cial features are added to take advantage of the mode-bit
 ms. control and the separate memory indexing in each PE.
To manage the operation of the computer and to For example, to add two vectors, we declare vari-
provide the I/O capability, a Burroughs B machine ables as
is used. The majority of operating system functions are PE REAL X,Y
run on this machine.
To further speed up the instruction execution, the by which two  elements vectors are created, and an
instruction decoding in CU is overlapped with the exe- element X(i) is stored in PEM of PEi. Thus, the state-
cution in PEs, i.e., the instruction decoding and the ment X ← X + Y will add two vectors in parallel.
instruction execution is pipelined. To control the execution in each PE, there is a con-
trol statement such as
Software IF < BE > THEN S
Besides an ordinary assembler, called ASK, originally,
two compilers were planned, the TRANQUIL and the where BE is a  elements Boolean values, each of which
GLYPNIR. corresponds the mode bit of PEs. Thus, the statement
 I Illiac IV

Instructions

CU

Routing Path
PE0 PE1 PE3 PE63

2048
Words
PEM

Organization of Illiac IV

Illiac IV. Fig.  Organization of Illiac IV

S is executed in PEs whose corresponding element of shape detection, and many others. Also, the Illiac IV
BE is true. was successfully used to analyze LANDSAT satellite
data, especially in clustering and classification.

Applications . Astronomy
The Illiac IV was the first practical parallel computer to Although very little research was done in this area
allow writing real parallel codes. three-dimensional galaxy simulations have been
Many original and great parallel application ideas run successfully on the Illiac IV. This is a typical
have been developed by the project which contributed n-body problem.
to solve real world problems. Among many influential
. Seismic
results, we just list a few of them below. Application
This is one of applications which were studied inten-
areas attacked by the project are:
sively to develop parallel algorithms. The applica-
. Computational fluid dynamics tion is characterized as a three-dimensional finite
NASA people developed an aerodynamic flow difference code with many irregular data structure.
simulation program on Illiac IV which helped
. Mathematical programs
them to replace their wind tunnel with a com-
Many basic numerical computations have been
puter simulation system. Other programs devel-
coded in parallel. They include dense and sparse
oped include the Navier–Stokes solver for two
matrix calculations, the single-value decomposi-
dimensional unsteady transonic flow, a viscous-
tion, and so on.
flow airfoil code, turbulence modeling for three-
dimensional incompressible flow, and many more. . Weather/climate simulation
A dynamic atmospheric model incorporating
. Image processing chemistry and heat exchange was build.
Image processing may be one of the application
areas that is most suited for parallel computa-
tion. Many innovatory parallel algorithms were Assessment
developed and implemented on the Illiac IV which The project took a decade of development and was
include mage line detection, image skeletonizing, also massively over budget. Costs escalated from the
ILUPACK I 

$ million estimated in  to $ million by . as linear systems arising from partial differential equa-
Only a quarter of the fully planned machine, i.e.,  tions (PDEs). ILUPACK supports single and double
PEs instead of  PEs, was ever built. Nevertheless, precision arithmetic for real and complex numbers.
the project developed a very unique and for the time, Among the structured matrix classes that are supported
a powerful computer. Also, many ideas in software are by individual drivers are symmetric and/or Hermitian
still vital and appreciated. matrices that may or may not be positive definite and
The project pushed research forward, leading the general square matrices. An interface to MATLAB (via
way for machines such as the Thinking Machines CM- MEX) is available. The main drivers can be called from
and CM-. It is also true that the software team of the C, C++, and FORTRAN.
project developed not only a whole new set of parallel
programming ideas, but also a set of experts in parallel
programming.
Discussion
Introduction
Bibliography Large sparse linear systems arise in many application
. Barnes GH et al () The Illiac IV computer. IEEE Trans Comput
C-():–. The paper summarizes hardware and software of
areas such as partial differential equations, quantum I
physics, or problems from circuit and device simula-
the Illiac IV
. Abel N et al. TRABQUIL, a language for an array processing tion. They all share the same central task that consists
computer. In: Proceedings AFIPS  SJCC, vol . AFIPS Press, of efficiently solving large sparse systems of equations.
Montvale NJ, pp –. The TRANQUIL language is introduced For a large class of application problems, sparse direct
. Kuck D, Sameh A () Parallel computation of eigenvalues of solvers have proven to be extremely efficient. However,
real matrices, In: Information Processing , vol II. North-Holland,
the enormous size of the underlying applications arising
Amsterdam, pp –. Describes parallel eigenvalue algorithm
. Hord RM () The Illiac IV. Computer Science Press, Rockville.
in -D PDEs or the large number of devices in integrated
A complete description of the project. Out of print now circuits currently requires fast and efficient iterative
solution techniques, and this need will be exacerbated
as the dimension of these systems increases. This in turn
demands for alternative approaches and, often, approx-
ILUPACK imate factorization techniques, combined with iterative
methods based on Krylov subspaces, reflecting as an
Matthias Bollhöfer , José I. Aliaga , Alberto F. attractive alternative for these kinds of application prob-
Martín , Enrique S. Quintana-Ortí lems. A comprehensive overview over iterative methods

Universitat Jaume I, Castellón, Spain can be found in [].

TU Braunschweig Institute of Computational The ILUPACK software is mainly built on incom-
Mathematics, Braunschweig, Germany plete factorization methods (ILUs) applied to the system
matrix in conjunction with Krylov subspace methods.
The ILUPACK hallmark is the so-called inverse-based
Definition approach. It was initially developed to connect the ILUs
ILUPACK is the abbreviation for Incomplete LU factor- and their approximate inverse factors. These relations
ization PACKage. It is a software library for the iterative are important since, in order to solve linear systems, the
solution of large sparse linear systems. It is written in inverse triangular factors resulting from the factoriza-
FORTRAN  and C and available at http://ilupack. tion are applied rather than the original incomplete fac-
tu-bs.de. The package implements a multilevel incom- tors themselves. Thus, information extracted from the
plete factorization approach (multilevel ILU) based on a inverse factors will in turn help to improve the robust-
special permutation strategy called “inverse-based piv- ness for the incomplete factorization process. While this
oting” combined with Krylov subspace iteration meth- idea has been successfully used to improve robustness,
ods. Its main use consists of application problems such its downside was initially that the norm of the inverse
 I ILUPACK

factors could become large such that small entries could The pivoting strategy computes a partial
hardly be dropped during Gaussian elimination. To factorization
overcome this shortcoming, a multilevel strategy was
developed to limit the growth of the inverse factors. ⎛B E ⎞
This has led to the inverse-based approach and hence PTÂP = ⎜



the incomplete factorization process that has eventually ⎝F C ⎠
been implemented in ILUPACK benefits from the infor-
⎛L ⎞ ⎛D U D U ⎞
mation of bounded inverse factors while being efficient B
⎟⎜ B B B E
⎟ + R,
=⎜
⎜ ⎟⎜ ⎟
at the same time [].
⎝LF I ⎠ ⎝  SC ⎠
A parallel version of ILUPACK on shared-memory
multiprocessors, including current multicore architec-
tures, is under development and expected to be released where R is the error matrix, which collects those
in the near future. The ongoing development of the par- entries of  that were dropped during the factoriza-
allel code is inspired by a nested dissection hierarchy of tion and “SC ” is the Schur complement consisting
the initial system that allows to map tasks concurrently of all rows and columns associated with the rejected
to independent threads within each level. pivots. By construction the inverse triangular factors
satisfy

The Multilevel Method −


−
To solve a linear system Ax = b, the multilevel approach



⎛L ⎞




⎛U U ⎞



⎜ B




⎜ B E




of ILUPACK performs the following steps:

⎜ ⎟


,

⎜ ⎟ ⪅ κ.




⎝ LF I ⎠


⎝  I ⎠


. The given system A is scaled by diagonal matrices




Dl and Dr and reordered by permutation matrices
. Steps  and  are successively applied to  = SC until
Πl , Π r as
SC is void or “sufficiently dense” to be efficiently fac-
A → Dl ADr → Π Tl Dl ADr Π r = Â. torized by a level  BLAS-based direct factorization
These operations can be considered as a prepro- kernel.
cessing prior to the numerical factorization. They When the multilevel method is applied over multi-
typically include scaling strategies to equilibrate the ple levels, a cascade of factors LB , DB , and UB , as well as
system, scaling and permuting based on maximum matrices E, F are usually obtained (cf. Fig. ). Solving
weight matchings [], and, finally, fill-reducing linear systems via a multilevel ILU requires a hierar-
orderings such as nested dissection [], (approxi- chy of forward and backward substitutions, interlaced
mate) minimum degree [, ], and some more. with reordering and scaling stages. ILUPACK employs
. An incomplete factorization A ≈ LDU is next com- Krylov subspace methods to incorporate the approxi-
puted for the system Â, where L, U T are unit lower mate factorization into an iterative method.
triangular factors and D is diagonal. Since the main The computed multilevel factorization is adapted to
objective of ILUPACK (inverse-based approach) is to the structure of the underlying system. For real sym-
limit the norm of the inverse triangular factors, L− metric positive definite (SPD) matrices, an incomplete
and U − , the approximate factorization process is Cholesky decomposition is used in conjunction with
interlaced with a pivoting strategy that cheaply esti- the conjugate gradient method. For the symmetric and
mates the norm of these inverse factors. The pivoting indefinite case, symmetric maximum weight matchings
process decides in each step to reject a factorization [] allow to build  ×  and  ×  pivots. In this case ILU-
step if an imposed bound κ is exceeded, or to accept PACK constructs a blocked symmetric multilevel ILU
a pivot and continue the approximate factorization and relies on the simplified QMR [] as iterative solver.
otherwise. The set of rejected rows and columns is The general (unsymmetric) case typically uses GMRES
permuted to the lower and right end of the matrix. [] as default driver. All drivers in ILUPACK support
This process is illustrated in Fig. . real/complex and single/double precision arithmetic.
ILUPACK I 

factorized pending

current rejected
accept

eTk L−1,U −1 ek ≤ k

approximate
continue
factorization factorization
reject

rejected compute
eTk L−1,U −1 ek > k pivots SC

I
current factorization step finalize level

ILUPACK. Fig.  ILUPACK pivoting strategy

0 linear system Ax = b. From this point of view,

500 L− AU − = D + L− RU −


ensures that the entries in the error matrix R are not
1,000
amplified by some large inverse factors L− and U − .
The second and more important aspect links
1,500
ILUPACK’s multilevel ILU to algebraic multilevel
−
2,000
methods. The inverse  can be approximately
written as
2,500 ⎛⎛(L D U )− ⎞
−
 ≈ P⎜ ⎜ B B B ⎟
⎜⎜ ⎟
3,000 ⎝⎝  ⎠

3,500 ⎛−U − U ⎞
E
+⎜
B ⎟ S− (−L L− I )) PT .
⎜ ⎟ C F B
0 500 1,000 1,500 2,000 2,500 3,000 3,500 ⎝ I ⎠

ILUPACK. Fig.  ILUPACK multilevel factorization In this expression, all terms on the right hand side are
constructed to be bounded except S− C . Since in gen-
−
Mathematical Background eral  has a relatively large norm, so does S− C . In
The motivation for inverse-based approach of ILU- principle this justifies that eigenvalues of small modu-
PACK can be explained in two ways. First, when the lus are revealed by the approximate Schur complement
approximate factorization SC (a precise explanation is given in []). SC serves
as some kind of coarse grid system in the sense of
A = LDU + R discretized partial differential equations. This observa-
tion goes hand in hand with the observation that the
is computed for some error matrix R, the inverse trian- inverse-based pivoting approach selects pivots similar
gular factors L− and U − have to be applied to solve a to coarsening strategies in algebraic multilevel methods,
 I ILUPACK

which is demonstrated for the following simple model As a rule of thumb, the more rows (and columns) of
problem: the coefficient matrix satisfy ∣aii ∣ ≫ ∑j/=i ∣aij ∣, the less
pivots tend to be rejected.
−− uxx (x, y)−uyy (x, y) = f (x, y) for all
(x, y) ∈ [, ] , u(x, y) =  on ∂[, ] . The Parallelization Approach
Parallelism in the computation of approximate factor-
This partial differential equation can be easily dis- izations can be exposed by means of graph-based sym-
cretized on a square grid Ω h = {(kh, lh) : k, l = metric reordering algorithms, such as graph coloring
, . . . , N +} , where h = N+

is the mesh size, using stan- or graph partitioning techniques. Among these classes
dard finite difference techniques. Algebraic approaches of algorithms, nested dissection orderings enhance par-
to multilevel methods roughly treat the system as if the allelism in the approximate factorization of A by par-
term −− uxx (x, y) is hardly present, i.e., the coars- titioning its associated adjacency graph G(A) into a
ening process treats it as if it were a sequence of one- hierarchy of vertex separators and independent sub-
dimensional differential equations in the y-direction. graphs. For example, in Fig. , G(A) is partitioned
Thus, semi-coarsening in the y-direction is the usual after two levels of recursion into four independent
approach to build a coarse grid. ILUPACK inverse- subgraphs, G(,) , G(,) , G(,) , and G(,) , first by
based pivoting algebraically picks and rejects the pivots separator S(,) , and then by separators S(,) and S(,) .
precisely in the same way as in semi-coarsening. In This hierarchy is constructed so that the size of vertex
Fig. , this is illustrated for a portion of the grid Ω h separators is minimized while simultaneously balanc-
using κ = . In the y-direction, pivots are rejected after ing the size of the independent subgraphs. Therefore,
– steps while in the x-direction all pivots are kept or relabeling the nodes of G(A) according to the levels
rejected (in blue and red, resp.). in the hierarchy leads to a reordered matrix, A →
The number of rejected pivots in ILUPACK strongly Φ T AΦ, with a structure amenable to efficient paral-
depends on the underlying application problem. lelization. In particular, the leading diagonal blocks of

80

75

70

65

60

55

50
30 35 40 45 50

ILUPACK. Fig.  ILUPACK pivoting for partial differential equations. Blue and red dots denote, respectively, accepted and
rejected pivots
ILUPACK I 

S(1,1)

G(2,1) G(2,2)

G(3,1) G(3,3) A → ΦTAΦ


S(2,1) S(2,2)
G(3,2) G(3,4)
(1,1)

A G(A) (2,1) (2,2)

G(3,1) G(3,2) G(3,3) G(3,4)


(3,1) (3,2) (3,3) (3,4)
Nested dissection hierarchy
Task dependency tree

ILUPACK. Fig.  Nested dissection reordering


I

ΦT AΦ associated with the independent subgraphs can because the updates from descendant nodes to an ances-
be first eliminated independently; after that, S(,) and tor node can also be performed locally/independently
S(,) can be eliminated in parallel, and finally sepa- by its descendants.
rator S(,) is processed. This type of parallelism can The parallel multilevel method considers the follow-
be expressed by a binary task dependency tree, where ing partition of the local submatrices into  ×  block
nodes represent concurrent tasks and arcs dependencies matrices
among them. State-of-the-art reordering software pack-
⎛A A ⎞
ages e.g., METIS (http://glaros.dtc.umn.edu/gkhome/ X V
Apar = ⎜ ⎟,
⎜ ⎟
views/metis) or SCOTCH (http://www.labri.fr/perso/
⎝AW AZ ⎠
pelegrin/scotch), provide fast and efficient multilevel
variants [] of nested dissection orderings.
where the partitioning lines separate the blocks to be
The dependencies in the task tree are resolved while
factorized by the task, i.e., AX , AW , and AV , and its con-
the computational data and results are generated and
tribution blocks, i.e., AZ . It then performs the following
passed from the leaves toward the root. The leaves
steps:
are responsible for approximating the leading diagonal
blocks of Φ T AΦ, while those blocks which will be later . Scaling and permutation matrices are only applied
factorized by their ancestors are updated. For example, to the blocks to be factorized,
in Fig. , colors are used to illustrate the correspondence
between the blocks of Φ T AΦ to be factorized by tasks ⎛ Π T ⎞ ⎛D ⎞
(,), (,), and (,). Besides, (,) only updates those Apar →⎜
l ⎟⎜ l ⎟
⎜ ⎟⎜ ⎟
blocks that will be later factorized by tasks (,) and ⎝  I⎠ ⎝  I⎠
(,). Taking this into consideration, Φ T AΦ is decom-
posed into the sum of several submatrices, one local ⎛ A A ⎞ ⎛D ⎞
X V
×⎜ ⎟⎜ r ⎟
block per each leaf of the tree, as shown in Fig. . Each ⎜ ⎟⎜ ⎟
⎝AW AZ ⎠ ⎝  I ⎠
local submatrix is composed of the blocks to be factor-
ized by the corresponding task, together with its local ⎛Π ⎞ ⎛Â Â ⎞
r
contributions to the blocks that are later factorized by its ×⎜ ⎟=⎜ X V
⎟ = Âpar .
⎜ ⎟ ⎜ ⎟
ancestors, hereafter referred as contribution blocks. This ⎝  I ⎠ ⎝ÂW ÂZ ⎠
strategy significantly increases the degree of parallelism,
 I ILUPACK

(1,1)

(2,1) (2,2)

(3,1) (3,2) (3,3) (3,4)

= + + +

ΦTAΦ

To be factorized by (3,2)

Local contributions from (3,2) to (2,1) contribution


blocks
Local contributions from (3,2) to (1,1)
Local submatrix

ILUPACK. Fig.  Matrix decomposition and local submatrix associated to a single node of the task tree

Factorized Pending accept (1,1)

Current Rejected (2,2)


(2,1)

eTk L−1,U −1 ek ≤ k (3,1) (3,2) (3,3) (3,4)

approx. continue
factor. reject factorization

eTk L−1,U −1 ek > k

compute
SC

current factorization step finalize local level

ILUPACK. Fig.  Local incomplete factorization computed by a single node of the task tree

. These blocks are next approximately factorized factorization of Âpar , rejected rows and columns are
using inverse-based pivoting, while ÂZ is only permuted to the lower and right end of the leading
updated. For this purpose, during the incomplete block ÂX . This is illustrated in Fig. .
ILUPACK I 

(3,1)

+ =
(1,1)

(2,1) (2,2)
Merge contributions (2,1)
(3,1) (3,2) (3,3) (3,4)

(3,2)

Apar Apar

ILUPACK. Fig.  Parent nodes construct their local submatrix from the data generated by their children

This step computes a partial factorization data from its children, it must first construct its
local submatrix Apar . To achieve this, it incorporates
⎛ BX EX EV ⎞
⎛P T ⎞ ⎛P ⎞ ⎜ ⎟ the pivots rejected from its children and accumu-
⎜ ⎟Âpar ⎜ ⎟=⎜ ⎟
⎜ ⎟ ⎜ ⎟ ⎜
⎜ FX CX CV ⎟
⎟ lates their contribution blocks, as shown in Fig. .
⎝ I⎠ ⎝ I⎠ ⎜ ⎟
⎝F W CW CZ ⎠ If the parent node is the root of the task dependency
tree, it applies the sequential multilevel algorithm to
⎛ LB X   ⎞ ⎛D B X U B X DB X UE X DB X UE V ⎞ the new constructed submatrix Apar . Otherwise, the
⎜ ⎟⎜ ⎟
⎜ ⎟⎜ ⎟
= ⎜ LF X
⎜ I ⎟ ⎜
⎟⎜  SC X SC V ⎟ ⎟
+ R, parallel multilevel method is restarted on this matrix
⎜ ⎟⎜ ⎟
⎝LF W  I⎠ ⎝  SC W SCZ ⎠ at step .

where the inverse triangular factors approximately To compare the cascade of approximations
satisfy computed by the sequential multilevel method and its
 −   −  parallel variant, it is helpful to consider the latter as


 

 

 


⎛ LB X
  ⎞ 
 ⎛UBX
 UE X UE V ⎞ 
 an algorithmic variant, which enforces a certain order


 ⎜ ⎟ 

 

 ⎜ ⎟ 




 ⎜ ⎟ 

 

 ⎜ ⎟ 

 of elimination. Thus, the parallel variant interlaces the

 ⎜ LF X
⎜ I ⎟
⎟ 
 , 
 ⎜ 
⎜ I  ⎟ ⎟ 
 ⪅ κ.


 ⎜ ⎟ 

 

 ⎜ ⎟ 




 

 

 

 algebraic levels generated by the pivoting strategy of



⎝LF W  I⎠ 





⎝   I ⎠ 


    ILUPACK with the nested dissection hierarchy levels
in order to expose a high degree of parallelism. The
. Steps  and  are successively applied to the Schur
leading diagonal blocks associated with the last nested
complement until SCX is void or “sufficiently small”,
dissection hierarchy level are first factorized using
i.e., Apar and its  ×  block partition are redefined
inverse-based pivoting, while those corresponding to
as
previous hierarchy levels are only updated, i.e., the
⎛ A A ⎞ ⎛S SC V ⎞
X V
⎟ := ⎜ CX nodes belonging to the separators are rejected by con-
Apar = ⎜⎜ ⎟ ⎜
⎟.
⎟ struction. The multilevel algorithm is restarted on the
⎝AW AZ ⎠ ⎝SCW SCZ ⎠
rejected nodes, and only when it has eliminated the
. The task completes its local computations and the bulk of the nodes of the last hierarchy level, it starts
result Apar is sent to the parent node in the task approximating the blocks belonging to previous hier-
dependency tree. When a parent node receives the archy levels. Figure  compares the distribution of
 I ILUPACK

200 200

160 160

120 120

80 80

40 40

0 0
0 40 80 120 160 200 0 40 80 120 160 200

ILUPACK. Fig.  ILUPACK pivoting strategy (left) and its parallel variant (right). Blue and red dots denote, respectively,
accepted and rejected nodes

Parallel scalability Memory scalability


18
p=2, perfect binary tree
3.5 p=1
16 p=2, #leaves in [p,2p]
p=2, perfect binary tree
p=4, perfect binary tree
Speedup (vs sequential ILUPACK)

p=2, #leaves in [p,2p]


p=4, #leaves in [p,2p]
Mem. parallel/Mem. sequential

14 p=4, perfect binary tree


p=8, perfect binary tree 3 p=4, #leaves in [p,2p]
p=8, #leaves in [p,2p]
12 p=8, perfect binary tree
p=16, perfect binary tree p=8, #leaves in [p,2p]
p=16, #leaves in [p,2p]
10 2.5 p=16, perfect binary tree
p=16, #leaves in [p,2p]
8
2
6

4 1.5

2
1
0 1 2 3 0 1 2 3
log10(n/1709) log10(n/1709)

ILUPACK. Fig.  Parallel speedup as a function of problem size (left) and ratio among the memory consumed by the
parallel multilevel method and that of the sequential method (right) for a discretized -D elliptic PDE

accepted and rejected nodes (in blue and red, resp.) by the numerical properties of the inverse-based precon-
the sequential and parallel inverse-based incomplete ditioning approach, in the sense that the convergence
factorizations when they are applied to the Laplace rate of the preconditioned iterative method is largely
PDE with discontinuous coefficients discretized with independent on the number of processors involved in
a standard  ×  finite-difference grid. In both the parallel computation.
cases, the grid was reordered using nested dissection.
These diagrams confirm the strong similarity between Numerical Example
the sequential inverse-based incomplete factorization The example illustrates the parallel and memory
and its parallel variant for this particular example. The scalability of the parallel multilevel method on a
experimentation with this approach in the SPD case SGI Altix  CC-NUMA shared-memory multipro-
reveals that this compromise has a negligible impact on cessor with  Intel Itanium@. GHz processors
ILUPACK I 

sharing  GBytes of RAM connected via a SGI NUMA- multiprocessors using OpenMP, the basic principle is
link network. The nine linear systems considered in currently being transferred to distributed-memory par-
this experiment are derived from the linear finite ele- allel architectures via MPI. Future research will be
ment discretization of the irregular -D elliptic PDE devoted to block-structured algorithms to improve
[−div(A grad u) = f ] in a -D domain, where A(x, y, z) cache performance. Block structures have been proven
is chosen with positive random coefficients. The size to be extremely efficient for direct methods, but their
of the systems ranges from n = , –, ,  integration into incomplete factorizations remains a
equations/unknowns. challenge; on the other hand, recent research indicates
The parallel execution of the task tree on shared- the high potential of block-structured algorithms also
memory multiprocessors is orchestrated by a runtime for ILUs [, ].
which dynamically maps tasks to threads (processors)
in order to improve load balance requirements dur- Related Entries
ing the computation of the multilevel preconditioner. Graph Partitioning
Figure  displays two lines for each number of proces- Linear Algebra Software
sors: dashed lines are obtained for a perfect binary tree Load Balancing, Distributed Memory
with the same number of leaves as processors, while Nonuniform Memory Access (NUMA) Machines I
solid lines correspond to a binary tree with (poten- Shared-Memory Multiprocessors
tially) more leaves than processors (up to p leaves). The
higher speedups revealed in the solid lines demonstrate Bibliographic Notes and Further
the benefit of dynamic scheduling on shared-memory Reading
multiprocessors. Around , a first package also called ILUPACK was
As shown in Fig.  (left), the speedup always developed by H.D. Simon [], which used reorder-
increases with the number of processors for a fixed ings, incomplete factorizations, and iterative methods
problem size, and the parallel efficiency rapidly grows in one package. Nowadays, as the development of pre-
with problem size for a given number of processors. conditioning methods has advanced significantly by
Besides, as shown in Fig.  (right), the memory over- novel approaches like improved reorderings, multilevel
head associated with parallelization relatively decreases methods, maximum weight matchings, and inverse-
with the problem size for a fixed number of proces- based pivoting, incomplete factorization methods have
sors, and it is below . for the two largest linear sys- changed completely and gained wide acceptance among
tems; this is very moderate taking into consideration the scientific community. Besides ILUPACK, there
that the amount of available physical memory typi- exist several software packages based on incomplete
cally increases linearly with the number of processors. factorizations and on multilevel factorizations. For
These observations confirm the excellent scalability of example, Y. Saad (http://www-users.cs.umn.edu/∼saad/
the parallelization approach up to  cores. software/) et al. developed the software packages
SPARSKIT, ITSOL, and pARMS, which also inspired
Future Research Directions the development of ILUPACK. In particular, pARMS is
ILUPACK is currently parallelized for SPD matri- a parallel code based on multilevel ILU interlaced with
ces only. Further classes of matrices such as sym- reorderings. J. Mayer (http://iamlasun.mathematik.
metric and indefinite matrices or unsymmetric, but uni-karlsruhe.de/∼ae/iluplusplus.html) developed
symmetrically structured matrices are subject to ongo- ILU++, also for multilevel ILU. MRILU, by F.W.
ing research. The parallelization of these cases shares Wubs (http://www.math.rug.nl/∼wubs/mrilu/) et al.,
many similarities with the SPD case. However, maxi- uses multilevel incomplete factorizations and has
mum weight matching, or more generally, methods to successfully been applied to problems arising from
rescale and reorder the system to improve the size of partial differential equations. There are many other sci-
the diagonal entries (resp. diagonal blocks) are hard to entific publications on multilevel ILU methods, most of
parallelize []. Although the current implementa- them especially tailored for the use in partial differential
tion of ILUPACK is designed for shared-memory equations.
 I Impass

ILUPACK has been successfully applied to sev- . Amestoy P, Davis TA, Duff IS () An approximate mini-
eral large scale application problems, in particu- mum degree ordering algorithm. SIAM J Matrix Anal Appl ():
–
lar, to the Anderson model of localization [] or
. Bollhöfer M, Grote MJ, Schenk O () Algebraic multilevel pre-
the Helmholtz equation []. Its symmetric version
conditioner for the Helmholtz equation in heterogeneous media.
is integrated into the software package JADAMILU SIAM J Sci Comput ():–
(http://homepages.ulb.ac.be/∼jadamilu/), which is a . Bollhöfer M, Saad Y () Multilevel preconditioners con-
Jacobi-Davidson-based eigenvalue solver for symmetric structed from inverse–based ILUs. SIAM J Sci Comput
eigenvalue problems. ():–
. Duff IS, Koster J () The design and use of algorithms for per-
More details on the design aspects of the par-
muting large entries to the diagonal of sparse matrices. SIAM J
allel multilevel preconditioner (including dynamic Matrix Anal Appl ():–
scheduling) and experimental data with a bench- . Duff IS, Pralet S () Strategies for scaling and pivoting for
mark of irregular matrices from the UF sparse matrix sparse symmetric indefinite problems. SIAM J Matrix Anal Appl
collection (http://www.cise.ufl.edu/research/sparse/ ():–
. Duff IS, Uçar B (Aug ) Combinatorial problems in solving
matrices/) can be found in [, ]. The computed par-
linear systems. Technical Report TR/PA//, CERFACS
allel multilevel factorization is incorporated to a Krylov . Freund R, Nachtigal N () Software for simplified Lanczos and
subspace solver, and the underlying parallel struc- QMR algorithms. Appl Numer Math ():–
ture of the former, expressed by the task dependency . George A, Liu JW () The evolution of the minimum degree
tree, is exploited for the application of the precondi- ordering algorithm. SIAM Rev ():–
tioner as well as other major operations in iterative . Gupta A, George T () Adaptive techniques for improving the
performance of incomplete factorization preconditioning. SIAM
solvers []. The experience with these parallel tech-
J Sci Comput ():–
niques has revealed that the sequential computation . Hénon P, Ramet P, Roman J () On finding approximate
of the nested dissection hierarchy (included in, e.g., supernodes for an efficient block-ILU(k) factorization. Parallel
METIS or SCOTCH) dominates the execution time Comput (–):–
of the whole solution process when the number of . Karypis G, Kumar V () A fast and high quality multilevel
scheme for partitioning irregular graphs. SIAM J Sci Comput
processors is large. Therefore, additional types of par-
():–
allelism have to be exploited during this stage in order . Saad Y () Iterative methods for sparse linear systems, nd
to develop scalable parallel solutions; in [], two paral- edn. SIAM Publications, Philadelphia, PA
lel partitioning packages, ParMETIS (http://glaros.dtc. . Saad Y, Schultz M () GMRES: a generalized minimal residual
umn.edu/gkhome/metis/parmetis/overview) and PT- algorithm for solving nonsymmetric linear systems. SIAM J Sci
SCOTCH (http://www.labri.fr/perso/pelegrin/scotch), Stat Comput :–
. Schenk O, Bollhöfer M, Römer RA () Awarded SIGEST
are evaluated with this purpose.
paper: on large scale diagonalization techniques for the Anderson
model of localization. SIAM Rev :–
Bibliography . Simon HD (Jan ) User guide for ILUPACK: incomplete LU
factorization and iterative methods. Technical Report ETA-LR-
. Aliaga JI, Bollhöfer M, Martín AF, Quintana-Ortí ES ()
, Boeing Computer Services
Design, tuning and evaluation of parallel multilevel ILU precon-
ditioners. In: Palma J, Amestoy P, Dayde M, Mattoso M, Lopes JC
(eds) High performance computing for computational science –
VECPAR , Toulouse, France. Number  in Lecture Notes
in Computer Science, pp –. Springer, Berlin/Heidelberg
. Aliaga JI, Bollhöfer M, Martín AF, Quintana-Ortí ES () Impass
Exploiting thread-level parallelism in the iterative solution of
sparse linear systems. Technical report, Dpto. de Ingeniería y Deadlocks
Ciencia de Computadores, Universitat Jaume I, Castellón (sub-
mitted for publication)
. Aliaga JI, Bollhöfer M, Martín AF, Quintana-Ortí ES ()
Evaluation of parallel sparse matrix partitioning software for par-
allel multilevel ILU preconditioning on shared-memory multi-
Implementations of Shared
processors. In: Chapman B et al (eds) Parallel computing: from Memory in Software
multicores and GPUs to petascale. Advances in parallel comput-
ing, vol . IOS Press, Amsterdam, pp – Software Distributed Shared Memory
InfiniBand I 

design a scalable and high performance communica-


Index tion and I/O architecture by taking an integrated view
of computing, networking and storage technologies.”
All-to-All
Many other companies participated in this consortium
later. The members of this consortium deliberated and
defined the InfiniBand architecture specification. The
InfiniBand first specification (Volume , Version .) was released to
the public on October , . Since then this standard
Dhabaleswar K. Panda, Sayantan Sur is becoming enhanced periodically. The latest version is
The Ohio State University, Columbus, OH, USA .. was released in January .
The word “InfiniBand” is coined from two words
“Infinite Bandwidth”. The architecture is defined in such
Synonyms a manner that as the speed of computing and network-
Interconnection network; Network architecture
ing technologies improve over time, the InfiniBand
architecture should be able to deliver higher and higher
Definition bandwidth. During the inception in , InfiniBand I
The InfiniBand Architecture (IBA) describes a switched was delivering a link speed of . Gbps (payload data
interconnect technology for inter-processor communi- rate of  Gbps due to the underlying / encoding and
cation and I/O in a multiprocessor system. The archi- decoding). Now in , it is able to deliver a link speed
tecture is independent of the host operating system and of  Gbps (payload data rate of  Gbps).
the processor platform. InfiniBand is based on a widely
adopted open standard.
Communication Model
The communication model comprises of an interaction
Discussion of multiple components in the interconnect system. The
major components and their interaction are described
Introduction
below.
As the commodity clusters started getting popular
around mid-nineties, the common interconnect used
for such clusters was Fast Ethernet ( Mbps). These Topology and Network Components
clusters were identified as Beowulf clusters []. Even At a high level, IBA serves as an interconnection of
though it was cost effective to design such clusters nodes, where each node can be a processor node, an I/O
with Ethernet/Fast Ethernet, the communication per- unit, a switch or a router to another network, as illus-
formance was not very good because of the high trated in Fig. . Processor nodes or I/O units are typically
overhead associated with the standard TCP/IP commu- referred to as “end nodes,” while switches and routers
nication protocol stack. The high overhead was because are referred to as “intermediate nodes” (or sometimes as
of the TCP/IP protocol stack being completely exe- “routing elements”). An IBA network is subdivided into
cuted on the host processor. The network interface cards subnets interconnected by routers. Overall, the IB fabric
((NICs) or commonly known as network adapters) on is comprised of four different components: (a) channel
those systems were also not intelligent. Thus, these adapters, (b) links and repeaters, (c) switches, and (d)
adapters did not allow any overlap of computation with routers.
communication or I/O. This led to high latency, low Channel adapters: A processor or I/O node con-
bandwidth, and high CPU requirement for communi- nects to the fabric using channel adapters connecting
cation and I/O operations on those clusters. them to the IB fabric. These channel adapters con-
The above limitations led to a goal for designing sume and generate IB packets. Most current chan-
a new and converged interconnect with open stan- nel adapters are equipped with programmable direct
dard. Seven industry leaders (Compaq, Dell, IBM, Intel, memory access (DMA) engines with protection fea-
Microsoft, HP, and Sun) formed a new InfiniBand Trade tures implemented in hardware. Each adapter can have
association []. The charter for this association was “to one or more physical ports connecting it to either a
 I InfiniBand

Processor Node
CPU CPU CPU
Processor Node
Processor Node HCA Mem HCA CPU CPU CPU
Fabric
CPU CPU CPU HCA Mem HCA
Mem HCA
Other IB Subnets
Switch Switch Switch Router WANs
RAID Subsystem LANs
Processor Nodes
SCSI
SCSI
Processor
SCSI Mem
SCSI TCA
TCA
SCSI Switch Switch Controller

Storage
Subsystem
Drivers

Switch I/O Switch I/O


Chassis Chassis
IO Module

Consoles

IO Module
IO Module
IO Module

IO Module

IO Module

IO Module

IO Module

IO Module

IO Module
TCA
TCA
TCA

TCA
TCA
TCA

TCA

TCA

TCA
TCA

SCSI Video
Ethernet Graphics HCA = InfiniBand Channel Adapter in processor node
Fiber Channel TCA = InfiniBand Channel Adapter in I/O node
hub and FC
devices

InfiniBand. Fig.  Typical InfiniBand Cluster (Courtesy InfiniBand Standard)

switch/router or another adapter. Each physical port Switches: IBA switches are the fundamental routing
itself internally maintains two or more virtual chan- components for intra-subnet routing. Switches do not
nels (or virtual lanes) with independent buffering for generate or consume data packets (they generate/con-
each of them. The channel adapters also provide a mem- sume management packets). Every destination within
ory translation and protection (MTP) mechanism that the subnet is configured with one or more unique local
translates virtual addresses to physical addresses and identifiers (LIDs). Packets contain a destination address
validates access rights. Each channel adapter has a glob- that specifies the LID of the destination (DLID). The
ally unique identifier (GUID) assigned by the channel switch is typically configured out-of-band with a for-
adapter vendor. Additionally, each port on the channel warding table that allows it to route the packet to the
adapter has a unique port GUID assigned by the vendor appropriate output port. This out-of-band configura-
as well. tion of the switch is handled by a separate component
Links and repeaters: Links interconnect channel within IBA called as the subnet manager, as we will
adapters, switches, routers, and repeaters to form an discuss later in this chapter. It is to be noted that such
IB network fabric. Different forms of IB links are cur- LID-based forwarding allows the subnet manager to
rently available, including copper links, optical links, configure multiple routes between the same two desti-
and printed circuit wiring on a backplane. Repeaters are nations. This, in turn, allows the network to maximize
transparent devices that extend the range of a link. Links availability by rerouting packets around failed links
and repeaters themselves are not accessible in the IB through reconfiguration of the forwarding tables.
architecture. However, the status of a link (e.g., whether Routers: IBA routers are the fundamental rout-
it is up or down) can be determined through the devices ing component for inter-subnet routing. Like switches,
which the link connects. routers do not generate or consume data packets; they
InfiniBand I 

simply pass them along. Routers forward packets based to prepost a receive WQE, which the consumer net-
on the packet’s global route header and replace the work adapter uses to place the incoming data in the
packet’s local route header as the packet passes from appropriate location.
one subnet to another. The primary difference between For an RDMA operation, together with the informa-
a router and a switch is that routers are not completely tion about the local buffer and the destination end point,
transparent to the end nodes since the source must spec- the WQE also specifies the address of the remote con-
ify the LID of the router and also provide the global sumer’s memory. Thus, RDMA operations do not need
identifier (GID) of the destination. IB routers use the the consumer to specify the target memory location
IPv protocol to derive their forwarding tables. using a receive WQE. There are four types of RDMA
operations: RDMA write, RDMA write with immedi-
Messaging ate data, RDMA read, and Atomic. In an RDMA write
Communication operations in InfiniBand are initiated operation, the sender specifies the target location to
by the consumer by setting up a list of instructions that which the data has to be written. In this case, the con-
the hardware executes. In IB terminology, this facility sumer process does not need to post any receive WQE at
is referred to as a work queue. In general, work queues all. In an RDMA write with immediate data operation,
are created in pairs, referred to as queue pairs (QPs), the sender again specifies the target location to write I
one for send operations and one for receive operations. data, but the receiver still needs to post a receive WQE.
The consumer process submits a work queue element The receive WQE is marked complete when all the data
(WQE) to be placed in the appropriate work queue. The is written. An RDMA read operation is similar to an
channel adapter executes WQEs in the order in which RDMA write operation, except that instead of writing
they were placed. After executing a WQE, the channel data to the target location, data is read from the target
adapter places a completion queue entry (CQE) in a location. Finally, there are two types of atomic oper-
completion queue (CQ); each CQE contains or points ations specified of RDMA operations: RDMA write,
to all the information corresponding to the completed RDMA write with immediate data, RDMA read, and
request. This is illustrated in Fig. . Atomic. In an RDMA write operation, the sender spec-
Send operations: There are three classes of send ifies the target location to which the data has to be
queue operations: (a) Send, (b) Remote Direct Mem- written. In this case, the consumer process does not
ory Access (RDMA), and (c) Memory Binding. For need to post any receive WQE at all. In an RDMA write
a send operation, the WQE specifies a block of data with immediate operation requests for the same loca-
in the consumer’s memory. The IB hardware uses this tion simultaneously, the second operation is not started
information to send the associated data to the destina- till the first one completes. However, such atomicity is
tion. The send operation requires the consumer process not guaranteed when another device is operating on
the same data. For example, if the CPU or another
InfiniBand adapter modifies data while one InfiniBand
Consumer Work Queue
WQE
adapter is performing an atomic operation on this data,
the target location might get corrupted.
Work Queue
Work Receive operations: Unlike the send operations, there
Request WQE WQE WQE WQE
is only one type of receive operation. For a send oper-
Work Queue
ation, the corresponding receive WQE specifies where
WQE WQE
to place data that is received, and if requested places
the receive WQE in the completion queue. For an
Completion Queue
Work
Completion CQE CQE CQE CQE
Hardware RDMA write with immediate data operation, the con-
sumer process does not have to specify where to place
the data (if specified, the hardware will ignore it). When
InfiniBand. Fig.  Consumer Queuing Model (Courtesy
the message arrives, the receive WQE is updated with
InfiniBand Standard)
 I InfiniBand

the memory location specified by the sender, and the data on a link because another node is using up all of
WQE placed in the completion queue, if requested. the available flow-control credits on the link. In order to
identify whether a port is the root of the victim of a con-
Overview of Features gestion, IBA specifies a simple approach. When a switch
IBA describes a multilayer network protocol stack. Each port notices congestion, if it has no flow-control cred-
layer provides several features that are usable by applica- its left, then it assumes that it is a victim of congestion.
tions : () link layer features, () network layer features, On the other hand, when a switch port notices conges-
and () transport layer features. tion, if it has flow-control credits left, then it assumes
that it is the root of a congestion. This approach is not
Link Layer Features perfect: it is possible that even a root of a congestion
CRC-based data integrity: IBA provides two forms of may not have flow-control credits remaining at some
CRC-based data integrity to achieve both early error point of the communication (for example, if the receiver
detection as well as end-to-end reliability: invariant process is too slow in receiving data); in this case, even
CRC and variant CRC. Invariant CRC (ICRC) covers the root of the congestion would assume that it is a vic-
fields that do not change on each network hop. Vari- tim of the congestion. Thus, though not required by
ant CRC (VCRC), on the other hand, covers the entire the IBA, in practice, IB switches are configured to react
packet including the variant as well as the invariant with the congestion control protocol irrespective of
fields. whether they are the root of the victim of the congestion
Buffering and flow control: IBA provides an abso- or not.
lute credit-based flow control where the receiver guar- Static rate control: IBA defines a number of differ-
antees that it has enough space allotted to receive N ent link bit rates, such as x SDR (. Gbps), x SDR
blocks of data. The sender sends only N blocks of ( Gbps), x SDR ( Gbps), x DDR ( Gbps), and
data before waiting for an acknowledgment from the x QDR (Gbps), and so on. It is to be noted that these
receiver. The receiver occasionally updates the sender rates are signaling rates; with an b/b data encoding,
with an acknowledgment as and when the receive the actual maximum data rates would be  Gbps (x
buffers get freed up. Note that this link-level flow con- SDR),  Gbps (x SDR),  Gbps (x SDR),  Gbps (x
trol has no relation to the number of messages sent, but DDR), and  Gbps (x QDR), respectively. To simul-
only to the total amount of data that has been sent. taneously support multiple link speeds within a fabric,
Virtual lanes, service levels, and QoS: Virtual lanes IBA defines a static rate control mechanism that pre-
(VLs) are a mechanism that allow the emulation of vents faster links from overrunning the capacity of ports
multiple virtual links within a single physical link as with slower links. Using a static rate control mechanism,
illustrated in Fig. . Each port provides at least  and each destination has a timeout value based on the ratio
up to  virtual lanes (VL to VL). VL is reserved of the destination and source link speeds.
exclusively for subnet management traffic.
Congestion control: Together with flow-control, IBA Network Layer Features
also defines a multistep congestion control mechanism IB routing: IB network-layer routing is similar to the
for the network fabric. Specifically, congestion control link-layer switching, except for two primary differences.
when used in conjunction with the flow-control mech- First, network-layer routing relies on a global routing
anism is intended to alleviate head-of-line blocking for header (GRH) that allows packets to be routed between
non-congested flows in a congested environment. To subnets. Second, VL (the virtual lane reserved for
achieve effective congestion control, IB switch ports management traffic) is not respected by IB routers since
need to first identify whether they are the root or the management traffic stays within the subnet and is never
victim of a congestion. A switch port is a root of a con- routed across subnets.
gestion if it is sending data to a destination faster than Flow labels: Flow labels identify a single flow of pack-
it can receive, thus using up all the flow-control cred- ets. For example, these can identify packets belonging to
its available on the switch link. On the other hand, a a single connection that need to be delivered in order,
port is a victim of a congestion if it is unable to send thus allowing routers to optimize traffic routing when
InfiniBand I 

Raw Datagram
Reliable Reliable Unreliable Unreliable
Attribute (both IPv6 &
Connection Datagram Datagram Connection
ethertype)
Scalability (M processes on N M2*N QPs M QPs required M QPs required M2*N QPs required 1 QP required on
processor nodes communicat- required on each on each proces- on each proces- on each processor each end node,
ing with all processes on all processor node, sor node, per CA. sor node, per CA. node, per CA. per CA.
nodes) per CA
Corrupt data detected Yes
Data delivery guarantee Data delivered exactly once No guarantees
Data order guaranteed Yes, per connec- Yes, packets from No Unordered and dupli- No
tion any one source cate packets are
QP are ordered to detected.
multiple destina-
tion QPs.
Reliability

Data loss detected Yes No Yes No


Error recovery Reliable. Errors are detected at both Unreliable. Pack- Unreliable. Packets Unreliable. Pack-
the requestor and the responder. The ets with some with errors, including ets with errors are
requestor can transparently recover types of errors sequence errors, are not delivered. The
from errors (retransmission, alternate may not be deliv- detected and may be requestor and
path, etc.) without any involvement of ered. Neither logged by the responder are not I
the client application. QP processing source nor desti- responder. The informed of
is halted only if the destination is nation QPs are requestor is not dropped packets.
inoperable or all fabric paths between informed of informed.
the channel adapters have failed. dropped packets.

RDMA and ATOMIC operations Yes Yes No Yes : RDMA WRITEs No


No: RDMA READs &
ATOMICs
Bind memory window Yes Yes No Yes No
IBA unreliable multicast sup- No No Yes No No
port
Raw multicast No No No No Yes
Message size Message size 0 to 231 bytes. Smaller Single PMTU Message size 0 to 231 Single PMTU
max size may be negotiated by Con- packet datagrams bytes. Smaller max packet datagrams
nection Management. A message – 0 to 4,096 bytes. size may be negoti- – 0 to 4,096 bytes.
may consist of multiple packets. ated by Connection
Management. A mes-
sage may consist of
multiple packets.
Connection oriented? Connected. The Connectionless. Connectionless. Connected. The client Connectionless.
client connects Appears connec- No prior connec- connects the local No prior connec-
the local QP to tionless to the cli- tion is needed for QP to one and only tion is needed for
one and only one ent – uses one or communication. one remote QP. No communication.
remote QP. No more End-to-End other traffic flows over
other traffic flows contexts per CA these QPs.
over these QPs. to provide reliabil-
ity service.

InfiniBand. Fig.  IB Transport Services (Courtesy InfiniBand Standard)

using multiple paths for communication. IB routers are Transport Layer Features
allowed to change flow labels as needed, for example, to IB transport services: IBA defines four types of trans-
distinguish two different flows which have been given port services (reliable connection, reliable datagram,
the same label. However, during such relabeling, routers unreliable connection, and unreliable datagram) and
ensure that all packets corresponding to the same flow two types of raw communication services (raw IPv
will have the same label. datagram and raw Ethertype datagram) to allow for
 I InfiniBand

encapsulation of non-IBA protocols. The transport ser- would be required simply for communication buffers
vice describes the degree of reliability and how the data that may not be used. Given that current InfiniBand
is communicated. clusters now reach K processes, maximum memory
As illustrated in Fig. , each transport service pro- usage would potentially be over  GB per process in that
vides a trade-off in the amount of resources used and configuration.
the capabilities provided. For example, while the reli- Recognizing that such buffers could be pooled, SRQ
able connection provides the most features, it has to support was added, so instead of connecting a QP to
establish a QP for each peer process it communicates a dedicated RQ, buffers could be shared across QPs.
such, thus requiring a quadratically increasing number In this method, a smaller pool can be allocated and
of QPs with the number of processes. Reliable and unre- then refilled as needed instead of pre-posting on each
liable datagram services on the other hand utilize fewer connection.
resources since a single QP can be used to communicate Extended Reliable Connection (XRC): eXtended reli-
with all peer processes. able connection (XRC) provides the services of the RC
Automatic path migration: The Automatic Path transport, but defines a very different connection model
Migration (APM) feature of InfiniBand allows a con- and method for determining data placement on the
nection end point to specify a fallback path, together receiver in channel semantics. This mode was mostly
with a primary path for a data connection. All data designed to address multi-core clusters and lower mem-
is initially sent over the primary path. However, if the ory consumption of QPs. For one process to communi-
primary path starts throwing excessive errors or is heav- cate with another over RC, each side of communication
ily loaded, the hardware can automatically migrate the must have a dedicated QP for the other. There is no
connection to the new path, after which all data is only distinction as to the node in terms of allowing commu-
sent over the new path. nication. In XRC, a process no longer needs to have a
Message-level flow control: Together with the link- QP to every process on a remote node. Instead, once one
level flow control, for reliable connection, IBA also QP to a node has been set up, messages can be routed to
defines an end-to-end message-level flow-control the other processes by giving the address/number of a
mechanism. This message-level flow control does not shared receive queue (SRQ). In this model, the number
deal with the number of bytes of data being commu- of QPs required is based on the number of nodes in the
nicated, but rather with the number of messages being job rather than the number of processes.
communicated. Specifically, a sender is only allowed
to send as many messages that use up receive WQEs Management and Services
as there are receive WQEs posted by the consumer. Together with the regular communication capabilities,
That is, if the consumer has posted  receive WQEs, IBA also defines an elaborate management semantics
after the sender sends out  send or RDMA write with and several services. The basic management messag-
immediate data messages, the next message is not com- ing is handled through special packets called as man-
municated till the receiver posts another receive WQE. agement datagrams (MADs). The IBA management
Shared Receive Queue (SRQ): Introduced in the model defines several management classes as described
InfiniBand . specification, shared receive queues below.
(SRQs) were added to help address scalability issues Subnet Management: The subnet management class
with InfiniBand memory usage. When using the RC deals with discovering, initializing, and maintaining an
transport of InfiniBand, one QP is required per commu- IB subnet. It also deals with interfacing with diagnos-
nicating peer. To prepost receives on each QP, however, tic frameworks for handling subnet and protocol errors.
can have very high memory requirements for commu- For subnet management, each IB subnet has at least one
nication buffers. To give an example, consider a fully subnet manager (SM). An SM can be a collection of
connected MPI job of K processes. Each process in multiple processing elements, though they appear as a
the job will require K -  QPs, each with n buffers single logical management entity. These processes can
of size s posted to it. Given a conservative setting of internally make management decisions, for example, to
n =  and s =  KB, over  MB of memory per process manage large networks in a distributed fashion. Each
InfiniBand I 

channel adapter, switch, and router, also maintains a I/O devices, but rather focuses on the communication
subnet management agent (SMA) that interacts with the with the channel adapter. Any other communication
SM and handles management of the specific device on between the adapter and the I/O devices is unspec-
which it resides as directed by the SM. An SMA can be ified by the IB standard and depends on the device
viewed as a worker process, though in practice it could vendor.
be implemented as hardware logic.
Subnet Administration: Subnet administration
(SA) deals with providing consumers with access to InfiniBand Today
information related to the subnet through the subnet To achieve high performance, adapters are continu-
management interface (SMI). Most of this informa- ally updated with the latest in speeds (SDR, DDR and
tion is collected by the SM, and hence the SA works QDR) and I/O interface technology. Earlier InfiniBand
very closely with the SM. In fact, in most implemen- adapters supported only PCI-X I/O interface technol-
tations, a single logical SM processing unit performs ogy, but have now moved to PCI-Express (x and
the tasks of both the SM and the SA. The SA typically x) and Hyper-Transport. More recently, QDR cards
provides information that cannot be locally computed have moved to the PCI-Express . standard to further
to consumer processes (such as data paths, SL-to-VL increase I/O bandwidth. Although InfiniBand adapters I
mappings, and partitioning information). Further, it have traditionally been add-on components in expan-
also deals with inter-SM management such as handling sion slots, they have recently started moving onto the
standby SMs. motherboard. Many boards including some designed by
Communication Management: Communication Tyan and SuperMicro include adapters directly on the
management deals with the protocols and mechanisms motherboard.
used to establish, maintain, and release channels for RC, In , Mellanox Technologies released the fourth
UC, and RD transport services. At creation, QPs are generation of their adapter called ConnectX. This was
not ready for communication. The CM infrastructure is the first InfiniBand adapter to support QDR speed. This
responsible for preparing the QPs for communication adapter in addition to increasing speeds and lower-
by exchanging appropriate information. ing latency included support for  Gigabit Ethernet.
Performance Management: The performance man- Depending on the cable connected to the adapter port,
agement class allows performance management enti- it either performs as an Ethernet adapter or an Infini-
ties to retrieve performance and error statistics from Band adapter. The current version of ConnectX adapter,
IB components. There are two classes of statistics: (a) ConnectX- supports advanced offloading of arbitrary
mandatory statistics that have to be supported by all lists of send, receive and wait tasks to the network inter-
IB devices and (b) optional statistics that are spec- face. This enables network offloaded collective opera-
ified to allow for future standardization. There are tions, which are critical to scaling messaging libraries
several mandatory statistics supported by current IB to very large systems.
devices including the amount of bytes/packets sent and The IBA specification does not specify the exact
received, transmit queue depth at various intervals, software interface through which hardware is accessed.
number of ticks during which the port had data to send OpenFabrics is an industry consortium that has spec-
but had no flow-control credits, etc. ified an interface through which IB hardware can be
Device Management: Device management is an accessed. It also includes support for other RDMA capa-
optional class that mainly focuses on devices that do ble interconnects, such as iWARP.
not directly connect to the IB fabric, such as I/O devices A few current usage environments for IB are men-
and I/O controllers. An I/O unit (IOU) containing one tioned below:
or more IOCs is attached to the fabric using a channel Message passing interface: MVAPICH [] and MVA-
adapter. The channel adapter is responsible for receiv- PICH [] are implementations of MPI over Infini-
ing packets from the fabric and delivering them to the Band. They are based on MPICH [] and MPICH [],
relevant devices and vice versa. The device manage- respectively. As of January , these stacks are used by
ment class does not deal with directly managing the end more than , organizations in  countries around
 I InfiniBand

S-ar putea să vă placă și