CPU, GPU, TPU-Making Chips Smarter 2017-CACM

news
Technology | DOI:10.1145/3057740 Samuel Greengard
Making Chips Smarter

Advances in artificial intelligence and machine learning are
motivating researchers to design and build new chips to support
different computing models.
I
T IS NO secret that artificial in-
telligence (AI) and machine CPU CPU NVLink
learning have advanced radi- PCIe
cally over the last decade, yet
somewhere between better al-
gorithms and faster processors lies the
increasingly important task of engi-
neering systems for maximum perfor-
PCIe Switches PCIe Switches
manceand producing better results.
The problem for now, says Nidhi
Chappell, director of machine learning
in the Datacenter Group at Intel, is that
AI experts spend far too much time GPU GPU GPU GPU
preprocessing code and data, iterating
on models and parameters, waiting for
training to converge, and experiment-
ing with deployment models. Each
step along the way is either too labor-
and/or compute-intensive. GPU GPU GPU GPU
The research and development
communityspearheaded by com-
panies such as Nvidia, Microsoft, Bai-
du, Google, Facebook, Amazon, and
Intelis now taking direct aim at the The design of the NVIDIA NVLink Hybrid Cube Mesh, which connects eight graphics
challenge. Teams are experimenting, processing units, each with 15 billion transistors.
developing, and even implementing
new chip designs, interconnects, and still in the early stages of research, de- originally invented to improve graph-
systems to boldly go where AI, deep velopment, or deployment. ics processing on computers, execute
learning, and machine learning have For another, theres no single de- specific tasks faster than conventional
not gone before. Over the next few sign, approach, or method that works central processing units (CPUs). Yet,
years, these developments could have well for every situation or AI-based a specialized design is not ideal for
a major impacteven a revolutionary framework. every application or situation. For in-
effecton an array of fields: automat- One thing that is perfectly clear: AI stance, a search engine such as Bing
ed driving, drug discovery, personal- and machine learning frameworks are or Google has very different require-
IMAGE BY AND RIJ BORYS ASSOCIAT ES, BASED ON SCH EMAT IC FRO M NVIDIA BLO G
ized medicine, intelligent assistants, advancing rapidly. Says Eric Chung, ments than the speech processing
robotics, big data analytics, computer a researcher at Microsoft Research: used on a smartphone, or the visual
security, and much more. They could Were seeing an escalating, insatiable processing that takes place in an auto-
deliver faster and better processing for demand for this kind of technology. mated vehicle or in the cloud. To vary-
important tasks related to speech, vi- ing degrees, systems must support
sion, and contextual searching. Beyond the GPU both training and delivering real-time
Specialized chips can significantly The quest for faster and better pro- information and controls.
increase performance for fixed-func- cessing in AI is nothing new. In re- In the quest to boost performance in
tion workloads, because they include cent years, graphical processing units these systems, designers and engineers
everything needed specifically for the (GPUs) have become the technology of are leaving no idea unexamined. How-
task at hand and nothing more. Yet, choice for supporting the neural net- ever, all the research revolves around
the task is not without its challenges. works that support AI, deep learning, a key goal: Specialized AI chips will
For one thing, theres no clear idea and machine learning. The reason is deliver better performance than either
about how to use silicon to accelerate simple, even if the underlying tech- CPUs or GPUs. This will undoubtedly
AI. Most chip designs and systems are nology is complex: GPUs, which were shift the AI compute [framework] mov-
MAY 2 0 1 7 | VO L. 6 0 | N O. 5 | C OM M U N IC AT ION S OF T HE ACM 13

news
ing forward, Chappell explains. In the engine, as well as the Azure cloud. This
real world, these boutique chips would allows teams to implement algorithms
greatly reduce training requirements The objective is to directly onto hardware, rather than po-
in neural networks, in some cases from build computation tentially less-efficient software. Chung
days or weeks to hours or minutes. This says the FPGAs performance exceeds
has the potential to not only improve platforms that deliver that of CPUs while retaining flexibility
performance but also slash costs for the performance and and allowing production systems to op-
companies developing AI, deep learn- erate at hyperscale. He describes the
ing, and machine learning systems. energy efficiency technology as programmable silicon.
The result would be faster and better needed to build To be sure, energy-efficient FP-
visual recognition in automated vehi- GAs satisfy an important requirement
cles, or the ability to reprocess millions AI with a level of when deploying accelerators at hyper-
of scans for potentially missed mark- accuracy that isnt scale in power-constrained data-
ers in healthcare or pharma. centers. The system delivers a scalable,
The focus on boutique chips and possible today. uniform pool of resources independent
better AI computation is leading re- from CPUs. For instance, our cloud al-
searchers down several avenues. These lows us to allocate few or many FPGAs as
include improvements in GPUs as well a single hardware service, he explains.
as work on other technologies such as This, ultimately, allows Microsoft to
field programmable gate arrays (FP- tween throughput-oriented GPU and scale up models seamlessly to a large
GAs), Tensor Processing Units (TPUs), latency-oriented CPU. For instance, number of nodes. The result is extremely
and other chip systems and architec- Nvidia has introduced a specialized high throughput with very low latency.
tures that match specific AI and ma- server called DGX-1, which uses eight FPGAs are, in fact, highly flexible
chine learning requirements. These Tesla P100 processors to deliver 170 chips that achieve higher performance
initiatives, says Bryan Catanzaro, vice teraflops of compute for neural net- and better energy efficiency with re-
president of Applied Deep Learning work training. The system also uses a duced numerical precision. Each
Research at Nvidia, point in the same fast interconnect between GPUs called computational operation gets more ef-
general direction: The objective is to NVLink, which the company claims al- ficient on the FPGA with the fewer bits
build computation platforms that de- lows up to 12 times faster data sharing you use, Chung explains. The current
liver the performance and energy effi- than traditional PCIe interconnects. generation of these Intel chips, known
ciency needed to build AI with a level of There is still an opportunity as Stratix V FPGAs, will evolve into more
accuracy that isnt possible today. for considerable innovation in this advanced versions, including Arria 10
GPUs, for instance, already deliver space, he says. and Stratix 10, he notes. They will intro-
superior processor-to-memory band- duce additional speed and efficiencies.
width and they can be applied to many New Models Emerge With the technology, we can build
tasks and workloads in the AI arena, in- Other approaches are also usher- custom pipelines that are tailored
cluding visual and speech processing. ing in significant gains. For example, to specific algorithms and models.
The appeal of GPUs revolves around Googles Tensor Processing Unit Chung says. In fact, Microsoft has
providing greater floating-point opera- (TPU) is a custom application-specific reached a point where developers can
tions per second (FLOPs) using fewer integrated circuit (ASIC) that is spe- deploy models rapidly, and without un-
watts of electricity, and the ability to cifically designed for AI applications derlying technical expertise about the
extend the energy advantage by sup- such as speech processing and street- machine learning framework. The ap-
porting 16-bit floating point numbers, view mapping and navigation. It has peal is the high level of flexibility. It can
which are more power- and energy-ef- been used in Googles datacenters for be reprogrammed for different AI mod-
ficient than single-precision (32-bit) or more than 18 months. A big benefit is els and tasks, Chung notes. In fact, the
double-precision (64-bit) floating point that the chip is optimized for reduced FPGAs can be reprogrammed on the
numbers. What is more, GPUs are quite computational precision. This trans- fly to respond to advances in artificial
scalable. The Nvidia Tesla P100 chip, lates into fewer transistors per opera- intelligence or different datacenter re-
which packs 15 billion transistors into tion and the ability to squeeze more quirements. A process that previously
a silicon chip, delivers extremely high operations per second into the chip, could take two years or more, now can
throughput on AI workloads associated which results in better-optimized per- take place in minutes.
with deep learning. formance per watt and an ability to use Finally, Intel is introducing Nervana,
However, as Moores Law reaches more sophisticated and powerful ma- a technology that aims to deliver un-
physical barriers, the technology must chine learning modelswhile apply- precedented compute density and high
evolve further. For now, There are ing the results more quickly. bandwidth interconnect for seamless
a lot of ways to customize processor Another technology aimed at advanc- model parallelism, Chappell says. The
architectures for deep learning, Cat- ing AI and machine learning is Micro- technology will focus primarily on mul-
anzaro says. Among these: improving softs Project Catapult, which uses field tipliers and local memory, and skip ele-
efficiency on deep learning specific programmable gate arrays (FPGAs) that ments such as caches that are required
workloads, and better integration be- underpin the widely used Bing search for graphics processing but not for deep
14 COMMUNICATIO NS O F TH E AC M | M AY 201 7 | VO L . 60 | NO. 5

news
learning. It also features isolated pipe-

lines for computation and data manage-
Alexa, Microsofts Cortana, or Apples
Siri; photo and image recognition ACM
ment, as well as High Bandwidth Memo-
ry (HBM) to accelerate data movement.
Nervana, which Intel expects to intro-
systems, and search engines, includ-
ing general services like Bing and
Google but also those used by retail-
Member
duce during the first half of this year, will
deliver sustained performance near
ers, online travel agencies, and others.
It also supports advanced functionality
News
theoretical maximum throughput, he like real-time speech-to-text transcrip- FINDING THE
adds. It also includes 12 bidirectional tion and language translations. INTERSECTION OF
high-bandwidth links, enabling mul- In the end, says Gregory Diamos, a MATH AND LANGUAGE
tiple interconnected engines for seam- senior researcher at Baidu, specialized When
Bulgarian-born
less scalability, a key requirement for machine learning chips have the poten- Dragomir
increased performance through scale. tial to change the stakes and usher in Radev,
an era of even greater breakthroughs. professor of
computer
Into the Future Machine learning has already made science at the
An intriguing aspect of emerging chip tremendous progress, he says. Spe- University of Michigan, studied
designs for AI, deep learning, and cialized chips and systems will contin- computer science as an
machine learning is the fact that ue to close the gap between computers undergraduate at the Technical
University of Sofia, I was
low-precision chip designs increas- and human performance. interested in math and
ingly prevail. In many cases, reduced- languages; French, Russian, and
precision processors conform better to English. I was not sure how to
Further Reading combine those two interests,
neuromorphic compute platforms and
Radev explains. When the first
accelerate the deployment and possibly Caulfield, A., Chung, E., Putnam, A., Angepat,
personal computers came
H., Fowers, J., Haselman, M., Heil, S., Humphrey,
training of deep learning algorithms. around, I thought it would be a
M., Kaur, P., Kim, J.Y., Lo, D., Massengill, T., good way to combine my
Simply put: they can produce similar Ovtcharov, K., Papamichael, M., Woods, L., interests.
results while consuming less power, Lanka, S., Chiou, D., and Burger, D. Radev completed his
in some cases by a factor of 100. While A Cloud-Scale Acceleration Architecture, undergraduate degree at the
algorithms running on todays digital October 15, 2016. Proceedings of the University of Maine at Orono,
49th Annual IEEE/ACM International before going on to earn a
processors require high numerical pre- Symposium on Microarchitecture, IEEE Ph.D. in computer science at
cision, the same algorithms operating Computer Society. https://www.microsoft. Columbia University in New
on low precision chips in a neural net com/en-us/research/publication/ York in 1999 (while serving as
excel, because these systems adapt dy- configurable-cloud-acceleration/ an adjunct assistant professor
in the department of computer
namically by examining data in a more Samel, B., Mahajan, S., and Ingole, A.M. science). His focus was on
relational and contextual way (and are GPU Computing and Its Applications, natural language processing
less sensitive to rounding errors). International Research Journal of and computational linguistics,
Engineering and Technology (IRJET). working on algorithms to teach
This makes the technology perfect
Volume: 03 Issue: 04, Apr-2016. human languages to computers.
for an array of machine learning tasks https://www.irjet.net/archives/V3/i4/IRJET- Even before graduating,
and technologies, including drones; V3I4357.pdf Radev was hired by IBM in 1998
automated vehicles; intelligent per- to work on the team that built
Shafiee, A., Nag, A., Muralimanohar, N., the first question/answer system
sonal assistants such as Amazons Balasubramonian, R., Strachan, J.P., Hu, M., at the companys Thomas J.
Williams, S.R., and Srikumar, V. Watson Research Center in
ISAAC: A Convolutional Neural Network Hawthorne, NY. After a year-
Microsoft has Accelerator with In-Situ Analog Arithmetic
in Crossbars, 2016 ACM/IEEE 43rd Annual
and-a-half at IBM, I started at
the University of Michigan in
reached a point International Symposium on Computer
Architecture (ISCA), pp. 14-26, 2016, ISSN
January 2000, he adds, and I
have been there since.
where developers 1063-6897. http://ieeexplore.ieee.org/
document/7446049/citations
Radev now is involved
with building spoken dialog
can deploy models Shirahata, K., Tomita, Y., and Ike, A.
systems for student advising,
and he serves on the executive
rapidly, without Memory reduction method for deep
neural network training, 2016 IEEE
committee at the Association for
Computational Linguistics, an
underlying technical 26th International Workshop on
Machine Learning for Signal Processing
organization for those working
on problems involving natural
expertise about the (MLSP), 2016, pp. 1-6. doi: 10.1109/ language and computation.
MLSP.2016.7738869. He has also served as co-
machine learning http://ieeexplore.ieee.org/stamp/stamp. chair of the North American
Computational Linguistics
framework. jsp?tp=&arnumber=7738869&isnumb
er=7738802 Olympiad (NACLO), in which
thousands of high school
students in the U,S, and Canada
Samuel Greengard is an author and journalist based in
West Linn, OR.
compete to solve problems in
natural language processing
and computational linguistics.
2017 ACM 0001-0782/17/05 $15.00 John Delaney
MAY 2 0 1 7 | VO L. 6 0 | N O. 5 | C OM M U N IC AT ION S OF T HE ACM 15

CPU, GPU, TPU-Making Chips Smarter 2017-CACM

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

CPU, GPU, TPU-Making Chips Smarter 2017-CACM

Încărcat de

Drepturi de autor:

Formate disponibile

news

Technology | DOI:10.1145/3057740 Samuel Greengard

Making Chips Smarter

MAY 2 0 1 7 | VO L. 6 0 | N O. 5 | C OM M U N IC AT ION S OF T HE ACM 13

14 COMMUNICATIO NS O F TH E AC M | M AY 201 7 | VO L . 60 | NO. 5

learning. It also features isolated pipe-

MAY 2 0 1 7 | VO L. 6 0 | N O. 5 | C OM M U N IC AT ION S OF T HE ACM 15

S-ar putea să vă placă și