Ekaterina Gonina

CS267 Assignment 0
Ekaterina Gonina
Bio:
I am a first year PhD student in CS at UC Berkeleys ParLab. I am interested in parallel
graph algorithms and optimizing them for different parallel hardware platforms. I am also
interested in developing parallel programming frameworks to make programming newly
emerging parallel hardware easier for regular programmers. As and undergrad at UIUC, I
worked with Professor L.V. Kale on parallelizing and optimizing minimum spanning tree
algorithms using MPI and Charm++ (www.charm.cs.uiuc.edu).
What Im hoping to gain from this course is to learn in a nice systematic way the current
state of the art of developing parallel applications and its implications with hands-on
examples. Im hoping to gain a solid foundation for my parallel programming Prelim
exam.
Application:
Data-Parallel Large Vocabulary Continuous Speech

Recognition on GPUs
Speech Recognition is an important application that the Parallel Computing Lab at UC
Berkeley is exploring at to mine for parallel parallel patterns and eventually create
frameworks and techniques to implement inference engines efficiently on manycore
platforms. Speech recognition is a key technology that enables human-computer
interaction in many emerging applications, for example, keeping a diary of a conference
meeting or recording a one-on-one research meeting presents very useful applications
that both the industrial and research community could benefit from. However, the size of
language vocabulary models are extremely large and are proportional to the accuracy of
recognition we want, thus parallel processing is extremely in need here we want to be
able to recognize human speech in small amounts of time, ideally, in real-time, and that is
only possible if we take full advantage of the parallelism of current manycore platforms.
One effective approach for solving the large vocabulary continuous speech recognition
(LVCSR) problem is to use the Hidden Markov Model (HMM) with beam search
approximate inference algorithm. This system uses a recognition network that is
compiled offline from a variety of knowledge sources using powerful statistical learning
techniques. Spectral-based speech features are extracted by signal-processing, the audio
input and the inference engine then computes the most likely word sequence based on the
extracted speech features and the recognition network [1]. Since it is not feasible to
explore the whole recognition network to find the most likely word sequence given the
set of signals from speech, they used the beam search heuristic the reduce the problem
space to a feasible size.
Parallel Platform:
The application targets NVIDIA G8x series GPUs as its parallel platform with SIMD
architecture. It has an array of Shared Multiprocessors (SM) each of which has 8 scalar
processors. They used CUDA programming environment to implement the parallel
version of the algorithm. The application is organized into a sequential host program and
one or more parallel kernels that get invoked from the sequential program to run on the
GPU. The basis of parallel execution is a CUDA thread, the kernel executes a scalar
sequential program across the set of threads. The programmer organizes the threads into
thread blocks that get scheduled to run on the SIMD lanes of one SM multithreaded
processor. Each SM has 16KB of on-chip memory that has high bandwidth and low
latency, this memory is shared among the threads in a block. For more information about
CUDA and the NVIDIA GPU programming see [2].
Inference Engine Implementation:
The inference engine implements the beam search algorithm and iterates over set of
active states to infer the set of next active states in the recognition network graph. Each
iteration begins with a set of active states that represent the set of most likely sequence of
words up to the current observation. The first step computes the observation probabilities
of all potential next states, the second step computes the next state likelihoods, and the
third step selects the most likely next state to retain as the set of new active states for the
next iteration [1]. Figure below illustrates the HMM graph and the high level overview of
the iteration process and the figure below is the detailed breakdown of the inference
engines iterations:
The major bottleneck in the algorithm is the transfer of data from CPU to GPU, thus the
key challenge in effectively using the GPU is to keep all computations and intermediate
results in the GPU memory.
The most significant parallelization potential in this algorithm is in the data level
parallelism. In computing next state likelihoods for example, they parallelize over the set
of active end states and do computations for each state in parallel, and in computing
observation probabilities they take advantage of the embarrassingly parallel structure of
the problem they compute probability for each state independently in parallel.
Results:
The application performed really well on the GPU, it got overall of 9x speedup over its
sequential equivalent, 19x in the computation of observation probabilities kernel and
11.8x in updating the next state likelihoods. The key result is that the performance of the
parallel version is about 6.25x better than the performance of sequential version [1]. The
two versions of the algorithm retain the same accuracy. Below is some performance data
they gathered, broken down by each kernel:
Conclusion:
This project presents a significant step in parallelizing automatic speech recognition on
manycore platforms, specifically on NVIDIA GPUs using CUDA. It uses the data
parallel model to achieve 9x speedup over the sequential version of the algorithm
illustrating that we have a great space to explore in improving performance of speech
recognition applications on such architectures. The next step that we are currently
working on is implementing the inference engine using a different recognition network
Weighted Finite State Transducer (WFST) that is optimized for state space. We hope to
see great speedup and performance improvement from parallelizing speech recognition
using this model.
References:
1. Jike Chong, Youngmin Yi, Nadathur Rajagopalan Satish, Kurt Keutzer. "Data Parallel
Large Vocabulary Continuous Speech Recognition on Graphics Processing Unit". Poster,
GSRC Annual Symposium 2008, 29, September, 2008, 1.1.1.15.
2. NVIDIA CUDA Manual,
http://developer.download.nvidia.com/compute/cuda/2_0/docs/CudaReferenceManual_2.
0.pdf

Ekaterina Gonina

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Ekaterina Gonina

Încărcat de

Drepturi de autor:

Formate disponibile

CS267 Assignment 0

Data-Parallel Large Vocabulary Continuous Speech

S-ar putea să vă placă și