Documente Academic
Documente Profesional
Documente Cultură
UNIT I
PARALLEL AND DISTRIBUTED
DATABASES
Inter and Intra Query Parallelism – Architecture Query
evaluation – Optimization – Distributed Architecture
– Storage – Catalog Management – Query Processing -
Transactions – Recovery - Large-scale Data Analytics
in the Internet Context – Map Reduce Paradigm - run-
time system for supporting scalable and fault-tolerant
execution - paradigms: Pig Latin and Hive and parallel
databases versus Map Reduce.
Parallel Database Systems
• Homogeneous working
environment
• Multiple processors
1.3. Objectives
• The primary objective of parallel database processing is to gain
performance improvement
• Two main measures:
– Throughput: the number of tasks that can be completed within a given
time interval
– Response time: the amount of time it takes to complete a single task
from the time it is submitted
• Metrics:
– Speed up
– Scale up
D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008
1.3. Objectives (cont’d)
• Scale up
– Handling of larger tasks by increasing the degree of parallelism
– The ability to process larger tasks in the same amount of time by providing more
resources.
• Linear scale up: the ability to maintain the same level of performance
when both the workload and the resources are proportionally added
• Transactional scale up
• Data scale up
D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008
1.3. Objectives (cont’d)
• Transaction scale up
– The increase in the rate at which the transactions are processed
– The size of the database may also increase proportionally to the transactions’
arrival rate
– N-times as many users are submitting N-times as many requests or transactions
against an N-times larger database
– Relevant to transaction processing systems where the transactions are small
updates
• Data scale up
– The increase in size of the database, and the task is a large job who runtime
depends on the size of the database (e.g. sorting)
– Typically found in online analytical processing (OLAP)
D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008
1.3. Objectives (cont’d)
• Parallel Obstacles
– Start-up and Consolidation costs,
– Interference and Communication, and
– Skew
D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008
1.3. Objectives (cont’d)
D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008
1.3. Objectives (cont’d)
• Skew
– Unevenness of workload
– Load balancing is one of the critical factors to achieve linear speed up
D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008
1.4. Forms of Parallelism
• Forms of parallelism for database processing:
– Interquery parallelism
– Intraquery parallelism
– Interoperation parallelism
– Intraoperation parallelism
– Mixed parallelism
D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008
1.4. Forms of Parallelism (cont’d)
• Interquery Parallelism
– “Parallelism among queries”
– Different queries or transactions are executed in parallel with one another
– Main aim: scaling up transaction processing systems
D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008
1.4. Forms of Parallelism (cont’d)
• Intraquery Parallelism
– “Parallelism within a query”
– Execution of a single query in parallel on multiple processors and disks
– Main aim: speeding up long-running queries
D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008
1.4. Forms of Parallelism (cont’d)
D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008
1.4. Forms of Parallelism (cont’d)
• Intraoperation Parallelism
– “Partitioned parallelism”
– Parallelism due to the data
being partitioned
– Since the number of records
in a table can be large, the
degree of parallelism is
potentially enourmous
D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008
1.4. Forms of Parallelism (cont’d)
– Pipeline parallelism
– Independent parallelism
D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008
1.4. Forms of Parallelism (cont’d)
• Pipeline Parallelism
– Output record of one operation
A are consumed by a second
operation B, even before the
first operation has produced
the entire set of records in its
output
D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008
1.4. Forms of Parallelism (cont’d)
• Independent Parallelism
– Operations in a query that do
not depend on one another are
executed in parallel
D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008
1.4. Forms of Parallelism (cont’d)
• Mixed Parallelism
– In practice, a mixture of all available parallelism forms is used.
D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008
Parallel Databases
Parallel Databases
WHY ?
Parallel Databases– Why ?
The Philosophy –
Parallel Databases
The Implementation
Parallel Databases-
Implementation
Parallel Database Implementation – The Basic Techniques
Shared Nothing
Parallel Databases-
Implementation
Parallel Dataflow Approach To SQL Software
Parallel Databases
The Future
Parallel Databases- The
Future
Research Problems
Parallel Query Optimization
Application Program Parallelism
Physical Database Design
On-line Data Reorganization and Utilities
Parallel Databases- The
Future
Future Directions
Many commercial success stories.
But research issues still remain unresolved.
Some applications are not well supported by
relational data model.
Object-oriented design ??
Introduction
• Parallel machines are becoming quite
common and affordable
– Prices of microprocessors, memory and disks
have dropped sharply
• Databases are growing increasingly large
– large volumes of transaction data are collected
and stored for later analysis.
– multimedia objects like images are increasingly
stored in databases
Introduction
• Large-scale parallel database systems
increasingly used for:
Shared-memory architecture
Shared-disk architecture
Shared-nothing architecture
Shared-something architecture
1.5. Parallel Database Architectures (cont’d)
• Shared-Memory and Shared-Disk Architectures
– Shared-Memory: all processors share a common main
memory and secondary memory
– Load balancing is relatively easy to achieve, but suffer
from memory and bus contention
– Shared-Disk: all processors, each of which has its own
local main memory, share the disks
1.5. Parallel Database Architectures (cont’d)
• Shared-Nothing Architecture
– Each processor has its own local main memory and
disks
– Load balancing becomes difficult
D. Taniar, C.H.C. Leung, W. Rahayu, S. Goel: High-Performance Parallel Database Processing and Grid Databases, John Wiley & Sons, 2008
1.5. Parallel Database Architectures (cont’d)
• Shared-Something Architecture
– A mixture of shared-memory and shared-nothing architectures
– Each node is a shared-memory architecture connected to an
interconnection network ala shared-nothing architecture
1.5. Parallel Database Architectures (cont’d)
• Interconnection Networks
– Bus, Mesh, Hypercube
Parallel System Performance
Measure
• Speedup: = small system elapsed time
large system elapsed time
• Scaleup: = small system small problem elapsed time
Range Partitioning
Hashing Partitioning
Parallel Databases-
Implementation
Range Round-Robin Hashing
Parallel Databases-
Implementation
Round-Robin
Ideal for applications that wish to read entire relation
sequentially for each query.
Not ideal for point and range queries, since each of the n disks
must be searched.
Hash
Ideal for point queries based on the partitioning attribute.
Ideal for sequential scans of the entire relation.
Not ideal for point queries on non-partitioning attributes.
Not ideal for range queries on the partitioning attribute.
Range
Ideal for point and range queries on the partitioning attribute.
Parallel Databases-
Implementation
Handling Of Skew
The distribution of tuples when a relation
is partitioned (except for Round-Robin) may
be skewed, with a high percentage of tuples
placed in some partitions and fewer tuples in
other partitions.
2 Kinds –
Data Skew (Attribute-value Skew)
Execution Skew (Partition Skew)
Parallel Databases-
Implementation
Parallelism With Relational Operators
Teradata
Tandem NonStop SQL
Gamma
The Super Database Computer
Bubba
nCUBE
Skew
• The distribution of tuples to disks may be
skewed
– Attribute-value skew.
• Some values appear in the partitioning attributes of
many tuples
– Partition skew.
• Too many tuples to some partitions and too few to
others
• Round robin handles skew well
• Hashing and ranging may result in skew
Yan Huang - CSCI5330 Database
04/25/2005
Implementation – Parallel Database
Typical Database Query Types
• Sequential scan
• Point query
• Range query
Point Query Difficult Good for hash key Good for range
vector
1 Terabyte 1 Terabyte
10 MB/s Parallelism:
Divide a big problem
into many smaller ones
to be solved in parallel.
Parallel DBMS: Intro
• Parallelism is natural to DBMS processing
– Pipelined parallelism: many machines each
doing one step in a multi-step process.
– Partitioned parallelism: many machines doing
the same thing to different pieces of data.
– Both are natural in DBMS!
Any Any
Sequential Sequential
Program
Pipeline Program
Sequential
Any
Sequential Any
Partition Sequential
Program
Sequential
Program
outputs split N ways, inputs merge M ways
DBMS: The || Success Story
• For a long time, DBMSs were the most (only?!)
successful/commercial application of parallelism.
– Teradata, Tandem vs. Thinking Machines, KSR.
– Every major DBMS vendor has some || server.
– (Of course we also have Web search engines now. )
• Reasons for success:
– Set-oriented processing (= partition ||-ism).
– Natural pipelining (relational operators/trees).
– Inexpensive hardware can do the trick!
– Users/app-programmers don’t need to think in ||
Some || Terminology Ideal
• Speed-Up
(throughput)
Xact/sec.
– Adding more resources
results in proportionally
less running time for a degree of ||-ism
fixed amount of data.
• Scale-Up
Ideal
– If resources are
(response time)
increased in proportion
sec./Xact
to an increase in
data/problem size, the
overall time should degree of ||-ism
remain constant.
Architecture Issue: Shared What?
Shared Memory Shared Disk Shared Nothing
(SMP) (network)
Processors
Memory
Shared Memory
Informix 9 nodes
RedBrick ? nodes CLIENTS
Processors
Memory
Different Types of DBMS ||-ism
• Intra-operator parallelism
– get all machines working together to compute a
given operation (scan, sort, join)
• Inter-operator parallelism
– each operator may run concurrently on a different
site (exploits pipelining)
• Inter-query parallelism
– different queries run on different sites
• We’ll focus mainly on intra-operator ||-ism
Automatic Data Partitioning
Partitioning a table:
Range Hash Round Robin
A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z A...E F...J K...N O...S T...Z
hash 2
Original Relations ... function
h
(R then S) B-1
B-1
Disk B main memory buffers Disk
• Basic idea:
– Scan in parallel, range-partition as you go.
– As tuples arrive, perform “local” sorting.
– Resulting data is sorted and range-partitioned
(i.e., spread across system in known way).
– Problem: skew!
– Solution: “sample” the data at the outset to
determine good range partition points.
Parallel Aggregation
• For each aggregate function, need a
decomposition:
– count(S) = S count(s(i)), ditto for sum()
– avg(S) = (S sum(s(i))) / S count(s(i))
– and so on...
Count
A B R S
Observations
• It is relatively easy to build a fast parallel
query executor.
– S.M.O.P., well understood today.
• It is hard to write a robust and world-class
parallel query optimizer.
– There are many tricks.
– One quickly hits the complexity barrier.
– Many resources to consider simultaneously (CPU,
disk, memory, network).
Parallel Query Optimization