Sunteți pe pagina 1din 8

AWS REDSHIFT

INTRODUCTION
Big Data Challenges

I/O: A major bottleneck in handling big data is disk access. DRAM can be 400,000 times faster than
HDs.

Row x Column Storage

Column ∂ Calculations are typically executed on small number of columns.


Storage ∂ The table is searched based on the values of a few columns.
∂ The table has a large number of columns.
∂ The table has a large number of rows and columnar operations are required
(aggregate, scan, and so on)
∂ Compression: Majority of the columns contain only a few distinct values (compared
to the number of rows).
∂ Elimination of additional indexes: storing data in columns already works like having
a built-in index for each columnar
∂ operations on single columns, such as searching or aggregations, can be
implemented as loops over an array stored in contiguous memory locations
∂ operations on different columns can easily be processed in parallel.
∂ OLAP/DW: Highly complex queries over all data

Row store ∂ The application needs to process only one single record at one time (many selects
and /or updates of single records).
∂ The application typically needs to access the complete record.
∂ The columns contain mainly distinct values so compression rate would be low.
∂ Neither aggregations nor fast searching are required.
∂ The table has a small number of rows (for example, configuration tables).
∂ Single row queries can retrieve the row in a single disk read, whereas numerous
disk operations to collect data from multiple columns are required from a
columnar database.
∂ OLTP

Row x Columnar storage OLAP query performance can be compared by using the the Star Schema Benchmark
(SSBM)

Tentatives of using Key-Value storage format, Indexing all columns or creating views for every query

MPP
∂ MPP databases use multi-core processors, multiple processors and servers, and storage
appliances equipped for parallel processing.
∂ This combination enables reading many pieces of data across many processing units at the
same time for enhanced speed.
∂ Specialized hardware tuned for CPU, storage and network performance
∂ This method is necessary because the frequencies of processors are hitting the limits of the
technologies used and are slow to increase.
∂ MPP architecture, which means that all operations are executed with as much parallelism as
possible. Both data loading and data querying are executed in parallel among the nodes and this
allows the cluster to scale in an eventually linear way by adding new node

REDSHIFT FEATURES
1. Column Storage
2. Compression (Column Storage makes similar data sit together, which improves compression)
3. Wendy Writer, Ronny Reader, Abby Author
4. Zone Maps: Keeps track of minimum and maximum value for each block. This allows skipping over blocks
that do not contain the searched data, avoiding unnecessary I/O.
5. Direct Attached Storage: This optimizes I/O
6. Large Data Block Size: 1MB (Typical database block sizes range from 2 KB to 32 KB. )
7. Share-nothing architecture, that is the storage is tied to the individual processor cores of the nodes. With
Oracle we have shared storage (SAN or local disk) attached to a pool of processors (single machine or a
cluster)
8.
9. Data can be distributed between node-slices, so each processor can process a slice of a table in parallel.
10. Support Avro and JSON data types
11. Data load is optimized for s3 and Dynamo by using the COPY command.
12. Result sets come out via JDBC/ODBC (don’t do this for anything bigger than a few thousand rows) or are
shipped out to S3.
13.

WHEN NOT TO USE


1. Update intensive usecases. Where Redshift performs less well is when we use certain kinds of ETL steps
in our process, particularly those that involve updating rows or single row activities

LIMITATIONS
∂ To implement a sequential number, you need to write your own custom code.
∂ No stored procedures supported
∂ Data control language such as IF, DO, WHILE, are not supported
∂ Views do not pass-through parameters, which is a potential performance problem.
∂ Doesn't enforce uniqueness. This means you'll have to be very diligent about data hygiene. If
you're running distributed systems that write to Redshift, you'll probably have to use some
caching system like Redis to check if you've written the data to the database already.
∂ Very fast, but not fast enough for most web apps. You might need a caching layer on the front.
∂ Can't alter the datatypes of a column in a table after they've been set
∂ Your data must be flat, in a CSV/TSV/*SV format. No nested structures.
∂ Your data must be loaded from S3 or DynamoDB.

PARTITIONING
∂ By default the data distribution style is EVEN, that is data is distributed between node-slices in a round-
robin fashion
∂ Distribution key column to allow a particular column to control how data is distributed
∂ ONE Distribution Key is supported. Multi-column distribution keys are not supported.

PERFORMANCE OPTIMIZATIONS
∂ SORTKEY: this specifies one or more columns on the table by which the data is ordered on data
load (it can be the same column as the distribution key). Redshift maintains information on the
minimum and maximum values of the sort key in each database block and at query time uses this
information to skip blocks that do not contain data of interest.Multi-column sort keys are
supported.
∂ Data partitioning to minimize data transfer between nodes and data read from disk.For a FACT +
DIMENSIONS data model (such as in the performance layer of Oracle’s Reference Data
Warehouse Architecture) it would be appropriate to distribute data on the dimension key of the
largest dimension on both the dimension and the fact tables, this will reduce the amount of data
being moved between slices to facilitate joins.
∂ Adding primary and foreign keys to the tables tells the optimizer about the data relationships and
thus improves the quality of query plan
∂ Include both the distribution keys and the sort keys in any query. The presence of these keys
forces the optimizer to access the tables in an efficient way.
∂ For best data load performance we insert rows in bulk and in sortkey order. Redshift claim best
performance comes from using the rich COPY command to load from flat files and as second
best the bulk insert SQL commands such as CTAS and INSERT INTO T1 (select * from T2).
∂ use the smallest width columns you can as Redshift is more performant at scale whenever the
columns are optimally sized. This is different from Postgres, where unbounded VARCHAR
columns are faster than fixed length VARCHAR columns.
∂ VACUUM: is used to clean up and reorganize tables.

I/O REDUCTION
∂ Direct Attached Storage
∂ Large Data Block Sizes
∂ Columnar Storage
∂ Data Compression
∂ Zone Maps

COMPRESSION
Default compression is RAW (i.e. uncompressed).

Compression may be:

∂ data block based (DELTA, BYTE-DICTIONARY, RUN LENGTH, TEXT255 and TEXT32K)
∂ value base (LZO and the MOSTLY compressions). This sounds daunting but there are two ways we can
get compression suggestions from the database: using the ANALYZE COMPRESSION command on a
loaded table

Improving Compression Techniques


∂ To improve compression, sorting rows can also help. It is betterto use low-cardinality columns as
the first sort keys (e.g. sex (2) then age (100) then name (10000))

WORKLOAD MANAGEMENT (WLM)


Enables users to flexibly manage priorities within workloads so that short, fast-running queries won't get
stuck in queues behind long-running queries.

WLM assigns the query to a queue according to:

∂ User's user group


∂ By matching a query group that is listed in the queue configuration with a query group label that
the user sets at runtime

Queries from different queues do not block each other.

By default, Amazon Redshift configures one queue with a concurrency level of five, which enables up to
five queries to run concurrently, plus one predefined Superuser queue, with a concurrency level of one.
You can define up to eight queues. Each queue can be configured with a maximum concurrency level
of 50. The maximum total concurrency level for all user-defined queues (not including the Superuser
queue) is 50.

You can also configure the amount of memory allocated to each queue, so that large queries run in
queues with more memory than other queues. You can also configure the WLM timeout property to
limit long-running queries.

SCALABILITY
Scalability - it's horizontally scalable. Need it to go faster? Just add more nodes

COMPONENTS

Overview
Leader Node
Queries are performed by the Leader Node
Compute Node

Leader

SECURITY
Encryption at rest: When you enable encryption for a cluster, the data blocks and
system metadata are encrypted for the cluster and its snapshots.

PostgreSQL
Amazon Redshift is based on PostgreSQL 8.0.2.
Only the 8.x version of the PostgreSQL query tool psql is supported.

Differences
∂ CREATE TABLE
Amazon Redshift does not support tablespaces, table partitioning, inheritance, and certain
constraints. The Amazon Redshift implementation of CREATE TABLE enables you to define the
sort and distribution algorithms for tables to optimize parallel processing.
∂ ALTER TABLE
ALTER COLUMN actions are not supported.
∂ ADD COLUMN supports adding only one column in each ALTER TABLE statement.
∂ COPY
The Amazon Redshift COPY command is highly specialized to enable the loading of data from
Amazon S3 buckets and Amazon DynamoDB tables and to facilitate automatic compression.
See the Loading Data section and the COPY command reference for details.
∂ INSERT, UPDATE, and DELETE
WITH is not supported.
∂ VACUUM
The parameters for VACUUM are entirely different. For example, the default VACUUM
operation in PostgreSQL simply reclaims space and makes it available for re-use; however, the
default VACUUM operation in Amazon Redshift is VACUUM FULL, which reclaims disk space and
resorts all rows.
∂ Trailing spaces in VARCHAR values are ignored when string values are compared. For more
information, see Significance of Trailing Blanks.

Some PostgreSQL features that are suited to smaller-scale OLTP processing, such as secondary
indexes and efficient single-row data manipulation operations, have been omitted to improve
performance.

Wire protocol: PostgreSQL uses a message-based protocol for communication between frontends and
backends (clients and servers). The protocol is supported over TCP/IP and also over Unix-domain
sockets. Port number 5432. In order to serve multiple clients efficiently, the server launches a new
“backend” process for each client.

DATA TYPES
Data Type Aliases Description
SMALLINT INT2 Signed two-byte integer
INTEGER INT, INT4 Signed four-byte integer
BIGINT INT8 Signed eight-byte integer
DECIMAL NUMERIC Exact numeric of selectable precision
REAL FLOAT4 Single precision floating-point number
DOUBLE FLOAT8, FLOAT Double precision floating-point number
PRECISION
BOOLEAN BOOL Logical Boolean (true/false)
CHAR CHARACTER, NCHAR, BPCHAR Fixed-length character string
VARCHAR CHARACTER VARYING, NVARCHAR, Variable-length character string with a user-
TEXT defined limit
DATE Calendar date (year, month, day)
TIMESTAMP TIMESTAMP WITHOUT TIME ZONE Date and time (without time zone)
TIMESTAMPTZ TIMESTAMP WITH TIME ZONE Date and time (with time zone)

BACKUPS
Automatic snapshots that eliminate the need for managing backups. Simple database
maintenance strategies with the VACUUM and ANALYZE commands.

S-ar putea să vă placă și