Sunteți pe pagina 1din 11

Overview of Cloudera Impala

Objectives

After completing this lesson, you should be able to:


• Describe the features of Cloudera Impala
• Explain how Impala works with Hive, HDFS, and HBase

7- 2
Hadoop: Some Data Access/Processing Options

Component Purpose
Hive Puts a partial SQL interface in front of Hadoop. Includes
a metadata “repository” called the Metastore.
Pig A SQL-like scripting language on top of Java - for
MapReduce programming
HBase Applies a partial columnar scheme on top of Hadoop
Impala A database-like SQL layer on top of Hadoop

7- 3
Cloudera Impala

• The Impala server is a distributed, massively parallel


processing (MPP) database engine.
• It consists of different daemon processes that run on
specific hosts within your CDH cluster.
• The core Impala component is a daemon process that runs
on each node of the cluster.
• SQL is the primary development language.

7- 4
Cloudera Impala: Key Features

• Open source and Apache-licensed


• MPP architecture
• Interactive analysis on data stored in HDFS and HBase
• Incorporates native Hadoop security
• Provides ANSI- SQL support
• Shares workload management with Apache
• Supports common Hadoop file formats

7- 5
Cloudera Impala: Programming Interfaces

You can connect and submit requests to the Impala daemons


through:
• The Impala-shell interactive command interpreter
• The Apache Hue web-based user interface
• JDBC and ODBC

7- 6
How Impala Fits Into the Hadoop Ecosystem

Makes use of components within the Hadoop ecosystem:


• Provides a SQL layer on Hadoop
• May interchange data with other Hadoop components
• Can assist in ETL processes

7- 7
Working of Impala

Impala does not make use of Mapreduce as it contains its own


pre-defined daemon process to run a job. It sits on top of
only the Hadoop Distributed File System (HDFS) as it uses the
same to merely store the data. Therefore, we prefer calling it as
simply “SQL on HDFS”

However ,Hive functions on top of Hadoop which itself includes


HDFS as well as MapReduce. Executing an Hive query
would then, set forth a series of mapreduce commands until we
arrive at the results.

Since Impala doesn’t have to translate a SQL query into


another processing framework like the map/shuffle/reduce, it
does not suffer from the latencies that those operations impose
and this makes Impala much faster than Hive on
performance benchmarks.
7- 8
How Impala Works with Hive

• Uses existing Hive infrastructure


• Stores its table definitions in the Hive Metastore
• Accesses Hive tables
• Focuses on query performance

7- 9
How Impala Works with HDFS and HBase

• HDFS
– Impala’s primary storage mechanism
– Data stored as data files
• HBase
– Alternative to HDFS to store Impala data
– Impala table definition can be mapped to HBase tables

7- 10
Summary of Cloudera Impala Benefits

• MPP performance (uses its own MPP query engine)


• Cost savings
• Analysis of raw and historical data
• Security

7- 11

S-ar putea să vă placă și