Documente Academic
Documente Profesional
Documente Cultură
Study Guide
1
CONTENTS
About MapR Study Guides .................................................................................................................................... 3
Rescheduling ......................................................................................................................................................... 29
MapR certification study guides are intended to help you prepare for certification by
providing additional study resources, sample questions, and details about how to
take the exam. The study guide by itself is not enough to prepare you for the exam.
Youll need training, practice, and experience. The study guide will point you in the
right direction and help you get ready.
If you use all the resources in this guide, and spend 6-12 months on your own using
the software, experimenting with the tools, and practicing the role you are certifying
for, you should be well prepared to attempt the exams.
The MapR Certified Data Analyst credential is designed for Data Analysts who use
Hive, Pig, and Drill to do data analysis. The certification tests ones ability to use
MapR tools and Load and Inspect Data, Manage Hive tables, run SQL queries, and
use Drill to do analysis.
3
Exam?
1
Whats on the
6.2 Execute a Drill Query using clients such as SQLine, JDBC Apps, & REST APIs
6.4 Create a view and visualize queries with BI Tools such as Tableau
6.5 Demonstrate how to use Drill Explorer to explore unknown data and
determine its structure to perform queries
6.6 Discover the flexibility and extensibility of Drill and UDFs
Sample Questions
The following questions represent the kinds of questions you will see on the exam.
The answers to these sample questions can be found in the answer key following
the sample questions denoted by an asterisk next to the correct option.
2. Which commands would create a relation, XYZ, which contains the first 100 records
of relation ABC, in alphabetical order by customer last name?
Which Hive configuration will allow you to run the query without error?
a) hive.mapred.mode=strict
b) hive.mapred.mode=nonstrict
c) hive.exec.dynamic.partition=true
d) hive.exec.dynamic.partition=false
4. Suppose Cost Based Optimization (CBO) and Predicate Pushdown features are
enabled in Hive. Which is true about the following 2 queries.
Query 1:
SELECT employees.id, b.sales FROM employees LEFT JOIN sales ON
(employees.id = sales.employee_id) WHERE day_id BETWEEN '2015-01-01'
AND '2015-03-31';
Query 2:
SELECT employees.id, b.sales FROM employees LEFT JOIN sales ON
(employees.id = sales.employee_id AND sales.day_id BETWEEN '2015-01-
01' AND '2015-03-31');
5. You have a hive table accesslogs that contains a column called "time", which
contains string value in this format [15/Jul/2009:14:58:59 -0700].
You want to write a query to find the number of logs for each year. Using a series of 3
UDF's, you project the "year" field and then you group by that computed field.
SELECT year(from_unixtime(UNIX_TIMESTAMP(t1.time,
'[dd/MMM/yyyy:HH:mm:ss Z]'))) AS `year`, COUNT(*) FROM accesslogs AS
t1 GROUP BY `year`;
Assume that all three UDF's used in this query work as expected. What will be the
output if the run the above query?
a) Returns the year and count of records sorted by year
6. Which of the following will generate the list of 'Apache' projects from a CSV file?
This is an example of the files contents:
10
Which of the following Apache Drill SQL statements returns keys & values for
the address field, as shown below:
Keys Values
city Austin
state TX
11
8. When you want to store the following JSON data set in HBase, what column names
will you use?
b) user
9. You query your employee data with the following SQL statement: SELECT
first_name, employee_ID FROM employees ORDER BY employee_ID;
The data is altered when Bob leaves the company, and is removed from the
database, and Jamie is hired as a new employee. Which of the following is a
possible output you will see when you look at the view with the updated data?
a) first_name employee_ID
Jan 1
Steve 3
Iman 4
Jamie 5
12
b) first_name employee_ID
Jan 1
Bob 2
Steve 3
Iman 4
c) first_name employee_ID
Jan 1
Bob 2
Steve 3
Iman 4
Jamie 5
10. You have a 20 node cluster that includes 10 nodes dedicated for data
analysis. These are all data nodes in the topology, /data/analysis/. To best
take advantage of data locality with your queries, you will install Drill on:
a) every node in the /data/analysis/ topology, and all of the control nodes
13
Answer Key
2. Which commands would create a relation, XYZ, which contains the first 100 records
of relation ABC, in alphabetical order by customer last name?
Which Hive configuration will allow you to run the query without error?
e) hive.mapred.mode=strict
f) *hive.mapred.mode=nonstrict
g) hive.exec.dynamic.partition=true
h) hive.exec.dynamic.partition=false
14
4. Suppose Cost Based Optimization (CBO) and Predicate Pushdown features are
enabled in Hive. Which is true about the following 2 queries.
Query 1:
SELECT employees.id, b.sales FROM employees LEFT JOIN sales ON
(employees.id = sales.employee_id) WHERE day_id BETWEEN '2015-01-01'
AND '2015-03-31';
Query 2:
SELECT employees.id, b.sales FROM employees LEFT JOIN sales ON
(employees.id = sales.employee_id AND sales.day_id BETWEEN '2015-01-
01' AND '2015-03-31');
5. You have a hive table accesslogs that contains a column called "time", which
contains string value in this format [15/Jul/2009:14:58:59 -0700].
You want to write a query to find the number of logs for each year. Using a series of 3
UDF's, you project the "year" field and then you group by that computed field.
SELECT year(from_unixtime(UNIX_TIMESTAMP(t1.time,
'[dd/MMM/yyyy:HH:mm:ss Z]'))) AS `year`, COUNT(*) FROM accesslogs AS
t1 GROUP BY `year`;
Assume that all three UDF's used in this query work as expected. What will be the
output if the run the above query?
a) Returns the year and count of records sorted by year
15
6. Which of the following will generate the list of 'Apache' projects from a CSV file?
This is an example of the files contents:
16
Which of the following Apache Drill SQL statements returns keys & values for
the address field, as shown below:
Keys Values
city Austin
state TX
17
d) Select repeated_contains(p.personal[0].contact[0].address[0],
'city'), repeated_contains(p.personal[0].contact[0].address[0],
'country'), repeated_contains(p.personal[0].contact[0].address[0],
'state'), repeated_contains(p.personal[0].contact[0].address[0],
'street') from dfs.`/tmp/person.json` p;
8. When you want to store the following JSON data set in HBase, what column names
will you use?
b) user
9. You query your employee data with the following SQL statement: SELECT
first_name, employee_ID FROM employees ORDER BY employee_ID;
The data is altered when Bob leaves the company, and is removed from the
database, and Jamie is hired as a new employee. Which of the following is a
possible output you will see when you look at the view with the updated data?
a) *first_name employee_ID
Jan 1
Steve 3
Iman 4
Jamie 5
18
b) first_name employee_ID
Jan 1
Bob 2
Steve 3
Iman 4
c) first_name employee_ID
Jan 1
Bob 2
Steve 3
Iman 4
Jamie 5
10. You have a 20 node cluster that includes 10 nodes dedicated for data
analysis. These are all data nodes in the topology, /data/analysis/. To best
take advantage of data locality with your queries, you will install Drill on:
e) every node in the /data/analysis/ topology, and all of the control nodes
19
2
Preparing for the
Certification
20
Section 2- Preparing for the Certification
MapR provides several ways to prepare for the certification including classroom
training, self-paced online training, videos, webinars, blogs, and ebooks.
MapR offers a number of training courses that will help you prepare. We
recommend taking the classroom training first, followed by self-paced online
training, and then several months of experimentation on your own learning the
tools in a real-world environment.
We also provide additional resources in this guide to support your learning. The
blogs, whiteboard walkthroughs, and ebooks are excellent supporting material
in your efforts to become a Data Analyst.
Instructor and Virtual Instructor-led Training
All courses include:
Certified MapR Instructor who is an expert in the topic, and is
expert in classroom facilitation and course delivery techniques
Collaboration and assistance for all students on completion of
exercises
Lab exercises, a lab guide, slide guide, job aids as appropriate
Certification exam fee included one exam try only, done on the
students own time (not in class)
21
Course DA-4000 Data Analysis with Apache Drill
Duration: 2 days
Cost: $2400
Course Description:
You will write SQL queries on a variety of data types, including structured data
in a Hive table, semi-structured data in HBase or MapR-DB, and complex data file
types, such as Parquet and JSON. You will also learn the different services
involved at each step, and how Drill optimizes a query for distributed SQL
execution.
Required:
o Access to, and the ability to use, a laptop with a terminal program installed
(such as terminal on the Mac, or PuTTY and WinSCP on Windows)
Recommended:
22
Course: DA-4500 Data Analysis with Apache Pig and Apache Hive
Where: learn.mapr.com
Duration: 3 days
Cost: $1200
Course Description:
This course covers how to use Pig and Hive as part of a single data flow in a Hadoop
cluster. The course begins with manipulating semi-structured raw data files in Pig,
using the grunt shell and the Pig Latin programming language. Once the raw data has
been manipulated into structured tables, they will be exported from Pig and imported
into Hive. The structured data can be queried in Hive, and some basic data analysis can
be performed.
Required:
Familiarity with a command-line interface, such as a Unix shell
Familiarity with RDBMS database tools, such as SQL
Access to, and the ability to use, a laptop with an internet connection and a
terminal program installed (such as terminal on the Mac, or PuTTY on Windows).
Recommended:
This introductory Apache Drill course, targeted at Data Analysts, Scientists and SQL
programmers, covers how to use Drill to explore known or unknown data without
writing code. You will write SQL queries on a variety of data types including structured
data in a Hive table, semi-structured data in HBase or MapR-DB, and complex data file
types, such as Parquet and JSON.
23
Course: DA 440 - Apache Hive Essentials
DA 440 is an introductory-level course designed for data analysts and developers. You
will learn how Apache Hive fits in the Hadoop ecosystem, how to create and load tables
in Hive, and how to query data using the Hive Query Language.
Videos
24
Documentation
The Drill documentation on the Apache.org site is very good and we recommend you
spend some time learning drill from these resources.
5. Drill in 10 minutes
https://drill.apache.org/docs/drill-in-10-minutes/
25
3
Taking the Exam
26
Section 3 - Taking the Exam
MapR Certification exams are delivered online using a service from Innovative Exams. A
human will proctor your exam. Your proctor will have access to your webcam and
desktop during your exam. Once you are logged in for your test session, and your
webcam and desktop are shared, your proctor will launch your exam.
This method allows you to take our exams anytime, and anywhere, but you will need a
quiet environment where you will remain uninterrupted for up to two hours. You will
also need a reliable Internet connection for the entire test session.
27
Reserve a Test Session
MapR exams are delivered on a platform called Innovative Exams. When you are ready
to schedule your exam, go back to your profile in learn.mapr.com, click on your exam,
and click the Continue to Innovative Exams link to proceed to scheduling. This will take
you to Examslocal.com.
28
7) Once confirmed, your reservation will be in your My Exams tab of Innovative Exams
Refunds
MapR Certification exams are non-refundable.
29
Test System Compatibility
We recommend that you check your system compatibility several days before your exam
to make sure you are ready to go. Go to
https://www.examslocal.com/ScheduleExam/Home/CompatibilityCheck
30
31
Day of the Exam
Make sure your Internet connection is strong and stable
Make sure you are in a quiet, well-lit room without distractions
Clear the room - you must be alone when taking your exam
No breaks are allowed during the exam; use the bathroom before you log in
Clear your desk of any materials, notebooks, and mobile devices
Silence your mobile and remove it from your desk
Configure your computer for a single display; multiple displays are not allowed
Close out of all other applications except for Chrome
We recommend that you sign in 30 minutes in advance of your testing time so that you
can communicate with your proctor, and get completely set up well in advance of your
test time.
You will be required to share your desktop and your webcam prior to the exam start.
YOUR EXAM SESSION WILL BE RECORDED. If the Proctor senses any misconduct,
your exam will be paused and you will be notified by the proctor of your misconduct.
If your misconduct is not corrected, the Proctor will shut down your exam, resulting in a
Fail.
Examples of misconduct and/or misuse of the exam include, but are not limited to, the
following:
32
After the Exam - Sharing Your Results
When you pass a MapR Certification exam, you will receive a confirmation email with
the details of your success. This will include the title of your certification and details on
how you can download your digital certificate, and share your certification on social
media.
Your certification will be updated in learn.mapr.com in your profile. From your profile
you can view your certificate and share it on LinkedIn.
33
Your certificate is available as a PDF. You can download and print your certificate from
your profile in learn.mapr.com.
Your credential contains a unique Certificate Number and a URL. You can share your
credential with anyone who needs to verify your certification.
If you happen to fail the exam, you will automatically qualify for a discounted exam
retake voucher. Retakes are $100 USD and can be purchased by contacting
certification@maprtech.com. MapR will verify your eligibility and supply you with a
special 1-time use discount code which you can apply at the time of purchase.
Exam Retakes
If you fail an exam, you are eligible to purchase and retake the exam in 14 days. Once
you have passed the exam, you may not take that version (e.g., v.4.0) of the exam again,
but you may take any newer version of the exam (e.g., v.4.1). A test result found to be in
violation of the retake policy will result in no credit awarded for the test taken. Violators
of these policies may be banned from participation in the MapR Certification Program.
34