MapR Certified Data Analyst (MCDA) Study Guide 16Skmxd

Certification Study Guide
MapR Certified Data Analyst
Study Guide
1

CONTENTS
About MapR Study Guides .................................................................................................................................... 3
MapR Certified Data Analyst 1.8 (MCDA) ....................................................................................................... 3
SECTION 1 WHATS ON THE EXAM? .......................................................................... 5
Exam Objectives ...................................................................................................................................................... 5
Sample Questions ................................................................................................................................................... 8
Answer Key ............................................................................................................................................................ 14
SECTION 2- PREPARING FOR THE CERTIFICATION ....................................................... 21
Instructor and Virtual Instructor-led Training ......................................................................................... 21
Course: DA 410 - Apache Drill Essentials .................................................................................................... 23
Course: DA 440 - Apache Hive Essentials .................................................................................................... 24
Course: DA 450 - Apache Pig Essentials ....................................................................................................... 24
SECTION 3 - TAKING THE EXAM ................................................................................. 27
Register for the Exam ......................................................................................................................................... 27
Reserve a Test Session ....................................................................................................................................... 28
Rescheduling ......................................................................................................................................................... 29
Test System Compatibility ................................................................................................................................ 30
Day of the Exam .................................................................................................................................................... 32
After the Exam - Sharing Your Results ......................................................................................................... 33
Exam Retakes ........................................................................................................................................................ 34
About MapR Study Guides
MapR certification study guides are intended to help you prepare for certification by
providing additional study resources, sample questions, and details about how to
take the exam. The study guide by itself is not enough to prepare you for the exam.
Youll need training, practice, and experience. The study guide will point you in the
right direction and help you get ready.
If you use all the resources in this guide, and spend 6-12 months on your own using
the software, experimenting with the tools, and practicing the role you are certifying
for, you should be well prepared to attempt the exams.
MapR Certified Data Analyst 1.8 (MCDA)
The MapR Certified Data Analyst credential is designed for Data Analysts who use
Hive, Pig, and Drill to do data analysis. The certification tests ones ability to use
MapR tools and Load and Inspect Data, Manage Hive tables, run SQL queries, and
use Drill to do analysis.
Exam Cost: $250

Duration: 2 Hours
Number of Questions: This changes frequently but it will be between 50-70
questions
3
Exam?
1
Whats on the
Section 1 Whats on the Exam?

The MapR Certified Data Analyst 1.8 exam is comprised of 7 exam topic sections and 26
objectives. There are 50-70 questions on the exam. MapR exams are updated frequently
and therefore the number of exam questions can change.

MapR tests new questions on the exam in an unscored manner. This means that you may
see test questions on the exam that are not used for scoring your exam. You will not know
which items are scored and which are unscored. Unscored items are being tested for
inclusion in future versions of the exam. They do not affect your results.

At the completion of the exam you will be given a Pass or Fail, your percentage achieved,
and your performance by topic section. Exam results may not be appealed. Even if you fail
by 1% it is considered a fail and we will not respond to requests to rescore the exam.
Exam Objectives
1. Extract, Transform, and Load Data with Apache Pig (12%)
1.1 Load data into relations
1.2 Debug Pig scripts
1.3 Perform simple manipulations
1.4 Save relations as files
2. Manipulate Data with Apache Pig (8%)
2.1 Subset relations, Combine relations, and use UDFs on relations
3. Create Tables and Load data in Apache Hive (12%)
3.1 Create Databases
3.2 Create simple, external, and partitioned tables
3.3 Alter and drop tables
4. Query data with Apache Hive (20%)
4.1 Query Tables
4.2 Manipulate tables with UDFs
4.3 Combine and store tables
5. SQL Queries with Drill (25%)
5.1 Create Tables & Views
5.2 Join structured and semi-structured content into a single query
5.3 SQL functions including window functions
5.4 Nested data functions including JSON data model
5.5 Data type conversions including CAST & CONVERT_FROM functions
5.6 Partition Pruning & CTAS Auto partitioning
6. Working with Self Describing Data (15%)
6.1 Define structured and unstructured data formats
6.2 Execute a Drill Query using clients such as SQLine, JDBC Apps, & REST APIs
6.3 Storage plugins and workspaces in Drill
6.4 Create a view and visualize queries with BI Tools such as Tableau
6.5 Demonstrate how to use Drill Explorer to explore unknown data and
determine its structure to perform queries
6.6 Discover the flexibility and extensibility of Drill and UDFs
7. Advanced Topics (7%)
7.1 Optimize Drill Queries
7.2 Troubleshooting, profiles, logs, & Tuning
7.2 Security, authorization, authentication, and impersonation
Sample Questions
The following questions represent the kinds of questions you will see on the exam.
The answers to these sample questions can be found in the answer key following
the sample questions denoted by an asterisk next to the correct option.
1. Which of the following is true of FOREACH in Pig?
a) FOREACH statements must be paired with a GENERATE statement
b) UDFs cannot be called inside FOREACH statements
c) column names must be defined with AS inside FOREACH statements
d) aggregate functions cannot be called inside FOREACH statements
2. Which commands would create a relation, XYZ, which contains the first 100 records
of relation ABC, in alphabetical order by customer last name?
a) PQR = ORDER ABC BY last.name;

XYZ = LIMIT PQR 100;
b) XYZ = LIMIT {ORDER ABC BY last.name} 100;
c) XYZ = FOREACH ABC

GENERATE XYZ
ORDER BY last.name
LIMIT 100;
d) XYZ = ORDER {LIMIT ABC 100} BY last.name;
3. You query on "employees" table, which is already partitioned by (country, state)

and encounter the following error
FAILED: Error in semantic analysis: No partition predicate found for
Alias "e" Table "employees"
Query: SELECT e.name, e.salary FROM employees e LIMIT 5;
Which Hive configuration will allow you to run the query without error?
a) hive.mapred.mode=strict
b) hive.mapred.mode=nonstrict
c) hive.exec.dynamic.partition=true
d) hive.exec.dynamic.partition=false
4. Suppose Cost Based Optimization (CBO) and Predicate Pushdown features are
enabled in Hive. Which is true about the following 2 queries.
Query 1:
SELECT employees.id, b.sales FROM employees LEFT JOIN sales ON
(employees.id = sales.employee_id) WHERE day_id BETWEEN '2015-01-01'
AND '2015-03-31';
Query 2:
(employees.id = sales.employee_id AND sales.day_id BETWEEN '2015-01-
01' AND '2015-03-31');
a) Both queries will run with equal efficiency
b) Query 1 will be faster
c) Query 2 will faster
d) Both query 1 and 2 will run faster by setting hive.optimize.ppd=false;
5. You have a hive table accesslogs that contains a column called "time", which
contains string value in this format [15/Jul/2009:14:58:59 -0700].
You want to write a query to find the number of logs for each year. Using a series of 3
UDF's, you project the "year" field and then you group by that computed field.
SELECT year(from_unixtime(UNIX_TIMESTAMP(t1.time,
'[dd/MMM/yyyy:HH:mm:ss Z]'))) AS `year`, COUNT(*) FROM accesslogs AS
t1 GROUP BY `year`;
Assume that all three UDF's used in this query work as expected. What will be the
output if the run the above query?
a) Returns the year and count of records sorted by year
b) Returns the year and count of records in arbitrary order
c) Throws exception saying that too many nested UDF's
d) Throws exception saying Invalid table alias or column reference 'year'
6. Which of the following will generate the list of 'Apache' projects from a CSV file?
This is an example of the files contents:
Apache Drill,Apache Software Foundation

Microsoft Windows,Microsoft Corp
Apache Pig,Apache Software Foundation
Apple MacOSX,Apple Inc
Oracle Java,Oracle Inc
Apache Hive,Apache Software Foundation
a) SELECT project from (SELECT STRPOS(columns[0], 'Apache') as

asfFlag, columns[0] as project FROM dfs.`/tmp/project-
names.csv` ) as pList where asfFlag=0
b) SELECT project from (SELECT LTRIM(columns[0], 'Apache') as

project FROM dfs.`/tmp/project-names.csv` );
c) SELECT project from (SELECT SUBSTR(columns[0],

LENGTH('APACHE'), LENGTH(columns[0])) as project FROM
dfs.`/tmp/project-names.csv` );
d) SELECT project from (SELECT SUBSTR(columns[0],

LENGTH('APACHE'), LENGTH(columns[0]) - LENGTH('APACHE')) as
10

7. Use the following dataset to answer the question below
Table Person (dfs.`/tmp/Person.json`):

{
"first_name": "Joe",
"last_name": "Baker",
"work": {
"company": "Cogibox",
"ssn": "528-77-1229",
"salary": "$34434.54",
"id": 2287,
"jobtitle": "Research Nurse"
},
"finance": {
"bitcoins": "1MKEjfqByXb2vJVypdWLZpFKGN9mBy3FSS",
"expenses": 8110.38
},
"personal": [
{
"birthdate": "1/31/2015",
"birthtime": "7:04 AM",
"favcolors": ["yellow","red","blue"],
"contact": [
{
"email": "bcrawford0@huffingtonpost.com",
"website": "https://examiner.com",
"phone": "51-(298)175-9031",
"address": [
{
"city": "Austin",
"country": "United States",
"state": "TX",
"street": "2 Maywood Parkway"
}
]
}
]
}
]
}
Which of the following Apache Drill SQL statements returns keys & values for
the address field, as shown below:
Keys Values
city Austin
country United States
state TX
street 2 Maywood Parkway
a. Select t.f.`key` as `Keys`,t.f.`value` as `Values` from (Select

flatten(kvgen(p.personal[0].contact[0].address[0])) f
from dfs.`/tmp/person.json` p) t;
b. Select p.city, p.country, p.state, p.street from

dfs.`/tmp/person.json` p;
11

c. Select repeated_contains(p.personal[0].contact[0].address[0], 'city'),

repeated_contains(p.personal[0].contact[0].address[0], 'country'),
repeated_contains(p.personal[0].contact[0].address[0], 'state'),
repeated_contains(p.personal[0].contact[0].address[0], 'street')
from dfs.`/tmp/person.json` p;
d. Select t.f.`key` as `Keys`,t.f.`value` as `Values` from (Select

kvgen(flatten(p.personal[0].contact[0].address[0])) f
e. Select t.`Keys`, t.`Values` from (Select

p.personal[0].contact[0].address[0] from dfs.`/tmp/person.json` p) t;
8. When you want to store the following JSON data set in HBase, what column names
will you use?
{user: {user_ID: 1k43dg0, first_name: Doug, last_name:

Nelson}}
a) user, user_ID, first_name, last_name
b) user
c) user_ID, first_name, last_name
d) user_ID: 1k43dg0, first_name: Doug, last_name: Nelson
9. You query your employee data with the following SQL statement: SELECT
first_name, employee_ID FROM employees ORDER BY employee_ID;
You save a view on the following output.

first_name employee_ID
Jan 1
Bob 2
Steve 3
Iman 4
The data is altered when Bob leaves the company, and is removed from the
database, and Jamie is hired as a new employee. Which of the following is a
possible output you will see when you look at the view with the updated data?
a) first_name employee_ID
Jan 1
Steve 3
Iman 4
Jamie 5
12

b) first_name employee_ID
Jan 1
Bob 2
Steve 3
Iman 4
c) first_name employee_ID
Jan 1
Bob 2
Steve 3
Iman 4
Jamie 5
d) The view cannot be reloaded after the data has changed
10. You have a 20 node cluster that includes 10 nodes dedicated for data
analysis. These are all data nodes in the topology, /data/analysis/. To best
take advantage of data locality with your queries, you will install Drill on:
a) every node in the /data/analysis/ topology, and all of the control nodes
b) every node outside of the /data/analysis/ topology
c) every node in the /data/analysis/ topology
d) every node in the cluster
13

Answer Key
1. Which of the following is true of FOREACH in Pig?
a) * FOREACH statements must be paired with a GENERATE statement
b) UDFs cannot be called inside FOREACH statements
c) column names must be defined with AS inside FOREACH statements
d) aggregate functions cannot be called inside FOREACH statements
2. Which commands would create a relation, XYZ, which contains the first 100 records
of relation ABC, in alphabetical order by customer last name?
a) XYZ = LIMIT {ORDER ABC BY last.name} 100;
b) *PQR = ORDER ABC BY last.name;

XYZ = LIMIT PQR 100;
c) XYZ = FOREACH ABC

GENERATE XYZ
ORDER BY last.name
LIMIT 100;
d) XYZ = ORDER {LIMIT ABC 100} BY last.name;
3. You query on "employees" table, which is already partitioned by (country, state)

and encounter the following error
FAILED: Error in semantic analysis: No partition predicate found for
Alias "e" Table "employees"
Query: SELECT e.name, e.salary FROM employees e LIMIT 5;
Which Hive configuration will allow you to run the query without error?
e) hive.mapred.mode=strict
f) *hive.mapred.mode=nonstrict
g) hive.exec.dynamic.partition=true
h) hive.exec.dynamic.partition=false
14

4. Suppose Cost Based Optimization (CBO) and Predicate Pushdown features are
enabled in Hive. Which is true about the following 2 queries.
Query 1:
(employees.id = sales.employee_id) WHERE day_id BETWEEN '2015-01-01'
AND '2015-03-31';
Query 2:
(employees.id = sales.employee_id AND sales.day_id BETWEEN '2015-01-
01' AND '2015-03-31');
a) Both queries will run with equal efficiency
b) Query 1 will be faster
c) *Query 2 will faster
d) Both query 1 and 2 will run faster by setting hive.optimize.ppd=false;
5. You have a hive table accesslogs that contains a column called "time", which
contains string value in this format [15/Jul/2009:14:58:59 -0700].
You want to write a query to find the number of logs for each year. Using a series of 3
UDF's, you project the "year" field and then you group by that computed field.
SELECT year(from_unixtime(UNIX_TIMESTAMP(t1.time,
'[dd/MMM/yyyy:HH:mm:ss Z]'))) AS `year`, COUNT(*) FROM accesslogs AS
t1 GROUP BY `year`;
Assume that all three UDF's used in this query work as expected. What will be the
output if the run the above query?
a) Returns the year and count of records sorted by year
b) Returns the year and count of records in arbitrary order
c) Throws exception saying that too many nested UDF's
d) *Throws exception saying Invalid table alias or column reference 'year'
15

6. Which of the following will generate the list of 'Apache' projects from a CSV file?
This is an example of the files contents:
Apache Drill,Apache Software Foundation

Microsoft Windows,Microsoft Corp
Apache Pig,Apache Software Foundation
Apple MacOSX,Apple Inc
Oracle Java,Oracle Inc
Apache Hive,Apache Software Foundation
a) *SELECT project from (SELECT STRPOS(columns[0], 'Apache') as

asfFlag, columns[0] as project FROM dfs.`/tmp/project-names.csv`
) as pList where asfFlag=0
b) SELECT project from (SELECT LTRIM(columns[0], 'Apache') as

c) SELECT project from (SELECT SUBSTR(columns[0], LENGTH('APACHE'),

LENGTH(columns[0])) as project FROM dfs.`/tmp/project-names.csv`
);
d) SELECT project from (SELECT SUBSTR(columns[0], LENGTH('APACHE'),

LENGTH(columns[0]) - LENGTH('APACHE')) as project FROM
dfs.`/tmp/project-names.csv` );
16

7. Use the following dataset to answer the question below
Table Person (dfs.`/tmp/Person.json`):

{
"first_name": "Joe",
"last_name": "Baker",
"work": {
"company": "Cogibox",
"ssn": "528-77-1229",
"salary": "$34434.54",
"id": 2287,
"jobtitle": "Research Nurse"
},
"finance": {
"bitcoins": "1MKEjfqByXb2vJVypdWLZpFKGN9mBy3FSS",
"expenses": 8110.38
},
"personal": [
{
"birthdate": "1/31/2015",
"birthtime": "7:04 AM",
"favcolors": ["yellow","red","blue"],
"contact": [
{
"email": "bcrawford0@huffingtonpost.com",
"website": "https://examiner.com",
"phone": "51-(298)175-9031",
"address": [
{
"city": "Austin",
"country": "United States",
"state": "TX",
"street": "2 Maywood Parkway"
}
]
}
]
}
]
}
Which of the following Apache Drill SQL statements returns keys & values for
the address field, as shown below:
Keys Values
city Austin
country United States
state TX
street 2 Maywood Parkway
a) *Select t.f.`key` as `Keys`,t.f.`value` as `Values` from (Select

flatten(kvgen(p.personal[0].contact[0].address[0])) f
b) Select p.city, p.country, p.state, p.street from

dfs.`/tmp/person.json` p;
17

d) Select repeated_contains(p.personal[0].contact[0].address[0],
'city'), repeated_contains(p.personal[0].contact[0].address[0],
'country'), repeated_contains(p.personal[0].contact[0].address[0],
'state'), repeated_contains(p.personal[0].contact[0].address[0],
'street') from dfs.`/tmp/person.json` p;
e) Select t.f.`key` as `Keys`,t.f.`value` as `Values` from (Select

kvgen(flatten(p.personal[0].contact[0].address[0])) f
f) Select t.`Keys`, t.`Values` from (Select

p.personal[0].contact[0].address[0] from dfs.`/tmp/person.json` p)
t;
8. When you want to store the following JSON data set in HBase, what column names
will you use?
{user: {user_ID: 1k43dg0, first_name: Doug, last_name:

Nelson}}
a) user, user_ID, first_name, last_name
b) user
c) *user_ID, first_name, last_name
d) user_ID: 1k43dg0, first_name: Doug, last_name: Nelson
9. You query your employee data with the following SQL statement: SELECT
first_name, employee_ID FROM employees ORDER BY employee_ID;
You save a view on the following output.

first_name employee_ID
Jan 1
Bob 2
Steve 3
Iman 4
The data is altered when Bob leaves the company, and is removed from the
database, and Jamie is hired as a new employee. Which of the following is a
possible output you will see when you look at the view with the updated data?
a) *first_name employee_ID
Jan 1
Steve 3
Iman 4
Jamie 5
18

b) first_name employee_ID
Jan 1
Bob 2
Steve 3
Iman 4
c) first_name employee_ID
Jan 1
Bob 2
Steve 3
Iman 4
Jamie 5
d) The view cannot be reloaded after the data has changed
10. You have a 20 node cluster that includes 10 nodes dedicated for data
analysis. These are all data nodes in the topology, /data/analysis/. To best
take advantage of data locality with your queries, you will install Drill on:
e) every node in the /data/analysis/ topology, and all of the control nodes
f) every node outside of the /data/analysis/ topology
g) *every node in the /data/analysis/ topology
h) every node in the cluster
19
2
Preparing for the
Certification
20
Section 2- Preparing for the Certification
MapR provides several ways to prepare for the certification including classroom
training, self-paced online training, videos, webinars, blogs, and ebooks.
MapR offers a number of training courses that will help you prepare. We
recommend taking the classroom training first, followed by self-paced online
training, and then several months of experimentation on your own learning the
tools in a real-world environment.
We also provide additional resources in this guide to support your learning. The
blogs, whiteboard walkthroughs, and ebooks are excellent supporting material
in your efforts to become a Data Analyst.
Instructor and Virtual Instructor-led Training
All courses include:
Certified MapR Instructor who is an expert in the topic, and is
expert in classroom facilitation and course delivery techniques
Collaboration and assistance for all students on completion of
exercises
Lab exercises, a lab guide, slide guide, job aids as appropriate
Certification exam fee included one exam try only, done on the
students own time (not in class)
21
Course DA-4000 Data Analysis with Apache Drill
Duration: 2 days
Cost: $2400
Course Description:
You will write SQL queries on a variety of data types, including structured data
in a Hive table, semi-structured data in HBase or MapR-DB, and complex data file
types, such as Parquet and JSON. You will also learn the different services
involved at each step, and how Drill optimizes a query for distributed SQL
execution.
Prerequisites for Success in this Course

Review the following prerequisites carefully and decide if you are ready to succeed in
this course. The instructor will move forward with lab exercises, assuming that you have
mastered the skills listed below.
Required:
o Basic Linux knowledge, including familiarity with basic command-line options

such a mv, cp, cd, ls, ssh, and scp
o Access to, and the ability to use, a laptop with a terminal program installed
(such as terminal on the Mac, or PuTTY and WinSCP on Windows)
o Beginner to intermediate fluency with SQL
Recommended:
o Completion of the on-demand course ESS 100 Introduction to Big Data
o Completion of the on-demand course ESS 101 Apache Hadoop Essentials
Optional: Basic Hadoop knowledge
22
Course: DA-4500 Data Analysis with Apache Pig and Apache Hive
Where: learn.mapr.com
Duration: 3 days
Cost: $1200
Course Description:
This course covers how to use Pig and Hive as part of a single data flow in a Hadoop
cluster. The course begins with manipulating semi-structured raw data files in Pig,
using the grunt shell and the Pig Latin programming language. Once the raw data has
been manipulated into structured tables, they will be exported from Pig and imported
into Hive. The structured data can be queried in Hive, and some basic data analysis can
be performed.
Prerequisites for Success in this Course

Review the following prerequisites carefully and decide if you are ready to succeed in
this course. The instructor will move forward with lab exercises, assuming that you have
mastered the skills listed below.
Required:
Familiarity with a command-line interface, such as a Unix shell
Familiarity with RDBMS database tools, such as SQL
Access to, and the ability to use, a laptop with an internet connection and a
terminal program installed (such as terminal on the Mac, or PuTTY on Windows).
Recommended:
o Completion of the on-demand course ESS 100 Introduction to Big Data

o Familiarity with Hadoop
Course: DA 410 - Apache Drill Essentials
This introductory Apache Drill course, targeted at Data Analysts, Scientists and SQL
programmers, covers how to use Drill to explore known or unknown data without
writing code. You will write SQL queries on a variety of data types including structured
data in a Hive table, semi-structured data in HBase or MapR-DB, and complex data file
types, such as Parquet and JSON.
23
Course: DA 440 - Apache Hive Essentials
DA 440 is an introductory-level course designed for data analysts and developers. You
will learn how Apache Hive fits in the Hadoop ecosystem, how to create and load tables
in Hive, and how to query data using the Hive Query Language.
Course: DA 450 - Apache Pig Essentials
DA 450 - Apache Pig Essentials is an introductory-level course designed for data

analysts and developers. The course begins with a review of data pipeline tools, then
covers how to load and manipulate relations in Pig.
Videos
In addition to the classroom and self-paced training courses, we recommend

these videos, webinars, and tutorials
1. Apache Drill SQL Query Optimization | Whiteboard Walkthrough

https://youtu.be/u4z_SnBKU4c
2. Apache Drill SQL Queries on Parquet Data | Whiteboard Walkthrough

https://youtu.be/lslA8kDr_jQ
3. Rethinking SQL for Big Data with Apache Drill

https://youtu.be/KqEOH8_nw9Y
4. Apache Drill: The Rise of the Non-Relational Datastore Whiteboard

Walkthrough
https://youtu.be/65c42i7Xg7Q
24
Documentation
The Drill documentation on the Apache.org site is very good and we recommend you
spend some time learning drill from these resources.
5. Drill in 10 minutes
https://drill.apache.org/docs/drill-in-10-minutes/
6. Learn Drill with the MapR Sandbox

Explore data using a Hadoop environment pre-configured with Drill
https://drill.apache.org/docs/about-the-mapr-sandbox/
7. Analyzing Data Using Window Functions

Learn how to use analytic/window functions
https://drill.apache.org/docs/analyzing-data-using-window-functions
8. Using Drill to Analyze Amazon Spot Prices

Use a Drill workshop on github to create views of JSON and Parquet data
https://github.com/vicenteg/spot-price-history/#drill-workshop---amazon-spot-
prices
9. Query Data Introduction

https://drill.apache.org/docs/query-data-introduction/
10. Analyzing Social Media

Analyze Twitter data in its native JSON format using Drill.
25
3
Taking the Exam
26
Section 3 - Taking the Exam
MapR Certification exams are delivered online using a service from Innovative Exams. A
human will proctor your exam. Your proctor will have access to your webcam and
desktop during your exam. Once you are logged in for your test session, and your
webcam and desktop are shared, your proctor will launch your exam.
This method allows you to take our exams anytime, and anywhere, but you will need a
quiet environment where you will remain uninterrupted for up to two hours. You will
also need a reliable Internet connection for the entire test session.
There are five steps in taking your exam:

1) Register for the exam
2) Reserve a test session
3) Test your system compatibility
4) Take the exam
5) Get your results
Register for the Exam
MapR exams are available for purchase exclusively at learn.mapr.com. You have six
months to complete your certification after you purchase the exam. After six months have
expired, your exam registration will be canceled. There are no refunds for expired
certification purchases.
1) Sign in to your profile at learn.mapr.com

2) Find the exam in the learn.mapr.com catalog and click Purchase
3) If you have a voucher code, enter it in the Promotion Code field
4) Use a credit card to pay for the exam
You may use a Visa, MasterCard, American Express, or Discover credit card. The
charge will appear as MAPR TECHNOLOGIES on your credit card statement.
5) Look for a confirmation with your Order ID
27
Reserve a Test Session
MapR exams are delivered on a platform called Innovative Exams. When you are ready
to schedule your exam, go back to your profile in learn.mapr.com, click on your exam,
and click the Continue to Innovative Exams link to proceed to scheduling. This will take
you to Examslocal.com.
1) In learn.mapr.com find your exam and click Continue to Innovative Exams

2) Single Sign on will bring you to Innovative Exams
3) Go to My Exams
4) Enter your exam title in the Search field
5) Choose an exam date
6) Choose a time slot at least 24 hours in advance
28
7) Once confirmed, your reservation will be in your My Exams tab of Innovative Exams
8) Check your email for a reservation confirmation

Rescheduling
Examinees are allowed to reschedule their exam with 24-hour notice without a
cancellation penalty. If they cancel or reschedule within 24 hours of the scheduled
appointment, the examinee will forfeit the entire cost of the exam and they will need to
pay for it again to reschedule. Examinees must reschedule their exams more than 24
hours in advance because human proctors are scheduled to deliver the test session.
To reschedule an exam, the examinee logs into www.examslocal.com and clicks My

Exams, selects the exam to reschedule. You will remain enrolled in your exam in MapR
Academy.
Refunds
MapR Certification exams are non-refundable.
29
Test System Compatibility
We recommend that you check your system compatibility several days before your exam
to make sure you are ready to go. Go to
https://www.examslocal.com/ScheduleExam/Home/CompatibilityCheck
These are the system requirements:
1) Mac, Windows, Linux, or Chrome OS

2) Google Chrome or Chromium version 32 and above
3) Your browser must accept third party cookies for the duration of the exam ONLY
4) Install Innovative Exams Google Chrome Extension
5) TCP: port 80 and 443
6) 1GB RAM & 2GHz dual core processor
7) Minimum 1280 x 800 resolution
8) Sufficient bandwidth to share your screen via the Internet
30
31
Day of the Exam
Make sure your Internet connection is strong and stable
Make sure you are in a quiet, well-lit room without distractions
Clear the room - you must be alone when taking your exam
No breaks are allowed during the exam; use the bathroom before you log in
Clear your desk of any materials, notebooks, and mobile devices
Silence your mobile and remove it from your desk
Configure your computer for a single display; multiple displays are not allowed
Close out of all other applications except for Chrome
We recommend that you sign in 30 minutes in advance of your testing time so that you
can communicate with your proctor, and get completely set up well in advance of your
test time.
You will be required to share your desktop and your webcam prior to the exam start.
YOUR EXAM SESSION WILL BE RECORDED. If the Proctor senses any misconduct,
your exam will be paused and you will be notified by the proctor of your misconduct.
If your misconduct is not corrected, the Proctor will shut down your exam, resulting in a
Fail.
Examples of misconduct and/or misuse of the exam include, but are not limited to, the
following:
Impersonating another person

Accepting assistance or providing assistance to another person
Disclosure of exam content including, but not limited to, web postings, formal or
informal test preparation or discussion groups, or reconstruction through
memorization or any other method
Possession of unauthorized items during the exam. This includes study materials,
notes, computers and mobile devices.
Use of unauthorized materials (including brain-dump material and/or
unauthorized publication of exam questions with or without answers).
Making notes of any kind during the exam
Removing or attempting to remove exam material (in any format)
Modifying and/or altering the results and/or scoring the report or any other exam
record
MapR Certification exam policies can be viewed at: https://www.mapr.com/mapr-

certification-policies
32
After the Exam - Sharing Your Results
When you pass a MapR Certification exam, you will receive a confirmation email with
the details of your success. This will include the title of your certification and details on
how you can download your digital certificate, and share your certification on social
media.
Your certification will be updated in learn.mapr.com in your profile. From your profile
you can view your certificate and share it on LinkedIn.
33
Your certificate is available as a PDF. You can download and print your certificate from
your profile in learn.mapr.com.
Your credential contains a unique Certificate Number and a URL. You can share your
credential with anyone who needs to verify your certification.
If you happen to fail the exam, you will automatically qualify for a discounted exam
retake voucher. Retakes are $100 USD and can be purchased by contacting
certification@maprtech.com. MapR will verify your eligibility and supply you with a
special 1-time use discount code which you can apply at the time of purchase.
Exam Retakes
If you fail an exam, you are eligible to purchase and retake the exam in 14 days. Once
you have passed the exam, you may not take that version (e.g., v.4.0) of the exam again,
but you may take any newer version of the exam (e.g., v.4.1). A test result found to be in
violation of the retake policy will result in no credit awarded for the test taken. Violators
of these policies may be banned from participation in the MapR Certification Program.
34

MapR Certified Data Analyst (MCDA) Study Guide 16Skmxd

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

MapR Certified Data Analyst (MCDA) Study Guide 16Skmxd

Încărcat de

Drepturi de autor:

Formate disponibile

Certification Study Guide

MapR Certified Data Analyst

MapR Certified Data Analyst 1.8 (MCDA) ....................................................................................................... 3

SECTION 1 WHATS ON THE EXAM? .......................................................................... 5

Exam Objectives ...................................................................................................................................................... 5

Sample Questions ................................................................................................................................................... 8

Answer Key ............................................................................................................................................................ 14

SECTION 2- PREPARING FOR THE CERTIFICATION ....................................................... 21

Instructor and Virtual Instructor-led Training ......................................................................................... 21

Course: DA 410 - Apache Drill Essentials .................................................................................................... 23

Course: DA 440 - Apache Hive Essentials .................................................................................................... 24

Course: DA 450 - Apache Pig Essentials ....................................................................................................... 24

SECTION 3 - TAKING THE EXAM ................................................................................. 27

Register for the Exam ......................................................................................................................................... 27

Reserve a Test Session ....................................................................................................................................... 28

Test System Compatibility ................................................................................................................................ 30

Day of the Exam .................................................................................................................................................... 32

After the Exam - Sharing Your Results ......................................................................................................... 33

Exam Retakes ........................................................................................................................................................ 34

About MapR Study Guides

MapR Certified Data Analyst 1.8 (MCDA)

Exam Cost: $250

Section 1 Whats on the Exam?

1. Extract, Transform, and Load Data with Apache Pig (12%)

1.1 Load data into relations

1.2 Debug Pig scripts

1.3 Perform simple manipulations

1.4 Save relations as files

2. Manipulate Data with Apache Pig (8%)

2.1 Subset relations, Combine relations, and use UDFs on relations

3. Create Tables and Load data in Apache Hive (12%)

3.1 Create Databases

3.2 Create simple, external, and partitioned tables

3.3 Alter and drop tables

4. Query data with Apache Hive (20%)

4.1 Query Tables

4.2 Manipulate tables with UDFs

4.3 Combine and store tables

5. SQL Queries with Drill (25%)

5.1 Create Tables & Views

5.2 Join structured and semi-structured content into a single query

5.3 SQL functions including window functions

5.4 Nested data functions including JSON data model

5.5 Data type conversions including CAST & CONVERT_FROM functions

5.6 Partition Pruning & CTAS Auto partitioning

6. Working with Self Describing Data (15%)

6.1 Define structured and unstructured data formats

6.3 Storage plugins and workspaces in Drill

7. Advanced Topics (7%)

7.1 Optimize Drill Queries

7.2 Troubleshooting, profiles, logs, & Tuning

7.2 Security, authorization, authentication, and impersonation

1. Which of the following is true of FOREACH in Pig?

a) FOREACH statements must be paired with a GENERATE statement

b) UDFs cannot be called inside FOREACH statements

c) column names must be defined with AS inside FOREACH statements

d) aggregate functions cannot be called inside FOREACH statements

a) PQR = ORDER ABC BY last.name;

b) XYZ = LIMIT {ORDER ABC BY last.name} 100;

c) XYZ = FOREACH ABC

d) XYZ = ORDER {LIMIT ABC 100} BY last.name;

3. You query on "employees" table, which is already partitioned by (country, state)

Query: SELECT e.name, e.salary FROM employees e LIMIT 5;