Bda 80

BEITC802 - Big Data Analysis
B.E. (I.T) - Even Semester 2019

Experiment List
Name: Smriti Sharma Roll no: 80
No. Experiment Name Page No. Date
A Self-Study 1 08/01/2019
B Case Study on Big Data – IEEE /White papers 6 15/01/2019
1 Installation of Eclipse & Hadoop, to do environment 12 22/01/2019

setup
2 Map Reduce Program – To find maximum temperature 17 29/01/2019
3 Map Reduce Program- To calculate Word count without 22 05/02/2019

combiner
4 Map Reduce Program –Matrix Multiplication 27 26/02/2019
5 & RBase (Language) installation and RStudio Installation 37 05/03/2019

6 & executing basic command , R-studio : - working with
extended command, control structure and matrix
multiplication and report generation
7 & Mongo DB&MonogoChef (GUI -optional )– 43 26/03/2019

8 installation and practicing commands
Name: Smriti Sharma Rollno: 80
Experiment A
Self Study
Objective: To introduce Big data and its tools required to manage and analyse big data
like Hadoop,NoSql, MapReduce ,Rlanguage .
Aim:To do self-study on basic concept and application, technology so they get
familiarized with the subject.
Procedure:
Group was formed consisting of good and average student mixed so they can contribute
and share knowledge with their member, with the help of the books, URL provided to
them.
Topic: Exercise
Q1. Big Data
Ans. Big data is an evolving term that describes a large volume of structured, semi-
structured and unstructured data that has the potential to be mined for information and
used in machine learning projects and other advanced analytics applications. Big data can
be analyzed for insights that lead to better decisions and strategic business moves.
Q2. Big Data Characteristics
Ans.
1. Volume – The name 'Big Data' itself is related to a size which is enormous. Size of data
plays very crucial role in determining value out of data. Also, whether a particular data
can actually be considered as a Big Data or not, is dependent upon volume of data.
Hence, 'Volume' is one characteristic which needs to be considered while dealing with
'Big Data'.
2. Variety – The next aspect of 'Big Data' is its variety.
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of
data considered by most of the applications. Now days, data in the form of emails, photos,
videos, monitoring devices, PDFs, audio, etc. is also being considered in the analysis
applications. This variety of unstructured data poses certain issues for storage, mining and
analysing data.
3. Velocity – The term 'velocity' refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real potential in the
data.
1
Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks and social media sites, sensors, Mobile devices, etc.
The flow of data is massive and continuous.
Q3. Types of data used in BD
Ans. Big data could be found in three forms:
1. Structured
2. Unstructured
3. Semi-structured
4. Hybrid
1. Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed
as a 'structured' data. Over the period of time, talent in computer science have achieved
greater success in developing techniques for working with such kind of data (where the
format is well known in advance) and also deriving value out of it. However, now days,
we are foreseeing issues when size of such data grows to a huge extent, typical sizes are
being in the rage of multiple zettabyte.
Examples Of Structured Data
An 'Employee' table in a database is an example of Structured Data
Employee_ID Employee_Name Gender Department Salary_In_lacs

2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000
2. Unstructured
Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in terms of
its processing for deriving value out of it. Typical example of unstructured data is, a
heterogeneous data source containing a combination of simple text files, images, videos
etc. Now a day organizations have wealth of data available with them but unfortunately
they don't know how to derive value out of it since this data is in its raw form or
unstructured format.
2
Examples Of Un-structured Data
Output returned by 'Google Search'
3. Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data
as a strcutured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in XML file.
Examples Of Semi-structured Data
Personal data stored in a XML file-
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
4. Hybrid
Hybrid data can contain a combination of Structured, Unstructured and Semi-Structured

data.
Examples of Hybrid Data
Ecommerce, Weather Report, Curreny Conversion, POS, POL, NFC, etc
Q4. Difference between Data Analysis & Data Scientist
Ans.
3
Q5. Applications of Big Data. Explain any one in detail.
Ans.
1. Banking and Securities
2. Communications, Media and Entertainment
3. Healthcare Providers
4. Education
5. Manufacturing and Natural Resources
6. Government
7. Insurance
8. Retail and Whole sale trade
9. Transportation
10. Energy and Utilities
Healthcare Providers
Industry-Specific challenges
The healthcare sector has access to huge amounts of data but has been plagued by failures
in utilizing the data to curb the cost of rising healthcare and by inefficient systems that
stifle faster and better healthcare benefits across the board.
This is mainly due to the fact that electronic data is unavailable, inadequate, or unusable.
Additionally, the healthcare databases that hold health-related information have made it
difficult to link data that can show patterns useful in the medical field.
Other challenges related to big data include: the exclusion of patients from the decision
making process, and the use of data from different readily available sensors.
Applications of big data in the healthcare sector
Some hospitals, like Beth Israel, are using data collected from a cell phone app, from
millions of patients, to allow doctors to use evidence-based medicine as opposed to
administering several medical/lab tests to all patients who go to the hospital. A battery of
tests can be efficient but they can also be expensive and usually ineffective.
Free public health data and Google Maps have been used by the University of Florida to
create visual data that allows for faster identification and efficient analysis of healthcare
information, used in tracking the spread of chronic disease.
4
Obamacare has also utilized big data in a variety of ways.
Q6. To install tools like
● RStudio & RLanguage

https://www.r-bloggers.com/download-and-install-r-in-ubuntu/
● Hadoop & Map reduce
● one No SQL software
Conclusion:
Thus study was done and to come up with set of question & answers that can help one to
clear or understand his basic fundamentals/Concepts of Big Data Technology.
Reference: (list down all the reference materials if any in this section)
1. http://moodle.dbit.in/course/view.php?id=428
2. https://www.simplilearn.com/big-data-applications-in-industries-article
3. https://www.dezyre.com/article/difference-between-data-analyst-and-data-scientist/332
4. https://searchdatamanagement.techtarget.com/definition/big-data
5
Experiment B - Case Study on Big Data

(IEEE paper / White paper )
Objective: To expose the students, on the research & modern tools existing in the market
Aim:To do a case study on SOA based IEEE /White papers by forming a group and
writing a
Summary
Procedure:
1. To forms groups of maximum 2-3 students were formed combination of bright and
weak student
2. To identify and conclude on a paper /topic, understand the paper, explain each other in
the team , and arrive at common understanding .
3. To Write down the summary or explanation of the paper
4. Students will be graded based on their peer to peer viva.
Topics:
Big Data Applications
Big Data Applications on various domains
Big Data Analtyic Algorithm
Big Data Analytics Tools & Techniques
NoSQl
Big Data Applications

The primary goal of Big Data applications is to help companies make more informative
business decisions by analyzing large volumes of data. It could include web server logs,
Internet click stream data, social media content and activity reports, text from customer
emails, mobile phone call details and machine data captured by multiple sensors.
Organisations from different domain are investing in Big Data applications, for examining
large data sets to uncover all hidden patterns, unknown correlations, market trends,
customer preferences and other useful business information. In this blog we will we be
covering:
● Big Data Applications in Healthcare

● Big Data Applications in Manufacturing
● Big Data Applications in Media & Entertainment
● Big Data Applications in IoT
● Big Data Applications in Government
Let’s understand how Big Data applications are playing a major role in different domains.
Big Data Applications: Healthcare
6
The level of data generated within healthcare systems is not trivial. Traditionally, the health
care industry lagged in using Big Data, because of limited ability to standardize and
consolidate data.
But now Big data analytics have improved healthcare by providing personalized medicine
and prescriptive analytics. Researchers are mining the data to see what treatments are more
effective for particular conditions, identify patterns related to drug side effects, and gains
other important information that can help patients and reduce costs.
With the added adoption of mHealth, eHealth and wearable technologies the volume of data
is increasing at an exponential rate. This includes electronic health record data, imaging
data, patient generated data, sensor data, and other forms of data.
By mapping healthcare data with geographical data sets, it’s possible to predict disease that
will escalate in specific areas. Based of predictions, it’s easier to strategize diagnostics and
plan for stocking serums and vaccines.
Big Data Applications: Manufacturing

Predictive manufacturing provides near-zero downtime and transparency. It requires an
enormous amount of data and advanced prediction tools for a systematic process of data
into useful information.
Big Data Architect Masters Program

Major benefits of using Big Data applications in manufacturing industry are:
7
● Product quality and defects tracking

● Supply planning
● Manufacturing process defect tracking
● Output forecasting
● Increasing energy efficiency
● Testing and simulation of new manufacturing processes
● Support for mass-customization of manufacturing
Big Data Applications: Media & Entertainment

Various companies in the media and entertainment industry are facing new business
models, for the way they – create, market and distribute their content. This is happening
because of current consumer’s search and the requirement of accessing content anywhere,
any time, on any device.
Big Data provides actionable points of information about millions of individuals. Now,
publishing environments are tailoring advertisements and content to appeal consumers.
These insights are gathered through various data-mining activities. Big Data applications
benefits media and entertainment industry by:
● Predicting what the audience wants

● Scheduling optimization
● Increasing acquisition and retention
● Ad targeting
● Content monetization and new product development
8
Big Data Applications: Internet of Things (IoT)

Data extracted from IoT devices provides a mapping of device inter-connectivity. Such
mappings have been used by various companies and governments to increase efficiency.
IoT is also increasingly adopted as a means of gathering sensory data, and this sensory data
is used in medical and manufacturing contexts.
Master Big Data with Edureka
Big Data Applications: Government

The use and adoption of Big Data within governmental processes allows efficiencies in
terms of cost, productivity, and innovation. In government use cases, the same data sets
9
are often applied across multiple applications & it requires multiple departments to work in
collaboration.
Big Data Architect Masters Program

Since Government majorly acts in all the domains, thus it plays an important role in
innovating Big Data applications in each and every domain. Let me address some of the
major areas:
Cyber security & Intelligence

The federal government launched a cyber security research and development plan that relies
on the ability to analyze large data sets in order to improve the security of U.S. computer
networks.
The National Geospatial-Intelligence Agency is creating a “Map of the World” that can
gather and analyze data from a wide variety of sources such as satellite and social media
data. It contains a variety of data from classified, unclassified, and top-secret networks.
Crime Prediction and Prevention

Police departments can leverage advanced, real-time analytics to provide actionable
intelligence that can be used to understand criminal behaviour, identify crime/incident
patterns, and uncover location-based threats.
Pharmaceutical Drug Evaluation

According to a McKinsey report, Big Data technologies could reduce research and
development costs for pharmaceutical makers by $40 billion to $70 billion. The FDA and
NIH use Big Data technologies to access large amounts of data to evaluate drugs and
treatment.
Scientific Research
The National Science Foundation has initiated a long-term plan to:
● Implement new methods for deriving knowledge from data

● Develop new approaches to education
● Create a new infrastructure to “manage, curate, and serve data to communities”.
Weather Forecasting
The NOAA (National Oceanic and Atmospheric Administration) gathers data every minute
of every day from land, sea, and space-based sensors. Daily NOAA uses Big Data to
analyze and extract value from over 20 terabytes of data.
Tax Compliance
Big Data Applications can be used by tax organizations to analyze both unstructured and
structured data from a variety of sources in order to identify suspicious behavior and
multiple identities. This would help in tax fraud identification.
Traffic Optimization
10
Big Data helps in aggregating real-time traffic data gathered from road sensors, GPS
devices and video cameras. The potential traffic problems in dense areas can be prevented
by adjusting public transportation routes in real time.
I have just conveyed some of the prominent examples of Big Data applications, but there
are uncountable ways in which Big Data is revolutionizing each and every domain. I hope
you found this blog informative enough. In my next blog, I will talk about the career
opportunities in Big Data and Hadoop.
Reference:
note: list down all the reference url and book details that you have referenced in this
section
1. http://moodle.dbit.in/course/view.php?id=428
11
Experiment 1
Installing Hadoop, Eclipse and do Environment Set Up
Objective: To install & do initial set up for Implementing MapReduce Program
Aim: Implement Eclipse & Hadoop & to do environment setup, then to perform Mapper
Program to find Maximum temperature
Prerequisite:Knowledge on Eclipse & Java and Linux

Steps in installation:
a. Eclipse
1. Install Java
If you don’t have Java installed on your system. Click the link below to bring up Ubuntu
Software Center and click install OpenJDK Java 7
http://ubuntuhandbook.org/index.php/2013/07/install-oracle-java-6-7-8-on-ubuntu-13-10/
h
2. Download Eclipse from
http://www.eclipse.org/downloads/packages/release/Kepler/SR2
3. Extract Eclipse to /opt/ for global use
Press Ctrl+Alt+T on keyboard to open the terminal. When it opens, run the command
below to extract Eclipse to /opt/:
cd /opt/ && sudo tar -zxvf ~/Downloads/eclipse-*.tar.gz
OR Open Nautilus file browser via root: Press

Alt+F2
-> run
gksudo nautilus, Once done, you should see the eclipse
folder under /opt/ directory.
4. Create a launcher shortcut for Eclipse
Press Ctrl+Alt+T , paste below command into the terminal and hit enter (install
gksu from Software Center if below command does not work). gksudo gedit
/usr/share/applications/eclipse.desktop
Above command will create and open the launcher file for eclipse with gedit text editor.
Paste below content into the opened file and save it.
[Desktop Entry]
Name=Eclipse 4
Type=Application
Exec=/opt/eclipse/eclipse
Terminal=false
Icon=/opt/eclipse/icon.xpm
Comment=Integrated Development Environment
12
NoDisplay=false
Categories=Development;IDE;
Name[en]=Eclipse
5. Finally open Eclipse from Dashboard
b. Hadoop on single node

1. Adding a dedicated Hadoop user
k@laptop:~$ sudo addgroup hadoop
Adding group `hadoop' (GID 1002) ...
Done.
k@laptop:~$ sudo adduser --ingroup hadoop hduser
Adding user `hduser' ...
Adding new user `hduser' (1001) with group `hadoop' ...
Creating home directory `/home/hduser' ...
Copying files from `/etc/skel' ...
Enter new UNIX password: Retype new
UNIX password: passwd: password
updated successfully Changing the user
information for hduser
Enter the new value, or press ENTER for the default
Full Name []:
Room Number []:
Work Phone []:
Home Phone []:
Other []:
Is the information correct? [Y/n] Y
2. Installing SSH
ssh has two main components:
1.ssh : The command we use to connect to remote machines - the client.
2.sshd : The daemon that is running on the server and allows clients to connect to the
server.
The ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh
first. Use this command to do that :
k@laptop:~$ sudo apt-get install ssh
This will install ssh on our machine. If we get something similar to the following, we can
think it is
setup properly:
k@laptop:~$ which ssh
/usr/bin/ssh
k@laptop:~$ which sshd
/usr/sbin/sshd
3. Create and Setup SSH Certificates

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus our local
machine. For our single-node setup of Hadoop, we therefore need to configure SSH
13
access to localhost.So, we need to have SSH up and running on our machine and
configured it to allow SSH public key authentication.
Hadoop uses SSH (to access its nodes) which would normally require the user to enter a
password. However, this requirement can be eliminated by creating and setting up SSH
certificates using the following commands. If asked for a filename just leave it blank and
press the enter key to continue.
k@laptop:~$ su hduser
Password:
k@laptop:~$ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
50:6b:f3:fc:0f:32:bf:30:79:c2:41:71:26:cc:7d:e3 hduser@laptop
The key's randomart image is:
+--[ RSA 2048]----+
| .oo.o |
| . .o=. o |
|.+.o.|
|o=E|
|S+|
|.+|
|O+|
|Oo|
| o.. |
+-----------------+
hduser@laptop:/home/k$ cat $HOME/.ssh/id_rsa.pub >>$HOME/.ssh/authorized_keys
The second command adds the newly created key to the list of authorized keys so that
Hadoop can use ssh without prompting for a password. We can check if ssh works:
hduser@laptop:/home/k$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is e1:8b:a0:a5:75:ef:f4:b4:5e:a9:ed:be:64:be:5c:2f.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-40-generic x86_64)
4. Install Hadoop hduser@laptop:~$ wget

http://mirrors.sonic.net/apache/hadoop/common/hadoop-2.6.0/hadoop- 2.6.0.tar.gz
hduser@laptop:~$ tar xvzf hadoop-2.6.0.tar.gz
We want to move the Hadoop installation to the /usr/local/hadoop directory using the
following command:
hduser@laptop:~/hadoop-2.6.0$ sudo mv * /usr/local/hadoop
[sudo] password for hduser:
hduser is not in the sudoers file. This incident will be reported.
14
Note: for error "hduser is not in the sudoers file. This incident will be reported."
This error can be resolved by logging in as a root user, and then add
hduser to sudo : hduser@laptop:~/hadoop-2.6.0$ su k
Password:
k@laptop:/home/hduser$ sudo adduser hduser sudo
[sudo] password for k:
Adding user `hduser' to group `sudo' ...
Adding user hduser to group sudo
Done.
Now, the hduser has root priviledge, we can move the Hadoop installation to the
/usr/local/hadoop directory without any problem: k@laptop:/home/hduser$ sudo
su hduser hduser@laptop:~/hadoop-2.6.0$ sudo mv * /usr/local/hadoop
hduser@laptop:~/hadoop-2.6.0$ sudo chown -R hduser:hadoop /usr/local/hadoop
5. Setup Configuration Files
The following files will have to be modified to complete the Hadoop setup:
1.~/.bashrc
2./usr/local/hadoop/etc/hadoop/hadoop-env.sh
3./usr/local/hadoop/etc/hadoop/core-site.xml
4./usr/local/hadoop/etc/hadoop/mapred-site.xml.template
5./usr/local/hadoop/etc/hadoop/hdfs-site.xml
1. ~/.bashrc : Before editing the .bashrc file in our home directory, we need to find the
path where Java has been installed to set the JAVA_HOME environment variable using
the following command:
hduser@laptop update-alternatives --config java
There is only one alternative in link group java (providing /usr/bin/java):
/usr/lib/jvm/java-7-
openjdk-amd64/jre/bin/java
Nothing to configure.
Now we can append the following to the end of ~/.bashrc :
hduser@laptop:~$ vi ~/.bashrc
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-
amd64 export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL export
YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
hduser@laptop:~$ source ~/.bashrc
note that the JAVA_HOME should be set as the path just before the '.../bin/':
15
hduser@ubuntu-VirtualBox:~$ javac -version

javac 1.7.0_75
hduser@ubuntu-VirtualBox:~$ which javac
/usr/bin/javac
hduser@ubuntu-VirtualBox:~$ readlink -f /usr/bin/javac
/usr/lib/jvm/java-7-openjdk-amd64/bin/javac
2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh :We need to set JAVA_HOME

by modifying hadoop-env.sh file.
hduser@laptop:~$ vi /usr/local/hadoop/etc/hadoop/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
Adding the above statement in the hadoop-env.sh file ensures that the value of
JAVA_HOME variable will be available to Hadoop whenever it is started up.
Conclusion: Initial setup needed to implement Map Reduce programs was done
successfully
References:
1.
http://ubuntuhandbook.org/index.php/2013/07/install-oracle-java-6-7-8-on-ubuntu-13-10/
2. http://www.eclipse.org/downloads/packages/release/Kepler/SR2
3.
http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_c
luster.php
16
Experiment 2
MapReduce Program – To find maximum temperature
Objective:To know how to do setup of Stream Input and Output generation and
understand the flow Of MapReduce data
Aim: Implement a MapReduce Program for finding Maximum temperature of all the data
Collected from weather forecast DB
Pre-requisite: Java I/O File system, Stream Commands

Procedure:
MapReduce is a programming model designed for processing large volumes of data in

parallel by dividing the work into a set of independent tasks.Our previous traversal has
given an introduction about MapReduce
This traversal explains how to design a MapReduce program. The aim of the program is to
find the Maximum temperature recorded for each year of NCDC data
The input for our program is weather data files for each year This weather data is collected
by National Climatic Data Center – NCDC from weather sensors at all over the
world. You can find weather data for each year from ftp://ftp.ncdc.noaa.gov/pub/data/noaa/
.All files are zipped by year and the weather station. For each year, there are multiple files
for different weather stations .
Here is an example for 1990 (ftp://ftp.ncdc.noaa.gov/pub/data/noaa/1901/).
● 010080-99999-1990.gz
● 010100-99999-1990.gz
● 010150-99999-1990.gz
When we consider the highlighted fields, the first one (029070) is the USAF
weather station identifier. The next one (19050101) represents the observation
date. The third highlighted one (-0139) represents the air temperature in Celsius
times ten. So the reading of -0139 equates to -13.9 degrees Celsius. The next
highlighted and italic item indicates a reading quality code.
Map Phase: The input for Map phase is set of weather data files as shown in snap shot.
The types of input key value pairs are LongWritableand Text and the types of output key
value pairs are Textand IntWritable.Each Map task extracts the temperature data from the
given year file. The output of the map phase is set of key value pairs. Set of keys are the
years. Values are the temperature of each year.
Reduce Phase: Reduce phase takes all the values associated with a particular key. That is
17
all the temperature values belong to a particular year is fed to a same reducer. Then each
reducer finds the highest recorded temperature for each year. The types of output key
value pairs in Map phase is same for the types of input key value pairs in reduce phase (
Textand IntWritable). The types of output key value pairs in reduce phase is too Text
and IntWritable.
In Hadoop, MapReduce is a computation that decomposes large manipulation jobs into

individual tasks that can be executed in parallel cross a cluster of servers. The results of
tasks can be joined together to compute final results.
MapReduce consists of 2 steps:
● Map Function – It takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (Key-Value pair).
Example – (Map function in Word Count)
Set of data
Bus, Car, bus, car, train, car, bus, car, train, bus, TRAIN,BUS,
Input
buS, caR, CAR, car, BUS, TRAIN
(Bus,1), (Car,1), (bus,1), (car,1), (train,1),

Convert into another
(car,1), (bus,1), (car,1), (train,1), (bus,1),
Output set of data
(TRAIN,1),(BUS,1), (buS,1), (caR,1), (CAR,1),
(Key,Value)
(car,1), (BUS,1), (TRAIN,1)
● Reduce Function – Takes the output from Map as an input and combines those
data tuples into a smaller set of tuples.
Example – (Reduce function in Word Count)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1),
Input (car,1), (bus,1), (car,1), (train,1), (bus,1),

Set of Tuples
(output of Map (TRAIN,1),(BUS,1), (buS,1), (caR,1),
function) (CAR,1),
(car,1), (BUS,1), (TRAIN,1)
18
(BUS,7),
Converts into smaller set of
(CAR,7),
Output tuples
(TRAIN,4)
Workflow of MapReduce consists of 5 steps
1. Splitting– The splitting parameter can be anything, e.g. splitting by space,

comma, semicolon, or even by a new line (‘\n’).
2. Mapping– as explained above
3. Intermediate splitting– the entire process in parallel on different clusters. In
order to group them in “Reduce Phase” the similar KEY data should be on same
cluster.
4. Reduce– it is nothing but mostly group by phase
Algorithm:
map(String input_key, String input_value):
//input_key: document name
//input_value: document contents
for each year and temperature in input_value:
EmitIntermediate( , );
reduce(String output_key, Interator intermediate_values):
// output_key: year
// intermediate_values: a list of temperature
int maxValue = Interger.MIN_VALUE; for
each in intermediate_values: maxValue =
Math.max( );
Emit(year, maxValue);
19
Program:
1)YearTempDriver.java
package test;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class YearTempDriver {
public static void main(String[] args) throws Exception

{ Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "JobName");
job.setJarByClass(test.YearTempDriver.class);
job.setMapperClass(test.YearTempMapper.class);
// TODO: specify a reducer

// job.setReducerClass(Reducer.class);
job.setNumReduceTasks(0);
// TODO: specify input and output DIRECTORIES (not files)

FileInputFormat.setInputPaths(job, new
Path("/home/dbit/Desktop/weather_data_input"));
FileOutputFormat.setOutputPath(job, new
Path("/home/dbit/Desktop/out"));
if (!job.waitForCompletion(true))
return; } }
2)YearTempMapper.java
package test;
import java.io.IOException;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.Mapper;
public class YearTempMapper extends Mapper<LongWritable, Text,

org.apache.hadoop.io.IntWritable, org.apache.hadoop.io.FloatWritable> {
20
public void map(LongWritable ikey, Text ivalue, Context context)

throws IOException, InterruptedException {
String line = ivalue.toString();
String [] tokens = StringUtils.split(line,'|');
if(tokens.length == 5) {
int year = Integer.parseInt(tokens[1]);
float temp = Float.parseFloat(tokens[4]);
// write it to the output of the mapper
context.write(new IntWritable(year), new FloatWritable(temp));
}}}
Input: Output:
1)weather_data_set_1900 1)part-m-0000
3300|1900|0101|0300|23 1901 24.0
3300|1900|0101|0400|24 1901 45.0
3300|1900|0301|0200|26 1901 40.0
3300|1900|0305|0230|24 1901 34.0
3300|1900|0312|0100|30 1901 32.0
3300|1900|0412|0300|29 1901 22.0
3301|1900|0312|0100|31 1901 21.0
3301|1900|0412|0400|23 1901 20.0
1901 24.0
2)weather_data_set_1901 1901 35.0
3300|1901|0101|0400|24 1901 20.0
3300|1901|0101|0500|45 1901 19.0
3300|1901|0301|0300|40 1901 26.0
3300|1901|0312|0200|34
3300|1901|0412|0100|32 2)part-m-0001
3301|1901|0312|0130|22 1900 23.0
3301|1901|0412|01500|21 1900 24.0
3302|1901|0102|0400|20 1900 26.0
3302|1901|0103|0500|24 1900 24.0
3302|1901|0203|0300|35 1900 30.0
3302|1901|0312|0200|20 1900 29.0
3302|1901|0412|0100|19 1900 31.0
3302|1901|0102|0400|26 1900 23.0
3302|1901|0104|0430
References :
1. https://www.dezyre.com/hadoop-tutorial/hadoop-mapreduce-tutorial-
2. http://www.hindawi.com/journals/tswj/2014/646497/
3. http://www.oracle.com/technetwork/articles/java/architect-streams-pt2-2227132.html
21
Experiment 3
MapReduce: Word Count Program with/without combiner
Objective: Understand the Processing & Execution pipeline & functions of MapReducer.
Pre-requisite: MapReduce
Aim: To implement a program to count distinct words from the given input stream file .
Theory :
Map Phase: The input for Map phase is set of weather data files as shown in snap shot.
The types of input key value pairs are LongWritableand Text and the types of output key
value pairs are Textand IntWritable.Each Map task extracts the temperature data from the
given year file. The output of the map phase is set of key value pairs. Set of keys are the
years. Values are the temperature of each year.
Reduce Phase: Reduce phase takes all the values associated with a particular key. That is
all the temperature values belong to a particular year is fed to a same reducer. Then each
reducer finds the highest recorded temperature for each year. The types of output key
value pairs in Map phase is same for the types of input key value pairs in reduce phase (
Textand IntWritable). The types of output key value pairs in reduce phase is too Text
and IntWritable.
1. Splitting– The splitting parameter can be anything, e.g. splitting by space,
comma, semicolon, or even by a new line (‘\n’).
2. Mapping– as explained above
3. Intermediate splitting– the entire process in parallel on different clusters. In
order to group them in “Reduce Phase” the similar KEY data should be on same
cluster.
4. Reduce– it is nothing but mostly group by phase
5. Combining– The last phase where all the data (individual result set from each
cluster) is combine together to form a Result
22
Steps in the Program:

The functionality of the map method is as follows
1. Create a IntWritable variable ‘one’ with value as 1
2. Convert the input line in Text type to a String
3. Use a tokenizer to split the line into words
4. Iterate through each word and a form key value pairs as
a. Assign each work from the tokenizer(of String type) to a Text ‘word’
b. Form key value pairs for each word as <word,one> and push it to the output
collector
The functionality of the reduce method is as follows

1 Initialize a variable ‘sum’ as 0
2. Iterate through all the values with respect to a key and sum up all of them
3. Push to the output collector the Key and the obtained sum as value
A Combiner, also known as a semi-reducer, is an optional class that operates by accepting

the inputs from the Map class and thereafter passing the output key-value pairs to the
Reducer class.The main function of a Combiner is to summarize the map output records
with the same key. The output (key- value collection) of the combiner will be sent over
the network to the actual Reducer task as input.
The Combiner class is used in between the Map class and the Reduce class to reduce the
volume of data transfer between Map and Reduce. Usually, the output of the map task is
large and the data transferred to the reduce task is high.
23
Algorithm:
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)
class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
Program:
Mapper Program: WordToFileNameMapper.java
package hadoop2;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
public class WordToFileNameMapper extends
Mapper<LongWritable, Text, Text, Text> { private
String fileName;
private Text key = new Text();
private Text value = new Text();
@Override
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException { super.setup(context);
FileSplit fileSplit = (FileSplit)context.getInputSplit();

String fileName = fileSplit.getPath().getName();
value.set(fileName);}
19String[] tokens = StringUtils.split(line,' ');
for (String token : tokens) { key.set(token);
context.write(key,value); }}}
Combiner program: WordToFileNameCombine.java

package hadoop2;
import java.util.HashMap;
24
public class WordToFileNameCombine extends Reducer<Text, Text, Text, Text> {
public void reduce(Text _key, Iterable<Text> fileNames, Context context)
HashMap<String,Void> encounteredFileNames = new HashMap<>();
// process values
for (Text fileName : fileNames) {
String fileNameStr = fileName.toString();
if (!encounteredFileNames.containsKey(fileNameStr)) {
encounteredFileNames.put(fileNameStr, null);
value.set(fileNameStr);
context.write(_key, value);}} }}
Reducer Program: WordToFileNameListReducer.java

package hadoop2;
import java.util.ArrayList;
import java.util.HashMap;
import org.apache.commons.collections.map.HashedMap;
import org.apache.commons.lang.StringUtils; import
org.apache.hadoop.io.Text;
public class WordToFileNameListReducer extends Reducer<Text, Text, Text, Text> {
public void reduce(Text _key, Iterable<Text> fileNames, Context context)
throws IOException, InterruptedException { ArrayList<String>
uniqueFileNames = new ArrayList<>(); HashMap<String, Void>
encounteredFileNames = new HashMap<>(); // process values
for (Text fileName : fileNames) {

String fileNameStr = fileName.toString();
if (!encounteredFileNames.containsKey(fileNameStr)) {
encounteredFileNames.put(fileNameStr, null);
uniqueFileNames.add(fileNameStr); }}
value.set(StringUtils.join(uniq
Driver Program: DistinctWordsToFileDriver.java

package hadoop2;
public class DistinctWordsToFileDriver {
public static void main(String[] args) throws Exception {
25
Configuration conf = new Configuration(); Job job =

Job.getInstance(conf, "JobName");
job.setJarByClass(hadoop2.DistinctWordsToFileDriver.class);
job.setMapperClass(hadoop2.WordToFileNameMapper.class);
job.setCombinerClass(hadoop2.WordToFileNameCombine.class);
job.setReducerClass(hadoop2.WordToFileNameListReducer.class);
// TODO: specify output types
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
20// TODO: specify input and output DIRECTORIES (not files)
FileInputFormat.setInputPaths(job, new Path("/home/universe/Desktop/input"));
FileOutputFormat.setOutputPath(job, new Path("/home/universe/Desktop/words"));
if (!job.waitForCompletion(true))
return;}}
Input: lat input2.txt

input1.txt cat mat sat bat input2.txt
input2.txt lat bat cat lat cat input2.txt
input3.txt bat hat yat
Output:
part-m-00000 part-m-00002
bat input1.txt bat input3.txt
cat input1.txt hat input3.txt
lat input1.txt yat input3.txt
part-m-00001
Conclusion:
Distinct word count program was implemented with the help of combiner and reducer
References:
1.
http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html
2. https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
3.. http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/ditp/ditp_ch3.pdf
26
Experiment 4
Matrix Multiplication
Objective: Is to understand the flow and working and implementation of Matrix

Multiplication.
Pre-requisite: Mathematical -Matrix Multiplication, Java

Aim: To implement Matrix Multiplication using MapReduce , for a better understanding
of the flow of data .
Procedure:
27
Program :
1)IntPair.java
package Matrix;
import java.io.DataInput;
import java.io.DataOutput;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
public class IntPair implements WritableComparable<IntPair> {
private IntWritable i;
private IntWritable k;
public IntPair() {
i = new IntWritable();
k = new IntWritable(); }
28
public void set(int string, int string2) {

this.i.set(string);
this.k.set(string2); }
public int getI() { return i.get(); }
public int getK() { return k.get(); }
@Override
public String toString() {
return i.get() + "," + k.get(); }
@Override
public void readFields(DataInput input) throws IOException {
// TODO Auto-generated method stub
i.readFields(input);
k.readFields(input); }
@Override
public void write(DataOutput output) throws IOException {
i.write(output);
k.write(output);}
@Override
public int compareTo(IntPair second) {
int cmp = this.i.compareTo(second.i);
if(cmp != 0){
return cmp;
}return this.k.compareTo(second.k);
}
@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((i == null) ? 0 : i.hashCode());
result = prime * result + ((k == null) ? 0 : k.hashCode());
return result; }
@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
29
return false;
IntPair other = (IntPair) obj;
if (i == null) {
if (other.i != null)
return false;
} else if (!i.equals(other.i))
return false;
if (k == null) {
if (other.k != null)
return false;
} else if (!k.equals(other.k))
return false;
return true; } }
2)MatrixMultiplyDriver.java
package Matrix;
import org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob;
import org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl;
public class MatrixMultiplyDriver {
public static void main(String[] args) throws Exception

{ Configuration conf = new Configuration();
Job job1 = Job.getInstance(conf, "STep1JobName");

job1.setJarByClass(Matrix.MatrixMultiplyDriver.class);
job1.setMapperClass(Matrix.STep1MatrixMapper.class);
// TODO: specify a reducer

job1.setReducerClass(Step1MatrixReducer.class);

job1.setMapOutputKeyClass(IntWritable.class);
job1.setMapOutputValueClass(Relation.class);
job1.setOutputKeyClass(IntPair.class);
job1.setOutputValueClass(IntWritable.class);

FileInputFormat.setInputPaths(job1, new
Path("/home/universe/Desktop/2by2"));
30
FileOutputFormat.setOutputPath(job1,
new Path("/home/universe/Desktop/out1/step1"));
/**job1 ends **/
Job job2 = Job.getInstance(conf, "STep2JobName");

job2.setJarByClass(Matrix.MatrixMultiplyDriver.class);
job2.setMapperClass(Matrix.TSep2MatrixMapper.class);
job2.setReducerClass(Step2MatrixReducer.class);

job2.setMapOutputKeyClass(IntPair.class);
job2.setMapOutputValueClass(IntWritable.class);
job2.setOutputKeyClass(IntPair.class);
job2.setOutputValueClass(IntWritable.class);

FileInputFormat.setInputPaths(job2, new
Path("/home/universe/Desktop/out1/step1/part-r*"));
FileOutputFormat.setOutputPath(job2, new
Path("/home/universe/Desktop/out1/step2"));
/**end of job2**/
ControlledJob cj1 = new ControlledJob(conf);

cj1.setJob(job1);
ControlledJob cj2 = new ControlledJob(conf);

cj2.setJob(job2);
cj2.addDependingJob(cj1);
JobControl jobControl = new JobControl("MatrixMultiplicaiton");

jobControl.addJob(cj1);
jobControl.addJob(cj2);
Thread thread = new Thread(jobControl);

thread.setDaemon(true);
thread.start();
while(!jobControl.allFinished()){
Thread.sleep(4000); } }}
3)Relation.java
package Matrix;
import java.io.DataInput;
import java.io.DataOutput;
31
import java.util.Set;
import org.apache.hadoop.io.Writable;
public class Relation implements Writable {
private Text fromMorN;

private IntWritable iOrk;
private IntWritable mijorNjK;
public Relation() {
fromMorN = new Text();
iOrk = new IntWritable();
mijorNjK = new IntWritable(); }
void set(String sourceMatrix, int MatrixCoordinate, int MatrixCallValue) {

fromMorN.set(sourceMatrix);
iOrk.set(MatrixCoordinate);
mijorNjK.set(MatrixCallValue); }
public String getFromMatrix() {

return fromMorN.toString(); }
public int getiOrk() {

return iOrk.get(); }
public int getmijorNjK() {

return mijorNjK.get(); }
@Override
public void readFields(DataInput input) throws IOException {
fromMorN.readFields(input);
iOrk.readFields(input);
mijorNjK.readFields(input); }
@Override
public void write(DataOutput output) throws IOException {
fromMorN.write(output);
iOrk.write(output);
mijorNjK.write(output); }}
32
4)Step1 MatrixReducer.java
package Matrix;
import java.util.ArrayList;
import java.util.Iterator;
public class Step1MatrixReducer

extends
Reducer<org.apache.hadoop.io.IntWritable, Matrix.Relation,
Matrix.IntPair, org.apache.hadoop.io.IntWritable> {
private IntPair key = new IntPair();

private IntWritable value = new IntWritable();
public void reduce(IntWritable _key, Iterable<Relation> values,

Context context) throws IOException, InterruptedException
{ // process values
ArrayList<Relation> mRels = new ArrayList<>();

ArrayList<Relation> nRels = new ArrayList<>();
// seperate M & N relation for

(Relation value : values) {
// for every relation create a new object at reducer side

Relation temp = new Relation();
temp.set(value.getFromMatrix(), value.getiOrk(),
value.getmijorNjK());
if (value.getFromMatrix().equals("m")) {
mRels.add(temp);
} else {
nRels.add(temp);
} }
for (Iterator iterator = mRels.iterator(); iterator.hasNext();) {

Relation mRelation = (Relation) iterator.next();
for (Iterator iterator2 = nRels.iterator(); iterator2.hasNext();) {
Relation nRelation = (Relation) iterator2.next();
key.set(mRelation.getiOrk(), nRelation.getiOrk());
value.set(mRelation.getmijorNjK() *
nRelation.getmijorNjK());
33
context.write(key, value); } }
}}
5)Step2MatrixReducer.java
package Matrix;
public class Step2MatrixReducer extends

Reducer<IntPair, IntWritable, IntPair, IntWritable> {
private IntWritable value = new IntWritable();
public void reduce(IntPair _key, Iterable<IntWritable> values, Context context)

int sum=0;
for(IntWritable value :values){
sum += value.get();
}
value.set(sum);
context.write(_key,value); } }
6)Step1MatrixMapper.java
package Matrix;
public class STep1MatrixMapper extends

Mapper<LongWritable, Text, IntWritable, Matrix.Relation> {
private IntWritable key = new IntWritable();

private Relation value = new Relation();

34

String[] tokens = StringUtils.split(line, ',');
if (tokens[0].equals("m")) {
key.set(Integer.parseInt(tokens[2]));
value.set(tokens[0], Integer.parseInt(tokens[1]),
Integer.parseInt(tokens[3]));
}
else {
key.set(Integer.parseInt(tokens[1]));
value.set(tokens[0], Integer.parseInt(tokens[2]),
Integer.parseInt(tokens[3]));
}
context.write(key, value); }}
7)Step2MatrixMapper.java
package Matrix;
public class Step2MatrixMapper

extends
Mapper<LongWritable, Text, Matrix.IntPair,
org.apache.hadoop.io.IntWritable> {
public IntPair key = new IntPair();

public IntWritable value = new IntWritable();


String[] tokens = StringUtils.split(line,'\t');
String[] subtokens = StringUtils.split(tokens[0],',');
key.set(Integer.parseInt(subtokens[0]),Integer.parseInt(subtokens[1]));
value.set(Integer.parseInt(tokens[1]));
context.write(key, value); }}
35
Input:
m,0,0,1
m,0,1,2
m,1,0,2
m,1,1,1
n,0,0,1
n,0,0,5
n,1,0,2
n,1,1,6
Output:
Step1
1,0 10 Step2
1,0 2 0,0 10
0,0 5 0,1 12
0,0 1 1,0 14
1,1 6 1,1 6
1,0 2
0,1 12
0,0 4
Applications of matrix multiplication:

1) Computer Graphics eg. Computer-generated image that has a reflection, or distortion
effects such as light passing through rippling water
2) Matrix arithmetic helps us calculate the electrical properties of a circuit, with voltage,
amperage, resistance, etc.
3) It supports graph theory.
4) Matrix mathematics simplifies linear algebra, at least in providing a more compact way
to deal with groups of equations in linear algebra.
Conclusion:
Matrix multiplication program was implemented and understood the flow of data using
mapReduce
References:
1. http://cdac.in/index.aspx?id=ev_hpc_hadoop-map-reduce
2.https://www.cs.duke.edu/courses/fall12/cps216/Project/Project/projects/Matrix_Multipl
y/proj_report.pd
36
Experiment 5 & 6
Rstudio installation and implementation
Objective: To install Studio and execute basic R programming commands

Aim: Install RStudio and execute various R commands
Pre-requisite:Knowledge on Java and Linux
Installation steps:
Install R-Base
You can find R-Base in the Software Center; this would be the easy way to do it.
However, the Software Center versions are often out of date, which can be a pain moving
foward when your packages are based on the most current version of R Base. The easy fix
is to download and install R Base directly from the Cran servers.
1. Add R repository
First, we’ve got to add a line to our /etc/apt/sources.list file. This can be accomplished
with the following. Note the “trusty” in the line, indicating Ubuntu 14.04. If you have a
different version, just change that.
sudo echo "deb http://cran.rstudio.com/bin/linux/ubuntu trusty/" | sudo tee -a
/etc/apt/sources.list
2. Add R to Ubuntu
Keyring First:
gpg --keysever keyserver.ubuntu.com --recv-key E084DAB9
Then:
gpg -a --export E084DAB9 | sudo apt-key add -
3. Install R-Base
Most Linux users should be familiar with the old…

sudo apt-get update
sudo apt-get install r-base r-base-dev
Installing R-Studio
Download the appropriate version of
https://www.rstudio.com/products/rstudio/download/
From here you can download your files and install the IDE through Ubuntu Software
Center or Synaptic Package Manager.
If you prefer a command line approach to install Rstudio:
sudo apt-get install gdebi-core
37
wget https://download1.rstudio.org/rstudio-1.0.44-amd64.deb
sudo gdebi -n rstudio-1.0.44-amd64.deb rm rstudio-1.0.44-
amd64.deb
Theory:
Rstudio is an integrated development environment (IDE) for R. It includes a console,
syntax-highlighting editor that supports direct code execution, as well as tools for
plotting, history, and debugging and workspace management. Rstudio is available in
open source and commercial editions and runs on the desktop (Windows, Mac, and
Linux) or in a browser connected to Rstudio Server or Rstudio Server Pro
(Debian/Ubuntu, RedHat/CentOS, and SUSE Linux).
Features include:
- Syntax highlighting, code completion, and smart indentation.
-Execute R code directly from the source editor
-Quickly jump to function definitions
-Integrated R help and documentation
-Easily manage multiple working directories using projects
-Workspace browser and data viewer
-Interactive debugger to diagnose and fix errors quickly
-Extensive package development tools
-Authoring with Sweave and R Markdown
Programs: As
a calculator
38
To execute a script
Operations
Datatype example and difference between print() and cat()
39
Reading user input values
Concatenation
40
List creation
Vector creation and accessing vector elemnts
Assignment using leftward and rightward operators
41
Displaying and removing elemnts of alist
Use of Arithmetic , relational and logical operators
Conclusion:
Rstudio setup needed to implement R programs was done successfully and various R
programs were executed
References:
1. http://www.thertrader.com/2014/09/22/installing-rrstudio-on-ubuntu-14-04/
2. https://www.rstudio.com/products/rstudio/download/
3. https://www.datascienceriot.com/33/kris/
42
Experiment 7 & 8
MongoDB installation and Exercises
Objective: To install MongoDb & MonoChef(Optional) and execute basic MonoDB

programming commands
Aim: Is to install and practice MongoDB and practice all commands to forthcoming
experiment on case study
Pre-requisite:Knowledge on SQL/Oracle
Procedure: To Install and Test MongoDB please visit the url
http://www.w3resource.com/mongodb/databases-documents-collections.php
https://www.tutorialspoint.com/mongodb/index.htm
Database
Database is a physical container for collections. Each database gets its own set of files on
the file system. A single MongoDB server typically has multiple databases.
Collection
Collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A
collection exists within a single database. Collections do not enforce a schema.
Documents within a collection can have different fields. Typically, all documents in a
collection are of similar or related purpose.
Document
A document is a set of key-value pairs. Documents have dynamic schema. Dynamic
schema means that documents in the same collection do not need to have the same set of
fields or structure, and common fields in a collection's documents may hold different
types of data.
The following table shows the relationship of RDBMS terminology with MongoDB.
RDBMS MongoDB
Database Database
Table Collection
Tuple/Row Document
column Field
Table Join Embedded Documents
Primary Key Primary Key (Default key _id provided by mongodb itself)
43
Database Server and Client

Mysqld/Oracle mongod
mysql/sqlplus mongo
Exercise:
If you want to create a database with name <mydb>,then use DATABASEstatement
would be as follows −
>use mydb
switched to db mydb
To check your currently selected database, use the command db

>db
mydb
If you want to check your databases list, use the command show dbs.
>show dbs
local 0.78125GB
test 0.23012GB
Your created database (mydb) is not present in list. To display database, you need to
insert at least one document into it.
>db.movie.insert({"name":"tutorials point"})
>show dbs
local 0.78125GB
mydb 0.23012GB
test 0.23012GB
db.dropDatabase()
This will delete the selected database. If you have not selected any database, then it will
delete default 'test' database.
Example
First, check the list of available databases by using the command, show dbs.
>show dbs
local 0.78125GB
mydb 0.23012GB
test 0.23012GB
>
If you want to delete new database <mydb>,then dropDatabase()command would be as

follows −
44
>use mydb
switched to db mydb
>db.dropDatabase()
>{ "dropped" : "mydb", "ok" : 1 }
>
Now check list of databases.
>show dbs
local 0.78125GB
test 0.23012GB
>
The createCollection() Method

MongoDB db.createCollection(name, options)is used to create collection.
Syntax
Basic syntax of createCollection()command is as follows −
db.createCollection(name, options)
In the command, nameis name of collection to be created. Optionsis a document and is
used to specify configuration of collection.
Parameter Type Description
Name String Name of the collection to be created
Options Document (Optional) Specify options about memory size and indexing
Options parameter is optional, so you need to specify only the name of the collection.
Following is the list of options you can use −
Field Type Description
(Optional) If true, enables a capped collection. Capped collection is a fixed
size collection that automatically overwrites its oldest entries when it
capped Boolean reaches its maximum size. If you specify true, you need to specify size
parameter also.
(Optional) If true, automatically create index on _id field.s Default value is
autoIndexId Boolean false.
(Optional) Specifies a maximum size in bytes for a capped collection. If
capped is true, then you need to specify this field also.
size Number
(Optional) Specifies the maximum number of documents allowed in the
capped collection.
max Number
While inserting the document, MongoDB first checks size field of capped collection, then
it checks max field.
Examples
Basic syntax of createCollection()method without options is as follows −
45
>use test
switched to db test
>db.createCollection("mycollection")
{ "ok" : 1 }
>
You can check the created collection by using the command show collections.
>show collections
mycollection
system.indexes
The following example shows the syntax of createCollection()method with few

important options −
>db.createCollection("mycol", { capped : true, autoIndexId : true, size :
6142800, max : 10000 } )
{ "ok" : 1 }
>
In MongoDB, you don't need to create collection. MongoDB creates collection

automatically, when you insert some document.
>db.tutorialspoint.insert({"name" : "tutorialspoint"})
>show collections
mycol
mycollection
system.indexes
tutorialspoint
>
The drop() Method

MongoDB's db.collection.drop()is used to drop a collection from the database.
Syntax
Basic syntax of drop()command is as follows −
db.COLLECTION_NAME.drop()
Example
First, check the available collections into your database mydb.
>use mydb
switched to db mydb
>show collections
mycol
mycollection
system.indexes
46
tutorialspoint
>
Now drop the collection with the name mycollection.

>db.mycollection.drop()
true
>
Again check the list of collections into database.
>show collections
mycol
system.indexes
tutorialspoint
>
MongoDB supports many datatypes. Some of them are −

● String− This is the most commonly used datatype to store the data. String in
MongoDB must be UTF-8 valid.
● Integer− This type is used to store a numerical value. Integer can be 32 bit or 64
bit depending upon your server.
● Boolean− This type is used to store a boolean (true/ false) value.
● Double− This type is used to store floating point values.
● Min/ Max keys− This type is used to compare a value against the lowest and
highest BSON elements.
● Arrays− This type is used to store arrays or list or multiple values into one key.
● Timestamp− ctimestamp. This can be handy for recording when a document has
been modified or added.
● Object− This datatype is used for embedded documents.
● Null− This type is used to store a Null value.
● Symbol− This datatype is used identically to a string; however, it's generally
reserved for languages that use a specific symbol type.
● Date − This datatype is used to store the current date or time in UNIX time
format. You can specify your own date time by creating object of Date and
passing day, month, year into it.
● Object ID− This datatype is used to store the document’s ID.
● Binary data− This datatype is used to store binary data.
47
● Code− This datatype is used to store JavaScript code into the document.
● Regular expression− This datatype is used to store regular expression.
The insert() Method
To insert data into MongoDB collection, you need to use MongoDB's insert()or save()
method.
Syntax
The basic syntax of insert()command is as follows −
>db.COLLECTION_NAME.insert(document)
Example
>db.mycol.insert({
_id: ObjectId(7df78ad8902c),
title: 'MongoDB Overview',
description: 'MongoDB is no sql database',
by: 'tutorials point',
url: 'http://www.tutorialspoint.com',
tags: ['mongodb', 'database', 'NoSQL'],
likes: 100
})
Here mycolis our collection name, as created in the previous chapter. If the collection
doesn't exist in the database, then MongoDB will create this collection and then insert a
document into it.
In the inserted document, if we don't specify the _id parameter, then MongoDB assigns a
unique ObjectId for this document.
_id is 12 bytes hexadecimal number unique for every document in a collection. 12 bytes
are divided as follows −
_id: ObjectId(4 bytes timestamp, 3 bytes machine id, 2 bytes process id,
3 bytes incrementer)
To insert multiple documents in a single query, you can pass an array of documents in
insert() command.
Example
>db.post.insert([
{
title: 'MongoDB Overview',
description: 'MongoDB is no sql database',
48

likes: 100 },
{ title: 'NoSQL Database',

description: 'NoSQL database doesn't have tables',
likes: 20,
comments: [
{
user:'user1',
message: 'My first comment',
dateCreated: new Date(2013,11,10,2,35),
like: 0
}
]
}
])
The find() Method

To query data from MongoDB collection, you need to use MongoDB's find()method.
Syntax
The basic syntax of find()method is as follows −
>db.COLLECTION_NAME.find()
find()method will display all the documents in a non-structured way.
The pretty() Method
To display the results in a formatted way, you can use pretty()method.
Syntax
>db.mycol.find().pretty()
Example
>db.mycol.find().pretty()
{
"_id": ObjectId(7df78ad8902c),
"title": "MongoDB Overview",
"description": "MongoDB is no sql database",
"by": "tutorials point",
"url": "http://www.tutorialspoint.com",
"tags": ["mongodb", "database", "NoSQL"],
"likes": "100"
}
49
>
The update() method updates the values in the existing document.
Syntax
The basic syntax of update()method is as follows −
>db.COLLECTION_NAME.update(SELECTION_CRITERIA, UPDATED_DATA)
Example
Consider the mycol collection has the following data.
{ "_id" : ObjectId(5983548781331adf45ec5), "title":"MongoDB Overview"}

{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}
Following example will set the new title 'New MongoDB Tutorial' of the documents
whose title is 'MongoDB Overview'.
>db.mycol.update({'title':'MongoDB Overview'},{$set:{'title':'New MongoDB
Tutorial'}})
>db.mycol.find()
{ "_id" : ObjectId(5983548781331adf45ec5), "title":"New MongoDB Tutorial"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}
>
By default, MongoDB will update only a single document. To update multiple

documents, you need to set a parameter 'multi' to true.
>db.mycol.update({'title':'MongoDB Overview'},
{$set:{'title':'New MongoDB Tutorial'}},{multi:true})
MongoDB Save() Method

The save()method replaces the existing document with the new document passed in the
save() method.
50

Bda 80

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Bda 80

Încărcat de

Drepturi de autor:

Formate disponibile

BEITC802 - Big Data Analysis

B.E. (I.T) - Even Semester 2019

Name: Smriti Sharma Roll no: 80

No. Experiment Name Page No. Date

B Case Study on Big Data – IEEE /White papers 6 15/01/2019

1 Installation of Eclipse & Hadoop, to do environment 12 22/01/2019

2 Map Reduce Program – To find maximum temperature 17 29/01/2019

3 Map Reduce Program- To calculate Word count without 22 05/02/2019

4 Map Reduce Program –Matrix Multiplication 27 26/02/2019

5 & RBase (Language) installation and RStudio Installation 37 05/03/2019

7 & Mongo DB&MonogoChef (GUI -optional )– 43 26/03/2019

Q1. Big Data

Q2. Big Data Characteristics

Q3. Types of data used in BD

Ans. Big data could be found in three forms:

An 'Employee' table in a database is an example of Structured Data

Employee_ID Employee_Name Gender Department Salary_In_lacs

Examples Of Un-structured Data

Output returned by 'Google Search'

Personal data stored in a XML file-

Hybrid data can contain a combination of Structured, Unstructured and Semi-Structured

Examples of Hybrid Data

Ecommerce, Weather Report, Curreny Conversion, POS, POL, NFC, etc

Q4. Difference between Data Analysis & Data Scientist

Q5. Applications of Big Data. Explain any one in detail.

1. Banking and Securities

2. Communications, Media and Entertainment

5. Manufacturing and Natural Resources

8. Retail and Whole sale trade

10. Energy and Utilities

Applications of big data in the healthcare sector

Obamacare has also utilized big data in a variety of ways.

Q6. To install tools like

● RStudio & RLanguage

● Hadoop & Map reduce

● one No SQL software

Experiment B - Case Study on Big Data

Big Data Applications

● Big Data Applications in Healthcare

Big Data Applications: Healthcare

Big Data Applications: Manufacturing

Big Data Architect Masters Program

● Product quality and defects tracking

Big Data Applications: Media & Entertainment

● Predicting what the audience wants

Big Data Applications: Internet of Things (IoT)

Master Big Data with Edureka

Big Data Applications: Government

Big Data Architect Masters Program

Cyber security & Intelligence

Crime Prediction and Prevention

Pharmaceutical Drug Evaluation

● Implement new methods for deriving knowledge from data

Installing Hadoop, Eclipse and do Environment Set Up

Objective: To install & do initial set up for Implementing MapReduce Program

Prerequisite:Knowledge on Eclipse & Java and Linux

OR Open Nautilus file browser via root: Press

5. Finally open Eclipse from Dashboard

b. Hadoop on single node

3. Create and Setup SSH Certificates

4. Install Hadoop hduser@laptop:~$ wget

hduser@ubuntu-VirtualBox:~$ javac -version

2. /usr/local/hadoop/etc/hadoop/hadoop-env.sh :We need to set JAVA_HOME

/job1 ends /