Documente Academic
Documente Profesional
Documente Cultură
A Self-Study 1 08/01/2019
Experiment A
Self Study
Objective: To introduce Big data and its tools required to manage and analyse big data
like Hadoop,NoSql, MapReduce ,Rlanguage .
Aim:To do self-study on basic concept and application, technology so they get
familiarized with the subject.
Procedure:
Group was formed consisting of good and average student mixed so they can contribute
and share knowledge with their member, with the help of the books, URL provided to
them.
Topic: Exercise
Ans. Big data is an evolving term that describes a large volume of structured, semi-
structured and unstructured data that has the potential to be mined for information and
used in machine learning projects and other advanced analytics applications. Big data can
be analyzed for insights that lead to better decisions and strategic business moves.
Ans.
1. Volume – The name 'Big Data' itself is related to a size which is enormous. Size of data
plays very crucial role in determining value out of data. Also, whether a particular data
can actually be considered as a Big Data or not, is dependent upon volume of data.
Hence, 'Volume' is one characteristic which needs to be considered while dealing with
'Big Data'.
2. Variety – The next aspect of 'Big Data' is its variety.
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only sources of
data considered by most of the applications. Now days, data in the form of emails, photos,
videos, monitoring devices, PDFs, audio, etc. is also being considered in the analysis
applications. This variety of unstructured data poses certain issues for storage, mining and
analysing data.
3. Velocity – The term 'velocity' refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands, determines real potential in the
data.
1
Name: Smriti Sharma Rollno: 80
Big Data Velocity deals with the speed at which data flows in from sources like business
processes, application logs, networks and social media sites, sensors, Mobile devices, etc.
The flow of data is massive and continuous.
1. Structured
2. Unstructured
3. Semi-structured
4. Hybrid
1. Structured
Any data that can be stored, accessed and processed in the form of fixed format is termed
as a 'structured' data. Over the period of time, talent in computer science have achieved
greater success in developing techniques for working with such kind of data (where the
format is well known in advance) and also deriving value out of it. However, now days,
we are foreseeing issues when size of such data grows to a huge extent, typical sizes are
being in the rage of multiple zettabyte.
Examples Of Structured Data
2. Unstructured
Any data with unknown form or the structure is classified as unstructured data. In
addition to the size being huge, un-structured data poses multiple challenges in terms of
its processing for deriving value out of it. Typical example of unstructured data is, a
heterogeneous data source containing a combination of simple text files, images, videos
etc. Now a day organizations have wealth of data available with them but unfortunately
they don't know how to derive value out of it since this data is in its raw form or
unstructured format.
2
Name: Smriti Sharma Rollno: 80
3. Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-structured data
as a strcutured in form but it is actually not defined with e.g. a table definition in
relational DBMS. Example of semi-structured data is a data represented in XML file.
Examples Of Semi-structured Data
<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
<rec><name>Subrato Roy</name><sex>Male</sex><age>26</age></rec>
<rec><name>Jeremiah J.</name><sex>Male</sex><age>35</age></rec>
4. Hybrid
Ans.
3
Name: Smriti Sharma Rollno: 80
Ans.
3. Healthcare Providers
4. Education
6. Government
7. Insurance
9. Transportation
Healthcare Providers
Industry-Specific challenges
The healthcare sector has access to huge amounts of data but has been plagued by failures
in utilizing the data to curb the cost of rising healthcare and by inefficient systems that
stifle faster and better healthcare benefits across the board.
This is mainly due to the fact that electronic data is unavailable, inadequate, or unusable.
Additionally, the healthcare databases that hold health-related information have made it
difficult to link data that can show patterns useful in the medical field.
Other challenges related to big data include: the exclusion of patients from the decision
making process, and the use of data from different readily available sensors.
Some hospitals, like Beth Israel, are using data collected from a cell phone app, from
millions of patients, to allow doctors to use evidence-based medicine as opposed to
administering several medical/lab tests to all patients who go to the hospital. A battery of
tests can be efficient but they can also be expensive and usually ineffective.
Free public health data and Google Maps have been used by the University of Florida to
create visual data that allows for faster identification and efficient analysis of healthcare
information, used in tracking the spread of chronic disease.
4
Name: Smriti Sharma Rollno: 80
Conclusion:
Thus study was done and to come up with set of question & answers that can help one to
clear or understand his basic fundamentals/Concepts of Big Data Technology.
Reference: (list down all the reference materials if any in this section)
1. http://moodle.dbit.in/course/view.php?id=428
2. https://www.simplilearn.com/big-data-applications-in-industries-article
3. https://www.dezyre.com/article/difference-between-data-analyst-and-data-scientist/332
4. https://searchdatamanagement.techtarget.com/definition/big-data
5
Name: Smriti Sharma Rollno: 80
Objective: To expose the students, on the research & modern tools existing in the market
Aim:To do a case study on SOA based IEEE /White papers by forming a group and
writing a
Summary
Procedure:
1. To forms groups of maximum 2-3 students were formed combination of bright and
weak student
2. To identify and conclude on a paper /topic, understand the paper, explain each other in
the team , and arrive at common understanding .
3. To Write down the summary or explanation of the paper
4. Students will be graded based on their peer to peer viva.
Topics:
Big Data Applications
Big Data Applications on various domains
Big Data Analtyic Algorithm
Big Data Analytics Tools & Techniques
NoSQl
Let’s understand how Big Data applications are playing a major role in different domains.
6
Name: Smriti Sharma Rollno: 80
The level of data generated within healthcare systems is not trivial. Traditionally, the health
care industry lagged in using Big Data, because of limited ability to standardize and
consolidate data.
But now Big data analytics have improved healthcare by providing personalized medicine
and prescriptive analytics. Researchers are mining the data to see what treatments are more
effective for particular conditions, identify patterns related to drug side effects, and gains
other important information that can help patients and reduce costs.
With the added adoption of mHealth, eHealth and wearable technologies the volume of data
is increasing at an exponential rate. This includes electronic health record data, imaging
data, patient generated data, sensor data, and other forms of data.
By mapping healthcare data with geographical data sets, it’s possible to predict disease that
will escalate in specific areas. Based of predictions, it’s easier to strategize diagnostics and
plan for stocking serums and vaccines.
7
Name: Smriti Sharma Rollno: 80
8
Name: Smriti Sharma Rollno: 80
9
Name: Smriti Sharma Rollno: 80
are often applied across multiple applications & it requires multiple departments to work in
collaboration.
Scientific Research
The National Science Foundation has initiated a long-term plan to:
Weather Forecasting
The NOAA (National Oceanic and Atmospheric Administration) gathers data every minute
of every day from land, sea, and space-based sensors. Daily NOAA uses Big Data to
analyze and extract value from over 20 terabytes of data.
Tax Compliance
Big Data Applications can be used by tax organizations to analyze both unstructured and
structured data from a variety of sources in order to identify suspicious behavior and
multiple identities. This would help in tax fraud identification.
Traffic Optimization
10
Name: Smriti Sharma Rollno: 80
Big Data helps in aggregating real-time traffic data gathered from road sensors, GPS
devices and video cameras. The potential traffic problems in dense areas can be prevented
by adjusting public transportation routes in real time.
I have just conveyed some of the prominent examples of Big Data applications, but there
are uncountable ways in which Big Data is revolutionizing each and every domain. I hope
you found this blog informative enough. In my next blog, I will talk about the career
opportunities in Big Data and Hadoop.
Reference:
note: list down all the reference url and book details that you have referenced in this
section
1. http://moodle.dbit.in/course/view.php?id=428
11
Name: Smriti Sharma Rollno: 80
Experiment 1
Aim: Implement Eclipse & Hadoop & to do environment setup, then to perform Mapper
Program to find Maximum temperature
12
Name: Smriti Sharma Rollno: 80
NoDisplay=false
Categories=Development;IDE;
Name[en]=Eclipse
2. Installing SSH
ssh has two main components:
1.ssh : The command we use to connect to remote machines - the client.
2.sshd : The daemon that is running on the server and allows clients to connect to the
server.
The ssh is pre-enabled on Linux, but in order to start sshd daemon, we need to install ssh
first. Use this command to do that :
k@laptop:~$ sudo apt-get install ssh
This will install ssh on our machine. If we get something similar to the following, we can
think it is
setup properly:
k@laptop:~$ which ssh
/usr/bin/ssh
k@laptop:~$ which sshd
/usr/sbin/sshd
13
Name: Smriti Sharma Rollno: 80
access to localhost.So, we need to have SSH up and running on our machine and
configured it to allow SSH public key authentication.
Hadoop uses SSH (to access its nodes) which would normally require the user to enter a
password. However, this requirement can be eliminated by creating and setting up SSH
certificates using the following commands. If asked for a filename just leave it blank and
press the enter key to continue.
k@laptop:~$ su hduser
Password:
k@laptop:~$ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hduser/.ssh/id_rsa):
Created directory '/home/hduser/.ssh'.
Your identification has been saved in /home/hduser/.ssh/id_rsa.
Your public key has been saved in /home/hduser/.ssh/id_rsa.pub.
The key fingerprint is:
50:6b:f3:fc:0f:32:bf:30:79:c2:41:71:26:cc:7d:e3 hduser@laptop
The key's randomart image is:
+--[ RSA 2048]----+
| .oo.o |
| . .o=. o |
|.+.o.|
|o=E|
|S+|
|.+|
|O+|
|Oo|
| o.. |
+-----------------+
hduser@laptop:/home/k$ cat $HOME/.ssh/id_rsa.pub >>$HOME/.ssh/authorized_keys
The second command adds the newly created key to the list of authorized keys so that
Hadoop can use ssh without prompting for a password. We can check if ssh works:
hduser@laptop:/home/k$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be established.
ECDSA key fingerprint is e1:8b:a0:a5:75:ef:f4:b4:5e:a9:ed:be:64:be:5c:2f.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
Welcome to Ubuntu 14.04.1 LTS (GNU/Linux 3.13.0-40-generic x86_64)
14
Name: Smriti Sharma Rollno: 80
Note: for error "hduser is not in the sudoers file. This incident will be reported."
This error can be resolved by logging in as a root user, and then add
hduser to sudo : hduser@laptop:~/hadoop-2.6.0$ su k
Password:
k@laptop:/home/hduser$ sudo adduser hduser sudo
[sudo] password for k:
Adding user `hduser' to group `sudo' ...
Adding user hduser to group sudo
Done.
Now, the hduser has root priviledge, we can move the Hadoop installation to the
/usr/local/hadoop directory without any problem: k@laptop:/home/hduser$ sudo
su hduser hduser@laptop:~/hadoop-2.6.0$ sudo mv * /usr/local/hadoop
hduser@laptop:~/hadoop-2.6.0$ sudo chown -R hduser:hadoop /usr/local/hadoop
5. Setup Configuration Files
The following files will have to be modified to complete the Hadoop setup:
1.~/.bashrc
2./usr/local/hadoop/etc/hadoop/hadoop-env.sh
3./usr/local/hadoop/etc/hadoop/core-site.xml
4./usr/local/hadoop/etc/hadoop/mapred-site.xml.template
5./usr/local/hadoop/etc/hadoop/hdfs-site.xml
1. ~/.bashrc : Before editing the .bashrc file in our home directory, we need to find the
path where Java has been installed to set the JAVA_HOME environment variable using
the following command:
hduser@laptop update-alternatives --config java
There is only one alternative in link group java (providing /usr/bin/java):
/usr/lib/jvm/java-7-
openjdk-amd64/jre/bin/java
Nothing to configure.
Now we can append the following to the end of ~/.bashrc :
hduser@laptop:~$ vi ~/.bashrc
#HADOOP VARIABLES START
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-
amd64 export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$PATH:$HADOOP_INSTALL/bin
export PATH=$PATH:$HADOOP_INSTALL/sbin
export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_HOME=$HADOOP_INSTALL
export HADOOP_HDFS_HOME=$HADOOP_INSTALL export
YARN_HOME=$HADOOP_INSTALL
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_INSTALL/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_INSTALL/lib"
#HADOOP VARIABLES END
hduser@laptop:~$ source ~/.bashrc
note that the JAVA_HOME should be set as the path just before the '.../bin/':
15
Name: Smriti Sharma Rollno: 80
References:
1.
http://ubuntuhandbook.org/index.php/2013/07/install-oracle-java-6-7-8-on-ubuntu-13-10/
2. http://www.eclipse.org/downloads/packages/release/Kepler/SR2
3.
http://www.bogotobogo.com/Hadoop/BigData_hadoop_Install_on_ubuntu_single_node_c
luster.php
16
Name: Smriti Sharma Rollno: 80
Experiment 2
MapReduce Program – To find maximum temperature
Objective:To know how to do setup of Stream Input and Output generation and
understand the flow Of MapReduce data
Aim: Implement a MapReduce Program for finding Maximum temperature of all the data
Collected from weather forecast DB
17
Name: Smriti Sharma Rollno: 80
all the temperature values belong to a particular year is fed to a same reducer. Then each
reducer finds the highest recorded temperature for each year. The types of output key
value pairs in Map phase is same for the types of input key value pairs in reduce phase (
Textand IntWritable). The types of output key value pairs in reduce phase is too Text
and IntWritable.
● Reduce Function – Takes the output from Map as an input and combines those
data tuples into a smaller set of tuples.
Example – (Reduce function in Word Count)
(Bus,1), (Car,1), (bus,1), (car,1), (train,1),
18
Name: Smriti Sharma Rollno: 80
(BUS,7),
Converts into smaller set of
(CAR,7),
Output tuples
(TRAIN,4)
19
Name: Smriti Sharma Rollno: 80
Program:
1)YearTempDriver.java
package test;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
if (!job.waitForCompletion(true))
return; } }
2)YearTempMapper.java
package test;
import java.io.IOException;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.FloatWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
20
Name: Smriti Sharma Rollno: 80
Input: Output:
1)weather_data_set_1900 1)part-m-0000
3300|1900|0101|0300|23 1901 24.0
3300|1900|0101|0400|24 1901 45.0
3300|1900|0301|0200|26 1901 40.0
3300|1900|0305|0230|24 1901 34.0
3300|1900|0312|0100|30 1901 32.0
3300|1900|0412|0300|29 1901 22.0
3301|1900|0312|0100|31 1901 21.0
3301|1900|0412|0400|23 1901 20.0
1901 24.0
2)weather_data_set_1901 1901 35.0
3300|1901|0101|0400|24 1901 20.0
3300|1901|0101|0500|45 1901 19.0
3300|1901|0301|0300|40 1901 26.0
3300|1901|0312|0200|34
3300|1901|0412|0100|32 2)part-m-0001
3301|1901|0312|0130|22 1900 23.0
3301|1901|0412|01500|21 1900 24.0
3302|1901|0102|0400|20 1900 26.0
3302|1901|0103|0500|24 1900 24.0
3302|1901|0203|0300|35 1900 30.0
3302|1901|0312|0200|20 1900 29.0
3302|1901|0412|0100|19 1900 31.0
3302|1901|0102|0400|26 1900 23.0
3302|1901|0104|0430
References :
1. https://www.dezyre.com/hadoop-tutorial/hadoop-mapreduce-tutorial-
2. http://www.hindawi.com/journals/tswj/2014/646497/
3. http://www.oracle.com/technetwork/articles/java/architect-streams-pt2-2227132.html
21
Name: Smriti Sharma Rollno: 80
Experiment 3
MapReduce: Word Count Program with/without combiner
Objective: Understand the Processing & Execution pipeline & functions of MapReducer.
Pre-requisite: MapReduce
Aim: To implement a program to count distinct words from the given input stream file .
Theory :
Map Phase: The input for Map phase is set of weather data files as shown in snap shot.
The types of input key value pairs are LongWritableand Text and the types of output key
value pairs are Textand IntWritable.Each Map task extracts the temperature data from the
given year file. The output of the map phase is set of key value pairs. Set of keys are the
years. Values are the temperature of each year.
Reduce Phase: Reduce phase takes all the values associated with a particular key. That is
all the temperature values belong to a particular year is fed to a same reducer. Then each
reducer finds the highest recorded temperature for each year. The types of output key
value pairs in Map phase is same for the types of input key value pairs in reduce phase (
Textand IntWritable). The types of output key value pairs in reduce phase is too Text
and IntWritable.
1. Splitting– The splitting parameter can be anything, e.g. splitting by space,
comma, semicolon, or even by a new line (‘\n’).
2. Mapping– as explained above
3. Intermediate splitting– the entire process in parallel on different clusters. In
order to group them in “Reduce Phase” the similar KEY data should be on same
cluster.
4. Reduce– it is nothing but mostly group by phase
5. Combining– The last phase where all the data (individual result set from each
cluster) is combine together to form a Result
22
Name: Smriti Sharma Rollno: 80
23
Name: Smriti Sharma Rollno: 80
Algorithm:
class Mapper
method Map(docid id, doc d)
for all term t in doc d do
Emit(term t, count 1)
class Reducer
method Reduce(term t, counts [c1, c2,...])
sum = 0
for all count c in [c1, c2,...] do
sum = sum + c
Emit(term t, count sum)
Program:
Mapper Program: WordToFileNameMapper.java
package hadoop2;
import java.io.IOException;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
public class WordToFileNameMapper extends
Mapper<LongWritable, Text, Text, Text> { private
String fileName;
private Text key = new Text();
private Text value = new Text();
@Override
protected void setup(org.apache.hadoop.mapreduce.Mapper.Context context)
throws IOException, InterruptedException { super.setup(context);
24
Name: Smriti Sharma Rollno: 80
import org.apache.hadoop.mapreduce.Reducer;
public class WordToFileNameCombine extends Reducer<Text, Text, Text, Text> {
private Text value = new Text();
public void reduce(Text _key, Iterable<Text> fileNames, Context context)
throws IOException, InterruptedException {
HashMap<String,Void> encounteredFileNames = new HashMap<>();
// process values
for (Text fileName : fileNames) {
String fileNameStr = fileName.toString();
if (!encounteredFileNames.containsKey(fileNameStr)) {
encounteredFileNames.put(fileNameStr, null);
value.set(fileNameStr);
context.write(_key, value);}} }}
25
Name: Smriti Sharma Rollno: 80
Output:
part-m-00000 part-m-00002
bat input1.txt bat input3.txt
cat input1.txt hat input3.txt
lat input1.txt yat input3.txt
part-m-00001
Conclusion:
Distinct word count program was implemented with the help of combiner and reducer
References:
1.
http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html
2. https://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
3.. http://www.dcs.bbk.ac.uk/~dell/teaching/cc/book/ditp/ditp_ch3.pdf
26
Name: Smriti Sharma Rollno: 80
Experiment 4
Matrix Multiplication
Procedure:
27
Name: Smriti Sharma Rollno: 80
Program :
1)IntPair.java
package Matrix;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
private IntWritable i;
private IntWritable k;
public IntPair() {
i = new IntWritable();
k = new IntWritable(); }
28
Name: Smriti Sharma Rollno: 80
@Override
public String toString() {
return i.get() + "," + k.get(); }
@Override
public void readFields(DataInput input) throws IOException {
// TODO Auto-generated method stub
i.readFields(input);
k.readFields(input); }
@Override
public void write(DataOutput output) throws IOException {
// TODO Auto-generated method stub
i.write(output);
k.write(output);}
@Override
public int compareTo(IntPair second) {
// TODO Auto-generated method stub
int cmp = this.i.compareTo(second.i);
if(cmp != 0){
return cmp;
}return this.k.compareTo(second.k);
}
@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + ((i == null) ? 0 : i.hashCode());
result = prime * result + ((k == null) ? 0 : k.hashCode());
return result; }
@Override
public boolean equals(Object obj) {
if (this == obj)
return true;
if (obj == null)
return false;
if (getClass() != obj.getClass())
29
Name: Smriti Sharma Rollno: 80
return false;
IntPair other = (IntPair) obj;
if (i == null) {
if (other.i != null)
return false;
} else if (!i.equals(other.i))
return false;
if (k == null) {
if (other.k != null)
return false;
} else if (!k.equals(other.k))
return false;
return true; } }
2)MatrixMultiplyDriver.java
package Matrix;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.jobcontrol.ControlledJob;
import org.apache.hadoop.mapreduce.lib.jobcontrol.JobControl;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
30
Name: Smriti Sharma Rollno: 80
FileOutputFormat.setOutputPath(job1,
new Path("/home/universe/Desktop/out1/step1"));
job2.setMapperClass(Matrix.TSep2MatrixMapper.class);
job2.setReducerClass(Step2MatrixReducer.class);
/**end of job2**/
cj2.addDependingJob(cj1);
while(!jobControl.allFinished()){
Thread.sleep(4000); } }}
3)Relation.java
package Matrix;
import java.io.DataInput;
import java.io.DataOutput;
31
Name: Smriti Sharma Rollno: 80
import java.io.IOException;
import java.util.Set;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
public Relation() {
fromMorN = new Text();
iOrk = new IntWritable();
mijorNjK = new IntWritable(); }
@Override
public void readFields(DataInput input) throws IOException {
fromMorN.readFields(input);
iOrk.readFields(input);
mijorNjK.readFields(input); }
@Override
public void write(DataOutput output) throws IOException {
// TODO Auto-generated method stub
fromMorN.write(output);
iOrk.write(output);
mijorNjK.write(output); }}
32
Name: Smriti Sharma Rollno: 80
4)Step1 MatrixReducer.java
package Matrix;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
if (value.getFromMatrix().equals("m")) {
mRels.add(temp);
} else {
nRels.add(temp);
} }
key.set(mRelation.getiOrk(), nRelation.getiOrk());
value.set(mRelation.getmijorNjK() *
nRelation.getmijorNjK());
33
Name: Smriti Sharma Rollno: 1
context.write(key, value); } }
}}
5)Step2MatrixReducer.java
package Matrix;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
int sum=0;
for(IntWritable value :values){
sum += value.get();
}
value.set(sum);
context.write(_key,value); } }
6)Step1MatrixMapper.java
package Matrix;
import java.io.IOException;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
34
Name: Smriti Sharma Rollno: 80
if (tokens[0].equals("m")) {
key.set(Integer.parseInt(tokens[2]));
value.set(tokens[0], Integer.parseInt(tokens[1]),
Integer.parseInt(tokens[3]));
}
else {
key.set(Integer.parseInt(tokens[1]));
value.set(tokens[0], Integer.parseInt(tokens[2]),
Integer.parseInt(tokens[3]));
}
context.write(key, value); }}
7)Step2MatrixMapper.java
package Matrix;
import java.io.IOException;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
key.set(Integer.parseInt(subtokens[0]),Integer.parseInt(subtokens[1]));
value.set(Integer.parseInt(tokens[1]));
context.write(key, value); }}
35
Name: Smriti Sharma Rollno: 80
Input:
m,0,0,1
m,0,1,2
m,1,0,2
m,1,1,1
n,0,0,1
n,0,0,5
n,1,0,2
n,1,1,6
Output:
Step1
1,0 10 Step2
1,0 2 0,0 10
0,0 5 0,1 12
0,0 1 1,0 14
1,1 6 1,1 6
1,0 2
0,1 12
0,0 4
Conclusion:
Matrix multiplication program was implemented and understood the flow of data using
mapReduce
References:
1. http://cdac.in/index.aspx?id=ev_hpc_hadoop-map-reduce
2.https://www.cs.duke.edu/courses/fall12/cps216/Project/Project/projects/Matrix_Multipl
y/proj_report.pd
36
Name: Smriti Sharma Rollno: 80
Experiment 5 & 6
Rstudio installation and implementation
2. Add R to Ubuntu
Keyring First:
gpg --keysever keyserver.ubuntu.com --recv-key E084DAB9
Then:
gpg -a --export E084DAB9 | sudo apt-key add -
3. Install R-Base
Installing R-Studio
Download the appropriate version of
https://www.rstudio.com/products/rstudio/download/
From here you can download your files and install the IDE through Ubuntu Software
Center or Synaptic Package Manager.
If you prefer a command line approach to install Rstudio:
sudo apt-get install gdebi-core
37
Name: Smriti Sharma Rollno: 80
wget https://download1.rstudio.org/rstudio-1.0.44-amd64.deb
sudo gdebi -n rstudio-1.0.44-amd64.deb rm rstudio-1.0.44-
amd64.deb
Theory:
Rstudio is an integrated development environment (IDE) for R. It includes a console,
syntax-highlighting editor that supports direct code execution, as well as tools for
plotting, history, and debugging and workspace management. Rstudio is available in
open source and commercial editions and runs on the desktop (Windows, Mac, and
Linux) or in a browser connected to Rstudio Server or Rstudio Server Pro
(Debian/Ubuntu, RedHat/CentOS, and SUSE Linux).
Features include:
- Syntax highlighting, code completion, and smart indentation.
-Execute R code directly from the source editor
-Quickly jump to function definitions
-Integrated R help and documentation
-Easily manage multiple working directories using projects
-Workspace browser and data viewer
-Interactive debugger to diagnose and fix errors quickly
-Extensive package development tools
-Authoring with Sweave and R Markdown
Programs: As
a calculator
38
To execute a script
Operations
39
Name: Smriti Sharma Rollno: 80
Concatenation
40
List creation
41
Name: Smriti Sharma Rollno: 80
Conclusion:
Rstudio setup needed to implement R programs was done successfully and various R
programs were executed
References:
1. http://www.thertrader.com/2014/09/22/installing-rrstudio-on-ubuntu-14-04/
2. https://www.rstudio.com/products/rstudio/download/
3. https://www.datascienceriot.com/33/kris/
42
Name: Smriti Sharma Rollno: 80
Experiment 7 & 8
MongoDB installation and Exercises
Pre-requisite:Knowledge on SQL/Oracle
Procedure: To Install and Test MongoDB please visit the url
http://www.w3resource.com/mongodb/databases-documents-collections.php
https://www.tutorialspoint.com/mongodb/index.htm
Database
Database is a physical container for collections. Each database gets its own set of files on
the file system. A single MongoDB server typically has multiple databases.
Collection
Collection is a group of MongoDB documents. It is the equivalent of an RDBMS table. A
collection exists within a single database. Collections do not enforce a schema.
Documents within a collection can have different fields. Typically, all documents in a
collection are of similar or related purpose.
Document
A document is a set of key-value pairs. Documents have dynamic schema. Dynamic
schema means that documents in the same collection do not need to have the same set of
fields or structure, and common fields in a collection's documents may hold different
types of data.
The following table shows the relationship of RDBMS terminology with MongoDB.
RDBMS MongoDB
Database Database
Table Collection
Tuple/Row Document
column Field
Table Join Embedded Documents
Primary Key Primary Key (Default key _id provided by mongodb itself)
43
Name: Smriti Sharma Rollno: 80
If you want to check your databases list, use the command show dbs.
>show dbs
local 0.78125GB
test 0.23012GB
Your created database (mydb) is not present in list. To display database, you need to
insert at least one document into it.
>db.movie.insert({"name":"tutorials point"})
>show dbs
local 0.78125GB
mydb 0.23012GB
test 0.23012GB
db.dropDatabase()
This will delete the selected database. If you have not selected any database, then it will
delete default 'test' database.
Example
First, check the list of available databases by using the command, show dbs.
>show dbs
local 0.78125GB
mydb 0.23012GB
test 0.23012GB
>
44
Name: Smriti Sharma Rollno: 80
>use mydb
switched to db mydb
>db.dropDatabase()
>{ "dropped" : "mydb", "ok" : 1 }
>
>show dbs
local 0.78125GB
test 0.23012GB
>
45
Name: Smriti Sharma Rollno: 80
>use test
switched to db test
>db.createCollection("mycollection")
{ "ok" : 1 }
>
You can check the created collection by using the command show collections.
>show collections
mycollection
system.indexes
46
Name: Smriti Sharma Rollno: 80
tutorialspoint
>
>show collections
mycol
system.indexes
tutorialspoint
>
47
Name: Smriti Sharma Rollno: 80
● Code− This datatype is used to store JavaScript code into the document.
● Regular expression− This datatype is used to store regular expression.
The insert() Method
To insert data into MongoDB collection, you need to use MongoDB's insert()or save()
method.
Syntax
The basic syntax of insert()command is as follows −
>db.COLLECTION_NAME.insert(document)
Example
>db.mycol.insert({
_id: ObjectId(7df78ad8902c),
title: 'MongoDB Overview',
description: 'MongoDB is no sql database',
by: 'tutorials point',
url: 'http://www.tutorialspoint.com',
tags: ['mongodb', 'database', 'NoSQL'],
likes: 100
})
Here mycolis our collection name, as created in the previous chapter. If the collection
doesn't exist in the database, then MongoDB will create this collection and then insert a
document into it.
In the inserted document, if we don't specify the _id parameter, then MongoDB assigns a
unique ObjectId for this document.
_id is 12 bytes hexadecimal number unique for every document in a collection. 12 bytes
are divided as follows −
_id: ObjectId(4 bytes timestamp, 3 bytes machine id, 2 bytes process id,
3 bytes incrementer)
To insert multiple documents in a single query, you can pass an array of documents in
insert() command.
Example
>db.post.insert([
{
title: 'MongoDB Overview',
description: 'MongoDB is no sql database',
by: 'tutorials point',
url: 'http://www.tutorialspoint.com',
48
Name: Smriti Sharma Rollno: 80
Example
>db.mycol.find().pretty()
{
"_id": ObjectId(7df78ad8902c),
"title": "MongoDB Overview",
"description": "MongoDB is no sql database",
"by": "tutorials point",
"url": "http://www.tutorialspoint.com",
"tags": ["mongodb", "database", "NoSQL"],
"likes": "100"
}
49
Name: Smriti Sharma Rollno: 80
>
Syntax
The basic syntax of update()method is as follows −
>db.COLLECTION_NAME.update(SELECTION_CRITERIA, UPDATED_DATA)
Example
Consider the mycol collection has the following data.
Following example will set the new title 'New MongoDB Tutorial' of the documents
whose title is 'MongoDB Overview'.
>db.mycol.update({'title':'MongoDB Overview'},{$set:{'title':'New MongoDB
Tutorial'}})
>db.mycol.find()
{ "_id" : ObjectId(5983548781331adf45ec5), "title":"New MongoDB Tutorial"}
{ "_id" : ObjectId(5983548781331adf45ec6), "title":"NoSQL Overview"}
{ "_id" : ObjectId(5983548781331adf45ec7), "title":"Tutorials Point Overview"}
>
50