Sunteți pe pagina 1din 61

CHAPTER 1

INPLANT TRAINING

1.1 BRIEF ON INPLANT TRAINING ATTENDED


Day 1- Introduction to Big Data
The session initiated with a detailed introduction on what is Big Data
along with real life examples and deep understanding of the data
processing world. A conceptual elaboration of the need of data analytics
and data processing techniques in the technologically evolving world
was formulated in the minds of audience. Detailed and interactive
presentation on Data Science and the revolution it is creating in the
modern technology was demonstrated and discussed. Thorough
understanding on relationships of business aspects of growth stimulated
companies with Data Science and Analytics was elucidated. Problems
related to Big Data were discussed with examples such as- companies
and organizations are growing at a very fast pace. Moreover, the growth
of the companies rapidly increases the amount of data produced. The
storage of this data is becoming a challenge for everyone. Options like
data lakes/ warehouses are used to collect and store massive quantities
of unstructured data in its native format. The problem, however, is when
a data lakes/ warehouse try to combine inconsistent data from disparate
sources, it encounters errors. Inconsistent data, duplicates, logic
conflicts, and missing data all result in data quality challenges. Data
Analysis is very important to make the huge amount of data being
produced, useful. Therefore, there is a huge need for Big Data analysts
and Data Scientists. The storage of quality data scientists has made it a
job in great demand. It is important for a data scientist to have skills that
are varied as the job is multidisciplinary. This is another challenge faced
by companies. The number of data scientists available is very less in
comparison to the amount of data being produced.
Day 2- Real life problem and solution implementation
A real life problem was given and all possible solutions for every step
of that problem was discussed until a final conclusion was drawn out.
Thus, a new problem solving tactic through Hadoop was introduced.
Problem here is, one, storage. Suppose the size of dataset is 1 TB and
the workstation supports only till some 100 GB of data. Two, time.
Time taken by HDDs to transfer 122 MB is 1 second. Let us consider
the data access rate of a NAS server be same as that of HDDs. 1 TB
will take 2.5 hours. An already optimized Java program to compute the
1 TB data will take 60 mins, considering the factors like network
bandwidth etc., total time will exceed well over 3 hours. What is the
alternate approach? The solution to this is, divide the 1 TB data into
100 blocks of equal size and have 100 different computational nodes to
work on each block. This way data access rate will decrease to 1.5
mins (150 mins/100) and computational time will decrease to 0.6 mins
(60mins/100), provided all the nodes perform parallel to each other.
The problems in this approach to the main problem were also
discussed. Hardware failure will lead to data loss. So to prevent that
each block has to be copied on more than one nodes. And, 100 nodes
are accessing the storage at the same time to read data parallel to each
other. This will choke the network and decrease the performance. So to
prevent it, each block has to be stored locally in each node's hard disk
(Storage closer to computation). Challenges faced while achieving this
are, storage related, how does node 1 know node 3 also has block 1?
Who decides block 7 should be for instance, stored in nodes 1, 2 and
3? Who breaks the 1 TB data into 100 blocks? Computation related-
Node 1 can compute the data stored in block 1, similarly node 2 can
compute the data in block 2. But the data in block 2 can also be stored
in block 82 and block 1 for example. We have to consolidate the
solutions from each of the node and compute them together. Who is
going to coordinate all that? The solution is distributed computing.
Day 3- Working with HDFS
HDFS is a Hadoop Distributed File System. In this session, detailed
introduction to distributed programming was given and the data
analysis tool Hadoop was discussed thoroughly. Various questions like
how the data is stored in HDFS, what is the use of HDFS in Hadoop
and the difference between NameNode and Datanode in Hadoop were
discussed and answered.
HDFS is a Java-based file system that provides scalable and reliable
data storage, and it was designed to span large clusters of commodity
servers. HDFS has demonstrated production scalability of up to 200
PB of storage and a single cluster of 4500 servers, supporting close to a
billion files and blocks. DFS is a scalable, fault-tolerant, distributed
storage system that works closely with a wide variety of concurrent
data access applications, coordinated by YARN. NameNode is the
centrepiece of HDFS. It is also known as the Master, stores the
metadata of HDFS – the directory tree of all files in the file system,
and tracks the files across the cluster. It does not store the actual data
or the dataset. The data itself is actually stored in the Datanode.
NameNode knows the list of the blocks and its location for any given
file in HDFS. DataNode is responsible for storing the actual data in
HDFS. It is also known as the Slave. NameNode and DataNode are in
constant communication. When a DataNode starts up it announce itself
to the NameNode along with the list of blocks it is responsible for.
When a DataNode is down, it does not affect the availability of data or
the cluster. NameNode will arrange for replication for the blocks
managed by the DataNode that is not available. DataNode is usually
configured with a lot of hard disk space. Because the actual data is
stored in the DataNode. There are numerous datanodes present in a
network formation called racks. Collection of racks are called clusters.
Every companies working with Big Data problems like Facebook,
Yahoo have huge infrastructures filled with collection of racks called
as clusters. Each Datanode contain blocks inside it. These blocks are
the actual data from the file.
Day 4- Map Reduce
MapReduce is a processing technique and a program model for
distributed computing based on java. The MapReduce algorithm
contains two important tasks, namely Map and Reduce. Map takes a
set of data and converts it into another set of data, where individual
elements are broken down into tuples (key/value pairs). Secondly,
reduce task, which takes the output from a map as an input and
combines those data tuples into a smaller set of tuples. As the sequence
of the name MapReduce implies, the reduce task is always performed
after the map job.
The major advantage of MapReduce is that it is easy to scale data
processing over multiple computing nodes. Under the MapReduce
model, the data processing primitives are called mappers and reducers.
Decomposing a data processing application into mappers and reducers
is sometimes nontrivial. But, once we write an application in the
MapReduce form, scaling the application to run over hundreds,
thousands, or even tens of thousands of machines in a cluster is merely
a configuration change. This simple scalability is what has attracted
many programmers to use the MapReduce model.
The Algorithm
Generally MapReduce paradigm is based on sending the computer to
where the data resides!
MapReduce program executes in three stages, namely map stage,
shuffle stage, and reduce stage.
Map stage: The map or mapper’s job is to process the input data.
Generally the input data is in the form of file or directory and is stored
in the Hadoop file system (HDFS). The input file is passed to the
mapper function line by line. The mapper processes the data and
creates several small chunks of data.
Reduce stage: This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes
from the mapper. After processing, it produces a new set of output,
which will be stored in the HDFS.
Day 5- Important Commands
All Hadoop commands are invoked by the
$HADOOP_HOME/bin/hadoop command. Running the Hadoop script
without any arguments prints the description for all commands.
Usage : hadoop [--config confdir] COMMAND
The following are the options available and their description.
namenode –format- Formats the DFS filesystem.
secondarynamenode- Runs the DFS secondary namenode.
namenode- Runs the DFS namenode.
datanode- Runs a DFS datanode.
dfsadmin- Runs a DFS admin client.
Mradmin- Runs a Map-Reduce admin client.
fsck- Runs a DFS filesystem checking utility.
fs- Runs a generic filesystem user client.
balancer- Runs a cluster balancing utility.
oiv- Applies the offline fsimage viewer to an fsimage.
fetchdt- Fetches a delegation token from the NameNode.
jobtracker - Runs the MapReduce job Tracker node.
pipes- Runs a Pipes job.
tasktracker- Runs a MapReduce task Tracker node.
historyserver- Runs job history servers as a standalone daemon.
job- Manipulates the MapReduce jobs.
queue- Gets information regarding JobQueues.
version - Prints the version.
jar <jar> - Runs a jar file.
distcp <srcurl> <desturl> - Copies file or directories recursively.
distcp2 <srcurl> <desturl> - DistCp version 2.
archive -archiveName NAME –p - Creates a hadoop archive.
<parent path> <src>* <dest>
classpath- Prints the class path needed to get the Hadoop jar and the
required libraries.
daemonlog- Get/Set the log level for each daemon
Day 6- Apache Pig
This session was meant for all those professionals working on Hadoop
who would like to perform MapReduce operations without having to
type complex codes in Java. We were given a good understanding of
the basics of Hadoop and HDFS commands. Programmers who are not
so good at Java normally used to struggle working with Hadoop,
especially while performing any MapReduce tasks. Apache Pig is a
boon for all such programmers. Apache Pig is an abstraction over
MapReduce. It is a tool/platform which is used to analyze larger sets of
data representing them as data flows. Pig is generally used with
Hadoop; we can perform all the data manipulation operations in
Hadoop using Apache Pig. Pig would- Minimize the learning curve,
Minimize time and effort, Handle optimizations to an extent and
Handle errors to an extent.
Easy for non-programmers to write data analysis programs, Pig
provides a high-level language known as Pig Latin. This language
provides various operators using which programmers can develop their
own functions for reading, writing, and processing data. To analyze
data using Apache Pig, programmers need to write scripts using Pig
Latin language. All these scripts are internally converted to Map and
Reduce tasks. Apache Pig has a component known as Pig Engine that
accepts the Pig Latin scripts as input and converts those scripts into
MapReduce jobs. Apache Pig is generally used by data scientists for
performing tasks involving ad-hoc processing and quick prototyping.
Apache Pig is used − To process huge data sources such as web logs,
to perform data processing for search platforms and to process time
sensitive data loads. To perform a particular task Programmers using
Pig, programmers need to write a Pig script using the Pig Latin
language, and execute them using any of the execution mechanisms
(Grunt Shell, UDFs, Embedded). After execution, these scripts will go
through a series of transformations applied by the Pig Framework, to
produce the desired output. Internally, Apache Pig converts these
scripts into a series of MapReduce jobs, and thus, it makes the
programmer’s job easy.
Day 7- Apache Hive
Before proceeding with this session, we were instilled with a basic
knowledge of Core Java, Database concepts of SQL, Hadoop File
system, and some of Linux operating system flavors. This session was
prepared for professionals aspiring to make a career in Big Data
Analytics using Hadoop Framework. ETL developers and
professionals who are into analytics in general may as well use this
session to good effect.
Hive is a data warehouse infrastructure tool to process structured data
in Hadoop. It resides on top of Hadoop to summarize Big Data, and
makes querying and analyzing easy. Initially, Hive was developed by
Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It
is used by different companies. For example, Amazon uses it in
Amazon Elastic MapReduce.
The following components depicts the architecture of Hive:-
User Interface- Hive is a data warehouse infrastructure software that
can create interaction between user and HDFS. The user interfaces that
Hive supports are Hive Web UI, Hive command line, and Hive HD
Insight (In Windows server).
Meta Store- Hive chooses respective database servers to store the
schema or Metadata of tables, databases, columns in a table, their data
types, and HDFS mapping.
HiveQL Process Engine- HiveQL is similar to SQL for querying on
schema info on the Metastore. It is one of the replacements of
traditional approach for MapReduce program. Instead of writing
MapReduce program in Java, we can write a query for MapReduce job
and process it.
Execution Engine- The conjunction part of HiveQL process Engine
and MapReduce is Hive Execution Engine. Execution engine processes
the query and generates results as same as MapReduce results. It uses
the flavor of MapReduce.
HDFS or HBASE- Hadoop distributed file system or HBASE are the
data storage techniques to store data into file system.
1.2 ABSTRACT ON INPLANT TRAINING ATTENDED
The term ‘Big Data’ describes innovative techniques and technologies
to capture, store, distribute, manage and analyse petabyte- or larger-
sized datasets with high-velocity and different structures. Big data can
be structured, unstructured or semi-structured, resulting in incapability
of conventional data management methods. Data is generated from
various different sources and can arrive in the system at various rates.
In order to process these large amounts of data in an inexpensive and
efficient way, parallelism is used. Big Data is a data whose scale,
diversity, and complexity require new architecture, techniques,
algorithms, and analytics to manage it and extract value and hidden
knowledge from it. Hadoop is the core platform for structuring Big
Data, and solves the problem of making it useful for analytics
purposes. Hadoop is an open source software project that enables the
distributed processing of large data sets across clusters of commodity
servers. It is designed to scale up from a single server to thousands of
machines, with a very high degree of fault tolerance.
Big data is a term that refers to data sets or combinations of data sets
whose size (volume), complexity (variability), and rate of growth
(velocity) make them difficult to be captured, managed, processed or
analyzed by conventional technologies and tools, such as relational
databases and desktop statistics or visualization packages, within the
time necessary to make them useful. While the size used to determine
whether a particular data set is considered big data is not firmly
defined and continues to change over time, most analysts and
practitioners currently refer to data sets from 30-50 terabytes(10 12 or
1000 gigabytes per terabyte) to multiple petabytes (1015 or 1000
terabytes per petabyte) as big data. When humans consume
information, a great deal of heterogeneity is comfortably tolerated. In
fact, the nuance and richness of natural language can provide valuable
depth. However, machine analysis algorithms expect homogeneous
data, and cannot understand nuance. In consequence, data must be
carefully structured as a first step in (or prior to) data analysis
1.3 CERTIFICATION BY COMPANY
1.4 DECLARATION BY STUDENT

I, Aindrila Samanta, a bonafide student of B.Tech in SRM University,


Ramapuram would like to declare that project entitled “Big Data and
Hadoop” submitted for the "Industrial Training" is my original work
and the project has not formed the basis for the award of any degree,
associateship, fellowship or any other similar titles.

Date: Signature:
1.5 INTRODUCTION TO PROJECT
The term Big Data refers to all the data that is being generated across
the globe at an unprecedented rate. This data could be either structured
or unstructured. Today’s business enterprises owe a huge part of their
success to an economy that is firmly knowledge-oriented. Data drives
the modern organizations of the world and hence making sense of this
data and unravelling the various patterns and revealing unseen
connections within the vast sea of data becomes critical and a hugely
rewarding endeavour indeed. There is a need to convert Big Data into
Business Intelligence that enterprises can readily deploy. Better data
leads to better decision making and an improved way to strategize for
organizations regardless of their size, geography, market share,
customer segmentation and such other categorizations. Hadoop is the
platform of choice for working with extremely large volumes of data.
The most successful enterprises of tomorrow will be the ones that can
make sense of all that data at extremely high volumes and speeds in
order to capture newer markets and customer base.
Big Data has certain characteristics and hence is defined using 4Vs
namely:
Volume: the amount of data that businesses can collect is really
enormous and hence the volume of the data becomes a critical factor in
Big Data analytics.
Velocity: the rate at which new data is being generated all thanks to our
dependence on the internet, sensors, machine-to-machine data is also
important to parse Big Data in a timely manner.
Variety: the data that is generated is completely heterogeneous in the
sense that it could be in various formats like video, text, database,
numeric, sensor data and so on and hence understanding the type of Big
Data is a key factor to unlocking its value.
1.6 TOOLS AND TECHNOLOGY USED
Technology selection is just part of the process when implementing big
data projects. Experienced users say it's crucial to evaluate the potential
business value that big data software can offer and to keep long-term
objectives in mind as one moves forward.
Hadoop is a popular tool for organizing the racks and racks of servers,
and NoSQL databases are popular tools for storing data on these racks.
These mechanism can be much more powerful than the old single
machine, but they are far from being as polished as the old database
servers. Although SQL may be complicated, writing the JOIN query for
the SQL databases was often much simpler than gathering information
from dozens of machines and compiling it into one coherent answer.
Hadoop jobs are written in Java, and that requires another level of
sophistication. The tools for tackling big data are just beginning to
package this distributed computing power in a way that's a bit easier to
use. Many of the big data tools are also working with NoSQL data stores.
These are more flexible than traditional relational databases, but the
flexibility isn't as much of a departure from the past as Hadoop. NoSQL
queries can be simpler because the database design discourages the
complicated tabular structure that drives the complexity of working with
SQL. The main worry is that software needs to anticipate the possibility
that not every row will have some data for every column.
The Jaspersoft package is one of the open source leaders for producing
reports from database columns. The software is well-polished and
already installed in many businesses turning SQL tables into PDFs that
everyone can scrutinize at meetings. Pentaho is another software
platform that began as a report generating engine; it is, like JasperSoft,
branching into big data by making it easier to absorb information from
the new sources. You can hook up Pentaho's tool to many of the most
popular NoSQL databases such as MongoDB and Cassandra. Once the
databases are connected, you can drag and drop the columns into views
and reports as if the information came from SQL databases.
1.7 RESULTS AND DISCUSSIONS
The problem of handling a vast quantity of data that the system is unable
to process is not a brand-new research issue; in fact, it appeared in
several early approaches [2, 21, 72], e.g., marketing analysis, network
flow monitor, gene expression analysis, weather forecast, and even
astronomy analysis. This problem still exists in big data analytics today;
thus, pre-processing is an important task to make the computer,
platform, and analysis algorithm be able to handle the input data. The
traditional data pre-processing methods [73] (e.g., compression,
sampling, feature selection, and so on) are expected to be able to operate
effectively in the big data age. However, a portion of the studies still
focus on how to reduce the complexity of the input data because even
the most advanced computer technology cannot efficiently process the
whole input data by using a single machine in most cases. By using
domain knowledge to design the pre-processing operator is a possible
solution for the big data. The big data may be created by handheld
device, social network, internet of things, multimedia, and many other
new applications that all have the characteristics of volume, velocity,
and variety. As a result, the whole data analytics has to be re-examined
from the following perspectives:
From the volume perspective, the deluge of input data is the very first
thing that we need to face because it may paralyze the data analytics.
Different from traditional data analytics, for the wireless sensor network
data analysis, Baraniuk pointed out that the bottleneck of big data
analytics will be shifted from sensor to processing, communications,
storage of sensing data. This is because sensors can gather much more
data, but when uploading such large data to upper layer system, it may
create bottlenecks everywhere. In addition, from the velocity
perspective, real-time or streaming data bring up the problem of large
quantity of data coming into the data analytics within a short duration
but the device and system may not be able to handle these input data.
This situation is similar to that of the network flow analysis for which
we typically cannot mirror and analyse everything we can gather.

1.8 CONCLUSION AND FUTURE SCOPE

As the size of data increases, the demand for Hadoop technology will
rise. There will be need of more Hadoop developers to deal with the big
data challenges.

IT professionals having Hadoop skills will be benefited with increased


salary packages and an accelerated career growth.

Shown below are different profiles of Hadoop developers according to


their expertise and experience in Hadoop technology.
Hadoop Developer- A Hadoop developer must have proficiency in Java
Programming Language, Database Interactive language like HQL, and
scripting languages as these are needed to develop applications related
to Hadoop technology.
Hadoop Architect- The overall development and deployment process of
Hadoop Applications is managed by Hadoop Architects. They plan and
design Big Data system architecture and serves as the head of the
project.
Hadoop Tester- A Hadoop tester is responsible for the testing of any
Hadoop application which includes, fixing bugs and testing whether the
application is effective or need some improvements.
Hadoop Administrator- The responsibility of a Hadoop Administrator is
to install and monitor Hadoop clusters. It involves use of cluster
monitoring tools like Ganglia, Nagios etc. to add and remove nodes.
Data Scientist- The role of Data Scientist is to employ big data tools and
several advanced statistical techniques in order to solve business related
problems. Being the most responsible job profile, the future growth of
the organization mostly depend on Data Scientists.
CHAPTER 2

ANALYSIS

SURVEY ON COMPANIES

HECKYL TECHNOLOGIES

 Headquarters: Mumbai, Maharashtra, India


 Year of incorporation: 2010
 Annual turnover: Rs. 1.4 crore (2013)
 Number of Employees: 51-200
 Current Project:
o FIND – ANALYTICAL PLATFORM
FIND is a B2B analytical platform for investors, traders and researchers - uses
a combination of human intervention, computerised language processing and a
simple, intuitive display to give users a sense of whether the day's news on a
particular stock is positive, negative or neutral, without having to individually
read each news feed.
Features:
 Aggregation of data from thousands of sources to cover all asset classes:
equities, FX, commodities, etc.
 Deep-dive news and sentiment analytics
 Real-time actionable intelligence

Why it’s great:

A scalable product with a global appeal that performs the real-time analytics of
structured and unstructured data.
 Skill Demands:
 Web Engineer
Expertise in areas
C#
ASP.NET MVC
WCF
LINQ
HTML5
jQuery Scripting

 Contact Details:
HECKYL TECHNOLOGIES (INDIA)
Unit No. 1002, B - Wing
Supreme Business Park, Hiranandani Gardens
Powai, Mumbai - 400 076
PHONE: +91 2242153561
SIGMOID

 Headquarters: Bengaluru, Karnataka, India


 Year of incorporation: 2013
 Categories: Data Visualization, Big Data, Database
 Number of Employees: 1 to 50
 Current Project:
o SigView- An interactive analytics tool
Unify Data Sources
Join internal data from streaming sources like clickstream, web or transaction
data to external data sources like SEM reports and customer data, offline data
sources like CRM, and other historical data sources to create a single source of
truth for your business. One can seamlessly integrate billions of events from
multiple sources.
Run interactive analytics at scale on unified data set. SigView is a fully managed
SaaS platform, designed for scale and speed of today’s event data. One can
interactively analyse hundreds of billions of events in less than a few seconds.
You do not have to worry when your data explodes as your business reaches
new heights.
Why it’s great:
 Query Performance
 Live Data Refresh
 No Pre-Aggregation
 Data Optimization
 Interactive Interface
 Fully Managed
 Skill Demands:
o Big Data Architect

Preferred Qualifications:

 Engineering Bachelors/Masters in Computer Science/IT.


 Top Tier Colleges (IIT, NIT, IIIT, etc) will be preferred.

 Contact Details:

SIGMOID, INDIA OFFICE

Address: # 7, 2ND FLOOR, LIGHTSPRO BUILDING, GULMOHAR ENCLAVE


ROAD, KUNDANAHALLI GATE, BENGALURU, KARNATAKA 560037

Contact: 080 4218 7033


FLUTURA

 Headquarters: Bengaluru, Karnataka, India


 Year of incorporation: 2012
 Categories: Machine to Machine, Internet of Things, Big Data Analytics, Decision
sciences
 Annual turnover: US$7.5 million
 Current Project:
o Cerebra- Data science platform
Only platform which aligns gracefully with an engineer's mental model. It
allows OEM to scale value added offerings across varied equipment classes.
Out of the box equipment sub systems and fault modes enabling faster time to
market. Advanced diagnostics algorithms for equipment health episode
detection. Asset centric grey box models - engineering + statistics + heuristics
based machine learning. Action oriented real-time nanoapps machine tweets.
Pre-built machine diagnostics tests. Automated predictor ranking. Integration
with internal systems workload based stores. Intelligence at the edge
Why it’s great:
 Persona specific user experience
 State of the art and secure lambda architecture powering Cerebra
 Finely balanced machine intelligence at Edge and Cloud
 Superparser noise vs. signals
 Ability to machine learn new signals from billions of machine events
 Ability to have asset and process context
 Ability to triangulate signals across fragmented data pools - Historians,
SCADA, PLC, Maintenance systems, Ambient conditions
 Skill Demands:
o Decision Scientist

Preferred Qualifications:

 Engineering Bachelors/Masters in Computer Science/IT.


 Must be well-equipped in:
 Machine Learning
 Pyspark
 R
 Python
 Spark
 Scala
 Django
 AngularJS

 Contact Details:

ADDRESS: Flutura Business Solutions Private Limited

#693 ' Geethanjali ', 1st Floor,

15th Cross, J.P Nagar 2nd Phase,

Bangalore - 560078 Karnataka, India

PHONE: +91 8026581334


FRACTAL ANALYTICS

 Headquarters: Mumbai, Maharashtra, India


 Year of incorporation: 2000
 Categories: Machine Learning, Artificial Intelligence
 Number of Employees: 1001 to 5000
 Annual turnover: ₹1 to ₹5 billion (INR)
 Current Project:
 CONCORDIA- DATA HAMORMONIZATION TOOL
o Concordia helps enterprises get strong returns on their data investments
and deliver exceptional business results. The solution enables
enterprises to:
o Harmonize data at the lowest level of granularity (e.g., stock keeping
units)
o Centralize harmonization logic, rules and data
o Establish a single global hierarchy to map products and other
dimensions across data sources
o Standardize definitions and varying metrics across data sources
o Deliver timely data that’s accurate, consistent and there when its needed
o Accelerate set-up times to reduce the time to data-driven decisions
o Streamline sales and operations planning processes by having a “single
source of the truth”
 Other Products:
o Customer Genomics
o Business 360 steering wheel
 Skill Demands:
o Data Scientist
Must Have:
o 4+ years experience in AI and Machine Learning
o Masters/PhD in Computer Science, Math, Engineering,
Statistics, Economics or other quantitative fields from
top tier institute
o Programming expertise in at least two of Python/Scala/R
o Preferred experience with large data sets and distributed
computing (Spark/Hadoop)
o Experience with Deep Learning and AI packages (e.g.,
Theano/TensorFlow/DeepLearning4j)
o Knowledge of traditional Machine Learning algorithms
like Random Forests, GBM etc.
o Fluency with SQL databases
o Proven experience in leading data driven projects from
definition to execution
o Strong problem-solving skills

 Contact Details:

ADDRESS: Mumbai

Level 7, Silver Metropolis, Western Express Highway, Goregaon (E),

Mumbai 400 063

PHONE: +91 22 40675800


CRAYON DATA

 Headquarters: Chennai, Tamil Nadu, India


 Year of incorporation: 2000
 Categories: Big Data, Analytics, Business Intelligence, Simpler Choices, Technology,
Recommender Engine, Choice Engine
 Number of Employees: 51-200
 Current Project:

MAYA- ANALYTICS TOOL


With Maya, enterprises can deliver ultra-personalized choices to their
customers. Maya stores, analyses and maps the tastes of millions of
customers. Across multiple lifestyle categories.
This provides enterprises with data available today only to Internet giants
like Google and Amazon. Using Crayon's TasteGraphTM Maya enriches
enterprise internal data and generates personalized choices. In real time. At
scale. To be served on any digital channel.
YODA- SALES ASSISTANT TOOL
Yoda is an indispensable personal sales assistant. Yoda prioritises prospects
and personalises sales messaging for enterprises looking to target SMEs.
Enterprises can identify the right prospects to engage with. And get real
time, relevant information to drive targeted conversations
 Skill Demands:
o Data Scientist
Responsibilities:
1. Hands on experience in Java or Scala.
2. Strong understanding and hands on experience of Supervised
and Unsupervised Machine learning techniques (like
classification, clustering, regression, etc.)
3. Experience with prototyping of these ML techniques using R
or Python
4. Experience with productising the same using Spark ML,
Apache Mahout, Weka, etc.
5. Strong understanding on various types of recommender
systems like Collaborative Filtering, Content based filtering,
Association rule mining, etc.,.
6. Working knowledge of Big Data Tech stack — Hadoop, Spark
and NoSQL databases like Couchbase, HBase, Solr, etc.,.

 Contact Details:

ADDRESS: #33-B, 3rd Floor, Software Block,

Elnet Software City, Old Mahabalipuram Road,

Taramani, Chennai, Tamil Nadu 600113

PHONE: +91 44 66992020


GERMIN8

 Headquarters: Mumbai, Maharashtra, India


 Year of incorporation: 2007
 Categories: Big Data, Analytics
 Number of Employees: 51-200
 Current Project:

GERMIN8 SOCIAL LISTENING- DIGITAL MARKETING TOOL

Germin8 Social Listening helps companies make sense of the conversations


about their brand, competitors, products and campaigns on social media.
Conversations gathered from various social media and user generated
content sites are analysed using industry specific algorithms for topic and
sentiment and presented on helpful dashboards for deriving insights.

TROOYA- SOCIAL CUSTOMER SERVICE TOOL

Trooya is a cloud-based social media contact center that enables you to


respond to customers, resolve their issues and earn their goodwill. Some of
the key intelligence built into the tool revolve around smart workflows that
enable teams to collaboratively attend to customers, prioritize important
customers and automate the allocation of customer service agents to
customer interactions. The product comes with three pricing plans:
Beginner, Professional and Enterprise, based on the volume of usage and
support levels.
 Skill Demands:
o Solution Architect
Must Haves:
o Java,
o JavaScript,
o HTML,
o NoSQL,
o Apache Solr,
o AngularJS, Bash Scripting, Apache Cassandra, data structures and
algorithms, cloud computing,
o Data Streaming concepts

 Contact Details:

ADDRESS: 501/502 Eco House, Vishveshwar Colony, Off Aarey Road, Goregaon
(E), Mumbai 400 063, India

PHONE: +91-22-4006-4550
AUREUS ANALYTICS

 Headquarters: Mumbai, Maharashtra, India


 Year of incorporation: 2013
 Categories: Insurance & Banking. Big Data, Predictive Analytics & Machine Learning
 Number of Employees: 11-50
 Current Project:

CRUX- ANALYTICS TOOL


CRUX is a completely customizable predictive analytics platform that can
meet your organizations unique analytics requirements. Using proprietary
algorithms, CRUX develops dynamic intelligence modules for each
customer and their household. Insights at a household level can help
improve the overall customer contactability and serviceability thereby
improving the possibility of higher customer retention, more accurate cross
sell and better claims and fraud prediction.

PULSE- CUSTOMER FEEDBACK INSIGHT TOOL


Net Promoter Score in itself is a great tool to measure customer loyalty. But
it doesn't help the business user know precisely what are the key areas of
improvement. PULSE solves this problem by deriving hidden insights from
the customer survey data and interaction data to make the NPS more
actionable.
 Skill Demands:
o Analytics Engineer
 Key skills:
o PHP
o Java
o Product Development
o Cost Benefit Analysis
o Java Developer
o Hadoop developer

 Contact Details:
Development Center
706, Powai Plaza,
Hiranandani Gardens,
Powai, Mumbai 400076
METAOME

 Headquarters: Bangalore, Karnataka, India


 Year of incorporation: 2007
 Categories: Biotechnology
 Number of Employees: 11-50
 Current Project:
DISTILBIO ENTERPRISE- DATA ANALYTICS TOOL
DistilBio Enterprise is a seamless solution for enterprise-level data
integration, search and discovery. DistilBio Enterprise brings together
disparate data such as data from various instruments, documents, laboratory
data management systems, private databases and also public databases.
DistilBio Enterprise enables the user to seamlessly search across the data
and identify hidden relationships within the data. DistilBio enterprise
comes with graphical query builder and a search platform that is highly
customizable to the needs of the organization.
DISTILBIO ONLINE- WEB APPLICATION
DistilBio Online is a free web-based graph search and discovery application
for the Life Sciences and Drug Discovery. DistilBio enables users to
discover hidden relationships that span across many data types and data
sources such as Uniprot, Reactome, OMIM, InterPro, DrugBank, CCLE,
PharmGKB, HomoMINT and several others, revealing new insights into
Biology. DistilBio has a visual query builder that enables users to
intuitively build complex queries without the need for any informatics
support. DistilBio's graph-based search technology enables users to look
for relationships rather than just keywords in the data.
 Skill Demands:
o Analytics Engineer
 Key skills:
o PHP
o Java
o Product Development
o Cost Benefit Analysis
o Java Developer
o Hadoop developer

 Contact Details:
ADDRESS: 147, 5th Cross,
8th Main, 2nd Block Jayanagar,
Bangalore – 560011
PHONE: +91-80-2656-5696
FRROLE

 Headquarters: Bengaluru, Karnataka, India


 Year of incorporation: 2012
 Categories: Social Media, Analytics, Big Data
 Number of Employees: 15
 Current Project:

SCOUT- INTELLIGENCE TOOL


Frrole Scout is an enterprise-ready Social Intelligence tool that marries
state-of-the-art algorithms with a simple UI. It incorporates traditional
Social Listening, Audience Intelligence and an industry-first ability to
conduct on-demand Market Research.
DEEPSENSE- INSIGHTS TOOL
A state-of-the-art offering that makes insights about any user available to
products that need to be consumer aware. Powered by the Frrole AI
platform that reverse constructs a profile for each consumer based on
his/her public social activity, including predictive attributes about his/her
behavior, needs, demographics and possible choices.

OTHER PRODUCTS
o FRROLE AI PLATFORM
The platform combines this ability to deeply understand social data
with the capability to build 000s of data models based on traditional
web data sources like Wikipedia, Freebase, Geo Maps etc. The
result therefore is a rich, contextual insight everytime and not just a
piece of annotation or metadata.
 Skill Demands:
o Principle Architect
 Key skills:
o Knowledge of Machine Learning
o Big data technologies
o Algorithms building large scale frameworks
o Java
o Cassandra
o Software Architecture
o Spring. Hibernate
o Maven

 Contact Details:

ADDRESS: 1145, 3rd Floor, 22nd Cross,


HSR Club Road, Sector II, HSR Layout
Bangalore 560102
PROMPTCLOUD

 Headquarters: Bengaluru, Karnataka, India


 Year of incorporation: 2009
 Categories: Analytics, Big Data
 Number of Employees: 11-50
 Current Project:

MASS SCALE CRAWLS


Mass-scale crawls are your data partner when you wish to analyse content
from a variety and large number of sources without much attention to
record-level details.
SITE SPECIFIC CRAWLS
A classic web crawling model where it takes the list of sites that you’d like
crawled and do vertical-specific crawls. Particularly valid when your target
dataset is scattered across multiple sources on the web and each site is
different.
TWITTER CRAWLS
One provides it with a list of keywords that gets fed into its crawler. The
crawler then continuously looks for matching tweets to your list of
keywords as tweets get published. All these tweets are later converted into
a structured format with other associated information. You can now query
the datasets that have been captured through our Search API using various
criteria. Or you can just download the files (with added normalization if
you like) from our data API.
 Skill Demands:
o Software Engineer
Required skills:
o sound knowledge of Algorithms and OOP
concepts
o proficiency with Linux/Unix (required)
o knowledge of any one of the scripting languages
– Ruby/Perl/Python
o graduated from a tier-1 college (IITs, NITs, IIITs,
BITs) or you’re dead smart to blow us away with
your tech skills
o 1 to 3 years of industry experience in a tech role
o prior experience with a startup or Big Data
technologies is a plus
o prior exposure to web technologies, Rails,
Django is a plus
o energy and passion for working in a startup
o sense of ownership and attention to details

 Contact Details:
Phone:
+91 8041216038
Address:
Second floor 118/1, 80 ft Road, Indira Nagar,
Bangalore – 560075,INDIA
CHAPTER 3

PLATFORMS/ LANGUAGE/ FRAMEWORK/ TOOLS

3.1 HISTORY

The genesis of Hadoop was the "Google File System" paper that was published
in October 2003. This paper spawned another one from Google – "MapReduce:
Simplified Data Processing on Large Clusters". Development started on the
Apache Nutch project, but was moved to the new Hadoop subproject in January
2006.[14] Doug Cutting, who was working at Yahoo! at the time, named it after
his son's toy elephant. The initial code that was factored out of Nutch consisted
of about 5,000 lines of code for HDFS and about 6,000 lines of code for
MapReduce.

The first committer to add to the Hadoop project was Owen O’Malley (in March
2006);[16] Hadoop 0.1.0 was released in April 2006.[17] It continues to evolve
through the many contributions that are being made to the project.

Apache Hadoop is an open-source software framework used for distributed


storage and processing of dataset of big data using the MapReduce programming
model. It consists of computer clusters built from commodity hardware. All the
modules in Hadoop are designed with a fundamental assumption that hardware
failures are common occurrences and should be automatically handled by the
framework.

The core of Apache Hadoop consists of a storage part, known as Hadoop


Distributed File System (HDFS), and a processing part which is a MapReduce
programming model. Hadoop splits files into large blocks and distributes them
across nodes in a cluster. It then transfers packaged code into nodes to process
the data in parallel. This approach takes advantage of data locality, where nodes
manipulate the data they have access to. This allows the dataset to be processed
faster and more efficiently than it would be in a more conventional supercomputer
architecture that relies on a parallel file system where computation and data are
distributed via high-speed networking.

The base Apache Hadoop framework is composed of the following modules:

Hadoop Common – contains libraries and utilities needed by other Hadoop


modules;

Hadoop Distributed File System (HDFS) – a distributed file-system that stores


data on commodity machines, providing very high aggregate bandwidth across
the cluster;

Hadoop YARN – a platform responsible for managing computing resources in


clusters and using them for scheduling users' applications; and

Hadoop MapReduce – an implementation of the MapReduce programming model


for large-scale data processing.

The term Hadoop has come to refer not just to the aforementioned base modules
and sub-modules, but also to the ecosystem, or collection of additional software
packages that can be installed on top of or alongside Hadoop, such as Apache
Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache
ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and
Apache Storm.

Apache Hadoop's MapReduce and HDFS components were inspired by Google


papers on their MapReduce and Google File System.

The Hadoop framework itself is mostly written in the Java programming


language, with some native code in C and command line utilities written as shell
scripts. Though MapReduce Java code is common, any programming language
can be used with "Hadoop Streaming" to implement the "map" and "reduce" parts
of the user's program.[11] Other projects in the Hadoop ecosystem expose richer
user interfaces.
3.2 FEATURES

Apache Hadoop is the most popular and powerful big data tool, Hadoop provides
world’s most reliable storage layer – HDFS, a batch Processing engine –
MapReduce and a Resource Management Layer – YARN.

Open-source – Apache Hadoop is an open source project. It means its code can
be modified according to business requirements.

Distributed Processing – As data is stored in a distributed manner in HDFS across


the cluster, data is processed in parallel on a cluster of nodes.

Fault Tolerance – By default 3 replicas of each block is stored across the cluster
in Hadoop and it can be changed also as per the requirement. So if any node goes
down, data on that node can be recovered from other nodes easily. Failures of
nodes or tasks are recovered automatically by the framework. This is how Hadoop
is fault tolerant.

Reliability – Due to replication of data in the cluster, data is reliably stored on the
cluster of machine despite machine failures. If your machine goes down, then also
your data will be stored reliably.

High Availability – Data is highly available and accessible despite hardware


failure due to multiple copies of data. If a machine or few hardware crashes, then
data will be accessed from another path.

Scalability – Hadoop is highly scalable in the way new hardware can be easily
added to the nodes. It also provides horizontal scalability which means new nodes
can be added on the fly without any downtime.

Economic – Apache Hadoop is not very expensive as it runs on a cluster of


commodity hardware. We do not need any specialized machine for it. Hadoop
provides huge cost saving also as it is very easy to add more nodes on the fly here.
So if requirement increases, you can increase nodes as well without any downtime
and without requiring much of pre-planning.

Easy to use – No need of client to deal with distributed computing, the framework
takes care of all the things. So it is easy to use.

Data Locality – Hadoop works on data locality principle which states that move
computation to data instead of data to computation. When a client submits the
MapReduce algorithm, this algorithm is moved to data in the cluster rather than
bringing data to the location where the algorithm is submitted and then processing
it.

Hadoop is written with large clusters of computers in mind and is built around
the following assumptions:

 Hardware may fail, (as commodity hardware can be used)


 Processing will be run in batches. Thus there is an emphasis on high
throughput as opposed to low latency.
 Applications that run on HDFS have large data sets. A typical file in HDFS
is gigabytes to terabytes in size.
 Applications need a write-once-read-many access model.
 Moving Computation is Cheaper than Moving Data.
Below are the design principles on which Hadoop works:
o System shall manage and heal itself
o Automatically and transparently route around failure (Fault
Tolerant)
o Speculatively execute redundant tasks if certain nodes are detected
to be slow
o Performance shall scale linearly
o Proportional change in capacity with resource change (Scalability)
o Computation should move to data
3.3 CONCEPTS

The Hadoop Distributed File System (HDFS) is a distributed file system


designed to run on commodity hardware. It has many similarities with
existing distributed file systems.

However, the differences from other distributed file systems are


significant. HDFS is highly fault-tolerant and is designed to be deployed
on low-cost hardware. HDFS provides high throughout access to
application data and is suitable for applications that have large data sets.

HDFS relaxes a few POSIX requirements to enable streaming access to file


system data.

HDFS was originally built as infrastructure for the Apache Nutch web
search engine project.

HDFS is now an Apache Hadoop subproject. The project URL is


http://hadoop.apache.org/

hdfs/.

 Assumptions and Goals


o Hardware Failure

Hardware failure is the norm rather than the exception. An HDFS


instance may consist of hundreds or thousands of server machines,
each storing part of the file system’s data.

The fact that there are a huge number of components and that each
component has a nontrivial probability of failure means that some
component of HDFS is always non-functional.

Therefore, detection of faults and quick, automatic recovery from


them is a core architectural goal of HDFS.
o Streaming Data Access

Applications that run on HDFS need streaming access to their data


sets. They are not general purpose applications that typically run on
general purpose file systems. HDFS is designed more for batch
processing rather than interactive use by users. The emphasis is on
high throughput of data access rather than low latency of data access.
POSIX imposes many hard requirements that are not needed for
applications that are targeted for HDFS. POSIX semantics in a few
key areas has been traded to increase data throughput rates.

o Large Data Sets

Applications that run on HDFS have large data sets. A typical file in
HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support
large files. It should provide high aggregate data bandwidth and scale
to hundreds of nodes in a single cluster. It should support tens of
millions of files in a single instance.

o Simple Coherency Model

HDFS applications need a write-once-read-many access model for files.


A file once created, written, and closed need not be changed. This
assumption simplifies data coherency issues

HDFS Architecture Guide

A MapReduce application or a web crawler application fits perfectly


with this model. There is a plan to support appending-writes to files in
the future.

o “Moving Computation is Cheaper than Moving Data”

A computation requested by an application is much more efficient if it


is executed near the data it operates on. This is especially true when the
size of the data set is huge. This minimizes network congestion and
increases the overall throughput of the system. The assumption is that
it is often better to migrate the computation closer to where the data is
located rather than moving the data to where the application is running.
HDFS provides interfaces for applications to move themselves closer to
where the data is located.

o Portability Across Heterogeneous Hardware and Software Platforms

HDFS has been designed to be easily portable from one platform to


another. This facilitates widespread adoption of HDFS as a platform of
choice for a large set of applications.

 NameNode and DataNodes

HDFS has a master/slave architecture. An HDFS cluster consists of a single


NameNode, a master server that manages the file system namespace and
regulates access to files by clients. In addition, there are a number of
DataNodes, usually one per node in the cluster, which manage storage
attached to the nodes that they run on. HDFS exposes a file system
namespace and allows user data to be stored in files. Internally, a file is
split into one or more blocks and these blocks are stored in a set of
DataNodes. The NameNode executes file system namespace operations
like opening, closing, and renaming files and directories. It also determines
the mapping of blocks to DataNodes. The DataNodes are responsible for
serving read and write requests from the file system’s clients. The
DataNodes also perform block creation, deletion, and replication upon
instruction from the NameNode.

The NameNode and DataNode are pieces of software designed to run on


commodity machines. These machines typically run a GNU/Linux
operating system (OS). HDFS is built using the Java language; any
machine that supports Java can run the NameNode or the DataNode
software. Usage of the highly portable Java language means that HDFS can
be deployed on a wide range of machines. A typical deployment has a
dedicated machine that runs only the NameNode software. Each of the
other machines in the cluster runs one instance of the DataNode software.
The architecture does not preclude running multiple DataNodes on the
same machine but in a real deployment that is rarely the case. The existence
of a single NameNode in a cluster greatly simplifies the architecture of the
system. The NameNode is the arbitrator and repository for all HDFS
metadata. The system is designed in such a way that user data never flows
through the NameNode.

 The File System Namespace

HDFS supports a traditional hierarchical file organization. A user or an


application can create directories and store files inside these directories.
The file system namespace hierarchy is similar to most other existing file
systems; one can create and remove files, move a file from one directory
to another, or rename a file. HDFS does not yet implement user quotas.
HDFS does not support hard links or soft links. However, the HDFS
architecture does not preclude implementing these features.

The NameNode maintains the file system namespace. Any change to the
file system namespace or its properties is recorded by the NameNode. An
application can specify the number of replicas of a file that should be
maintained by HDFS. The number of copies of a file is called the
replication factor of that file. This information is stored by the NameNode.

 Data Replication
HDFS is designed to reliably store very large files across machines in a
large cluster. It stores each file as a sequence of blocks; all blocks in a file
except the last block are the same size. The blocks of a file are replicated
for fault tolerance. The block size and replication factor are configurable
per file. An application can specify the number of replicas of a file. The
replication factor can be specified at file creation time and can be changed
later. Files in HDFS are write-once and have strictly one writer at any time.

The NameNode makes all decisions regarding replication of blocks. It


periodically receives a Heartbeat and a Blockreport from each of the
DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode
is functioning properly. A Blockreport contains a list of all blocks on a
DataNode.

o Replica Placement: The First Baby Steps

The placement of replicas is critical to HDFS reliability and performance.


Optimizing replica placement distinguishes HDFS from most other
distributed file systems. This is a feature that needs lots of tuning and
experience. The purpose of a rack-aware replica placement policy is to
improve data reliability, availability, and network bandwidth utilization.
The current implementation for the replica placement policy is a first effort
in this direction. The short term goals of implementing this policy are to
validate it on production systems, learn more about its behaviour, and build
a foundation to test and research more sophisticated policies.

Large HDFS instances run on a cluster of computers that commonly spread


across many racks. Communication between two nodes in different racks
has to go through switches. In most cases, network bandwidth between
machines in the same rack is greater than network bandwidth between
machines in different racks.
The NameNode determines the rack id each DataNode belongs to via the
process outlined in Hadoop Rack Awareness. A simple but non-optimal
policy is to place replicas on unique racks. This prevents losing data when
an entire rack fails and allows use of bandwidth from multiple racks when
reading data. This policy evenly distributes replicas in the cluster which
makes it easy to balance load on component failure. However, this policy
increases the cost of writes because a write needs to transfer blocks to
multiple racks. For the common case, when the replication factor is three,
HDFS’s placement policy is to put one replica on one node in the local
rack, another on a node in a different (remote) rack, and the last on a
different node in the same remote rack. This policy cuts the interrack write
traffic which generally improves write performance. The chance of rack
failure is far less than that of node failure; this policy does not impact data
reliability and availability guarantees. However, it does reduce the
aggregate network bandwidth used when reading data since a block is
placed in only two unique racks rather than three. With this policy, the
replicas of a file do not evenly distribute across the racks. One third of
replicas are on one node, two thirds of replicas are on one rack, and the
other third are evenly distributed across the remaining racks. This policy
improves write performance without compromising data reliability or read
performance.

The current, default replica placement policy described here is a work in


progress.

o Replica Selection

To minimize global bandwidth consumption and read latency, HDFS tries


to satisfy a read request from a replica that is closest to the reader. If there
exists a replica on the same rack as the reader node, then that replica is
preferred to satisfy the read request. If angg/ HDFS cluster spans multiple
data centers, then a replica that is resident in the local data center is
preferred over any remote replica.

o Safemode

On startup, the NameNode enters a special state called Safemode.


Replication of data blocks does not occur when the NameNode is in the
Safemode state. The NameNode receives Heartbeat and Blockreport
messages from the DataNodes. A Blockreport contains the list of data
blocks that a DataNode is hosting. Each block has a specified minimum
number of replicas. A block is considered safely replicated when the
minimum number of replicas of that data block has checked in with the
NameNode. After a configurable percentage of safely replicated data
blocks checks in with the NameNode (plus an additional 30 seconds), the
NameNode exits the Safemode state. It then determines the list of data
blocks (if any) that still have fewer than the specified number of replicas.
The NameNode then replicates these blocks to other DataNodes.

 The Persistence of File System Metadata

The HDFS namespace is stored by the NameNode. The NameNode uses a


transaction log called the EditLog to persistently record every change that
occurs to file system metadata.

For example, creating a new file in HDFS causes the NameNode to insert
a record into the

EditLog indicating this. Similarly, changing the replication factor of a file


causes a new record to be inserted into the EditLog. The NameNode uses
a file in its local host OS file system to store the EditLog. The entire file
system namespace, including the mapping of blocks to files and file system
properties, is stored in a file called the FsImage. The FsImage is stored as
a file in the NameNode’s local file system too.
The NameNode keeps an image of the entire file system namespace and
file Blockmap in memory. This key metadata item is designed to be
compact, such that a NameNode with 4 GB of RAM is plenty to support a
huge number of files and directories. When the NameNode starts up, it
reads the FsImage and EditLog from disk, applies all the transactions from
the EditLog to the in-memory representation of the FsImage, and flushes
out this new version into a new FsImage on disk. It can then truncate the
old EditLog because its transactions have been applied to the persistent
FsImage. This process is called a checkpoint.

In the current implementation, a checkpoint only occurs when the


NameNode starts up. Work is in progress to support periodic checkpointing
in the near future. The DataNode stores HDFS data in files in its local file
system. The DataNode has no knowledge about HDFS files. It stores each
block of HDFS data in a separate file in its local file system. The DataNode
does not create all files in the same directory. Instead, it uses a heuristic to
determine the optimal number of files per directory and creates
subdirectories appropriately. It is not optimal to create all local files in the
same directory because the local file system might not be able to efficiently
support a huge number of files in a single directory. When a DataNode
starts up, it scans through its local file system, generates a list of all HDFS
data blocks that correspond to each of these local files and sends this report
to the NameNode: this is the Blockreport.

 The Communication Protocols

All HDFS communication protocols are layered on top of the TCP/IP


protocol. A client establishes a connection to a configurable TCP port on
the NameNode machine. It talks the ClientProtocol with the NameNode.
The DataNodes talk to the NameNode using the DataNode Protocol. A
Remote Procedure Call (RPC) abstraction wraps both the Client Protocol
and the DataNode Protocol. By design, the NameNode never initiates any
RPCs. Instead, it only responds to RPC requests issued by DataNodes or
clients.

 Robustness

The primary objective of HDFS is to store data reliably even in the


presence of failures. The three common types of failures are NameNode
failures, DataNode failures and network partitions.

 Data Disk Failure, Heartbeats and Re-Replication

Each DataNode sends a Heartbeat message to the NameNode periodically.


A network partition can cause a subset of DataNodes to lose connectivity
with the NameNode. The NameNode detects this condition by the absence
of a Heartbeat message. The NameNode marks DataNodes without recent
Heartbeats as dead and does not forward any new IO requests to them. Any
data that was registered to a dead DataNode is not available to HDFS
anymore. DataNode death may cause the replication factor of some blocks
to fall below their specified value. The NameNode constantly tracks which
blocks need to be replicated and initiates replication whenever necessary.
The necessity for re-replication may arise due to many reasons: a DataNode
may become unavailable, a replica may become corrupted, a hard disk on
a DataNode may fail, or the replication factor of a file may be increased.

 Cluster Rebalancing

The HDFS architecture is compatible with data rebalancing schemes. A


scheme might automatically move data from one DataNode to another if
the free space on a DataNode falls below a certain threshold. In the event
of a sudden high demand for a particular file, a scheme might dynamically
create additional replicas and rebalance other data in the cluster.
These types of data rebalancing schemes are not yet implemented.

o Data Integrity

It is possible that a block of data fetched from a DataNode arrives


corrupted. This corruption can occur because of faults in a storage device,
network faults, or buggy software. The HDFS client software implements
checksum checking on the contents of HDFS files. When a client creates
an HDFS file, it computes a checksum of each block of the file and stores
these checksums in a separate hidden file in the same HDFS namespace.
When a client retrieves file contents it verifies that the data it received from
each DataNode matches the checksum stored in the associated checksum
file. If not, then the client can opt to retrieve that block from another
DataNode that has a replica of that block.

o Metadata Disk Failure

The FsImage and the EditLog are central data structures of HDFS. A
corruption of these files can cause the HDFS instance to be non-functional.
For this reason, the NameNode can be configured to support maintaining
multiple copies of the FsImage and EditLog. Any update to either the
FsImage or EditLog causes each of the FsImages and EditLogs to get
updated synchronously. This synchronous updating of multiple copies of
the FsImage and EditLog may degrade the rate of namespace transactions
per second that a NameNode can support. However, this degradation is
acceptable because even though HDFS applications are very data intensive
in nature, they are not metadata intensive. When a NameNode restarts, it
selects the latest consistent FsImage and EditLog to use.

The NameNode machine is a single point of failure for an HDFS cluster.


If the NameNode machine fails, manual intervention is necessary.
Currently, automatic restart and failover of the NameNode software to
another machine is not supported.

o Snapshots

Snapshots support storing a copy of data at a particular instant of time. One


usage of the snapshot feature may be to roll back a corrupted HDFS
instance to a previously known good point in time. HDFS does not
currently support snapshots but will in a future release.

 Data Organization
o Data Blocks

HDFS is designed to support very large files. Applications that are


compatible with HDFS are those that deal with large data sets. These
applications write their data only once but they read it one or more times
and require these reads to be satisfied at streaming speeds. HDFS supports
write-once-read-many semantics on files. A typical block size used by
HDFS is 64 MB. Thus, an HDFS file is chopped up into 64 MB chunks,
and if possible, each chunk will reside on a different DataNode.

o Staging

A client request to create a file does not reach the NameNode immediately.
In fact, initially the HDFS client caches the file data into a temporary local
file. Application writes are transparently redirected to this temporary local
file. When the local file accumulates data worth over one HDFS block size,
the client contacts the NameNode. The NameNode inserts the file name
into the file system hierarchy and allocates a data block for it. The
NameNode responds to the client request with the identity of the DataNode
and the destination data block. Then the client flushes the block of data
from the local temporary file to the specified DataNode. When a file is
closed, the remaining un-flushed data in the temporary local file is
transferred to the DataNode. The client then tells the NameNode that the
file is closed. At this point, the NameNode commits the file creation
operation into a persistent store. If the NameNode dies before the file is
closed, the file is lost.

The above approach has been adopted after careful consideration of target
applications that run on HDFS. These applications need streaming writes
to files. If a client writes to a remote file directly without any client side
buffering, the network speed and the congestion in the network impacts
throughput considerably. This approach is not without precedent. Earlier
distributed file systems, e.g. AFS, have used client side caching to improve
performance. A

POSIX requirement has been relaxed to achieve higher performance of


data uploads.

 Replication Pipelining

When a client is writing data to an HDFS file, its data is first written to a
local file as explained in the previous section. Suppose the HDFS file has
a replication factor of three.

When the local file accumulates a full block of user data, the client retrieves
a list of

DataNodes from the NameNode. This list contains the DataNodes that will
host a replica of that block. The client then flushes the data block to the
first DataNode. The first DataNode starts receiving the data in small
portions (4 KB), writes each portion to its local repository and transfers
that portion to the second DataNode in the list. The second DataNode, in
turn starts receiving each portion of the data block, writes that portion to
its repository and then flushes that portion to the third DataNode. Finally,
the third DataNode writes the data to its local repository. Thus, a DataNode
can be receiving data from the previous one in the pipeline and at the same
time forwarding data to the next one in the pipeline. Thus, the data is
pipelined from one DataNode to the next.

 Accessibility

HDFS can be accessed from applications in many different ways. Natively,


HDFS provides a Java API for applications to use. A C language wrapper
for this Java API is also available. In addition, an HTTP browser can also
be used to browse the files of an HDFS instance. Work is in progress to
expose HDFS through the WebDAV protocol.

 FS Shell

HDFS allows user data to be organized in the form of files and directories.
It provides a command line interface called FS shell that lets a user interact
with the data in HDFS. The syntax of this command set is similar to other
shells (e.g. bash, csh) that users are already familiar with. Here are some
sample action/command pairs:

Action Command

Create a directory named /foodir bin/hadoop dfs -mkdir /foodir

Remove a directory named /foodir bin/hadoop dfs -rmr /foodir

View the contents of a file named /foodir/

myfile.txt bin/hadoop dfs -cat /foodir/

myfile.txt

FS shell is targeted for applications that need a scripting language to


interact with the stored data.

 DFSAdmin
The DFSAdmin command set is used for administering an HDFS cluster.
These are commands that are used only by an HDFS administrator. Here
are some sample action/command pairs:

Action Command

Put the cluster in Safemode bin/hadoop dfsadmin -safemode enter

Generate a list of DataNodes bin/hadoop dfsadmin -report

Recommission or decommission DataNode(s) bin/hadoop dfsadmin -


refreshNodes

 Browser Interface

A typical HDFS install configures a web server to expose the HDFS


namespace through a configurable TCP port. This allows a user to navigate
the HDFS namespace and view the contents of its files using a web
browser.

o File Deletes and Undeletes

When a file is deleted by a user or an application, it is not immediately


removed from HDFS.

Instead, HDFS first renames it to a file in the /trash directory. The file can
be restored quickly as long as it remains in /trash. A file remains in /trash
for a configurable amount of time. After the expiry of its life in /trash, the
NameNode deletes the file from

HDFS Architecture Guide

The deletion of a file causes the blocks associated with the file to be freed.
Note that there could be an appreciable time delay between the time a file
is deleted by a user and the time of the corresponding increase in free space
in HDFS.

A user can Undelete a file after deleting it as long as it remains in the /trash
directory.

If a user wants to undelete a file that he/she has deleted, he/she can navigate
the /trash directory and retrieve the file. The /trash directory contains only
the latest copy of the file that was deleted. The /trash directory is just like
any other directory with one special feature: HDFS applies specified
policies to automatically delete files from this directory. The current
default policy is to delete files from /trash that are more than 6 hours old.
In the future, this policy will be configurable through a well-define
interface.

o Decrease Replication Factor

When the replication factor of a file is reduced, the NameNode selects


excess replicas that can be deleted. The next Heartbeat transfers this
information to the DataNode. The

DataNode then removes the corresponding blocks and the corresponding


free space appears in the cluster. Once again, there might be a time delay
between the completion of the setReplication API call and the appearance
of free space in the cluster.
3.4 APPLICATIONS
1. Banking and Securities
Industry-Specific big data challenges
A study of 16 projects in 10 top investment and retail banks shows that the
challenges in this industry include: securities fraud early warning, tick
analytics, card fraud detection, archival of audit trails, enterprise credit risk
reporting, trade visibility, and customer data transformation, social
analytics for trading, IT operations analytics, and IT policy compliance
analytics, among others.
Applications of big data in the banking and securities industry
The Securities Exchange Commission (SEC) is using big data to monitor
financial market activity. They are currently using network analytics and
natural language processors to catch illegal trading activity in the financial
markets.
Retail traders, Big banks, hedge funds and other so-called ‘big boys’ in the
financial markets use big data for trade analytics used in high frequency
trading, pre-trade decision-support analytics, sentiment measurement,
Predictive Analytics etc.
This industry also heavily relies on big data for risk analytics including;
anti-money laundering, demand enterprise risk management, "Know Your
Customer", and fraud mitigation.
Big Data providers specific to this industry include: 1010data, Panopticon
Software, Streambase Systems, Nice Actimize and Quartet FS
2. Communications, Media and Entertainment
Industry-Specific big data challenges
Since consumers expect rich media on-demand in different formats and in
a variety of devices, some big data challenges in the communications,
media and entertainment industry include:
Collecting, analyzing, and utilizing consumer insights
Leveraging mobile and social media content
Understanding patterns of real-time, media content usage
Applications of big data in the Communications, media and entertainment
industry
Organizations in this industry simultaneously analyze customer data along
with behavioral data to create detailed customer profiles that can be used
to:
Create content for different target audiences
Recommend content on demand
Measure content performance
A case in point is the Wimbledon Championships (YouTube Video) that
leverages big data to deliver detailed sentiment analysis on the tennis
matches to TV, mobile, and web users in real-time.
Spotify, an on-demand music service, uses Hadoop big data analytics, to
collect data from its millions of users worldwide and then uses the analyzed
data to give informed music recommendations to individual users.
Amazon Prime, which is driven to provide a great customer experience by
offering, video, music and Kindle books in a one-stop shop also heavily
utilizes big data.
Big Data Providers in this industry include:Infochimps, Splunk, Pervasive
Software, and Visible Measures
3. Healthcare Providers
The healthcare sector has access to huge amounts of data but has been
plagued by failures in utilizing the data to curb the cost of rising healthcare
and by inefficient systems that stifle faster and better healthcare benefits
across the board.
This is mainly due to the fact that electronic data is unavailable, inadequate,
or unusable. Additionally, the healthcare databases that hold health-related
information have made it difficult to link data that can show patterns useful
in the medical field.
Healthcare sector
Other challenges related to big data include: the exclusion of patients from
the decision making process, and the use of data from different readily
available sensors.
Applications of big data in the healthcare sector
Some hospitals, like Beth Israel, are using data collected from a cell phone
app, from millions of patients, to allow doctors to use evidence-based
medicine as opposed to administering several medical/lab tests to all
patients who go to the hospital. A battery of tests can be efficient but they
can also be expensive and usually ineffective.
Free public health data and Google Maps have been used by the University
of Florida to create visual data that allows for faster identification and
efficient analysis of healthcare information, used in tracking the spread of
chronic disease.
Obamacare has also utilized big data in a variety of ways.
Big Data Providers in this industry include: Recombinant Data, Humedica,
Explorys and Cerner
4. Education
Industry-Specific big data challenges
From a technical point of view, a major challenge in the education industry
is to incorporate big data from different sources and vendors and to utilize
it on platforms that were not designed for the varying data.
From a practical point of view, staff and institutions have to learn the new
data management and analysis tools.
On the technical side, there are challenges to integrate data from different
sources, on different platforms and from different vendors that were not
designed to work with one another.
Applications of big data in Education
Big data is used quite significantly in higher education. For example, The
University of Tasmania. An Australian university with over 26000
students, has deployed a Learning and Management System that tracks
among other things, when a student logs onto the system, how much time
is spent on different pages in the system, as well as the overall progress of
a student over time.
In a different use case of the use of big data in education, it is also used to
measure teacher’s effectiveness to ensure a good experience for both
students and teachers. Teacher’s performance can be fine-tuned and
measured against student numbers, subject matter, student demographics,
student aspirations, behavioural classification and several other variables.
On a governmental level, the Office of Educational Technology in the U.
S. Department of Education, is using big data to develop analytics to help
course correct students who are going astray while using online big data
courses. Click patterns are also being used to detect boredom.

3.4 SCOPE

The demand for Analytics skill is going up steadily but there is a huge
deficit on the supply side. This is happening globally and is not restricted
to any part of geography. In spite of Big Data Analytics being a ‘Hot’ job,
there is still a large number of unfilled jobs across the globe due to shortage
of required skill. A McKinsey Global Institute study states that the US will
face a shortage of about 190,000 data scientists and 1.5 million managers
and analysts who can understand and make decisions using Big Data by
2018.
India, currently has the highest concentration of analytics globally. In spite
of this, the scarcity of data analytics talent is particularly acute and demand
for talent is expected to be on the higher side as more global organizations
are outsourcing their work.

According to Srikanth Velamakanni, co-founder and CEO of Fractal


Analytics, there are two types of talent deficits: Data Scientists, who can
perform analytics and Analytics Consultant, who can understand and use
data. The talent supply for these job title, especially Data Scientists is
extremely scarce and the demand is huge.

Strong demand for Data Analytics skills is boosting the wages for qualified
professionals and making Big Data pay big bucks for the right skill. This
phenomenon is being seen globally where countries like Australia and the
U.K are witnessing this ‘Moolah Marathon’.

According to the 2015 Skills and Salary Survey Report published by the
Institute of Analytics Professionals of Australia (IAPA), the annual median
salary for data analysts is $130,000, up four per cent from last year.
Continuing the trend set in 2013 and 2014, the median respondent earns
184% of the Australian full-time median salary. The rising demand for
analytics professionals is also reflected in IAPA’s membership, which has
grown to more than 5000 members in Australia since its formation in 2006.

Randstad states that the annual pay hikes for Analytics professionals in
India is on an average 50% more than other IT professionals. According to
The Indian Analytics Industry Salary Trend Report by Great Lakes
Institute of Management, the average salaries for analytics professionals in
India was up by 21% in 2015 as compared to 2014. The report also states
that 14% of all analytics professionals get a salary of more than Rs. 15 lakh
per annum.
A look at the salary trend for Big Data Analytics in the UK also indicates
a positive and exponential growth. A quick search on Itjobswatch.co.uk
shows a median salary of £62,500 in early 2016 for Big Data Analytics
jobs, as compared to £55,000 in the same period in 2015. Also, a year-on-
year median salary change of +13.63% is observed.
CHAPTER 4

CONCLUSION

Today, huge repositories of structured, semi-structured and unstructured data


collected across various digital platforms, social media and blogs or generated
through simulation and modelling are at our disposal. These mass repositories
are beyond the abilities of traditional database methods to analyse and
understand effectively. Commoditization of High Performance Computing and
mass storage in conjunction with cloud computing, open source software and
platform interoperability made it possible to deploy data analytics techniques in
order to cope with data volume, velocity and variety and to provide the insight
needed to really benefit from this data deluge. The value of data at our
fingertips is largely underestimated and unexploited today and in almost every
sector, including science, health, e-commerce, government, energy,
environment, and manufacturing, many applications need to be developed in
order to deliver the promise of Big Data. Our lives will consequently be
changing rapidly and a whole new way of science and business will be added to
existing ones. Correlations and predictions will pave their way into data analysis
next to causation, modelling and theories.

The biggest challenge does not seem to be the technology itself &#8211; as this
is evolving much more rapidly than humans &#8211; but rather how to make
sure we have enough skills to make effective use of the technology at our
disposal and make sense out of the data collected. And before we get to that
stage, we need to resolve many legal issues around intellectual property rights,
data privacy and integrity, cyber security, exploitation liability and Big Data
code of conduct. Like in many other technological areas, customs and ethics
around Big Data possibilities and excesses take time to develop. Promises of
Big Data include innovation, growth and long term sustainability. Threats
include breach of privacy, property rights, data integrity or personal freedom.
So provided Big Data is exploited in an open and transparent manner, delivery
of the promise of Big Data is not far ahead of us.

S-ar putea să vă placă și