Documente Academic
Documente Profesional
Documente Cultură
INPLANT TRAINING
Date: Signature:
1.5 INTRODUCTION TO PROJECT
The term Big Data refers to all the data that is being generated across
the globe at an unprecedented rate. This data could be either structured
or unstructured. Today’s business enterprises owe a huge part of their
success to an economy that is firmly knowledge-oriented. Data drives
the modern organizations of the world and hence making sense of this
data and unravelling the various patterns and revealing unseen
connections within the vast sea of data becomes critical and a hugely
rewarding endeavour indeed. There is a need to convert Big Data into
Business Intelligence that enterprises can readily deploy. Better data
leads to better decision making and an improved way to strategize for
organizations regardless of their size, geography, market share,
customer segmentation and such other categorizations. Hadoop is the
platform of choice for working with extremely large volumes of data.
The most successful enterprises of tomorrow will be the ones that can
make sense of all that data at extremely high volumes and speeds in
order to capture newer markets and customer base.
Big Data has certain characteristics and hence is defined using 4Vs
namely:
Volume: the amount of data that businesses can collect is really
enormous and hence the volume of the data becomes a critical factor in
Big Data analytics.
Velocity: the rate at which new data is being generated all thanks to our
dependence on the internet, sensors, machine-to-machine data is also
important to parse Big Data in a timely manner.
Variety: the data that is generated is completely heterogeneous in the
sense that it could be in various formats like video, text, database,
numeric, sensor data and so on and hence understanding the type of Big
Data is a key factor to unlocking its value.
1.6 TOOLS AND TECHNOLOGY USED
Technology selection is just part of the process when implementing big
data projects. Experienced users say it's crucial to evaluate the potential
business value that big data software can offer and to keep long-term
objectives in mind as one moves forward.
Hadoop is a popular tool for organizing the racks and racks of servers,
and NoSQL databases are popular tools for storing data on these racks.
These mechanism can be much more powerful than the old single
machine, but they are far from being as polished as the old database
servers. Although SQL may be complicated, writing the JOIN query for
the SQL databases was often much simpler than gathering information
from dozens of machines and compiling it into one coherent answer.
Hadoop jobs are written in Java, and that requires another level of
sophistication. The tools for tackling big data are just beginning to
package this distributed computing power in a way that's a bit easier to
use. Many of the big data tools are also working with NoSQL data stores.
These are more flexible than traditional relational databases, but the
flexibility isn't as much of a departure from the past as Hadoop. NoSQL
queries can be simpler because the database design discourages the
complicated tabular structure that drives the complexity of working with
SQL. The main worry is that software needs to anticipate the possibility
that not every row will have some data for every column.
The Jaspersoft package is one of the open source leaders for producing
reports from database columns. The software is well-polished and
already installed in many businesses turning SQL tables into PDFs that
everyone can scrutinize at meetings. Pentaho is another software
platform that began as a report generating engine; it is, like JasperSoft,
branching into big data by making it easier to absorb information from
the new sources. You can hook up Pentaho's tool to many of the most
popular NoSQL databases such as MongoDB and Cassandra. Once the
databases are connected, you can drag and drop the columns into views
and reports as if the information came from SQL databases.
1.7 RESULTS AND DISCUSSIONS
The problem of handling a vast quantity of data that the system is unable
to process is not a brand-new research issue; in fact, it appeared in
several early approaches [2, 21, 72], e.g., marketing analysis, network
flow monitor, gene expression analysis, weather forecast, and even
astronomy analysis. This problem still exists in big data analytics today;
thus, pre-processing is an important task to make the computer,
platform, and analysis algorithm be able to handle the input data. The
traditional data pre-processing methods [73] (e.g., compression,
sampling, feature selection, and so on) are expected to be able to operate
effectively in the big data age. However, a portion of the studies still
focus on how to reduce the complexity of the input data because even
the most advanced computer technology cannot efficiently process the
whole input data by using a single machine in most cases. By using
domain knowledge to design the pre-processing operator is a possible
solution for the big data. The big data may be created by handheld
device, social network, internet of things, multimedia, and many other
new applications that all have the characteristics of volume, velocity,
and variety. As a result, the whole data analytics has to be re-examined
from the following perspectives:
From the volume perspective, the deluge of input data is the very first
thing that we need to face because it may paralyze the data analytics.
Different from traditional data analytics, for the wireless sensor network
data analysis, Baraniuk pointed out that the bottleneck of big data
analytics will be shifted from sensor to processing, communications,
storage of sensing data. This is because sensors can gather much more
data, but when uploading such large data to upper layer system, it may
create bottlenecks everywhere. In addition, from the velocity
perspective, real-time or streaming data bring up the problem of large
quantity of data coming into the data analytics within a short duration
but the device and system may not be able to handle these input data.
This situation is similar to that of the network flow analysis for which
we typically cannot mirror and analyse everything we can gather.
As the size of data increases, the demand for Hadoop technology will
rise. There will be need of more Hadoop developers to deal with the big
data challenges.
ANALYSIS
SURVEY ON COMPANIES
HECKYL TECHNOLOGIES
A scalable product with a global appeal that performs the real-time analytics of
structured and unstructured data.
Skill Demands:
Web Engineer
Expertise in areas
C#
ASP.NET MVC
WCF
LINQ
HTML5
jQuery Scripting
Contact Details:
HECKYL TECHNOLOGIES (INDIA)
Unit No. 1002, B - Wing
Supreme Business Park, Hiranandani Gardens
Powai, Mumbai - 400 076
PHONE: +91 2242153561
SIGMOID
Preferred Qualifications:
Contact Details:
Preferred Qualifications:
Contact Details:
Contact Details:
ADDRESS: Mumbai
Contact Details:
Contact Details:
ADDRESS: 501/502 Eco House, Vishveshwar Colony, Off Aarey Road, Goregaon
(E), Mumbai 400 063, India
PHONE: +91-22-4006-4550
AUREUS ANALYTICS
Contact Details:
Development Center
706, Powai Plaza,
Hiranandani Gardens,
Powai, Mumbai 400076
METAOME
Contact Details:
ADDRESS: 147, 5th Cross,
8th Main, 2nd Block Jayanagar,
Bangalore – 560011
PHONE: +91-80-2656-5696
FRROLE
OTHER PRODUCTS
o FRROLE AI PLATFORM
The platform combines this ability to deeply understand social data
with the capability to build 000s of data models based on traditional
web data sources like Wikipedia, Freebase, Geo Maps etc. The
result therefore is a rich, contextual insight everytime and not just a
piece of annotation or metadata.
Skill Demands:
o Principle Architect
Key skills:
o Knowledge of Machine Learning
o Big data technologies
o Algorithms building large scale frameworks
o Java
o Cassandra
o Software Architecture
o Spring. Hibernate
o Maven
Contact Details:
Contact Details:
Phone:
+91 8041216038
Address:
Second floor 118/1, 80 ft Road, Indira Nagar,
Bangalore – 560075,INDIA
CHAPTER 3
3.1 HISTORY
The genesis of Hadoop was the "Google File System" paper that was published
in October 2003. This paper spawned another one from Google – "MapReduce:
Simplified Data Processing on Large Clusters". Development started on the
Apache Nutch project, but was moved to the new Hadoop subproject in January
2006.[14] Doug Cutting, who was working at Yahoo! at the time, named it after
his son's toy elephant. The initial code that was factored out of Nutch consisted
of about 5,000 lines of code for HDFS and about 6,000 lines of code for
MapReduce.
The first committer to add to the Hadoop project was Owen O’Malley (in March
2006);[16] Hadoop 0.1.0 was released in April 2006.[17] It continues to evolve
through the many contributions that are being made to the project.
The term Hadoop has come to refer not just to the aforementioned base modules
and sub-modules, but also to the ecosystem, or collection of additional software
packages that can be installed on top of or alongside Hadoop, such as Apache
Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache
ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and
Apache Storm.
Apache Hadoop is the most popular and powerful big data tool, Hadoop provides
world’s most reliable storage layer – HDFS, a batch Processing engine –
MapReduce and a Resource Management Layer – YARN.
Open-source – Apache Hadoop is an open source project. It means its code can
be modified according to business requirements.
Fault Tolerance – By default 3 replicas of each block is stored across the cluster
in Hadoop and it can be changed also as per the requirement. So if any node goes
down, data on that node can be recovered from other nodes easily. Failures of
nodes or tasks are recovered automatically by the framework. This is how Hadoop
is fault tolerant.
Reliability – Due to replication of data in the cluster, data is reliably stored on the
cluster of machine despite machine failures. If your machine goes down, then also
your data will be stored reliably.
Scalability – Hadoop is highly scalable in the way new hardware can be easily
added to the nodes. It also provides horizontal scalability which means new nodes
can be added on the fly without any downtime.
Easy to use – No need of client to deal with distributed computing, the framework
takes care of all the things. So it is easy to use.
Data Locality – Hadoop works on data locality principle which states that move
computation to data instead of data to computation. When a client submits the
MapReduce algorithm, this algorithm is moved to data in the cluster rather than
bringing data to the location where the algorithm is submitted and then processing
it.
Hadoop is written with large clusters of computers in mind and is built around
the following assumptions:
HDFS was originally built as infrastructure for the Apache Nutch web
search engine project.
hdfs/.
The fact that there are a huge number of components and that each
component has a nontrivial probability of failure means that some
component of HDFS is always non-functional.
Applications that run on HDFS have large data sets. A typical file in
HDFS is gigabytes to terabytes in size. Thus, HDFS is tuned to support
large files. It should provide high aggregate data bandwidth and scale
to hundreds of nodes in a single cluster. It should support tens of
millions of files in a single instance.
The NameNode maintains the file system namespace. Any change to the
file system namespace or its properties is recorded by the NameNode. An
application can specify the number of replicas of a file that should be
maintained by HDFS. The number of copies of a file is called the
replication factor of that file. This information is stored by the NameNode.
Data Replication
HDFS is designed to reliably store very large files across machines in a
large cluster. It stores each file as a sequence of blocks; all blocks in a file
except the last block are the same size. The blocks of a file are replicated
for fault tolerance. The block size and replication factor are configurable
per file. An application can specify the number of replicas of a file. The
replication factor can be specified at file creation time and can be changed
later. Files in HDFS are write-once and have strictly one writer at any time.
o Replica Selection
o Safemode
For example, creating a new file in HDFS causes the NameNode to insert
a record into the
Robustness
Cluster Rebalancing
o Data Integrity
The FsImage and the EditLog are central data structures of HDFS. A
corruption of these files can cause the HDFS instance to be non-functional.
For this reason, the NameNode can be configured to support maintaining
multiple copies of the FsImage and EditLog. Any update to either the
FsImage or EditLog causes each of the FsImages and EditLogs to get
updated synchronously. This synchronous updating of multiple copies of
the FsImage and EditLog may degrade the rate of namespace transactions
per second that a NameNode can support. However, this degradation is
acceptable because even though HDFS applications are very data intensive
in nature, they are not metadata intensive. When a NameNode restarts, it
selects the latest consistent FsImage and EditLog to use.
o Snapshots
Data Organization
o Data Blocks
o Staging
A client request to create a file does not reach the NameNode immediately.
In fact, initially the HDFS client caches the file data into a temporary local
file. Application writes are transparently redirected to this temporary local
file. When the local file accumulates data worth over one HDFS block size,
the client contacts the NameNode. The NameNode inserts the file name
into the file system hierarchy and allocates a data block for it. The
NameNode responds to the client request with the identity of the DataNode
and the destination data block. Then the client flushes the block of data
from the local temporary file to the specified DataNode. When a file is
closed, the remaining un-flushed data in the temporary local file is
transferred to the DataNode. The client then tells the NameNode that the
file is closed. At this point, the NameNode commits the file creation
operation into a persistent store. If the NameNode dies before the file is
closed, the file is lost.
The above approach has been adopted after careful consideration of target
applications that run on HDFS. These applications need streaming writes
to files. If a client writes to a remote file directly without any client side
buffering, the network speed and the congestion in the network impacts
throughput considerably. This approach is not without precedent. Earlier
distributed file systems, e.g. AFS, have used client side caching to improve
performance. A
Replication Pipelining
When a client is writing data to an HDFS file, its data is first written to a
local file as explained in the previous section. Suppose the HDFS file has
a replication factor of three.
When the local file accumulates a full block of user data, the client retrieves
a list of
DataNodes from the NameNode. This list contains the DataNodes that will
host a replica of that block. The client then flushes the data block to the
first DataNode. The first DataNode starts receiving the data in small
portions (4 KB), writes each portion to its local repository and transfers
that portion to the second DataNode in the list. The second DataNode, in
turn starts receiving each portion of the data block, writes that portion to
its repository and then flushes that portion to the third DataNode. Finally,
the third DataNode writes the data to its local repository. Thus, a DataNode
can be receiving data from the previous one in the pipeline and at the same
time forwarding data to the next one in the pipeline. Thus, the data is
pipelined from one DataNode to the next.
Accessibility
FS Shell
HDFS allows user data to be organized in the form of files and directories.
It provides a command line interface called FS shell that lets a user interact
with the data in HDFS. The syntax of this command set is similar to other
shells (e.g. bash, csh) that users are already familiar with. Here are some
sample action/command pairs:
Action Command
myfile.txt
DFSAdmin
The DFSAdmin command set is used for administering an HDFS cluster.
These are commands that are used only by an HDFS administrator. Here
are some sample action/command pairs:
Action Command
Browser Interface
Instead, HDFS first renames it to a file in the /trash directory. The file can
be restored quickly as long as it remains in /trash. A file remains in /trash
for a configurable amount of time. After the expiry of its life in /trash, the
NameNode deletes the file from
The deletion of a file causes the blocks associated with the file to be freed.
Note that there could be an appreciable time delay between the time a file
is deleted by a user and the time of the corresponding increase in free space
in HDFS.
A user can Undelete a file after deleting it as long as it remains in the /trash
directory.
If a user wants to undelete a file that he/she has deleted, he/she can navigate
the /trash directory and retrieve the file. The /trash directory contains only
the latest copy of the file that was deleted. The /trash directory is just like
any other directory with one special feature: HDFS applies specified
policies to automatically delete files from this directory. The current
default policy is to delete files from /trash that are more than 6 hours old.
In the future, this policy will be configurable through a well-define
interface.
3.4 SCOPE
The demand for Analytics skill is going up steadily but there is a huge
deficit on the supply side. This is happening globally and is not restricted
to any part of geography. In spite of Big Data Analytics being a ‘Hot’ job,
there is still a large number of unfilled jobs across the globe due to shortage
of required skill. A McKinsey Global Institute study states that the US will
face a shortage of about 190,000 data scientists and 1.5 million managers
and analysts who can understand and make decisions using Big Data by
2018.
India, currently has the highest concentration of analytics globally. In spite
of this, the scarcity of data analytics talent is particularly acute and demand
for talent is expected to be on the higher side as more global organizations
are outsourcing their work.
Strong demand for Data Analytics skills is boosting the wages for qualified
professionals and making Big Data pay big bucks for the right skill. This
phenomenon is being seen globally where countries like Australia and the
U.K are witnessing this ‘Moolah Marathon’.
According to the 2015 Skills and Salary Survey Report published by the
Institute of Analytics Professionals of Australia (IAPA), the annual median
salary for data analysts is $130,000, up four per cent from last year.
Continuing the trend set in 2013 and 2014, the median respondent earns
184% of the Australian full-time median salary. The rising demand for
analytics professionals is also reflected in IAPA’s membership, which has
grown to more than 5000 members in Australia since its formation in 2006.
Randstad states that the annual pay hikes for Analytics professionals in
India is on an average 50% more than other IT professionals. According to
The Indian Analytics Industry Salary Trend Report by Great Lakes
Institute of Management, the average salaries for analytics professionals in
India was up by 21% in 2015 as compared to 2014. The report also states
that 14% of all analytics professionals get a salary of more than Rs. 15 lakh
per annum.
A look at the salary trend for Big Data Analytics in the UK also indicates
a positive and exponential growth. A quick search on Itjobswatch.co.uk
shows a median salary of £62,500 in early 2016 for Big Data Analytics
jobs, as compared to £55,000 in the same period in 2015. Also, a year-on-
year median salary change of +13.63% is observed.
CHAPTER 4
CONCLUSION
The biggest challenge does not seem to be the technology itself – as this
is evolving much more rapidly than humans – but rather how to make
sure we have enough skills to make effective use of the technology at our
disposal and make sense out of the data collected. And before we get to that
stage, we need to resolve many legal issues around intellectual property rights,
data privacy and integrity, cyber security, exploitation liability and Big Data
code of conduct. Like in many other technological areas, customs and ethics
around Big Data possibilities and excesses take time to develop. Promises of
Big Data include innovation, growth and long term sustainability. Threats
include breach of privacy, property rights, data integrity or personal freedom.
So provided Big Data is exploited in an open and transparent manner, delivery
of the promise of Big Data is not far ahead of us.