Documente Academic
Documente Profesional
Documente Cultură
Source: http://www.cision.com/us/2012/10/big-data-and-big-analytics/
DATA VS BIG DATA
Big data is just data with:
More volume
Faster data generation (velocity)
Multiple data format (variety)
[1] http://e27.co/worlds-data-volume-to-grow-40-per-year-50-times-by-2020-aureus-20150115-2/
CHALLENGES
More data = more storage space
More storage = more money to spend (RDBMS server needs very costly
storage)
Storage Processing
HADOOP ECOSYSTEM
HDFS
HDFS DATA BLOCK
Each file is stored on HDFS as block. The default size of each block is
128 mb
YARN
YARN framework is responsible for integration of different tools with
Hadoop like spark, hive, pig.
MAP REDUCE APPROACH
Process data in parallel way using distributed algorithm on a cluster
Map procedure performs filtering and sorting data locally
Reduce procedure performs a summary operation (count, sum,
average, etc.)
HADOOP vs UNSTRUCTURED
DATA
Hadoop has HDFS (Hadoop Distributed File System)
It is just file system, so what you need is just drop the file there
Schema on read concept
User
APPROACH
APPROACH
HADOOP
RDBMS
3. Configure Flume -
The /usr/lib/flume-ng/conf/flume.conf should have all the agents (flume,
memory and hdfs) defined as below
5. Install hive
7. Start the Hive shell using the hive command and register the hive-serdes-1.0-
SNAPSHOT.jar file downloaded earlier.
To know which user has the most number of followers, the below query helps.
teddy777 27906
s_m_angelique 7678
NatureGeosci 7150
GlobalSupplyCG 6755
HadoopNews 3904
WORLD TRENDS
2018 HYPE CYCLE
Big data related things in
top of hype curve:
• Advanced analytics
• IoT
• Machine Learning
WHERE TO START
LETS GET OUR HAND DIRTY
SKILLS NEEDED
DOMAIN KNOWLEDGE
SKILLS NEEDED
Business Acumen
In terms of data science, being able to discern which problems are
important to solve for the business is critical, in addition to identifying
new ways the business should be leveraging its data.
Python, Scala, and SQL
SQL skills is a must! Python and Scala also become a common language to
do data processing, along with Java, Perl, or C/C++
Hadoop Platform
It is heavily preferred in many cases. Having experience with Hive or Pig is
also a strong selling point. Familiarity with cloud tools such as Amazon S3
can also be beneficial.
SAS or R or other predictive analytics tools
In-depth knowledge of at least one of these analytical tools, for data
science R is generally preferred. Along with this, statistical knowledge also
important
SKILLS NEEDED
Intellectual curiosity
Curiosity to dig deeper into data and solving a problem by finding a
root cause of it
Communication & Presentation
Companies searching for a strong data scientist are looking for
someone who can clearly and fluently translate their technical findings
to a non-technical team. A data scientist must enable the business to
make decisions by arming them with quantified insights