Sunteți pe pagina 1din 35

Social Data Analysis Using

Big Data And Hadoop


THE WORLD OF DATA

Source: http://www.cision.com/us/2012/10/big-data-and-big-analytics/
DATA VS BIG DATA
Big data is just data with:
 More volume
 Faster data generation (velocity)
 Multiple data format (variety)

World's data volume to grow 40% per year


& 50 times by 2020 [1]

Data coming from various human & machine


activity

[1] http://e27.co/worlds-data-volume-to-grow-40-per-year-50-times-by-2020-aureus-20150115-2/
CHALLENGES
 More data = more storage space
 More storage = more money to spend  (RDBMS server needs very costly
storage)

 Data coming faster


 Speed up data processing or we’ll have backlog

 Needs to handle various data structure


 How do we put JSON data format in standard RDBMS?
 Hey, we also have XML format from other sources
 Other system give us compressed data in gzip format

 Agile business requirement.


 On initial discussion, they only need 10 information, now they ask for 25? Can
we do that? We only put that 10 in our database
 Our standard ETL process can’t handle this
STORAGE COST
In Terms of storage cost, Hadoop has lower comparing to standard
RDBMS.
Hadoop provides highly scalable storage and process with fraction of
the EDW Cost
HADOOP
HADOOP is a framework that allows us to store and process large
datasets in parallel and distributed fashion

Storage Processing
HADOOP ECOSYSTEM
HDFS
HDFS DATA BLOCK
Each file is stored on HDFS as block. The default size of each block is
128 mb
YARN
YARN framework is responsible for integration of different tools with
Hadoop like spark, hive, pig.
MAP REDUCE APPROACH
 Process data in parallel way using distributed algorithm on a cluster
 Map procedure performs filtering and sorting data locally
 Reduce procedure performs a summary operation (count, sum,
average, etc.)
HADOOP vs UNSTRUCTURED
DATA
 Hadoop has HDFS (Hadoop Distributed File System)
 It is just file system, so what you need is just drop the file there 
 Schema on read concept

User

APPROACH
APPROACH

HADOOP
RDBMS

Application (BI Tools)


Database Table Metadata

Load the data Applying schema


Source Data
HIVE
 The Apache Hive ™ data warehouse software facilitates querying and
managing large datasets residing in distributed storage.
 With Hive you can write the schema for the data in HDFS
 Hive provide many library that enable you to read various data type
like XML, JSON, or even compressed format
 You can create your own data parser with Java language
 Hive support SQL language to read from your data
 Hive will convert your SQL into Java MapReduce code, and run it in
cluster
ANALYTICS
ANALYTICS IS IN YOUR BLOOD
 Do you realize that you do analytics everyday?
 I need to go to campus faster!
 Hmm.. Looking at the sky today, I think it’ll be rain
 Based on my mid term and assignment score, I need to get at least 80
in my final exam to pass this course
 I stalked her social media. I think she is single because most of her
post only about food :p
PREDICTIVE ANALYTICS
There is 2 types of predictive analytics:
◦ Supervised
Supervised analytics is when we know the truth about something in the past
Example:
we have historical weather data. The temperature, humidity, cloud density and
weather type (rain, cloudy, or sunny). Then we can predict today weather
based on temp, humidity, and cloud density today
Machine learning to be used: Regression, decision tree, SVM, ANN, etc.
◦ Unsupervised
Unsupervised is when we don’t know the truth about something in the past.
The result is segment that we need to interpret
Example:
We want to do segmentation over the student based on the historical exam
score, attendance, and late history
ANALYTICS TOOLS
Microsoft Excel. Very powerful tools to do statistical data manipulation, pivoting, even doing
simple prediction
SQL is just the language. Your data lying in database? SQL will help to filter, aggregate and
extract your data
RapidMiner provide built-in RDBMS connector, parser for common data format (csv, xml), data
manipulation, and many machine learning algorithm. We can also create our own library. Latest
version of RapidMiner can connect to Hadoop and do more complex analysis like text mining.
Free version is available (community edition)
KNIME. Known as a powerful tools to do predictive analytics. Overall function is similar to
RapidMiner. Latest version of KNIME can connect to Hadoop and do more complex analysis
such as text mining. Free version is available
Tableau is one of the famous tools to build visualization on top of the data. Tableau also
powerful to create interactive dashboard. Free version is available with some limitation
QlikView. Similar to Tableau, QlikView designed to enable data analyst to develop a
dashboard or just simple visualization on top of the data. Free version is available
Twitter Sentiment
Analysis Using
Hadoop
Ecosystem
Problem Statement
 Social media has gained immense popularity and twitter is an
effective tool for a company to get people excited about its
products. Twitter makes it easy to engage users and communicate
directly with them an in turn users can provide word-of-mouth
marketing for companies by discussing the products.

 As an analyst, you have been tasked by to find the popularity of big


data technologies and related organizations to it.
Solution
STEPS
1.Create an application in https://dev.twitter.com/apps/ and then
generate the corresponding keys.
2. Install Flume -
$ sudo yum install flume-ng

3. Configure Flume -
The /usr/lib/flume-ng/conf/flume.conf should have all the agents (flume,
memory and hdfs) defined as below

TwitterAgent.sources = Twitter TwitterAgent.channels = MemChannel TwitterAgent.sinks = HDFS


TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource
TwitterAgent.sources.Twitter.channels = MemChannel
TwitterAgent.sources.Twitter.consumerKey = q4QxzGfR2syqqgPbMt7FA
TwitterAgent.sources.Twitter.consumerSecret =
iZW3eeNS8jY8rXtJAhULnNEQRvF5gezfmEeaeIFIw TwitterAgent.sources.Twitter.accessToken =
24530524- Lv9OH4Cg58pN2a4yO4hFGTr1CDfSvE986v4qY0h4
TwitterAgent.sources.Twitter.accessTokenSecret =
WQggKkDWIJ5pyR46dclpmtCX8zpU8o1wUccRweu2d4 TwitterAgent.sources.Twitter.keywords =
hadoop, big data TwitterAgent.sinks.HDFS.channel = MemChannel TwitterAgent.sinks.HDFS.type
= hdfs TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:8020/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream TwitterAgent.sinks.HDFS.hdfs.writeFormat
= Text TwitterAgent.sinks.HDFS.hdfs.batchSize = 1000 TwitterAgent.sinks.HDFS.hdfs.rollSize = 0
TwitterAgent.sinks.HDFS.hdfs.rollCount = 1000 TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600
TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 1000
TwitterAgent.channels.MemChannel.transactionCapacity = 100
4. Start flume using the below command –

> flume-ng agent --conf /usr/lib/flume-ng/conf/ -f /usr/lib/flume-


ng/conf/flume.conf -D flume.root.logger=DEBUG,console -n TwitterAgent

5. Install hive

6. Download “hive-serdes-1.0-SNAPSHOT.jar” to the lib directory in Hive. Twitter


returns Tweets in the JSON format and this library will help Hive understand the
JSON format.

7. Start the Hive shell using the hive command and register the hive-serdes-1.0-
SNAPSHOT.jar file downloaded earlier.

Hive> ADD JAR /usr/lib/hive/lib/hive-serdes-1.0-SNAPSHOT.jar;


8. Now, create the tweets table in Hive

> CREATE TABLE tweets ( id BIGINT, created_at STRING, source STRING,


favorited BOOLEAN, retweet_count INT, retweeted_status STRUCT< text:STRING,
retweet_count:INT, user:STRUCT>, entities STRUCT< urls:ARRAY<STRUCT>,
user_mentions:ARRAY<STRUCT>, hashtags:ARRAY<STRUCT>>, text STRING,
user STRUCT< screen_name:STRING, name:STRING,
friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN,
utc_offset:INT, time_zone:STRING>, in_reply_to_screen_name STRING )
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/tweets';
9. Now that we have the data in HDFS and the table created in Hive, lets run
some queries in Hive.

To know which user has the most number of followers, the below query helps.

> select user.screen_name, user.followers_count c from tweets order by c


desc;

teddy777 27906
s_m_angelique 7678
NatureGeosci 7150
GlobalSupplyCG 6755
HadoopNews 3904
WORLD TRENDS
2018 HYPE CYCLE
Big data related things in
top of hype curve:
• Advanced analytics
• IoT
• Machine Learning
WHERE TO START
LETS GET OUR HAND DIRTY
SKILLS NEEDED

DOMAIN KNOWLEDGE
SKILLS NEEDED
 Business Acumen
In terms of data science, being able to discern which problems are
important to solve for the business is critical, in addition to identifying
new ways the business should be leveraging its data.
 Python, Scala, and SQL
SQL skills is a must! Python and Scala also become a common language to
do data processing, along with Java, Perl, or C/C++
 Hadoop Platform
It is heavily preferred in many cases. Having experience with Hive or Pig is
also a strong selling point. Familiarity with cloud tools such as Amazon S3
can also be beneficial.
 SAS or R or other predictive analytics tools
In-depth knowledge of at least one of these analytical tools, for data
science R is generally preferred. Along with this, statistical knowledge also
important
SKILLS NEEDED
 Intellectual curiosity
Curiosity to dig deeper into data and solving a problem by finding a
root cause of it
 Communication & Presentation
Companies searching for a strong data scientist are looking for
someone who can clearly and fluently translate their technical findings
to a non-technical team. A data scientist must enable the business to
make decisions by arming them with quantified insights

Summarized from http://www.kdnuggets.com/2014/11/9-must-have-skills-data-scientist.html


[BIG] DATA SOURCES
 Social media platform. Most of social media provided some API to
fetch the data from there. Twitter and Facebook is the most common
example
 KDNuggets (http://www.kdnuggets.com/datasets/index.html)
 Kaggle (https://www.kaggle.com/)
 Portal Data Indonesia (http://data.go.id/)
 Your WhatsApp group conversation
ONLINE TUTORIAL
 Coursera (https://www.coursera.org/)
 DataQuest (https://www.dataquest.io/)
 Udacity (https://www.udacity.com/)
 TutorialsPoint (http://www.tutorialspoint.com/)
 Youtube, RapidMiner Channel
(https://www.youtube.com/user/RapidIVideos)
 Youtube KNIME TV (https://www.youtube.com/user/KNIMETV)
 Cloudera Quickstart VM (http://www.cloudera.com/content/www/en-
us/documentation/enterprise/latest/topics/cloudera_quickstart_vm.html)
 Hortonworks Sandbox VM
(http://hortonworks.com/products/hortonworks-sandbox/)
 Apache Spark Page (https://spark.apache.org/examples.html)
PREPARE
YOURSELF
TO SURF THE DATA ERA!

S-ar putea să vă placă și