Documente Academic
Documente Profesional
Documente Cultură
Volume: 2 Issue: 11
ISSN: 2321-8169
3692 - 3696
_______________________________________________________________________________________________
Dhara Kalola
Abstract In this internet era websites are useful source of many information. Because of growing popularity of World Wide Web a website
receives thousands to millions requests per day. Thus, the log files of such websites are growing in size day by day. These log files are useful
source of information to identify users behavior. This paper is an attempt to analyze the weblogs using Hadoop Map-Reduce algorithm. Hadoop
is an open source framework that provides parallel storage and processing of large datasets. This paper makes use of Hadoops this
feature to analyze the large, Semi structured dataset of websites log. The performance of the algorithm is compared on pseudo
distributed and fully distributed mode Hadoop cluster.
Keywords-Hadoop;Map-Reduce;Weblog Analysis
__________________________________________________*****_________________________________________________
I.
INTRODUCTION
HADOOP OVERVIEW
_______________________________________________________________________________________
ISSN: 2321-8169
3692 - 3696
_______________________________________________________________________________________________
manage the storage attached to the nodes that they run on.
When a file is uploaded in HDFS at Master, it is divided into
blocks and distributed among all the DataNodes running on
slave machines in the same cluster. NameNode is responsible
for splitting of files, replication of blocks for fault tolerance,
maintaining update reports of each block for integrity.
DataNodes send periodic heartbeat message to NameNode
about their block report. Secondary NameNode periodically
updates the file system changes logs and performs the
housekeeping function for NameNode. This allows NameNode
to start faster next time. [2] [9].
C. Hadoop Configuration
Hadoop runs in one of three modes [3].
Standalone: All Hadoop functionality runs in one Java
process.
B. Map-Reduce
Hadoop MapReduce is a software framework for writing
applications which process large amounts of datasets in parallel
on large clusters of commodity hardware in a reliable, faulttolerant manner. Two daemons of Hadoop are associated with
MapReduce, JobTracker and TaskTracker. There is only one
JobTracker running on the Master. It divides the job in to
number of tasks and assigns tasks to each TaskTracker running
on slaves in same cluster. The TaskTracker is responsible for
executing the assigned task and sending the output back to the
JobTracker. The JobTracker then combines the result and
writes it in HDFS. Thus, a MapReduce job runs on Hadoop
cluster in completely parallel manner.
The MapReduce framework operates exclusively on <key,
value> pairs, that is, the framework views the input to the job
as a set of <key, value> pairs and produces a set of <key,
value> pairs as the output of the job. The Map task divides
workload into small workload, assigns the task to Mapper
which processes each unit of block of data. The output of
Mapper is sorted list of <Key, Value> pairs which is then
EXPERIMENTAL SETUP
3693
IJRITCC |November 2014, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
ISSN: 2321-8169
3692 - 3696
_______________________________________________________________________________________________
website accessed by the user. It provides a raw feed of data
created any time when anyone clicks on a URL. The endpoint
responds to http requests for any URL and returns stream of
JSON(JavaScript Object Notation) entries, one per line, that
represent real-time clicks. The files hold 12.5 GB dataset
collected from various time period of the year 2011, 2012,
2013 [5]. The sample log file is shown in Figure 5.
(Mumbai, (Mozilla))
(Mumbai, (IE))
(Mozilla,1)
(IE,1)
(Mumbai, (Safari))
Reduce(Key, Value)
(Mozilla,1)
(Safari,1)
IV.
Mozilla,2
IE,1
Safari,1
Analysis
3694
IJRITCC |November 2014, Available @ http://www.ijritcc.org
_______________________________________________________________________________________
ISSN: 2321-8169
3692 - 3696
_______________________________________________________________________________________________
The function reads the files from Input directory in HDFS
and applies the Map-Reduce chaining as per filters specified by
the user. The sample execution is shown in figure 6. After
execution, the result gets stored in a file of output directory in
HDFS. This result is also in the form of <Key,Value> pairs.
The function reads this file format and displays the result in
graphical format is shown in figure 7. It shows the number of
hits from various types of browsers in a Country and City
specified by the user.
V.
PERFORMANCE ANALYSIS
for
No of Hits
58
18
10
1
1
1
Time
VI.
CONCLUSION
FUTURE WORK
_______________________________________________________________________________________
ISSN: 2321-8169
3692 - 3696
_______________________________________________________________________________________________
fails the Backup Node can take place of it and another node
from the Cluster will be chosen as the Backup Node.
REFERENCES
[1]
http://www.rackspace.com/knowledge_center/article/readingapache-web-logs
[2]
[3]
http://hayesdavis.net/2008/06/14/running-hadoop-on-windows/
[4]
Welcome
to
Apache
http://hadoop.apache.org/
[5]
http://www.usa.gov/About/developer-resources/1usagov.shtml
[6]
[7]
[8]
http://ebiquity.umbc.edu/Tutorials/Hadoop/00%20%20Intro.html
[9]
http://www.fromdev.com/2010/12/interview-questions-hadoopmapreduce.html
Hadoop,
website:
[10] http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
[11] Understanding Big Data: Analytics for Enterprise Class Hadoop
3696
IJRITCC |November 2014, Available @ http://www.ijritcc.org
_______________________________________________________________________________________