Sunteți pe pagina 1din 7

Objectives :-

The Objective of this project is to solve the following problem statement from the given
Data
Find out the frequency of books published each year. (int! "se #oooks.csv file
for this$
Find out in which year ma%imum number of books were published
Find out how many book were published based on ranking in the year &''&.
( int! "se #ook.csv and #ook()atings.csv$
Prerequisites:-
*D+ installed on ,mware -layer
The datasets are copied to local download directory through FT-
ive *onfigurations Done as we are using ive for this solution
Solution -
1
st
Problem . To find the frequency of books published each year/ firstly we have to
clean up the inputfiles and process it to set delimeter (0$. we are going to use linu% bash
12D command
3 sed 4s56amp7565g4 #8(#ooks.csv 9 #ooks*orrected:.csv
3 sed 4:d4 #ooks*orrected:.csv 9 #ooks*orrected&.csv
3 sed 4s5;7;5;0;5g4 #ooks*orrected&.csv 9 #ooks*orrected.csv
first command will replace 46amp4 with 464 globally and store it in #ooks*orrected:.csv
second command will remove the first row of the file which is the headers and store it in
#ooks*orrected&.csv
third command will replace 7 with 0 inorder to process the delimeter and store it in
#ooks*orrected.csv which we are going to use
Note:- we can use here (i option on 12D command to get changes reflected directly onto
the file but i prefered not to go for that as the original files kept safe for backup if
anything goes wrong.
<ow our data is ready and clean for input/lets put it in DF1
3 hadoop fs (mkdir 5user5input
3 hadoop fs (put 5home5cloudera5#ooks*orrected.csv 5user5input
<ow *reate = table with the attributes
3 sudo hive
hive9 *)2=T2 T=#>2 ?F <OT 28?1T1 Data1et
(?1#< 1T)?<@/
#ookTitle 1T)?<@/
#ook=uthor 1T)?<@/
AearOf-ublication 1T)?<@/
-ublisher 1T)?<@/
?mage")>1 1T)?<@/
?mage")>B 1T)?<@/
?mage")>> 1T)?<@$
)OC FO)B=T D2>?B?T2D F?2>D1 T2)B?<=T2D #A 404
1TO)2D =1 T28TF?>27
Our table is now ready so to feed the data use the following command
hive9 >O=D D=T= ?<-=T 45user5input5#ooks*orrected.csv4 O,2)C)?T2 ?<TO
T=#>2 Data1et7
then to display the required result we are using select command grouping it by year of
publication. "se the following command
hive9select yearofpublication/ count(booktitle$ from dataset group by yearofpublication7
Output!
;:+DE; :
;:+DF; :
;:F'E; :
;:FGD; :
;:G''; +
;:G':; D
;:G'&; &
;:G'H; :
;:G'E; :
;:G'F; :
;:G'G; &
;:G:'; :
;:G::; :G
;:G:H; :
;:G:D; :
;:G:G; :
;:G&'; ++
;:G&:; &
;:G&&; &.. values going below further
(((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
((
2
nd
Problem - To Find out in which year ma%imum number of books were
published we use the same table and sort it descending according to the number of books
published.
hive9select yearofpublication/ count(booktitle$ as count
from dataset
group by yearofpublication
sort by count desc7
Output!
;&''&; :DE&E
;:GGG; :DH+'
;&'':; :D+IH
;&'''; :D&+&
;:GGF; :IDEI
;:GGD; :HFFG
;&''+; :H+II
;:GGE; :H'+'
;:GGI; :+IHI
;:GGH; ::DGE
;:GG+; :'E'&
;:GG&; GG'E
;:GG:; G+FG
;:GG'; FEE:
;:GFG; DG+D
;:GFF; DHG+
;:GFD; EI&G
;:GFE; IFH:
;&''H; IF+G
;:GFI; I+H+
;:GFH; HGFE
1o in year &''& ma%imum number of books were published i.e :DE&E
((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
((
3
rd
Problem - To Find out how many book were published based on ranking
in the year &''&/ we have to use file #8(#OOJ()atings.csv
and ?mport it into table to perform hive queries
>ets clean up the file first by following command
3 sed 4:d4 #8(#ook()atings.csv 9 #ooks)ating&.csv
3 sed 4s5;7;5;0;5g4 #ooks)ating&.csv 9 #ook)atings*orrected.csv
<ow put the file into DF1
3 hadoop fs (put 5home5cloudera5#ook)atings*orrected.csv 5user5input
<ow *reate the table with the attributes
sudo hive
hive9*)2=T2 T=#>2 ?F <OT 28?1T1 D=T=)=T?<@
("12)?D 1T)?<@/
?1#< 1T)?<@/
#OOJ)=T?<@ 1T)?<@$
)OC FO)B=T D2>?B?T2D F?2>D1 T2)B?<=T2D #A 404
1TO)2D =1 T28TF?>27
<ow load the data into table
hive9>O=D D=T= ?<-=T 45user5input5#ook)atings*orrected.csv4
O,2)C)?T2 ?<TO T=#>2 D=T=)=T?<@7
-erform the ive query
hive9select a.#OOJ)=T?<@/*O"<T(b.#ookTitle$ from datarating a KO?<
dataset b On(a.?1#<Lb.?1#<$
C2)2 b.AearOf-ublication like 4M&''&M4
@)O"- #A a.#ook)ating7
Output!
;'; I+F:+
;:; :H+
;:'; E&D+
;&; &E'
;+; I+H
;H; FE:
;I; +E:&
;E; +:FE
;D; EED'
;F; GFF:
;G; EIEH
((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
((
Conclusion- ?n this way/? have implemented the solution for the given
problem. Ce could write a mapreduce program for each problem but ive in
adoop ecosystem lets us do that in very less time.

S-ar putea să vă placă și