Documente Academic
Documente Profesional
Documente Cultură
The Objective of this project is to solve the following problem statement from the given
Data
Find out the frequency of books published each year. (int! "se #oooks.csv file
for this$
Find out in which year ma%imum number of books were published
Find out how many book were published based on ranking in the year &''&.
( int! "se #ook.csv and #ook()atings.csv$
Prerequisites:-
*D+ installed on ,mware -layer
The datasets are copied to local download directory through FT-
ive *onfigurations Done as we are using ive for this solution
Solution -
1
st
Problem . To find the frequency of books published each year/ firstly we have to
clean up the inputfiles and process it to set delimeter (0$. we are going to use linu% bash
12D command
3 sed 4s56amp7565g4 #8(#ooks.csv 9 #ooks*orrected:.csv
3 sed 4:d4 #ooks*orrected:.csv 9 #ooks*orrected&.csv
3 sed 4s5;7;5;0;5g4 #ooks*orrected&.csv 9 #ooks*orrected.csv
first command will replace 46amp4 with 464 globally and store it in #ooks*orrected:.csv
second command will remove the first row of the file which is the headers and store it in
#ooks*orrected&.csv
third command will replace 7 with 0 inorder to process the delimeter and store it in
#ooks*orrected.csv which we are going to use
Note:- we can use here (i option on 12D command to get changes reflected directly onto
the file but i prefered not to go for that as the original files kept safe for backup if
anything goes wrong.
<ow our data is ready and clean for input/lets put it in DF1
3 hadoop fs (mkdir 5user5input
3 hadoop fs (put 5home5cloudera5#ooks*orrected.csv 5user5input
<ow *reate = table with the attributes
3 sudo hive
hive9 *)2=T2 T=#>2 ?F <OT 28?1T1 Data1et
(?1#< 1T)?<@/
#ookTitle 1T)?<@/
#ook=uthor 1T)?<@/
AearOf-ublication 1T)?<@/
-ublisher 1T)?<@/
?mage")>1 1T)?<@/
?mage")>B 1T)?<@/
?mage")>> 1T)?<@$
)OC FO)B=T D2>?B?T2D F?2>D1 T2)B?<=T2D #A 404
1TO)2D =1 T28TF?>27
Our table is now ready so to feed the data use the following command
hive9 >O=D D=T= ?<-=T 45user5input5#ooks*orrected.csv4 O,2)C)?T2 ?<TO
T=#>2 Data1et7
then to display the required result we are using select command grouping it by year of
publication. "se the following command
hive9select yearofpublication/ count(booktitle$ from dataset group by yearofpublication7
Output!
;:+DE; :
;:+DF; :
;:F'E; :
;:FGD; :
;:G''; +
;:G':; D
;:G'&; &
;:G'H; :
;:G'E; :
;:G'F; :
;:G'G; &
;:G:'; :
;:G::; :G
;:G:H; :
;:G:D; :
;:G:G; :
;:G&'; ++
;:G&:; &
;:G&&; &.. values going below further
(((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
((
2
nd
Problem - To Find out in which year ma%imum number of books were
published we use the same table and sort it descending according to the number of books
published.
hive9select yearofpublication/ count(booktitle$ as count
from dataset
group by yearofpublication
sort by count desc7
Output!
;&''&; :DE&E
;:GGG; :DH+'
;&'':; :D+IH
;&'''; :D&+&
;:GGF; :IDEI
;:GGD; :HFFG
;&''+; :H+II
;:GGE; :H'+'
;:GGI; :+IHI
;:GGH; ::DGE
;:GG+; :'E'&
;:GG&; GG'E
;:GG:; G+FG
;:GG'; FEE:
;:GFG; DG+D
;:GFF; DHG+
;:GFD; EI&G
;:GFE; IFH:
;&''H; IF+G
;:GFI; I+H+
;:GFH; HGFE
1o in year &''& ma%imum number of books were published i.e :DE&E
((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
((
3
rd
Problem - To Find out how many book were published based on ranking
in the year &''&/ we have to use file #8(#OOJ()atings.csv
and ?mport it into table to perform hive queries
>ets clean up the file first by following command
3 sed 4:d4 #8(#ook()atings.csv 9 #ooks)ating&.csv
3 sed 4s5;7;5;0;5g4 #ooks)ating&.csv 9 #ook)atings*orrected.csv
<ow put the file into DF1
3 hadoop fs (put 5home5cloudera5#ook)atings*orrected.csv 5user5input
<ow *reate the table with the attributes
sudo hive
hive9*)2=T2 T=#>2 ?F <OT 28?1T1 D=T=)=T?<@
("12)?D 1T)?<@/
?1#< 1T)?<@/
#OOJ)=T?<@ 1T)?<@$
)OC FO)B=T D2>?B?T2D F?2>D1 T2)B?<=T2D #A 404
1TO)2D =1 T28TF?>27
<ow load the data into table
hive9>O=D D=T= ?<-=T 45user5input5#ook)atings*orrected.csv4
O,2)C)?T2 ?<TO T=#>2 D=T=)=T?<@7
-erform the ive query
hive9select a.#OOJ)=T?<@/*O"<T(b.#ookTitle$ from datarating a KO?<
dataset b On(a.?1#<Lb.?1#<$
C2)2 b.AearOf-ublication like 4M&''&M4
@)O"- #A a.#ook)ating7
Output!
;'; I+F:+
;:; :H+
;:'; E&D+
;&; &E'
;+; I+H
;H; FE:
;I; +E:&
;E; +:FE
;D; EED'
;F; GFF:
;G; EIEH
((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((
((
Conclusion- ?n this way/? have implemented the solution for the given
problem. Ce could write a mapreduce program for each problem but ive in
adoop ecosystem lets us do that in very less time.