Sunteți pe pagina 1din 9

IDL - International Digital Library Of

Technology & Research


Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017

HADOOP based Recommendation Algorithm


for Micro-video URL
Revathy Ramakrishnan
Department of CSE, CMRIT, VTU
Bengaluru, India
revathyr1@gmail.com

Abstract: In the recent years usage social media 1. INTRODUCTION


applications pervade in our daily life which makes the
Social Networking Sites (SNSs) being dependent on With an increase in the amount of data provided by
users for content generation. Considering user interest, social networks, Internet searches, etc., there was a
contents produced by individual SNSs significantly need to revolutionize the data. "Big Data" describes a
leaves some of the interest based content universe of very large dataset. Although, Big Data
refers to the volume of data, it also signifies the
undiscovered. This led to facilitate features such as
important capabilities which involve processing of Big
like, share, hashtags functions to deliver the Data. Typically, a wide range of media and e-
content from one platform to another platform. These commerce firms such as news websites, video
allowed users to interact with multiple SNSs but providers and also social networking websites, provide
limited to receive contents for separate SNSs. data (hereafter referred as "content") on the Internet
Although Open Identity allowed users for single sign- and their primary goal is to generate revenue. Not
only, Content providers tend to maximize their
in in multiple platforms, it still remained to target
revenue through advertisements and subscriptions but
multiple platforms. A Unified Access Model is also try to reduce the cost of content distribution.
proposed to internet-based-content modeling where Hence the providers distribute their contents across
the content for the users could be images or videos or several geographical locations and also to improve and
text. Videos of short length termed as micro-videos understand user experience, special analytical services
are more popular both for the viewers and also the would be used (eg., Google Analytics).
producers. The work carried out provides a
recommendation algorithm for micro-video url, which Social media applications that deliver contents are
compared to traditional recommendation algorithms completely dependent on the Users and hence make
them deliver the best possible quality with minimum
such as content based recommendation, the big data cost. At the same time, Content providers will now
uses parallel computing framework. High performance have the ability to collect, store and analyze behavioral
computing is achieved by using slope one algorithm patterns from Users. Users are proactively engaged in
that uses Mapreduce and Hadoop techniques. Hence, integrating content information with their social
the proposed recommendation system for micro-video information giving rise to social networking sites.
url can achieve high performance parallel computing,
which can be used by the producers and viewers. Social networking sites such as Facebook, Twitter,
etc., completely depend on individual users for content
Keywords: Networking Sites; Hadoop; Mapreduce; generation. Each of these social networking sites are
Single-Platform based. With an increase in social
parallel computing; Slope one; micro-video
networking sites, Single-Platform has a limitation
where significant user interests are always left behind.

IDL - International Digital Library 1|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


Moreover, users not only consume data but also Existing System
engage actively with the contents and thus pose new
challenges to Big Data repositories. In order to Indeed, social media applications emerged as Single-
enhance user experience in Big Data paradigm, its Platform with the limitation of user accessing the
essential to use Big Data User Centric Model in contents. Although, efforts were made to propagate
foreground. The data can be differentiated by using the content across platforms through OpenID, this
the characteristics of Big Data which are generally always led the users to spend more time and effort to
referred as Five Vs follow all social media applications with same
dedication.
Volume: Data sources such as sensors, social
media and online transactions produce huge Limitations of Existing System
volume of data that demands huge storage and
high process management Even though several attempts were made to facilitate
Velocity: With a short span of time, enormous interest-based content access such as "like" on
amount of data can be accumulated by Data Facebook, "hash tags" on Twitter, etc., searching user
sources which in turn needs short processing interest-content across multiple platforms weren't
time for accumulated data available. However, the "share" feature across Single-
Variety: Multiple types of data such as videos, Platform social networking sites allowed user contents
images, text, audio, etc., both structured and to propagate across multiple platforms, Single-
unstructured data brings challenges for data Platform were still isolated in receiving user contents
integration, storage and processing individually.
Veracity: Data quality is more important while
considering the source of data. For example,
data from a controlled source such as registered
user, has more fidelity when compared to the
data from a uncontrolled source such as blog
post
Value: Data usefulness for an enterprise is a key
factor which is highly dependent on veracity and
processing time. Data with high veracity that
can be analysed in shorter time is more value to
a enterprise business

Fig. 2 Overview of content share in Existing System

Problem Statement

Content in social networking platforms is wide spread


and these Single-Platform consume significant amount
of data share in our Internet lives. Since the single-
platform restricts itself from content discovery through
other platforms, a large proportion of user-experience
Fig. 1 Overview of Content Delivery to Users and user-interest is lost within single-platform. We

IDL - International Digital Library 2|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


propose a access model for user interest and recently. These platforms are focused towards users
experience based content modelling using the Big through add-ons such as mobile-texting in Facebook,
Data paradigm. Instagram, twitter, Google+ and Whatsapp. There exists
a key problem in the add-ons from these applications.
Proposed System There is no user-assessment experience. This paper
mainly focuses on Twitter in the context of Journalism.
With the large-scale data storage and processing, we To go into further detail, this paper deals with the
propose to use the Hadoop framework that can process structural analysis of Twitter use which pertains to the
large amount structured and unstructured data. first season of a talk show called Hubinette.
Moreover, Hadoop implements MapReduce Hubinette was aired on a public service television in
processing technique where the input dataset is split Sweden in 2011. The current state-of-the-art methods
into several independent segments that can be for data collection and analysis were used on this
processed parallel. The Mapping of these independent dataset and this paper shows that Twitter is used in
segments are later sorted which are then input to some ways which are not so traditional in terms of
reduce the tasks. journalist-reader relationships.

The proposed work consists of [2] N. B. Ellison et al., Social network sites:
Definition, history, and scholarship
Design of cross-platform layers
In this paper, we figure out that the social media
Feedback based use-interest algorithm
platforms consist of social networking sites (SNSs)
which heavily rely on the users for the creation of their
Advantages: contents. This contrasts with the professionally
produced content. If the user does not involve or
Content similarity is measured participate in these social networking sites, there would
High speed processing due to Hadoop be no success for these SNSs. This has intrigued the
(MapReduce) framework attention in research. Current institutes and industries
focus on this issue. The different features of SNSs are
The social community standards such as OpenID [6] described in this journal. They are also defined in
enable access to user social profile and connection to constructive way. This paper gives a perspective for the
content based sites. This helps to bring content based history of these sites where the key changes were
sites (eg: youtube, google videos, etc.,) much closer to observed and the most important developments were
social networking sites that raises new challenges in highlighted. Once the different features and definitions
data management and content discovery across of SNSs are known, we can pursue research for the
multiple platforms. While some of platforms provide current paper.
large amount of content access, they lack supporting
content discovery with other user experiences. [3] K. P. R. Lee, J. Brenner. (2012, Sep 13) Photos
and videos as social currency online
2. LITERATURE SURVEY One can easily predict that most people in the world use
[1] Y. Beyer, G. S. Enli, A. J. Maas, and E. the internet to find images and videos in the current
Ytreberg, Small talk makes a big difference: recent generation. It is obvious that social media is the core of
developments in interactive, sms-based television most people. Here, in this paper, Lee shows that more
than forty-one percent of the US population discover or
In this paper, we find information regarding the
transfer the photos and videos on the internet.
different social media platforms which were emerged
Therefore, this is a clear sign to say that internet based

IDL - International Digital Library 3|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


content discovery is the centre stage for content with the share functions to enable the user to access
generation and redistribution. This only adds to the fact multiple platforms by creating a user-id. Here, the
that social media is used by these people a lot which interests of the individual is stored based on the
generates a huge amount of internet traffic. The content searches and shares done by the user and corresponding
streams are fragmented and limit the internet-based content from multiple platforms are shown to the user.
relevant content to the users only. Individual interest
was considered as an interest by Lee that was [5] E. P. Bucy, Interactivity in society: Locating an
continuously stimulated by relevant content discovery. elusive concept
Single-platform SNSs varied in terms of technology and
This paper additionally deals with cross platform
scope. The different ways range from user
application for the users. It elaborates further into the
demographics, geographical attributes or pre-existing
fact that following multiple platforms takes more time,
relationships. Focusing on specific interests like travel,
more effort and much more cognitive capacity. With the
religion, sports, music, photo sharing, video sharing,
same dedication, one can achieve far more knowledge
country-related news, politics, philosophy, etc. have
or one can save time when all this information from
become mainstream in the old times for SNSs. Scope
these multiple platforms is presented in a single
was slowly increased by removing age restriction,
platform. Such a trend towards the share functionality
opening in different unreachable countries and so on.
has some side-effects. The downside to such an
These limitations overcame limited content access,
approach is the engagement of one-to-many content
platform interoperability issues, lack of relevant content
distribution by the user. The user, however, is limited to
segmentation across multiple platforms, etc. Therefore,
receive contents from each separated platform
internet based content access started from a specific
individually. This has been cleverly clearly explained by
platform to a more general one. Some examples of these
Bucy.
features which were eventually modeled in these ways
include the like feature on Facebook, hashtag on [6] OpenID Foundation. (2013, Sep 13)
Twitter, filters on Instagram, etc. These attempts lead
to a conclusion that internet-based content is searched The openid foundation website allows us to use an open
within a single platform in a more effective way rather identity to access content by allowing users to sign into
than across multiple ways. People become lazy and try too many websites using a single identity. This open ID
to find a platform where everything can be found. They is limited to a targeted platform rather than multiple
dont even consider user interaction with other users or platforms. Open ID is a very creative approach to access
content through different platforms. multiple platforms with a unique id which allows us to
get information from all these sites. Content aggregation
[4] D. Recordon and D. Reed, Openid 2.0: a platforms provide users with large amount of content
platform for usercentric identity management access. Now, the only thing lacking is to use history of
the user to achieve results. This is yet to support
This paper tells us that the social media cross-platform
interaction and content discovery through user
applications are meticulously designed to account for a
experiences. Our unified access model to interest based
single-platform content access limitations. The share
content modeling accounts for this mentioned fact.
feature, which is an internet-based content
redistribution, makes it very easy to set a common
platform to access everything using a single platform. 3. SYSTEM ARCHITECTURE
Content access to multiple platforms is enabled using The overall structure of the system together with the
such features in a single platform. Therefore, content conceptual integrity of the system is provided through
variety and content flexibility was ensured with the system architecture. The structural properties depict
emergence of such social networking sites, especially the components of the system and their
interconnectivity through interfaces. With proper

IDL - International Digital Library 4|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


specifications of the structural properties, we can provides information related to individual class
provide a architectural design can be realized. attributes. Fig 4 shows the class diagram.

The system consists of following modules B. Use Case Diagram

User Management: User registers to the Behaviour of the class is visualized in the form of
system. Manage content filtering and browse graph. This gives information related to the usefulness
and view content using this module. Also this of the system with respect to their objective (referred
module implements the blacklisting of contents as use cases) and the dependencies between use cases.
not matching to user interest. Fig 5 shows the usecase diagram.

Information Extraction: This module extracts


contents from social media like youtube,
facebook, twitter.

Interest Mining: Based on user browsing


behaviour on contents, this module learns the
user interest and constructs user profiles
grouping user of similar interest.

Content Matching: This module will match the


contents to user interest based on metadata
matching and also collaborative
recommendation and provides content
recommendation to the user.

The entire system will run on Hadoop Cluster. Fig. 4 Class Diagram

Fig. 5 Usecase Diagram

Fig. 3 System Architecture C. Sequence Diagram

A. Class Design The message sequence for the forms can be shown
through a sequence diagram using Unified Modelling
Framework classes are drawn using Unified Modelling Language (UML). Fig 6 shows the Sequence Diagram.
Language which provides a logical connectivity
among classes as a chart. Also, the class diagram D. Data flow Diagram

IDL - International Digital Library 5|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


A data-flow diagram (DFD) is a graphical 4. IMPLEMENTATION
representation of the "stream" of information through
a data framework. DFDs can likewise be utilized for Step 1: Each content extracted from Social
the representation of information handling (organized networking site is expressed in form of feature vector
plan). On a DFD, information things spill out of an D= (W1,W2 WN)
outside information source or an inside information
store to an interior information store or an outer where W1,W2 are weight vectors for the Feature items.
information sink, through an inward procedure. Fig 7 Feature Items are taken from the metadata of the
and Fig 8 shows the Dataflow diagrams for Level 0 content. One keeps a continuous tab on the user-
and Level 1. activity in the social networking site. The content
browsed or accessed by the user from the social
networking site is kept tabs by the framework. This
content is extracted from the social networking site in
the form of a feature vector D. D contains a list of
weight vectors which are known as individual features
for the individual contents. Each feature represents the

Fig. 8 Level 1 Dataflow Diagram


particular content. These feature items are taken from
the metadata of the content.
Fig. 6 Sequence Diagram
In short, every time the user clicks on a link which
points to content, this content is stored in the metadata.
This happens for a while (time-limit of a micro-video).
The metadata during this time interval is saved in the
common user account. The implemented framework
reads this micro-video, or this metadata information.
This metadata information contains the information
regarding the links clicked or seen. Or one can say that
each line of the metadata contains each link clicked or
Fig. 7 Level 0 Dataflow Diagram seen during a time period. This information regarding
content is stored as a feature (W). A list of such
information during the micro-video time period is

IDL - International Digital Library 6|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


stored as a feature vector D, which is a vector of such There are many ways in which one can use the
features. information found so far. This step tells us exactly
how the information of these weights and similarity
Step 2: Store the features vectors whenever a user measures is used for a micro-video (cluster of weights)
browses a particular content and store it as interest in deciding or predicting the content which the user
feature vector. Generally, when the framework tries to wants to see. The groups of similar cluster centers of
access the metadata over a period of time, it would an interest feature point are grouped. When the user
contain the same information for a particular time-gap. watches the recommended content, the similarity
But we are interested only in the new content, rather measure is increased and when the user doesnt watch
than the same content. Therefore, the framework tries it, it is decreased. Therefore, the similarity measure is
to extract the relevant and changed metadata. In this dynamically changed according to user-activity and
way, we can figure out which metadata are useful for this is what this framework is all about, to consider
analysis. Therefore, such contents are stored as interest user activity and decide on ads.
feature vectors. In the end we would only process the
interest feature contents along with the time periods.
The frequency of the interest feature points and the
time period of access give a lot of information 5. OUTCOMES
regarding what the user is looking for and we can
Table I, II and III gives a summary of Unit Testing,
generate ads based on user experience.
Integration Testing and Validation Testing that were
performed for the implementation respectively. User
Step 3: Remove outliers and cluster the items in the
Interests are monitored and logged to profile.txt and
interest feature vector. This step is simple to
transaction.txt in Cross-Platform module as shown in
understand. The outliers are removed. Not all contents
Fig 9. These files are later moved to Linux system,
which are stored in metadata are relevant. Some of
where we run the map-reduce technique using Hadoop
these are not to be considered. These contents or
and use slope-one algorithm to generate
weights are represented with least frequency or least
recommendation result file. We again use this file as
time interval associated. These weights are ignored.
input to Cross-Platform module where the
recommendation results are shown to the user as
Step 4: Any new content extracted from the social
shown in Fig 10.
site, compare the similarity of feature vector of content
to cluster centers of the interest feature vector created, Table I
if similarity value is less than threshold, the content is
recommended to the user. After analyzing the first two
steps, these steps are quite easy to follow. Any new
content found in the metadata are stored as a new
weight. This new weight is compared with the weights
contained in the cluster and based on the frequency, a
similarity value, and a threshold, a decision is made if
the content is recommended by the user. And this
results in a decision to advertise an ad.

Step 5: Group the groups of similar cluster centers of


interest feature vector and whenever the user is group
watches a content recommend the same content to
other users in the group if they have not watched it.

IDL - International Digital Library 7|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


Table II

Fig. 10 Multiplatform Recommended Results based on


user interests on subsequent login
CONCLUSION

The paper proposes a method, where interconnection


of services to multiple social media platforms is
emphasized. Content sharing and content access
across multiple platforms through Big Data user
centric approach provided an extension to the past
implementations. User interests and also individual
experiences are considered as data where we use
Table III Hadoop with Map-Reduce techniques for data
processing. User friendly GUI's were implemented
using Java Swing to ease the use of recommendation
system that generates interest-based content to the
users.

REFERENCES
[1] Y. Beyer, G. S. Enli, A. J. Maas, and E. Ytreberg,
Small talk makes a big difference: recent
developments in interactive, sms-based television,
Television & New Media, vol. 8, no. 3, pp. 213234,
2007

[2] N. B. Ellison et al., Social network sites:


Definition, history, and scholarship, Journal of
Computer-Mediated Communication, vol. 13,no. 1,
pp. 210230, 2007

[3] K. P. R. Lee, J. Brenner. (2012, Sep 13) Photos


and videos as social currency
http://pewinternet.org/Reports/2012/Online-
Pictures/MainFindings.aspx?view=all

[4] D. Recordon and D. Reed, Openid 2.0: a platform


for usercentric identity management, in Proceedings
of the second ACMworkshop on Digital identity
management, ser. DIM 06. NewYork, NY, USA:
Fig. 9 Recommended URL is empty during first login

IDL - International Digital Library 8|P a g e Copyright@IDL-2017


IDL - International Digital Library Of
Technology & Research
Volume 1, Issue 6, June 2017 Available at: www.dbpublications.org

International e-Journal For Technology And Research-2017


ACM, 2006, pp. 1116.
http://doi.acm.org/10.1145/1179529.1179532

[5] E. P. Bucy, Interactivity in society: Locating an


elusive concept, The information society, vol. 20, no.
5, pp. 373383, 2004

[6] OpenID Foundation. (2013, Sep 13) Openid


foundation website http://openid.net/

[7] H. M. Inc. (2013) Social media management.


https://hootsuite.com/

[8] Yoono. (2013) Your social networks united.


http://www.yoono.com/

IDL - International Digital Library 9|P a g e Copyright@IDL-2017

S-ar putea să vă placă și