Sunteți pe pagina 1din 9

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2828081, IEEE Access

Relevant Feedback Based Accurate and Intelligent


Retrieval on Capturing User Intention for
Personalized Websites
Yayuan Tang 1,2 , Hao Wang 3 , Kehua Guo 2,4 , Yizhe Xiao 2 , Tao Chi 5

1
School of Electronics and Information Engineering, Hunan University of Science and Engineering, Yongzhou, 425199, China
2
School of Information Science and Engineering, Central South University, Changsha, 410083, China
3
Faculty of Engineering and Natural Sciences, Norwegian University of Science and Technology, 1517, N-6025, Aalesund, Norway
4
Key Laboratory of Information Processing and Intelligent Control of Fujian, Minjiang University, Fuzhou, 350108, China
5
Key Laboratory of Fisheries Information, Ministry of Agriculture, Shanghai Ocean University, Shanghai, 200120, China
*Correspondence: Kehua Guo (guokehua@csu.edu.cn)

Abstract—With the rapid growth of networking, cyber- Various retrieval approaches have been extensively employed
physical-social systems (CPSSs) provide vast amounts of in- to retrieve massive amounts of Internet information. For ex-
formation. Aimed at the huge and complex data provided ample, people can use search engines to conveniently crawl
by networking, obtaining valuable information to meet precise
search needs when capturing user intention has become a major information from the Internet such as through Google and
challenge, especially in personalized websites. General search Baidu. However, the retrieval results often contain substantial
engines face difficulties in addressing the challenges brought by amounts of unnecessary information, and some required results
this exploding amount of information. In this paper, we use can be hidden in the back of a webpage; thus, users have
real-time location and relevant feedback technology to design to spend a lot of time finding the relevant results. It remains
and implement an efficient, configurable, and intelligent retrieval
framework for personalized websites in CPSSs. To improve the difficult to retrieve more accurate and more special information
retrieval results, this paper also proposes a strategy of implicit that satisfies the query intentions of users. Thus, a a special
relevant feedback based on click-through data analysis, which field, retrieval information in personalized websites aims to
can obtain the relationship between the user query conditions better account for an individual’s requirements than do general
and retrieval results. Finally, this paper designs a personalized search engines [4,5].
PageRank algorithm including modified parameters to improve
the ranking quality of the retrieval results using the relevant A number of approaches to retrieving information in person-
feedback from other users in the interest group. Experiments alized websites have been presented. The dominant approach
illustrate that the proposed accurate and intelligent retrieval primarily focuses on keywords-based techniques. However, a
framework improves the user experience. search engine only retrieves information based on keywordss
Index Terms—Intelligent retrieval, real-time location, person- provided by the user and is not in a position to capture
alized websites, keywords extraction, implicit feedback.
the user’s search intention. Different users have different
search needs on account of their different ages, interests and
I. I NTRODUCTION
occupations. For example, the keywords ‘orange’ could mean
Cyber-physical-social systems (CPSSs), including the cyber a type of fruit or a color. If the search engine process the
world, physical world, and social world, can provide high- keywordss in the same way and return the retrieval results
quality personalized services for humans [1,2]. Such systems to different users, it cannot still change the fact that search
take data from the environment, integrate the data and extract engines lack the ability to satisfy the personalized demands of
valid information. CPSSs take human society from the abstract users.
of philosophy into daily concrete applications. Human flesh Many approaches have been taken to capture the intentions
search, new media, Wiki and crowd-sourcing mechanisms of users to solve the above problems. The most commonly
rapidly enhance human living space and are both data-driven used approach is to employ keywords-based searching meth-
and virtualized. Knowledge can almost be transmitted and ods to find relevant webpages and provide appropriate ranking
accessed at the speed of light through cyberspace via social strategies [6]. In addition, with the increase in the demands on
networks. The emergence of cyber-physical-social systems user satisfaction, vertical search engines have provided certain
(CPSSs) [3] has resulted in the explosive growth of networking value information and related services for a particular field, a
information. They however also bring about several important particular person and a particular demand (e.g., travel searches
challenges in CPSSs. and educational resource searches) [7]. However, detailed and
To address the explosive growth of Internet information, accurate information is still not able to be obtained by vertical
the development of effective retrieval techniques is urgent. or general search engines. For instance, if we are at Central

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2828081, IEEE Access

South University, we need to search today’s news, find the algorithm. Furthermore, a few algorithms provide typical rank-
location of the fifth canteen, etc. A general search engine will ing algorithms for matching results in the search procedure.
not provide satisfactory results. However, the aforementioned retrieval approaches suffer from
Moreover, improving rankings has not been effectively ad- several drawbacks. The approach produces many unrelated
dressed. Personalized search has become a research direction results, which lead to a massive waste of computational
for numerous scientific researchers. User-behavior-based tech- resources.
niques have boosted the ranking performance [8]. For example, To solve the above problem, a new generation of search
click models have been well studied for personalized search engines is becoming a hot spot of research at home and abroad.
[9], where clicks with a reasonable dwell time on a particular Reference [15] implemented a free custom professional spi-
document suggest that a user favors this result [10,11], whereas der with storage management. Reinforcement learning was
it might be non-relevant for other users. In this paper, we introduced into web spider’s learning process, and the hidden
provide personalized ranking based on user behavior to meet structure information obtained by training the link text guided
their real-time information needs. During information retrieval, the spider to perform the work [16]. Reference [17] proposed
users usually expect to obtain the unknown information. a search strategy based on a context diagram, which was used
Analyzing the user’s historic search data represents a large to construct a typical context diagram to estimate the distance
proportion of research on personalized retrieval but is still from the target webpage. Reference [18] proposed a topic-
unable to solve existing problems. sensitive PageRank algorithm to avoid the theme drift problem
In this paper, we investigate the above-mentioned problems. of the algorithm. Although these research works have reduced
The main contributions of this paper are summarized as the amount of noise in results, many of them still cannot
follows: effectively capture the intentions of users.
1) We propose an accurate and intelligent retrieval frame- To address this issue, current solutions are applied to
work with real-time location and relevant feedback technology personalized search. Reference [19] designed a personalized
for personalized websites. information acquisition system in which the web crawler
2) We predict user retrieval intentions by analyzing the achieved a high acquisition accuracy. Reference [20] presented
user’s real-time location to determine a personalized search a character-sensitive PageRank calculation formula. Reference
range. To improve the retrieval results, we also propose a [21] proposed a personalization approach based on query clus-
strategy of implicit relevant feedback based on click-through tering. Reference [22] proposed a new web search personaliza-
data analysis, which can obtain the relationship between the tion approach that captured the user’s interests and preferences
user query conditions and the retrieval results. in the form of concepts by mining search results and their
3) We design a personalized PageRank algorithm including click-through. Reference [23] proposed a personalized mobile
modified parameters to improve the ranking quality of the search engine (PMSE) that captured the users’ preferences
search results using the relevant feedback from other users in the form of concepts by mining their click-through data.
in the interest group. The solution ensures that different Reference [24] presented an algorithm in the personalization of
users obtain different results that are closer to the user’s web searches, called a Decision Making Algorithm, to classify
requirements, even with the same keywords search. the content in the user history. Reference [25] discussed an
The remainder of this paper is organized as follows. Section ontology-based web information system (SEWISE) to support
2 provides a brief review of related work and comparison with web information description and retrieval.
similar problems. In section 3, the model of the search engine There have been studies on click-through data of relevant
is introduced. This section also describes the optimized re- feedback being introduced into retrieval systems. Reference
trieval strategy as the solution to the problem. In section 4, the [26] analyzed the user click behavior when browsing retrieval
proposed framework is implemented, and some experiments results. Reference [27] compared two users’ click-through
are performed, analyzed and compared with other methods model on a click chain model (CCM) and a dependent
through simulation. Finally, the paper is concluded in section click model (DCM), and the experiment confirmed that the
5, and future work is reviewed. CCM model performed better. Reference [28] introduced user
click-through data as measured parameters to improve the
II. R ELATED W ORK data quality of training algorithms. Reference [29] introduced
user click-through data into image retrieval to retrieve more
Researchers have made great effort to improve the efficiency accurate results. Reference [30] adopted extracting vocabulary
of information retrieval. The most common approach is based in the retrieval results to improve the accuracy. In addition,
on keywordss. Substantial current research work only consid- substantial research has been performed on search result
ers single keywordss without fully expressing the intentions of reordering using user click-through data [31,32] that has
users. In expanded research, others use related multi-keywords confirmed the effectiveness of click-through data [33].
queries, which make the query results more consistent with Generally, although most approaches on personalization
the user’s requirements [6]. Traditional keywords extraction have achieved good performance, some difficulties, such as
algorithms come in four types: the LCS algorithm [12], N- real-time performance and user experience, prevent them from
Gram algorithm [13], IkAnalyzer algorithm [14] and Nakatsu being widely applied. Our method provides an accurate and

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2828081, IEEE Access

intelligent retrieval framework with real-time location and the corresponding personalized website, and the web log also
relevant feedback technology for personalized websites. Our records the information. The retrieval results are then returned
framework captures certain aspects, and the experimental to the client.
results demonstrate the method’s effectiveness. In the third step, the results returned from the server are
not directly displayed to the users. The client uses the click-
III. I NTELLIGENT R ETRIEVAL FOR P ERSONALIZED through data acquisition strategy (feedback strategy) intro-
W EBSITES duced to record user behavior and upload it to the server. The
In this section, we present the framework of intelligence server then analyzes the user’s click-through data features and
retrieval and demonstrate how to use real-time location in- updates the PageRank value through the personalized ranking
formation to assist in retrieval for personalized websites. method proposed to make subsequent retrievals more relevant
There are four main parts to the our proposed intelligence to users’ requirements.
retrieval algorithm: (a) real-time location and web configura-
B. Real-time Location and Web Configuration
tion, (b) keywords extraction, (c) relevant feedback and (d)
personalized ranking. In our proposed method, we assume Users have different requirements for the same query in
that the server has already collected some website framework different scenarios. Consequently, we consider their location to
information in a database to guarantee that the server can obtain personalized query results. Real-time location and web
return appropriate results faster. configuration are two important issues facing the intelligent
retrieval framework. The detailed process of location and web
A. Framework Overview configuration is illustrated in Fig. 2.
An overview of the intelligence retrieval framework is
shown in Fig. 1. The work process mainly consists of four /RFDWLRQ,QIR

steps: (a) real-time location and configuration personalized ,' ,3 7LWOH $GGUHVV

website, (b) retrieval, (c) performance optimization and (d) ƚŝƚůĞ $FFXUDWH5HVXOWV

re-retrieval and a list of the final returned results.


In the first step, the user’s real-time location information
:HEVLWH,QIR
(the latitude and longitude of the user) is obtained from the
,' ,3 7LWOH
Location API and uploaded to the server. The user retrieves the ƚŝƚůĞ
current location area after setting the retrieval range, thereby
5HTXHVW
acquiring the name and address information of the nearby
building. Then, the information features are processed into a /RFDWLRQ3KDVH 6HDUFK3KDVH :HEVLWH&RQILJXUDWLRQ3KDVH
long text using the proposed keywords extraction algorithm
to extract the keywordss. The current personalized website is Fig. 2: Process of location and web configuration
then determined.
Generally, a mobile user obtains location information via
GPS, including the current latitude and longitude. We observe
two main jobs in extracting real-time location information.
First, we can retrieve the information of the current sur-
rounding location by the API of the Baidu map and acquire
detailed information of the surrounding buildings by setting
the coordinates (location information and retrieval range as
the parameters) such as id, name, and location. Second, we
extract the “title”and “address” information of the located
surrounding buildings and combine the information into a long
text. Thus, we can obtain the website name of the current
user’s personalized search range.
The frameworks of personalized websites vary, and thus,
we design web configuration files to achieve faster retrieval.
Fig. 1: An overview of intelligent retrieval framework The user can modify the file according to their own needs for
personalized retrieval. The web configuration includes three
In the second step, the server captures information of the phases. First, the user sets the search range according to
initial websites in the web configuration file and calculates the user’s own needs, and then, the corresponding location
the initial PageRank value of all the webpages. The user information is saved to the Location Info. Second, an accurate
can modify the local website list in the client and upload retrieval result is obtained after the location information and
it to the sever. The client groups the users according to the the request are extracted as keywordss; then, the request is sent
keywordss entered the first time. After the server receives the to the server. Finally, the personalized website is configured
keywords request, the information retrieval is performed on automatically in the server according to the Location Info.

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2828081, IEEE Access

C. keywords Extraction the retrieval performance, the extra overhead of user feedback
This paper proposes an optimized algorithm for the key- should be minimized.
words extraction algorithm based on a statistical model, which We randomly generate user click-through data of top 10
calculates the frequency of words emerging in the text of links. The distribution information of a user click-through is
important locations such as “first line”, “first” and “tail”. shown in Fig. 3. Obviously, when users retrieve information
Therefore, the text is broken into clauses, and a public from search engines, usually only the top few retrieval results
substring is extracted to calculate the frequency of each word will attract the attention of the user. If the retrieval results
in two clauses one by one when extracting the keywordss. that the user needs are further back, the user cannot obtain
The text is first broken into clauses, which go with one useful information from this retrieval. In addition, each user
another by permutation and combination. Then, we use the has different concerns about the target link even if they input
optimized public substring extraction algorithm to address the same keywordss as a retrieval condition. Therefore, we
the clause set. Finally, we extract the keywordss of the text can obtain a certain value from the information from the
according to the weights of the candidate keywordss, which user’s click-through data. This paper presents a strategy to
depend on the word frequency and word length. The public obtain users click-through data via implicit feedback, which
substring extraction algorithm is shown in Algorithm 1. can improve the performance of search engines and the user’s
satisfaction.
Algorithm 1: Public substring extraction algorithm
input : str1[],str2[] 

output: pstr[] 
VN
int rowLength; LQ 
/

index[rowLength][str1.length()];

int row=0;

for i = 0 to str2.length do

for i = 0 to str1.length do

if str2.getChar(i) == str1.getChar(j) then

if index[row][j]==-1 then

index[row][j]=i;

end
     
else if index[row][j]> -1 then &OLFNThrough
row++;
index[row][j]=i; Fig. 3: Distribution information of user click-through
end
end
end Definition III.1. Click set (CS): Given a query keywords
end with accessible links and the CS satisfying

The optimized keywords extraction algorithm presents im- CS=(ID,Q,R,C)


provements in terms of its space and time complexities. In where
terms of space complexity, it adjusts the construction method ID is number of the user’s interest group and used to distin-
of the matrix and changes the string traversal method based guish users in different groups;
on the traditional LCS algorithm. For instance, if two string Q is a query keywords, which shows the query conditions of
lengths are p and q, the space complexity of the traditional the retrieval;
LCS algorithm is O(pq). Because we adopt the most frequent R denotes a collection of all links returned from the search
character (m) as the height of the matrix, the space complexity engine, in which the order of the links in the set is the display
of the proposed algorithm is improved to O(pm). This is order on the webpage; and
obviously less than O(pq). In addition, this structure mode of C denotes a collection of all links clicked by the user.
the matrix does not affect extracting the public substring and Definition III.2. Feedback Set (FS): The FS is used to
allows the frequency of the public substring to be recorded indicate the relevant feedback information obtained from the
more easily. In terms of time complexity, the time complexity click data analysis, and the FS satisfies
of the string traversal is still O(pq), but the total running time
of the optimized algorithm is reduced to a certain degree due FS=(ID,map)
to the height of the matrix being reduced. where
D. Relevant Feedback map is a relational table that stores relative degrees of
correlation between two webpages.
A practical search engine, especially on the Internet, should
provide a convenient user experience. If we want to improve We analyze user behavior and propose optimized strategies

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2828081, IEEE Access

to obtain relevant feedback. In our strategies, the relevant behavior. Thus, we regularly update the map table to more
degree is low when a link is in the front and unclicked. If a accurately reflect the current retrieval intention for the same
link is not clicked and the previous link is clicked by the user, group user.
the relevant degree of the link is low. If a link is clicked by the The improvement in the traditional PageRank algorithm
user, the relevant degree of the link is higher than the previous consists in adding a vector q, which represents the modification
unclicked links of the link. The relevant degree is higher of the PageRank value using the relevant feedback information
between the link clicked more often and query keywordss. obtained from the click-through data. During the traversal of
The relevant feedback algorithm is shown in Algorithm 2. the map table, if the relevancy of link li for the same keywords
is greater than link lj and the webpage weight of link li is less
Algorithm 2: Relevant feedback algorithm than link lj , we modify the weight stored in the database by
input : CS(ID,Q,R,C) the vector q. The calculation is as follows:
output: FS(ID,map) P(Rank(li )−Rank(lj ))
for i = 1 to n do (li ,lj )
q[li ] = /N (li , lj ) (1)
for j = 1 to n do 2
if 1<=j<i<=n then
if link(i) ∈ C && link(j) ∈ / C then q[lj ] = −q[li ] (2)
(li ,lj ) stored in map;
end Rank(li ) represents the current weight of the link li in the
else if 1<=i<=n-1 then database, and N(li ,lj ) represents the number of relationships in
if link(i) ∈ C && link(i+1) ∈ / C then the relevancy table. The click status of a user cannot represent
(li ,li+1 ) stored in map; other users; thus, we need to analyze and merge the click-
end through data of different users, which gradually makes the
end vector q perfect. Formula (3) represents the accumulation
else if i=n then process of the modified vector q.
if link(i) ∈ C then
for j = 1 to n do qold [li ] = k1 qold [li ] + k2 qnew [lj ] (3)
(li ,lj ) stored in map;
qold [li ] represents the original value of the modified vector
end
for link li . qnew [lj ] indicates the modified value calculated
end
based on the relevancy of the newly acquired click-through
end
data. Introducing the modified vector q into the traditional
else if link(i) ∈ C && link(j) ∈ C then
PageRank equation, the following formula (4) is obtained:
if num(link(i))>num(link(j)) then
(li ,lj ) is stored in map;
end ∀li Rank n+1 (li ) =
X
Rank n (li )/Nlj + q[li ] (4)
end lj ∈Bli
end
end Bli represents the collection of all links in, and Nlj rep-
end resents the total number of chain links to the webpage. For
formula (4), a variable d is added to control the coefficient of
the modified vector q and the traditional PageRank value. The
N is the number of R sets. link(i) represents the ith link calculation is as follows:
in the linked collection returned from the search engine. The
relationship (li ,lj ) indicates that the relevant degree of link i is X
higher than that of link j for the keywordss used in this query. ∀li Rank n+1 (li ) = d∗ Rank n (li )/Nlj +(1−d)q[li ] (5)
num(link(i) represents the number of clicks of link i. lj ∈Bli

E. Personalized Ranking Method Formula (4) and formula (5) add the modified vector q to the
The traditional PageRank algorithm is implemented based traditional PageRank. The corresponding formula including
on linkage relations between webpages, but it ignores the the webpage access probability C is as follows:
importance of the webpages for different users. The paper P
d∗[(1−C)+C∗ Rankn (li )]
presents a method whereby the relevant feedback informa- ∀li Rank n+1 (li ) =
lj ∈Bl
i
+ (1 − d)q[li ] (6)
Nlj
tion obtained from the click-through data is introduced into
the PageRank algorithm. According to the proposed relevant The relevant feedback information provided by different
feedback information extraction strategy, we obtain the map users is different, and the value of the modified vector q
table for the relationship of relevant degrees between links. is different; therefore, the calculated personalized PageRank
However, the personalized PageRank value is influenced by value also shows significant differences. Therefore, even if
not only the relationships among links but also the user click users of different groups use the same retrieval keywordss,

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2828081, IEEE Access

the retrieved results will be reordered based on the value of Therefore, the keywords extraction accuracy can be defined
the personalized PageRank. The calculation process of this as follows:
personalized PageRank algorithm is shown in Algorithm 3.
n
1X
accuracy = Sim(i) (7)
Algorithm 3: Personalized PageRank algorithm n i=1
input : the relation of the link;
the relevant feedback information where n is the number of keywords extraction results and
output: personalized PageRank value Sim(i) represents the similarity of the number i with the range
while the PageRank value converges do of 0 ≤ Sim(i) ≤ 1 .
calculate PageRank value of webpage;
calculate the value of related feedback vector 7KH3URSRVHG$OJRULWKP /&6$OJRULWKP
according to formulas(1), (2), (3); 1*UDP$OJRULWKP ,.$QDO\]HU$OJRULWKP

calculate personalized PageRank value according to $FFXUDF\



1DNDWVX$OJRULWKP

formula(6); 
end 




The proposed personalized PageRank value is merged into

the webpage weight and user behavior influence factor. For 
the result ranking, we still add the webpage relevant degree 
to make the results more accurate. 


IV. E XPERIMENTAL R ESULTS 

A. Experimental setup          
7H[WOHQJWK
It is important to establish the experimental database. We
perform the experiments in a real Web environment. Table 1 Fig. 4: Relation between keywords extraction accuracy and
shows the experimental environment. The intelligent retrieval text length
framework provides a convenient operating interface, which
is similar to a commercial search engine. Users can type From Fig. 4, the accuracy of the keywords extraction under
keywordss into the interface and submit the information to the proposed algorithm is higher than that of traditional public
the server. substring extraction algorithms. In this paper, the proposed
keywords extraction algorithm does not limit the length of
Table 1 Test environment the word or phrase, considering the number of words and
Test environment Name
phrases in the text. Thus, when extracting keywordss, the
”off” phenomenon whereby the keywordss form a meaningless
Server OS Mac OS 10.11.5
phrase is avoided. In addition, it better matches the theme of
Server Memory 8GB
the text that the proposed algorithm introduces into the word
Server Version Tomcat 7.0
frequency.
Server Database Version MySQL 5.7.13
2) Time Cost: The main influences on the algorithm ex-
Mobile Terminal Type Huawei Mate9
ecution time are the length and quantity of keywordss. The
Mobile Terminal OS Android 7.0
algorithm execution time increases with increasing length and
Mobile Terminal Database SQLite 3.8.8
quantity of text, as shown in figures 5 and 6.
From Fig. 5, we draw the following conclusions: (1) the
proposed algorithm and the LCS algorithm are similar in time
B. keywords Extraction Efficiency complexity because the two algorithms need to extract public
The first experiment illustrates the extraction accuracy and substrings from clauses of the text; however, the proposed
time cost between the proposed keywords extraction algorithm algorithm achieves a relatively low space complexity in the
and four traditional public substring extraction algorithms. presence of high frequencies, and thus, the time consumed by
The experimental dataset includes 1000 experimental sets, the scanning array is relatively short. The time complexity of
including the abstract and keywordss extracted from papers the LCS algorithm is still higher than the proposed algorithm.
in the Baidu academic and CNKI platforms. (2) The keywords extraction time of the Nakatsu algorithm
1) Extraction Accuracy: The text length strongly influences was the highest because of the complex steps needed for the
the keywords extraction accuracy, and thus, the experiment high-frequency words. (3) The keywords extraction procedure
only tests a single text dozens of times. The accuracy of the of the N-Gram algorithm is relatively simple. Only one word
keywords extraction algorithm is estimated by the similarity segmentation process is performed; therefore, the time com-
between the results of the algorithms and the real data. plexity is the lowest. (4) The IKAnalyzer algorithm has the

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2828081, IEEE Access

7KH3URSRVHG$OJRULWKP /&6$OJRULWKP
Table 2 The results of partial data in map
1*UDP$OJRULWKP ,.$QDO\]HU$OJRULWKP
7LPH
1DNDWVX$OJRULWKP
 no. results in map


1 (l5, l2)
2 (l5, l3)
 3 (l5, l4)



 From table 2, link 5 is before links 2, 3 and 4, which were


not clicked; thus, the relevant degree between the links and

         
the query keywordss is less than link 5.
7H[WOHQJWK

Fig. 5: Relation between keywords extraction time and text Table 3 The results of partial data in map

length no. results in map

1 (l1, l2)

same matching process as the N-Gram algorithm; therefore, 2 (l6, l7)

the time complexity is higher than the N-Gram algorithm. 3 (l8, l9)

Fig. 6 shows the relation between the keywords extraction 4 (l11, l12)

time and text number. Because the text length also affects 5 (l15, l16)

the execution time of the algorithm, we choose a test text 6 (l15, l1)

that is approximately 500 characters. When the amount of text 7 (l15, l6)

increases, the keywords extraction time will also increase. For 8 (l15, l8)

the same number and length of text, the execution time of the 9 (l15, l11)

proposed algorithm is in the middle.


We then click on five links 1, 6, 8, 11 and 15 successively
after truncating the map table, and each link is clicked once.
7KH3URSRVHG$OJRULWKP /&6$OJRULWKP
1*UDP$OJRULWKP ,.$QDO\]HU$OJRULWKP The partial data in the map table are shown in table 3. We find
7LPH
1DNDWVX$OJRULWKP from table 3 that the relevancy of the link that is not clicked
ϭϬϬϬϬ
and behind the clicked link is low. In addition, the relevancy of
ϵϬϬϬ
link 15 is higher than those all the other clicked links because
ϴϬϬϬ
it is the last link clicked. The results are in agreement with
ϳϬϬϬ
the phenomenon of the proposed strategy.
ϲϬϬϬ
We click link 1 five times, link 3 two times and link 9 seven
ϱϬϬϬ times after clearing the map table again. The partial data in
ϰϬϬϬ the map table are shown in table 4. Table 4 illustrates that the
ϯϬϬϬ more a link is clicked, the higher the relevancy of the link
ϮϬϬϬ is. The results are in agreement with the phenomenon of the
ϭϬϬϬ proposed strategy.
ϭϬϬ ϮϬϬ ϯϬϬ ϰϬϬ ϱϬϬ

7H[W1XPEHU
Table 4 The results of partial data in map
Fig. 6: Relation between keywords extraction time and text
no. results in map
amount
1 (l1, l3)
C. Relevant Feedback Extraction Strategy Efficiency 2 (l9, l1)
3 (l9, l3)
The second experiment tests the relevancy between the web-
page results and the query keywordss according to the relevant
feedback extraction strategy based on the click-through data
analysis. In the simulation experiment, we set 20 static links D. Personalized ranking performance
in the android client and 10 links per page. We first set the The third experiment tests the performance of the proposed
initial 2-tuple (ID, map) in the database. Then, we click on personalized ranking method. We first write 10 simple HTML
two links, 1 and 5, of the 20 links in sequence and click once files as test webpages, and these webpages have the same web
per link. The partial data in the map table are shown in table structure, that is, the same initial PageRank value. For the same
2. keywordss, each of the 10 webpages will be given different

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2828081, IEEE Access

similarity values. We click on 10 webpages many times, Fig. 8 represents the ranking of the 10 pages in the result
where the number of clicks is different, therein recording list when the user performs 0, 100 and 300 clicks. When the
and analyzing the PageRank value and rankings of the 10 user is not clicking first these webpages are sorted by page
webpages. similarity, creating a straight line. However, when the user
The PageRank values of the 10 webpages are first stored. performs multiple clicks, the ranking will change due to the
The PageRank value is only related to the webpage structure of influence of the personalized PageRank value. The webpages
the webpage because there is no user click operation. A total with more clicks will be ranked in higher positions. When the
of 100 clicks on the 10 webpages are distributed on different number of clicks increases again, the line is basically stable,
webpages. The current PageRank value is shown in table 5. indicating that the ranking result is in line with the retrieval
intentions of the majority of users.
Table 5 The status of personalized PageRank value

:HESDJH
webpage no. clicks initial PageRank value personalized PageRank value
UDQNLQJ FOLFN FOLFNV FOLFNV

1 14 2.37E-2 2.96E-2
2 18 2.37E-2 3.12E-2

3 9 2.37E-2 2.76E-2
4 21 2.37E-2 3.24E-2 

5 2 2.37E-2 2.30E-2

6 3 2.37E-2 2.31E-2
7 6 2.37E-2 2.39E-2

8 26 2.37E-2 3.38E-2
9 0 2.37E-2 2.17E-2 

10 1 2.37E-2 2.17E-2

         
:HESDJH12
From table 5, a higher number of clicks results in a greater
increase in the personalized PageRank value. Simultaneously, Fig. 8: PageRank value of Webpages in Different Clicks
personalized PageRank values with fewer clicks do not fluc-
tuate. The personalized PageRank value of a webpage that is
not clicked declines correspondingly. V. C ONCLUSIONS AND F UTURE W ORK
We increase the number of clicks to 300 for 10 pages and In this paper, we propose a new approach for an intelligent
record the change in the personalized PageRank value after retrieval framework with real-time location in CPSSs to re-
each click. In Fig. 7, We randomly extract the personalized solve ambiguities for general search engines. We first present
PageRank value of 3 pages. From Fig. 7, when the webpage an intelligent retrieval model for a single field with real-time
is clicked, the personalized PageRank value of the webpage location. Second, to improve the retrieval results, the paper
will correspondingly increase; otherwise, it will be reduced proposes a strategy for implicit correlation feedback based
or remain essentially unchanged. Even if the webpages in the on click-through data analysis, which obtains the relationship
rear of the retrieval result list are not clicked, the PageRank between the user query conditions and retrieval results. Fi-
value also will not change; however, if some links before the nally, the paper designs a personalized PageRank algorithm
clicked webpage are not clicked, the PageRank value will be including modified parameters to improve the ranking quality
reduced because the webpage is not of interest to the user. of the retrieval results using the relevant feedback from other
users in the interest group.
3DJH5DQN We have performed several experiments to evaluate the per-
YDOXH ZHESDJH ZHESDJH ZHESDJH
 formance of the proposed framework. Comparisons performed
 from experiments demonstrate that the proposed framework
 obtains remarkable retrieval performances with minimum ef-

fort and provides a superior user experience.

The proposed framework provides an efficient, intelligence,


real-time location-oriented personalized retrieval approach in
 CPSSs. Although we have proven the efficiency and effec-
 tiveness of the proposed framework, in the future, we will
 concentrate on thoroughly investigating several improvements

to the compatibility and usability of the proposed framework.
          
&OLFNV
ACKNOWLEDGMENT
Fig. 7: PageRank value of webpages with different numbers This work is supported by the Hunan Science and Tech-
of clicks nology Plan (2012RS4054), Natural Science Foundation of

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2018.2828081, IEEE Access

China (61672535 and 61472005), Key Laboratory of Intelli- [18] Haveliwala, T. H. (2003). Topic-sensitive pagerank: a context-sensitive
gent Perception and Systems for High-Dimensional Informa- ranking algorithm for web search. IEEE Transactions on Knowledge &
Data Engineering, 15(4), 784-796.
tion of Ministry of Education Innovation Fund (JYB201502), [19] Jing, Y., & Baluja, S. (2008). Visualrank: applying pagerank to large-
Key Laboratory of Information Processing and Intelligent scale image search. IEEE Transactions on Pattern Analysis & Machine
Control of Fujian Innovation Fund (MJUKF201735), Natural Intelligence, 30(11), 1877.
[20] Pavel Berkhin. (2005). A survey on pagerank computing. Internet
Science Foundation of Hunan Province (2018JJ3203), Grad- Mathematics, 2(1), 73-120.
uate Student Innovation Project of Hunan (CX2016B049), [21] Leung, W. T., Ng, W., & Lee, D. L. (2008). Personalized concept-based
Research Foundation of Education Bureau of Hunan Province clustering of search engine queries. IEEE Transactions on Knowledge
& Data Engineering, 20(11), 1505-1518.
(17C0679), Hunan University of Science and Engineering [22] Leung, W. T., Lee, D. L., & Lee, W. C. (2010). Personalized Web
Research Project(17XKY071) and key discipline for computer search with location preferences. IEEE, International Conference on
application and technology of Hunan University of Science Data Engineering (Vol.41, pp.701-712). IEEE.
[23] Leung, W. T., Lee, D. L., & Lee, W. C. (2013). Pmse: a personalized
and Engineering. The authors declare that they have no con- mobile search engine. IEEE Transactions on Knowledge & Data Engi-
flicts of interests. neering, 25(4), 820-834.
[24] Divya, R., & Robin, C. R. R. (2014). Onto-search: An ontology based
R EFERENCES personalized mobile search engine. International Conference on Green
[1] Wang, X., Yang, L. T., Xie, X., Jin, J., & Deen, M. J. (2017). A cloud- Computing Communication and Electrical Engineering (pp.1-4). IEEE.
edge computing framework for cyber-physical-social services. IEEE [25] Gardarin, G., Kou, H., Zeitouni, K., Meng, X., & Wang, H. (2008).
Communications Magazine, 55(11), 80-85. SEWISE: An Ontology-based Web Information Search Engine. Natural
[2] Wang, X., Yang, L. T., Feng, J., Chen, X., & Deen, M. J. (2016). A Language Processing and Information Systems, International Conference
tensor-based big service framework for enhanced living environments. on Applications of Natural Language To Information Systems, June
IEEE Cloud Computing, 3(6), 36-43. (pp.106-119).
[3] Zeng, J., Yang, L. T., Lin, M., Ning, H., & Ma, J. (2016). A survey: [26] Sderlind P. (1998). Nominal interest rates as indicators of inflation
cyber-physical-social systems and their system-level design methodol- expectations. Scandinavian Journal of Economics, 100(2), 457472.
ogy. Future Generation Computer Systems. [27] Guo, F., Liu, C., Kannan, A., Minka, T., Taylor, M., & Wang, Y. M., et
[4] Liu, X.,& Turtle, H. (2013). Real-time user interest modeling for al. (2009). Click chain model in web search. International Conference
realtime ranking. Journal of American Society for Information Science on World Wide Web (pp.11-20). ACM.
and Technology, 64(8), 15571576 [28] Wendt, C., & Lewis, W. (2009). Improving the quality of a customized
[5] Liu, J., & Belkin, N.J. (2015). Personalizing information retrieval smt system using shared training data. Inproceedings.
formulti-session tasks: Examining the roles of task stage, task type, [29] Zhang, Y., Yang, X., & Mei, T. (2014). Image search reranking with
and topic knowledge on the interpretation of dwell time as an indicator query-dependent click-based relevance feedback. IEEE Transactions
of document usefulness. Journal of American Society for Information on Image Processing A Publication of the IEEE Signal Processing
Science and Technology, 66(1), 5881. Society,23(10), 4448.
[6] Li, R., Xu, Z., Kang, W., Yow, K. C., & Xu, C. Z. (2014). Efficient [30] Cui, H., Wen, J. R., Nie, J. Y., & Ma, W. Y. (2002). Probabilistic query
multi-keywords ranked query over encrypted data in cloud computing. expansion using query logs. 325-332.
Future Generation Computer Systems, 30(1), 179-190. [31] Smyth, B., Balfe, E., Freyne, J., Briggs, P., Coyle, M., & Boydell,
[7] Wu, Y., Shou, L., Hu, T., & Chen, G. (2008). Query Triggered Crawling O. (2004). Exploiting query repetition and regularity in an adaptive
Strategy: Build a Time Sensitive Vertical Search Engine. International community-based web search engine. User Modeling and User-Adapted
Conference on Cyberworlds (pp.422-427). IEEE. Interaction, 14(5), 383-423.
[8] Agichtein, E., Brill, E., & Dumais, S. (2006). Improving web search [32] Burke, R., & Ramezani, M. (2011). Matching recommendation tech-
ranking by incorporating user behavior information. In Proceedings of nologies and domains. Recommender Systems Handbook, 367-386.
the 29th Annual International ACM SIGIR Conference on Research and [33] Abdullah, N. Y., Husin, H. S., Ramadhani, H., & Nadarajan, S. V. (2012).
Development in Information Retrieval (pp. 1926). New York: ACM. Pre-processing of query logs in web usage mining. , 11(1), 82-86.
[9] Chuklin, A., Markov, I.,& de Rijke, M. (2015). Click Models for Web
Search. Synthesis Lectures on Information Concepts, Retrieval, and
Services, Morgan & Claypool Publishers.
[10] Xu, S., Jiang, H., & Lau, F.C.M. (2011). Mining user dwell time
for personalized web search reranking. In Proceedings of the 20th
International Joint Conference on Artificial Intelligence (pp. 23672372).
Palo Alto, CA: AAAI Press.
[11] Yi, X., Hong, L., Zhong, E., Liu, N.N., & Rajan, S. (2014). Beyond
clicks: Dwell time for personalization. In Proceedings of the Eighth
ACM Conference on Recommender Systems (pp. 113120). New York:
ACM.
[12] Babenko, M. A., & Starikovskaya, T. A. (2011). Computing the longest
common substring with one mismatch. Problems of Information Trans-
mission, 47(1), 28-33.
[13] Choong, C. Y., Mikami, Y., & Nagano, R. L. (2011). Language identifi-
cation of web pages based on improved n-gram algorithm.International
Journal of Computer Science Issues, 8(3).
[14] Gao, Y., Song, F., Xie, X., Sun, Q., & Wu, X. (2015). Study of
test classification algorithm based on domain knowledge. International
Conference on Cyberspace Technology (pp.5. -5. ).
[15] Chakrabarti, S., Berg, M. V. D., & Dom, B. (1999). Distributed hypertext
resource discovery through examples. Vldb, 375–386.
[16] Rennie, J., & Mccallum, A. (1999). Using Reinforcement Learning
to Spider the Web Efficiently. Sixteenth International Conference on
Machine Learning (pp.335-343). Morgan Kaufmann Publishers Inc.
[17] Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., & Gori, M. (2000).
Focused Crawling Using Context Graphs. International Conference on
Very Large Data Bases (pp.527-534). Morgan Kaufmann Publishers Inc.

2169-3536 (c) 2018 IEEE. Translations and content mining are permitted for academic research only. Personal use is also permitted, but republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

S-ar putea să vă placă și