Sunteți pe pagina 1din 18

Search Engine Ranking Based On Distribution of Keywords

For assignment help please contact at help@hndassignmenthelp.co.uk and


hndassignmenthelp@gmail.com

To increase traffic on websites, SEO (Search Engine Optimization) provides many


options [1], but this process is costly and time consuming. This paper describes work on
an initial model for handling some of the SEO factors to increase the distribution of
keywords. Keyword distribution factor affects on primary rank when the search engine
crawler visits the website at the first time. The proposed model shows evidence of
gaining the greater utilization of the aforementioned parts and prevents recognition of
search engine spam. Also the model provides users with the words and their values
based on the key weights for provide a new title, keywords or description in order to
increasing the relativity between content and HTML meta tags and Title tag. This
suggestion is based on methods of lexical analysis and semantic.
Keywords: Keyword Distribution, Meta tags, Ranking and Search Engine Optimization,
Spam Recognition, Title and Keyword Generation
Introduction
Ranking is the most important element in web search engines. Therefore, searching
specific terms through search engines requires proper ranking to obtain good results.
Proper ranking is important in online advertisement. In general, there are two types of
online advertisements associated with internet search engines: paid placement, and
Search Engine Optimization (SEO). SEO is the process of improving the volume and
quality of traffic to a web site from search engines via natural search results. Achieving
the high rank in search engines depends on more than 200 parameters [2]. Site owner
or expert users be able to customize and improve the rank if they manage all these
parameters and using them in proper position and condition.

With respect to the mentioned parameters, there is an obvious and logical relation
between Title tag, Keywords and Description meta tags (TKD) and web site contents

which called body. As the TKD distribution rate in body gets higher, the ranking will be
improved, because the effect of this relation and distribution is a reason for achieving
higher ranking position in search engine result page (SERP). On the other hand, search
engines may also penalize pages or exclude them from the index if they detect search
engine "spamming". For instance, one word is repeated hundreds of times on a page to
increase the frequency for propelling the page higher in the listing. Search engines
watch for common spamming methods in a variety of ways, including complaints from
users. Due to this proper distribution of keywords is a crucial and newsworthy issue for
a web page.
In the SEO field too many researches and theory have been done. Nowadays designers
and site owners have understood what they want and this request is good rank.
Therefore, many of specialists in SEO designed and developed models to obtain the
satisfying result. In order to find the term frequency in a document J.Ramos used TFIDF (Term Frequency - Inverse Document Frequency) to determine word relevance in
document queries [3]. A complete and persnickety work on key phrase extraction in
HTML page was performed by Humphreys. He introduced a novel key phrase extraction
heuristic for web pages which requires no training [4], but instead is based on the
assumption that most well written WebPages "suggest" key phrases based on their
internal structure. It is very fast, flexible, and its results compare favorably with the
state of the art in key phrase extraction.
To analyse some factors which are used in search engine ranking in "An Analysis of
Factors Used in Search Engine Ranking" some work had been done but their factors had
zest in words length such as number of bytes of the original document, average term
length, and not involving the major factors that users can manipulate them. These
factors are not so practical [5]. And also a model for generating keywords for search
engine advertisements based on semantic similarity between terms has been prepared
[6]. For finding and testing some factors on ranking in a specified search engine Google - an analysis was provided through search engine optimization data [2]. Sh.
Thurow analyzed and collected most factors which had effect on search engine ranking
and worked on a marvelous topic called "do and don'ts in SEO" which seemed to be
necessary [7].
R. Bhowmik had an excellent experience on extract keywords from abstracts and titles
in academic papers which are useful for documents in small size of text [8] but the main

purpose is different from SEO. Recently, another research on automatic key phrase
extraction worked out, and named it KP-Miner system that works on to languages, i.e.,
English and Arabic [9]. This system is same as Humphreys solution [4] and does not
need to be trained on a particular document set in order to achieve its task. And in order
to improving the web advertisement (Xin et al.) worked on some factors for helping
managers to make informed advertising decisions [12].
In this way, the proposed method figured out a model in both sides of semantic and
lexical process in different ways, and also it is a kind of keyword extraction for TKD
enrichment.

The issue of proper distribution and


customization keywords in TKD
One of the search engine parts which is called spider (crawler) collects information from
websites such as content, Meta tags, title, etc., and sends them to search engine database
for calculating the ranking of each website. The spiders may come to the site any time,
during day or night and the "return time" that they come again for checking the site
depends on some factors such as ranking, update period and number of visitors. The
first visit of a spider is very important for a website because after the first visit, the
spider determines when and after what period, it returns. Therefore a website owner
should optimizes the TKD in order to make positive impact on the spider decision before
uploading the web pages for the first time, that is for determining the return period and
also obtaining the uppermost rank in the first visit.

A correspondent distribution of keywords and phrases in body is necessary for achieving


the better rank as well as TKD. Meanwhile, changing page TKD is not necessarily going
to help the page in the ranking position if the page has nothing to do with these parts.
Keywords need to be reflected in the page content too. Therefore, the problem is
inappropriate distribution of TKD in body and vice versa.

Proposed Model
Here, three of the most important factors are selected for this research [7, 10]. The
chosen factors are "Title", "Keyword" Meta tag and "Description" Meta tag, which are

called TKD. One of the goals is to find a proper distribution of TKD in content that is
called the body. This research supposed that the user is a semi-expert web designer or
developer. Therefore, it is assumed that the description tag is meaningful and related to
the body content. Although after the processing, the results show which part needs to be
changed or modified but the proposed model preferred a minimum description and
keywords.
The objective of this research is developing a model for optimizing TKD via body words
to improve the preliminary ranking. This objective contains of four smaller goals. The
first is, reducing the spider return time, the second, obtaining the uppermost rank, the
third, checking TKD standards and the last, recognizing Spam page.

Methodology
The proposed model methodology contains 6 steps which every of them gain a part of
result for users. After importing a HTML file as an input, data preprocessing for it
begins. Next, character analyzing section analyze the words and characters, after
finishing the analyzing, keywords analysis and generation will extract the word from
body and Symantec dictionary. Also this extraction performs for title tag. At the end
with an initiated formula, the model demonstrate the suggestions to user for further
actions.

Data Pre-processing
Extracting text from HTML format is a first move in this section. After that, Recognizing
and removing the stop words, special and non-standard characters perform. Next step is
Tokenizing and counting the words and sentences. And the end of this section Word
frequency calculation is done.

Character analysis
This phase has three parts. At first, TKD should be extracted. Second, extracted parts
should be counted and they are compared with the standards of search engines (Google,
Yahoo and MSN) in number of characters. Third, stop words are reduced from extracted
TKD.

Keywords analysis and generation

At first, the model recognizes the keywords and descriptions in html format and extracts
them. Then, it extracts the words from body and description tag which are valuable to be
used as keywords. The extracted words from description are added automatically to the
keywords. And this is because it is assumed that the users who have entered the
description have been experts and have written something related to the document.
And, also rating value will be determined with threshold which can be set by user in the
proposed model. Next, it creates a list of words for the user to choose some of them and
to add them to the keywords. This part enriches the keywords for title analysis. In this
part, the model may confront with some words which have a proper distribution in body
but the user has not made use of them in title or Meta tags. Therefore, the proposed
model suggests these new keywords to user for adding them in proper places.
On the other hand, for improving the results, the model finds the synonyms of each
word in current title tag via WordNet repository [11]. The model has used dictionary
based semantic words i.e., semantic model, to find the similar for each keyword. The
WordNet repository which was first provided at the cognitive science laboratory at
Princeton University in 2006 (sponsored by Google) was exploited for dictionary. Users
can check and add them if they are related to the content.

Title analysis and generation


After customizing keywords, the most important part of the research and also the most
significant factor in SEO, i.e., Title tag, will be processed. First, Title tag is recognized
and extracted. Now, for finding the title words' weights, the model should calculate the
real values of the words. For this reason, a formula based on probability is initiated.
Actually, the proposed model finds the title words' weights in the content to compare
them with other words. This comparison helps to normalize the words' values in the
title. This formula has three sections: word ratio (1), word contained sentences ratio
(2), and average of word presence as respects word contained sentences (3).
Therefore, the final formula multiplies these three factors.

(1)The formula is given in details:


(2)Where, TF is Term Frequency, TC is total content words, and 1 is Word Ratio

And where, w is the number of title words, Li is the number of content lines, n is the
number of lines, and 2 is the word contained sentences ratio.
(4)
(3)
k is the total number words of line Li, J is line Li words, and 3 is average presence of
word (in word contained sentences)
The Body Weight () is defined as:
= 1 * 2* 3 (5)
On the other hand, title words' synonyms are extracted from WordNet. Also
aforementioned formula is calculated for them. With this way, keyword weight is
obtained. Therefore, the improvement of the title and keywords is more accurate.

Spam recognition
On the other hand, one of the main reasons that a webpage cannot obtain a suitable
rank is that it has been recognized as a spam page. When a keyword is repeated more
than usual (it depends on number of words in document) search engines mark them as a
spam page in the sense that these pages try to gain a higher rank in listings illegally.
Therefore, our model finds all mistakes in a page by calculating the repetition of
keywords and alerts them to user to resolve the issue.
Spam recognition in this model is based on threshold and percentage. Users can adjust
the number of repetitions or set a percentage in the developed application and find any
suspicious word.

Methodology of the Model


This model uses One-Group Pre-test/Post-test design for experimental procedure. It
means the results are shown with comparison between before and after using "Prime
Solution". Also, 20 pages are randomly selected on the internet although all HTML files
could be our sample and could be imported into our model.

For making tangible results, some statistics has been used based on comparison
between results with using initiated formula based on probability, before treatment and
after that in bar charts and line charts.

Results and discussions


There is 4 HTML files have been selected as samples:

Table 1: HTML file samples


Doc. No.
Doc. Name
Total body words
Total Sentences
1
Random.htm
2023
140
2
Werewolf.htm
3482
276
3
Chemical.htm
1127

75
4
Sample.htm
60
7
The sentences recognition method makes use of "." for recognizing the sentences. It
means the model assumes the web designer or website owner has used the correct
punctuation. Also, the total words number is calculated after reducing the stop words in
order to increase the worthiness. Because when stop words have not been reduced, the
numerical results are very small and unworthy. The below table has shown whether the
word is good for title or it should be changed or manipulated in the body part. Also the
model determines the value of a word according to whether it should be added to
keywords Meta tags or it should be removed. This suggestion will help users to manage
the title and Meta tags since each place is valuable.
On the other hand, our comparison between the words value is based on the "biggest
value" of title words. The model is nominating the words with nominal variable as Good
(75% ), partially Good (50% and <75% ), Fair (25% <50% ) or Bad
(<25% ).
This way of comparison will help to choose a normalized title and keywords for
documents which are depend on own document because the proposed model suggests
the words according to comparison result of word's value in document itself.
The results in Table 2 have shown how the model can improve the webpage Title or
Keyword Meta tag.
Table 2: Proposed model results and suggestions
Word
Doc. No.

Freq.
Lines Freq.

Title Suggestion
Keyword suggestion
1
Random
1
87
74
2.55
3.85
Good
Good
2
randomness
1
61
38

0.89
3.85
Fair
Good
3
chance
1
7
7
0.02
7.69
Bad
Good
4
wall
1
0
0
0.00
0.00

Bad
Bad
10
Werewolf
2
77
43
0.36
13.33
Good
Good
11
Human
2
31
27
0.092
13.33
Fair
Good

12
werewolves
2
48
35
0.18
6.66
Partially Good
Good
13
lycanthropy
2
18
12
0.023
6.66
Bad
Good
14
Chemical

3
81
31
3.30
9.09
Good
Good
15
Substance
3
30
19
0.72
0.00
Bad
Bad
16
Chemistry
3
15

10
0.18
9.09
Bad
Good
17
Elements
3
19
13
0.31
9.09
Bad
Good
5
Computer
4
0
0
0.00

3.84
Bad
Good
6
Science
4
0
0
0.00
3.84
Bad
Good
7
Directed
4
1
1
0.26
3.84
Good

Good
8
Reading
4
1
1
0.26
3.84
Good
Good
9
Malaysia
4
0
0
0.00
0.00
Bad
Bad
The calculation method for or "key weight" is bidirectional. It means the model
calculates the percentage of the words in the keyword Meta tag which are either the

same or have a synonym word among the title words and vice versa. Also in this case,
WordNet repository is used for finding the synonyms.
In order to check the standards, our application checks the length of TKD separately and
compares with three famous search engine standards, Google, Yahoo and MSN. This
comparison will be shown to the users by numbers and bar graphs.
Also, the percentage of total title weight toward body calculates before and after using
the suggested words via proposed model. It can help the users for choosing the best
combination of words to making the improved title.
In table 3 the results shown that the proposed model improves the total title weight with
using the suggested keywords.
Table 3: Comparison Title words weight between original title and improved title
Doc. No.
Doc. Name
Original total title weight
Improved total title weight
1
Random.htm
0.5189
0.8596
2
Werewolf.htm
0.0939
0.2228

3
Chemical.htm
0.8163
1.4802
4
Sample.htm
0.1131
2.3491

Conclusion
The results are shown by finding some words in body and customizing the TKD with
their synonyms. Also by observing the standard length, the primary parameters that are
directly involved in SERP will be improved. On the other hand, with increasing this
factor, the "spider return time" will be less. For expanding the model, the work is in
progress on more than three factors and also on using a neural network solution instead
of dictionary based solution for semantic web purpose.

S-ar putea să vă placă și