Documente Academic
Documente Profesional
Documente Cultură
1.1 INTRODUCTION
The question-answering site Stack Overflow allows users to assign tags to questions in order to
make them easier for other people to find. Further experts on a certain topic can subscribe to tags
to receive digests of new questions for which they might have an answer. Therefore, it is both in
the interest of the original poster and in the interest of people who are interested in the answer
that a question gets assigned appropriate tags.
Stack Overflow is the largest, most trusted online community for developers to learn, share their
programming knowledge, and build their careers. It is something which every programmer use
one way or another. Each month, over 50 million developers come to Stack Overflow to learn,
share their knowledge, and build their careers. It features questions and answers on a wide range
of topics in computer programming.
The website serves as a platform for users to ask and answer questions, and, through membership
and active participation, to vote questions and answers up or down and edit questions and
answers in a fashion like a wiki or Digg. As of April 2014, Stack Overflow has over 4,000,000
registered users, and it exceeded 10,000,000 questions in late August 2015. Based on the type of
tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript,
C#, PHP, Android, jQuery, Python and HTML.
Stack Overflow allows users to manually assign between one and five tags to a posting. Users
are encouraged to use existing tags that are suggested by typing the first letter(s) of a tag, but
they are also allowed to create new ones, so the set of possible tags is infinite. While the manual
tagging by users generally works well for experienced users, it can be challenging for
inexperienced users to find appropriate tags for their question and by letting users add new tags
it is likely that different users use different orthographic versions of tags that mean the same thing
such as “php5” and “php5”.
For these reasons it is desirable to a have a system that can either automatically tag questions or
to suggest relevant tags to a user based on the question content. In this project we are developing
a predictor that can assign tags based on the content of a question. More formally, given a
question q containing a title consisting of n words a1, ..., an and a body consisting of m words
b1, ..., bm, we want to assign 1 ≤ k ≤ 5 tags t1, ..., t k from a limited list of tags T.
1
CHAPTER 1- INTRODUCTION
1.2 EXISTING SYSTEM
There any many methods which can predict the labels for a given question and description
within a certain time limit.
The existing system has a time constraint on it so that the labels predicted are within certain
time limit and the precision and recall values are not accurate so the functioning of stack
overflow will be affected.
1.2.1 LIMITATIONS
• The accuracy of the labels predicted is less because of low precision and recall values.
Due to this entire working of the stack overflow will be affected.
• Users will not be able to see the personalized questions which will reduce the chance
of answering those questions.
• Predicting the accurate tags become difficult and it takes more time to predict the
tags.
• It has a low precision and recall values.
In our proposed model we mainly concentrate on accuracy of the model rather than strict latency
constraints. We use logistic regression with one vs rest classifier to predict the accurate results
of the labels for a given title and description.
Time constraints is eliminated in this model. Accurate precision and recall values can be
obtained. Correct tags or labels can be predicted without any difficulty.
1.3.1 ADVANTAGES
• Due to better precision and recall values the accuracy of predicting relevant tags increases
so that any questions posted by the user can be answered in less amount of time.
• It gives better precision and recall values.
• Time needed to find the tags is comparatively low when compared with other models
• Accurate tags with exact prediction are obtained.
• Tags are predicted as expected by the user.
2
CHAPTER 2- LITERATURE REVIEW
TITLE: Predicting Tags for Stack Overflow Questions Using Different Classifiers
DESCRIPTION:
The adequacy of any online education forum depends on the user’s experience based on users
interests and demands. So, it is the fundamental requirement to design a system which considers
users interest in to account when putting content online. Many online websites such as Quora,
GeeksforGeeks, and Stack Exchange have large scale of data in terms of questions and answers
of users. Large-Scale datasets are available on these websites that can be mined and pre-
processed using text classification and can be used to know users query regarding a topic.
Information that is provided should be relevant to user’s interest. We propose a system that will
take significant amount of data from a website and use that data for different approaches to
predict the tag for the website Stack overflow posts and achieve a better accuracy for 1000 most
frequent tags.
JOURNAL: ELSEVIER,2009.
3
CHAPTER 2- LITERATURE REVIEW
DESCRIPTION:
Stack Overflow (SO) forum is a widely used platform for people to interact on topics related to
Computer Programming languages. With more than three lakh users and ten lakh questions,
Stack Overflow is emerging as the biggest QA forum for programmers. The questions on Stack
Overflow cover a wide range of topics and are categorized using appropriate tags. Currently the
tags are entered manually by users depending on their judgment of the tags. Since there are a
huge number of tags, it is often a cumbersome process to search the correct tags. It may be useful
to have an auto-tagging system that suggests tags to users depending on the content of the
question. In this paper we present a hybrid auto-tagging system for SO. The auto-tagging system
includes a) programming language detection system b) SVM based question classification
system. This system will suggest tags once a user enters a question.
JOURNAL: ELSEVIER,2016
DESCRIPTION:
Community Question Answering websites (CQA) have a growing popularity as a way of
providing and searching of information. CQA attract users as they provide a direct and rapid way
to find the desired information. As recognizing good questions can improve the CQA services
and the user’s experience, the current study focuses on question quality instead. Specifically, we
predict question quality and investigate the features which influence it. The influence of the
question tags, length of the question title and body, presence of a code snippet, the user reputation
and terms used to formulate the question tested.
4
CHAPTER 3- PROBLEM STATEMENT AND REQUIREMENT
SPECIFICATIONS
3.1 PROBLEM STATEMENT
Suggest the tags based on the content that was there in the question posted on Stack overflow.
3.2 APPLICATIONS
3.3 LIMITATIONS
The requirements gathering process takes as its input the goals identified in the high-level
requirements section of the project plan. Each goal will be refined into a set of one or more
requirements. These requirements define the major functions of the intended application, define
operational data areas and reference data areas, and define the initial data entities. Major
functions include critical processes to be managed, as well as mission critical inputs, outputs and
reports.
• The system should have the capability to perform pre-processing the text so that only
important and relevant inputs are given to the neural network.
• The system should provide text parser functions which can take the whole text and
separate into sentences, paragraphs and words.
• Performance
• Scalability
• Supportability
• Compatibility
5
CHAPTER-3
3.5 SOFTWARE REQUIREMENTS
The software requirements report defines the particulars of the framework. A key benefit of
developing a software requirement specification is in streamlining the development process. The
developer working from the software requirement specification has ideally all their questions
answered about the application and can start to develop. The software requirements are very
valuable in assessing the project cost, the complexity involved in the project, arranging the
required tools for development etc. Below are the software requirements of the current project
where various requirements such as Operating System, Tools and Software packages required
for implementing this project are mentioned.
The hardware requirements give the description about the resources needed for implementation
of the project. There are many hardware resources such as processors, storage devices and others
required for implementing a project. It gives brief idea to the software developers whether the
present system can support such requirements or is it required to deploy more hardware resources
for supporting the project.
The hardware requirements for this project work are mentioned below:
• RAM : 16GB
6
CHAPTER 4- METHODOLOGY
4.1 BLOCK DIAGRAM
• The stack overflow dataset consists over 6M data points which has id, title, body and tags
as attributes. This dataset is loaded into database and cleaned by removing any duplicates
present in the dataset.
• The entire dataset is divided into test and train dataset randomly. If the time stamp for
each row in dataset is given, we can divide them based on time stamp. Then this dataset
is cleaned, and preprocessing is done.
• Next the labels in the dataset are analyzed so that we can reduce the size of dataset and
train the model in shorter duration of time. We consider the top 15 or 20 most occurring
tags to reduce the number of models to build.
• Then we convert the multilabel classification into binary or multiclass classification then
we apply the linear algorithms like logistic regression with OneVsRest classifier and
linear SVM. We cannot use other algorithms like random forest because of their poor
performance with high dimensional data.
7
CHAPTER 4- METHODOLOGY
4.2 DATAFLOW DIAGRAM
Model
Test Data
Prediction
• The stack overflow dataset is loaded into SQLite database and cleaned by removing the
duplicate rows. The dataset consists of four attributes which include id, title, body and
tags. We analyze the tags to know the count of number of unique tags and number of
times a tag has appeared.
• Cleaning and preprocessing of questions are done by reducing the dataset to 1M data
points, separating code snippets from body, removing special characters from question
title and description, removing stop words, removing HTML tags, converting all
characters into small letters and using snowball stemmer to stem the words.
• We convert the tags in multilabel problem into binary or multiclass classification. Then
we split the new dataset into test and train dataset. Now we featurize data and apply
logistic regression with OneVsRest classifier to get the accuracy of the tags which are
predicted by the created model.
8
CHAPTER 4- METHODOLOGY
A class diagram is an illustration of the relationships and source code dependencies among
classes in the Unified Modeling Language (UML). In this context, a class defines the methods
and variables in an object, which is a specific entity in a program or the unit of code representing
that entity. Class diagrams are useful in all forms of object-oriented programming (OOP).
In our class diagram, there are five classes: StackoverflowDB, SQLiteDB, Scikit-learn,
Preprocess and Model. Here there is a dependency relationship between the SQLiteDB and
Model. There is a relationship from StackoverflowDB, Scikit-learn, Preprocess classes to the
SQLiteDB class where each class have their own attributes and operations.
9
CHAPTER 4- METHODOLOGY
4.3.2 USE CASE DIAGRAM
A use case diagram is a graphic depiction of the interactions among the elements of a system.
A use case is a methodology used in system analysis to identify, clarify, and organize system
requirements. The relationships between and among the actors and the use cases is described.
In our Use case diagram, the actors are: User, NLTK, Final Model and Scikit-learn Pkg. These
actors interact with the different use cases within the system.
10
CHAPTER 4- METHODOLOGY
4.3.3 SEQUENCE DIAGRAM
Sequence Diagrams are interaction diagrams that detail how operations are carried out. They
capture the interaction between objects in the context of collaboration. Sequence Diagrams are
time focus and they show the order of the interaction visually by using the vertical axis of the
diagram to represent time what messages are sent and when.
In our sequence diagram, user supplies dataset which is loaded and cleaned then the dataset is
preprocessed by the NLTK. This preprocessed dataset is then reduced according to the
specifications given by the user using the Scikit-learn package in which we convert the tags of
multilabel problem to binary or multiclass classification. We featurize the data and train the
model by applying logistic regression which builds a model to predict the output.
11
CHAPTER 4- METHODOLOGY
4.3.4 STATE CHART DIAGRAM
A state diagram is a diagram used to describe the behavior of a system considering all the possible
states of an object when an event occurs. This behavior is represented and analyzed in a series of
events that occur in one or more possible states. Each diagram represents objects and tracks the
various states of these objects throughout the system.
The state chart diagram shows how the transition takes place from one state to other in our
diagram. There are different states in our state chart diagram such as Test and train dataset,
remove duplicates, analyze tags, preprocessed data, reduced labels, model and output. When an
activity or event such as request occurs, it results in a state transition resulting in next state.
12
CHAPTER 4- METHODOLOGY
4.3.5 ACTIVITY DIAGRAM
Activity diagram is an important diagram in UML to describe the dynamic aspects of the
system. Activity diagram is basically a flowchart to represent the flow from one activity to
another activity. The activity can be described as an operation of the system. The control flow is
drawn from one operation to another.
13
CHAPTER 5- IMPLEMENTATION
5.1 ALGORITHM
1. Problem Transformation
2. Adapted Algorithm
3. Ensemble approaches
This is the simplest technique, which basically treats each label as a separate single class
classification problem.
For example, let us consider a case as shown below. We have the data set like this, where X is
the independent feature and Y’s are the target variable.
In binary relevance, this problem is broken into 4 different single class classification problems
as shown in the figure below.
Adapted Algorithm:
Adapted algorithm, as the name suggests, adapting the algorithm to directly perform multi-label
classification, rather than transforming the problem into different subsets of problems.
14
CHAPTER 5- IMPLEMENTATION
For example, multi-label version of kNN is represented by MLkNN. So, let us quickly implement
this on our randomly generated data set.
Due to the memory error we go for logistic regression with OneVsRest Classifier
5.2 MODULES
In this module we load the data into SQLite database and clean the data by removing the
duplicates from the stack overflow dataset. We analyse the tags to reduce the size of dataset.
15
CHAPTER 5- IMPLEMENTATION
16
CHAPTER 5- IMPLEMENTATION
17
CHAPTER 5- IMPLEMENTATION
18
CHAPTER 5- IMPLEMENTATION
19
CHAPTER 5- IMPLEMENTATION
5.3.1 NLTK
Natural Language Toolkit (NLTK) is library in Python, which provides a base for building
programs and classification of data. NLTK is a leading platform for building Python programs
to work with human language data. It provides easy-to-use interfaces to over 50 corpora and
lexical resources such as WordNet, along with a suite of text processing libraries for
classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and wrappers
for industrial-strength NLP libraries. This toolbox plays a key role in transforming the text data
in the tweets into a format that can be used to extract sentiment from them.
NLTK provides various functions which are used in pre-processing of data so that data available
becomes fit for mining and extracting features. NLTK support various methods for simplifying
few tasks before Machine Learning algorithms are applied.
5.3.2 NUMPY
NumPy is the fundamental package for scientific computing with Python. It contains among other
things:
Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly
and speedily integrate with a wide variety of databases.
Installing NumPy:
Most major projects upload official packages to the Python Package index. They can be installed
on most operating systems using Python’s standard pip package manager.
Note that you need to have Python and pip already installed on your system.
20
CHAPTER 5- IMPLEMENTATION
5.3.3 PANDAS
Pandas is a Python package providing fast, flexible, and expressive data structures designed to
make working with structured (tabular, multidimensional, potentially heterogeneous) and time
series data both easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, real world data analysis in Python. Additionally, it has the broader goal of
becoming the most powerful and flexible open source data analysis / manipulation tool available
in any language.
• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column
labels
• Any other form of observational / statistical data sets. The data actually need not be
labelled at all to be placed into a pandas data structure.
Installing Pandas:
The Jupyter Notebook is an interactive computing environment that enables users to author
notebook documents that include: - Live code - Interactive widgets - Plots - Narrative text -
Equations - Images – Video
These documents provide a complete and self-contained record of a computation that can be
converted to various formats and shared with others using email, Dropbox, version control
systems (like git/GitHub) or nbviewer.jupyter.org.
21
CHAPTER 5- IMPLEMENTATION
if not os.path.isfile('train.db'):
start = datetime.now()
disk_engine = create_engine('sqlite:///train.db')
start = dt.datetime.now()
chunksize = 180000
j=0
index_start = 1
df.index += index_start
j+=1
print('{} rows'.format(j*chunksize))
index_start = df.index[-1] + 1
if os.path.isfile('train.db'):
start = datetime.now()
con = sqlite3.connect('train.db')
22
CHAPTER 5- IMPLEMENTATION
con.close()
else:
print("Please download the train.db file from drive or run the above cell to genarate train.db
file")
6034196
if os.path.isfile('train.db'):
start = datetime.now()
con = sqlite3.connect('train.db')
con.close()
else:
print("Please download the train.db file from drive or run the first to genarate train.db file")
df_no_dup.head()
23
CHAPTER 5- IMPLEMENTATION
df_no_dup.cnt_dup.value_counts()
1 2656284
2 1272336
3 277575
4 90
5 25
6 5
start = datetime.now()
24
CHAPTER 5- IMPLEMENTATION
df_no_dup.head()
df_no_dup.tag_count.value_counts()
3 1206157
2 1111706
4 814996
1 568298
5 505158
if not os.path.isfile('train_no_dup.db'):
disk_dup = create_engine("sqlite:///train_no_dup.db")
no_dup.to_sql('no_dup_train',disk_dup)
#This method seems more appropriate to work with this much data.
25
CHAPTER 5- IMPLEMENTATION
tag_counts = tag_df_sorted['Counts'].values
plt.plot(tag_counts)
plt.grid()
plt.xlabel("Tag number")
plt.show()
plt.plot(tag_counts[0:10000])
26
CHAPTER 5- IMPLEMENTATION
plt.grid()
plt.xlabel("Tag number")
plt.show()
print(len(tag_counts[0:10000:25]), tag_counts[0:10000:25])
plt.plot(tag_counts[0:1000])
plt.grid()
plt.xlabel("Tag number")
plt.show()
27
CHAPTER 5- IMPLEMENTATION
print(len(tag_counts[0:1000:5]), tag_counts[0:1000:5])
plt.plot(tag_counts[0:500])
plt.grid()
plt.xlabel("Tag number")
plt.show()
print(len(tag_counts[0:500:5]), tag_counts[0:500:5])
plt.plot(tag_counts[0:100], c='b')
28
CHAPTER 5- IMPLEMENTATION
plt.grid()
plt.xlabel("Tag number")
plt.legend()
plt.show()
print(len(tag_counts[0:100:5]), tag_counts[0:100:5])
20 [331505 221533 122769 95160 62023 44829 37170 31897 26925 24537
22429 21820 20957 19758 18905 17728 15533 15097 14884 13703]
lst_tags_gt_10k = tag_df[tag_df.Counts>10000].Tags
29
CHAPTER 5- IMPLEMENTATION
lst_tags_gt_100k = tag_df[tag_df.Counts>100000].Tags
Observations:
1. There are total 153 tags which are used more than 10000 times.
2. 14 tags are used more than 100000 times.
3. Most frequent tag (i.e. c#) is used 331505 times.
4. Since some tags occur much more frequently than others, Micro-averaged F1-score is
the appropriate metric for this problem.
tag_quest_count = tag_dtm.sum(axis=1).tolist()
#Converting list of lists into single list, we will get [[3], [4], [2], [2], [3]] and we are converting
this to [3, 4, 2, 2, 3]
print(tag_quest_count[:5])
[3, 4, 2, 2, 3]
30
CHAPTER 5- IMPLEMENTATION
sns.countplot(tag_quest_count, palette='gist_rainbow')
plt.xlabel("Number of Tags")
plt.ylabel("Number of questions")
plt.show()
Observations:
start = datetime.now()
31
CHAPTER 5- IMPLEMENTATION
tup = dict(result.items())
width=1600,
height=800,
).generate_from_frequencies(tup)
fig = plt.figure(figsize=(30,20))
plt.imshow(wordcloud)
plt.axis('off')
plt.tight_layout(pad=0)
fig.savefig("tag.png")
plt.show()
32
CHAPTER 5- IMPLEMENTATION
Observations:
A look at the word cloud shows that "c#", "java", "php", "asp.net", "javascript", "c++" are some
of the most frequent tags.
i=np.arange(30)
tag_df_sorted.head(30).plot(kind='bar')
plt.xticks(i, tag_df_sorted['Tags'])
plt.xlabel('Tags')
plt.ylabel('Counts')
plt.show()
Observations:
33
CHAPTER 5- IMPLEMENTATION
Preprocessing
def striphtml(data):
cleanr = re.compile('<.*?>')
return cleantext
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer("english")
def create_connection(db_file):
specified by db_file
"""
try:
conn = sqlite3.connect(db_file)
return conn
except Error as e:
print(e)
return None
34
CHAPTER 5- IMPLEMENTATION
:return:
"""
try:
c = conn.cursor()
c.execute(create_table_sql)
except Error as e:
print(e)
def checkTableExists(dbcon):
cursr = dbcon.cursor()
table_names = cursr.execute(str)
tables =table_names.fetchall()
print(tables[0][0])
return(len(tables))
conn = create_connection(database)
create_table(conn, query)
checkTableExists(conn)
35
CHAPTER 5- IMPLEMENTATION
else:
conn.close()
create_database_table("Processed.db", sql_create_table)
QuestionsProcessed
start = datetime.now()
read_db = 'train_no_dup.db'
write_db = 'Processed.db'
if os.path.isfile(read_db):
conn_r = create_connection(read_db)
reader =conn_r.cursor()
if os.path.isfile(write_db):
conn_w = create_connection(write_db)
tables = checkTableExists(conn_w)
writer =conn_w.cursor()
if tables != 0:
36
CHAPTER 5- IMPLEMENTATION
QuestionsProcessed
we create a new data base to store the sampled and pre-processed questions
start = datetime.now()
preprocessed_data_list=[]
reader.fetchone()
questions_with_code=0
len_pre=0
len_post=0
questions_proccesed = 0
is_code = 0
if '<code>' in question:
questions_with_code+=1
is_code = 1
x = len(question)+len(title)
len_pre+=x
question=striphtml(question.encode('utf-8'))
37
CHAPTER 5- IMPLEMENTATION
title=title.encode('utf-8')
question=str(title)+" "+str(question)
question=re.sub(r'[^A-Za-z]+',' ',question)
words=word_tokenize(str(question.lower()))
#Removing all single letter and and stopwords from question exceptt for the letter 'c'
len_post+=len(question)
tup = (question,code,tags,x,len(question),is_code)
questions_proccesed += 1
writer.execute("insert into
QuestionsProcessed(question,code,tags,words_pre,words_post,is_code) values
(?,?,?,?,?,?)",tup)
if (questions_proccesed%100000==0):
no_dup_avg_len_pre=(len_pre*1.0)/questions_proccesed
no_dup_avg_len_post=(len_post*1.0)/questions_proccesed
38
CHAPTER 5- IMPLEMENTATION
# dont forget to close the connections, or else you will end up with locks
conn_r.commit()
conn_w.commit()
conn_r.close()
conn_w.close()
if os.path.isfile(write_db):
conn_r = create_connection(write_db)
reader =conn_r.cursor()
print('='*100)
reader.fetchone()
print(row)
39
CHAPTER 5- IMPLEMENTATION
preprocessed_data.head()
number of dimensions : 2
multilabel_y = vectorizer.fit_transform(preprocessed_data['tags'])
We will sample the number of tags instead considering all of them (due to limitation of
computing power)
def tags_to_choose(n):
t = multilabel_y.sum(axis=0).tolist()[0]
40
CHAPTER 5- IMPLEMENTATION
multilabel_yx = tags_to_choose(5500)
multilabel_y.shape[1])
total_size=preprocessed_data.shape[0]
train_size=int(0.80*total_size)
x_train=preprocessed_data.head(train_size)
x_test=preprocessed_data.tail(total_size - train_size)
41
CHAPTER 5- IMPLEMENTATION
y_train = multilabel_yx[0:train_size,:]
y_test = multilabel_yx[train_size:total_size,:]
Featurizing data
start = datetime.now()
x_train_multilabel = vectorizer.fit_transform(x_train['question'])
x_test_multilabel = vectorizer.transform(x_test['question'])
# classifier = LabelPowerset(GaussianNB())
"""
classifier = MLkNN(k=21)
# train
42
CHAPTER 5- IMPLEMENTATION
classifier.fit(x_train_multilabel, y_train)
# predict
predictions = classifier.predict(x_test_multilabel)
print(accuracy_score(y_test,predictions))
print(metrics.hamming_loss(y_test,predictions))
"""
# ---------------------------------------------------------------------------
#<ipython-input-170-f0e7c7f3e0be> in <module>()
# this will be taking so much time try not to run it, download the lr_with_equal_weight.pkl file
and use to predict
classifier.fit(x_train_multilabel, y_train)
predictions = classifier.predict(x_test_multilabel)
43
CHAPTER 5- IMPLEMENTATION
print("accuracy :",metrics.accuracy_score(y_test,predictions))
joblib.dump(classifier, 'lr_with_equal_weight.pkl')
Modeling with less data points (0.5M data points) and more weight to title and 500 tags
only.
create_database_table("Titlemoreweight.db", sql_create_table)
read_db = 'train_no_dup.db'
write_db = 'Titlemoreweight.db'
train_datasize = 400000
if os.path.isfile(read_db):
conn_r = create_connection(read_db)
reader =conn_r.cursor()
if os.path.isfile(write_db):
44
CHAPTER 5- IMPLEMENTATION
conn_w = create_connection(write_db)
tables = checkTableExists(conn_w)
writer =conn_w.cursor()
if tables != 0:
Pre-processing of questions
start = datetime.now()
preprocessed_data_list=[]
reader.fetchone()
questions_with_code=0
len_pre=0
len_post=0
questions_proccesed = 0
is_code = 0
if '<code>' in question:
questions_with_code+=1
45
CHAPTER 5- IMPLEMENTATION
conn_r = create_connection(write_db)
conn_r.commit()
conn_r.close()
preprocessed_data.head()
number of dimensions : 2
multilabel_y = vectorizer.fit_transform(preprocessed_data['tags'])
questions_explained = []
total_tags=multilabel_y.shape[1]
total_qs=preprocessed_data.shape[0]
46
CHAPTER 5- IMPLEMENTATION
questions_explained.append(np.round(((total_qs-
questions_explained_fn(i))/total_qs)*100,3))
fig, ax = plt.subplots()
ax.plot(questions_explained)
xlabel = list(500+np.array(range(-50,450,50))*50)
ax.set_xticklabels(xlabel)
plt.xlabel("Number of tags")
plt.grid()
plt.show()
# you can choose any number of tags based on your computing power, minimun is 500(it
covers 90% of the tags)
multilabel_yx = tags_to_choose(500)
47
CHAPTER 5- IMPLEMENTATION
x_train=preprocessed_data.head(train_datasize)
x_test=preprocessed_data.tail(preprocessed_data.shape[0] - 400000)
y_train = multilabel_yx[0:train_datasize,:]
y_test = multilabel_yx[train_datasize:preprocessed_data.shape[0],:]
start = datetime.now()
x_train_multilabel = vectorizer.fit_transform(x_train['question'])
x_test_multilabel = vectorizer.transform(x_test['question'])
48
CHAPTER 5- IMPLEMENTATION
start = datetime.now()
classifier.fit(x_train_multilabel, y_train)
joblib.dump(classifier, 'lr_with_more_title_weight.pkl')
['lr_with_more_title_weight.pkl']
start = datetime.now()
49
CHAPTER 5- IMPLEMENTATION
classifier_2.fit(x_train_multilabel, y_train)
predictions_2 = classifier_2.predict(x_test_multilabel)
50
CHAPTER 6- RESULT ANALYSIS
6.1 OUTPUT
Body:
#include<iostream>\n
#include<stdlib.h>\n\n
int main()\n
{ cout<<"Hello Dheeraj"<<endl; }
PERFORMANCE METRIC
In the multi-class and multi-label case, this is the weighted average of the F1 score of each
class.
'Micro f1 score': Calculate metrics globally by counting the total true positives, false negatives
and false positives. This is a better metric when we have class imbalance.
'Macro f1 score': Calculate metrics for each label and find their unweighted mean. This does
not take label imbalance into account.
Hamming loss: The Hamming loss is the fraction of labels that are incorrectly predicted.
write_db = 'Processed.db'
if os.path.isfile(write_db):
conn_r = create_connection(write_db)
51
CHAPTER 6- RESULT ANALYSIS
if conn_r is not None:
conn_r.commit()
conn_r.close()
preprocessed_data.head()
For 1M datapoints and 5500 tags if we apply logistic regression with OneVsRest classifier we
get the following results
accuracy: 0.081965
macro f1 score: 0.0963020140154
micro f1 score: 0.374270748817
hamming loss: 0.00041225090909090907
Precision recall report:
precision recall f1-score support
Up to 5499
avg / total 0.53 0.26 0.33 530065
52
CHAPTER 7 – TESTING
Software Testing is evaluation of the software against requirements gathered from users and
system specifications. Testing is conducted at the phase level in software development life cycle
or at module level in program code. Software testing comprises of Validation and Verification.
Validation is process of examining whether the software satisfies the user requirements or not. It
is carried out at the end of SDLC. If the software matches requirements for which it was made,
it is validated.
7.2 VERIFICATION
Verification is the process of confirming if the software is meeting the business requirements and
is developed adhering to the proper specifications and methodologies.
1. Manual – This testing is performed without taking help of automated testing tools. The
software tester prepares test cases for different sections and levels of the code, executes
the tests and reports the results to the manager.
2. Manual testing is time and resource consuming. The tester needs to confirm whether right
test cases are used or not. Major portion of testing involves manual testing.
3. Automated This testing is a testing procedure done with aid of automated testing tools.
The limitations with manual testing can be overcome using automated test tools.
TEST CASE 1:
1M data points and 5500 tags
This will take about 6-7 hours to run on 32GB ram computer with i7 processor
accuracy: 0.081965
macro f1 score: 0.0963020140154
micro f1 score: 0.374270748817
hamming loss: 0.00041225090909090907
Precision recall report:
precision recall f1-score support
53
CHAPTER 7 – TESTING
8 0.70 0.40 0.50 6032
9 0.78 0.43 0.55 6020
10 0.86 0.62 0.72 5707
11 0.52 0.17 0.25 5723
12 0.55 0.10 0.16 5521
13 0.59 0.25 0.35 4722
14 0.61 0.22 0.32 4468
15 0.79 0.52 0.63 4536
Up to 5499
avg / total 0.53 0.26 0.33 530065
TEST CASE 2:
0.5M data points and 550 tags with more weight to title
Accuracy: 0.23623
Hamming loss 0.00278088
Micro-average quality numbers
Precision: 0.7216, Recall: 0.3256, F1-measure: 0.4488
Macro-average quality numbers
Precision: 0.5473, Recall: 0.2572, F1-measure: 0.3339
precision recall f1-score support
54
CHAPTER 8- CONCLUSION AND FUTURE WORK
CONCLUSION
In this work, an efficient machine learning model is built for predicting the labels if title and
description is given in stack overflow. In this we used logistic regression with one vs rest
classifier to predict the accurate results of the labels for a given title and description.
The stack overflow dataset is preprocessed in such a way that we can get multiple labels which
are relevant to the title and description given. If any one of the predicted labels is not related to
the title and description the precision and recall values decreases which indirectly affects the user
satisfaction.
FUTURE WORK
This model is trained using logistic regression with OneVsRest classifier. The data set contains
the attributes id, title, body and tag. Future work can be done by sorting the tuples or rows in the
dataset using the time stamp of the questions as the newer questions contain new tags. Determine
how effective the parts of speech breakdown are versus using the number of words in the body
and title. As we saw, most of the key features were related to post body composition. Returning
these to a singular value may reveal just how useful the breakdown was in determining the post
status. These help in predicting the tags accurately with high precision and recall values.
55
REFERENCES
1. AlKofahi,J.M.,Tamrawi,A.,Nguyen,T.T.,Nguyen,H.A.,Nguyen,T.N.:Fuzzysetapproachfora
utomatic tagging in evolving software. In: International Conference on Software
Maintenance, pp. 1–10. IEEE (2010)
2. Begel, A., DeLine, R., Zimmermann, T.: Social media for software engineering. In:
Proceedings of the FSE/SDP Workshop on Future of Software Engineering Research, pp.
33–38. ACM (2010)
3. Pennington, J., Socher, R., and Manning, C. D. . (2014). Glove: Global vectors for word
representation. Conference on Empirical Methods in Natural Language Processing.
4. Klassen, M. and Paturi, N. (2010). Web document classification by keywords using random
forests. In Networked Digital Technologies, volume 88 of Communications in Computer and
Information Science, pages 256-261. Springer Berlin Heidelberg.
5. Manning, C. D., Raghavan, P., and Schutze, H. (2008). Introduction to Information Retrieval.
Cambridge University Press, New York, NY, USA. [6] McCallum, A. K. (1999). Multi-label
text classification with a mixture model trained by em. In AAAI 99 Workshop on Text
Learning.
6. Loper, E. and Bird, S. (2002). Nltk: The natural language toolkit. In Proceedings of the ACL-
02 Workshop on Elective Tools and Methodologies for Teaching Natural Language
Processing and Computational Linguistics, ETMTNLP ’02, pages 63-70, Stroudsburg, PA,
USA. Association for Computational Linguistics.
56