Sunteți pe pagina 1din 56

CHAPTER 1- INTRODUCTION

1.1 INTRODUCTION

The question-answering site Stack Overflow allows users to assign tags to questions in order to
make them easier for other people to find. Further experts on a certain topic can subscribe to tags
to receive digests of new questions for which they might have an answer. Therefore, it is both in
the interest of the original poster and in the interest of people who are interested in the answer
that a question gets assigned appropriate tags.

Stack Overflow is the largest, most trusted online community for developers to learn, share their
programming knowledge, and build their careers. It is something which every programmer use
one way or another. Each month, over 50 million developers come to Stack Overflow to learn,
share their knowledge, and build their careers. It features questions and answers on a wide range
of topics in computer programming.

The website serves as a platform for users to ask and answer questions, and, through membership
and active participation, to vote questions and answers up or down and edit questions and
answers in a fashion like a wiki or Digg. As of April 2014, Stack Overflow has over 4,000,000
registered users, and it exceeded 10,000,000 questions in late August 2015. Based on the type of
tags assigned to questions, the top eight most discussed topics on the site are: Java, JavaScript,
C#, PHP, Android, jQuery, Python and HTML.

Stack Overflow allows users to manually assign between one and five tags to a posting. Users
are encouraged to use existing tags that are suggested by typing the first letter(s) of a tag, but
they are also allowed to create new ones, so the set of possible tags is infinite. While the manual
tagging by users generally works well for experienced users, it can be challenging for
inexperienced users to find appropriate tags for their question and by letting users add new tags
it is likely that different users use different orthographic versions of tags that mean the same thing
such as “php5” and “php5”.

For these reasons it is desirable to a have a system that can either automatically tag questions or
to suggest relevant tags to a user based on the question content. In this project we are developing
a predictor that can assign tags based on the content of a question. More formally, given a
question q containing a title consisting of n words a1, ..., an and a body consisting of m words
b1, ..., bm, we want to assign 1 ≤ k ≤ 5 tags t1, ..., t k from a limited list of tags T.

1
CHAPTER 1- INTRODUCTION
1.2 EXISTING SYSTEM

There any many methods which can predict the labels for a given question and description
within a certain time limit.

The existing system has a time constraint on it so that the labels predicted are within certain
time limit and the precision and recall values are not accurate so the functioning of stack
overflow will be affected.

1.2.1 LIMITATIONS

• The accuracy of the labels predicted is less because of low precision and recall values.
Due to this entire working of the stack overflow will be affected.
• Users will not be able to see the personalized questions which will reduce the chance
of answering those questions.
• Predicting the accurate tags become difficult and it takes more time to predict the
tags.
• It has a low precision and recall values.

1.2.2 REAL WORLD / BUSINESS OBJECTIVES AND CONSTRAINTS

• Predict as many tags as possible with high precision and recall.


• Incorrect tags could impact customer experience on Stack Overflow.
• No strict latency constraints.

1.3 PROPOSED SYSTEM

In our proposed model we mainly concentrate on accuracy of the model rather than strict latency
constraints. We use logistic regression with one vs rest classifier to predict the accurate results
of the labels for a given title and description.

Time constraints is eliminated in this model. Accurate precision and recall values can be
obtained. Correct tags or labels can be predicted without any difficulty.

1.3.1 ADVANTAGES

• Due to better precision and recall values the accuracy of predicting relevant tags increases
so that any questions posted by the user can be answered in less amount of time.
• It gives better precision and recall values.
• Time needed to find the tags is comparatively low when compared with other models
• Accurate tags with exact prediction are obtained.
• Tags are predicted as expected by the user.

2
CHAPTER 2- LITERATURE REVIEW

TITLE: Predicting Tags for Stack Overflow Questions Using Different Classifiers

JOURNAL: IEEE XPLORE,2018.

AUTHORS: Taniya Saini, Sachin Tripathi

DESCRIPTION:
The adequacy of any online education forum depends on the user’s experience based on users
interests and demands. So, it is the fundamental requirement to design a system which considers
users interest in to account when putting content online. Many online websites such as Quora,
GeeksforGeeks, and Stack Exchange have large scale of data in terms of questions and answers
of users. Large-Scale datasets are available on these websites that can be mined and pre-
processed using text classification and can be used to know users query regarding a topic.
Information that is provided should be relevant to user’s interest. We propose a system that will
take significant amount of data from a website and use that data for different approaches to
predict the tag for the website Stack overflow posts and achieve a better accuracy for 1000 most
frequent tags.

TITLE: Predicting Questions' Scores on Stack Overflow

JOURNAL: ELSEVIER,2009.

AUTHORS: Haifa Alharthi, Djedjiga Outioua, Olga Baysal


DESCRIPTION:
Developer support forums are becoming more popular than ever. Crowdsourced knowledge is
an essential resource for many developers, yet it can raise concerns about the quality of the shared
content. Most existing research efforts address the quality of answers posted by Q&A community
members. In this paper, we explore the quality of questions and propose a method of predicting
the score of questions on Stack Overflow based on sixteen factors related to questions' format,
content and interactions that occur in the post. We performed an extensive investigation to
understand the relationship between the factors and the scores of questions.

3
CHAPTER 2- LITERATURE REVIEW

TITLE: A Hybrid Auto-tagging System for Stack Overflow Forum Questions

JOURNAL: RESEARCH GATE,2016.

AUTHORS: Marjia Sultana, Afrin Haider and Mohammad ShorifUddin

DESCRIPTION:
Stack Overflow (SO) forum is a widely used platform for people to interact on topics related to
Computer Programming languages. With more than three lakh users and ten lakh questions,
Stack Overflow is emerging as the biggest QA forum for programmers. The questions on Stack
Overflow cover a wide range of topics and are categorized using appropriate tags. Currently the
tags are entered manually by users depending on their judgment of the tags. Since there are a
huge number of tags, it is often a cumbersome process to search the correct tags. It may be useful
to have an auto-tagging system that suggests tags to users depending on the content of the
question. In this paper we present a hybrid auto-tagging system for SO. The auto-tagging system
includes a) programming language detection system b) SVM based question classification
system. This system will suggest tags once a user enters a question.

TITLE: Predicting the Quality of Questions on Stack overflow

JOURNAL: ELSEVIER,2016

AUTHORS: Soodeh Nikan, FemidaGwadry-Sridhar, and Michael Bauer

DESCRIPTION:
Community Question Answering websites (CQA) have a growing popularity as a way of
providing and searching of information. CQA attract users as they provide a direct and rapid way
to find the desired information. As recognizing good questions can improve the CQA services
and the user’s experience, the current study focuses on question quality instead. Specifically, we
predict question quality and investigate the features which influence it. The influence of the
question tags, length of the question title and body, presence of a code snippet, the user reputation
and terms used to formulate the question tested.

4
CHAPTER 3- PROBLEM STATEMENT AND REQUIREMENT
SPECIFICATIONS
3.1 PROBLEM STATEMENT

Suggest the tags based on the content that was there in the question posted on Stack overflow.

3.2 APPLICATIONS

• Stack overflow is used in learning online courses.


• Maintenance of performance become easy
• Stack overflow can also be used as connecting tool to communicate with the expert people
• Best results can be obtained by predicting tags for the given questions by the user

3.3 LIMITATIONS

• Only experienced persons can give the answers.


• Beginners find it difficult to understand.
• Experienced people are only allowed to give the answer based on his rating.

3.4 REQUIREMENT SPECIFICATIONS

The requirements gathering process takes as its input the goals identified in the high-level
requirements section of the project plan. Each goal will be refined into a set of one or more
requirements. These requirements define the major functions of the intended application, define
operational data areas and reference data areas, and define the initial data entities. Major
functions include critical processes to be managed, as well as mission critical inputs, outputs and
reports.

3.4.1 FUNCTIONAL REQUIREMENTS

• The system should have the capability to perform pre-processing the text so that only
important and relevant inputs are given to the neural network.
• The system should provide text parser functions which can take the whole text and
separate into sentences, paragraphs and words.

3.4.2 NON-FUNCTIONAL REQUIREMENTS

• Performance
• Scalability
• Supportability
• Compatibility

5
CHAPTER-3
3.5 SOFTWARE REQUIREMENTS

The software requirements report defines the particulars of the framework. A key benefit of
developing a software requirement specification is in streamlining the development process. The
developer working from the software requirement specification has ideally all their questions
answered about the application and can start to develop. The software requirements are very
valuable in assessing the project cost, the complexity involved in the project, arranging the
required tools for development etc. Below are the software requirements of the current project
where various requirements such as Operating System, Tools and Software packages required
for implementing this project are mentioned.

The software requirements for this project are as follows:

• Operating System : Linux/Windows 7/8/10

• Languages : Python 3.6

• Tool : Anaconda Distribution for Python

3.6 HARDWARE REQUIREMENTS

The hardware requirements give the description about the resources needed for implementation
of the project. There are many hardware resources such as processors, storage devices and others
required for implementing a project. It gives brief idea to the software developers whether the
present system can support such requirements or is it required to deploy more hardware resources
for supporting the project.

The hardware requirements for this project work are mentioned below:

• Processor : Intel Core i7 processor

• RAM : 16GB

• Hard Disk Space : 15GB

6
CHAPTER 4- METHODOLOGY
4.1 BLOCK DIAGRAM

Fig 4.1 Proposed System Architecture

• The stack overflow dataset consists over 6M data points which has id, title, body and tags
as attributes. This dataset is loaded into database and cleaned by removing any duplicates
present in the dataset.
• The entire dataset is divided into test and train dataset randomly. If the time stamp for
each row in dataset is given, we can divide them based on time stamp. Then this dataset
is cleaned, and preprocessing is done.
• Next the labels in the dataset are analyzed so that we can reduce the size of dataset and
train the model in shorter duration of time. We consider the top 15 or 20 most occurring
tags to reduce the number of models to build.
• Then we convert the multilabel classification into binary or multiclass classification then
we apply the linear algorithms like logistic regression with OneVsRest classifier and
linear SVM. We cannot use other algorithms like random forest because of their poor
performance with high dimensional data.

7
CHAPTER 4- METHODOLOGY
4.2 DATAFLOW DIAGRAM

Data loading Analysis of Clean and


Data
and cleaning tags preprocess
set

Featurizing New Converting


Data Train Data Data tags

Model
Test Data

Prediction

Fig 4.2 Dataflow Diagram

• The stack overflow dataset is loaded into SQLite database and cleaned by removing the
duplicate rows. The dataset consists of four attributes which include id, title, body and
tags. We analyze the tags to know the count of number of unique tags and number of
times a tag has appeared.
• Cleaning and preprocessing of questions are done by reducing the dataset to 1M data
points, separating code snippets from body, removing special characters from question
title and description, removing stop words, removing HTML tags, converting all
characters into small letters and using snowball stemmer to stem the words.
• We convert the tags in multilabel problem into binary or multiclass classification. Then
we split the new dataset into test and train dataset. Now we featurize data and apply
logistic regression with OneVsRest classifier to get the accuracy of the tags which are
predicted by the created model.

8
CHAPTER 4- METHODOLOGY

4.3 UML DIAGRAMS

4.3.1 CLASS DIAGRAM

A class diagram is an illustration of the relationships and source code dependencies among
classes in the Unified Modeling Language (UML). In this context, a class defines the methods
and variables in an object, which is a specific entity in a program or the unit of code representing
that entity. Class diagrams are useful in all forms of object-oriented programming (OOP).

Fig 4.3.1 Class Diagram

In our class diagram, there are five classes: StackoverflowDB, SQLiteDB, Scikit-learn,
Preprocess and Model. Here there is a dependency relationship between the SQLiteDB and
Model. There is a relationship from StackoverflowDB, Scikit-learn, Preprocess classes to the
SQLiteDB class where each class have their own attributes and operations.

9
CHAPTER 4- METHODOLOGY
4.3.2 USE CASE DIAGRAM

Fig 4.3.2 Use Case Diagram

A use case diagram is a graphic depiction of the interactions among the elements of a system.
A use case is a methodology used in system analysis to identify, clarify, and organize system
requirements. The relationships between and among the actors and the use cases is described.

In our Use case diagram, the actors are: User, NLTK, Final Model and Scikit-learn Pkg. These
actors interact with the different use cases within the system.

10
CHAPTER 4- METHODOLOGY
4.3.3 SEQUENCE DIAGRAM

Fig 4.3.3 Sequence Diagram

Sequence Diagrams are interaction diagrams that detail how operations are carried out. They
capture the interaction between objects in the context of collaboration. Sequence Diagrams are
time focus and they show the order of the interaction visually by using the vertical axis of the
diagram to represent time what messages are sent and when.

In our sequence diagram, user supplies dataset which is loaded and cleaned then the dataset is
preprocessed by the NLTK. This preprocessed dataset is then reduced according to the
specifications given by the user using the Scikit-learn package in which we convert the tags of
multilabel problem to binary or multiclass classification. We featurize the data and train the
model by applying logistic regression which builds a model to predict the output.

11
CHAPTER 4- METHODOLOGY
4.3.4 STATE CHART DIAGRAM

Fig 4.3.4 State Chart Diagram

A state diagram is a diagram used to describe the behavior of a system considering all the possible
states of an object when an event occurs. This behavior is represented and analyzed in a series of
events that occur in one or more possible states. Each diagram represents objects and tracks the
various states of these objects throughout the system.

The state chart diagram shows how the transition takes place from one state to other in our
diagram. There are different states in our state chart diagram such as Test and train dataset,
remove duplicates, analyze tags, preprocessed data, reduced labels, model and output. When an
activity or event such as request occurs, it results in a state transition resulting in next state.

12
CHAPTER 4- METHODOLOGY
4.3.5 ACTIVITY DIAGRAM

Fig 4.3.5 Activity Diagram

Activity diagram is an important diagram in UML to describe the dynamic aspects of the
system. Activity diagram is basically a flowchart to represent the flow from one activity to
another activity. The activity can be described as an operation of the system. The control flow is
drawn from one operation to another.

13
CHAPTER 5- IMPLEMENTATION

5.1 ALGORITHM

In this we try to transform our multi-label problem into single-label problem(s).

Multi-label classification can be carried out in three different ways as:

1. Problem Transformation
2. Adapted Algorithm
3. Ensemble approaches

Binary Relevance (Problem Transformation):

This is the simplest technique, which basically treats each label as a separate single class
classification problem.

For example, let us consider a case as shown below. We have the data set like this, where X is
the independent feature and Y’s are the target variable.

In binary relevance, this problem is broken into 4 different single class classification problems
as shown in the figure below.

Adapted Algorithm:

Adapted algorithm, as the name suggests, adapting the algorithm to directly perform multi-label
classification, rather than transforming the problem into different subsets of problems.

14
CHAPTER 5- IMPLEMENTATION

For example, multi-label version of kNN is represented by MLkNN. So, let us quickly implement
this on our randomly generated data set.

Due to the memory error we go for logistic regression with OneVsRest Classifier

5.2 MODULES

5.2.1 DATA CLEANING AND ANALYSIS OF TAGS

In this module we load the data into SQLite database and clean the data by removing the
duplicates from the stack overflow dataset. We analyse the tags to reduce the size of dataset.

15
CHAPTER 5- IMPLEMENTATION

16
CHAPTER 5- IMPLEMENTATION

5.2.2 CLEANING AND PREPROCESSING OF QUESTIONS

The following are the pre-processing conditions

1. Sample 1M data points


2. Separate out code-snippets from Body
3. Remove Special characters from Question title and description (not in code)
4. Remove stop words (Except 'C')
5. Remove HTML Tags
6. Convert all the characters into small letters
7. Use Snowball Stemmer to stem the words

17
CHAPTER 5- IMPLEMENTATION

5.2.3 CONVERTING TAGS AND FEATURIZING TAGS

18
CHAPTER 5- IMPLEMENTATION

5.2.4 Applying Logistic Regression with OneVsRest Classifier

19
CHAPTER 5- IMPLEMENTATION

5.3 SOFTWARE ENVIRONMENT AND LIBRARIES

5.3.1 NLTK

Natural Language Toolkit (NLTK) is library in Python, which provides a base for building
programs and classification of data. NLTK is a leading platform for building Python programs
to work with human language data. It provides easy-to-use interfaces to over 50 corpora and
lexical resources such as WordNet, along with a suite of text processing libraries for
classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and wrappers
for industrial-strength NLP libraries. This toolbox plays a key role in transforming the text data
in the tweets into a format that can be used to extract sentiment from them.

NLTK provides various functions which are used in pre-processing of data so that data available
becomes fit for mining and extracting features. NLTK support various methods for simplifying
few tasks before Machine Learning algorithms are applied.

5.3.2 NUMPY

NumPy is the fundamental package for scientific computing with Python. It contains among other
things:

• a powerful N-dimensional array object

• sophisticated (broadcasting) functions

• tools for integrating C/C++ and Fortran code

• useful linear algebra, Fourier transform, and random number capabilities

Besides its obvious scientific uses, NumPy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary data-types can be defined. This allows NumPy to seamlessly
and speedily integrate with a wide variety of databases.

Installing NumPy:

Most major projects upload official packages to the Python Package index. They can be installed
on most operating systems using Python’s standard pip package manager.

Note that you need to have Python and pip already installed on your system.

We can install packages via commands such as:

>> python -m pip install --user numpy scipy matplotlib ipython


pandas

20
CHAPTER 5- IMPLEMENTATION
5.3.3 PANDAS

Pandas is a Python package providing fast, flexible, and expressive data structures designed to
make working with structured (tabular, multidimensional, potentially heterogeneous) and time
series data both easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, real world data analysis in Python. Additionally, it has the broader goal of
becoming the most powerful and flexible open source data analysis / manipulation tool available
in any language.

Pandas is well suited for many kinds of data:

• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel


spreadsheet

• Ordered and unordered (not necessarily fixed-frequency) time series data.

• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column
labels

• Any other form of observational / statistical data sets. The data actually need not be
labelled at all to be placed into a pandas data structure.

Installing Pandas:

Pandas can be installed using the following command:

>>python3 -m pip install --upgrade pandas

5.3.4 JUPYTER NOTEBOOK

The Jupyter Notebook is an interactive computing environment that enables users to author
notebook documents that include: - Live code - Interactive widgets - Plots - Narrative text -
Equations - Images – Video

These documents provide a complete and self-contained record of a computation that can be
converted to various formats and shared with others using email, Dropbox, version control
systems (like git/GitHub) or nbviewer.jupyter.org.

Internally, notebook documents are `JSON <https://en.wikipedia.org/wiki/JSON>`__ data with


binary values `base64 <http://en.wikipedia.org/wiki/Base64>`__ encoded. This allows them to
be read and manipulated programmatically by any programming language. Because JSON is a
text format, notebook documents are version control friendly.

Notebooks can be exported to different static formats including HTML, reStructeredText,


LaTeX, PDF, and slide shows (reveal.js) using Jupyter’s nbconvert utility.

21
CHAPTER 5- IMPLEMENTATION

5.4 SAMPLE CODE

Data Loading and Cleaning

Using Pandas with SQLite to Load the data

if not os.path.isfile('train.db'):

start = datetime.now()

disk_engine = create_engine('sqlite:///train.db')

start = dt.datetime.now()

chunksize = 180000

j=0

index_start = 1

for df in pd.read_csv('Train.csv', names=['Id', 'Title', 'Body', 'Tags'], chunksize=chunksize,


iterator=True, encoding='utf-8', ):

df.index += index_start

j+=1

print('{} rows'.format(j*chunksize))

df.to_sql('data', disk_engine, if_exists='append')

index_start = df.index[-1] + 1

print("Time taken to run this cell :", datetime.now() - start)

Counting the number of rows

if os.path.isfile('train.db'):

start = datetime.now()

con = sqlite3.connect('train.db')

num_rows = pd.read_sql_query("""SELECT count(*) FROM data""", con)

#Always remember to close the database

22
CHAPTER 5- IMPLEMENTATION

print("Number of rows in the database :","\n",num_rows['count(*)'].values[0])

con.close()

print("Time taken to count the number of rows :", datetime.now() - start)

else:

print("Please download the train.db file from drive or run the above cell to genarate train.db
file")

Number of rows in the database :

6034196

Time taken to count the number of rows : 0:01:15.750352

Checking for duplicates

if os.path.isfile('train.db'):

start = datetime.now()

con = sqlite3.connect('train.db')

df_no_dup = pd.read_sql_query('SELECT Title, Body, Tags, COUNT(*) as cnt_dup FROM


data GROUP BY Title, Body, Tags', con)

con.close()

print("Time taken to run this cell :", datetime.now() - start)

else:

print("Please download the train.db file from drive or run the first to genarate train.db file")

Time taken to run this cell : 0:04:33.560122

df_no_dup.head()

# we can observe that there are duplicates

23
CHAPTER 5- IMPLEMENTATION

print("number of duplicate questions :", num_rows['count(*)'].values[0]- df_no_dup.shape[0],


"(",(1-((df_no_dup.shape[0])/(num_rows['count(*)'].values[0])))*100,"% )")

number of duplicate questions: 1827881 ( 30.2920389063 % )

# number of times each question appeared in our database

df_no_dup.cnt_dup.value_counts()

1 2656284

2 1272336

3 277575

4 90

5 25

6 5

Name: cnt_dup, dtype: int64

start = datetime.now()

df_no_dup["tag_count"] = df_no_dup["Tags"].apply(lambda text: len(text.split(" ")))

# adding a new feature number of tags per question

print("Time taken to run this cell :", datetime.now() - start)

24
CHAPTER 5- IMPLEMENTATION

df_no_dup.head()

Time taken to run this cell: 0:00:03.169523

# distribution of number of tags per question

df_no_dup.tag_count.value_counts()

3 1206157

2 1111706

4 814996

1 568298

5 505158

Name: tag_count, dtype: int64

#Creating a new database with no duplicates

if not os.path.isfile('train_no_dup.db'):

disk_dup = create_engine("sqlite:///train_no_dup.db")

no_dup = pd.DataFrame(df_no_dup, columns=['Title', 'Body', 'Tags'])

no_dup.to_sql('no_dup_train',disk_dup)

#This method seems more appropriate to work with this much data.

#creating the connection with database file.

25
CHAPTER 5- IMPLEMENTATION

tag_df_sorted = tag_df.sort_values(['Counts'], ascending=False)

tag_counts = tag_df_sorted['Counts'].values

plt.plot(tag_counts)

plt.title("Distribution of number of times tag appeared questions")

plt.grid()

plt.xlabel("Tag number")

plt.ylabel("Number of times tag appeared")

plt.show()

plt.plot(tag_counts[0:10000])

26
CHAPTER 5- IMPLEMENTATION

plt.title('first 10k tags: Distribution of number of times tag appeared questions')

plt.grid()

plt.xlabel("Tag number")

plt.ylabel("Number of times tag appeared")

plt.show()

print(len(tag_counts[0:10000:25]), tag_counts[0:10000:25])

plt.plot(tag_counts[0:1000])

plt.title('first 1k tags: Distribution of number of times tag appeared questions')

plt.grid()

plt.xlabel("Tag number")

plt.ylabel("Number of times tag appeared")

plt.show()

27
CHAPTER 5- IMPLEMENTATION

print(len(tag_counts[0:1000:5]), tag_counts[0:1000:5])

plt.plot(tag_counts[0:500])

plt.title('first 500 tags: Distribution of number of times tag appeared questions')

plt.grid()

plt.xlabel("Tag number")

plt.ylabel("Number of times tag appeared")

plt.show()

print(len(tag_counts[0:500:5]), tag_counts[0:500:5])

plt.plot(tag_counts[0:100], c='b')

28
CHAPTER 5- IMPLEMENTATION

plt.scatter(x=list(range(0,100,5)), y=tag_counts[0:100:5], c='orange', label="quantiles with 0.05


intervals")

# quantiles with 0.25 difference

plt.scatter(x=list(range(0,100,25)), y=tag_counts[0:100:25], c='m', label = "quantiles with 0.25


intervals")

for x,y in zip(list(range(0,100,25)), tag_counts[0:100:25]):

plt.annotate(s="({} , {})".format(x,y), xy=(x,y), xytext=(x-0.05, y+500))

plt.title('first 100 tags: Distribution of number of times tag appeared questions')

plt.grid()

plt.xlabel("Tag number")

plt.ylabel("Number of times tag appeared")

plt.legend()

plt.show()

print(len(tag_counts[0:100:5]), tag_counts[0:100:5])

20 [331505 221533 122769 95160 62023 44829 37170 31897 26925 24537
22429 21820 20957 19758 18905 17728 15533 15097 14884 13703]

# Store tags greater than 10K in one list

lst_tags_gt_10k = tag_df[tag_df.Counts>10000].Tags

29
CHAPTER 5- IMPLEMENTATION

#Print the length of the list

print ('{} Tags are used more than 10000 times'.format(len(lst_tags_gt_10k)))

# Store tags greater than 100K in one list

lst_tags_gt_100k = tag_df[tag_df.Counts>100000].Tags

#Print the length of the list.

print ('{} Tags are used more than 100000 times'.format(len(lst_tags_gt_100k)))

153 Tags are used more than 10000 times

14 Tags are used more than 100000 times

Observations:

1. There are total 153 tags which are used more than 10000 times.
2. 14 tags are used more than 100000 times.
3. Most frequent tag (i.e. c#) is used 331505 times.
4. Since some tags occur much more frequently than others, Micro-averaged F1-score is
the appropriate metric for this problem.

Tags Per Question

#Storing the count of tag in each question in list 'tag_count'

tag_quest_count = tag_dtm.sum(axis=1).tolist()

#Converting list of lists into single list, we will get [[3], [4], [2], [2], [3]] and we are converting
this to [3, 4, 2, 2, 3]

tag_quest_count=[int(j) for i in tag_quest_count for j in i]

print ('We have total {} datapoints.'.format(len(tag_quest_count)))

print(tag_quest_count[:5])

We have total 4206314 datapoints.

[3, 4, 2, 2, 3]

print( "Maximum number of tags per question: %d"%max(tag_quest_count))

print( "Minimum number of tags per question: %d"%min(tag_quest_count))

30
CHAPTER 5- IMPLEMENTATION

print( "Avg. number of tags per question: %f"%


((sum(tag_quest_count)*1.0)/len(tag_quest_count)))

Maximum number of tags per question: 5

Minimum number of tags per question: 1

Avg. number of tags per question: 2.899440

sns.countplot(tag_quest_count, palette='gist_rainbow')

plt.title("Number of tags in the questions ")

plt.xlabel("Number of Tags")

plt.ylabel("Number of questions")

plt.show()

Observations:

1. Maximum number of tags per question: 5


2. Minimum number of tags per question: 1
3. Avg. number of tags per question: 2.899
4. Most of the questions are having 2 or 3 tags

Most Frequent Tags

# Ploting word cloud

start = datetime.now()

31
CHAPTER 5- IMPLEMENTATION

# Lets first convert the 'result' dictionary to 'list of tuples'

tup = dict(result.items())

#Initializing WordCloud using frequencies of tags.

wordcloud = WordCloud( background_color='black',

width=1600,

height=800,

).generate_from_frequencies(tup)

fig = plt.figure(figsize=(30,20))

plt.imshow(wordcloud)

plt.axis('off')

plt.tight_layout(pad=0)

fig.savefig("tag.png")

plt.show()

print("Time taken to run this cell :", datetime.now() - start)

32
CHAPTER 5- IMPLEMENTATION

Time taken to run this cell: 0:00:05.470788

Observations:
A look at the word cloud shows that "c#", "java", "php", "asp.net", "javascript", "c++" are some
of the most frequent tags.

The top 20 tags

i=np.arange(30)

tag_df_sorted.head(30).plot(kind='bar')

plt.title('Frequency of top 20 tags')

plt.xticks(i, tag_df_sorted['Tags'])

plt.xlabel('Tags')

plt.ylabel('Counts')

plt.show()

Observations:

1. Majority of the most frequent tags are programming language.


2. C# is the top most frequent programming language.
3. Android, IOS, Linux and windows are among the top most frequent operating systems.

33
CHAPTER 5- IMPLEMENTATION

Cleaning and preprocessing of Questions

Preprocessing

1. Sample 1M data points


2. Separate out code-snippets from Body
3. Remove Special characters from Question title and description (not in code)
4. Remove stop words (Except 'C')
5. Remove HTML Tags
6. Convert all the characters into small letters
7. Use Snowball Stemmer to stem the words

def striphtml(data):

cleanr = re.compile('<.*?>')

cleantext = re.sub(cleanr, ' ', str(data))

return cleantext

stop_words = set(stopwords.words('english'))

stemmer = SnowballStemmer("english")

def create_connection(db_file):

""" create a database connection to the SQLite database

specified by db_file

:param db_file: database file

:return: Connection object or None

"""

try:

conn = sqlite3.connect(db_file)

return conn

except Error as e:

print(e)

return None

34
CHAPTER 5- IMPLEMENTATION

def create_table(conn, create_table_sql):

""" create a table from the create_table_sql statement

:param conn: Connection object

:param create_table_sql: a CREATE TABLE statement

:return:

"""

try:

c = conn.cursor()

c.execute(create_table_sql)

except Error as e:

print(e)

def checkTableExists(dbcon):

cursr = dbcon.cursor()

str = "select name from sqlite_master where type='table'"

table_names = cursr.execute(str)

print("Tables in the databse:")

tables =table_names.fetchall()

print(tables[0][0])

return(len(tables))

def create_database_table(database, query):

conn = create_connection(database)

if conn is not None:

create_table(conn, query)

checkTableExists(conn)

35
CHAPTER 5- IMPLEMENTATION

else:

print("Error! cannot create the database connection.")

conn.close()

sql_create_table = """CREATE TABLE IF NOT EXISTS QuestionsProcessed (question text


NOT NULL, code text, tags text, words_pre integer, words_post integer, is_code integer);"""

create_database_table("Processed.db", sql_create_table)

Tables in the databse:

QuestionsProcessed

start = datetime.now()

read_db = 'train_no_dup.db'

write_db = 'Processed.db'

if os.path.isfile(read_db):

conn_r = create_connection(read_db)

if conn_r is not None:

reader =conn_r.cursor()

reader.execute("SELECT Title, Body, Tags From no_dup_train ORDER BY RANDOM()


LIMIT 1000000;")

if os.path.isfile(write_db):

conn_w = create_connection(write_db)

if conn_w is not None:

tables = checkTableExists(conn_w)

writer =conn_w.cursor()

if tables != 0:

writer.execute("DELETE FROM QuestionsProcessed WHERE 1")

print("Cleared All the rows")

36
CHAPTER 5- IMPLEMENTATION

print("Time taken to run this cell :", datetime.now() - start)

Tables in the databse:

QuestionsProcessed

Cleared All the rows

Time taken to run this cell : 0:06:32.806567

we create a new data base to store the sampled and pre-processed questions

start = datetime.now()

preprocessed_data_list=[]

reader.fetchone()

questions_with_code=0

len_pre=0

len_post=0

questions_proccesed = 0

for row in reader:

is_code = 0

title, question, tags = row[0], row[1], row[2]

if '<code>' in question:

questions_with_code+=1

is_code = 1

x = len(question)+len(title)

len_pre+=x

code = str(re.findall(r'<code>(.*?)</code>', question, flags=re.DOTALL))

question=re.sub('<code>(.*?)</code>', '', question, flags=re.MULTILINE|re.DOTALL)

question=striphtml(question.encode('utf-8'))

37
CHAPTER 5- IMPLEMENTATION

title=title.encode('utf-8')

question=str(title)+" "+str(question)

question=re.sub(r'[^A-Za-z]+',' ',question)

words=word_tokenize(str(question.lower()))

#Removing all single letter and and stopwords from question exceptt for the letter 'c'

question=' '.join(str(stemmer.stem(j)) for j in words if j not in stop_words and (len(j)!=1 or


j=='c'))

len_post+=len(question)

tup = (question,code,tags,x,len(question),is_code)

questions_proccesed += 1

writer.execute("insert into
QuestionsProcessed(question,code,tags,words_pre,words_post,is_code) values
(?,?,?,?,?,?)",tup)

if (questions_proccesed%100000==0):

print("number of questions completed=",questions_proccesed)

no_dup_avg_len_pre=(len_pre*1.0)/questions_proccesed

no_dup_avg_len_post=(len_post*1.0)/questions_proccesed

print( "Avg. length of questions(Title+Body) before processing: %d"%no_dup_avg_len_pre)

print( "Avg. length of questions(Title+Body) after processing: %d"%no_dup_avg_len_post)

print ("Percent of questions containing code:


%d"%((questions_with_code*100.0)/questions_proccesed))

print("Time taken to run this cell :", datetime.now() - start)

number of questions completed= 100000

number of questions completed= 200000

number of questions completed= 300000

number of questions completed= 400000

38
CHAPTER 5- IMPLEMENTATION

number of questions completed= 500000

number of questions completed= 600000

number of questions completed= 700000

number of questions completed= 800000

number of questions completed= 900000

Avg. length of questions(Title+Body) before processing: 1169

Avg. length of questions(Title+Body) after processing: 327

Percent of questions containing code: 57

Time taken to run this cell : 0:47:05.946582

# dont forget to close the connections, or else you will end up with locks

conn_r.commit()

conn_w.commit()

conn_r.close()

conn_w.close()

if os.path.isfile(write_db):

conn_r = create_connection(write_db)

if conn_r is not None:

reader =conn_r.cursor()

reader.execute("SELECT question From QuestionsProcessed LIMIT 10")

print("Questions after preprocessed")

print('='*100)

reader.fetchone()

for row in reader:

print(row)

39
CHAPTER 5- IMPLEMENTATION

preprocessed_data.head()

print("number of data points in sample :", preprocessed_data.shape[0])

print("number of dimensions :", preprocessed_data.shape[1])

number of data points in sample : 999999

number of dimensions : 2

Machine Learning Models

Converting tags for multilabel problems

# binary='true' will give a binary vectorizer

vectorizer = CountVectorizer(tokenizer = lambda x: x.split(), binary='true')

multilabel_y = vectorizer.fit_transform(preprocessed_data['tags'])

We will sample the number of tags instead considering all of them (due to limitation of
computing power)

def tags_to_choose(n):

t = multilabel_y.sum(axis=0).tolist()[0]

40
CHAPTER 5- IMPLEMENTATION

print("with ",5500,"tags we are covering ",questions_explained[50],"% of questions")

multilabel_yx = tags_to_choose(5500)

print("number of questions that are not covered :", questions_explained_fn(5500),"out of ",


total_qs)

number of questions that are not covered : 9599 out of 999999

multilabel_y.shape[1])

print("number of tags taken :",


multilabel_yx.shape[1],"(",(multilabel_yx.shape[1]/multilabel_y.shape[1])*100,"%)")

Number of tags in sample : 35422

number of tags taken : 5500 ( 15.527073570097679 %)

We consider top 15% tags which covers 99% of the questions

Split the data into test and train (80:20)

total_size=preprocessed_data.shape[0]

train_size=int(0.80*total_size)

x_train=preprocessed_data.head(train_size)

x_test=preprocessed_data.tail(total_size - train_size)

41
CHAPTER 5- IMPLEMENTATION

y_train = multilabel_yx[0:train_size,:]

y_test = multilabel_yx[train_size:total_size,:]

print("Number of data points in train data :", y_train.shape)

print("Number of data points in test data :", y_test.shape)

Number of data points in train data : (799999, 5500)

Number of data points in test data : (200000, 5500)

Featurizing data

start = datetime.now()

vectorizer = TfidfVectorizer(min_df=0.00009, max_features=200000, smooth_idf=True,


norm="l2", \

tokenizer = lambda x: x.split(), sublinear_tf=False, ngram_range=(1,3))

x_train_multilabel = vectorizer.fit_transform(x_train['question'])

x_test_multilabel = vectorizer.transform(x_test['question'])

print("Time taken to run this cell :", datetime.now() - start)

Time taken to run this cell : 0:09:50.460431

print("Dimensions of train data X:",x_train_multilabel.shape, "Y :",y_train.shape)

print("Dimensions of test data X:",x_test_multilabel.shape,"Y:",y_test.shape)

Diamensions of train data X: (799999, 88244) Y : (799999, 5500)

Diamensions of test data X: (200000, 88244) Y: (200000, 5500)

# classifier = LabelPowerset(GaussianNB())

"""

from skmultilearn.adapt import MLkNN

classifier = MLkNN(k=21)

# train

42
CHAPTER 5- IMPLEMENTATION

classifier.fit(x_train_multilabel, y_train)

# predict

predictions = classifier.predict(x_test_multilabel)

print(accuracy_score(y_test,predictions))

print(metrics.f1_score(y_test, predictions, average = 'macro'))

print(metrics.f1_score(y_test, predictions, average = 'micro'))

print(metrics.hamming_loss(y_test,predictions))

"""

# we are getting memory error because the multilearn package

# is trying to convert the data into dense matrix

# ---------------------------------------------------------------------------

#MemoryError Traceback (most recent call last)

#<ipython-input-170-f0e7c7f3e0be> in <module>()

#----> classifier.fit(x_train_multilabel, y_train)

"\nfrom skmultilearn.adapt import MLkNN\nclassifier = MLkNN(k=21)\n\n#


train\nclassifier.fit(x_train_multilabel, y_train)\n\n# predict\npredictions =
classifier.predict(x_test_multilabel)\nprint(accuracy_score(y_test,predictions))\nprint(metrics.f
1_score(y_test, predictions, average = 'macro'))\nprint(metrics.f1_score(y_test, predictions,
average = 'micro'))\nprint(metrics.hamming_loss(y_test,predictions))\n\n"

Applying Logistic Regression with OneVsRest Classifier

# this will be taking so much time try not to run it, download the lr_with_equal_weight.pkl file
and use to predict

# This takes about 6-7 hours to run.

classifier = OneVsRestClassifier(SGDClassifier(loss='log', alpha=0.00001, penalty='l1'),


n_jobs=-1)

classifier.fit(x_train_multilabel, y_train)

predictions = classifier.predict(x_test_multilabel)
43
CHAPTER 5- IMPLEMENTATION

print("accuracy :",metrics.accuracy_score(y_test,predictions))

print("macro f1 score :",metrics.f1_score(y_test, predictions, average = 'macro'))

print("micro f1 scoore :",metrics.f1_score(y_test, predictions, average = 'micro'))

print("hamming loss :",metrics.hamming_loss(y_test,predictions))

print("Precision recall report :\n",metrics.classification_report(y_test, predictions))

from sklearn.externals import joblib

joblib.dump(classifier, 'lr_with_equal_weight.pkl')

Modeling with less data points (0.5M data points) and more weight to title and 500 tags
only.

sql_create_table = """CREATE TABLE IF NOT EXISTS QuestionsProcessed (question text


NOT NULL, code text, tags text, words_pre integer, words_post integer, is_code integer);"""

create_database_table("Titlemoreweight.db", sql_create_table)

read_db = 'train_no_dup.db'

write_db = 'Titlemoreweight.db'

train_datasize = 400000

if os.path.isfile(read_db):

conn_r = create_connection(read_db)

if conn_r is not None:

reader =conn_r.cursor()

# for selecting first 0.5M rows

reader.execute("SELECT Title, Body, Tags From no_dup_train LIMIT 500001;")

# for selecting random points

#reader.execute("SELECT Title, Body, Tags From no_dup_train ORDER BY


RANDOM() LIMIT 500001;")

if os.path.isfile(write_db):

44
CHAPTER 5- IMPLEMENTATION

conn_w = create_connection(write_db)

if conn_w is not None:

tables = checkTableExists(conn_w)

writer =conn_w.cursor()

if tables != 0:

writer.execute("DELETE FROM QuestionsProcessed WHERE 1")

print("Cleared All the rows")

Pre-processing of questions

1. Separate Code from Body


2. Remove Special characters from Question title and description (not in code)
3. Give more weightage to title: Add title three times to the question
4. Remove stop words (Except 'C')
5. Remove HTML Tags
6. Convert all the characters into small letters
7. Use Snowball Stemmer to stem the words

start = datetime.now()

preprocessed_data_list=[]

reader.fetchone()

questions_with_code=0

len_pre=0

len_post=0

questions_proccesed = 0

for row in reader:

is_code = 0

title, question, tags = row[0], row[1], str(row[2])

if '<code>' in question:

questions_with_code+=1

45
CHAPTER 5- IMPLEMENTATION

conn_r = create_connection(write_db)

if conn_r is not None:

preprocessed_data = pd.read_sql_query("""SELECT question, Tags FROM


QuestionsProcessed""", conn_r)

conn_r.commit()

conn_r.close()

preprocessed_data.head()

print("number of data points in sample :", preprocessed_data.shape[0])

print("number of dimensions :", preprocessed_data.shape[1])

number of data points in sample : 500000

number of dimensions : 2

Converting string Tags to multilabel output variables

vectorizer = CountVectorizer(tokenizer = lambda x: x.split(), binary='true')

multilabel_y = vectorizer.fit_transform(preprocessed_data['tags'])

Selecting 500 Tags

questions_explained = []

total_tags=multilabel_y.shape[1]

total_qs=preprocessed_data.shape[0]

for i in range(500, total_tags, 100):

46
CHAPTER 5- IMPLEMENTATION

questions_explained.append(np.round(((total_qs-
questions_explained_fn(i))/total_qs)*100,3))

fig, ax = plt.subplots()

ax.plot(questions_explained)

xlabel = list(500+np.array(range(-50,450,50))*50)

ax.set_xticklabels(xlabel)

plt.xlabel("Number of tags")

plt.ylabel("Number Questions coverd partially")

plt.grid()

plt.show()

# you can choose any number of tags based on your computing power, minimun is 500(it
covers 90% of the tags)

print("with ",5500,"tags we are covering ",questions_explained[50],"% of questions")

print("with ",500,"tags we are covering ",questions_explained[0],"% of questions")

# we will be taking 500 tags

multilabel_yx = tags_to_choose(500)

47
CHAPTER 5- IMPLEMENTATION

print("number of questions that are not covered :", questions_explained_fn(500),"out of ",


total_qs)

number of questions that are not covered : 45221 out of 500000

x_train=preprocessed_data.head(train_datasize)

x_test=preprocessed_data.tail(preprocessed_data.shape[0] - 400000)

y_train = multilabel_yx[0:train_datasize,:]

y_test = multilabel_yx[train_datasize:preprocessed_data.shape[0],:]

print("Number of data points in train data :", y_train.shape)

print("Number of data points in test data :", y_test.shape)

Number of data points in train data : (400000, 500)

Number of data points in test data : (100000, 500)

Featurizing data with TfIdf vectorizer

start = datetime.now()

vectorizer = TfidfVectorizer(min_df=0.00009, max_features=200000, smooth_idf=True,


norm="l2", \

tokenizer = lambda x: x.split(), sublinear_tf=False, ngram_range=(1,3))

x_train_multilabel = vectorizer.fit_transform(x_train['question'])

x_test_multilabel = vectorizer.transform(x_test['question'])

print("Time taken to run this cell :", datetime.now() - start)

Time taken to run this cell : 0:03:52.522389

print("Dimensions of train data X:",x_train_multilabel.shape, "Y :",y_train.shape)

print("Dimensions of test data X:",x_test_multilabel.shape,"Y:",y_test.shape)

Diamensions of train data X: (400000, 94927) Y : (400000, 500)

Diamensions of test data X: (100000, 94927) Y: (100000, 500)

48
CHAPTER 5- IMPLEMENTATION

Applying Logistic Regression with OneVsRest Classifier

start = datetime.now()

classifier = OneVsRestClassifier(SGDClassifier(loss='log', alpha=0.00001, penalty='l1'),


n_jobs=-1)

classifier.fit(x_train_multilabel, y_train)

predictions = classifier.predict (x_test_multilabel)

print("Accuracy :",metrics.accuracy_score(y_test, predictions))

print("Hamming loss ",metrics.hamming_loss(y_test,predictions))

precision = precision_score(y_test, predictions, average='micro')

recall = recall_score(y_test, predictions, average='micro')

f1 = f1_score(y_test, predictions, average='micro')

print("Micro-average quality numbers")

print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))

precision = precision_score(y_test, predictions, average='macro')

recall = recall_score(y_test, predictions, average='macro')

f1 = f1_score(y_test, predictions, average='macro')

print("Macro-average quality numbers")

print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))

print (metrics.classification_report(y_test, predictions))

print("Time taken to run this cell :", datetime.now() - start)

joblib.dump(classifier, 'lr_with_more_title_weight.pkl')

['lr_with_more_title_weight.pkl']

start = datetime.now()

classifier_2 = OneVsRestClassifier(LogisticRegression(penalty='l1'), n_jobs=-1)

49
CHAPTER 5- IMPLEMENTATION

classifier_2.fit(x_train_multilabel, y_train)

predictions_2 = classifier_2.predict(x_test_multilabel)

print("Accuracy :",metrics.accuracy_score(y_test, predictions_2))

print("Hamming loss ",metrics.hamming_loss(y_test,predictions_2))

precision = precision_score(y_test, predictions_2, average='micro')

recall = recall_score(y_test, predictions_2, average='micro')

f1 = f1_score(y_test, predictions_2, average='micro')

print("Micro-average quality numbers")

print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))

precision = precision_score(y_test, predictions_2, average='macro')

recall = recall_score(y_test, predictions_2, average='macro')

f1 = f1_score(y_test, predictions_2, average='macro')

print("Macro-average quality numbers")

print("Precision: {:.4f}, Recall: {:.4f}, F1-measure: {:.4f}".format(precision, recall, f1))

print (metrics.classification_report(y_test, predictions_2))

print("Time taken to run this cell :", datetime.now() - start)

50
CHAPTER 6- RESULT ANALYSIS
6.1 OUTPUT

SAMPLE DATAPOINT IN DATASET

Title: Implementing Boundary Value Analysis of Software Testing in a C++ program?

Body:

#include<iostream>\n

#include<stdlib.h>\n\n

using namespace std;\n\n

int main()\n

{ cout<<"Hello Dheeraj"<<endl; }

Tags: 'c++ c'

PERFORMANCE METRIC

Micro-Averaged F1-Score (Mean F Score): The F1 score can be interpreted as a weighted


average of the precision and recall, where an F1 score reaches its best value at 1 and worst
score at 0. The relative contribution of precision and recall to the F1 score are equal. The
formula for the F1 score is:

F1 = 2 (precision recall) / (precision + recall)

In the multi-class and multi-label case, this is the weighted average of the F1 score of each
class.

'Micro f1 score': Calculate metrics globally by counting the total true positives, false negatives
and false positives. This is a better metric when we have class imbalance.

'Macro f1 score': Calculate metrics for each label and find their unweighted mean. This does
not take label imbalance into account.

Hamming loss: The Hamming loss is the fraction of labels that are incorrectly predicted.

Dataset after preprocessing:

#Taking 1 Million entries to a dataframe.

write_db = 'Processed.db'

if os.path.isfile(write_db):

conn_r = create_connection(write_db)

51
CHAPTER 6- RESULT ANALYSIS
if conn_r is not None:

preprocessed_data = pd.read_sql_query("""SELECT question, Tags FROM


QuestionsProcessed""", conn_r)

conn_r.commit()

conn_r.close()

preprocessed_data.head()

Fig 6.1.1 Preprocessed Data

For 1M datapoints and 5500 tags if we apply logistic regression with OneVsRest classifier we
get the following results
accuracy: 0.081965
macro f1 score: 0.0963020140154
micro f1 score: 0.374270748817
hamming loss: 0.00041225090909090907
Precision recall report:
precision recall f1-score support

0 0.62 0.23 0.33 15760


1 0.79 0.43 0.56 14039
2 0.82 0.55 0.66 13446
3 0.76 0.42 0.54 12730
4 0.94 0.76 0.84 11229
5 0.85 0.64 0.73 10561
6 0.87 0.50 0.63 5086
7 0.87 0.54 0.67 4533
8 0.60 0.13 0.22 3000
9 0.81 0.53 0.64 2765
10 0.59 0.17 0.26 3051
11 0.70 0.33 0.45 3009
12 0.64 0.24 0.35 2630
13 0.71 0.23 0.35 1426
14 0.90 0.53 0.67 2548
15 0.66 0.18 0.28 2371

Up to 5499
avg / total 0.53 0.26 0.33 530065

52
CHAPTER 7 – TESTING
Software Testing is evaluation of the software against requirements gathered from users and
system specifications. Testing is conducted at the phase level in software development life cycle
or at module level in program code. Software testing comprises of Validation and Verification.

7.1 SOFTWARE VALIDATION

Validation is process of examining whether the software satisfies the user requirements or not. It
is carried out at the end of SDLC. If the software matches requirements for which it was made,
it is validated.

7.2 VERIFICATION

Verification is the process of confirming if the software is meeting the business requirements and
is developed adhering to the proper specifications and methodologies.

7.3 MANUAL VS AUTOMATED TESTING

Testing can either be done manually or using an automated testing tool:

1. Manual – This testing is performed without taking help of automated testing tools. The
software tester prepares test cases for different sections and levels of the code, executes
the tests and reports the results to the manager.
2. Manual testing is time and resource consuming. The tester needs to confirm whether right
test cases are used or not. Major portion of testing involves manual testing.
3. Automated This testing is a testing procedure done with aid of automated testing tools.
The limitations with manual testing can be overcome using automated test tools.

7.4 TEST CASES

TEST CASE 1:
1M data points and 5500 tags
This will take about 6-7 hours to run on 32GB ram computer with i7 processor
accuracy: 0.081965
macro f1 score: 0.0963020140154
micro f1 score: 0.374270748817
hamming loss: 0.00041225090909090907
Precision recall report:
precision recall f1-score support

0 0.62 0.23 0.33 15760


1 0.79 0.43 0.56 14039
2 0.82 0.55 0.66 13446
3 0.76 0.42 0.54 12730
4 0.94 0.76 0.84 11229
5 0.85 0.64 0.73 10561
6 0.70 0.30 0.42 6958
7 0.87 0.61 0.72 6309

53
CHAPTER 7 – TESTING
8 0.70 0.40 0.50 6032
9 0.78 0.43 0.55 6020
10 0.86 0.62 0.72 5707
11 0.52 0.17 0.25 5723
12 0.55 0.10 0.16 5521
13 0.59 0.25 0.35 4722
14 0.61 0.22 0.32 4468
15 0.79 0.52 0.63 4536
Up to 5499
avg / total 0.53 0.26 0.33 530065

TEST CASE 2:

0.5M data points and 550 tags with more weight to title
Accuracy: 0.23623
Hamming loss 0.00278088
Micro-average quality numbers
Precision: 0.7216, Recall: 0.3256, F1-measure: 0.4488
Macro-average quality numbers
Precision: 0.5473, Recall: 0.2572, F1-measure: 0.3339
precision recall f1-score support

0 0.94 0.64 0.76 5519


1 0.69 0.26 0.38 8190
2 0.81 0.37 0.51 6529
3 0.81 0.43 0.56 3231
4 0.81 0.40 0.54 6430
5 0.82 0.33 0.47 2879
6 0.87 0.50 0.63 5086
7 0.87 0.54 0.67 4533
8 0.60 0.13 0.22 3000
9 0.81 0.53 0.64 2765
10 0.59 0.17 0.26 3051
11 0.70 0.33 0.45 3009
12 0.64 0.24 0.35 2630
13 0.71 0.23 0.35 1426
14 0.90 0.53 0.67 2548
15 0.66 0.18 0.28 2371
Up to 499
avg / total 0.67 0.33 0.43 173812

54
CHAPTER 8- CONCLUSION AND FUTURE WORK
CONCLUSION

In this work, an efficient machine learning model is built for predicting the labels if title and
description is given in stack overflow. In this we used logistic regression with one vs rest
classifier to predict the accurate results of the labels for a given title and description.

The stack overflow dataset is preprocessed in such a way that we can get multiple labels which
are relevant to the title and description given. If any one of the predicted labels is not related to
the title and description the precision and recall values decreases which indirectly affects the user
satisfaction.

FUTURE WORK

This model is trained using logistic regression with OneVsRest classifier. The data set contains
the attributes id, title, body and tag. Future work can be done by sorting the tuples or rows in the
dataset using the time stamp of the questions as the newer questions contain new tags. Determine
how effective the parts of speech breakdown are versus using the number of words in the body
and title. As we saw, most of the key features were related to post body composition. Returning
these to a singular value may reveal just how useful the breakdown was in determining the post
status. These help in predicting the tags accurately with high precision and recall values.

55
REFERENCES
1. AlKofahi,J.M.,Tamrawi,A.,Nguyen,T.T.,Nguyen,H.A.,Nguyen,T.N.:Fuzzysetapproachfora
utomatic tagging in evolving software. In: International Conference on Software
Maintenance, pp. 1–10. IEEE (2010)

2. Begel, A., DeLine, R., Zimmermann, T.: Social media for software engineering. In:
Proceedings of the FSE/SDP Workshop on Future of Software Engineering Research, pp.
33–38. ACM (2010)

3. Pennington, J., Socher, R., and Manning, C. D. . (2014). Glove: Global vectors for word
representation. Conference on Empirical Methods in Natural Language Processing.

4. Klassen, M. and Paturi, N. (2010). Web document classification by keywords using random
forests. In Networked Digital Technologies, volume 88 of Communications in Computer and
Information Science, pages 256-261. Springer Berlin Heidelberg.

5. Manning, C. D., Raghavan, P., and Schutze, H. (2008). Introduction to Information Retrieval.
Cambridge University Press, New York, NY, USA. [6] McCallum, A. K. (1999). Multi-label
text classification with a mixture model trained by em. In AAAI 99 Workshop on Text
Learning.

6. Loper, E. and Bird, S. (2002). Nltk: The natural language toolkit. In Proceedings of the ACL-
02 Workshop on Elective Tools and Methodologies for Teaching Natural Language
Processing and Computational Linguistics, ETMTNLP ’02, pages 63-70, Stroudsburg, PA,
USA. Association for Computational Linguistics.

56

S-ar putea să vă placă și