Sunteți pe pagina 1din 2

SOEN 6481 Software Systems Requirements Specification (Winter 2015/16)

Worksheet #6

Automatic Traceability Link Recovery based on


the Vector Space Model (VSM) and Cosine Similarity

You just started a new job where you inherited a few hundred thousand lines of code, together with
thousands of pages of (possibly) relevant documents including requirements, domain information, design
descriptions, user guides, among others. Now its your job to figure out how they are connected, in order to
understand the system and be able to make changes, while keeping all artifacts consistent.
Fortunately, you just learned how to automatically create traceability links between software artifacts using
the vector space model (VSM).

The Input Artifacts. You start with two requirements documents (r1 , r2 ), and one source code file (s1 ):
r1 = The server has a database.
r2 = The client has encryption and the server has encryption.
s1 = // Server encryption.
Your goal here is to automatically create vertical traceability links: from source code to requirements.

Step 1: Tokenize each file and remove stopwords. The first step is to break up the artifacts into indi-
vidual tokens (tokenization), separating tokens by whitespace, ignoring punctuation marks, and converting
all tokens to lower case. Then, remove all stopwords (this includes words like the, a, is, be, has,
and, or, . . .).

The resulting list of tokens for the artifacts are:

r1 =

r2 =

s1 =

Step 2: Compute the document vectors and normalize them. Here, we want to find out which require-
ments documents should be linked from the source code file. For this, you first need to compute the vectors
for the query (source code file) and the two requirements documents. = (contd.)
SOEN 6481 Worksheet #6 Winter 2015/16

Fill in the empty values in the table below, using these definitions:
tf: term frequency
df: document frequency
N: number of documents
N
idf = log10
df
tf.idf =tfidf (i.e., no log weighting for tf)
Assume N = 10,000,000, each qi is the normalized weight for the tf.idf weights for the query words, and
di are the normalized weights for tf in the document.1 To normalize a vector, you have to (1) compute
its length ||~v || = x1 + . . . + xn2 , then (2) divide each element by the length: ||~xvi|| . Here, you end up with
p
2

4-dimensional vectors:

query (s1 ) r1 r2
token tf df idf tf.idf qi tf di tf di

server 50,000

database 10,000

client 100,000

encryption 10,000

Step 3: Compute the similarity between query vector and the other artifact vectors. Now compute
the cosine similarity between the vector for the query (source code s1 ) and each of the requirements
documents ~
~ P (r1 , r2 ). Since the vectors are already normalized, this is simply their dot product: cos(~q , d ) =
~q d = i qi di :
sim(~
s1 , r~1 ) = cos(~
s1 , r~1 ) =

sim(~
s1 , r~2 ) = cos(~
s1 , r~2 ) =

Step 4: Filter links by similarity. Now we have to filter the results. Here, we apply filtering by similarity:
only artifacts with a cosine similarity above 80% are linked. Show the resulting traceability link(s):

links =

So, now you know which of the requirements documents to consult/update for more information/changes
on the given source code file!
1
That means, you only have to do tf.idf weighting for the query, not for the documents, but this is only done for the purpose
of this exercise to do less calculations by hand in a real implementation, you do tf.idf weighting for all terms.

S-ar putea să vă placă și