DMML

1.
Abstract:
This report is the result of exploration of current utilization of Data Mining and Machine
Learning in the field of Software Engineering. This report will use several papers
proposed or published in ACM, IEEE Journal and conferences.
2. Introduction:
Data mining and Machine learning has been used vastly in software engineering. In all
phases of software engineering whether it is requirement gathering, design, development,
testing or maintenance, everywhere DM and ML are being used. There are many
applications including natural language processing, syntactic pattern recognition, search
engines, speech and handwriting recognition, object recognition in computer vision,
game playing which can be considered when DM and ML are discussed.
3. Discussion:
Let’s discuss some roles played by data mining and machine learning in software
engineering. We will separate out these by DM and ML.
3.1 Data Mining in Software Engineering:

Chadd C. Williams and Jeffrey K. Hollingsworth [1] described a method to use the
source code change history of a software project to drive and help to refine the search for
bugs. Based on the data retrieved from the source code repository, they implemented a
static source code checker that searches for a commonly fixed bug and uses information
automatically mined from the source code repository to refine its results. By applying the
tool, they identified a total of 178 warnings those were likely bugs in the Apache Web
server source code and a total of 546 warnings those were likely bugs in Wine, an open-
source implementation of the Windows API. The results showed that this technique is
more effective than the same static analysis that does not use historical data from the
source code repository.
Qinbao Song, Martin Shepperd, Michelle Cartwright, and Carolyn Mair [2] presented
association rule mining based methods to predict defect associations and defect
correction effort. The idea was to discover software defect associations from historical
software engineering data sets, and help determine whether or not a defect(s) is
accompanied by other defect(s). They used more than 200 projects over more than 15
years to examine their method. The results show that, for defect association prediction,
the accuracy is very high and the false-negative rate is very low. Likewise, for the defect
correction effort prediction, the accuracy for both defect isolation effort prediction and
defect correction effort prediction are also high. They compared the defect correction
effort prediction method with other types of methods and found that accuracy has been
improved by at least 23 percent. These results suggest that association rule mining may
be an attractive technique to the software engineering community due to its relative
simplicity, transparency, and seeming effectiveness in constructing prediction systems.
Amir Michail [3] proposed a way of using data mining to discover library reuse patterns
in user-selected applications. He considered the problem of discovering association rules
that identify library components that are often reused in combination by application
components. By querying and/or browsing such association rules, a developer can
discover patterns for reusing library components. This approach is illustrated using the
tool, CodeWeb, by demonstrating characteristic ways in which applications reuse
components in the ET++ application framework.
Defect reports are generated from various testing and development activities in software
engineering. Sometimes two reports are submitted that describe the same problem,
leading to duplicate reports. These reports are mostly written in structured natural
language, and as such, it is hard to compare two reports for similarity with formal
methods. In order to identify duplicates, we investigate using Natural Language
Processing (NLP) techniques to support the identification. A prototype tool is developed
and evaluated in a case study analyzing defect reports at Sony Ericsson Mobile
Communications. Per Runeson, Magnus Alexandersson and Oskar Nyholm [4] evaluated
the identification capabilities on a large defect management system and concluded that
about 40% of the marked duplicates could be found.
Automatic identification of software faults has enormous practical significance. This
requires characterizing program execution behavior and the use of appropriate data
mining techniques on the chosen representation. R. P. Jagadeesh Chandra Bose and S. H.
Srinivasan [5] used the sequence of system calls to characterize program execution. The
results show that kernel techniques are as accurate as the best available results but are
faster by orders of magnitude.
Patrick Francis, David Leon, Melinda Minch and Andy Podgurski [6] presented two new
tree-based techniques for refining an initial classification of failures. One of these
techniques is based on the use of dendrograms, which are rooted trees used to represent
the results of hierarchical cluster analysis. The second technique employs a classification
tree constructed to recognize failed executions. With both techniques, the tree
representation is used to guide the refinement process.
Statistical debugging uses dynamic instrumentation and machine learning to identify
predicates on program state that are strongly predictive of program failure. They enrich
the predicate vocabulary by adding complex Boolean formulae derived from these simple
predicates. They present [7] qualitative and quantitative evidence that complex predicates
are practical, precise, and informative. Furthermore, they demonstrate that their approach
is robust in the face of incomplete data provided by the sparse random sampling that
typifies post deployment statistical debugging.
In software systems, different software applications often interact with each other through
specific interfaces by exchanging data in string format. Sometimes these interfaces are
complex and distributed. Tao Xie and Evan Martin [8], proposed an approach to
understand software application interfaces through string analysis. The approach first
performs a static analysis of source code to identify interaction points, and then leverage
existing string analysis tools to collect all possible string data that can be sent through
these different interaction points. They manipulated collected string data by grouping
similar data together. Their preliminary results show that the approach can help us
understand the characteristics of interactions between database applications and
databases.
Identifiers represent an important source of information for programmers understanding
and maintaining a system. Self-documenting identifiers reduce the time and effort
necessary to obtain the level of understanding appropriate for the task at hand. In this
paper [9], Antoniol, Gueheneuc, Merlo, Tonella, characterized the evolution of program
identifiers in terms of stability metrics and occurrences of renaming. They assessed
whether an evolution process similar to the one occurring for the program structure exists
for identifiers. They argue that different evolution results from several factors including
the lack of advanced tool support for lexicon construction, documentation, and evolution.
Software specifications are often lacking, incomplete and outdated in the industry. Lack
and incomplete specifications cause various software engineering problems. Studies have
shown that program comprehension takes up to 45% of software development costs. In
this paper, David Lo and Siau-Cheng Khoo [10] described novel data mining techniques
to mine or reverse engineer these specifications from the pool of software engineering
data. In this paper, they have presented some novel work in mining software
specifications by employing novel pattern mining and rule mining techniques.
3.2 Machine Learning in Software Engineering:

One of the generic phases of software engineering is the requirements analysis. Ankori,
R. [11] presented a new method for automatically retrieving functional requirements from
the stakeholders using agile processes. The presented system was a machine learning
system for the automation of some aspects of the software requirements phase in the
software engineering process. The learning system was based on Tecuci’s multistrategy
task-adaptive learning by justification trees algorithm, known as Disciple-MTL, and
supports a few of the practices that Extreme Programming (XP) requires. The aim of the
algorithm was to collect information from the various stakeholders and integrate a variety
of learning methods in the knowledge acquisition process, while involving certain and
plausible reasoning. The result of the manipulation was a list of requirements essential to
a software system.
The importance of software testing to quality assurance cannot be overemphasized. The
estimation of a module's fault-proneness is important for minimizing cost and improving
the effectiveness of the software testing process. In this paper, Iker Gondra [12] proposed
the use of machine learning for this purpose. Specifically, given historical data on
software metric values and number of reported errors, an Artificial Neural Network
(ANN) is trained. Then, in order to determine the importance of software metric in
predicting fault-proneness, a sensitivity analysis is performed on the trained ANN.
Using a specific machine learning technique, Briand, Lionel C.; Labiche, Yvan; Liu,
Xuetao [13] proposed a way to identify suspicious statements during debugging. The
technique is based on principles similar to Tarantula but addresses its main flaw: its
difficulty to deal with the presence of multiple faults as it assumes that failing test cases
execute the same fault(s). The improvement they present in this paper results from the
use of C4.5 decision trees to identify various failure conditions based on information
regarding the test cases' inputs and outputs. Another contribution of this paper is to show
that failure conditions as modeled by a C4.5 decision tree accurately predict failures and
can therefore be used as well to help debugging.
Machine learning techniques have long been used for various purposes in software
engineering. Briand, L.C. [14] provided a brief overview of the state of the art and reports
on a number of novel applications he was involved with in the area of software testing.
Reflecting on this personal experience, he drew lessons learned and argued that more
research should be performed in that direction as machine learning has the potential to
significantly help in addressing some of the long-standing software testing problems.
Software engineering research and practice thus far are primarily conducted in a value-
neutral setting where each artifact in software development such as requirement, use case,
test case, and defect, is treated as equally important during a software system
development process. Machine learning has been playing an increasingly important role
in helping develop and maintain large and complex software systems. Du Zhang [15]
advocated a shift to applying machine learning methods to value-based software
engineering. He proposed a framework for value-based software test data generation. The
proposed framework incorporates some general principles in value-based software testing
and can help improve return on investment.
The software development process imposes major impacts on the quality of software at
every development stage; therefore, a common goal of each software development phase
concerns how to improve software quality. Fei Xing, Ping Guo and Michael R. Lyu [16]
proposed a novel technique to predict software quality by adopting Support Vector
Machine (SVM) in the classification of software modules based on complexity metrics.
Because only limited information of software complexity metrics is available in early
software life cycle, ordinary software quality models cannot make good predictions
generally. Experimental results with a Medical Imaging System software metrics data
show that their SVM prediction model achieves better software quality prediction than
some commonly used software quality prediction models.
In the context of open source development or software evolution, developers often face
test suites which have been developed with no apparent rationale and which may need to
be augmented or refined to ensure sufficient dependability, or even reduced to meet tight
deadlines. Briand, L.C., Labiche, Y. and Bawar [17] referred to this process as the re-
engineering of test suites. It is important to provide both methodological and tool support
to help people understand the limitations of test suites and their possible redundancies, so
as to be able to refine them in a cost effective manner. To address this problem in the
case of black-box testing, they proposed a methodology based on machine learning that
had shown promising results on a case study.
Conclusion:
So far we have seen how ML was used in prediction and estimation, property and model
discovery, transformation, generation and synthesis, reuse, requirement acquisition and
management of development knowledge. We have also seen how data mining plays an
important role in defect detection and correction, reuse patterns, bug finding, classifying
the failures of software etc.
Reference:
[1] Chadd C. Williams and Jeffrey K. Hollingsworth, “Automatic Mining of Source Code
Repositories to Improve Bug Finding Techniques” Software Engineering, IEEE
transactions on Publication Date: June 2005 Volume: 31, Issue: 6 on page(s): 466- 480.
[2] Qinbao Song, Martin Shepperd, Michelle Cartwright, and Carolyn Mair, “Software
Defect Association Mining and Defect Correction Effort Prediction” Software
Engineering, IEEE Transactions on Volume 32, Issue 2, Feb. 2006 Page(s): 69 - 82
[3] Amir Michail, “Data Mining Library Reuse Patterns in User-Selected Applications”,
Automated Software Engineering.14th IEEE International Conference. On Volume,
Issue, Oct 1999 Page(s):24 - 33
[4] Per Runeson, Magnus Alexandersson and Oskar Nyholm, “Detection of Duplicate
Defect Reports Using Natural Language Processing” Software Engineering, 2007. ICSE
2007. 29th International Conference on Volume, Issue, 20-26 May 2007 Page:499 - 510
[5] R. P. Jagadeesh Chandra Bose, S. H. Srinivasan, “Data Mining Approaches to
Software Fault Diagnosis”, Research Issues in Data Engineering: Stream Data Mining
and Applications, 2005. RIDE-SDMA 2005. 15th International Workshop on Publication
Date: 3-4 April 2005 On page(s): 45- 52
[6] Patrick Francis, David Leon, Melinda Minch, Andy Podgurski, “Tree-Based Methods
for Classifying Software Failures”, Software Reliability Engineering, 2004. ISSRE 2004.
5th International Symposium on Volume , Issue , 2-5 Nov. 2004 Page(s): 451 - 462
[7] Ben Liblit, Jake Rosin, Ting Chen, Piramanayagam Arumuga Nainar, “Statistical
Debugging Using Compound Boolean Predicates” Proceedings of the 2007 international
symposium on Software testing and analysis. Year of Publication: 2007, Pages: 5 – 15.
[8] Tao Xie, Evan Martin, “Understanding software application interfaces via string
analysis”, Proceedings of the 28th international conference on Software engineering,
Year of Publication: 2006, Pages: 901 - 904
[9] Antoniol, G.; Gueheneuc, Y.-G.; Merlo, E.; Tonella, P., “Mining the Lexicon Used by
Programmers during Sofware Evolution” Software Maintenance, 2007. ICSM 2007.
IEEE International Conference on Volume, Issue, 2-5 Oct. 2007 Page(s):14 – 23
[10] David Lo, Siau-Cheng Khoo, “Mining patterns and rules for software specification
discovery”, Proceedings of the VLDB Endowment, Volume 1, Issue 2 (August 2008),
Year of Publication: 2008, Pages 1609-1616
[11] Ankori, R., “Automatic requirements elicitation in agile processes”, Software -
Science, Technology and Engineering, 2005. Proceedings. IEEE International Conference
on 22-23 Feb. 2005 Page(s):101 - 109
[12] Iker Gondra, “Applying machine learning to software fault-proneness prediction”,
Journal of Systems and Software, on Publication Date: February 2008 Volume 81 , Issue
2 on Page(s): 186-195
[13] Briand, Lionel C.; Labiche, Yvan; Liu, Xuetao, “Using Machine Learning to
Support Debugging with Tarantula”, Software Reliability, 2007. ISSRE '07. The 18th
IEEE International Symposium on 5-9 Nov. 2007 Page(s):137 - 146
[14] Briand, L.C., “Novel Applications of Machine Learning in Software Testing”,
Quality Software, 2008. QSIC '08. The Eighth International Conference on 12-13 Aug.
2008 Page(s):3 - 10
[15] Du Zhang, “Machine Learning in Value-Based Software Test Data Generation”,
Tools with Artificial Intelligence, 2006. ICTAI '06. 8th IEEE International Conference on
Nov. 2006 Page(s):732 - 736
[16] Fei Xing, Ping Guo, Michael R. Lyu “A Novel Method for Early Software Quality
Prediction Based on Support Vector Machine”, Software Reliability Engineering, 2005.
ISSRE 2005. 16th IEEE International Symposium on 1-1 Nov. 2005 Page(s): 10 pp.-222
[17] Briand, L.C.; Labiche, Y.; Bawar, Z. “Using Machine Learning to Refine Black-Box
Test Specifications and Test Suites”, Quality Software, 2008. QSIC '08. The Eighth
International Conference on 12-13 Aug. 2008 Page(s): 135-144

DMML

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

DMML

Încărcat de

Drepturi de autor:

Formate disponibile

1.

3.1 Data Mining in Software Engineering:

3.2 Machine Learning in Software Engineering:

S-ar putea să vă placă și