Documente Academic
Documente Profesional
Documente Cultură
2
AUREA: Automated User REview Analyzer 59
60
3 61
4 Adelina Ciurumelea Sebastiano Panichella Harald C. Gall 62
5 University of Zurich, Department of University of Zurich, Department of University of Zurich, Department of 63
6 Informatics, Switzerland Informatics, Switzerland Informatics, Switzerland 64
7 ciurumelea@ifi.uzh.ch panichella@ifi.uzh.ch gall@ifi.uzh.ch 65
8 66
9 ABSTRACT popular apps can receive up to several thousands of reviews per 67
10 We present a novel tool, AUREA, that automatically classifies mobile day [10]. 68
11 app reviews, filters and facilitates their analysis using fine grained Researchers studied the characteristics of user comments and 69
12 mobile specific categories. We aim to help developers analyse the observed that users often report in their reviews bugs and feature 70
13 direct and valuable feedback that users provide through their re- requests [10], describe their experience with specific features [6], 71
14 views, in order to better plan maintenance and evolution activities ask for feature enhancements [7] and include comparisons with 72
15 for their apps. Reviews are often difficult to analyse because of their other apps. AR-Miner was one of the first tools proposed in the 73
16 unstructured textual nature and their frequency, moreover only a literature to automatically classify reviews as either informative 74
17 third of them are actually informative. We believe that by using our or non-informative [2]. Other researchers have proposed several 75
18 tool, developers can reduce the amount of time required to analyse approaches for automatically selecting relevant reviews for main- 76
19 and understand the issues users encounter and plan appropriate tenance activities [4–7, 16]. Most approaches classify reviews ac- 77
20 change tasks. We have thoroughly evaluated the tool and we report cording to a restricted set of classes (e.g. feature request, bugs etc.) 78
21 high precision and recall for the classification task. Additionally and then cluster them based on textual similarity, this results in un- 79
22 we asked 3 external inspectors to qualitatively evaluate AUREA structured groups of reviews that have to be manually analysed and 80
23 and can report encouraging feedback that indicates its usefulness. understood by developers in order to extract meaningful change 81
24 A video describing the main features of the tool can be found at: tasks. 82
25 https://youtu.be/V62ngWVvFpc. In this paper, we propose a tool named AUREA (Automated User 83
26 REview Analyzer) to classify reviews according to fine grained 84
27 KEYWORDS topics addressed by users in their reviews, which are relevant for 85
28 developers while planning maintenance and evolution tasks for 86
Mobile Applications, User Reviews, Text Classification
29 their applications. The tool is based on the approach detailed in 87
30 ACM Reference Format: [3] and uses Machine Learning to classify reviews according to 88
31
Adelina Ciurumelea, Sebastiano Panichella, and Harald C. Gall. 2018. AU- mobile specific categories and provides a user friendly interface 89
REA: Automated User REview Analyzer. In Proceedings of International that allows developers to import, analyse and filter the reviews of
32 90
Conference on Software Engineering (ICSE ’18). ACM, New York, NY, USA, their applications. The tool is available online at https://github.com/
33 91
4 pages. https://doi.org/
34 adelinac/urr and can be executed in the browser after following the 92
35 installation instructions. 93
36
1 INTRODUCTION The main contributions of our work are: (i) a dataset of 6107 re- 94
37 The popularity of smartphones and the use and development of views manually labeled according to the categories defined in Table 95
38 mobile applications has increased remarkably over the past years. 1; (ii) the implementation of the AUREA tool that automatically 96
39 In this context, Google Play and the Apple Store host more than classifies, filters and facilitates the analysis of reviews according to 97
40 2 billion apps each [15] and enable the download of millions of the defined categories; (iii) a thorough quantitative and qualitative 98
41 apps every day [9, 14]. Additionally to facilitating their distribution, evaluation of our tool with all the supporting files available in the 99
42 mobile marketplaces allow users to easily rate and write reviews online repository. 100
43 for their apps. These represent a rich source of information for de- 101
44 velopers containing direct and valuable feedback from their users. 2 APPROACH OVERVIEW 102
45 Nevertheless, they are difficult to manually analyse for several rea- AUREA uses pre-trained Machine Learning models to classify app 103
46 sons: (i) they consist of unstructured text with a low descriptive reviews according to mobile specific and actionable topics as spec- 104
47 quality, as they were written by users without any technical knowl- ified in the taxonomy in Table 1, additionally it facilitates their 105
48 edge, (ii) only a third of them are actually informative [2, 9] and filtering and analysis through an intuitive interface. This is dif- 106
49 ferent from previous work that uses a limited set of categories to 107
50 Permission to make digital or hard copies of all or part of this work for personal or classify reviews, for example as informative and non-informative 108
classroom use is granted without fee provided that copies are not made or distributed
51
for profit or commercial advantage and that copies bear this notice and the full citation [2], or as bug, feature request and other [16]. More details about 109
52 on the first page. Copyrights for components of this work owned by others than ACM the taxonomy definition are in our previous paper [3]. 110
53 must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, In developing the tool we followed the steps outlined below: 111
to post on servers or to redistribute to lists, requires prior specific permission and/or a
54 fee. Request permissions from permissions@acm.org. • one author of the paper manually labeled a set of 6107 re- 112
55 ICSE ’18, May-June 2018, Gothenburg, Sweden views from 37 open source apps, that are available on Google 113
56 © 2018 Association for Computing Machinery. Play according to the categories in Table 1 to build the train- 114
57 ing dataset; 115
58 1 116
ICSE ’18, May-June 2018, Gothenburg, Sweden A. Ciurumelea, S. Panichella, H. Gall
117 • we experimented with different parameters of the Gradient Table 1: Taxonomy of Review Categories 175
118 Boosted Trees algorithm [13] (number of estimators, learning Review Category Description 176
119 rate) and preprocessing steps (with/without stop-words) and Device mentions a specific mobile phone device (i.e. Galaxy 6). 177
120 selected the configuration that gave the best results on our Android Version references the OS version. (i.e. Marshmallow). 178
121 labeled dataset; Hardware talks about a specific hardware component. 179
122 • given the parameters obtained previously, we trained a sep- App Usability talks about ease or difficulty in using a feature. 180
123 arate classifier for each category from our taxonomy and UI mentions an UI element (i.e. button, menu item). 181
124 saved it for later use; Performance talks about the performance of the app (slow, fast). 182
125 • as a last step we developed a web application that can be Battery references related to the battery (i.e. drains battery).
183
run locally in the browser and uses the stored classifiers to Memory mentions issues related to the memory (i.e. out of memory).
126 184
references the licensing model of the app (i.e. free, pro version)
127 facilitate the analysis of app reviews. Licensing
185
Price talks about money aspects (i.e. donated 5$).
128 AUREA implements the previously described steps and consists Security talks about security/lack of it.
186
129 of two main components: the Classification and the Web App. Privacy issues related to permissions and user data.
187
130 The Classification component is responsible for the following Complaint the users reports or complains about an issue with the app
188
131 tasks: preprocessing, feature extraction, classifiers training and clas- 189
132 sification of new reviews. The preprocessing consists of removing using a browser. The first page shown in Figure 1 contains a short 190
133 punctuation and stops words according to the English stop words explanation, the buttons for Uploading review files, Analysing and 191
134 list and applying the Snowball Stemmer [11] to reduce words to Classifying them and a legend with the definition and icons for each 192
135 their root form. As features, we extracted the tf-idf scores [1] of the category from the taxonomy. Please note that mobile developers 193
136 unigrams, bigrams and trigrams [8] of the preprocessed review text. can easily download the reviews of their apps from Google Play or 194
137 To train the classifiers we used the Gradient Boosted Trees model the Apple Store in a csv format, as the one expected by our tool. 195
138 implemented by the scikit-learn library [13]. This model makes If there is any difference in the format a helpful error message is 196
139 predictions by combining decision from a sequence of weak learn- shown. 197
140 ers, usually decisions trees. During the training, at each iteration, a 198
141 new tree is built to maximise the classification correctness on the 199
142 training data. While building the tree models, the feature that best 200
143 partitions the data is selected at each node. After adding a tree, the 201
144 model evaluates the accuracy on the training set and increases the 202
145 weight of each misclassified training example. Therefore in the next 203
146 iteration, the new tree will try harder to correctly classify those. 204
147 We would like to note, that we perform multi-label classification 205
148 on complete reviews, therefore for a specific review, AUREA will 206
149 return a list of matching categories. This is different from previ- 207
150 ous work that often performs single-label classification on review 208
151 sentences. Classifying single sentences has the drawback that it 209
152 takes the sentence out of context and it makes the reported issues 210
153 harder to understand. Additionally reviews are often grammatically 211
154 incorrect, therefore sentence splitting is likely to be error prone 212
155 and it is still possible that users will report several issues in a single 213
156 sentence. 214
157 The Web App component provides the graphical user interface Figure 1: AUREA Home Page 215
158 for app developers to easily analyse the review of their apps ac- 216
159 cording to our taxonomy. It was implemented using the Flask web AUREA was developed expecting the following usage scenario: 217
160 framework written in Python [12]. We have developed the AU- • The developer first uploads a csv file containing reviews 218
161 REA tool to run locally in the web browser, but it can be adapted with the Upload functionality. This will first check that the 219
162 and made accessible online. The Web App communicates with the file has the required format and fields and then upload it to 220
163 Classification component to preprocess, extract the features an internal directory. 221
164 and classify new reviews and then displays the results to the user. • Next the file is available in the drop-down menu for Select 222
165 reviews files, after selecting it the user can click on the Anal- 223
166 3 AUREA IN ACTION ysis button to see which categories are most often associated 224
167
AUREA provides an intuitive and user-friendly web interface that with Complaints in reviews as in Figure 2. For each category, 225
168
allows developers to easily upload, analyse and filter their reviews AUREA will show the percentage and absolute number of re- 226
169
based on the categories defined in Table 1. In order to run it, a views that are not classified as Complaints (in green) and the 227
170
user first has to clone the public Github repository 1 , follow the ones that are (in red). In this way the developer can quickly 228
171
instructions included in the README.txt and then access the app notice which categories are problematic for users of the app, 229
172 e.g. in Figure 2 it is easy to see that reviews mentioning the 230
173 1 https://github.com/adelinac/urr Android version also report issues with the app. 231
174 2 232
AUREA: Automated User REview Analyzer ICSE ’18, May-June 2018, Gothenburg, Sweden
233 • Subsequently, the user can analyse the reviews belonging to 4 EVALUATION 291
234 the problematic categories by using the Classify functional- We have thoroughly evaluated our tool demo both from a quantitive 292
235 ity. This will open a new page as in Figure 3, where one can and qualitative perspective performing a new evaluation, different 293
236 easily browse all the reviews of the app and check to which from the one presented in our original paper [3]. For the first study 294
237 categories they belong. To only analyse a subset of them, the we report the precision, recall and F 1 scores obtained for the classi- 295
238 developer can use the filtering functionality, implemented fication task. For the second one, we asked 3 external evaluators to 296
239 with the Select Categories drop-down menu, to restrict analyse the reviews of 3 open source apps with Excel and AUREA 297
240 the displayed reviews to specific categories. For example, and then fill out a short survey comparing the two tools. 298
241 the developer can look only at the Android version and Com- 299
242 plaint categories, to understand what compatibility issues 300
243 users encounter with the different operating system versions.
4.1 Quantitative Evaluation 301
244 Another use case is for the developer to filter the reviews To train and evaluate the ML models used by our tool, one of the 302
245 based on each of the categories defined in the taxonomy, in authors manually labeled a set of 6107 reviews from 37 mobile apps 303
246 this way AUREA will only display the reviews focusing on available on Google Play according to the taxonomy in Table 1. 304
247 the specific topic, therefore reducing the manual analysis re- To enable multi-label classification, we have trained and used per 305
248 quired from the developer to understand the general opinion category a separate classifier, which we evaluated using a 10-fold 306
249 of the users regarding that topic. cross-validation and report the results in Table 2. Although we are 307
250 using a large dataset, our categories are fine grained and we obtain 308
Here we would like note that as Chen et al. [2] observed only a
251 an imbalanced set for most of them (a small percentage of reviews 309
third of reviews are informative. Although our taxonomy is different
252 belong to it as compared to the ones that do not). Therefore to 310
from the one defined in [2] we also noticed that each time we filter
253 accurately evaluate our classifiers we report the precision, recall 311
reviews based on one of our categories, we obtain a much smaller
254 and F 1 scores only for the positive class (e.g. the review should 312
subset, thus considerably decreasing the number of reviews that
255 be classified as Device), in this way we obtain more conservative 313
have to be read and analysed.
256 values. Here we would like to note that the scores for the negative 314
257 class are usually above 90%, hence leading to higher average scores 315
258 than the ones reported in Table 2. 316
259 The scores for AUREA are very positive, with precision above 317
260 76% for all categories and recall above 70% for most categories. We 318
261 have analysed the predictions of the lower performing classifiers to 319
262 try to understand the reason for it and how we could improve them. 320
263 The category with the lowest recall is Memory, likely because it is 321
264 the category with the fewest number of reviews in our dataset. One 322
265 way to improve the scores is to extend the dataset to include more 323
266 memory related reviews. Another issue is that the words typically 324
267 used to talk about the memory occupied by the app (e.g. memory, 325
268 space) also have another meaning, like in the following review 326
269 sentence: “It’s difficult to use vi or sed to edit scripts with a touch phone 327
270 keyboard that hogs half the screen space.”. Another category with a 328
271 lower recall is UI, this includes a wide range of reviews that address 329
272 the interface of the app, the users might discuss a specific button or 330
273 Figure 2: Analysis Results menu item, but also the overall interface. AUREA misclassifies such 331
274 reviews because users often refer to an interface element using the 332
275 name of the feature it implements, this is difficult for a classifier 333
276 to recognize. Additionally, users often refer to an element of the 334
277 UI using the word option, but this also has a different meaning 335
278 (e.g. No cloud option? It’s good apk, but there is no difference between 336
279 native phone clipboard and this one...) making it more difficult for 337
280 the classifier to recognize the reviews that are UI related. Another 338
281 category with a recall of under 70% is Complaint, this is a very 339
282 general app category that includes bug reports but also any issues 340
283 that users are unhappy about regarding the app. We noticed, that 341
284 the classifier has problems with negations, when the user claims 342
285 that the app has no bugs, or does not need any fixes (e.g.. ...It works 343
286 consistently. No bugs. Good for ...), or when classifying contradictory 344
287 reviews (e.g. I hate it so much but I can’t stop playing it!!). A way to 345
288 handle this problem is to add a sentiment analysis feature during 346
289
Figure 3: Classification Results training, which we are considering for future work. 347
290 3 348
ICSE ’18, May-June 2018, Gothenburg, Sweden A. Ciurumelea, S. Panichella, H. Gall