Sunteți pe pagina 1din 4

1

2
AUREA: Automated User REview Analyzer 59
60
3 61
4 Adelina Ciurumelea Sebastiano Panichella Harald C. Gall 62
5 University of Zurich, Department of University of Zurich, Department of University of Zurich, Department of 63
6 Informatics, Switzerland Informatics, Switzerland Informatics, Switzerland 64
7 ciurumelea@ifi.uzh.ch panichella@ifi.uzh.ch gall@ifi.uzh.ch 65
8 66
9 ABSTRACT popular apps can receive up to several thousands of reviews per 67
10 We present a novel tool, AUREA, that automatically classifies mobile day [10]. 68
11 app reviews, filters and facilitates their analysis using fine grained Researchers studied the characteristics of user comments and 69
12 mobile specific categories. We aim to help developers analyse the observed that users often report in their reviews bugs and feature 70
13 direct and valuable feedback that users provide through their re- requests [10], describe their experience with specific features [6], 71
14 views, in order to better plan maintenance and evolution activities ask for feature enhancements [7] and include comparisons with 72
15 for their apps. Reviews are often difficult to analyse because of their other apps. AR-Miner was one of the first tools proposed in the 73
16 unstructured textual nature and their frequency, moreover only a literature to automatically classify reviews as either informative 74
17 third of them are actually informative. We believe that by using our or non-informative [2]. Other researchers have proposed several 75
18 tool, developers can reduce the amount of time required to analyse approaches for automatically selecting relevant reviews for main- 76
19 and understand the issues users encounter and plan appropriate tenance activities [4–7, 16]. Most approaches classify reviews ac- 77
20 change tasks. We have thoroughly evaluated the tool and we report cording to a restricted set of classes (e.g. feature request, bugs etc.) 78
21 high precision and recall for the classification task. Additionally and then cluster them based on textual similarity, this results in un- 79
22 we asked 3 external inspectors to qualitatively evaluate AUREA structured groups of reviews that have to be manually analysed and 80
23 and can report encouraging feedback that indicates its usefulness. understood by developers in order to extract meaningful change 81
24 A video describing the main features of the tool can be found at: tasks. 82
25 https://youtu.be/V62ngWVvFpc. In this paper, we propose a tool named AUREA (Automated User 83
26 REview Analyzer) to classify reviews according to fine grained 84
27 KEYWORDS topics addressed by users in their reviews, which are relevant for 85
28 developers while planning maintenance and evolution tasks for 86
Mobile Applications, User Reviews, Text Classification
29 their applications. The tool is based on the approach detailed in 87
30 ACM Reference Format: [3] and uses Machine Learning to classify reviews according to 88
31
Adelina Ciurumelea, Sebastiano Panichella, and Harald C. Gall. 2018. AU- mobile specific categories and provides a user friendly interface 89
REA: Automated User REview Analyzer. In Proceedings of International that allows developers to import, analyse and filter the reviews of
32 90
Conference on Software Engineering (ICSE ’18). ACM, New York, NY, USA, their applications. The tool is available online at https://github.com/
33 91
4 pages. https://doi.org/
34 adelinac/urr and can be executed in the browser after following the 92
35 installation instructions. 93
36
1 INTRODUCTION The main contributions of our work are: (i) a dataset of 6107 re- 94
37 The popularity of smartphones and the use and development of views manually labeled according to the categories defined in Table 95
38 mobile applications has increased remarkably over the past years. 1; (ii) the implementation of the AUREA tool that automatically 96
39 In this context, Google Play and the Apple Store host more than classifies, filters and facilitates the analysis of reviews according to 97
40 2 billion apps each [15] and enable the download of millions of the defined categories; (iii) a thorough quantitative and qualitative 98
41 apps every day [9, 14]. Additionally to facilitating their distribution, evaluation of our tool with all the supporting files available in the 99
42 mobile marketplaces allow users to easily rate and write reviews online repository. 100
43 for their apps. These represent a rich source of information for de- 101
44 velopers containing direct and valuable feedback from their users. 2 APPROACH OVERVIEW 102
45 Nevertheless, they are difficult to manually analyse for several rea- AUREA uses pre-trained Machine Learning models to classify app 103
46 sons: (i) they consist of unstructured text with a low descriptive reviews according to mobile specific and actionable topics as spec- 104
47 quality, as they were written by users without any technical knowl- ified in the taxonomy in Table 1, additionally it facilitates their 105
48 edge, (ii) only a third of them are actually informative [2, 9] and filtering and analysis through an intuitive interface. This is dif- 106
49 ferent from previous work that uses a limited set of categories to 107
50 Permission to make digital or hard copies of all or part of this work for personal or classify reviews, for example as informative and non-informative 108
classroom use is granted without fee provided that copies are not made or distributed
51
for profit or commercial advantage and that copies bear this notice and the full citation [2], or as bug, feature request and other [16]. More details about 109
52 on the first page. Copyrights for components of this work owned by others than ACM the taxonomy definition are in our previous paper [3]. 110
53 must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, In developing the tool we followed the steps outlined below: 111
to post on servers or to redistribute to lists, requires prior specific permission and/or a
54 fee. Request permissions from permissions@acm.org. • one author of the paper manually labeled a set of 6107 re- 112
55 ICSE ’18, May-June 2018, Gothenburg, Sweden views from 37 open source apps, that are available on Google 113
56 © 2018 Association for Computing Machinery. Play according to the categories in Table 1 to build the train- 114
57 ing dataset; 115
58 1 116
ICSE ’18, May-June 2018, Gothenburg, Sweden A. Ciurumelea, S. Panichella, H. Gall

117 • we experimented with different parameters of the Gradient Table 1: Taxonomy of Review Categories 175
118 Boosted Trees algorithm [13] (number of estimators, learning Review Category Description 176
119 rate) and preprocessing steps (with/without stop-words) and Device mentions a specific mobile phone device (i.e. Galaxy 6). 177
120 selected the configuration that gave the best results on our Android Version references the OS version. (i.e. Marshmallow). 178
121 labeled dataset; Hardware talks about a specific hardware component. 179
122 • given the parameters obtained previously, we trained a sep- App Usability talks about ease or difficulty in using a feature. 180
123 arate classifier for each category from our taxonomy and UI mentions an UI element (i.e. button, menu item). 181
124 saved it for later use; Performance talks about the performance of the app (slow, fast). 182
125 • as a last step we developed a web application that can be Battery references related to the battery (i.e. drains battery).
183
run locally in the browser and uses the stored classifiers to Memory mentions issues related to the memory (i.e. out of memory).
126 184
references the licensing model of the app (i.e. free, pro version)
127 facilitate the analysis of app reviews. Licensing
185
Price talks about money aspects (i.e. donated 5$).
128 AUREA implements the previously described steps and consists Security talks about security/lack of it.
186
129 of two main components: the Classification and the Web App. Privacy issues related to permissions and user data.
187
130 The Classification component is responsible for the following Complaint the users reports or complains about an issue with the app
188
131 tasks: preprocessing, feature extraction, classifiers training and clas- 189
132 sification of new reviews. The preprocessing consists of removing using a browser. The first page shown in Figure 1 contains a short 190
133 punctuation and stops words according to the English stop words explanation, the buttons for Uploading review files, Analysing and 191
134 list and applying the Snowball Stemmer [11] to reduce words to Classifying them and a legend with the definition and icons for each 192
135 their root form. As features, we extracted the tf-idf scores [1] of the category from the taxonomy. Please note that mobile developers 193
136 unigrams, bigrams and trigrams [8] of the preprocessed review text. can easily download the reviews of their apps from Google Play or 194
137 To train the classifiers we used the Gradient Boosted Trees model the Apple Store in a csv format, as the one expected by our tool. 195
138 implemented by the scikit-learn library [13]. This model makes If there is any difference in the format a helpful error message is 196
139 predictions by combining decision from a sequence of weak learn- shown. 197
140 ers, usually decisions trees. During the training, at each iteration, a 198
141 new tree is built to maximise the classification correctness on the 199
142 training data. While building the tree models, the feature that best 200
143 partitions the data is selected at each node. After adding a tree, the 201
144 model evaluates the accuracy on the training set and increases the 202
145 weight of each misclassified training example. Therefore in the next 203
146 iteration, the new tree will try harder to correctly classify those. 204
147 We would like to note, that we perform multi-label classification 205
148 on complete reviews, therefore for a specific review, AUREA will 206
149 return a list of matching categories. This is different from previ- 207
150 ous work that often performs single-label classification on review 208
151 sentences. Classifying single sentences has the drawback that it 209
152 takes the sentence out of context and it makes the reported issues 210
153 harder to understand. Additionally reviews are often grammatically 211
154 incorrect, therefore sentence splitting is likely to be error prone 212
155 and it is still possible that users will report several issues in a single 213
156 sentence. 214
157 The Web App component provides the graphical user interface Figure 1: AUREA Home Page 215
158 for app developers to easily analyse the review of their apps ac- 216
159 cording to our taxonomy. It was implemented using the Flask web AUREA was developed expecting the following usage scenario: 217
160 framework written in Python [12]. We have developed the AU- • The developer first uploads a csv file containing reviews 218
161 REA tool to run locally in the web browser, but it can be adapted with the Upload functionality. This will first check that the 219
162 and made accessible online. The Web App communicates with the file has the required format and fields and then upload it to 220
163 Classification component to preprocess, extract the features an internal directory. 221
164 and classify new reviews and then displays the results to the user. • Next the file is available in the drop-down menu for Select 222
165 reviews files, after selecting it the user can click on the Anal- 223
166 3 AUREA IN ACTION ysis button to see which categories are most often associated 224
167
AUREA provides an intuitive and user-friendly web interface that with Complaints in reviews as in Figure 2. For each category, 225
168
allows developers to easily upload, analyse and filter their reviews AUREA will show the percentage and absolute number of re- 226
169
based on the categories defined in Table 1. In order to run it, a views that are not classified as Complaints (in green) and the 227
170
user first has to clone the public Github repository 1 , follow the ones that are (in red). In this way the developer can quickly 228
171
instructions included in the README.txt and then access the app notice which categories are problematic for users of the app, 229
172 e.g. in Figure 2 it is easy to see that reviews mentioning the 230
173 1 https://github.com/adelinac/urr Android version also report issues with the app. 231
174 2 232
AUREA: Automated User REview Analyzer ICSE ’18, May-June 2018, Gothenburg, Sweden

233 • Subsequently, the user can analyse the reviews belonging to 4 EVALUATION 291
234 the problematic categories by using the Classify functional- We have thoroughly evaluated our tool demo both from a quantitive 292
235 ity. This will open a new page as in Figure 3, where one can and qualitative perspective performing a new evaluation, different 293
236 easily browse all the reviews of the app and check to which from the one presented in our original paper [3]. For the first study 294
237 categories they belong. To only analyse a subset of them, the we report the precision, recall and F 1 scores obtained for the classi- 295
238 developer can use the filtering functionality, implemented fication task. For the second one, we asked 3 external evaluators to 296
239 with the Select Categories drop-down menu, to restrict analyse the reviews of 3 open source apps with Excel and AUREA 297
240 the displayed reviews to specific categories. For example, and then fill out a short survey comparing the two tools. 298
241 the developer can look only at the Android version and Com- 299
242 plaint categories, to understand what compatibility issues 300
243 users encounter with the different operating system versions.
4.1 Quantitative Evaluation 301
244 Another use case is for the developer to filter the reviews To train and evaluate the ML models used by our tool, one of the 302
245 based on each of the categories defined in the taxonomy, in authors manually labeled a set of 6107 reviews from 37 mobile apps 303
246 this way AUREA will only display the reviews focusing on available on Google Play according to the taxonomy in Table 1. 304
247 the specific topic, therefore reducing the manual analysis re- To enable multi-label classification, we have trained and used per 305
248 quired from the developer to understand the general opinion category a separate classifier, which we evaluated using a 10-fold 306
249 of the users regarding that topic. cross-validation and report the results in Table 2. Although we are 307
250 using a large dataset, our categories are fine grained and we obtain 308
Here we would like note that as Chen et al. [2] observed only a
251 an imbalanced set for most of them (a small percentage of reviews 309
third of reviews are informative. Although our taxonomy is different
252 belong to it as compared to the ones that do not). Therefore to 310
from the one defined in [2] we also noticed that each time we filter
253 accurately evaluate our classifiers we report the precision, recall 311
reviews based on one of our categories, we obtain a much smaller
254 and F 1 scores only for the positive class (e.g. the review should 312
subset, thus considerably decreasing the number of reviews that
255 be classified as Device), in this way we obtain more conservative 313
have to be read and analysed.
256 values. Here we would like to note that the scores for the negative 314
257 class are usually above 90%, hence leading to higher average scores 315
258 than the ones reported in Table 2. 316
259 The scores for AUREA are very positive, with precision above 317
260 76% for all categories and recall above 70% for most categories. We 318
261 have analysed the predictions of the lower performing classifiers to 319
262 try to understand the reason for it and how we could improve them. 320
263 The category with the lowest recall is Memory, likely because it is 321
264 the category with the fewest number of reviews in our dataset. One 322
265 way to improve the scores is to extend the dataset to include more 323
266 memory related reviews. Another issue is that the words typically 324
267 used to talk about the memory occupied by the app (e.g. memory, 325
268 space) also have another meaning, like in the following review 326
269 sentence: “It’s difficult to use vi or sed to edit scripts with a touch phone 327
270 keyboard that hogs half the screen space.”. Another category with a 328
271 lower recall is UI, this includes a wide range of reviews that address 329
272 the interface of the app, the users might discuss a specific button or 330
273 Figure 2: Analysis Results menu item, but also the overall interface. AUREA misclassifies such 331
274 reviews because users often refer to an interface element using the 332
275 name of the feature it implements, this is difficult for a classifier 333
276 to recognize. Additionally, users often refer to an element of the 334
277 UI using the word option, but this also has a different meaning 335
278 (e.g. No cloud option? It’s good apk, but there is no difference between 336
279 native phone clipboard and this one...) making it more difficult for 337
280 the classifier to recognize the reviews that are UI related. Another 338
281 category with a recall of under 70% is Complaint, this is a very 339
282 general app category that includes bug reports but also any issues 340
283 that users are unhappy about regarding the app. We noticed, that 341
284 the classifier has problems with negations, when the user claims 342
285 that the app has no bugs, or does not need any fixes (e.g.. ...It works 343
286 consistently. No bugs. Good for ...), or when classifying contradictory 344
287 reviews (e.g. I hate it so much but I can’t stop playing it!!). A way to 345
288 handle this problem is to add a sentiment analysis feature during 346
289
Figure 3: Classification Results training, which we are considering for future work. 347
290 3 348
ICSE ’18, May-June 2018, Gothenburg, Sweden A. Ciurumelea, S. Panichella, H. Gall

349 Table 2: Quantitative Evaluation Results 407


350 5 CONCLUSIONS 408
Review Category Precision Recall F1 Score
351
Device 90% 85% 88% We present a novel tool, AUREA, that is able to classify and filter 409
352
Android Version 87% 77% 82%
mobile apps reviews based on fine grained mobile specific cate- 410
353
Hardware 78% 70% 73%
gories. Through our work we want to help developers analyse the 411
354
App Usability 77% 74% 75%
reviews of their apps in less time, better comprehend what issues 412
355
UI 88% 62% 73%
users are reporting and plan their change tasks accordingly. The 413
356 evaluation showed that our tool obtains high precision and recall in 414
Performance 78% 70% 74%
357 classifying reviews. Our study participants gave positive feedback 415
Battery 76% 94% 84%
358 on AUREA and found it helpful in analyzing reviews. In the future 416
Memory 77% 56% 65%
359 we would like to extend our dataset to contain reviews from the 417
Licensing 90% 88% 89%
360 other application stores and include keyword search functionality 418
Price 86% 92% 89%
361 as suggested by one of the evaluators. 419
Security 83% 73% 77%
362 420
Privacy 90% 88% 89%
363 ACKNOWLEDGMENTS 421
Complaint 87% 63% 73%
364 422
We acknowledge the Swiss National Science foundation’s support
365 423
Table 3: Qualitative Evaluation Results for the project SURF-MobileAppsData (SNF Project No. 200021-
366 424
App Difficulty with Excel Difficulty with AUREA 166275).
367 AcDisplay Difficult, difficult, slightly difficult Slightly difficult, slightly difficult, not difficult
425
368 Pixel Dungeon Difficult, difficult, very difficult Not difficult, not difficult, slightly difficult 426
369 Terminal Emulator Very difficult, difficult, not difficult Not difficult, not difficult, not difficult REFERENCES 427
370
[1] R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison- 428
4.2 Qualitative Evaluation Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.
371 [2] N. Chen, J. Lin, S. C. H. Hoi, X. Xiao, and B. Zhang. Ar-miner: Mining informative 429
For the qualitative study, we recruited 3 participants with industry
372 reviews for developers from mobile app marketplace. In Proceedings of the 36th 430
experience: a software engineer with more than 5 years of experi- International Conference on Software Engineering, ICSE 2014, pages 767–778, New
373 431
ence, a PhD student with 3 years of experience as a requirements York, NY, USA, 2014. ACM.
374 [3] A. Ciurumelea, A. Schaufelbühl, S. Panichella, and H. C. Gall. Analyzing reviews 432
engineer and a master student with 3 years of experience in mobile and code of mobile apps for better release planning. In 2017 IEEE 24th International
375 433
applications development. Conference on Software Analysis, Evolution and Reengineering (SANER), pages
376 91–102, Feb 2017. 434
The participants analysed the reviews of 3 apps that had a larger
377 [4] A. Di Sorbo, S. Panichella, C. V. Alexandru, J. Shimagaki, C. A. Visaggio, G. Can- 435
and diverse set of reviews in our dataset (over 500 reviews for fora, and H. C. Gall. What would users change in my app? summarizing app
378 436
each app). We asked them to answer specific questions for each: reviews for recommending software changes. In Proceedings of the 2016 24th
379 ACM SIGSOFT International Symposium on Foundations of Software Engineering, 437
(i) AcDisplay: “What is the general opinion of the users regarding
380 FSE 2016, pages 499–510, New York, NY, USA, 2016. ACM. 438
the security of the application?”, “What is the battery consumption of [5] E. Guzman, M. El-Haliby, and B. Bruegge. Ensemble methods for app review
381 439
the app that users report?”; (ii) Pixel Dungeon: “What is the users classification: An approach for software evolution (n). In 2015 30th IEEE/ACM
382 International Conference on Automated Software Engineering (ASE), pages 771–776, 440
opinion regarding the price of the app?”; (iii) Terminal Emulator for Nov 2015.
383 441
Android: “Please find a bug of the app that is reported by at least 3 [6] E. Guzman and W. Maalej. How do users like this feature? a fine grained sentiment
384 analysis of app reviews. In 2014 IEEE 22nd International Requirements Engineering 442
users and describe it.”. The evaluators first answered the questions
385 Conference (RE), pages 153–162, Aug 2014. 443
by analysing the reviews information in the Excel file and then [7] C. Iacob and R. Harrison. Retrieving and analyzing mobile apps feature requests
386 444
with AUREA. They were required to record the necessary time and from online reviews. In Proceedings of the 10th Working Conference on Mining
387 Software Repositories, MSR ’13, pages 41–44, Piscataway, NJ, USA, 2013. IEEE 445
rate the difficulty of the task with both tools using a Likert scale
388 Press. 446
from very difficult to not difficult, the results are included in Table [8] D. J. . J. H. Martin. Speech and Language Processing. Prentice Hall, 2008.
389 447
3. We can easily notice that the evaluators perception of difficulty [9] W. Martin, F. Sarro, Y. Jia, Y. Zhang, and M. Harman. A survey of app store
390 analysis for software engineering. IEEE Transactions on Software Engineering, 448
of each task is much lower with AUREA. The evaluators needed PP(99):1–1, 2017.
391 449
60% more time on average to solve the tasks with Excel than with [10] D. Pagano and W. Maalej. User feedback in the appstore: An empirical study.
392 In 2013 21st IEEE International Requirements Engineering Conference (RE), pages 450
our tool. As a last step, the evaluators were asked to choose one
393 125–134, July 2013. 451
of the 3 apps and perform a general exploratory analysis of the [11] M. Porter. Snowball: A language for stemming algorithms. http://snowball.
394 452
reviews using AUREA, thus, providing feedback and suggestions tartarus.org/texts/introduction.html, 2001. [Online; accessed 14-June-2017].
395 [12] A. Ronacher. Flask. http://flask.pocoo.org/. [Online; accessed 14-June-2017]. 453
for improvement and rating its helpfulness.
396 [13] scikit-learn developers. Gradient Boosting Classifier. http://scikit-learn.org/ 454
Feedback from survey participants. In general, the feedback stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html.
397 455
received was quite positive, the evaluators mentioned that AUREA [Online; accessed 14-June-2017].
398 [14] Statista. Cumulative number of apps downloaded from the google play. Technical 456
“is a great tool for getting a higher level view of what people are report, 2016. https://www.statista.com/statistics/281106/number-of-android-app-
399 457
saying about the app.”, but also that it would be useful to add a downloads-from-google-play/.
400 [15] Statista. Number of apps available in leading app stores. Technical report, 458
keyword search functionality: “Would be nice to have more "drill-
401 2017. https://www.statista.com/statistics/276623/number-of-apps-available-in- 459
down" functionality like searching for keywords and then showing leading-app-stores/.
402 460
comments with containing similar keywords to the one searched for.”. [16] L. Villarroel, G. Bavota, B. Russo, R. Oliveto, and M. Di Penta. Release planning
403 of mobile apps based on user reviews. In Proceedings of the 38th International 461
Overall, they rated the tool compared to Excel for analysing reviews
404 Conference on Software Engineering, ICSE ’16, pages 14–24, New York, NY, USA, 462
as being: Extremely helpful, Very helpful and Slightly helpful. 2016. ACM.
405 463
406 4 464

S-ar putea să vă placă și