Sunteți pe pagina 1din 2

2018 ACM/IEEE 40th International Conference on Software Engineering: Companion Proceedings

Poster: A Recommender System for Developer Onboarding


Chao Liu, Dan Yang Jed Barson
Xiaohong Zhang, Haibo Hu Baishakhi Ray
School of Software Engineering Department of Computer Science
Chongqing University, China. University of Virginia, USA.
{liu.chao, dyang, xhongz, hbhu}@cqu.edu.cn {jb3bt, rayb}@virginia.edu

ABSTRACT also to the open source projects. It increases new developers’ efforts
Successfully onboarding open source projects in GitHub is difficult for searching a good project for themselves and learning [6]. It also
for developers, because it is time-consuming for them to search an postpones the development schedule of existing developers of a
expected project by a few query words from numerous repositories, project and hurts the evolutionary progress of open source projects
and developers suffer from various social and technical barriers [7]. As a result, the delayed development would take more time and
in joined projects. Frequently failed onboarding postpones devel- efforts to make up. In this work, we propose a novel recommender
opers’ development schedule, and the evolutionary progress of system to help developers onboarding new GitHub projects.
open source projects. To mitigate developers’ costly efforts for on- To capture developers’ onboarding pattern, we develop 9 project
boarding, we propose a ranking model NNLRank (Neural Network features that may affect the onboarding process. We consider an
for List-wise Ranking) to recommend projects that developers are onboarding is successful when a developer at least makes 6 com-
likely to contribute many commits. Based on 9 measured project mits (number of median commits per project in our studied data)
features, NNLRank learns a ranking function (represented by a to the project. Next, we propose a learning to rank model called
neural network, optimized by a list-wise ranking loss function) to NNLRank (Neural Network for List-wise Ranking) to recommend
score a list of candidate projects, where top-n scored candidates are projects where developers are likely to contribute many commits.
recommended to a target developer. We evaluate NNLRank by 2044 NNLRank learns a ranking function (represented by a neural net-
succeeded onboarding decisions from GitHub developers, compar- work [4] that is optimized by a list-wise ranking loss function [8])
ing with a related model LP (Link Prediction), and 3 other typical to score a list of candidate projects, and the top-n scored candi-
ranking models. Results show that NNLRank can provide develop- dates are recommended to a developer. NNLRank is verified by
ers with effective recommendation, substantially outperforming investigating 2044 successful onboarding decisions from GitHub
baselines. developers. We compare NNLRank with the LP model [6], and 3
other typical learning to rank models, SVMRank [5], BPNet [4], and
CCS CONCEPTS SVM (Support Vector Machine). By evaluating models with a com-
monly used performance metric MRR (Mean Reciprocal Rank) with
• Software and its engineering → Open source model;
5-fold cross-validation, we show that NNLRank achieves the best
KEYWORDS performance (mean MRR = 0.466), outperforming the best baseline
SVMRank by 29.81%.
Developer Onboarding, Recommender System, Learning to Rank
ACM Reference Format: 2 DATA COLLECTION
Chao Liu, Dan Yang, Xiaohong Zhang, Haibo Hu, Jed Barson, and Baishakhi
Ray. 2018. Poster: A Recommender System for Developer Onboarding. We briefly describe sampled datasets and measured features:
In Proceedings of 40th International Conference on Software Engineering Datasets. To simulate the practical usage of NNLRank, we col-
Companion (ICSE ’18 Companion). ACM, New York, NY, USA, 2 pages. lect developers’ onboarding data from GHTorrent [2], a mirror of
https://doi.org/10.1145/3183440.3194989 GitHub data with structured form. We downloaded the database
dump dated 11/01/2016, and sampled 6,343 active projects [1] from
1 INTRODUCTION the latest 5 years, and sampled 17,219 active developers [1]. We
Open source software ecosystems like GitHub attract numerous consider a joining with at least 6 commits (median of the sampled
developers to pursue their interests and goals [1] in software devel- projects) as a successful onboarding. Finally, we collected 2,044 suc-
opment. However, developers often face difficulties to successfully cessful onboarding decisions, relating to 1,070 active projects and
onboard open source projects, namely contributing many commits 1,672 active developers. Each decision involves 257-2794 candidate
on joined projects, due to various technical and social barriers [1]. projects, a total of about 2.9 million different instances.
Frequently failed onboarding is harmful not only to developers but Dependent Variable. The outcome of the model is a rank list of
Permission to make digital or hard copies of part or all of this work for personal or
onboarded projects recommended to a developer at onboarding
classroom use is granted without fee provided that copies are not made or distributed time [1]. The proposed model aims to give higher score to the
for profit or commercial advantage and that copies bear this notice and the full citation onboarded project so that the project to be onboarded can have a
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s). higher ranking.
ICSE ’18 Companion, May 27-June 3, 2018, Gothenburg, Sweden Independent Variables. To capture developers’ onboarding pat-
© 2018 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-5663-3/18/05.
tern, we represent candidate projects by 9 features: 1) the number
https://doi.org/10.1145/3183440.3194989 of developer’s joined projects collaborated with the owner of target

319
ICSE ’18 Companion, May 27-June 3, 2018, Gothenburg, Sweden Chao et al.

project; each collaborated project is weighted by the reciprocal of Table 1: Prediction accuracy (MRR and MS at top-1/5/10)
member count [1]; 2) the number of commits in the target project comparison of NNLRank and 5 baseline models.
[3]; 3) the number of developer’s joined projects whose program- Model MRR MS1 MS5 MS10
ming language matches the target project [1]; 4) the duration from
developer’s onboarding time to the time of creation, 5) first commit, NNLRank 0.466 35.28% 57.00% 62.52%
6) latest commit, and 7) first membership. 8) latest membership in SVMRank 0.359 19.46% 53.94% 65.57%
the target project [3]; 9) the number of members who are company BPNet 0.022 00.69% 02.59% 03.52%
colleagues of the developer. SVM 0.007 00.15% 00.44% 00.73%
Random 0.018 00.24% 01.86% 02.45%
3 METHODOLOGY LP 0.004 00.00% 00.00% 00.00%
The NNLRank is a neural network based model, and it is optimized How can the proposed model work for developers with dif-
by a list-wise ranking loss function. Simply sorting predicted scores ferent prior experience? Some features are extracted from devel-
for candidate projects can make a recommendation for developer opers’ prior experience, such as social tie, we thus plan to inves-
onboarding. tigate the recommendation for developers with different levels of
Neural Network. We build a 4-layer neural network [4] (i.e., a project experience in GitHub.
ranking function f ) with 2 hidden layers, where each hidden layer What other measurable features can help capture develop-
contains 5 nodes. The input and output layers respectively have 9 ers’ onboarding pattern? Other features, like project domain,
(the number of features) and 1 node(s). By inputting a list of candi- may also affect developers’ onboarding, but they are not easy to be
date projects, X = {x 1 , x 2 , ..., x n } ∈ R9×n , to the network, it gives measured. We will investigate more measurable features.
the scores of the candidate projects f (X) = { f (x 1 ), f (x 2 ), ..., f (x n )}
in the last layer. In the m-th layer, its input is multiplied by a weight What are the most important features that affects develop-
ers’ onboarding? Among measured features, some may affect the
matrix W(m) , added by a bias vector b(m) , and processed by an
model more than others, we will explore the relative importance of
activation function (we use the Arctan function).
these features by sensitivity analysis.
Network Optimization. To optimize the built network, the weights
are randomly initialized and the bias are set to zero at first. We ACKNOWLEDGMENTS
then iteratively input lists of candidate projects to the network,
The work described in this paper was partially supported by the
and update the m-th layer of the network based on a loss function
 Fundamental Research Funds for the Central Universities of China
proposed by [8]. We simplify the loss function as: arg min ni=2 (i −
W,b (Grant No. 106112017CDJXSYY002), the National Natural Science
M  
Foundation of China (Grant no. 61772093), and Chongqing Re-
1)f (x i ) + 12 m=1 W(m) F2 + b(m) 22 . The optimization stops
search Program of Basic Science & Frontier Technology (Grant
when the loss function converges to a small value (we set 0.01), or
No. cstc2017jcyjB0305), Scientific and Technological Research Pro-
all the inputted lists have been used [8].
gram of Chongqing Municipal Education Commission (Grant No.
KJ1501504).
4 EARLY RESULT
We conduct an early experiment on our proposed model: REFERENCES
Evaluation Criteria. We assess our model by a widely used per- [1] Casey Casalnuovo, Bogdan Vasilescu, Premkumar Devanbu, and Vladimir Filkov.
2015. Developer onboarding in Github: the role of prior social links and language
formance metric MRR (Mean Average Precision). We also use an experience. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software
auxiliary measure, MS (Match Score). MS calculates the percentage Engineering. ACM, 817–828.
of onboarded projects ranked among top-k (k=1,5,10) recommen- [2] Georgios Gousios and Diomidis Spinellis. 2012. GHTorrent: GitHub’s data from
a firehose. In Proceedings of the 9th IEEE Working Conference on Mining Software
dations for all lists in testing data. Larger MS indicates more devel- Repositories. IEEE Press, 12–21.
opers can find expected projects by reviewing top-k recommended [3] Jungpil Hahn, Jae Yun Moon, and Chen Zhang. 2008. Emergence of new project
projects at most. teams from open source software developer networks: Impact of prior collabora-
tion ties. Information Systems Research 19, 3 (2008), 369–391.
Validation Data. To suppress the effect of outliers, we normalize 9 [4] Robert Hecht-Nielsen et al. 1988. Theory of the backpropagation neural network.
Neural Networks 1, Supplement-1 (1988), 445–448.
features among each list of candidate projects (2044 in total) by the [5] Thorsten Joachims. 2006. Training linear SVMs in linear time. In Proceedings of
min-max normalization. We then perform a 5-fold cross-validation. the 12th ACM SIGKDD international conference on Knowledge discovery and data
mining. ACM, 217–226.
Results and Discussion. Table 1 presents the mean MRR and MS [6] Tadej Matek and Svit Timej Zebec. 2016. GitHub open source project recommen-
of 6 models. Results show that NNLRank achieves the best mean dation system. arXiv preprint arXiv:1602.02594 (2016).
MRR (0.466), outperforming the best baseline SVMRank by 29.81%. [7] Igor Steinmacher, Marco Aurelio Graciotto Silva, Marco Aurelio Gerosa, and
David F Redmiles. 2015. A systematic literature review on the barriers faced by
The main advantage of NNLRank over other baselines are resulted newcomers to open source software projects. Information and Software Technology
from the better MS at top-1 and top-5, where 35.28% and 57% of 59 (2015), 67–85.
[8] Fen Xia, Tie-Yan Liu, Jue Wang, Wensheng Zhang, and Hang Li. 2008. Listwise
developers can find their expected projects to onboard by reviewing approach to learning to rank: theory and algorithm. In Proceedings of the 25th
at most 1 and 5 recommended projects, respectively. international conference on Machine learning. ACM, 1192–1199.

5 FUTURE WORK
In our future work, we plan to explore more questions:

320

S-ar putea să vă placă și