Documente Academic
Documente Profesional
Documente Cultură
Volume 10, Issue 3, March 2019, pp. 603–613, Article ID: IJMET_10_03_062
Available online at http://www.iaeme.com/ijmet/issues.asp?JType=IJMET&VType=10&IType=3
ISSN Print: 0976-6340 and ISSN Online: 0976-6359
Sani M. Isa
Computer Science Department, Bina Nusantara University,
Jakarta, Indonesia 11480
Abstract
Classification is a method that process related categories used to group data
according to it are similarities. High dimensional data used in the classification
process sometimes makes a classification process not optimize because there are huge
amounts of otherwise meaningless data. in this paper, we try to classify profit agent
from PT.XYZ and find the best feature that has a major impact to profit agent. Feature
selection is one of the methods that can optimize the dataset for the classification
process. in this paper we applied a feature selection based on graph method, graph
method identifies the most important nodes that are interrelated with neighbors nodes.
Eigenvector centrality is a method that estimates the importance of features to its
neighbors, using Eigenvector centrality will ranking central nodes as candidate
features that used for classification method and find the best feature for classifying
Data Agent. Support Vector Machines (SVM) is a method that will be used whether
the approach using Feature Selection with Eigenvalue Centrality will further optimize
the accuracy of the classification.
Keywords: Classification, Support Vector Machines, Feature Selection, Eigenvalue
Centrality, Graph-based.
Cite this Article: Zidni Nurrobi Agam and Sani M. Isa, Profit Agent Classification
Using Feature Selection Eigenvector Centrality, International Journal of Mechanical
Engineering and Technology (IJMET)10(3), 2019, pp. 603–613.
http://www.iaeme.com/ijmet/issues.asp?JType=IJMET&VType=10&IType=3
1. INTRODUCTION
In this era data is a very important commodity used in almost all existing technologies, data
makes researchers examine more data in order to find hidden patterns that can be used as
information. but with the increasing number of data, there are also many data that irrelevant
and redundant dataset, making the quality of the data less good.
Feature Selection is a method that selects a subset of variables from the input which can
efficiently describe the input data while reducing effects from noise or irrelevant variables and
still provide good prediction results [1]. Usually, feature selection operation both ranking and
subset selection [2][3] to get most relational or most important value from a dataset. n
described as total feature the goal of feature selection is to select the optimal feature I, so the
optimal feature selection is I < n. With Feature Selection processing data will improve the
overall prediction because optimal dataset that improves by feature selection.
we applied feature selection for optimizing classification based on graph feature selection,
this feature selection ranked feature based on Eigenvector Centrality. in graph theory, ECFS
measures a node that has major impact on other nodes in the network. all nodes on the
network are assigned relative scores based on the concept that nodes that have high value
contribute more to the score of the node in question than equal connections to low-scoring
nodes [4]. A high eigenvector score means that a node is connected to many nodes who
themselves have high scores. so, relationship between feature (nodes) are measure by weight
the connection between nodes. The problem from feature subset selection refers the task of
identifying and selecting a useful subset of attributes to be used to represent patterns from a
larger set of often mutually redundant, possibly irrelevant, attributes with different associated
measurement costs and/or risks [5]. we try to find the most influential feature to predict the
profit agent with ECFS.
There are many studies that research about Eigenvector Centrality such as Nicholas J.
Bryan and Ge Wang [6] , Nicholas J. Bryan and team research about how music with so
many features can create pattern network between song and used to help describe patterns of
musical influence in sample-based music suitable for musicological analysis. [7] To analyze
rank influence feature between genre music with Eigenvector Centrality. and on 2016
Giorgio Roffo & Simione Melzi research about Feature ranking vie Eigenvector Centrality, in
Giorgio Roffo & Simione Melzi research important feature by identifying the most important
attribute into an arbitrary set of cues then mapping the problem to find where feature are the
nodes by assessing the importance of nodes through some indicator of centrality. for building
the graph and the weighted distance between nodes Giorgio Roffo & Simione Melzi use
Fisher Criteria.
The Goal of this paper is to applied Chi-Square and ECFS feature selection and compare
both features with different dataset. Both Feature Selection test with HCC and Profit agent
dataset, this test validates with K Fold Cross Validation feature selection to test the model’s
ability then evaluated with confusion matrix to measure misclassification. Based on ECFS
results we try to determine which attribute from profit agent that have a major impact on
another attribute on profit agent dataset.
2. RESEARCH METHOD
A discussion about Feature selection and ranking based on Graph network [8] method will be
discussed in this paragraph. To build the graph first we have to define how to design and build
the graph.
( )
( )
∑ ( ) ∑ (3)
where M(v) is a set of the neighbors of and is a constant. With a small rearrangement
this can be rewritten in vector notation as the eigenvector equation:
(4)
However, as we count longer and longer paths, this measure of accessibility converges to
an index known as eigenvector centrality measure (EC). Example for node and adjacency
matrix [9] described in figure 2 and Table 1.
3.1. Dataset
Dataset PT. XYZ is chosen analysis feature selection Eigenvector Centrality with the
scenario. first, we describe the dataset used for feature selection as how many features used
for prediction, data agent is a categorical data collected from transaction data and have 6
attributes that used for analysis:
1. Type Application
Description: a device used by the agent to the transaction.
Categorical: 1. EDC 2. Android
2. Age
Description: Age agent PT. XYZ
Categorical:
1. <= 23 Years
2. > 23 Years and <= 29 Years
3. > 29 Years
3. City
Description: City based on Agent stay.
Categorical: Convert every city with a number.
4. Balance Agent
Description: Wallet Agent on PT.Xyz
Categorical: 1. <= Rp.500.000 2. > Rp.500.000 and <= Rp.2.000.000 3. >
Rp.2.000.000
5. Transaction
Description: Transaction Agent every day.
Categorical: every transaction agent /day
6. Joined
Description: Duration Agent join with PT.Xyz.
Categorical: Count per day Agent from join until now.
7. Gender
Description: Agent Gender.
Categorical: 1= L 0 = P
8. PulsaPrabayar, PLNPrabayar, TVBerbayar, PDAM, PLNPasca, Telpon, Speedy,
BPJS, Cashin, Asuransi, Gopay, & TiketKereta.
Description: Any transaction from Pt.Xyz application.
Categorical: Count perday detail transaction agent.
Number of data profit: 558
Number of profit: 441
the number of samples and variables, number of classes and accuracy result from
classification [14].
Table 2 Dataset used in the comparison of feature selection. for each dataset, the following detailed
are reported.
No Dataset Samples Variables Classes
1 Agent Profit 1000 19 2
2 HCC 165 49 2
Table 4 Performance Analysis dataset Agent Profit with Chi-Square and ECFS
Itteration ECFS Chi square
19 84,5838 84,7838
18 84,9838 84,8828
17 85,0848 85,7838
16 85,3859 86,3889
15 85,0848 86,7889
14 85,3859 86,8889
13 87,7889 87,5899
12 90,3838 87,5909
11 90,4848 87,8899
10 83,7869 88,6899
9 67,6869 89,1899
Figure 6 Performance Analysis dataset Agent Profit with Chi-Square and ECFS
The Accuracy produced by ECFS on the Agent profit dataset is 90.48% better than chi-
square which produces a maximum accuracy 89,18% and iteration for maximum accuracy is
obtained by ECFS in 11 iterations while chi-square is in iteration 9. the overall performance
for both feature selection indicates that performance from ECFS is more robust than chi-
square because on the results obtained from test HCC dataset and Agent Profit Dataset, but
Chi-Square better when attribute is more than 20 attributes when chi-square can reach
maximum accuracy on 32 and 33 iteration. but when chi-square and ECFS reach less than 20
attributes from HCC Dataset, ECFS succeeded reach maximum accuracy on iteration 19.
When the iteration from attributes reach smaller iteration such as 9 attributes, Chi-Square
shows that accuracy decreases significantly as in dataset HCC and agent profit. ECFS is more
robust when the attribute reduced even the accuracy decreases ECFS not significantly
decreases. Value of ECFS has the largest increase in performance when the attribute reduces
[15] on dataset agent profit as many attributes have strong relationships with others. Every
attribute on ECFS rank according to how well they descriminant between two class and
reduces the attribute that doesn’t have a major impact on other attribute make the result from
reduces attribute better.
REFERENCES
[1] I. Guyon and A. Elisseeff, “An Introduction to Variable and Feature Selection,” J. Mach.
Learn. Res., Volume 3, Number 3, pp. 1157–1182, 2003.
[2] R. Wan, L. Vegas, M. Carlo, and P. Ii, Computational Methods of Feature Selection.”
[3] P. S. Bradley, Feature Selection via Concave Minimization and Support Vector Machines,
Number 6.
[4] L. Solá, M. Romance, R. Criado, J. Flores, A. García del Amo, and S. Boccaletti,
Eigenvector centrality of nodes in multiplex networks, Chaos, Volume 23, Number 3, pp.
1–11, 2013.
[5] J. Yang and V. Honavar, Feature Subset Selection Using a Genetic Algorithm, IEEE
Intell. Syst., vol. 13, pp. 44–49, 1998.
[6] N. Bryan and G. Wang, Musical Influence Network Analysis and Rank of Sample-Based
Music, Proc. 12th Int. Soc. Music Inf. Retr. Conf., no. ISMIR, pp. 329–334, 2011.
[7] G. Roffo and S. Melzi, “Ranking to learn: Feature ranking and selection via eigenvector
centrality,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect.
Notes Bioinformatics), vol. 10312 LNCS, pp. 19–35, 2017.
[8] F. R. Pitts, “A graph theoretic approach t o historical geography,” pp. 15–20.
[9] F. Harary, S. Review, and N. Jul, “No Title,” vol. 4, no. 3, pp. 202–210, 2008.
[10] B. Scholkopft and K. Mullert, “Fisher Discriminant Analysis With,” pp. 41–48, 1999.
[11] Y. Bazi, S. Member, F. Melgani, and S. Member, “Toward an Optimal SVM
Classification System for Hyperspectral Remote Sensing Images,” no. December 2013,
2006.
[12] S. Thaseen and C. A. Kumar, “Intrusion Detection Model Using fusion of Chi-square
feature selection and multi class,” J. KING SAUD Univ. - Comput. Inf. Sci., no. 2016,
2015.
[13] I. Sumaiya Thaseen and C. Aswani Kumar, “Intrusion detection model using fusion of
chi-square feature selection and multi class SVM,” J. King Saud Univ. - Comput. Inf. Sci.,
vol. 29, no. 4, pp. 462–472, 2017.
[14] D. Ballabio, F. Grisoni, and R. Todeschini, “Multivariate comparison of classification
performance measures,” Chemom. Intell. Lab. Syst., vol. 174, no. March, pp. 33–44,
2018.
[15] E. M. Hand and R. Chellappa, “Attributes for Improved Attributes: A Multi-Task
Network for Attribute Classification,” pp. 4068–4074, 2016.