Sunteți pe pagina 1din 4

A Data Mining Algorithm In Distance Learning

Dai Shangping
1
, Zhang Ping
2
1
Department of Computer Science, Hua Zhong Normal University, Wu Han, China
spdai@mail.ccnu.edu.cn
2
Foreign Languages College, Zhongnan University of Economics and Law, Wuhan,China
zpingcd@mail.ccnu.edu.cn
Abstract
Currently there is an increasing interest in data
mining and educational systems, making educational
data mining as a new growing research community.
One of the challenges in developing data mining
systems is to integrate and coordinate existing data
mining applications in a seamless manner so that cost-
effective systems can be developed without the need of
costly proprietary products. The popularity of distance
education has grown rapidly over the last decade in
higher education, yet many fundamental teaching
learning issues are still in debate. This paper presents
an approach for classifying students in order to predict
their final grade based on features extracted from
logged data in an education web-based system. In this
paper we take advantage of the genetic algorithm (GA)
designed specifically for discovering association rules.
We propose a novel spatial mining algorithm, called
ARMNGA(Association Rules Mining in Novel Genetic
Algorithm), Compared to the algorithm in
Reference[2] , the ARMNGA algorithm avoids
generating impossible candidates, and therefore is
more efficient in terms of the execution time.
Keywords: data mining; genetic algorithm, association
rules, distance learning
1. Introduction
Some of the commonly used data mining algorithms
fall under the following categories: decision trees and
rules, nonlinear regression and classification methods,
example-based methods, probabilistic graphical
dependency models, and relational learning models.
Over the years genetic algorithms have been
successfully applied in learning tasks in different
domains, like chemical process control, financial
classification, manufacturing scheduling, among others.
Many leading educational institutions are working to
establish an online teaching and learning presence.
Several systems with different capabilities and
approaches have been developed to deliver online
education in an academic setting. [1]
Because of the advances in information technology,
vast number of educational resources has accumulated
on the Internet. Therefore, how to mine interesting
resources from databases has attracted more and more
attention in recent years [2]. Many data mining methods
have been proposed such as association rule mining,
sequential pattern mining, calling path pattern mining,
text mining, temporal data mining, spatial data mining,
etc.
Association rule mining, as originally proposed in
with its apriori algorithm (ARMA), has developed into
an active research area. Many additional algorithms
have been proposed for association rule mining. Also,
the concept of association rule has been extended in
many different ways, such as generalized association
rules, association rules with item constraints, sequence
rules etc. Apart from the earlier analysis of market
basket data, these algorithms have been widely used in
many other practical applications such as customer
profiling, analysis of products and so on[3].
Genetic Algorithm (GA) is one self-adaptive
optimization searching algorithm. GA obtains the best
solution, or the most satisfactory solution through
generations of chromosomes constant evolution
includes the reproduce, crossover and mutation etc.
operation, until a certain termination condition is
coincident [4].
Association rules mining Algorithm Based on a
novel Genetic Algorithm (ARMNGA) is an optimal
algorithm combing GA with ARMA.
The contributions of this paper are:
We take advantage of the genetic algorithm (GA)
designed specifically for discovering association rules.
We propose a novel spatial mining algorithm, called
ARMNGA, Compared to the algorithm in Ref [2], and
the ARMNGA algorithm avoids generating impossible
candidates, and therefore is more efficient in terms of
the execution time.
2. Association Rules
Definition 1 confidence
Set up } , , {
2 1 m
i i i I = for items of collection, for item
in
) 1 ( m j i
j
s s
, for lasting ) 1 ( m j s s

_____________________________________
978 -1-4244-1651-6/08/$25.00 2008 IEEE
item, it is a trade collection,
here T is the trade.
) , {
1 N
T T D =
) 1 ( N i I T
i
s s _
Rule q r is probability that concentrates
on including in the trade.
q r
The association rule here is an implication of the
form q r where X is the conjunction of conditions,
and Y is the type of classification. The rule q r has
to satisfy specified minimum support and minimum
confidence measure [5].
The support of Rule q r is the measure of
frequency both r and q in D
D r r S = ) (
(1)
The confidence measure of Rule q r is for the
premise that includes r in the bargain descend, in the
meantime includes q
) ( ) ( ) ( r S rq S q r C = (2)
Definition 2 Weighting support
Designated ones project to collect I = {i
1
, i2, i
m
},
each project i
j
is composed with the value w
j
of right
(0wj 1, 1j m). If the rule is r q, the weighting
support is
_
e
=
r i
j w
r S w
k
r S ) (
1
) ( (3)
And, the K is the size of the Set rq of the project.
When the right value Wj is the same as i
j
, we
calculating the weighting including rule to have the
same support.
3. Genetic Algorithm (GA)
Genetic Algorithm (GA) is a self-adaptive
optimization searching algorithm. GA obtains the best
solution, or the most satisfactory solution through
generations of chromosomes constant evolution
includes reproduction, crossover and mutation etc.
Here is the general description of this problem:
) ( ) ( ) ( ) ( r A c r C b r S a r F + + = (3)
a,b,c is constants ,a 0,b0, c0,S(r) is the
support ,and C(r) is the confidence ,A(r) is the mantle.
4. A Data Mining Algorithm
4.1 Encoding
This paper employs natural numbers to encode the
variable A
ij
. That is, the number of the lines of every
range in the matrix A in which the element 1 exists is
regarded as a gene. The genes are independent of each
other. They are marked by A
1
, A
2
,,A
j
,,A
n,
in which
j
A ] , 1 [ ], , 1 [ n j m e e and A
n
may be a repeatedly
equal natural number.
When the distributive method at random is employed
to produce the initial population comprised of certain
individuals, the population must be in a certain scale in
order to achieve the optimal solution on the whole. The
best way is the generated M individuals randomly that
the length is nthen the chromosome bunch encoded
by the natural number is calculated as the initial
population.
4.2 The Fitness
Formula (3) is properly transformed into:
min min
) ( ) (
) (
C
r C
W
S
r S
W r F
c s
+ = (4)
Here, W
c
+W
s
=1, W
c
0, Ws 0, S
min,
is minimum
support, and C
min
is minimum confidence.
4.3 Reproduction Operator
Reproduction is the transmission of personal
information from the father generation to the son
generation. Each individual in each generation
determines the probability that it can reproduce the next
generation according to how big or small the fitness
value is. Through reproducing, the number of excellent
individuals in the population increases constantly, and
the whole process of evolution head for the optimal
direction. We are adopting roulette selection strategy;
each individual reproduction probability is proportion
to fitness value.
1) Compute the reproduction probability of all the
individuals
M
i = 1
( i )
P ( i ) =
( i )
f
f
_
(5)
2) Generate a number r randomly, r=random [0, 1]
3) If P(0)+ P(1)++ P(i-1)<r< P(0)+ P(1)++
P(i),the individual i is selected into the next generation.
4.4 Crossover Operator
Crossover is the substitution between two
individuals of the father generation that is to generating
new individual .The crossover probability Pc directly
influences the convergence of the algorithm. The larger
Pc is the most likely is the genetic mode of the optimal
individual to be destroyed .However, the over-small of

Pc can slow down the research process [6] .Here is the


definition of the crossover operator:
Computing crossover probability Pc

>

=
f f p
f f
f f
f f p p
p
Pc
c
c c
c
%
1
max
2 1
1
) )( (
(6)
Here, =0.9, =0.7, is the maximum
fitness value of the population,
1 c
p
2 c
p
max
f
f is the average
fitness value of the population.
4.5 Mutation Operator
Here is the definition of the mutations operator,
computing the mutation probability
m
P

>

=
f f p
f f
f f
f f p p
p
Pm
m
m m
m
%
1
max
2 1
1
) )( (
(7)
Here, =0.1, =0.001, is the maximum
fitness value of the population,
1 m
p
2 m
p
max
f
f is the average
fitness value of the population.
4.6 Termination Condition
When the matching condition is not coincident, the
process will naturally stop.
5. Experiments
To check the research capability of the operator and
its operational efficiency, such a simulation result is
given compared with the GA in [2] ,The platform of the
simulation experiment is a Dell power Edge2600 server
(double Intel Xeon 1.8GHz CPU,1G memory , RedHat
Linux 9.0).
We first compare the performance of our proposed
method with the algorithm in Ref [2].
Fig.1 shows the runtime vs. the minimum support for
both algorithms, where the minimum support varies
from 0.25% to 2% for the synthetic dataset. Our
proposed algorithm runs 25 times faster than the
Apriori algorithm, because a large number of
candidates can be pruned by using the ARMNGA
pruning strategy. Therefore, as the minimum support
threshold decreases, the runtime of the Apriori
algorithm increases dramatically since it generates too
many candidates when the minimum support is small.
Fig 1. Runtime vs. minimum support.
Fig.2 shows the runtime vs. the average size of
transactions for both algorithms, where the average size
of transactions varies from 4 to 14 for the synthetic
dataset. As the average size of transactions increases,
the runtime of the algorithm in Ref [2] increases
dramatically, however, compared to the algorithm in
Ref [2], the runtime of our proposed algorithm
increases slowly. The reason for increase in the
runtimes of both algorithms is that the number of
frequent resources increases as the lengths of
transactions are increased. Therefore, finding
candidates present in a trans- action takes a longer time.
Our proposed algorithm is more scalable than the
algorithm in Ref [2] because a large number of
candidates can be pruned by using the ARMNGA
pruning strategy.
Fig2. Runtime vs. average size of transactions.
From fig 1 and 2, we can educe that ARMNGA has a
higher convergence speed and more reasonable
selective scheme which guarantees the non-reduction
performance of the optimal solution. Therefore, it is
better than GA and ARMA through the theoretic
analysis and the experimental results.
6. CONCLUSIONS
Educational data mining is a young research area and
it is necessary more specialized and oriented work

educational domain in order to obtain a similar


application success level to other areas, such as medical
data mining, mining e-commerce data, etc. In this paper,
We propose an association rules mining based on a
novel genetic algorithm, designed specifically for
discovering association rules. We compare the results
of the ARMNGA with the results of Ref [2], and, it is
better than GA and ARM through the theoretic analysis
and the experimental results. We believe that some
future mining tools more easy to use by educators.
Acknowledgement
This work is partially supported by The research of
foreign Chinese long-distance visualized teaching
model The National Society Science Foundation of
P.R. China, Under Grant No. 07BYY033 .We thanks
Professor Zheng shijue of the Department of Computer
Science of Central China Normal University
Technology for his supports of various aspects and
thanks also go to all members of our research group.
References
[1] Behrouz Minaei-Bidgoli , William F. Punch III, Using
Genetic Algorithms for Data Mining Optimization in an
Educational Web-based System, Lecture Notes in
computer Science,pp2724 -2735,2003
[2] G. Chen, Q. Wei, Fuzzy association rules and the
extended mining algorithms, Information Sciences 147
(2002) 201228.
[3] K. Koperski, J. Han, Discovery of spatial association rules
in geographic information databases, in: Proc. of
International Symposium on Advance in Spatial
Databases, SSD, LNCS, vol. 951, Springer Verlag, 1995,
pp. 4766.
[4] Shijue Zheng,Wanneng Shu,Li Gao,Task Scheduling
Using Parallel Genetic Simulated Annealing
Algorithm ,2006 IEEE International Conference on
Service Operations and Logistics, and Informatics
Proceedings June 21-23, 2006, Shanghai, pp46-50
[5] P.Y. Hsu, Y.L. Chen, C.C. Ling, Algorithms for mining
association rules in bag databases, Information Sciences
166 (2004) 3147.
[6] Gao Li,Li Dan,Dai Shangping, A mining Algorithm of
constraint based association rules, journal of Henan
University Vol.33(2003) pp.55-58
[7] Wu zhaohui , Association rule mining based on simulated
annealing genetic algorithm, Comput Applications
Vol.25(2005) pp.1009-1011

S-ar putea să vă placă și