Documente Academic
Documente Profesional
Documente Cultură
286
and relationships which represent appropriate 2.1 Association rules
utilization of services. To date, most data analysis Association rules over basket data type
has focused on the area of detection and transactions is a problem defined in lAIS931. The
prevention of fraud and inappropriate practice. mining of association rules relates to finding intra-
Inappropriate practice deals with issues such as transaction patterns and can be defined as follows:
requesting or providing services which are given a database of transactions, where each
unreasonable, unnecessary or excessive (e.g., transaction represents a set of items (e.g. services
indiscriminate ordering of cholesterol test). rendered), generate all associations such that the
Typically, this type of analysis relies on human presence of some specific item(s) x in a transaction
experts, but these experts are both expensive and implies the presence of other item(s) y. The
scarce. HIC has supported the process with a
association rule x => y will hold given a suppovt
neural network, though this approach can only be
and a confidence greater than a user-specified
used on a subset of the database.
minimum support (s,i, > and minimum confidence
In - addition, HIC has developed and (cmin), wherein
implemented a pattern-recognition technique for
classifying the degree of appropriate practice of support is the number (or fraction) of the
service providers. By screening with this transactions that contain a given item set.
technology the claiming patterns of all confidence measures the frequency that items
practitioners, a score or class is assigned to each
practitioner; those practitioners who have a high in a multi-item set are found together.
likelihood of an inappropriate practice pattern are The database could be a data file, a relational
identified. However, this technique allows only table, or the result of a relational expression
four different classes. lAS941. The strengths of this technique are the
With the motivation of evaluating and capability to handle large databases in an efficient
determining the feasibility of using newer manner, while its execution time scales almost
technology that could be applied to large linearly with the size of the data. These
databases, we addressed the effectiveness of two characteristics have proven true during the course
data mining techniques for analyzing and of this study.
retrieving unknown behavior patterns from
gigabytes of data collected in this health insurance 2.2 Neural segmentation
information system. Specifically, an episode Neural segmentation is a pattern detection
(claims) database for pathology services and a algorithm, in which the base technology is a self-
general practitioners database were used; organizing feature map [Koh89]. Self-organizing
associufion rules were applied to the episode feature maps, otherwise known as topological
database; neural segmentation was applied to the feature maps, loosely preserve the topology of the
overlaying of both databases. multi-dimensional space in the two-dimensional
The results of this study have demonstrated map. That is, similar prototypes are near each
the potential value of data mining in health other on the map.
insurance information systems,, by detecting A self-organizing feature map consists of a
patterns in the ordering of pathology services, and two-dimensional array of units; each unit is
by classifying the general practitioners into groups connected to n input nodes, and contains a n-
reflecting the nature and style of their practices. dimensional vector Wii wherein (i,j) identifies the
The approach used led to results which could not unit at location Ci,jJ of the array. Each neuron
have been obtained using conventional techniques. computes the Euclidean distance between the
input vector x and the stored weight vector Wii.
2 Data mining techniqk applied The neuron with the minimum distance is declared
the “winner” and the input vector is assigned to
There are several primary data mining techniques this neuron. In addition, each of the weight vectors
including classification, regression, clustering, is modified as follows:
summarization and dependency modeling, among new weight vector =Wij+LR*NF*(X-Wij)
others lFPS961. In this paper, we report on the
applicability of association rules, which is a more wherein
sophisticated method of summarization, and neural l IX is the learning rate, a linearly decreasing
segmentation as a particular implementation of scalar which changes after each epoch;
clustering. These techniques, whose basic
characteristics are summarized next, were selected l NF is the neighborhood function, a Gaussian
based on the description of the problem and on distribution function in map-space, centered
our interest on experimenting with two distinct on the winning neuron;
methodologies. . epoch is the presentation of all input vectors to
287
we lab-id benefit cost-serl costser2 .... service 1 service 2 .... provid. sex
provi sex totserv O/female pat total patients nursing vis home visits totbenefit WstlseN ..
the system once, that is, one iteration.over the identified by physician number;
data. l An aggregate database from several other
Key to the success of this analysis is the databases, including claims by service
representation of behavibral data. Since self- containing more than 18,000,OOOrecords in one
organizing feature maps are a form of clustering, year. This database is used in other business
care must be taken into properly balancing the decision applications.
inputs. That is, each vector element intrinsically
represents one equally weighted dimension of the Episode database
subject’s behavior. Balancing the inputs implies
that each equally-important aspect of the problem The episode database for pathology services is
has the same number of vectbr elements. This is stored in relational form. This database contains
accomplished through an exte&ive study of the 6,800,OOOrecords and 120 attributes (3.5 GB). The
problem, with attention ,, to the type of records, which were sorted by provider
characteristics that are important to the solution, identification, have the layout depicted in Figure
and to a careful review of the variance of each of 1.
the inputs.
The output of the algorithm is a two- General practitioners database
dimensional array of segments, each one described The general practitioners database contains 105
by both its behavioral profile and potentially an attributes and consist of approximately 17,000
overlay of further descriptive data. records, which correspond to active general
practitioners during one year period. By doing the
3 Description of experiments and results aggregation, measures are obtained which assess
the cost, usage, and quality of the services (e,g., big
In this section, we describe the experiments excisions, cancer removal, pathology benefit per
conducted using association rules and neural patient); additional descriptive elements include
segmentation. We also describe the database data such as age or sex of the physician. This
structure used and the corresponding results. database was sorted by provider identification,
The experiments were performed using non- and the record layout is shown in Figure 2.
synthetic de-identified data obtained from the
health insurance provider, which included the 3.2 Experiment 1: Association rules
pathology episode and the general practitioners The association rules algorithm was applied to the
databases. These experiments were performed on episode database for pathology services; each
an IBM System 9121 model 480 with 3380 DASD patient visit is mapped to a record in the database.
storage subsystem, using a single CPU (actual Therefore, a database tuple corresponds to a
capacity of 19 MIPS). The data was stored in unique identifier and one or more medical tests (or
relational form within DB2/MVS. services) taken at a given instance in time, with a
maximum of 20 tests per episode. There are three
3.1 Data description major steps involved when applying association
The data used in these experiments is organized as rules:
two databases: l Data preprocess
l A transactional database corresponding to l Application of association rules
historical events for a set of individuals, l Postprocess or analysis of results
288
Batch Pipe Works
289
test 1 test 2 test 3 ... testj .. . test m
VI 1 1 0 0 0
v2 0 0.03 0 0 0
__, v3 1 0.02 1 0 1
.vi.. 0.01
... .0.. ‘1’ ...
0.01 0
n = 10,409 vn-l
. .. .o. . .o,
.. o 0.3
.. . .;.
representation of the frequency of each test The first experiment was set to identify those
ordered by the GP during a one year period. The pathology services which appeared in various
number of tests performed by each physician was combinations with S,in > 1% (or 68,000
scaled from 0 to 1 with respect to the total number transactions). A minimum confidence C,in = 50%
of tests. The final format of the input vectors is was set to summarize the data into 24 production
depicted in Figure 5. rules. The major results are:
The system was trained iteratively by 1. The most frequently medical test (test-A)
presenting the vectors Vi (with 1 <= i <= 10,409) for occurs in 28.15% of all episodes. An additional
25 epochs or iterations. The parameters used were 20.15% occurred in association with a collec-
l learning rate (LR) with an initial value of 0.8 tion fee. This test was present in 48.30% of all
and a final value of 0.05; the learning rate was episodes for pathology services.
decreased linearly after each epoch; 2. The most commonly ordered combination of
l neighborhood function (NF) with the width of tests occurred together in 10.9% of episodes.
the Gaussian distribution function varied from The rules obtained show that there was a
the square root of the number of nodes to 0.1. 62.2% chance that if test-A was ordered then
test-B would also be ordered.
3.4 Results The second experiment was set to identify
those services which appeared in various
Experiment 1 combinations with s,in > 0.5% (or 32,000
Association rules were obtained using C,in = 50% transactions). Cmin = 50% was set to reduce the data
(minimum confidence), and three different values into 64 production rules. A greater amount of
for shin = l%, 0.5%, and 0.25% (minimum support) knowledge through the behavior patterns was
specified by the user. Since there is a total of 6.8 gained by setting s,in > 0.5% rather than s,in > 1%.
million of episodes for pathology services (or Lowering %nin to 0.25% (or 16,000
transactions in the database), s,in = 1% represents transactions), with retaining cmin = 50% produced
68,000 transactions. The number of association 135 rules and information which was of much
rules obtained in each of the experiments is shown greater value, as demonstrated by the following
in Table 1. Simple as well as complex rules were two examples:
determined by the algorithm; an example of such 1. If test-B was claimed with test-C, there was a
rules is as follows: 92.8% chance that a test-A would also be
claimed. This finding may be indicative of dif-
‘If Iron Studies and Thyroid Function Tests ferent ordering habits for similar clinical situa-
occur together then there is an 87% chance #Full
Blood Examination occurring as well. This rule tions.
wasfound in 0.55% qf transactions.’ 2. If test-D was ordered with test-E, it was
revealed with a 60% chance that test-A would
also be ordered, in 0.25% of the cases. This rule
Table 1: Test using episode database raises questions on whether this testing is of a
Smin= 1% Smin= 0.5% Smin~0.25% screening nature, or even if this is the most
Cmin = 50% 24 rules 64 rules 135 rules proper medical monitoring regime.
290
The various examples above show commonly Execution time
ordered tests in association in a relatively large
number of episodes. The appropriateness of Table 2 shows the relative execution tim@&blained
ordering combinations such as the ones described when applying association rules to the episode
here needs to be considered by the relevant database.
medical learned bodies. It would appear from the Table 2: E.xecution tinie
analysis carried out during these experiments that
a large proportion of pathology tests ordered are of Smin CPU time Elapsed time
a non- specific screening nature rather than a
planned approach.
Association rules have given a new
I 1.00 %
0.25 %
0.50
22.23 (min)
22.53 (min)
22.56
35.6 (min)
35.6 (min)
35.8
291
Table 3: Test usage heterogeneous sector further into four subnodes
Use Average number of tests ordered reveals quite distinct subgroups.
by each member of the segment Applying neural segmentation to the general
practitioners database successfully achieved the
Relative use (x) “Use” divided by the average classification of general practitioners into groups
use for all GP’s of various sizes reflecting the nature and style of
Market Fraction of a specific test ordered their practices. Meaningful categories have
by a segment. Indicates the num- identified GE’s with different medical
ber of times the average test use characteristics (e.g., obstetricians, after hours
a given segment displays. It clinics, environmental medicine, geriatric
shows the segment’s relative practices, etc.)
interest
Penetration (pen) Fraction of the population in this 4 Concluding remarks
segment ordering at least one We have addressed the effectiveness of two data
specific test mining techniques when applied to large
databases in a health insurance information
deviation, minimum, and maximum for each system. Through this study, we have shown that
data mining algorithms can be used successfully
numeric quantity in the GP database, by segment
ID. This included 105 different attributes such as on large, real customer data (not synthetic data),
the number of house calls, or nursing home visits, with reasonable execution time. Moreover, we have
also shown that these algorithms can result in
the age of the.physician, and the fraction of female
quant$able ben$its for the interested organization,
patients seen.
helping identify. specific actions to be taken. In
A summary description of the 16 nodes found particular, among the results obtained we can
is shown in Appendix A; the information derived mention the following:
for each segment by overlaying the attributes
contained in the general practitioners database l The study provided the health insurance
was added. Therefore, each segment represents the organization with new and unexpected
practice type assigned. The average neural relationships among the combination of
network score3 was also included in the segment services provided by physicians in the area of
description. Comments on each segment reflect the pathology. An overpayment of over $550,000
domain expert evaluation and interpretation of the in one year was uncovered; detection of this
data mining output produced. problem, which had existed for over five years,
From the grid depicted in Appendix A, it is had eluded conventional monitoring
clear that one large segment was created; this techniques.
segment has a very low average of test performed l The study provided a classification of general
per GP so it was qualified as a “low usage” practitioners into groups of various sizes
segment. By overlaying the GP database, it was
reflecting the nature and style of their
identified that the segment contained most of the
non-vocationally registered GPs. practices. Categorization of general practice
into its various subgroups had proven elusive
Neural segmentation enables the creation of as
many nodes and subnodes as desired. Further in the past. The new subgroups will allow
segmentation was applied to the data constituting greatly improved monitoring of practice
Segment 2 and Segment 4 to enable adequate patterns. Given that the business goal of health
analysis because those nodes were not densely insurance corporations is modifying the
clustered; four subnodes were produced for each behavior of practitioners towards best practice,
one of these groups. In one of the nodes with an neural segmentation provides the means of
average neural network score of 1.8, the creation of understanding and monitoring the behavior of
four subnodes produced groups with scores the various subgroups in an application such
between 1.4 and 2 which were a variation of the as general practice.
same practice style. In another larger and
apparently less homogenous group, the creation of From this experience, we have empirically
four subnodes produced some distinctly different verified that large databases are indeed an
groups of practice style. Breaking down this fairly important asset to organizations, which can use
such databases to extract meaningful information
3 The class or score was obtained from the classification of practice
with reasonable effort.
pattern using the neural network developed at HIC and trained by
their domain experts.
292
References
[AIS93] R. Agrawal, T. Imielinski, and A. Swami.
Mining association rules between sets of
items in large databases. In Proceedings@the
ACM SIGMOD Co@rence on Management qf
Data, Washington, D.C., May 1993.
[AS941 R. Agrawal and R. Srikant. Fast algorithms
for mining association rules. In Proceedingsof
the 20th international Covtferenceon Very Large
Data Bases,pages 487-499, Santiago, Chile,
September 1994.
[BAT95]IBM BatchPipes/MVS, BatchPipesWorks
Reference Manual, September 1995.
[FPS96] U. Fayyad, G. Piatetsky-Shapiro, I! Smyth.
From data mining to knowledge discovery:
an overview. In Advances in Knowledge
Discovery and Data Mining, Chapter 1. AAAI
Press, 1996.
[Koh89]T. Kohonen. “Self-Organization and
Associative Memory,” Springer-Verlag, New
York, 1989.
[MPM96] C. Matheus, G. Piatetsky-Shapiro, D.
McNeill. Selecting and reporting what is
interesting. In Advances in Knowledge
Discovery and Data Mining. AAAI Press, 1996.
[RM95] M. Rothman, E. Murphy: Data mining: A
practical approach for database marketing.
In IBM White Paper,Dallas, Texas, April 1995.
293
Appendix A - Segments
294