Sunteți pe pagina 1din 24

Mathematical Geology, Vol. 22, No.

6, 1990

Three Nonparametric Techniques for the Optimum


Discretization of Quantitative Geological Variables 1
Guocheng Pan 2 and DeVerle P. Harris 3

Three nonparametric techniques for the optimum discretization of quantitative geological features
are proposed and demonstrated. The three methods are: isolated weight, entropy information, and
rank correlation. Optimum discretization plays important roles in solutions to the following geoscience problems: (1) signal~noise separation and delineation of meaningful anomalies and other
geofields related to mineral targets; (2) selection of those geological variables that explain variations in mineral resources; (3) determination of the best subintervals of values for a variable with
respect to mineralization; (4) enhancement of certain complex and concealed information of a
geofeature about its correlation with magnitude of mineralization; and (5) unification of diverse
geodata so that these data can be integrated and analyzed.
KEY WORDS: optimum discretization, isolated weight, entropy information, rank correlation,

Nb-Ta deposit, Au-Ag deposit, exploration target.

INTRODUCTION
In mineral exploration, geologists often consider many quantitative geological
measurements as discrete geological phenomena. For instance, the size of ore
deposits may be expressed in terms of qualitative categories, such as large,
medium, and small deposits. This transformation takes place in the mind of a
geologist and may be considered a type of subjective discretization. Although
such a transformation is rough and imprecise, it is useful in establishing the
idea of discretization. This paper describes nonparametric statistical techniques
for optimum discretization of a quantitative measurement.
Some major geoscience problems requiring discretization include anomaly
and background separation, delineation of mineral targets, selection of geological variables, and enhancement of geoinformation. In addition, discretization
~Manuscript received 31 July 1989; accepted 9 January 1990.
2Mineral Resources Estimation and Mineral Economics, Department of Mining and Geological
Engineering, University of Arizona, Tucson, Arizona 85721.
3Director of Mineral Economics, Department of Mining and Geological Engineering, University
of Arizona, Tucson, Arizona 85721.
699
0882-812l/90/0800-0699806.00/1 1990InternationalAssociationfor MathematicalGeology

700

Pan and Harris

is also useful for unifying diverse geodata for implementation of some statistical
techniques for mineral resource estimations [e.g., characteristic analysis (Botbol, 1971), pattern recognition (Agterberg, 1989; Bonham-Carter, et al., 1988)].
The common task of discretization problems is to find one or more critical
threshold values. Several often-used methods include: (1) the frequency methods [e.g., upper (lower) anomalous limit defined as/z + 3o(/~ - 30), where
# is the expectation of the feature and a the standard deviation]; (2) second
vertical derivative; and (3) trend analysis for residuals.
Although there does not exist a universal rule for discretization, a useful
basic principle is to discretize the quantitative measure so as to enhance as much
as possible that information of the measure that describes some other a priori
selected feature. This external feature is identified from the objective of the
analysis (e.g., estimation of resources or exploration targets of a particular
metal). Conventional approaches to discretization are not optimum in terms of
this principle, as they define the critical cutoff values only by some feature of
the measurement itself, without considering external information related to the
major objective. For instance, the upper (lower) anomalous limit method determines the threshold values simply by considering the statistical characteristics of density distribution of the feature of interest, and the second vertical
derivative method (Botbol et al., 1979) defines the critical cutoff values by the
inflection points of a curve characterizing the data measured in a profile or map.
Both of these ignore other important related information.
Three new nonparametric statistical techniques for the optimum discretization, which avoid major limitations of the "self-definition" problem associated with the traditional methods, are proposed in this paper: isolated weight
method, entropy information method, and rank correlation method. Each of
these methods is based upon associations between the variable to be discretized
and a selected feature characterizing the major objective.
ISOLATED WEIGHT METHOD
This approach is especially designed for finding a single critical threshold
value for a quantitative measurement. In other words, the goal of this method
is to transform a quantitative geodescriptor into a binary variable. Such analysis
is useful for anomalous background separation. This simple transformation also
is necessary for some statistical approaches which require binary input variables
[e.g., the characteristic analysis (Botbol, 1971; McCammon et al., 1983; Pan
and Wang, 1987; Pan and Harris, 1989a), the related information (Pan, 1985),
quantification theories (Dong et al., 1979), etc.].
Definition o f Isolated W e i g h t

Let x denote the quantitative geological observations made on a sample of


size n: x = (Xl, x2 . . . . . xn). In order to transform the measurement optimally,

Optimum Diseretization of Quantitative Geological Variables

701

select another feature, referred to as the objective variable and denoted by y =


(Yl, Y2. . . . .
y~). Variable y reflects some feature o f the major objective of the
investigation. F o r instance, if the objective is the delineation o f exploration
targets for a particular type o f mineral deposit, then y may be the size o f ore
deposit, the density o f mineralization, the number of mineral deposits, etc.
Denote y* = (Yi,, Yi2 . . . . Yi,,) as the ranked array of y (referred to as the
standard array), where Yi, -< Yi2 -< . . -- Yi,,. Corresponding to this ranking,
the observations on x are also ranked in the same way such that x * = (xi,, xi2,
. . . . xi,). To simplify notation without loss o f generality, let !j = j , j = 1, 2,
....
n. Thus, x* = (xl, x2 . . . . .
x~). F o r any given threshold value, x0,
variable x can be transformed into a binary array denoted by e(xo) = (el, e2,
. . . . e n ) , w h e r e e i i s 1 ( i f x i >- Xo) OrO(ifxi < x o ) . D e n o t e E =
{ k l k = 1,
2 .....
n }, as a limit integer set and introduce the following set definitions:

E ('~= { i l e i =

1, i ~ E } ,

E (,:

{ilei=O,i~E

Let the numbers o f elements contained in sets E (1~ and E () be nl and no,
respectively. Then,
nj + n o =

n,

E (l) U E () = E ,

E (~) (3 E (~ =

q5

where is an empty set.


Definition 1. The number o f elements contained in the section between
any two elements e~ and ek in array e (i ~ E ( t~ and k ~ E (~) is called the distance
between the two elements. The sum o f the distances between all pairs of such
elements is called the array distance [ d ( e ) ] of set E:
n--2

d(e) = Z
p-I

where W(ep, eq)

w(ep, e q ) ( q - p

q=p+l

- 1)

(1)

max (ep, eq) - min (ep, eq).


Given that numbers n~ and no are fixed, the array e is called
the best descriptive array (DA) of y* if e has the maximum array distance among
all of the possible configurations; the array e is called the invalid D A o f y* if
it has the minimum array distance among all possible configurations; and all
others are called the usual D A s o f y*.
Denote l~ and lo as the m a x i m u m running lengths o f the same elements
next to each other o f 1 and 0 in the array e, respectively. Then, the following
results are proven (see Appendix A).
Theorem 1. Given that numbers n 1 and n o are fixed; (i) array e is the best
D A of Yo if and only if 11 = nl and lo = no; and (ii) array e is the invalid D A
of e if and only if 11 = 1 and lo = 1, when ]nl - nol -< 1.
Definition 3. The following regularized array distance is called the isolated weight (IW) between the two responses ( 1 and 0) in array e:
=

Definition 2.

de(0, 1) - Ad(nl, no ) [ d ( e )

drain(e)]

(2)

702

Pan and Harris

where A d ( n l , no) = dmax( e ) - drain ( e ); dmax ( e ) and drain (e) are the maximum
and minimum array distances in e corresponding to the best and invalid DAs of
y*, respectively.
On the basis of definitions 1, 2, and 3 and Theorem 1, the following results
are obvious.

Theorem 2.
(1)

0 < de(0 , l ) < 1

(2)

de(0, 1) = 0 if and only i f e is the invalid DA of y*

(3)

de(0, l) = 1 if and only if e is the best DA of y*

(4)

0 < de(0, 1) < 1 if and only i f e is a usual DA of y*

Given that numbers nl and no are fixed, it is readily shown that


nl n--i--1

dmax(e )

max [ d ( e ) ] = ~

~]

i = l t=nl--i

nOr/l(/'/

--

2)

Computation of quantity dmi.(e), however, is not so easy, as it is not only a


function of numbers nl and no, but it also is related to the configurations of the
array. Fortunately, the following is found to be an excellent approximation to
its real value:
drain(e) = ~o(dmax(e)) -~- 7dmax(e)
where'), = 0.411 + # / 2 + (n o - nl)2], /x = 1 i f n > 7 a n d # = 0 i f n
7.
Hence, the isolated weight defined in (2) may be approximated by
1
In~] 2 ~
de(O , 1) = (1 - V) dmax(e) [_i=1 j = i + l

nonl(n -

n--2
Z

2) i = l j = i + l

_<

w(ei, e j ) ( j _ i _ l ) _ 3 ` d m a x ( e ) ]

w(ei, ej)
- -

(j-

i-

1 - 3,

1)

(3)

1 - 3'

P r o c e d u r e of Discretization

The basic rule of the isolated weight (IW) method for discretization is to
find a critical threshold value such that the IW defined in Eq. (3) is maximized.
Obviously, maximization of the IW is equivalent to maximizing the conformity
of the discretized variable and the selected objective feature. The term "conformity" may be illustrated by the following example. Suppose that the size of
sample (n) is 10 and number nl = 5 (no = 10 - 5 = 5). Suppose that the

O p t i m u m Discretization of Quantitative Geological Variables

703

observations y are ranked to y*. If variable x is discretized into e = ( 1, 1, 1,


I, 1, 0, 0, 0, 0, 0) or e = (0, 0, 0, 0, 0, 1, 1, 1, 1, 1), meaning that the
maximum running lengths equal the numbers of corresponding responses (i.e.,
ll = nl; and lo = n o ) , the discretized array of x* is the best DA of y*, suggesting that the discretized measurement is most conformable to y*. However,
another extreme case is that the discretized array takes the following configuration: e = (1, 0, 1, 0, 1, 0, 1, 0, 1, 0), which is the invalid DA of y*, meaning
that the discretized array is least conformable to the objective variable.
There exist more than one way to determine the critical threshold value
that optimizes the discretization of a quantitative variable with respect to a selected objective feature. The search program presented below, however, is a
feasible and effective procedure.
Without loss of generality, assume that y is already standardized [i.e., y
= (Yl, Y2 . . . . .
Yn),
where Yl <- Y 2 < . ~ Y n ] " If this is not the case,
variable y would be ranked in this way. Then, x is optimally discretized through
the following steps.
Rearrange the elements of x such that x* = (xi,, xi2 . . . . .
xi,,), where xi,
<- xi2 <- . ~ xi,, and 1 _< ij _< n for all j = 1, 2 . . . . , n. Clearly, the critical
threshold must be contained in the interval [xi~, xi,,]. More specifically, the
critical threshold must be in one of the n - 1 subintervals, [xiL, xi2]. The n 1 midvalues of these subintervals are candidates for the critical threshold value.
Accordingly, consider the critical threshold to be one of the following values:
Xij Jr- Xi)+ I

x) m -

,j

= 1,2 ....

n-

For any given possible cutoff value x) ( 1 -< j -< n - 1 ), x* is discretized


to a binary DA ej = (e~j), e~j~ . . . . , ej~), where e}j) = 1 when xi >- x} ~ or
el ~) = 0 when xi < xj(.) for i = 1,2 . . . . . n. Repeating the similar discretizations
for all other midvalues, we would obtain n - 1 binary DAs, e~, e2 . . . . . % - lFor each ek, its IW (de~(0, 1,)) is computed using Eq. (3). The same
computations repeated for all DAs produce the n - 1 IWs: de, (0, 1 ), de2(0,
1) . . . . .

de .... ( 0 , 1).

In the last step, determine the critical threshold that maximizes the isolated
weight, i.e.,
de~(O, 1) =

max

{dej(O, 1)}

l<_j~n--I

where e k* is one of the ej's. The optimum critical point is, therefore, x m) =
x~) and the optimally discretized binary array for variable x is: e* = e~ =
(k)

(el

(k)

, e2

(k),

, . . , en

).

It is worthwhile to point out that when variables x and y are highly correlated, the optimum critical point tends to be close to the mid-value of x. In
the extreme case, as one of the referees of this paper suggested, if the correla-

704

Pan and Harris

tion between x and y is one or negative one, then, the optimum critical value
determined by the above procedure will be the medium point of x.
Case Study

In the estimation o f N b - T a resources of pegmatitic deposits, the I W method


was applied to 8 quantitative measurements among 40 geological, geochemical,
as well as mineralogical variables (the others are 32 binary variables) collected
on 21 ore deposits hosted in the granite pegmatites in the southern China (Pan,
1985). Since the main interest o f this study was the estimation o f potential
resources of the pegmatitic N b - T a deposits, it is desirable that the information
o f these measurements about variation o f the N b - T a resources is enhanced as
much as possible to improve resources estimation. Furthermore, the statistical
approach employed in this estimation is characteristic analysis, which requires
binary or ternary input variables. Accordingly, the I W method was employed
in this study for both unification o f the data set (i.e., transformation of all quantitative measurements into binary variables) and the enhancement of information about resource variation. Clearly, it is appropriate to choose the quantity
of total metal of Nb and Ta as the objective variable ( y ) . Using the procedure
described above, 8 quantitative measurements were optimumly discretized into
binary variables, and the results are presented in Table 1. From this table, two
major characteristics merit comment.
First, the larger the isolated weight, the more information of the variable
about the variation o f the Nb and Ta metal is enhanced, and, therefore, the
more important this variable will be in the estimation of the resources. F o r
instance, variables x 8 and x I are most important, as they are associated with the
largest isolated weights.
Second, according to the relative magnitude of the IW, variables having
I W values significantly lower than the average might be excluded from further
consideration, as their presence may introduce noise or mask the most critical
Table 1. Discretized Results by the Isolated Weight Method
Variable

Threshold

No. of pegmatitic veins (x0


Maximum length (m) of veins (x2)
No. of structural zones (x3)
Average value (%) of Nb205 (x4)
Average value (%) of TazO5 (xs)
Ratio of Ta20~/Nb205 (Xr)
Ratio of Sn/Y (xT)
Ratio of Sn/La (xs)

4.0
210.0
4.0
0.019
0.015
0.8
16.01
15.05

Value"
(>
(>
(>
(>
(>
(>
(>
(>

1
1
1
1

1
1
1
1

aThe column contains value assignment: 1 ifx > threshold and 0, otherwise.

d*(0, 1)
0.82
0.64
0.46
0.71
0.65
0.62
0.73
0.91

Optimum Discretization of Quantitative Geological Variables

705

information carded by other variables. For instance, measurement x3 may be


considered for elimination.

ENTROPY INFORMATION METHOD


The isolated weight method described above provides a means of optimum
discretization to a binary variable through finding only a single threshold value.
In many cases, however, multiple critical threshold values may be appropriate.
The entropy information method presented in this section provides a means of
discretizing a quantitative measurement into a qualitative one with any limited
number of discrete categories.

Definition of Entropy Information


Suppose that the measurement x being discretized is observed on a sample
of size n: x ~ Ix 0, x ] (x > x0). A qualitative variable y is selected as the
objective feature, which is observed on the same sample: y ~ { y~, Yz. . . . .
Ys }, where yj ( j = 1, 2 . . . . .
s) are the single numbers. Let p (yj) ( j = 1, 2,
. . . . s) denote the occurrence probability for value yj. Then, the entropy, H ( y ) ,
of y and the conditional entropy, Hx(y), of y on x are defined as follows:
s

H(y) = -

~]

j=l

p( yj) ln p( y;)

(4)

Hx(y ) = -

2 p(yjlx) lnp(yjlx )

(5)

j=l

in Eqs. (4) and (5), w h e n p ( y j ) = 0, we assume thatp(yj) lnp(3~/) = 0. For


continuous variables, Eqs. (4) and (5) are given, respectively, by
H(y) = -

l/y(y ) lnfs(y )

dy

and
Hx(y) =-

~ f(ylx) ln f(ylx) dy

where f y ( y ) is the marginal probability density of y and f ( y l x ) is the conditional density of y on x.


On the basis of these definitions, the relative entropy information of x on
y is introduced as follows (Pan and Xia, 1988).
Definition 4. Define the relative entropy information on y contained in x
as:

p(x

y) - AI(x

h'(y)

y)

(6)

706

Pan and Harris

where A l ( x ~ y ) = H ( y ) - H x ( y ) , referred to as the absolute entropy information.


Equation (6) measures how much of the uncertainty about y is relatively
reduced, when x is realized. Clearly, a large p indicates a strong dependency
of y on x, and vice versa. Because o f uncertainty about variable x, the relative
entropy information p ( x ~ y ) is also uncertain. Therefore, its mean value is
computed:
~(x ~ y) = f p(x ~ y)g(x)dx

(7)

where g(x) is the probability density o f x.


Now consider a discretization o f x into m ( > 2) subinterva!s, each of which
is associated with a discrete probabilistic value g(xi), where xi is the median o f
interval i (i = 1, 2 . . . . .
m) and Eim=I g ( x i )
1. Thus, Eq. (7) can be written
in the discrete form:
=

rn

Z Z p(yilxi)g(xi)lnp(yjIxi )

i=lj=l

~(x ~ y)=

1 +

(8)

H(y)

Using the Bayesian formula, we have

p(x, lyi)p(yj)
p(yjlxi)

g(xi)

such that

Z Z p(x, lyj)p(yj)In

i=,j=|
~(x ~ y) = 1 -

~
Z..a

j=l

g(xi)

)
(9)

P(

yj)

In p ( y j )

The following properties of the entropy information are proven (see Appendix

B).

Theorem 3.
0-< ~(x~y)
-< 1
~ ( x ~ y ) = 0 if and only if y is statistically independent of x
~ ( x --' y ) = 1 if and only if s -< m and y is a deterministic function o f x,
i.e., y = 6 ( x ) , a.s.
~ ( y -~ x ) = 1 if and only i f m -< s and x is a deterministic function of y,
i.e., x = ~b(y), a.s.
~ ( x -~ y ) = ~ ( y ~ x ) = 1 if and only i f m = s and there exists a
deterministic function 3', such that y = 3,(x) and x = , y - i ( y ) , a.s.

Optimum Discretization of Quantitative Geological Variables

707

Estimation of the Relative Entropy Information


Estimation of the mean of the relative entropy information may be achieved
by using either of the two approaches: the maximum likelihood method and the
Bayesian robust method. Denote nij as the number of the observations for which
variable x takes values within the subinterval i and variable y takes value of yj
(i = 1, 2 . . . . .
m and j = 1, 2 . . . . .
s). Then, the maximum likelihood
estimates of the relevant probabilities are
n~j
--,
FI.j

p ( X i [ Yi) :

n .j
= --,
n

P(Yj)

g(xi)

ni
""
n

(10)

where n.j : r, 'n


m ~ E~s = i n0i= 1 l'lij, hi. = Es'j= I nu, and n = El=
Accordingly, the mean of the relative entropy information defined in (9)
is estimated by the maximum likelihood method as follows:
m

~(x --~ y) :

E E n o In ( n i f f n i . )
1 -~='J='
j=1

(11)

n.j ln ( n j / n )

For the purpose of robustness, it is necessary to use the Bayesian estimators for probabilities p ( y j ) , g ( xi ), and p ( xi [yi ):

fi(YJ)

--

rl.j -t- 1
n + s '

~(Xi)

--

hi. + 1
n + m'

fi(xilYj)

rlij -}- 1
rt.j + m

(12)

Given these estimators, the Bayesian estimate of the mean of the relative entropy information in (9) is given by:
y) = 1

~,
i=l j = l

( n j + 1)

In

-\n.j

j=l

q-

\nj +

+ 1

-~-

(13)

Implementation of Discretization
The basic objective of this approach is to choose a scheme for discretizing
the interval Xo < x < x into m ( m > 1)subintervals. This is equivalent to
determining m - 1 threshold values. One criterion for such a performance is

708

Pan and Harris

to select a set of (m - 1 ) critical cutoff values within the range [Xo, x ] such
that the estimated relative entropy information of x on the objective variable y
is maximized. Such a scheme will be considered as the best discretization of x
into m subintervals with respect to y.
Conceptually, this optimum discretization may be cast to a nonlinear programming problem. The most convenient and practical method, however, is
still a trial-and-error search algorithm, particularly when the number of threshold values being determined is not large At first glance, a thorough search
appears to be hopeless, as the possible schemes of discretization for a quantitative variable are infinite. Fortunately, this is not true. Denote the m - 1
threshold values by x0 = (x(01), X~o
2)
x(0m-l)) where X~ol) < X(o2) <(
<
X ~ - 1). Clearly, all of the threshold values must be selected on the interval [Xo,
x ]. More precisely, each of these threshold values must be determined within
one of the n - 1 intervals [xi, x i + j ] for i = 1, 2 . . . . .
n - 1 (here, x is
assumed to be ordered). Adopting an approach similar to that used in the isolated weight method, we consider the midvalues of these intervals, x (l), x (2),
. . , x (" - ~), as the possible candidates for the m - 1 optimum threshold values.
With this specification, a thorough search algorithm is feasible and effective.
In the majority of the practical cases, the binary or ternary discretization
of a quantitative measurement is satisfactory. For the ternary transformation, a
search procedure may be developed on the basis of the procedure suggested
above. Here, a detail search algorithm for the binary transformation only is
presented:
Select a qualitative feature ( y ) being of the most interest as the objective
variable, which takes s possible values, Yt < Y2 < - - < Ys.
Determine the minimum and maximum values (x 0 and x ) of n observations on variable x to be discretized. Compute the difference/x = x - Xo and
step length Ax = A I N , where N is the total number of discretizing schemes.
For any given scheme k, x is discretized into a binary array by using the
quantity x ~) = Xo + k A x as the threshold value. On the basis of this discretization, construct a two-dimensional contingency table [3 x (s + 1)], containing frequencies nij (i = 1, 2, and j = 1, 2, . . . , s).
Based upon the data in the contingency table, estimate the relevant probabilities in (10) for the maximum likelihood method, or in (12) for the Bayesian
robust method. Then, using Eq. (11) or (13), compute the estimate of the average relative entropy information for the k th scheme, ~ (x --' y).
Repeating the steps above N times, we obtain N estimated means of the
relative entropy information, ~ l ( x --* y), ~2(x ~ y) . . . . . pN(x ~ y). Then,
determine the largest mean: 3 7 ( x ~ y) = max~{~k(x --' y ) } (1 _< 1 _< N ) .
Subsequently, the optimum threshold value is x (t) = x o + I A x
It should be noted that the foregoing discussion requires that the objective
variable y be qualitative. In order to satisfy this requirement in the cases where
.

, .

Optimum Discretization of QuantitativeGeologicalVariables

709

only quantitative objective variables are available, the selected measurement


( y ) must be discretized into the qualitative form prior to the use of this algorithm. The discretization of y can be done simply through subjective methods.
For example, the range of y is divided into s equal subintervals, each of which
is represented by its mid-value. This simple treatment is reasonable, because
variable y is only used as a reference for the discretization of other measurements.

Value Assignment to the Discretized Variable


After the best scheme of discretization for measurement x is found, an
appropriate assignment of values to the discretized variable should be established. This step is useful for many geological studies. If the aim of the discretization is to unify geological data of diverse types, each of the discretized
subintervals for a quantitative measurement must be represented by a value from
the unifying scheme.
A useful principle for this assignment is to reveal as much of the information of the diseretized variables about the selected objective feature as is
possible. For binary discretization, we usually assign 1 to the category greater
than the critical threshold value and 0 to the other category. For the ternary
transformation, the three categories are usually represented by values 1, 0, and
- 1 . For ternary or higher orders of discretization, the following principle is
suggested for value assignments. Suppose that y has the same number of qualitative categories as the discretized variable x, i.e., s = m. Then, the value
assignment should be made in such a way that the following quantity is maximized:
1

rxy =

1 --

E (e,-

Hill

Yi)

where ei and Yi are discrete number, such as 1, 0, - I, etc.

Case Study
The entropy information method described above is applied to the problem
of delineating anomalous targets for the epithermal gold-silver deposits in the
Walker Lake quadrangle, which comprises the area between 38 and 39 North
latitude and 118 and 120 West longitude and includes parts of the states of
California and Nevada. The geochemical data used in this study were collected
from stream sediment samples. Among 30 elements analyzed, 14 elements,
including Au, Ag, Cu, Pb, Zn, Fe, Ca, Sb, Zr, V, Bi, Mo, Be, and B, were
employed in this analysis.
These elements were synthesized into a single measurement--the geochemical scores by using the model referred to as the weighted and targeted

710

o
o ~)
oo
J

Pan and Harris

o
ao
too
co

_>

o
tM

~o

o~
iv3

a)
)

Optimum Discretization of Quantitative Geological Variables

711

multivariate criterion (Harris and Pan, 1987, 1989a). These scores were then
filtered for noise. The filtered scores are contoured and shown in Fig. 1.
Consider the objective of delineating anomalous targets for the exploration
of epithermal gold-silver deposits. This objective can be perceived as a problem
in optimum discretization of the synthesized geochemical measurement into a
binary variable representing anomaly and background. In order to enhance the
information of the scores about the gold-silver deposits, the sum of gold and
silver concentrates was selected as the objective variable. The entropy information method was then applied to the geochemical scores and the optimum
threshold value was found to be about 580 (Fig. 2). The value of 1 was assigned
to the scores greater than 580 and 0 to the scores less than or equal to 580.
Finally, exploration targets for the epithermal gold-silver deposits were delineated as those areas, such as Windmill, across the entire Walker Lake quadrangle that are represented by a value of 1 and do not have known deposits (Fig.
3).

o"
C
0

=0
O_

o"

c3-

o"

o
c~

Weighted

Scores

Fig. 2. Optimal discretization for the filtered geochemical scores using the entropy information method.

712

Pan and Harris

r-

..=

l--le,
E
i~
i ,.-'o o
~G"

<[

~.

~'
~:

.~.~

"

\'<~)-" ~ <~
\-f.
oo.0:
,,,
0c-

~-----~\
I, "~'~1-

c~

""

oc~ ~./

~,-"

'vKIn~,/~\ o ~
t-5

o_~

c.-

~>"

n.. o
<[z

o.

o
0

~
0 ~ "~a.
"',~,

Optimum Discretization of Quantitative Geological Variables

713

Figure 3 shows that most of the known epithermal gold-silver deposits or


mining districts are well delineated. The delineated areas with no occurrences of deposits may be potential targets for the epithermal gold-silver deposits. Each target shown in Fig. 3 has two boundaries. The inner one was
derived from the entropy information method; the outer one was determined
using the arithmetic mean of the scores. Examination of these boundaries on
known deposits suggests that the analysis based upon entropy information
method produces a more precise delineation of mineral targets boundary than
does that based upon the mean of scores.
RANK CORRELATION METHOD

In this section, an alternative approach, referred to as the rank correlation


method, is proposed for the discretization of a quantitative measurement into a
qualitative variable with any limited number of categories. This model is based
upon the concept of rank correlation and employs probability densities of the
measurement being discretized.
Definition of Partial Rank Correlation

Let us suppose x and y are the quantitative and objective variables, respectively, that have been observed on a sample of size n. Rearrange the elements in y and obtain Y0
(Yi~, Yi2. . . . .
Y i . ) where Yi~ <- Yi2 <- . <- Yi,,.
Divide the range D = [Yi,, Yi,,] into t mutually exclusive subintervals denoted
by D l, D2 . . . . .
D t (t < n). Let y* be an array containing midvalues of these
subintervals: y* = (y*, y* . . . . . y*), which is referred to as the standard array
of y. Clearly, the rank of y~ is k for k = 1, 2 . . . . .
t. Select the minimum and
maximum values, x0 and x , for measurement x and then divide the region E =
[Xo, x ] into s mutually exclusive subregions denoted by El, E2 . . . . . Es (s <
n).
For any given subinterval E~, probabilities Pij = P ( x e E i, y ~ Dj) ( j =
1, 2 . . . . .
t) may be computed:
=

PiJ= I

.I f ( x , y )

Dj Ei

dxdy, j=

1,2,...,t

(14)

wheref(x, y) is the joint probability density function of x and y. For a particular


i ( 1 _< i _< s), we have a sequence { Pa, Pi2 . . . . . Pit } from which we obtain
a set of ranks { ril, t'i2. . . . .
ri, }.
Definition 5. The following is called the partial rank correlation coefficient between the ith subregion Ei of measurement x and the standard array y*:

p(i)
1
~Y =

t(t 2 -

where k is the rank of .y~.

~] (rik

1)~=1

-- k) 2,

i = 1, 2 . . . . .

(15)

714

Pan and Harris

In general, the partial rank correlation coefficient (PRCC) is compatible


with the usual correlation coefficient. That is, a large value of the PRCC indicates a strong association between the two sequences, whereas a small value
implies a weak association. Positive values represent direct associations, while
negative ones represent inverse associations. A PRCC near zero indicates little
association. In polar cases, the PRCC has the following properties.
Theorem 4. The PRCC between the two rank sequences, { ril . . . . .
r# }
and { 1 . . . . .
t }, satisfies the following properties:
- 1 _< p~) < 1
p~) = 1 if and only if the two sequences match perfectly in rank
p~) = - 1 if and only if the two sequences are opposite in rank
Using Eq. (15), we can compute P^(i)
for each of the states Ei (i = I , 2, " ,
xy
s), obtaining s PRCCs. In order to establish a criterion for optimizing the discretization, we define the following quantity referred to as the partial rank correlation difference:
l
AP2(E) - s ( s -

1) k=, ,=~+,

Wk,(Pk -- 0,) 2

(16)

where wkt = min(po~, p 0 / ) / m a x (Pok, P01), Poj = Eti= J Pik, and E represents
any discretization scheme for x.
P r o c e d u r e for D e t e r m i n i n g T h r e s h o l d Values

The basic strategy for finding the optimum discretizing scheme for the
entropy information method discussed in the last section is also appropriate for
the rank correlation method. The criterion employed here for determining critical threshold values for a quantitative measurement x is to maximize the quantity (16). Given ApZ(E),find E* such that
7r(E*) = max { A p 2 ( E ) } .

(17)

where E* is called the optimum discretization for x with respect to y*. Clearly,
for a division of s subintervals, s - 1 optimum threshold values are sought.
The rationality of criterion (17) is intuitive because it is based upon partial
rank correlation, PRCC in (15), which describes the rank consistency between
two sequences: the joint occurrence probabilities defined in (14) and the standard array y*. Larger P_(i)
values indicate that the subregion Ei of measurement
xy
x more likely corresponds with the larger values of the objective variable y,
and vice versa. Therefore, maximization of the partial rank correlation difference in (17) enhances the contrasts between the PRCCs as much as possible.
One consequence of this is the maximum separation of two groups of subregions. One of the groups contains those subregions having the most positive

Optimum Discretization of Quantitative Geological Variables

715

PRCCs with y*, meaning that these regions most likely co-occur with the largest values of y, whereas the other includes those subregions having the most
negative PRCCs, suggesting that they are most likely associated with the smallest values of y. For example, let s = 2, meaning that x is transformed into a
binary variable. Denote two subregions by 0 and 1, respectively, and assume
(1)
^(2)
that the o~y
> 0 and ~,xy
< 0. Then, criterion (17) would lead to an optimum
binary discretization of x in that value " 1 " represents information of x about
the largest values of y, while ' 0 " represents information of x about the smallest
values of y. If y is the size of ore deposits, then, observations of " 1 " on the
discretized variable indicate possible occurrence of large deposits, while " 0 "
indicates small or no deposits.
V a l u e A s s i g n m e n t to the D i s c r e t i z e d V a r i a b l e s

A general role for value assignment does not exist for this method. The
principles for value assignments to binary and ternary variables only are suggested below. These principles are useful when the discretization is motivated
by the objective of mineral resources estimation.
In the binary cases, two PRCCs are computed for two discretized subregions. Assign 1 and 0 to the two subregions according to the following roles.
* If the sign of Pxy
^(1) and ~,,y
~(2) are opposite and ] P~(1)
x y - - P x(~)l
y F ~ c, where c
is a positive number, then the subregion corresponding to the positive p
is assigned a value of 1 ; the other is given 0.
(1) and Pxy
~(2) have the same sign and ] px(ly) -- P~y
^(2) rt > C, then the subre" If Pxy
gion with the larger absolute value is assigned a value of 1, and the
other, 0.
* When px,,
(~) - ,~,,
~(2) I
c, this variable may be deleted, if it is believed
to be not strongly correlated with the objective variable y.
In ternary cases, three PRCCs are computed for the three discretized subregions. Assign value 1, 0, or - 1 to each of the 3 subregions according to the
following rules.
" If different signs exist among the three PRCCs, and maxi~; t ^(i)~,~, P(J)
xy l > c, then the subregion with the largest positive PRCC is given a
value of 1, the subregion with the smallest PRCC a value of - 1 , and
the other a value o f 0.
When the PRCCs have the same sign, but [ p(j~ - PXY
~(J) > c for all i v~
j, then the subregion with the largest PRCC is given a value of 1 if
PRCCs are positive, and value - 1 if the PRCCs are negative; other
subregions are given a value of 0.
Except for the two cases above, when the variable x is believed to be
weakly correlated with the variation of the objective variable y, it may
be deleted.

716

Pan and Harris


Case S t u d y

The rank correlation method described above was applied to a set of data
collected in the Walter Lake 1 2 quadrangle. The data set consists o f 9
integrated geofeatures, which are briefly described as follows (Harris and Pan,
1987, 1988, 1989a, b; Pan and Harris, 1989b): x~, filtered geochemical scores
that were derived from synthesis o f the 14 elements sampled from drainage
basins; x2, high pass structural fields that were obtained by synthesis of the 10
structural descriptors related to faults; x3, band pass gravity fields that were
derived from coherency analysis between high pass isostatic gravity fields and
filtered geochemical fields; x 4, band pass magnetic fields that were derived from
coherency analysis between high pass magnetic fields and filtered geochemical
fields; xs, ratio of rock density to susceptibility contrast estimated by a Poisson
moving window, based upon high pass gravity and magnetic fields; x 6, correlation between high pass gravity and magnetic fields estimated by a Poisson
moving window; XT, area of host rocks (in km 2) outcropped within a cell for
epithermal g o l d - s i l v e r deposits; xs, area o f Tertiary intrusives that outcrop
within a cell; and x9, area o f hydrothermal alterations found within a cell.
Each of these geofeatures is valued on a 55 x 55 inter-grid matrix across
the W a l k e r Lake region. In order to apply the discretization approach, a region
located chiefly in the Aurora 15' quadrangle and containing 324 sample locations was selected as a control region. Using the number of epithermal g o l d silver mineral occurrences as the objective variable, these quantitative measures
were discretized optimally into ternary variables by the rank correlation method.
The basic results of this transformation are shown in Table 2 where c* and c~
are the two optimum threshold values. The best subintervals are recognized in
terms of their correlations with mineral occurrences. F o r example, geochemical
scores (x~) greater than 558.5 (value o f 1) are most favorable for mineral oc-

Table 2. Discretized Results by the Rank Correlation Method


Item
min.
max.
p(J~
p (2)

p~3)
Ap*
c*
c*
<c*
[c*, c*]
> c~'

Xl

X2

0.031
3705.5
-0.179
-0.17
0.679
0.477
26.38
558.5
0
0
1

- 17.35
16.93
0.821
-0.857
0.714
0.440
--2.494
4.365
1
-1
1

X3

X4

X5

-7.918 - 188.4 -29.61


6.312 216.6
19.16
-0.143 -0.750 -0.964
0 . 7 1 4 0.750
0.964
-0.821
0.250
0.321
0.382
0.280
0.803
0.146 --66.86
1.279
1.569 --39.86
4.530
0
-1
-1
1
1
1
-1
0
0

X6

X7

X8

X9

-0.864
0.000
0.000
0.000
0.845
6.330
0.880
0.840
-0.857 -0.821 -1.000 -0.750
0 . 6 4 3 0.679
0 . 1 7 9 0.607
0.893
0 . 8 5 7 0 . 6 7 9 0.750
0.664
0 . 5 7 7 0 . 7 0 3 0.599
--0.010
0.211
0.059
0.084
0.389
1.477
0 . 2 9 3 0.336
-1
-1
-1
-1
1
1
0
1
1
1
1
1

Optimum Discretization of Quantitative Geological Variables

717

currence, while scores less than the same cutoff offer little evidence for the
occurrence of mineral deposits. Another interesting feature is that the mid-subintervals of some variables are most valuable for indicating existence of mineralization-e.g., interval [0.146, 1.569] regal of band pass gravity (x3)and
interval [ - 6 6 . 8 5 , - 3 9 . 8 6 ] gamma of band pass magnetics (x4). The third
feature is that the discretization reveals operational directions of the variables
vis-a-vis objective variables. For example, geochemical scores, hostrocks, hydrothermal alterations, etc., are positively associated with the number of mineral occurrences.
SUMMARY
The three techniques for the optimum discretization proposed and demonstrated in this paper are potentially useful for many geological problems.
Possible applications include the following:
1. Defining the optimum boundaries of geologic objects, geofields, and
various anomalous targets. These geological boundaries are important
in mineral exploration and mineral resource estimation, as mineral endowment units of various scales and kinds are closely related to these
geologic boundaries.
2. Revealing those subregions of a geologic variable carrying the most
information about the variations of mineral resources, although overall
the variable may insignificantly correlate with mineral resource descriptors.
3. Refining and selecting important and useful geological variables based
upon the maximum correlations between the geologic measurements
and some objective variable.
4. Unifying diverse geodata through transformation of quantitative geologic variables into binary or ternary data (e.g., characteristic analysis),
which requires binary or ternary input variables.
5. Recognizing the operational directions of variables in relations to variations of the objective variable--detecting whether a geological measurement is a positive or negative factor in terms of its influence on the
objective variable.
APPENDIX A
P r o o f for T h e o r e m 1

i. Proof for the first part by the deduction method.


When n = 3, for any given n~ ( _<3), it is readily proven that the first part
of the theorem is correct. Suppose that the first part of the theorem is correct

718

Pan and Harris

for any integer n~ = kl ( <-- k), w h e n n -= k. T h e n , we are going to prove that


the c o n c l u s i o n is also correct, w h e n n = k + 1.
Let ek be an array c o n t a i n i n g k elements:
e k = (1, 1 . . . . .

1, 0,

0 .....

0)

(ekl,

ek2)

where kl is the n u m b e r of ones and k 2 is the n u m b e r of zeroes, and


e,, = (1, 1 . . . . .

1), ek2 = (0, 0 . . . . .

O)

D e n o t e its array distance by d ( e k ) . N o w a p p e n d an additional e l e m e n t


e~+ 1 = 0 to the end of the array e~, preserving the conditions of 1t -- n I and lo
= n o. Then, a n e w array c o n t a i n i n g k + 1 elements is obtained:
ek+, = (ek, t3) = (ek,, e~2, 13)
where the capped e l e m e n t is inserted. T h e array distance of the new array is
computed:
d ( e ~ + ~ ) = d(e~) + k 2 + (k2 + 1) + . . .

+ (k, + k2 -

1)

A different array is formed if the n e w e l e m e n t ek +~ is inserted in the ith


position (1 < i < k I ), forming a new array in that the m a x i m u m r u n n i n g
lengths are not equal to their c o r r e s p o n d i n g total n u m b e r s ( i . e . , ll ~ n~ and lo
no):
e~+ l = ( 1 , 1 . . . . .

1,(3, 1, 1 . . . . .

1, 0, 0, . . . .

0)

= (e~, + t, e ~ ) = (e~, O)
where
e~,+~ = (1, 1 . . . . .

1 , 0 , 1, 1 . . . . .

1), e~2 = ( 0 , 0 . . . .

0)

and e~ contains the first k = k~ = k2 elements


e~ -- (1, 1, . . . ,

1,13, 1, 1 . . . . .

0 .....

0)

Accordingly, we have
d(e;+,)

= d(e~,) + (k2 -

1) + . . .

+ (kl + k2 - i -

+ (k, + k2 - i + 1) + . . .

+ (k, + k 2 -

1)
1)

= d ( e ~ ) + k2 + (k2 + l ) + . . .

+ (k, + k 2 -

1) - ( k -

<- d(ek) + k2 + (k2 + 1) + . . .

+ (k I +

1) - (k - i -

= d ( e k + , ) - (k - i + 1) _<

k2 -

i + 1)
1)

d(ek+,)

F o r the case that ek+ 1 ----- 1, a similar result can also be obtained. Therefore,
according to the role o f the d e d u c t i o n method, the proof for the first part of
T h e o r e m 1 is completed.

Optimum Discretization of Quantitative Geological Variables

719

ii. Proof for the second part by the deduction method:


Suppose that n is an even number and n~ = n o = n / 2 and other cases can
be shown similarly. When n = 4, the second part o f the theorem is easily
proven. Suppose that the conclusion is also true when n = k (k > 4). Now we
are going to prove that the conclusion is still true, when n = k + 1. Let e~ be
the following array:
ek = (0, 1 , 0 , 1 . . . .

, 0 , 1)

The distance associated with this array is denoted by d(ek). Append an additional element, ek+l = O, to the end o f the array ek and form a new array:
ek+ , = (0, 1, 0, 1 . . . . .
where e k_ 1 = (0, 1, 0, 1 . . . . .
pute the array distance:

0, 1, 0) = (ek, 0) = (e~_l, 1, 0)
0), containing the first k -

d(ek+l) = d(e~_l) + 2[2 + 4 + ...

I elements. Com-

+ (k - 2 ) ]

However, if the additional element ek +1 is inserted into the array ek at the ith
position (1 < i __ k), it destroys the original configuration of the array (i.e.,
I1 :g 1 and l o :~ 1 ). In order to compute the new array distance, this array is
reversed (note that such a modification does not alter the array distance). Then,
the new array becomes:
e~+, = ( 1 , 0 , 1 , 0 . . . . .

1, 0 , 0 , 1 , 0 , . .

, 1,0) = (e;_,, 1,0)

where e~_ 1 contains the first k - 1 elements.


Compute the array distance o f e + 1:
d(e;+/)

= d(e;_,)
+ (k-

+ [2 + 4 + . . .
i + 1) + ( k -

+ [2 + 4 + . . .
+ (k-

i -

+ (k-

i + 3) + . . .

+ (k-

1) + ( k -

i-

2)
+ (k-

1)]

+ (k-

3)]

i - 2)
i + 1) + . . .

= d(e~,_l) + 2 [ 2 + 4 + . .

+ (k-

2)]

Finally, we have
d(e~+,) - d(ek+,) = d(e_,

- d(ek_,)

>_ 0

which leads to the conclusion:


d ( e k + , ) _< d ( e ~ + , )
Similar p r o o f can be obtained, when e k+ 1 = 1. Therefore, according to
the rule of the deduction method, the p r o o f is completed.

720

Pan and Harris


APPENDIX B
P r o o f for T h e o r e m 3

The proof is given only for the cases where both x and y are discrete.
Equation (8) can be written as p ( x ~ y ) = 1 - Vx(y), where
m

i=lj=l

~ p(yjlxi)g(xi)lnp(yjlxi)

Vx(y) =

E p(yi)in p(yj)

j=l
Clearly, v X( y ) >_ 0. Furthermore,

p(yj) >- p(yjlxi) >- p(y~lx,)g(x,) = p(x,, Yi)


which implies that Vx( y ) -- 1. Thus, 0 <-- Vx( y ) -< 1. Therefore, 0 _< ~ (x
y) _< 1, which is the first part of the theorem.
Note that p ( yj Ixi) = p (yj) for all i' s and j ' s if and only if x and y are
statistically independent. Thus, under this necessary and sufficient condition,
the numerator of v~ ( y ) becomes
m

i=lj=l

~ p(yjlxi)g(xi)lnp(yjlxi)

= 2 p(yj) lnp(yj) = H ( y )
j=l

Thus, Vx(y) = 1, meaning that ~ ( x --* y) = 0, which completes the proof for
the second part of the theorem.
We know that ~ ( x --' y) = 1 if and only if Vx(y) = 0, which suggests
that for any given i (i = 1, 2 . . . . .
m), there exists a unique j*, such that
I1,

p(yjlxi) =

O,

j --j*

J 4= j*

j = 1, 2 . . . . .

This implies that there exists a function q~, such that yj. = q5(xi) with probability
1. Because i is arbitrary, we have
y = (x),

a.s.,

s _< m

which completes the proof for the third part of the theorem. Similarly, we also
can prove the fourth part of the theorem, i.e.,
x=

~(y),

a.s.,

m <_ s

When ~ ( x ~ y) = ~ ( y --' x) = 1, we have


y = q~(x),

x = ~(y),

a.s.,

s --- m

Next, prove 0 and ~b are mutually reversible from one to the other. Suppose
that there exist j * and i*, such that yj. = 4~(xi.), i.e., p(yj. Ixi,) = 1. Then,

Optimum Discretization of Quantitative Geological Variables

721

we have

p ( y j , ) p ( x i , [ yj,) = p(xi*, Y : ) = p(xi*)P(Yj*]xt *)


= p(x,.)

j=1

> o

Thus, w e must h a v e p ( x i , 1yj,) > 0. F u r t h e r m o r e , ~-(y --* x ) = 1 - v y ( x ) =


1 if and only if Vy(X) = 0. H e n c e , w e obtain
x,, = (y:)

D e n o t e ~b by 3,. T h e n , w e h a v e y:. = "),(xi.) and xi. = 3: ~ ( ) ) . ) . Because i*


and j * are arbitrary, the p r o o f for the last part o f the t h e o r e m is c o m p l e t e d .

ACKNOWLEDGMENTS
Grateful a c k n o w l e d g e m e n t is m a d e o f the support o f the data p r o v i d e d by
various U . S . G e o l o g i c a l Survey offices and personnel. Special thanks are g i v e n
to the g u i d a n c e p r o v i d e d by Dr. D a v i d M e n z i e . I also wish to thank the referees
for their v a l u a b l e c o m m e n t s and suggestions. Finally, we are appreciative o f
the assistance o f A l i c e Y e l v e r t o n and Y i n g h o n g M i a o in preparation o f tables
and figures.

REFERENCES
Agterberg, F. P., 1989, Systematic approach to dealing with uncertainty of geoscience information
in mineral exploration, in Weiss, A. (Ed.), 21st Application of Computers and Operations
Research in the Mineral Industry, p. 165-178.
Bonham-Carter, G, F., Agterberg, F. P., and Wright, D. F., 1988, Integration of geological datasets for gold exploration in Nova Scotia: Photogram. Engin. Remote Sensing, v. 54, p.
1585-1592.
Botbol, J. M., 1971, An application of characteristic analysis to mineral exploration: Proc. 9th Int.
Sym. on Techniques for Decision-Making in the Mineral Industry, Special v. 12, p. 92-99.
Botbol, J. M., Sinding-Larsen, R., and McCammon, R. B., 1978, A regionalized multivariate
approach to target selection in geochemical expIoration: Econ. Geol., v. 73, p, 534-546.
Dong, W. Q., Zhou, G. Y., and Xia, L. X., 1979, Theory of quantifications and their applications
(in Chinese): Jilin People's Publisher, Changchun, 197 p.
Harris, D. P., and Pan, G. C., 1987, An investigation of quantification methods and multivariate
relations designed explicitly to support the estimation of mineral resources--intrinsic samples:
Report on Research Sponsored by U.S. Geological Survey Grant No. 14-08-0001-G1399, 200
p.
Harris, D. P., and Pan, G. C., 1988, Intrinsic sample methodology in Gaal, G., and Merriam, D.
F. (Eds.), Computer Applications in Resource Exploration: Prediction and Assessment for
Petroleum, Metals and Nonmetals: v. 6, Computers and Geology Series, Pergamon Press,
New York.
Harris, D. P., and Pan, G. C., 1989a, Updated concepts of intrinsic samples and methodology for

722

Pan and Harris

simultaneous estimation of discovered resources and endowments: Report and Research Sponsored by U. S. Geological Survey, in preparation.
Hams, D. P., and Pan, G. C., 1989b, Information fields and exploration targets with a demonstration on the Walker Lake quadrangle of Nevada and California: Math. Geol., submitted.
McCammon, R. B., Botbol, J. M., Sinding-Larsen, R., and Bowan, R. W., 1983, Characteristic
analysis--1981: Final program and a possible discovery: Math. Geol., v. 15, p. 59-83.
Pan, G. C., 1985, Quantitative mineral resource assessment on the pegmatitic Nb-Ta mineral
deposits in Fulian Province of China--Method Investigation (in Chinese): M.S. thesis,
Changchun College of Geology, 141 p.
Pan, G. C., and Hams, D. P. 1989a, Decomposed and weighted characteristic analysis in the
quantitative evaluation of mineral resources with a case study on the pegmatitic Nb-Ta deposits in China: Math. Geol., submitted.
Pan, G. C., and Hams, D. P., 1989b, Quantitative analysis of anomalous sources and geochemical
signatures in the Walker Lake quadrangle of Nevada and California: J. Geochem. Explor. (in
press).
Pan, G. C., and Wang, Y., 1987, Weighted characteristic analysis and its applications in the
assessment of the pegmatitic Nb-Ta mineral resources in Fujian Province: Geol. Prospect.,
v. 23, p. 34-42.
Pan, G. C., and Xia, L.0 1988, Methods for quantification of association between variables by
means of information theory: Math. Star. Applied Prob., v. 3, p. 7-20.

S-ar putea să vă placă și