Sunteți pe pagina 1din 1

# XRDP PATTERN MATCHING: PROBABILITY BASED

## VERSUS IMAGE COMPARISON

Thomas Degen, PANalytical B.V., Lelyweg 1, Almelo, The Netherlands

Introduction:

## Nowadays cluster analysis is an indispensable tool for

high throughput data analysis. Almost all cluster analysis
approaches (agglomerative, divisive, k-means, fuzzy,
DBSCAN and so on) require a matrix that can be
calculated by comparing all involved patterns with each
other. However, if the variation between the patterns is
not properly extracted into such a correlation matrix,
then subsequent methodology will not be able to reveal
it. This is because some of the essential information has
already been lost in this first comparison step.
One advantage of using a correlation matrix is the severe
data reduction that takes place. So techniques like MMS
(Metric Multi-dimensional Scaling), PCA (Principal
Component Analysis) or Sammons non-linear mapping
are much easier and quicker to use to visualize the
systematic and non-systematic data variations.

## The basic question for a correlation matrix is, how to

properly calculate the (dis)-similarity of two X-ray powder
diffraction patterns?
Here we compare a probability-based approach (Ref.7),
where the probability is constructed on a point-by-point
basis from the signal-to-noise ratio of the supplied profile
data, against another approach that does not take into
account the counting statistics of the XRDP raw data. We
simply perform the PCA (Ref. 5,6) and/or the cluster analysis
directly on a matrix (X) of the raw data, where the
individual scans form the rows of the matrix and the
individual data points form the columns. This of course
requires that all scans are either measured on the same
measurement grid (corresponding data points are measured
at the same positions) or that the scans are first
interpolated onto such a common grid.

## Given is the matrix X with n scans and m observations.

Each observation represents one coordinate axis, so each
scan can be plotted as a point in m dimensional space. The
m observations
entire dataset is then a swarm of points in space. In this
point swarm the first principal component PC1 is the line
T
d principal components
V
that gives the best approximation to the data, i.e.,
represents the maximum variance within the data. PC2 is
orthogonal to PC1 and again has maximum variance in this
T*U
X
=
V
direction. Further components are generated accordingly.
n scans
X
U
Thus the PC's are constructed in the order of declining
importance. The coordinate of a scan, when projected
onto an axis given by a principal component, is called its
score.
In matrix notation the PCA approximates the data matrix X,
which has n scans and m observations, by two smaller
matrices: the score matrix U (n scans and d principal
can be understood as the weights for each original
variable when calculating the principal component.
components and m observations/variables), where:

## A closer inspection (Figure 3b) of the plot gives a faint idea

about the possible presence of more groups. But these
groups are much more difficult to spot than in Figure 2a.
Further the Eigenvalues plot (Figure 3c) shows that the total
variation in the data is almost completely collected in the
first principal component (x-direction), which accounts
already for 98% of the total variation. This means that PC2
and PC3 (y- and z-direction) are more sensitive to noise and
therefore much less reliable.

References:
1) Spearman, C. (1904), Am. J. Psychol. 15, 72-101.
2) Conover, W. J. (1998), Practical Nonparametric Statistics, 3rd ed., John Wiley & Sons, New York.
3) Gilmore et al. (2004), J. Appl. Cryst. 37, 231-242.
4) Pearson, K. (1896), Mathematical Contributions to the Theory of Evolution. III. Regression, Heredity and
Panmixia, Philosophical Transactions of the Royal Society of London, 187, 253-318.
5) E.H. Malinowski and D.G. Howery (1980), Factor Analysis in Chemistry, John Wiley & Sons, New York.
6) I.T. Joliffe (1986), Principal Component Analysis, Springer-Verlag, New York.
7) Unpublished proprietary algorithm.

## Other prominent full pattern approaches to generate a

correlation matrix (disregarding the signal-to-noise ratio),
are, for example: the non-parametric Spearman rank order
test (Ref. 1,2,3) or the Pearson's r (Ref. 3,4), which is a
parametric linear correlation coefficient.
The sample material used in this example is an industrial,
proprietary, pre-product. Seven samples were taken from
four batches, each sample was prepared and measured
eight times, resulting in a grand total of 56 scans.
The first rough, visual inspection of the surface plot of all 56
scans (Figure 1) leads to the impression that two different
groups of scans are present.

## Figure 2a. 3D PCA score plot generated from the

correlation matrix calculated by a probability based
comparison algorithm. It clearly shows the presence of
four to five different groups of patterns.

## Figure 3a. Front view of the 3D PCA score plot based on

the full matrix of observations without probabilities.

55
Ln(Counts)
9.156
8.941
8.727
8.513
8.299
8.085
7.87
7.656
7.442
7.228
7.014
6.8
6.585
6.371
6.157
5.943
5.729
5.515
5.3
5.086
4.872
4.658

50
45
40

Scan number

35
30
25
20
15
10
5

20

30

40

50

60

70
80
Position [2Theta]

90

100

110

120

Summary:

## This example gives a clear indication of the benefits of

using counting statistics for the extraction of pattern
variation into a correlation matrix, like utilized in o
our
software package X'Pert HighScore (Plus). The outlined
method proves to be an effective noise-cancelling approach
and has therefore an obvious advantage over directly using
the full matrix of observations for PCA and/or cluster
analysis, which in fact resembles a pure image comparison
approach.

## Figure 3b. Side view of the 3D PCA score plot based on

the full matrix of observations. It gives some vague
impression that 3 or 4 groups of patterns are present.

## Figure 1: Surface plot of all 56 scans; a rough visual

inspection reveals two pattern groups.

PC

PC

% var.

Eige
Ei

50
48
46
44
42
40
38
36
34
n Ac oun ed

32
30
28
26
24

Perc nta e V

## Perc nta e V riation cco nted

However, cluster analysis taking into account the signal-tonoise ratio of the XRDP raw data (as used in our searchmatch-identify algorithm) immediately reveals the presence
of at least four or five different groups. Figure 2a presents
the PCA score plot calculated from such a correlation
matrix. Figure 2b shows the Eigenvalues plot, which
indicates that PC3 in z-direction is much less important,
because it only accounts for about 2% variation in the data,
whereas PC1 in x-direction accounts for 51% and PC2 in ydirection still accounts for 38% variation.
The four clusters (fourth cluster is green and brown
together) nicely correspond to the four different batches
under investigation. Further manipulation of the cut-off
line in the dendrogram allows each of the seven samples to
be clearly identified, proving the non-homogeneity of the
sampling.
The other approach using only the matrix of raw
observations/data points creates a different picture. The 3D
PCA score plot (Figure 3a) based on the full matrix of
observations shows the separation into two clearly
distinguishable groups, but this is in fact no more help than
just looking at the surface plot (Figure 1).

22
20
18
16
14
12
10
8
6
4
2

var.

0
1

2
omp nen Nu ber

## Figure 2b. The Eigenvalues plot shows that PC1 contains

51% of the variation, PC2 contains another 38% and PC3
only contains 2% of the total variation in the data. 8%
of variation is not accounted for in PC1 - PC3.

## Figure 3c. The Eigenvalues plot shows that PC1 already

contains 98% of the variation; PC2 only adds 1%