Sunteți pe pagina 1din 8

Research Article

Received: 13 September 2012, Revised: 30 November 2012, Accepted: 5 January 2013, Published online in Wiley Online Library: 5 February 2013

(wileyonlinelibrary.com) DOI: 10.1002/cem.2489

Application of maximum likelihood multivariate curve resolution to noisy data sets


Mahsa Dadashia,b, Hamid Abdollahib and Roma Taulera*
In this work, two different maximum likelihood approaches for multivariate curve resolution based on maximum likelihood principal component analysis (MLPCA) and on weighted alternating least squares (WALS) are compared with the standard multivariate curve resolution alternating least squares (MCR-ALS) method. To illustrate this comparison, three different experimental data sets are used: the rst one is an environmental aerosol source apportionment; the second is a time-course DNA microarray, and the third one is an ultrafast absorption spectroscopy. Error structures of the rst two data sets were heteroscedastic and uncorrelated, and the difference between them was in the existence of missing values in the second case. In the third data set about ultrafast spectroscopy, error correlation between the values at different wavelengths is present. The obtained results conrmed that the resolved component proles obtained by MLPCA-MCR-ALS are practically identical to those obtained by MCR-WALS and that they can differ from those resolved by ordinary MCR-ALS, especially in the case of high noise. It is shown that methods that incorporate uncertainty estimations (such as MLPCA-ALS and MCR-WALS) can provide more reliable results and better estimated parameters than unweighted approaches (such as MCR-ALS) in the case of the presence of high amounts of noise. The possible advantage of using MLPCA-MCR-ALS over MCR-WALS is then that the former does not require changing the traditional MCR-ALS algorithm because MLPCA is only used as a preliminary data pretreatment before MCR analysis. Copyright 2013 John Wiley & Sons, Ltd. Keywords: noisy data; error structure; multivariate curve resolution alternating least squares; weighted alternating least squares; maximum likelihood principal component analysis

1. INTRODUCTION
Multivariate curve resolution (MCR) methods have emerged as powerful chemometric tools to investigate multivariate data sets [13]. In most MCR methods, residual measurement errors are assumed to exhibit a uniform measurement variance. In the recent years, chemometric methods that incorporate information about measurement errors have received more interest. The goal of these methods is to nd solutions and parameters less affected by error propagation. Different algorithms had been proposed for this purpose, including weighted least squares [4], weighted principal component analysis (PCA) [5], positive matrix factorization (PMF) [6], maximum likelihood PCA (MLPCA) [7], maximum likelihood principal component regression [8], maximum likelihood parallel factor analysis [9], and MCR weighted alternating least squares (MCR-WALS) [10]. MCR-WALS [1113] is the general extension of MCR-ALS [1416] for non-homoscedastic error cases. MCR-ALS algorithm assumes an independent and identical distributed (i.i.d.) error structure for the measured data. The abbreviation i.i.d. is very common in statics, and it refers to the having a normal and uniform variance structure in the measurements distribution. This assumption is generally and approximately valid for most of the measurements obtained from spectroscopic and chromatographic methods, where measurement errors are homoscedastic and uncorrelated; in these cases, the obtained MCR-ALS solutions and parameters result to be sufciently precise. However, in cases where uncertainties are large such as for environmental [11] or DNA microarray data sets [10], this assumption, the i.i.d error structure, is usually not fullled and special attention should be paid to error propagation and perturbation effects.

A frequently used method for the analysis of experimental multivariate data matrices is PCA [17] where i.i.d. error structures are also assumed for the measured data. The extension of PCA algorithm for non-homoscedastic error structure is MLPCA. In this method, the error structure of the experimental measurements has to be known in advance. The aim of this work is to investigate and compare the results obtained either by the direct application of MCR-WALS or by the application in two steps of MLPCA rst, followed by the application of ordinary MCR-ALS. Additionally, the results obtained by these two strategies will be compared with the results obtained when only MCR-ALS is applied in the traditional way. This work is the continuation of a previous recent work [18] in which this comparison was systematically presented for simulated data having different noise structures, from homocedastic noise to heterocedastic, and correlated noise of different intensities and structures. In the present paper, this comparison is extended to experimental data that covers different typical situations encountered in the investigation of analytical data. Such a comparison is pertinent because it will facilitate the extent of the use of MCR methods to noisy data. Three experimental data sets will be
* Correspondence to: Roma Tauler, IDAEA-CSIC, Jordi Girona 1824, Barcelona 08034, Spain. E-mail: roma.tauler@idaea.csic.es a M. Dadashi, R. Tauler IDAEA-CSIC, Jordi Girona 18-24, Barcelona 08034, Spain b M. Dadashi, H. Abdollahi Faculty of Chemistry, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran

34

J. Chemometrics 2013; 27: 3441

Copyright 2013 John Wiley & Sons, Ltd.

Maximum likelihood MCR of noisy data sets used for this purpose. These include an environmental aerosol source apportionment data set, a DNA time series microarray data set, and nally an ultrafast absorption spectroscopy data set. In these three data sets, different types of non-homocedastic error structures are present in the measured data and special attention has to be paid to their consideration for optimal parameter estimation. The results obtained in the analysis of these examples are expected to be general and extendable to other data sets not having homocedastic error structures. and PMF produces essentially the same results, whereas discrepancies were obtained with MCR-ALS. Data weighting by means of uncertainty estimates was found to be essential to obtain maximum likelihood accurate estimations of the component proles. Both methods, MCR-WALS and PMF, require data uncertainty estimations in the input. These uncertainty estimations were estimated by the procedure discussed in Section 3.1.1. 2.2. Time-course DNA microarray data set

2. EXPERIMENTAL
Three experimental data sets were used to investigate the application of maximum likelihood MCR methods: (i) an environmental aerosol source apportionment; (ii) a time-course DNA microarray; and (iii) and an ultrafast absorption spectroscopy. These three data examples have been already described elsewhere [11,12,19], and only a brief description of each of them is presented here. 2.1. Environmental aerosol source apportionment data set

The rst data set is an environmental data set for aerosol pollution studies. In this data set, different samples of particulate matter with grain size PM10 were collected at different geographical sites in highly industrial areas. Information and other details on the study area, source apportionment analysis, and on PM10 sampling can be found elsewhere [20]. This data set contained 34 variables (corresponding to different species concentrations in microgram per cubic meter and nanogram per cubic meter) and 87 valid PM10 samples (Figure 1). The analysis of chemical components was performed according to the methodology described in reference [21]. In a previous study [11,20], this data set was analyzed by different chemometric methods (MCR-ALS, MCR-WALS, and PMF). This study showed that applying MCR-WALS

DNA microarray technologies allow for the simultaneous measurement of gene expression values for thousands of genes in a sample [22]. This gene expression value measures the hybridization of orescent-labeled samples and the genes attached to a solid surface. In this work, an example of a data set obtained from Brauer et al. [23] and available at the Stanford University DNA microarray repository [24] has been used. This data set is known in the literature as the diauxic shift data set, and it represents the gene expression level in yeast during the diauxic shift in a glucose-limited culture. In glucose depletion conditions, the metabolism shifts rapidly to an oxidative stage. RNA samples were obtained about every 15 min and measured by DNA microarray technology. The nally analyzed data set contained 12 measurements taken between 7.25 and 10 h at intervals of 0.25 h. The gene expression values of 2284 genes at each time were analyzed. The size of the nally analyzed data set was (2284, 12). This data set was already studied by Jaumot et al. using MCR-ALS and MCR-WALS algorithms [12]. In this study, a better and easier interpretation of gene and time evolution proles was achieved when MCR-WALS was applied compared with MCR-ALS. 2.3. Ultrafast absorption spectroscopy data set

Time-resolved pump-probe absorption spectroscopy [25] has been used extensively in photochemistry to study the electronic exited states of molecules. In this technique, a laser is being used for perturbing the molecules and another laser to characterize them. The example used in this case is the study of the photophysics of benzophenon by ultravioletvisible pump-probe absorption spectroscopy [26]. More detail about the electronic transitions in this molecule and about the instrumental description of this system can be found in references [19,26]. A 5 104 M solution of benzophenon in acetonitrile solvent was used, and 31 ultravioletvisible spectra were recorded between 290 and 570 nm at intervals of 7.5 nm. The whole data set is a matrix or table sized (31, 50). One example is shown in Figure 2. In these previous studies, hardsoft MCR was used for understanding the kinetics of this system [26], and the error structure of this system was investigated in the time and spectra modes [19]. The error covariance matrix of this system has been shown to have an independent error structure in the time direction and a correlated error structure in the spectral direction.

3.
Figure 1. First data example. Environmental aerosol source apportionment data set. This data set contained 34 variables (corresponding to different species concentrations in microgram per cubic meter and nanogram per cubic meter) and 87 valid PM10 samples (see experimental details in reference [20]).

METHODS

3.1. Estimation of the measurement error covariance structures 3.1.1. Environmental data set error covariance matrix

A preliminary data pretreatment was performed for the concentration values below the detection limit, which were replaced by half

35

J. Chemometrics 2013; 27: 3441

Copyright 2013 John Wiley & Sons, Ltd.

wileyonlinelibrary.com/journal/cem

M. Dadashi, H. Abdollahi and R. Tauler 3.1.3. Ultrafast absorption spectroscopy error covariance matrix

To obtain information about the error covariance structure of this system, one experiment was replicated 30 times. For K replications of this experiment, the error covariance matrix for every row can be estimate by Equation (1): i
K 1 X T xk x  xk x K 1 k 1

(1)

Figure 2. Third data example. Ultrafast absorption spectroscopy experiment; this gure shows one slice of 30 replications in this experiment (see experimental details in reference [26]).

where xk is the kth replication of one row and  x is the average vector of K replications. i is the error covariance matrices for each row i, which provides the measurement error variances in its diagonal elements, and the measured error covariances between measurements in the same row, in its off-diagonal elements of size (J, J). Error covariance matrix for every column j of X could be calculated in a similar way using the replicate of each column. Details about how to estimate error covariance matrices in this data set were discussed in a previous work [19]. 3.2. Data analysis methods

of their detection limit [27]. Direct experimental determinations of sample-specic or variable-specic uncertainties were not available, and therefore, the error structure of this system was estimated from the previous analytical knowledge of the system. Errors were considered to be proportional to the measured concentrations plus a constant term related to their limit of detection. If the value of a particular variable is equal to or above its analytical detection limit, then its uncertainty was considered to be equal to 10% of the value plus the detection limit value divided by three (i.e., when xij > LOD, sij = 0.1 xij + LOD/3), xij represents a particular data of row i and column j, in data set X. On the other hand, when the value is below its analytical detection limit, the uncertainty was considered to be 20% of value plus the detection limit value divided by three (i.e., when xij < LOD,sij = 0.2 xij + LOD/3). This estimation of uncertainty gives a heteroscedastic uncorrelated error covariance structure.

Three different variants of the MCR method based on ALS (MCR-ALS) [2931] are compared in this work. The three methods decompose the original data matrix X into the product of the two factor matrices G and FT, using the bilinear model of Equation (2). In ordinary MCR-ALS method, the objective function to be minimized is Q2
I J X X i 1j 1

x i;j ^ x i;j

(2)

where xij and ^ x ij are, respectively, the experimental and calculated values. In MCR-WALS, this general minimization function is written as ^ T T 1 vec T ^T Q2 vec T (3) ^ are, respectively, the experimental and where now X and X calculated data matrices containing xij and ^ x ij values and is the full augmented error covariance matrix as shown in Equation (4): 2 3 1 6 7 2 7 6 (4) 4 5 I where i are the error covariance matrices for each row i. When there is no correlation between column and row errors, the off-diagonal elements of are zero. In contrast, when they are correlated, the off-diagonal elements of are not zero. Row covariance matrices can be described by i (within row correlation errors) and column error covariance matrices by cj (within column correlation errors, see the succeeding paragraphs) matrices, respectively. In some particular cases (as in ultrafast absorption spectroscopy data set), there is correlation between column and row errors and it is necessary to consider the full augmented error covariance matrix (Equation (4)). For a thorough discussion about the different possible error cases, see reference [32]. The rst step in the MCR-ALS algorithm is the data projection onto the subspace dened by its principal components [17]. This rst step can be written as

3.1.2.

Time-course microarray error covariance matrix

In this data set example, all i = 1,..I genes of a single sample were arranged in a long vector and all measurements for j = 1,..J different samples were arranged in a data matrix or table of size (J, I). In this type of data sets, a logarithmic pretreatment is usually performed. Because of the application of nonnegativity constraints in MCR-ALS, no logarithmic pretreatment was performed in this study as it is often done in genomic studies. For a more detailed discussion about this, see reference [10]. Our previous studies have shown that these data sets can have high proportional errors, around 20% or 30% of the measured signal. Uncertainties were nally considered to be 25% of the measured data values. Moreover, the presence of missing values should be taken into account. Because each sample has at least one measurement missing, this data set cannot be analyzed by eliminating rows or columns. MLPCA easily handles this situation [28]. Missing values were arbitrarily substituted by values of one, and their uncertainties were considered to be very large values, that is, equal to 100. In this way, missing values had a minimum effect on nal MCR results.

36

wileyonlinelibrary.com/journal/cem

Copyright 2013 John Wiley & Sons, Ltd.

J. Chemometrics 2013; 27: 3441

Maximum likelihood MCR of noisy data sets ^ N;PCA XVN VT X N (5) To evaluate how well the methods actually t the data, several parameters and equations are proposed [36]. Lack of t and explained variance (R2) are two parameters that can be used: v u I J uX X 2 u x i;j ^ x i;j u u i1 j1 100 (10) Lack of fit u u I X J X u t x2 i;j
i 1 j 1

In this equation, VN is the loadings matrix for N component ^ N;PCA is the projection of the original data set onto the loadand X ings subspace. In PCA, measurement errors are not considered explicitly during the analysis because they are assumed to be homoscedastic (randomly distributed with equal variances). In contrast to conventional PCA, MLPCA [7] incorporates known error information into the bilinear decomposition process, and it can deal with different types of error structures. Data measurements with high heteroscedastic and correlated errors should, therefore, be preferably analyzed by MLPCA instead of ordinary PCA. However, in MLPCA, the structure of the error covariance matrix needs to be known accurately in advance. Several procedures and guidelines for the use of MLPCA have been given [33]. MLPCA seeks to minimize the maximum likelihood objective function (Equation (3)). In this ^ are the original and MLPCA projected data matricase, X and X ces, respectively. The algorithm used by MLPCA is based on the idea that maximum likelihood projections should be the same in both row and column subspaces. After initial data projection (before ALS optimization) either by PCA or by MLPCA, at each ALS iteration of the optimizations, G and FT factor matrices are estimated iteratively using an ALS algorithm, giving, respectively, MCR-ALS (PCA projection rst) or MLPCA-MCR-ALS (MLPCA projection rst). The equations for the unconstrained least square solution for G when FT is assumed to be known are   ^F ^T  ^N G min ^ X
G

0 B B B R B1 B @
2

I X J X i 1 j 1

x i;j ^ x i;j C C C C 100 I X J C X A x2 i ;j


i1 j 1

1 (11)

Both parameters measure essentially the same, but the rst one (lack of t) has more sensitivity to tting differences when they are small and give very similar explained variances. To measure the similarities between two individual proles (x and y vectors), their correlation and the angle (whose arc has the cosine, arccosine, equal to their correlation coefcient) between them can be obtained by r xT y kxkkyk (12)

cos1 r

(13)

(6) In Equations (12) and (13), r2 is the correlation coefcient between two proles and is the angle between them. To measure the similarity between subspaces dened by a different set of proles(X and Y matrices), a matrix correlation coefcient can also be used [37], which denes their similarities and takes values between zero and one. If the results are close to 1, this indicates that there is a strong linear relationship (correlation) between the subspaces dened by the two matrices. If it is zero, this means that there is no linear relationship between the two subspaces. In general, for two subspaces dened by matrices X (I, J) and Y (I, J), the matrix correlation is dened by tr X T Y r X ; Y q (14) tr X T X trY T Y This formula is based on the inner product of the two matrices, where tr indicates the trace of a matrix (its diagonal elements). Another parameter that can also be used to nd the similarity between the subspaces generated by two sets of vectors is the angle between the two subspaces dened by the columns of X and Y matrices. This angle is similar to that used for the comparison of x and y column vectors (once they are normalized to unit length). For this calculation, the columns of X are rst projected onto the subspace dened by the columns of Y matrix, and then, the differences between them are calculated. By using these differences, the angle between X and Y column subspaces can be obtained. Therefore, if X has, for instance, N columns, then N angles are obtained. The lowest angle is then considered the

^ X ^ F ^T F ^ ^ NF G

XN FT

(7)

^ N is either the PCA or the MLPCA projected matrix for where X N components in MCR-ALS and MLPCA-MCR-ALS algorithms, respectively. And the equation for FT when G is assumed to be known is   ^ ^ ^T  minF ^ XN GF T 1 ^ G ^ XN ^ ^X ^N G ^T G G F (8) (9)

Equations (7) and (9) nd the unconstrained least squares solutions. However, to have meaningful MCR solutions and avoid rotation ambiguities, different constraints such as nonnegativity, unimodality, closure, selectivity, local rank constraints, or other constraints [30,31,34] should be applied. In this work, only nonnegativity constraints were applied using nonnegative least squares algorithms [35]. MCR-WALS nds G and FT factor matrices in a similar way than MCR-ALS [32] using Equations (7) and (9), but they are estimated at each iteration of the optimization using the row and column maximum likelihood projections of X matrix, onto the subspaces ^i dened by current ALS estimations of the F and G matrices (X ^ ^ ^ and X j ), in a similar way as in MLPCA algorithm (i.e., V N and U N factor matrices are now substituted by the current ALS estimates of FT and G factor matrices for the same number of components N). For more details, see reference [32].

37

J. Chemometrics 2013; 27: 3441

Copyright 2013 John Wiley & Sons, Ltd.

wileyonlinelibrary.com/journal/cem

M. Dadashi, H. Abdollahi and R. Tauler angle between the two subspaces. A small angle between the two subspaces indicates that they are very similar. For more details about these calculations, see references [38,39]. Subspace congruences show the differences between the subspaces spanned by the MCR-WALS and those obtained by MLPCA-MCR-ALS or MCR-ALS; in other words, MCR-WALS was taken as reference algorithm and the results obtained with the other algorithms were compared with it. Congruence values between score and loading matrices obtained by MCR-WALS and MLPCA-MCR-ALS were 13.9 and 10, respectively. On the contrary, the same values for the comparison between MCR-WALS and MCR-ALS matrices were 82.5 and 79.5, which are much higher, meaning that MCR-ALS and MCR-WALS spaces are signicantly different. This provides more evidence that the results obtained by MLPCA-MCR-ALS and MCR-WALS are rather similar and that the observed small difference between the results obtained by the two methods is attributed to the effect on the algorithms or possible inaccuracies in the estimation of error covariance matrix and to the possible existence of small rotational ambiguities for the applied constraints. 4.2. Analysis of time series microarray data set

4. RESULTS AND DISCUSSION


The three methods previously described (MCR-ALS, MLPCAMCR-ALS, and MCR-WALS) were applied for the analysis of the three experimental data sets. In all cases, nonnegativity constraints were applied to factor matrices, G and FT. Lack of t and R2 values were calculated according to Equations (10) and (11). Angles between resolved proles were calculated to show their similarity. In all cases, initial estimations were obtained from the purest samples or variables of the analyzed data matrices [40].

4.1. Analysis of the environmental source apportionment data set In Table I, the results of applying different algorithms on this data set are given. According to a previous work on the same data set, the selected number of components or sources was six. Explained variances after application of MLPCA-MCR-ALS and of MCR-WALS on this data set were similar, 86.3% and 87.6%, respectively. These two values were rather different compared with the results obtained by MCR-ALS without considering the error structure, which for the same number of components explained as much as 99.3% of the experimental data variance. This comparison clearly shows the tendency of MCR-ALS to overt and to incorporate noise in the solutions. The same tendency could be seen in the values of lack of t; these values were 37.0% and 35.5% for MLPCA-MCR-ALS and MCR-WALS, whereas for MCR-ALS, it was only 8.3%. This characteristic of MCR-ALS has been already reported in previous works [12,36].

The rst step in analyzing this data set was the estimation of the number of components (N). Finding the number of components for a microarray data set is more challenging than for ordinary spectroscopic data sets because of the higher noise structures present in gene expression data sets. In a previous work [12], the application of MCR-ALS and MCR-WALS algorithms was repeated for a different number of components, ranging from three to seven. The comparison of the explained variances showed that adding more than three components did not provide improvement in explained variances or in MCR-ALS resolved proles. When less than three components were considered, the shape of the sample proles between 8 and 9 h was unreliable and more difcult to interpret biologically. In a previous work [12], for a better interpretability of the results, four components were preferred. In this work, three components were preferred because by using only three components, similarity values between the two

Table I. Results for environmental aerosol source apportionment data set with heteroscedastic error Method lof (%)a R2 (%)b g1 f1 MCR-WALS MLPCA-ALS MCR-ALS MLPCA PCA 35.1 37.0 8.3 37.0 8.0 87.6 86.3 99.3 86.3 99.4 5.2 1.8 52.4 58.1 g2 f2 4.7 9.7 14.5 14.5 Recovery anglec g3 f3 5.1 4.1 36.7 33.7 g4 f4 7.3 2.6 23.9 9.1 g5 f5 5.8 2.9 71.0 64.8 g6 f6 8.6 5.2 53.2 37.9 Subspace congruenced G FT 13.9 10.0 82.5 79.5

MCR, multivariate curve resolution; WALS, weighted alternating least squares; ML, maximum likelihood; PCA, principal component analysis. a Lack of t (lof) values (see Equation (10)). b Explained variance values (see Equation (11)). c g1, g2, g3, g4, g5, and g6 are the distribution (column or scores) proles of the six components in factor matrix G; f1, f2, f3, f4, f5, and f6 are the composition (row or loadings) proles of the six components in factor matrix FT; numerical values give the angles between MCR-WALS resolved G and FT proles and MCR-ALS or MLPCA-ALS resolved C and ST proles (see Equations (12) and (13)). d Subspace congruence gives the angle between MCR-WALS resolved G and FT proles subspaces and MCR-ALS or MLPCA-ALS resolved G and FT proles subspaces (see Section 3).

38

wileyonlinelibrary.com/journal/cem

Copyright 2013 John Wiley & Sons, Ltd.

J. Chemometrics 2013; 27: 3441

Maximum likelihood MCR of noisy data sets methods, MLPCA-MCR-ALS and MCR-WALS, were better compared and discussed. When considering recovery angles in Table II, the maximum difference between the results of MCR-WALS and MLPCA-MCR-ALS is for the third component (g3, recovery angle of 2.8 ). If the number of component was considered to be four, the recovery angle for the fourth component t4 was much worse, equal to 24.0 . Because the aim of this study was to compare between proles obtained from MCR-ALS, MCR-WALS, and MLPCA-MCR-ALS rather than to interpret the proles obtained, which has already been performed in previous studies [12,23,32], an analysis with three components was nally preferred. Table II reveals again the similarity between the results obtained by MLPCA-MCR-ALS and MCR-WALS in the case of the analysis of this experimental data set. This case has been assumed to contain proportional heteroscedastic errors [41] such as the previous case, but with the difference that in this data set, missing values were also present and they were considered as such using an appropriate weighting scheme (see Section 3). Lack of t values obtained by MLPCA-MCR-ALS and MCR-WALS were 27.9% and 28.0%, respectively, whereas this value for MCRALS was lower and equal to 22.7%. Again, this comparison conrms the tendency of MCR-ALS (and also PCA in Table II) to data overtting compared with MLPCA-MCR-ALS and MCR-WALS. The same is observed in Table II for the R2 values (MLPCA-MCR-ALS and MCR-WALS values were about 92% in comparison with 94% for MCR-ALS). The angles reporting the subspace congruencies between MLPCA-MCR-ALS and MCR-WALS were only of 2.5 and 1.6 for G and FT proles, respectively, whereas they were 18.3 and 23.7 , respectively, in the case of the comparison with MCRALS proles. In Figure 3, the comparison between MCR-WALS and MLPCA-MCR-ALS is shown, conrming their similarity. 4.3. Analysis of ultrafast absorption spectroscopy data set

Figure 3. Second data example. Comparison of the resolved temporal evolution of the gene expression proles (three components) for the diauxic shift data set obtained by multivariate curve resolution (MCR)weighted alternating least squares (ALS) (solid line), maximum likelihood principal component analysis-MCR-ALS (dashed dotted line), and MCR-ALS (dashed line).

In this third data set, spectroscopic data [19] had a correlated error structure in the variables direction, and a heteroscedastic

and independent error structure is present in the time direction. This error structure was considered during the different analysis by MLPCA and MCR-WALS. However, in this case, the noise level was rather low (about 1% of the measured signal), as it usually happens in spectroscopic measurements. This low level of noise could be reduced signicantly by replication of measurements and averaging. To examine more precisely the effect of noise on the results obtained by the different methods, a single data set is analyzed considering the corresponding estimated error covariance matrix. If the average of 30 replications was taken

Table II. Results of the diauxic shift data set Method lof (%)a R2 (%)b g1 t1 MCR-WALS MLPCA-ALS MCR-ALS MLPCA PCA 27.9 28.0 22.7 28.0 22.4 92.2 92.1 94.8 92.1 95.0 0.7 0.9 4.8 11.3 Recovery anglec g2 t2 0.8 1.2 10.4 6.3 g3 t3 2.8 1.1 17.4 14.9 Subspace congruenced G TT 2.5 1.6 18.3 23.7

MCR, multivariate curve resolution; WALS, weighted alternating least squares; ML, maximum likelihood; PCA, principal component analysis. a Lack of t (lof) values (see Equation (10)). b Explained variance values (see Equation (11)). c g1, g2, and g3 are the distribution (column or scores) proles of the three components in factor matrix G; t1, t2, and t3 are the composition (row or loadings) proles of the three components in factor matrix TT; numerical values give the angles between MCR-WALS resolved G and TT proles and MCR-ALS or MLPCA-ALS resolved G and TT proles (see Equations (12) and (13)). d Subspace congruence gives the angle between MCR-WALS resolved G and TT proles subspaces and MCR-ALS or MLPCA-ALS resolved G and TT proles subspaces (see Section 3).

39

J. Chemometrics 2013; 27: 3441

Copyright 2013 John Wiley & Sons, Ltd.

wileyonlinelibrary.com/journal/cem

M. Dadashi, H. Abdollahi and R. Tauler Table III. Results of the ultrafast absorption spectroscopy for one data set Method lof (%)a R2 (%)b c1 s1 MCR-WALS MLPCA-ALS MCR-ALS MLPCA PCA 3.9 3.9 2.2 3.9 2.2 99.85 99.85 99.95 99.85 99.95 3.9 1.8 5.4 4.6 Recovery anglec c2 s2 3.5 1.1 11.7 2.4 c3 s3 7.4 8.1 6.7 8.1 Subspace congruenced C ST 0.85 0.39 12.2 5.6

MCR, multivariate curve resolution; WALS, weighted alternating least squares; ML, maximum likelihood; PCA, principal component analysis. a Lack of t (lof) values (see Equation (10)). b Explained variance values (see Equation (11)). c c1, c2, and c3 are the distribution (column or scores) proles of the three components in factor matrix C; s1, s2, and s3 are the composition (row or loadings) proles of the three components in factor matrix ST; numerical values give the angles between MCR-WALS resolved C and ST proles and MCR-ALS or MLPCA-ALS resolved C and ST proles (see Equations (12) and (13)). d Subspace congruence gives the angle between MCR-WALS resolved C and ST proles subspaces and MCR-ALS or MLPCA-ALS resolved C and ST proles subspaces (see Section 3).

as data set, the results obtained using maximum likelihood curve resolution MLPCA-MCR-ALS and MCR-WALS approaches and using the ordinary MCR-ALS approach would have been practically identical. This result shows the importance and effects of replication and averaging in the experimental work. These results have not been reported here for brevity. In Table III, results for the analysis of one single spectroscopic experiment (without replication) are given. Similar lack of t (around 3.9%) and R2 (99.8%) values were obtained for MLPCAMCR-ALS and MCR-WALS. Calculated values for MCR-ALS were 2.2% and 99.9%, respectively. These results conrmed the results obtained in previous comparisons, although now the differences among the results obtained by the three methods were the lowest. The same happened when subspace congruencies were compared with the same interpretation as before.

exploratory tool. However, whenever an accurate estimation of the error structure is not possible, the use of the ordinary MCR-ALS algorithm is still useful, and it would give good estimations unless noise levels are high (e.g., more than 10%). The extent of distortion on resolved proles of the factor matrices depends not only on the noise intensity but also in its complexity.

Acknowledgements
Research project grant number CTQ 2009-11572 from the Ministerio de Ciencia y Innovacin, Spain, is acknowledged, and also, the authors acknowledge the Institute for Advanced Studies in Basic Sciences for nancial support (grant number G2011IASBS117). The authors would like to thank Anna de Juan and Joaquim Jaumot who provided us with the data sets.

5. CONCLUSIONS
Results obtained in the analysis of three experimental data sets conrmed the conclusions raised in our previous work on simulated data sets with different noise structures and complexities [18]. MLPCA-MCR-ALS and MCR-WALS differ in the fact that MCRWALS uses simultaneously chemical constraints (nonnegativity and others) and noise information during the ALS optimization, whereas MLPCA-MCR-ALS uses them sequentially, rst noise information as a data pretreatment and then, separately, during the ALS optimization, the chemical constraints; it has been shown that MLPCA can be used as a preliminary projection step on ordinary MCR-ALS standard algorithms, with equivalent results than applying MCR-WALS. The use of this preliminary projection of data matrix onto the MLPCA subspace has the advantage over MCRWALS of its easier and more general implementation and application to currently developed MCR methods, without the need to change algorithms. Moreover, it provides the concurrent visualization of MLPCA results (instead of PCA results) as a preliminary data

REFERENCES
1. Lawton WH, Sylvestre EA, Maggio MS. Self modeling nonlinear regression. Technometrics 1972; 14(3): 513532. 2. de Juan A, Tauler R. Chemometrics applied to unravel multicomponent processes and mixtures: revisiting latest trends in multivariate resolution. Anal. Chim. Acta 2003; 500(12): 195210. 3. Hamilton JC, Gemperline PJ. Mixture analysis using factor analysis. II: self-modeling curve resolution. J. Chemom. 1990; 4(1): 113. 4. Kiers H. Weighted least squares tting using ordinary least squares algorithms. Psychometrika 1997; 62(2): 251266. 5. Simeon V, Pavkovi D. Weighted analysis of principal components: two approximations to statistical weights. J. Chemom. 1992; 6(5): 257266. 6. Paatero P, Tapper U. Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 1994; 5(2): 111126. 7. Wentzell PD, Andrews DT, Hamilton DC, Faber K, Kowalski BR. Maximum likelihood principal component analysis. J. Chemom. 1997; 11(4): 339366. 8. Wentzell PD, Andrews DT, Kowalski BR. Maximum likelihood multivariate calibration. Anal. Chem. 1997; 69(13): 22992311. 9. Vega-Montoto L, Wentzell PD. Maximum likelihood parallel factor analysis (MLPARAFAC). J. Chemom. 2003; 17(4): 237253.

40

wileyonlinelibrary.com/journal/cem

Copyright 2013 John Wiley & Sons, Ltd.

J. Chemometrics 2013; 27: 3441

Maximum likelihood MCR of noisy data sets


10. Wentzell P, Karakach T, Roy S, Martinez MJ, Allen C, WernerWashburne M. Multivariate curve resolution of time course microarray data. BMC Bioinformatics 2006; 7(1): 343. 11. Tauler R, Viana M, Querol X, Alastuey A, Flight RM, Wentzell PD, Hopke PK. Comparison of the results obtained by four receptor modelling methods in aerosol source apportionment studies. Atmos. Environ. 2009; 43(26): 39893997. 12. Jaumot J, Pia B, Tauler R. Application of multivariate curve resolution to the analysis of yeast genome-wide screens. Chemom. Intell. Lab. Syst. 2010; 104(1): 5364. 13. Stanimirova I, Tauler R, Walczak B. A comparison of positive matrix factorization and the weighted multivariate curve resolution method. Application to environmental data. Environ. Sci. Technol. 2011; 45(23): 1010210110. 14. Goicoechea HC, Olivieri AC, Tauler R. Application of the correlation constrained multivariate curve resolution alternating least-squares method for analyte quantitation in the presence of unexpected interferences using rst-order instrumental data. Analyst 2010; 135(3): 636642. 15. Parastar H, Radovi JR, Jalali-Heravi M, Diez S, Bayona JM, Tauler R. Resolution and quantication of complex mixtures of polycyclic aromatic hydrocarbons in heavy fuel oil sample by means of GC GC-TOFMS combined to multivariate curve resolution. Anal. Chem. 2011; 83(24): 92899297. 16. Terrado M, Barcelo D, Tauler R. Quality assessment of the multivariate curve resolution alternating least squares method for the investigation of environmental pollution patterns in surface water. Environ. Sci. Technol. 2009; 43(14): 53215326. 17. Wold S, Esbensen K, Geladi P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987; 2(13): 3752. 18. Dadashi M, Abdollahi H, Tauler R. Maximum likelihood principal component analysis as initial projection step in multivariate curve resolution analysis of noisy data. Chemom. Intell. Lab. Syst. 2012; 118(0): 3340. 19. Blanchet L, Rhault J, Ruckebusch C, Huvenne JP, Tauler R, de Juan A. Chemometrics description of measurement error structure: study of an ultrafast absorption spectroscopy experiment. Anal. Chim. Acta 2009; 642(12): 1926. 20. Viana M, Querol X, Alastuey A, Gil JI, Menndez M. Identication of PM sources by principal component analysis (PCA) coupled with wind direction data. Chemosphere 2006; 65(11): 24112418. 21. Querol X, Alastuey A, Rodriguez S, Plana F, Ruiz CR, Cots N, Massagu G, Puig O. PM10 and PM2.5 source apportionment in the Barcelona Metropolitan area, Catalonia, Spain. Atmos. Environ. 2001;35(36): 64076419. 22. Causton HC, Quackenbush J, Brzma A. Microarray Gene Expression Data Analysis: A beginners Guide. Blackwell Pub, Wiley-Blackwell, 2003. 23. Brauer MJ, Saldanha AJ, Dolinski K, Botstein D. Homeostatic adjustment and metabolic remodeling in glucose-limited yeast cultures. Mol. Biol. Cell 2005; 16(5): 25032517. 24. web, http://smd.stanford.edu/cgi-bin/tools/display/listMicroArrayData. pl?tableName=publication. 25. Rid GD, Wynne K. In Encyclopedia of Analytical Chemistry, Meyers RA (ed.). John Wiley & Sons Ltd.: Chichester, 2000. 26. Aloise S, Ruckebusch C, Blanchet L, Rehault J, Buntinx G, Huvenne J-P. The benzophenone S1(n,p*) ! T1(n,p*) states intersystem crossing reinvestigated by ultrafast absorption spectroscopy and multivariate curve resolution. J. Phys. Chem. A 2007; 112(2): 224231. 27. Farnham IM, Singh AK, Stetzenbach KJ, Johannesson KH. Treatment of nondetects in multivariate analysis of groundwater geochemistry data. Chemom. Intell. Lab. Syst. 2002; 60(12): 265281. 28. Andrews DT, Wentzell PD. Applications of maximum likelihood principal component analysis: incomplete data sets and calibration transfer. Anal. Chim. Acta 1997; 350(3): 341352. 29. Tauler R. Multivariate curve resolution applied to second order data. Chemom. Intell. Lab. Syst. 1995; 30(1): 133146. 30. Tauler R, Smilde A, Kowalski B. Selectivity, local rank, three-way data analysis and ambiguity in multivariate curve resolution. J. Chemom. 1995; 9(1): 3158. 31. Jaumot J, Gargallo R, de Juan A, Tauler R. A graphical user-friendly interface for MCR-ALS: a new tool for multivariate curve resolution in MATLAB. Chemom. Intell. Lab. Syst. 2005; 76(1): 101110. 32. Wentzell PD. 2.25 Other topics in soft-modeling: maximum likelihoodbased soft-modeling methods. In Comprehensive Chemometrics, Brown SD, Tauler R, Walczak B (Editors-in-Chief). Elsevier: Oxford, 2009; 507558. 33. Wentzell PD, Lohnes MT. Maximum likelihood principal component analysis with correlated measurement errors: theoretical and practical considerations. Chemom. Intell. Lab. Syst. 1999; 45(12): 6585. 34. de Juan A, Vander Heyden Y, Tauler R, Massart DL. Assessment of new constraints applied to the alternating least squares method. Anal. Chim. Acta 1997; 346(3): 307318. 35. Bro R, De Jong S. A fast non-negativity-constrained least squares algorithm. J. Chemom. 1997; 11(5): 393401. 36. Tauler R. Application of non-linear optimization methods to the estimation of multivariate curve resolution solutions and of their feasible band boundaries in the investigation of two chemical and environmental simulated data sets. Anal. Chim. Acta 2007; 595(12): 289298. 37. Smilde AK, Kiers HAL, Bijlsma S, Rubingh CM, van Erk MJ. Matrix correlations for high-dimensional data: the modied RV-coefcient. Bioinformatics 2009; 25(3): 401405. 38. Bjrck , Golub GH. Numerical methods for computing angles between linear subspaces. Math. Comput. 1973; 27(123): 579594. 39. Wedin P. On angles between subspaces of a nite dimensional inner product space Matrix Pencils. In Lecture Notes in Mathematics, Volume 973, Kgstrm B, Ruhe A (eds.). Springer: Berlin/Heidelberg, 1983; 263285. 40. Windig W, Guilment J. Interactive self-modeling mixture analysis. Anal. Chem. 1991; 63(14): 14251432. 41. Karakach T, Flight R, Wentzell P. Bootstrap method for the estimation of measurement uncertainty in spotted dual-color DNA microarrays. Anal. Bioanal. Chem. 2007; 389(7-8): 21252141.

41

J. Chemometrics 2013; 27: 3441

Copyright 2013 John Wiley & Sons, Ltd.

wileyonlinelibrary.com/journal/cem

S-ar putea să vă placă și