Sunteți pe pagina 1din 21

Analiza Datelor

Analiza componentelor principale,


Analiza factoriala, Metode de
recunoastere nesupervizata, Metode de
recunoastere supervizata

Antoneac Raluca

FABBV, seria A, grupa 1548


[2]

0. Introducere

Lucrarea de fata are drept scop analiza unui set de date prin patru
tehnici: analiza componentelor principale, analiza factoriala, metode de
recunoastere supervizata si metode de recunoastere nesupervizata.

Consideram astfel datele financiare pentru 10 firme ce activeaza in


industria textila si cu sediul in Bucuresti:

Nume firma Cifra Stocuri Salariati Profit Datorii Capitaluri


afaceri
Albalact SA Alba Iulia 1104600 836938 650 4754 2635432 50325
27179 9493 345 - 43625 -15942
Bucovina SA Suceava 16142

Danone P.D.P.A. SRL 100258 13523 555 - 31965 -13839


Arad 14039

Delaco Distribution 73980 0 470 - 241021 -68662


SRL Brasov 49755
Delta Lact SA Tulcea 92873 8400 280 21435 344540 29838
Diana Impex SRL 23890 0 155 28306 3780 28546
Ramnicu Valcea
Dorna Branzeturi SA 1049060 2135 600 18400 27617 66455
Vatra Dornei
Ecolact SRL Milisauti 402250 90116 50 -95967 331125 -107910
Ladorna Cheese SA 236703 78535 395 8309 130609 8549
Dorna Arini
Napolact SA Cluj- 1484710 95827 638 6463 798486 74443
Napoca

Avem, asadar, circa 10 observatii asupra 7 caractaristici.

1. Analiza componentelor principale


[3]

Cu ajutorul acestei tehnici multidiemensionale de analiza dorim sa


restrangem dimensionalitatea spatiului initial, sa obtinem un numar redus de
componente, dar cu conditia unei pierderi minimale de informatie.

Variables 6

Simple Statistics

Cifra_afac Stocuri Salariati Profit Datorii Capitaluri


eri

Mea 459550.3 113496.7 473.80000 - - 5180.300


n 000 000 000 8823.600 458820.00 00
00 00

StD 412580.8 257222.2 513.38378 38147.35 801743.03 58504.74


317 919 699 007 82 549

Correlation Matrix

Cifra_afaceri Stocuri Salariati Profit Datorii Capitaluri

Cifra_afaceri 1.0000 0.9226 0.3539 -.0163 -.9656 0.2294

Stocuri 0.9226 1.0000 0.2872 0.0735 -.9727 0.2538

Salariati 0.3539 0.2872 1.0000 -.3351 -.3343 -.3326

Profit -.0163 0.0735 -.3351 1.0000 -.0812 0.9032

Datorii -.9656 -.9727 -.3343 -.0812 1.0000 -.3037


[4]

Correlation Matrix

Cifra_afaceri Stocuri Salariati Profit Datorii Capitaluri

Capitaluri 0.2294 0.2538 -.3326 0.9032 -.3037 1.0000

Eigenvalues of the Correlation Matrix

Eigenvalue Difference Proportion Cumulative

1 3.13616478 1.04063527 0.5227 0.5227

2 2.09552951 1.47947300 0.3493 0.8719

3 0.61605651 0.1027 0.9746

Eigenvectors

PRIN1 PRIN2 PRIN3

Cifra_afaceri 0.543839 -.096331 -.152962

Stocuri 0.545107 -.045929 -.191679

Salariati 0.203635 -.439491 0.868629

Profit 0.092791 0.640442 0.370721

Datorii -.559311 0.040610 0.112733

Capitaluri 0.210217 0.619389 0.187625


[5]

Conform tabelului “Eigen Values of the Correlation Matrix”, putem


afirma ca reducerea numarului de variabile de la 6 la 3 se face cu un grad
cumulat de retinere a informatiei de 97,46% (simplificarea modelului are loc
cu o pierdere minimala), in timp ce matricea de corelatie ne arata taria
legaturilor dintre variabile (facilitand organizarea componentelor principale).

Principal Components Plot


2
[6]

Graficele de mai sus, avand ca axe chiar componentele principale,


alaturi de scorurile obtinute ne ajuta sa clasificam si sa ierarhizam cele 10
firme, chestiune de nerealizat, la inceput, datorita numarului relativ mare de
variabile.
[7]

2. Analiza factoriala
Analiza factoriala este o analiza multivariata, care are ca scop de a
explica corelatiile manifestate între o serie de variabile, numite indicatori sau
teste, prin intermediul unui numar mai mic de factori ordonati si necorelati,
numiti factori comuni. Proprietatea de necorelare a factorilor, care apare în
definitia anterioara, se refera la definirea si determinarea acestora sub
restrictia inexistentei redundantei informationale. In mod similar, ordonarea
factorilor se refera la ierarhizarea acestora într-o maniera descrescatoare, în
functie de marimea variantei fiecarui factor.

Prior Communality Estimates: ONE


Eigenvalues of the Correlation Matrix: Total
= 6 Average = 1

Eigenvalue Difference Proportion Cumulative

1 3.13616478 1.04063527 0.5227 0.5227

2 2.09552951 1.47947300 0.3493 0.8719

3 0.61605651 0.51140448 0.1027 0.9746

4 0.10465203 0.06706181 0.0174 0.9921

5 0.03759023 0.02758328 0.0063 0.9983

6 0.01000694 0.0017 1.0000

6 factors will be retained by the NFACTOR criterion.


[8]

Factor Pattern

Factor1 Factor2 Factor3 Factor4 Factor5 Factor6

Cifra_afacer 0.96310 - - 0.15392 0.11927 0.02584


i 0.13945 0.12006

Stocuri 0.96534 - - - - 0.04899


0.06649 0.15045 0.18828 0.05666

Salariati 0.36062 - 0.68178 0.00956 - 0.00592


0.63621 0.01562

Profit 0.16433 0.92710 0.29098 - 0.09569 -


0.13972 0.01169

Datorii - 0.05879 0.08848 0.00690 0.03415 0.08011


0.99050

Capitaluri 0.37228 0.89662 0.14727 0.16078 - 0.01869


0.09792

Variance Explained by Each Factor

Factor1 Factor2 Factor3 Factor4 Factor5 Factor6

3.1361648 2.0955295 0.6160565 0.1046520 0.0375902 0.0100069

Final Communality Estimates: Total = 6.000000

Cifra_afaceri Stocuri Salariati Profit Datorii Capitaluri

1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

The FACTOR Procedure


Rotation Method: Varimax
[9]

Orthogonal Transformation Matrix

1 2 3 4 5 6

1 0.95472 0.21122 0.20687 0.01486 0.02959 -0.00214

2 -0.10405 0.89466 -0.43424 -0.00864 0.01095 -0.00213

3 -0.27575 0.39357 0.87620 -0.00632 -0.03542 -0.00690

4 -0.03679 0.00337 0.01791 0.74732 0.65449 -0.10712

5 0.01627 0.00668 -0.02248 0.66372 -0.74287 0.08245

6 -0.00541 0.00493 0.00942 0.02554 0.13242 0.99079

Rotated Factor Pattern

Factor1 Factor2 Factor3 Factor4 Factor5 Factor6

Cifra_afac 0.9632 0.0328 0.1549 0.2111 0.0467 0.0180


eri 4 6 2 3 8 1

Stocuri 0.9757 0.0844 0.0951 - - 0.0631


7 3 2 0.1611 0.0414 5
9 8

Salariati Salaria 0.2218 - 0.9488 0.0034 - -


ti 5 0.2247 2 8 0.0018 0.0005
3 0 7

Profit - 0.9787 - - - 0.0069


0.0130 7 0.1184 0.0486 0.1593 4
5 0 1 7

Datorii - - - 0.0140 - 0.0828


0.9762 0.1211 0.1527 8 0.0420 4
9 5 9 4

Capitaluri 0.2139 0.9387 - 0.0525 0.1960 -


1 4 0.1780 0 6 0.0105
5 0
[10]

Variance Explained by Each Factor

Factor1 Factor2 Factor3 Factor4 Factor5 Factor6

2.9282610 1.9126242 1.0023770 0.0758865 0.0695181 0.0113332

Final Communality Estimates: Total = 6.000000

Cifra_afaceri Stocuri Salariati Profit Datorii Capitaluri

1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

The FACTOR Procedure


Rotation Method: Varimax

Scoring Coefficients Estimated by Regression


Squared Multiple Correlations of the Variables with Each Factor

Factor1 Factor2 Factor3 Factor4 Factor5 Factor6

1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000

Standardized Scoring Coefficients

Factor1 Factor2 Factor3 Factor4 Factor5 Factor6

Cifra_afac 0.33740 - - 3.27747 - 2.663011


eri 391 0.03250 0.09896 099 1.03736 98
11 54 52

Stocuri 0.37968 - - - 0.60792 4.920560


557 0.05149 0.08872 2.21343 081 38
29 68 57

Salariati - 0.18867 1.14184 - 0.40768 0.534657


0.17712 285 906 0.19504 542 95
[11]

Standardized Scoring Coefficients

Factor1 Factor2 Factor3 Factor4 Factor5 Factor6

05 83

Profit - 0.59951 0.14042 0.65598 - -


0.02939 765 683 641 2.92993 0.809158
9 66 6

Datorii - 0.06064 0.10454 0.85075 0.41433 7.999354


0.37500 64 546 078 907 85
92

Capitaluri - 0.49893 0.15187 - 3.18758 1.468222


0.10610 428 313 0.53649 573 47
42 56

3. Analiza cluster
Analiza cluster este o tehnica de clasificare caracterizata prin faptul ca
afectarea formelor sau obiectelor în clustere sau grupe se face progresiv si
fara a cunoaste aprioric numarul de clase, în functie de verificarea a doua
criterii fundamentale:
- obiectele sau formele clasificate în fiecare clasa sa fie cat mai similare din
punct de vedere al anumitor caracteristici;
- obiectele clasificate într-o clasa sa se diferentieze cat mai mult de obiectele
clasificate în oricare din celelalte clase.

Rezultate furnizate de metoda lui Ward:


[12]

Cluster Analysis Results

The CLUSTER Procedure


Ward's Minimum Variance Cluster Analysis
Eigenvalues of the Covariance Matrix
Eigenvalue Difference Proportion Cumulative
1 8.00706E11 7.93213E11 0.9826 0.9826

2 7493611884 2997348049 0.0092 0.9918

3 4496263835 2449581912 0.0055 0.9973

4 2046681923 1931729146 0.0025 0.9999

5 114952777 114952719 0.0001 1.0000

6 57.9409324 0.0000 1.0000

Root-Mean-Square Total-Sample Standard Deviation = 368523.6


Root-Mean-Square Distance Between Observations = 1276603
Cluster History
NCL Clusters Joined FREQ SPRSQ RSQ ERSQ CCC PSF PST2 T
i
e
9 OB2 OB3 2 0.0000 1.00 . . 4098 .

8 OB6 OB7 2 0.0002 1.00 . . 1335 .

7 OB4 OB5 2 0.0018 .998 . . 244 .

6 CL9 CL8 4 0.0027 .995 . . 167 25.4

5 CL7 OB9 3 0.0031 .992 . . 157 1.7

4 CL5 OB8 4 0.0051 .987 . . 152 2.1

3 CL6 CL4 8 0.0294 .958 . . 79.0 13.6

2 CL3 OB10 9 0.0641 .893 .801 1.82 67.1 10.6

1 OB1 CL2 10 0.8935 .000 .000 0.00 . 67.1

Cluster Analysis Plots


[13]

Cluster Analysis Plots


[14]

Cluster Analysis Plots

Cluster Analysis Tree Chart


[15]

The TREE Procedure


Cluster tree data for SASUSER.IMPW7714

Metoda genereaza 10 clustere, cate una pentru fiecare obiect


observat. Dorim restrangerea acestora, asadar:

4. Clusterizarea folosind algoritmul k-means


[16]

Rezultate furnizate de metoda celor k medii (parametri folositi: 1 iteratie,


maxim 4 clustere):

The FASTCLUS Procedure


Replace=FULL Radius=0 Maxclusters=4 Maxiter=1
Initial Seeds

Cluste Cifra_aface Stocuri Salaria Profit Datorii Capitaluri


r ri ti

1 1104600.0 836938.0 17.00 4754.000 - 50325.000


00 00 0 2635432.00
0

2 484710.00 95827.00 4.000 6463.000 - 74443.000


0 0 798486.000

3 10258.000 13523.00 1.000 - -31965.000 -13839.000


0 14039.00
0

4 402250.00 90116.00 4.000 - - -


0 0 95967.00 331125.000 107910.000
0

Criterion Based on Final Seeds = 39752.9

Cluster Summary

Cluster Frequency RMS Std Maximum Radius Nearest Distance


Deviation Distance Exceeded Cluster Between
from Seed Cluster
to Centroids
Observation

1 1 . 0 2 2075684

2 1 . 0 4 541313

3 5 47687.8 162613 4 339707

4 3 57906.9 133725 3 339707


[17]

Statistics for Variables

Variable Total STD Within STD R-Square RSQ/(1-RSQ)

Cifra_afaceri 317844 88020 0.948874 18.559474

Stocuri 257222 39267 0.984464 63.364985

Salariati 10.38375 10.17513 0.359852 0.562138

Profit 38147 37724 0.348038 0.533831

Datorii 801743 51330 0.997267 364.949947

Capitaluri 58505 49555 0.521708 1.090772

OVER-ALL 368524 51321 0.987071 76.345545

Pseudo F Statistic = 152.69

Approximate Expected Over-All R-Squared = .

Cubic Clustering Criterion = .

Cluster Means

Cluste Cifra_aface Stocuri Salaria Profit Datorii Capitaluri


r ri ti

1 1104600.0 836938.0 17.00 4754.000 - 50325.00


00 00 0 2635432.00 0
0

2 484710.00 95827.00 4.000 6463.000 - 74443.00


0 0 798486.000 0

3 109624.60 20737.20 2.600 4966.800 -47519.200 14753.80


0 0 0

4 315688.66 32838.66 14.00 - - -


7 7 0 41429.00 305562.000 48911.33
0 3
[18]

Cluster Standard Deviations

Clust Cifra_aface Stocuri Salariati Profit Datorii Capitaluri


er ri

1 . . . . . .

2 . . . . . .

3 93017.618 32768.642 2.3021 19641.418 48657.548 34137.657


92 44 7 55 17 93

4 77058.092 49781.117 17.320 59142.193 56295.169 70966.150


39 76 51 97 13 39

Metoda furnizeaza, conform parametrilor de intrare, 4 clustere,


apartenenta obiectelor la clustere fiind ilustrata in tabelul de mai sus.
[19]

Bibliografie

- suport de curs si de seminar;

- documentatie SAS;

- SAS Enterprise Guide libraries;

- www.google.com

- http://www.mfinante.ro/

- http://www.indexb.ro/

Anexa:

Cod ACP
%LET _CLIENTTASKLABEL=%NRBQUOTE(Import Data); INPUT
%LET _EGTASKLABEL=%NRBQUOTE(Import Data); Nume_firma $
PROC SQL; Cifra_afaceri
DROP VIEW SASUSER.IMPW5778; Stocuri
DROP TABLE SASUSER.IMPW5778; Salariati
%MACRO _EG_ImportData; Profit
%LET IsMVS=FALSE; Datorii
%IF %sysfunc(getoption(filesystem))=MVS %THEN Capitaluri
%DO;
%LET IsMVS=TRUE; ;
%END; LABEL
DATA SASUSER.IMPW5778; Nume_firma = "Nume_firma"
INFILE Cifra_afaceri = "Cifra_afaceri"
"C:\DOCUME~1\Raluca\LOCALS~1\Temp\EGImport\Local Stocuri = "Stocuri"
\Date AD9a53d7ad-ed46-42df-9f64-f434c12d8f11.asc" Salariati = "Salariati"
DELIMITER='*' Profit = "Profit"
MISSOVER Datorii = "Datorii"
DSD Capitaluri = "Capitaluri"
%IF &IsMVS=FALSE %THEN %DO;
LRECL=32767 ;
%END; RUN;
FIRSTOBS=2 %MEND;
; %_EG_ImportData
LENGTH RUN; QUIT;
Nume_firma $ 25 TITLE; FOOTNOTE;
Cifra_afaceri 8 RUN;
Stocuri 8 %LET _CLIENTTASKLABEL=;
Salariati 8 %LET _EGTASKLABEL=;
Profit 8
Datorii 8
Capitaluri 8
[20]

Cod Factorial
%LET _CLIENTTASKLABEL=%NRBQUOTE(Import Data); INPUT
%LET _EGTASKLABEL=%NRBQUOTE(Import Data); Nume_firma $
PROC SQL; Cifra_afaceri
DROP VIEW SASUSER.IMPW5480; Stocuri
DROP TABLE SASUSER.IMPW5480; Salariati
%MACRO _EG_ImportData; Profit
Datorii
%LET IsMVS=FALSE; Capitaluri

%IF %sysfunc(getoption(filesystem))=MVS %THEN ;


%DO; LABEL
%LET IsMVS=TRUE; Nume_firma = "Nume_firma"
%END; Cifra_afaceri = "Cifra_afaceri"
DATA SASUSER.IMPW5480; Stocuri = "Stocuri"
INFILE Salariati = "Salariati"
"C:\DOCUME~1\Raluca\LOCALS~1\Temp\EGImport\Local Profit = "Profit"
\Date AD962ff262-2f1e-4e8d-a9d1-06944aa6848f.asc" Datorii = "Datorii"
DELIMITER='*' Capitaluri = "Capitaluri"
MISSOVER
DSD ;
%IF &IsMVS=FALSE %THEN %DO; RUN;
LRECL=32767 %MEND;
%END; %_EG_ImportData
FIRSTOBS=2 RUN; QUIT;
; TITLE; FOOTNOTE;
LENGTH RUN;
Nume_firma $ 25 %LET _CLIENTTASKLABEL=;
Cifra_afaceri 8 %LET _EGTASKLABEL=;
Stocuri 8
Salariati 8
Profit 8
Datorii 8
Capitaluri 8

Cod Cluster (K-means)


%LET _CLIENTTASKLABEL=%NRBQUOTE(Import Data); Termen_inregistrare $ 21
%LET _EGTASKLABEL=%NRBQUOTE(Import Data); ;
PROC SQL; INPUT
DROP VIEW SASUSER.IMPW3597; Nume_firma $
DROP TABLE SASUSER.IMPW3597; Cifra_afaceri
%MACRO _EG_ImportData; Stocuri
Salariati
%LET IsMVS=FALSE; Profit
Datorii
%IF %sysfunc(getoption(filesystem))=MVS %THEN Capitaluri
%DO; ;
%LET IsMVS=TRUE; LABEL
%END; Nume_firma = "Nume_firma"
DATA SASUSER.IMPW3597; Cifra_afaceri = "Cifra_afaceri"
INFILE Stocuri = "Stocuri"
"C:\DOCUME~1\Raluca\LOCALS~1\Temp\EGImport\Local Salariati = "Salariati"
\Date AD271e8db8-009a-4bd5-b048-90da00c42c4f.asc" Profit = "Profit"
DELIMITER='*' Datorii = "Datorii"
MISSOVER Capitaluri = "Capitaluri"
DSD
%IF &IsMVS=FALSE %THEN %DO; ;
LRECL=32767 RUN;
[21]

%END; %MEND;
FIRSTOBS=2 %_EG_ImportData
; RUN; QUIT;
LENGTH TITLE; FOOTNOTE;
Nume_firma $ 25 RUN;
Cifra_afaceri 8 %LET _CLIENTTASKLABEL=;
Stocuri 8 %LET _EGTASKLABEL=;
Salariati 8
Profit 8
Datorii 8
Capitaluri 8

Cod Cluster (Ward)


%LET _CLIENTTASKLABEL=%NRBQUOTE(Import Data); Capitaluri 8
%LET _EGTASKLABEL=%NRBQUOTE(Import Data); ;
PROC SQL; INPUT
DROP VIEW SASUSER.IMPW3597; Nume_firma $
DROP TABLE SASUSER.IMPW3597; Cifra_afaceri
%MACRO _EG_ImportData; Stocuri
Salariati
%LET IsMVS=FALSE; Profit
Datorii
%IF %sysfunc(getoption(filesystem))=MVS %THEN Capitaluri
%DO; ;
%LET IsMVS=TRUE; LABEL
%END; Nume_firma = "Nume_firma"
DATA SASUSER.IMPW3597; Cifra_afaceri = "Cifra_afaceri"
INFILE Stocuri = "Stocuri"
"C:\DOCUME~1\Raluca\LOCALS~1\Temp\EGImport\Local Salariati = "Salariati"
\Date AD271e8db8-009a-4bd5-b048-90da00c42c4f.asc" Profit = "Profit"
DELIMITER='*' Datorii = "Datorii"
MISSOVER Capitaluri = "Capitaluri"
DSD
%IF &IsMVS=FALSE %THEN %DO; ;
LRECL=32767 RUN;
%END; %MEND;
FIRSTOBS=2 %_EG_ImportData
; RUN; QUIT;
LENGTH TITLE; FOOTNOTE;
Nume_firma $ 25 RUN;
Cifra_afaceri 8 %LET _CLIENTTASKLABEL=;
Stocuri 8 %LET _EGTASKLABEL=;
Salariati 8
Profit 8
Datorii 8