Documente Academic
Documente Profesional
Documente Cultură
Inferenta Statistica PDF
Inferenta Statistica PDF
ntrebrilor de opinie dect n cazul celor factuale. Cel mai puternic efect se
produce n studiile care vizeaz proieciile n viitor intenii, dorine etc.
Din punctul de vedere al posibilitii controlului erorilor, n literatura
american de studiu al pieei, erorile mai sunt clasificate n dou mari grupe:
1. Erori ce pot fi previzionate: acestea sunt controlabile i au drept
cauze msurrile statistice ale datelor continue i rotunjirile efectuate pentru
a obine rezultate discrete, conform coninutului caracteristicii statistice,
deci ele sunt probabile sau de sondaj i de calcul ambele tipuri putnd fi
estimate i efectele lor controlate. Prin operaiunea matematic de rotunjire
a valorilor nregistrate se induc erori ce se amplific dac rotunjirea
continu n faza de analiz.
Drept urmare, putem afirma c datele sunt rotunjite din urmtoarele
motive:
Cnd caracteristica observat este continu, n anumite cazuri este
necesar rotunjirea pentru a putea exprima magnitudinea datei (de obicei se
pstreaz doar dou zecimale);
Pentru caracteristicle discrete, rotunjirea are drept scop respectarea
caracterului ntreg al acestora.
2. Erori ce nu pot fi previzionate: acestea sunt necontrolabile i se
datoreaz: nregistrrilor incomplete sau incorecte, definirii ambigue a
caracteristiclor sau unitilor statistice ce sunt studiate.
Controlul erorilor are drept scop aflarea erorilor de observare i
asigurarea autenticitii datelor statistice, i se refer la controlul volumului
datelor nregistrate, controlul aritmetic i logic.
O parte a erorilor se datoreaz operatorului de interviu.
Cele mai importante surse de erori de rspuns datorate operatorilor
sunt:
a) caracteristicile operatorilor, ca, de exemplu, nivelul de pregtire
prea sczut sau prea ridicat care i face s fie nclinai spre greeli
sistematice, sau pot s induc, prin opinia lor exprimat, persoanei
intervievate, o anumit influen asupra rspunsului;
b) anticipaiile operatorilor i determin s sugereze anumite
rspunsuri subiecilor;
P ( 1 z 2 = 2 1
unde (t) reprezint funcia de probabilitate.
(7.1)
1 n
x = xi .
n 1=1
(7.2)
a x < t (p.k )
s
n
(7.3)
S=
1
n 1
n
x i x = S
n
1
j =1
(7.4)
a x
Cu siguran, data P
eantionului (7.5):
n t (P ) 2 ,
(7.5)
t=
x x
S 1 / n1 + 1 / n 2
unde:
S=
(n1 1) S12 + (n 2 1) S 22
.
(n1 1) + (n2 1).
(7.6)
2 s 2 =
2 s2 =
1 n
2
(x i a ) ,
n i =1
1
n 1
x i x
i =1
(7.7)
(7.8)
1
m
2
i
(7.9)
i =1
simplu. Dac operatorul i produce o distorsiune net b1, vom defini totalul
distrosiunilor produse de operatori ca (7.10):
2
1 R
=
(b b ) ,
R 1 r =1 i
2
b
(7.10)
V ( x ) = x2 / n
(7.11)
x2
1
+ b2 1
r
n
(7.12)
(7.1)
(7.2)
E ( ) 2 = var + b 2
variaie minim, deci dispersie minim.
(7.3)
(7.4)
x = E ( X ) = x e x dx =
0
(7.5)
L( x1 ,...., x n ) = f ( x i )
(7.6)
i =1
1
log f ( x )
nE
(7.7)
2
2
(
,.....,
)
2
[
]
L
x
x
e
=
1
n
x2
log L = n + i
2
2 2
(7.8)
2
proporional cu n: var = 2 =
, deoarece derivata de ordinul doi a
n
logaritmului funciei L este o funcie liniar.
Proprietile estimatorului de maxim verosimilitate sunt:
a) converge n probabilitate ctre valoarea adevarat, concret a lui
atunci cnd n , deci este consistent;
b) este asimptotic nedistorsionat;
c) are dispersia minim comparativ cu estimatorii nedistorsionai cu
dispersie finit putnd spune c este un estimator eficient;
d) estimatorul de maxim verosimilitate al unei funcii de este
funcia estimatorului de verosimilitate maxima a lui .
De exemplu, dac estimatorul de maxim verosimilitate ce estimeaz
2
(sau 2 ) este
2
i
abaterea medie ptratic este radical din expresia (7.9) devenind (7.10):
= =
2
i
/n
(7.10)
L( x1 ,...., x n ) = f ( x i )
(7.11)
i =1
n B pq
2 log f ( x )
p q
, p, q = 1,..., n
(7.12)
f ( x 1 , 2 ) = (2 2 ) 2 exp ( x 1 ) 2 / 2 2
(7.13)
1 =
x
n
= x s ;2 =
(x
x) 2
(7.14)
2
n
(7.15)
2 22
n
L (x1,., xn| ) = f ( x i | ) .
1
(7.16)
h (x1, , xn,)=g () f ( x i | )
i =1
(7.17)
(7.18)
p ( / x)d = 1
(7.19)
(7.20)
a mediei cu o
4n[ f ( x 0.5 )]
(7.22)
(7.23)
(7.24)
(7.25)
Pentru cele dou tipuri de caracteristici - alternativ i nealternativ se va putea estima nivelul lor absolut n cadrul colectivitii totale, ca un
produs, ntre limitele intervalului de ncredere i volumul ntregului
fenomen, adic:
N (x s x ) i N (w w )
(7.26)
y ( x ) = M (y / x ) = y j p y j / x .
j =1
unde x este una dintre valorile unitilor din eantion x1, x2, iar
M(y/x) (x, y)
Sperana matematic condiionat a mrimii aleatoare n cazul unui x
fixat. Variindu-l pe x ca pe un parametru, vom obine n planul variabilelor
locul geometric al centrelor repartiiilor condiionate, numit curba de
regresie a lui Y n X. Dac vom schimba locul variabilelor, vom obine
curba de regresie a lui X n Y.
y (x)
Ca estimaie a msurii abaterii mrimii aleatoare fa de centru se
adopt mrimea dispersiei condiionate:
Y
x
yj
x
Y2 / x = D = ( y j y ( x) 2 p
j =1
Y2 / x = M [Y y ( x)]2 = p( xi ) Y2 / x.
i =1
Y2 / x
Funcia de regresie cea mai simpl este cea liniar unde coeficienii
x si se calculeaz prin metoda celor mai mici ptrate, pornind de la
condiia de minimizare a funciei erorii:
n
f ( x, ) = M (Y x x) 2 =
i =1
p( x
j =1
i,
y j )( y j y ( xi )) 2 .
y ( x) = a y +
y
( x a x ),
x
X = ax +
x
(Y a y ).
y
y/x =
, x /Y = x
x
y
y ( x) = a 0 + ai xi .
i =1
f (a 0 , a1 ,..., a n ) = M Y a 0 ai xi .
i =1
y ( x) = ai xi + aij xi x j .
i =1
i =1 j =1
y1 = a + (b2 x2 + . + bn xn) + ,
Reprezentnd un hiperplan n spaiul n-dimensional. Orice coeficient
bi, arat cu ct se modific y1, dac variabila xi se modific cu o unitate.
Coeficienii de regresie, b, nu pot fi comparai deoarece variabilele xi pot fi
diferite, exprimate n uniti de msur diferite.
Pentru a realiza compararea coeficienilor funciei de regresie, adic
a compara factorii n funcie de importana lor n influenarea variabilei x1,
se calculeaz coeficienii de regresie standardizai, coeficienii beta, .
Valoarea acestora arat care este legtura dintre variabila independent i
valorile ajustate xi. De aceea, orice ecuaie de regresie trebuie nsoit i de
indicatorul care exprim eroarea medie a aproximrii variabilei dependente.
Dezavantajul regresiei const n faptul c nu ia n consideraie relaiile
dintre variabilele ce sunt considerate independente.
Valorea coeficienilor de regresie se bazeaz pe coeficienii de
corelaie parial, care sunt indici simetrici, adic nu presupun c o variabil
este dependent i alta independent una fa de alta.
Legturile statistice implicate n modelele de regresie se afl la baza
modelelor complexe, care introduc relaii de influen simultan a mai
multor variabile asupra celei rezultative, sau caut posibiliti de definire a
unor noi factori de influen, ce pot fi interpretai ca factori sintetici
(fiind considerate variabile finale) sau latente (cele considerate a avea
influen ca factori intermediari). Un astfel de procedeu este analiza path.
Acest metod utilizeaz un set de r factori, exprimai cantitativ, ntre care
se stabilesc relaii de dependen. O variabil xi poate fi considerat
dependent ntr-o ecuaie i independent ntr-o alt ecuaie, evitndu-se
cauzalitatea circular. Coeficienii path indic influena direct a variabilei
independente asupra celei dependente, fr a se evidenia influena
transmis prin intermediul altor variabile.
Corelaia multipl stabilete intensitatea legturii dintre mrimea
aleatoare i un grup de mrimi aleatoare x1, x2,, xn, coeficientul de
corelaie multipl:
R =
D
D
r = 1
ir
; ( i
1 , 2 ,...,
unde:
fr reprezint factorul simplu r;
k numrul de factori care urmeaz a fi aflat;
i elementele reziduale, care reprezint sursele de abatere, ce
acioneaz numai asupra mrimii i.
Mrimile aleatoare i se presupun independente att ntre ele, ct i
fa de acele k mrimi fr.
Coeficientul lir se numete de obicei sarcin a factorului r.
Dispersiile mrimilor aleatoare i le notm prin i. Toate mediile se
presupun egale cu 0.
Determinarea valorilor parametrilor lir, precum i i alctuiesc baza
analizei factoriale.
n practic prezint interes, de pild, urmtoarea problem: pentru
mrimile de sondaj observate x1, x2, , xn s se estimeze valoarea factorilor
ipotetici f1, f2, , fk i s se exprime aceti factori ca funcii liniare de
variabilele x1, x2, xn. n cazul de fa nu se poate aplica metoda obinuit
n goes to infinit, then N (x1, x2, .., xn,, A) / n represents a probability that
leads to 1.
Although the Law of Large Numbers specifies that the researcher
will reach a correct answer after a number of experiments, it does not
specifies how close heor she is to the correct answer after a number of
experiments, or recordings. In certain conditions, the statistical methods
may be used for estimating errors that can be caused by repeating an
experiment for a determined number of times.
The totality of the variables by a which a certain market
phenomenon is studied, constitutes the space of attribute (property-space),
of the phenomenon's characteristics. The operations made with the help of
attribute space aim a more detailed elucidation of the relation between the
variables and the theoretical concepts. Diminishing the attribute space by
combining the categories and eliminating some subdivisions, allows the
achievement of models belonging to the market phenonmenons.
Instead, the operation for substracting consists in following: from the
model it goes to elaborating an attribute space involved in that model. The
space for attribute is used for comparing the operational schemes used in
research, and for finding a posible common point of these empirical reseach
schemes.
If the events are A1, A2, .,An, with the appearance probabilities, (the
appearance frequencies) denoted by the vector P(Ai), i=1, ,n , the
appearance probability of all the events will be P(A1) . P(A2) . .P(An). The
independence's analysis is applied to statistical investigations in which the
data are wrong because of repeating the same elementary operations, each
recording being made independently of the others.
Research errors
It is considered as being a research error the deviation (gap) between
the values obtained by processing the primary data and the results that
would have been reached if a total observation were organized. From the
point of view of the error control's possibilities, according to the market
research literature, the errors are grouped in two large groups:
Errors that can be forecast: these are controllable and have as result
the statistical measurements of the continuous data and the adjustments
made for obtaining discrete results according to the content of statistical
characteristics, though they are possible-or research and computation- both
types can be estimated and their effects controlled. Through mathematical
operation for adjusting the recorded values, errors are induced which will be
modified if the adjustment continues in the analyzing step.
As a result we may say that the data is adjusted because of the
following reasons: if the observed characteristic is continuous in certain
cases the adjustment is necessary for expressing the magnitude of the data
(usually only two decimals are kept).
For discreet characteristics the adjustment targets to follow their
entire character. Errors that cannot be forecast:these are incontrolable and
are due to incomplete or incorrect recording, ambigouse defining of the
characteristics or statistical units that are studied.
The error control has as purpose to discover the errors of observation
and to ensure the authenticity of the statistical data, and refers to controlling
the size of recorded data, arithmetic and logic control.
A part of the errors are due to the interviewer. The most important
sources for answering errors due to interviewers are:
The operator's characteristics, for example a training level too low or
too high that makes him do sistematis mistakes, or to induce the interviewed
persons a certain influence over the answer because of their personal
opinions. The operator's anticipacions that determines them to suggest
certain answers to subjects. The opeartor's fraud appears in very few cases
and can be discovered through pilot research of reinterviewing
Other important sources for errors that may be avoided are: the size
of the questionniare that may cause the tiredness of the operator and of the
interviewed subjects; a larger number of open questions that leads to
difficulties in the postcoding operation; the questions' content, specially the
personal ones may lead to answering errors; formulating the question,
especially the use of ambiguous words, with multiple understandings; the
place and time of the interview, and last but not least, the interest or
incentives degree of the interviewed person.
Statistical Estimation
Point Estimation
We can identify two types of estimation of the population parameters
by a sample statistics: point estimation and interval estimation. This
section presents some definitions associated with estimation. Samples may
be used to estimate population parameters, such as and , which represent
the population mean and standard deviation, or other population
characteristics such as the median or other quintiles. Estimates may take the
form of a single number, called point estimate, or an interval of values,
called interval estimate.
A point estimate is a single number that is used as an estimate of a
population parameter or a population characteristic. Usually a point estimate
is derived from a random sample selected randomly from the population of
interest.
An interval estimate is an interval that provides an upper and lower
boundary for a specific population parameter whose value is unknown. This
interval estimate has an associated degree of confidence of containing the
population parameter possible values within a class. Such interval estimates
are also called confidence intervals and are calculated for random samples
parameters.
A part of the errors are due to the operator.
The most important are:
a) The characteristics of the operators, such as the education level
which may be lower or higher, so they might have systematical mistakes or
might influence the answer of the interviewed persons.
b) The anticipations of the operators that determine the subjects to
have specific opinions.
c) The fraud of the operators that is rare and may be tracked down
using the pilot surveys.
Other important sources of errors which could be avoided are the
length of the questionnaire which might cause the tiredness to the
interviewed subjects and to the operators, the majority percentage of the
open questions which will cause difficulties in post codifying procedure, the
content of the questions especially the personal ones which might cause
answering errors, the wording questions especially the usage of ambiguity
words, with more meanings, the place and the time of the interview and last
but not least the interest level or co interest of the subject.
P ( 1 z 2 = 2 1 (7.1)
computed with respect to the formula (7.1)
Where (t) represents the probability function.
For the random error probability to be outside the interval,
respecting the limits t(t>0), should be computed on the base of the
formula P (z>t)=1-1(t). As an example the probability that the random
x=
1 n
xi
n 1=1
(7. 2)
a x < t (p, k )
s
n
(7.3)
S=
n
1 n
x i x = S
n
1
n 1 i =1
(7.4)
The values of the function t (P,k) are given in the built up table on
the basis of Student repartition.
a x
n t (P ) 2 ,
(7.5)
t=
x1 x 2
S 1 / n1 + 1 / n 2
(7.6)
Where:
S=
2 s2 =
1 n
2
(x i ) ,
n i =1
(7.7)
1 n
2
(x i )
n i =1
(7.8)
2 S2 =
1 n
x i x
n 1 i =1
(7.9),
1 m 2
S1 ,
n i =1
(7.10)
where,
m: the number of the series of measurements.
The answer dispersion and also the sampling dispersion might be
estimated on the basis of the sampling results. If the individual answering
errors are independent, they will compensate by mean and appear in the
formulas of sampling variance estimation. In case of the answer dispersion
due to recording errors, if each operator produces a systematic distortion,
even if these are mutual compensating, it will take place an increase of the
overall dispersion.
Here comes a component of the answer dispersion, which will
generate the necessity of modification the formulas of computation of the
sampling dispersions.
So, if r operators extracted randomly out of the total R operators of
interview, then this will interview n persons, which will constitute a simple
random sample. If i operator produces a net distortion b1, we will define the
total distortions produced by the operators, such as (7.11):
2
b2 =
1 R
(b b ) ,
R 1 r =1 i
(7.11),
(7.12)
For this should be added the answer dispersion which is the result of
the errors of the operators, and the formula (4.2.12) becomes (4.2.13):
V (x) =
x2
1
+ b2 1
r
n
(4.2.13)
from where we can imply that the sampling dispersion is composed of two
parts with the meanings:
2x / n -represents
and
1
2b 1 , represents the variance due to the operator
r