Notes On Microeconomics, Macroeconomics, Optimization

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Microeconomic Theory: Lecture 1

Prices, Markets and E ciency
Parikshit Ghosh
Delhi School of Economics
Summer Semester, 2014
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Seating Puzzle
Schellings Seating Puzzle

I
In a crowded auditorium, why are the rst few rows empty?
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Seating Puzzle
Many Theories
I
I
I
I
I
I
I
I
Hypothesis 1: Everyone prefers to sit as far back as possible.

Blocking o the end rows will make everyone unhappy
Hypothesis 2: Everyone wants to sit as much in front as
possible as long as they are behind others.
Blocking o the end rows will make everyone happier.
Hypothesis 3: Everyone wants to sit near others. Provided
they dont stand out, they prefer being closer to the stage.
Blocking o the end rows will make everyone happier.
Things could have turned out better.
Hypothesis 4: They were afraid to sit in front in school and
are unthinkingly carrying the habit.
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Seating Puzzle
Questions for a Social Scientist

I
What do people want (motives/preferences)?
How do motives aect individual behaviour (choice)?
How does the interaction of individual choices aect group

behaviour (aggregation)?
How do we decide which theory is true (empirical testing)?
Is the outcome good (welfare evaluation)?
How do we decide what is good (ethics)?
Can the outome be improved by some suitable intervention

(policy)?
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Consumers
Quasilinear Utility
I
n consumers, i = 1, 2, ..., n.
Utility functions: vi (qi , mi ) = ui (qi ) + mi where

I
I
qi = quantity of the good consumed

mi = money spent on all other goods
ui (qi ) is the utility of consuming units of the good, expressed

in money equivalent.
Diminishing marginal utility: ui0 (qi ) > 0, ui00 (qi ) < 0.
Inada condition: ui0 (0) = , ui0 () = 0.
Consumers budget constraint: pqi + mi = yi where

I
I
yi = consumer is income.
p = price of the good.
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Consumers
Utility Maximization
I
Max utility within budget:

max ui (qi ) + mi sub to pqi + mi
q i ,m i
Substitution gives unconstrained problem:

max ui (qi ) + yi
q i ,m i
I
I
yi
pqi
First-order necessary condition (FOC):

ui0 (qi ) =
p
|{z}
| {z }
marginal utility = price
Second-order su cient condition (ui00 (qi ) < 0) due to DMU.

Interior solution guaranteed by Inada conditions.
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Consumers
Individual Demand Functions

I
FOC of consumers problem gives is demand function qi (p )

as an implict function. E.g., if ui (qi ) = i log qi , then
qi (p ) =
i
p
The FOC can be written as an identity:

ui0 (qi (p ))
Equating derivatives of both sides:

ui00 (.).qi0 (p ) = 1 ) qi0 (p ) =
1
<0
ui00 (.)
Implicit function theorem gives Law of Demand.
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Consumers
Consumers Surplus
I
How much does the consumer gain from having the

opportunity to buy any amount of the good he wants at p?
Utility from optimally purchasing the good

ui (qi (p )) + yi
pqi (p )
Utility from not buying the good at all: yi . Dierence is:

ui (qi (p ))
Z q i (p )
0
Parikshit Ghosh
pqi (p )
ui0 (qi )
p dqi
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Consumers
Consumers Surplus in Pictures

I
CS is the area under demand curve and above the price line.
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Consumers
Market Demand Function

I
Market demand is the sum of individual demands:

n
Q (p ) =
qi (p )
i =1
If each individual demand function is downward sloping, so is

the market demand function:
Q 0 (p ) =
qi0 (p )
i =1
Price elasticity of demand: how responsive to price change?

=
dQ p
.
dp Q
Petrol likely to have low price elasticity; apples high. Why?
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Firms and Prot Maximization
The Cost Function

I
m rms, j = 1, 2, ..., m.
Sunk cost Sj (cannot be recovered even if rm shuts down).
Fixed cost Fj (avoidable, but independent of quantity).
Variable cost j (xj ) (increases as production increases).
Typical assumption: j0 (xj ) > 0, j00 (xj ) > 0 (increasing

marginal cost).
Total cost function: cj (xj ) = Fi + j (xj ).
Average cost function: aj (xj ) =
The average cost function is U-shaped.
Fj
xj
+ j (xj ).
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Prot Maximization
I
Assumption: rms do not face budget constraints as long as

they are protable.
The rms problem:

max j (xj )
xj
pxj
c (xj )
First-order (necessary) condition:

c 0 (xj ) =
p
|{z}
| {z }
marginal cost = price
Second-order (su cient) condition satised if 00 (xj ) > 0.
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Supply Function
I
The FOC gives the rms supply function xj (p ) in implicit

function form. Written as identity:
c 0 (xj (p )) = p
Equating the derivatives:

c 00 (.)xj0 (p ) = 1 ) xj0 (p ) =
I
I
1
>0
c 00 (.)
The supply function of the rm has a positive slope.

Market supply function is sum of rmssupply functions:
m
xj (p )
X (p ) =
j =1
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Supply Function
I
Price elasticity of supply dened the same way:

=
dX p
.
dp X
Producers surplus is nothing but the rms prots:

pxj (p )
Z x j (p )
0
c (xj (p ))
p
c 0 (xj ) dxj
Area between the price line and the marginal cost curve.
Social surplus is producersplus consumerssurplus.
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Supply Function
I
Producers surplus (prot) is the area above the supply curve

and below the price line.
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Equilibrium: Denition
I
A market quilibrium is a set of prices and quantities,

(p , q , x ), such that the market clears:
qi (p ) = qi ; xi (p ) = xi
n
qi
i =1
I
I
I
I
xj
or, Q (p ) = X (p )
j =1
Existence: a market equilibrium exists if Q (0) > X (0).

Uniqueness: given the slopes of demand and supply curves,
there is no more than one equilibrium.
How does the equilibrium price emerge? Imaginary auctioneer.
Stability: depends on the price adjustment process.
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Comparative Statics
I
I
What happens when there is a demand or supply shock?

Introduce shift parameters into demand and supply functions:
Q (p ; ) = X (p ; )
I
I
could respresent consumer tastes; input prices.

When tastes change:
Q (p ; ) + Qp (p ; ).
or,
dp
dp
= Xp (p , ).
d
d
dp
Q
=
d
Xp Qp
The denominator is positive. If Q > 0, then dp

d > 0. If
consumer tastes shift in favour, market price will increase.
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Comparative Statics
I
I
What happens to equilibrium quantity?

Using the chain rule:
Q Xp
dX
dp
= Xp .
=
d
d
Xp Qp
which is positive if Q > 0.

Similarly, for a shift in costs, ,
Qp .
dp
dp
= Xp .
+ X
d
d
or,
I
X
dp
=
d
Qp Xp
If X > 0 (rising costs), this is negative in sign.
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Blood Ivory
Parikshit Ghosh

Introduction
Demand and Supply
Market Failures
Market Clearance
Seized Stockpiles
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Market Clearance
What to do with Stockpiles? Option 1
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Market Clearance
What to do with Stockpiles? Option 2
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Ivory Trade
I
African elephant population fell from 1.3 million in 1979 to

600 thousand in 1989.
Annual ivory trade worth $1 billion in the 1980s, 80% coming

from poached elephants.
In 1989, the international ivory trade was banned by CITES

(Convention on International Trade in Endangered Species).
In 1997, CITES allowed the sale of 49 tons of ivory from

Zimbabwe, Namibia and Botswana to Japan.
In 2010, CITES turned down Zambia and Tanzanias petition

to o- oad their stockpile of 110 tons worth $20 million.
In 1986, Kenya, and in 2010, Phillipines publicly destroyed

their stockpiles.
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Poster Against Selling Stockpiles
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Ivory Trade
Arguments for allowing sale of stockpiles:
I
I
I
I
Sunk cost: the elephants are already dead.

Governments need the money for development.
Sales proceeds can fund conservation eorts.
Investment in local communities can reduce incentives for
poaching.
Arguments against allowing sale of stockpiles:

I
I
I
It is inherently immoral.
May boost demand and open up new trading.
Sends the wrong message?
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Market Clearance
The Economic Argument for Selling
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Market Clearance
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Tax Incidence
Social Security
I
Government pension scheme (USA) started under New Deal.
Pay-as-you-go system: retirees funded by workers.
Redistributive and insurance component: benets do not rise

in proportion to contributions.
Payroll tax of 6.20% on wages up to a ceiling ($113,700 in

2013). Employers have to pay a matching 6.20%.
As part of the stimulus programme, the Obama administration

has cut the workerscontribution to 4.2% since 2011.
Argument for shifting tax to employers: reducing inequality.
Argument for shifting tax to workers: creating more jobs.
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Tax Incidence
The Incidence of Taxation

I
Assume linear demand and supply:

Qd
Qs
= a bPb
= c + dPs
Pb = net price for buyers, Ps = net price for sellers.
Suppose government imposes a tax t per unit, with buyers

paying fraction and sellers paying 1 fraction.
Let P be the market price. Then

Pb
Ps
Parikshit Ghosh
= P + t
= P (1
)t
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Tax Incidence
How does the choice of aect the welfare of buyers and

sellers?
How does the overall size of the tax (t) aect welfare?
How is the economic burden of the tax (as opposed to the

legal burden captured by ) distributed across buyers and
sellers?
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Tax Incidence

I
In equilibrium, Qd = Qs , i.e.
Solve for equilibrium price and quantity:
c + [(1 )d
b+d
bdt
ad + bc
Q =
b+d
b+d
The net price for buyers and sellers:
b (P + t ) = c + d [P
Parikshit Ghosh
Pb
= P + t =
Ps
= P
(1
(1
)t ]
b ]t
a c
dt
+
b+d
b+d
a c
bt
)t =
b+d
b+d
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Tax Incidence

I
The net price paid by each side is independent of !
Sellers pass on a part of the tax on them to buyers and vice

versa because the market price adjusts.
The economic burden depends on underlying economic

fundamentals, not on the legal burden.
For every Re 1 of tax, the burden on buyers and sellers is

and b +b d respectively.
Depends on relative slopes (elasticities).
Parikshit Ghosh
d
b +d

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Tax Incidence
More General Conditions

I
I
I
Assume general demand and supply functions: Qd (P ), Qs (P ).

Assume a per unit tax t on sellers. Let P (t ) be the
equilibrium price.
By denition
Qd (P (t )) Qs (P (t ) t )
Equating derivatives:
Qd0 (.).P 0 (t ) = Qs0 (.) P 0 (t )
Rearranging terms:
P 0 (t ) =
Parikshit Ghosh
Qs0
Qs0
Qd0
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Tax Incidence
More General Conditions

I
Multiplying numerator and denominator by
+e
P 0 (t ) =
where e = elasticity of demand, = elasticity of supply.

The tax burden on buyers and sellers:
Pb0 (t ) = P 0 (t ) =
Ps0 (t ) = 1
P (t )
:
Q (t )
+e
P 0 (t ) =
e
+e
The side of the market with relatively lower elasticity bears a

relatively higher tax burden.
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Markets and E ciency
Feasible Allocations
I
An allocation is a vector of consumptions, productions and

transfers:
z = (q, x, t, s)
An allocation is feasible if it meets the resource constraints of

the economy:
n
qi
i =1
j =1
ti
i =1
Parikshit Ghosh
xj
m
sj
j =1
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Social Welfare Function

I
Imagine a social planner who cares about everyone:

W (v, )
W (v1 , v2 , ..., vn ; 1 , 2 , ..., m )
A special case: utilitarianism (sum of happiness):

W (v, ) =
i =1
j =1
vi + j
What allocation will a utilitarian social planner choose?
How does this planners allocation compare against the

market allocation?
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Utilitarian Allocation
I
The planner solves

m
max
q,x,t,s
[ui (qi )
ti ] +
i =1
[ sj
cj (xj )]
j =1
subject to feasibility constraints:

n
qi
i =1
n
ti
i =1
Parikshit Ghosh
xj
j =1
m
sj
j =1
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Utilitarian Allocation
I
The problem simplies to
q,x
cj (xj )
j =1
qi = xj
sub to
j =1
i =1
This amounts to
n
i =1
j =1
"
ui (qi ) cj (xj ) + xj
max L(q, x)
q,x
i =1
max ui (qi )
qi
j =1
i =1
The FOC is the same as that for market allocation:

ui0 (qi ) = = cj0 (xj ) for all i, j
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Pareto Optimality: Denition

I
Agentsutilities under allocation z:

vi (z) = ui (qi )
j ( z ) = sj
ti
c (xj )
Set of all feasible allocations = F .
An allocation z (weakly) Pareto dominates another allocation

z0 if
vi (z)
vi (z0 ) for all i
j (z)
j (z0 ) for all j
and the inequality strict for some i or j.

Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Pareto Optimal Allocations

I
I
Planner maximizes some welfare function W (v, ).

The optimization problem:
#
"
"
max W (v, z) + 1
xj
j =1
qi
+ 2
ti
i =1
i =1
sj
j =1
The FOC is:

Wvi .ui0 (qi ) = 1 = Wj .cj0 (xj )
Wv i
= 2 = W j
Combining, we get same condition as market equilibrium:

ui0 (qi ) = cj0 (xj )
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
The Invisible Hand: Adam Smith

It is not from the benevolence of the butcher, the brewer or the
baker that we expect our dinner, but from their regard to their own
self-interest... [Every individual] intends only his own security, only
his own gain. And he is in this led by an invisible hand to promote
an end which was no part of his intention. By pursuing his own
interest, he frequently promotes that of society more eectually
than when he really intends to promote it.
Adam Smith, The Wealth of Nations.
I
The rst fundamental theorem of welfare economics:

every market allocation is Pareto e cient.
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
The Invisible Hand: Adam Smith

I
In a barter economy, market exchange exhausts all mutual

gains from trade.
In a production economy, if consumers highly value a good,

the high market price incentivizes rms to produce more of it.
If producing a good is very costly, the high market price

incentivizes consumers to reduce its usage.
The price system is like a thermostat that strikes the right

balance between utility and cost.
A planner can in principle replicate market outcomes, but

often lacks information on consumer tastes and rm costs.
Warning: dont ascribe too much power to the invisible hand!
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Market Interventions
Deadweight Loss of Taxation
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures

I
I
I
I
To raise Rs 100 in taxes, the cost imposed on consumers and

producers is more than Rs 100.
The excess cost is the deadweight loss: it arises because
quantity choices are distorted.
The tax revenue itself is not a loss; it can be returned.
Does not mean there should be no taxes. Taxes can
I
I
I
I
I
fund public goods

correct externalities
correct the income distribution
While doing cost-benet analysis, cost should include not only

the taxes but also the deadweight loss.
Subsidies being negative taxes have opposite distortions: lead
to over-consumption.
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Price Ceilings and Floors
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Equilibrium
E cient Allocations
Market Failures
Price Ceilings and Floors
Parikshit Ghosh

Introduction
Demand and Supply
Market Failures
Quotas
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
When Are Markets Ine cient?
Caveat 1: Externalities
The reason that the invisible hand often seems invisible is that it
is often not there. Whenever there are "externalities" where the
actions of an individual have impacts on others for which they do
not pay, or for which they are not compensated markets will not
work well... Markets, by themselves, produce too much pollution.
Markets, by themselves, also produce too little basic research...The
real debate today is about nding the right balance between the
market and government... Both are needed. They can each
complement each other.
Joseph Stiglitz.
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Caveat 2: Public Goods

I
(Pure) public goods, unlike private goods, are:

I
I
non-rival: one persons consumption doesnt reduce anothers

(e.g., parks, knowledge, music).
non-excludable: once a good is produced, everyone has access
to it (clean air, national defence).
Voluntary contribution to public goods is subject to a

free-rider problem.
Govt. provision has its own problems (corruption, waste)
A good may be non-rivalrous in consumption but only up to a

point (congestion eects)
Club goods: non-rival but excludable.
Parikshit Ghosh
Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Caveat 3: Imperfect Competition

People of the same trade seldom meet together, even for
merriment and diversion, but the conversation ends in a conspiracy
against the public, or in some contrivance to raise prices.
Adam Smith.
I
The rst welfare theorem only obtains under a perfectly

competitive market structure where all agents are price takers.
A monopolist restricts output below what is socially optimal

(creates scarcity) to maximize prots.
Even with limited competition (oligopoly), e ciency is usually

not reached.
Parikshit Ghosh

Introduction
Demand and Supply
Equilibrium
E cient Allocations
Market Failures
Caveat 4: Distributive Justice
E ciency and equity are distinct concepts.
Typically, the set of Pareto e cient allocations is very large.

Pareto e ciency is a weak criterion for judging welfare.
Some allocations may be e cient but highly unequal or unjust.
Markets have no inherent tendency to bring about equity.
Reasons for intervention (e ciency/equity) should be spelt

out.
Parikshit Ghosh
The Axiomatic Approach
Demand Functions
Applications

Choice Theory and Consumer Demand
Parikshit Ghosh
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Denitions and Axioms
Binary Relations
I
Examples: taller than, friend of, loves, hates, etc.
Abstract formulation: a binary relation R dened on a set of

objects X may connect any two elements of the set by the
statement xRy and/or the statement yRx.
R may or may not have certain abstract properties, e.g.

I
I
I
Commutativity: 8x, y , xRy ) yRx. Satised by classmate

of but not son of.
Reexivity: 8x, xRx. Satised by at least as rich as but not
richer than.
Transitivity: 8x, y , z, xRy and yRz ) xRz. Satised by
taller than but not friend of.
Based on observation, we can often make general assumptions

about a binary relation we are interested in studying.
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
The Preference Relation

I
The preference relation is a particular binary relation.
There are n goods, labeled i = 1, 2, ..., n.
xi = quantity of good i.
A consumption bundle/vector x = (x1 , x2 , ..., xn ) 2 Rn+ .
Let % denote at least as good as or weakly preferred to.
x1 % x2 means to the agent, the consumption bundle x1 is at

least as good as the consumption bundle x2 .
% is a binary relation which describes the consumers

subjective preferences.
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Other (Derived) Binary Relations

I
The strict preference relation

1
x
I
The indierence relation

x
can be dened as:
if x % x but not x2 % x1
can be dened as:
if x % x2 and x2 % x1
1
Some properties of % (e.g. transitivity) may imply similar

properties for
and .
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
The Axioms
I
Axiom 1 (Completeness): For all x1 , x2 2 Rn+ , either

x1 % x2 or x2 % x1 (or both).
I
I
The decision maker knows her mind.

Rules out dithering, confusion, inconsistency.
Axiom 2 (Transitivity): For all x1 , x2 , x3 2 Rn+ , if x1 % x2

and x2 % x3 , then x1 % x3 .
I
I
There are no preference loops or cycles. There is a

quasi-ordering over the available alternatives.
Without some kind of ordering, it would be di cult to choose
the best alternative.
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
The Axioms (contd.)

I
Axiom 3 (Continuity): For any sequence (xm , ym )m

=1 such
that xm % ym for all m, limm ! xm = x and limm ! ym = y,
it must be that x % y.
I
I
Equivalent denition: For all x 2 Rn+ , % (x) and - (x) are

closed sets.
Bundles which are close in quantities are close in preference.
Axiom 4 (Strict Monotonicity): For all x1 , x2 2 Rn+ ,

x1 x2 implies x1 % x2 .
I
I
Parikshit Ghosh
Choice and Demand
The more, the merrier.

Bads (e.g. pollution) can simply be dened as negative goods.
Demand Functions
Applications
Preference Representation
The Preference Representation Theorem

Theorem
If % satises Axioms 1-4, then there exists a continuous, increasing

function u : Rn+ ! R which represents %, i.e. for all x1 , x2 2 Rn+ ,
x1 % x2 , u (x1 ) u (x2 ).
I
The function u (.) may be called an utility function, but it is

really an artical construct that represents preferences in a
mathematically tractable way.
In cardinal choice theory, the utility function is a primitive.
In ordinal choice theory, the preference ordering is the

primitive and the utility function is a derived object.
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Proof in Two Dimensions

I
Step 1: For any x, there is a unique symmetric bundle (z, z )

such that x (z, z ).
Step 2: u (x) = z represents %.
Let Z + = fz j(z, z ) % xg and Z = = fz jx % (z, z )g.
Must be of the form: Z + = [z, ) and Z = = [0, z ].
Continuity ensures the sets are closed, monotonicity ensures

there are no holes.
To show that z = z.
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Proof (contd.)
I
Case 1: sets are disjoint

I
I
Case 2: sets are overlapping.

I
I
I
Suppose z < z.
Then for any z < z < z, completeness is violated.
Suppose z > z.
Then for any z < z < z, (z, z )
Strict monotonicity is violated.
x.
Construction represents preference

I
I
I
Suppose x1 % x2 . Let (z1 , z1 ) x1 and (z2 , z2 ) x2 .

Then (z1 , z1 ) (z2 , z2 ) (transitivity) ) z1 z2 (strict
monotonicity).
Given the construction, u (x1 ) u (x2 ).
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Invariance to Monotone Transformation

Theorem
If u (.) represents %, and f : R ! R is a strictly increasing

function, then v (x) = f (u (x)) also represents %.
I
I
There is no unique function that represents preferences, but

an entire class of functions.
Example: suppose preferences are captured by the
Cobb-Douglas utility function:
u (x) = x1 x2
I
The same preferences can also be described by:

v (x) = log u (x) = log x1 + log x2
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Preference for Diversity

I
I
Axiom 5 (Convexity): If x1 x2 , then

x1 + (1 )x2 % x1 , x2 for all 2 [0, 1].
Axiom 5A (Strict Convexity): If x1 x2 , then

x1 + (1 )x2 x1 , x2 for all 2 (0, 1).
Denition
A function f (x) is (strictly) quasiconcave if, for every x1 , x2
f x1 + (1
)x2
(>) minff (x1 ), f (x2 )g
Theorem
u (.) is (strictly) quasiconcave if and only if % is (strictly) convex.

Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Indierence Curves
I
The indierence curve through x0 is the set of all bundles just

as good as x0
I (x0 ) = xjx
I
I
x0 = xju (x) = u (x0 )
It is also the boundary of the upper and lower contour sets,

% (x0 ) and - (x0 ).
Deriving the slope of the indierence curve (marginal rate of
substitution, MRS) in two dimensions:
u (x1 , x2 ) = u )
dx2
=
dx1
Parikshit Ghosh
Choice and Demand
u
u
dx1 +
dx2 = 0
x1
x2
u
x1
u
x2
u1
<0
u2
Demand Functions
Applications
The Indierence Map

x2
x1
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
The Indierence Map

x2
x1
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Properties of Indierence Curves
Curves, not bands (strict monotonicity).
No jumps (continuity).
Downward sloping (strict monotonicity).
Convex to the origin (convexity).
Higher indierence curves represent more preferred bundles

(strict monotonicity).
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Optimization
The Consumers Problem

I
The budget set B is the set of bundles the consumer can

aord. Assuming linear prices p = (p1 , p2 , . . . , pn ), income y :
)
(
n
B=
xj pi xi
i =1
= fxjpx
yg
The budget line is the boundary of the budget set.

The consumers problem:
This can be obtained by solving:
Choose x 2 B such that x % x for all x 2 B

max u (x) subject to y
x
Parikshit Ghosh
Choice and Demand
px
0, xi
0
Demand Functions
Applications
Optimization
Simplifying the Problem

I
I
I
I
I
Suppose x 2 arg maxx2S f (x). If x 2 S 0 S, then

x 2 arg maxx2S 0 f (x).
We can solve a problem by ignoring some constraints and
later checking that the solution satises these constraints.
If we know (by inspection) that the solution to a problem will
satisfy certain constraints, we can try to solve it by adding
these constraints to the problem.
Strict monotonicity of preferences implies no money will be
left unspent, i.e. y px = 0.
Solve the simpler problem:
x
If the solution satises xi
px = 0
0, then it is the true solution.
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Optimization
Lagranges Method
I
Let x be the (interior) solution to:

max f (x) subject to gj (x) = 0; j = 1, 2, . . . , m
x
Then there is a = (1 , 2 , . . . , m ) such that (x , ) is a

critical point (0 derivatives) of:
L(x, )
I
f (x) +
j gj (x)
j =1
We can nd the solution to a constrained optimization

problem (harder) by solving an unconstrained problem (easier).
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Optimization
Application to the Consumers Problem

I
The consumer solves (assuming interior solution):

px = 0
The Lagrangian is:
"
min max L(x, )
u (x) + y
pi xi
i =1
First-order necessary conditions:

L
u
=
xi
xi
L
=y
pi = 0
n
pi xi
=0
i =1
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Optimization
Simplifying and Solving

I
Useful to eliminate the articial variable .
Dividing the i-th equation by the j-th:

u
xi
u
xj
pi
pj
|{z}
|{z}
jMRSij j = price ratio
In two dimensions, this means that at the optimum, slope of

indierence curve = slope of the budget line.
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Optimization
Consumers Optimum in Pictures

x2
x1
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Optimization

x2
x1
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Optimization

x2
x1
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Optimization

x2
x1
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Optimization
Optimization: Read the Fine Print!

I
Sometimes, the rst-order conditions describe a minimum

rather than a maximum.
Need to check second-order conditions to make sure.
It may only be a local maximum, not a global maximum.
If there is a unique local maximum, it must be a global

maximum.
Sometimes, the true maximum is at the boundary of the

feasible set (corner solution) rather than in the interior.
The Kuhn-Tucker conditions generalize to both interior and

corner solutions.
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Optimization
Second-Order Su cient Conditions

I
Consider the problem with a single equality constraint:

max f (x) subject to g (x) = 0
x
I
I
Suppose x satises the rst-order necessary

derived by the Lagrange method.
The bordered Hessian matrix is dened as
2
0 g1 g2
gn
6 g1 L11 L12
L1n
6
6
L2n
H = 6 g2 L21 L22
6
..
..
4
.
.
gn Ln1 Ln2
Lnn
conditions
3
7
7
7
7
7
5
x is a local maximum of the constrained problem if the

principal minors of H alternate in sign, starting with positive.
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Optimization
Uniqueness and Global Maximum

I
For the consumers problem, the bordered Hessian is

3
2
0 p1 p2
pn
7
6 p1 u11 u12
u
1n
7
6
6
u2n 7
H = 6 p2 u21 u22
7
6 ..
.. 7
4 .
. 5
pn un1 un2
unn
Suppose x
0 solves the f.o.c obtained by the Lagrange
method. If u (.) is quasiconcave, then x is a constrained
maximum.
If u (.) is strictly quasiconcave, the solution is unique.
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Optimization
Constrained Optimization
I
I
I
I
The problem: max f (x; a) subject to x 2 S (a).

x is a vector of endogenous variables (choices), a is a vector
of exogenous variables (parameters).
f (x; a) is the objective function. S (a) is the feasible set (may
be described by equalities or inequalities).
The choice function gives the optimal values of the choices, as
a function of the parameters:
x (a) = arg max f (x; a)
x2S (a)
The value function gives the optimized value of the objective

function, as a function of the parameters:
v (a) = max f (x; a)
x2S (a)
Parikshit Ghosh
Choice and Demand
f (x (a); a)
Demand Functions
Applications
Optimization
The Implicit Function Theorem

I
I
Consider a system of n continuously dierentiable equations in

n variables, x, and m parameters, a: f i (x; a) = 0.
The Jacobian matrix J is the matrix of partial derivatives of
the system of equations:
3
2
1
1
1
f
x1
6 f
2
6 x
1
J=6
6 ..
4 .
f n
x1
f
x2
f 2
x2
f
xn
f 2
xn
f n
x2
f n
xn
7
7
7
... 7
5
If jJ j 6= 0, there exist explicit solutions described by

continuously dierentiable functions: xi = g i (a),
i = 1, 2, . . . , n.
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Optimization
The Implicit Function Theorem (contd.)

I
The response of the endogenous variables x to changes in

some parameter ak can be characterized without explicitly
solving the system of equations.
Using identities, we get
J.Dx (ak ) = Df (ak )
where
dx2
dak
dxn
dak
f 1 f 2
ak ak
Applying Cramers Rule, we get
f n
ak
Dx (ak )t =
dx1
dak
Df (ak )t =
Parikshit Ghosh
dxi
jJ j
= k
dak
jJ j
where jJk j is the matrix obtained by replacing the k-the
Choice and Demand
Demand Functions
Applications
Optimization
The Envelope Theorem

I
Consider the value function:
The Lagrangian is:
v (a) = max f (x; a) subject to gj (x; a) = 0; j = 1, 2, . . . , m

x
L(x, ; a)
I
f (x) +
j gj (x)
j =1
Suppose all functions are continuously dierentiable. Then

v (a)
L(x, ; a)
=
ak
ak
Intuition: changes in a parameter aects the objective

function (a) directly (b) indirectly via induced changes in
choices. Indirect eects can be ignored, due to f.o.c.
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Optimization
Illustration: Single Variable Unconstrained Optimum

I
I
Consider the simple problem: maxx f (x; a).

Let v (a) be the value function and x (a) the choice function.
First-order condition as identity:
Equating derivatives of both sides (implicit function theorem):
fx ( x ( a ) ; a )
fxx x 0 (a) + fxa = 0 ) x 0 (a) =

I
I
I
fxa
fxx
Since fxx < 0 by s.o.c, sign depends on fxa .

Value function as identity: v (a) f (x (a), a).
Equating derivatives of both sides (envelope theorem):
v 0 (a) = fx .x 0 (a) + fa = fa (since fx = 0)
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Functional Properties
Demand Functions
I
The Marshallian demand function is the choice function of the

consumers problem:
x(p, y ) = arg max u (x) subject to y
x
px
0, xi
The indirect utility function is the value function of the

consumers problem:
Interesting comparative statics questions:
v (p, y ) = u (x(p, y ))
I
I
How is the demand for a good (xi ) aected by changes in (i)

its own price (pi ) (ii) price of another good (pj ) (iii) income?
What is the eect on consumer welfare (better o or worse
o? by how much?) of changes in prices or incomes?
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Properties of the Indirect Utility Function

I
Continuous (objective function and budget set are

continuous).
Homogeneous of degree 0 (budget set remains unchanged).
Strictly increasing in y (budget set exapands).
Decreasing in pi (budget set contracts).
Quasiconvex in (p, y ). (due to quasiconcavity of u (.))
Roys Identity (assuming dierentiability): Marshallian

demand function can be derived from indirect utility function
xi (p, y ) =
Parikshit Ghosh
Choice and Demand
v (p,y )
p i
v (p,y )
y
Demand Functions
Applications
Proof of Roys Identity

I
The Lagrangian function (assuming interior solution):
L(x, ) = u (x) + (y
I
px)
Using the Envelope theorem:

L(p, y )
v (p, y )
=
=
pi
pi
xi
v (p, y )
L(p, y )
=
=
y
y
I
Divide to get the result.
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Duality Theory
I
Consider the mirror image (dual) problem:

min px subject to u (x)
x
I
I
I
u, xi
Achieve a target level of utility at the lowest cost, rather than

achieve the highest level of utility for a given budget.
The Hicksian demand function xh (p, u ) is the choice function
of this problem.
The expenditure function e (p, u ) is the value function.
Theorem
Suppose f (x) and g (x) are increasing functions. Then
f = maxx f (x) subject to g (x) g if and only if
g = minx g (x) subject to f (x) f .
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Some Duality Based Relations

I
Suppose u is the maximized value of utility at price vector p

and income y .
Duality says that y is the minimum amount of money needed

to achieve utility u at prices p.
Since utility maximization and expenditure minimization are

dual problems, their choice and value functions must be
related.
xi (p, y ) = xih (p, v (p, y ))
xih (p, u ) = xi (p, e (p, u ))
e (p, v (p, y )) = y
v (p, e (p, u )) = u
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Properties of the Expenditure Function

I
e (p, u (0)) = 0.
Continuous (objective function and feasible set are

continuous).
For all p
Increasing in pi (cost increases for every choice).
Homogeneous of degree 1 in p (optimal choice unchanged).
Concave in p.
Shephards Lemma (assuming dierentiability): Hicksian

demand functions can be derived from the expenditure
function
e (p, u )
xih (p, u ) =
pi
Parikshit Ghosh
Choice and Demand
0, strictly increasing in u and unbounded above.
Demand Functions
Applications
Proof: Concavity and Shephards Lemma

I
Suppose x1 minimizes expenditure at p1 , and x2 at p2 .
Let x minimize expenditure at p = p1 + (1

denition
p1 x1
p1 x
2 2
p2 x
p x
I
Combining the two inequalities:

p1 x1 + (1
)p2 x2
or, e (p1 , u ) + (1
I
)p2 . By
p1 + (1
)e (p2 , u )
)p2 x = p.x
e (p1 + (1
)p2 , u )
Shephards lemma obtained by applying envelope theorem.
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
The Slutsky Equation

Theorem
Suppose p
0 and y > 0, and u = v (p, y ). Then
xi (p, y )
pj
xih (p, u )
xi (p, y )
xj (p, y )
pj
y
|
{z
}
| {z }
substitution
income
eect
eect
Substitution eect: change in consumption that would arise if

the consumer were compensated to preserve real income.
Income eect: the further change in consumption which is

due to drop in real income.
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Proof of the Slutsky Equation

I
By duality (note: identity)

xih (p, u )
xi (p, e (p, u ))
Dierentiating w.r.t pj :
xih (p, u )
xi (p, e (p, u )) xi (p, e (p, u )) e (p, u )
.
=
+
pj
pj
y
pj
From Shephards Lemma:

e (p, u )
= xjh (p, u ) = xjh (p, v (p, y )) = xj (p, y )
pj
Using above:
xih (p, u )
xi (p, y )
xi (p, y )
=
+ xj (p, y )
pj
pj
y
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Testable Implications: Properties of Marshallian Demand

I
I
I
Budget balancedness: px(p, y ) = y (due to strict

monotonicity).
Homogeneity of degree 0: x(p, y ) = x(p, y ) (budget set
does not change).
The matrix H is symmetric, negative semi-denite, where
2
3
h
h
h
x1
p 1
6 x
h
6
2
6 p1
H=6 .
6 ..
4
xnh
p 1
xih
p j
Parikshit Ghosh
Choice and Demand
x1
p 2
x2h
p 2
x1
p n
x2h
p n
xnh
p 2
xnh
p n
7
7
2 e (p, u )
7
=
.. 7
7
pi pj
. 5
is observable thanks to the Slutsky equation.

Demand Functions
Applications
The Law of Demand: A Critical Look

I
I
Are demand curves necessarily downward sloping?

Slutsky tells us
x h (p, u )
xi (p, y )
xi (p, y )
2 e (p, u )
= i
<0
+ xi (p, y )
=
pi
y
pi
pi2
i
For a normal good ( x
y > 0), the law of demand holds
xi
( p
< 0).
i
I
I
I
i
For an inferior good ( x
y < 0), it may or may not hold.
Gien goods are those which have positively sloped demand
xi
curves ( p
> 0).
i
Must be (a) inferior (b) an important item of consumption (xi
large).
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Charity: Cash vs. Kind
Are In-Kind Donations Ine cient?

I
Many kinds of altruistic transfers are in-kind or targeted

subsidies.
I
I
I
I
I
Employer matching grants to pension funds

Government subsidized health care
Tied aid by the World Bank
Book grants (as opposed to cash stipend) for students
Birthday or Diwali gifts
The donor can make the recipient equally well o at lower

cost if he gave assistance in cash rather than targeted subsidy.
Rough idea: each Rupee of cash grant will be more valuable

to the recipient since he can allocate it to suit his taste.
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
The Economics of Seinfeld
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Distant Uncles vs Close Friends
Gifts are not merely transfer of resources; they may also be

signals of intimacy.
A good test of intimacy is whether the donor has paid

attention to the recipients interests and preferences.
Giving the wrong gift is failing the test.
Giving a cash gift is refusing to take the test.
As social beings, we must take the test!
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Anomalies
Framing Eect
I
I
Kahnemann and Tversky (1981): suppose 600 people will be

subjected to a medical treatment against some deadly disease.
Decision problem 1: which do you prefer?
Treatment A: 200 people will be saved

1
2
Treatment B: everyone saved (prob ) or no one saved (prob )
3
3
Treatment C: 400 people will die

1
2
Treatment D: everyone dies (prob ) or no one dies (prob )
3
3
In surveys, most people say:
A
B (72%), D
C (78%)
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Anomalies
Sunk Cost Fallacy

I
Experiment conducted by Richard Thaler.
Patrons in Pizza Hut oered a deal: $3 entry fee, then eat as

much pizza as you like.
Entry fee returned to half the subjects (randomly chosen).

Can still eat as much pizza as you wish.
Those who got back the money ate signicantly less.
However, the extra or marginal cost of pizza is the same for

both groups.
Once inside, entry fee is a sunk cost: a cost that cannot be

recovered it no matter what you do.
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Anomalies
Non-Consequentialism: Cake Division
From Sens article Rational Fools.
Laurel and Hardy has 2 cakes: big and small.
Laurel asks Hardy to divide. Hardy takes big one himself.
Laurel: If I were doing it, Id take the small one.
Hardy: Thats what youve got. Whats the problem?
Hardys preference does not depend on consequence (who

gets what) alone.
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Anomalies
Other Regarding Preferences: Generosity

I
Sahlgrenska University Hospital, Gothenberg, Sweden.
262 subjects (undergraduates) divided into 3 groups and

asked if they will donate blood:
I
I
I
Treatment 1: no rewards oered.

Treatment 2: compensation of SEK 50 (US $7) for donation.
Treatment 3: SEK 50 to be donated to charity.
Personal payment of SEK 50 can always be donated to charity!
Subjects drawn from 3 disciplines: (i) medicine (ii) economics

and commercial law (iii) education.
Those who donated blood in the previous 5 years excluded.
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Anomalies
The Swedish Experiment:Results
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Anomalies
Is Learning Economics Socially Harmful?
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Anomalies
Other Regarding Preferences: Envy
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Anomalies
The Ultimatum Game
The proposer must divide some money between himself and

the receiver.
The receiver can either accept the proposed split or reject it.
If the receiver rejects, both players get 0.
Money minded rationalists: split = (99%, 1%)
Experimental results: median oers are 40%+
High rejection rates for oers less than 30%.
Parikshit Ghosh
Choice and Demand
Demand Functions
Applications
Anomalies
Time Inconsistent Preferences

I
Odysseus and the sirens.
The smokers dilemma: wants to quit but cannot.
Procrastination: more than just laziness.
The agent seemingly has multiple selves with conicting

preferences.
Prediction: how will such an agent behave?
Ethics: which of several conicting preferences should others

respect?
Welfare: how to evaluate such an agents welfare?
Paternalism and welfarism become less distinct concepts.
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Anomalies
Two Choice Problems

I
Problem 1: Which do you prefer?

I
I
(A) Rs 1 lakh now

(B) Rs 1 lakh + Rs 100 next week
Problem 2: Which do you prefer?

I
I
(C) Rs 1 lakh one year from now

(D) Rs 1 lakh + Rs 100 a year and one week from now
Most people answer: A
Suppose you choose D over C . But a year later, you will want
to reverse your choice!
This pattern found in humans, rats and pigeons (Ainslie

(1974)).
Parikshit Ghosh
Choice and Demand
B and D
C.
Demand Functions
Applications
Anomalies
The Cake Eating Problem with Geometric Discounting

I
A consumer has a cake of size 1 which can be consumed over

dates t = 1, 2, 3 . . .
The cake neither grows nor shrinks over time (exhaustible
resource like petroleum).
The consumers utility at date t is
I
I
Ut = u (ct ) + u (ct +1 ) + 2 u (ct +2 ) + . . .

I
u (.) is instantaneous utility (strictly concave), 2 (0, 1) is

the discount factor.
At date 0, the consumers problem is to choose a sequence of
consumptions fct gt=0 to solve
max
t u ( ct )
fct gt=0 t =0
subject to
ct = 1
t =0
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Anomalies
Time Consistency of the Optimal Path

I
I
I
Let fct gt=0 be the optimal consumption path at date 0.

If the consumer gets the chance to revise her own plan at date
t, will she do so (i.e. is the consumer dynamically consistent)?
Suppose at some date t, the amount of cake left is c. At any
bt < t, the consumersoptimal plan for t onwards is:
max
f c g =t = t
I
b
t
u (c ) subject to
c = c
=t
The Lagrangian is
L(c, ) =
Parikshit Ghosh
Choice and Demand
=t
b
t
"
u (c ) + c
=t
#
Demand Functions
Applications
Anomalies
Time Consistency of the Optimal Path

I
First-order condition:
I
I
Eliminating :
b
t 0
u (c ) =
u 0 (c )
= |{z}
u 0 (c +1 )
| {z }
intertemporal MRS = discount factor
Note that this is independent of bt , the date at which the plan

is being made.
The consumer will not want to change her plans later.
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Anomalies
Logarithmic Utility
I
Suppose u (c ) = log c.
From the rst-order condition

ct +1 = ct ) ct = t c0
Using the budget constraint

c0 + c0 + 2 c0 + . . . = 1 ) c0 = 1
t
ct = ( 1
I
In every period, consume 1

and save fraction.
Parikshit Ghosh
Choice and Demand
fraction of the remaining cake,
Demand Functions
Applications
Anomalies
Quasi-Hyperbolic Discounting and Cake Eating

I
Suppose
U t = u ( ct ) +
u (c )
=t +1
The Lagrangian for the date 0 problem is:
L(c, ) = u (ct ) +
I
"
u (c ) + 1
=t +1
=t
First-order conditions:
u 0 ( c0 ) =
t 0
u (c ) =
Parikshit Ghosh
Choice and Demand

Demand Functions
Applications
Anomalies
Time Inconsistency of the Optimal Path

I
Eliminating :
MRS0,1 =
u 0 ( c0 )
=
u 0 ( c1 )
u 0 ( ct )
= for all t > 0
u 0 ct + 1
However, when date t arrives, the consumer will want to
change the plan and reallocate consumption such that
MRSt,t +1 =
MRSt,t +1 =
I
I
Realizing that she may change her own optimal plan later, the
self aware consumer will adjust her plan at date 0 itself.
Alternatively, the consumer may try to commit and restrict
her own future options (e.g. Christmas savings accounts).
Parikshit Ghosh
Choice and Demand
The Firm and Technology
Prot Maximization

Production, Costs and the Firm
Parikshit Ghosh
Parikshit Ghosh

Prot Maximization
The Firm
The Firm
I
Often a very large organization with thousands of workers.
Starting assumption: objective is to maximize prots.
Obvious exceptions: public sector organizations, non-prots,

vanity projects (sports teams).
Inside the rm: a command economy. Outside the rm: a

market economy. What determines the boundary (Coase)?
The joint stock company: separation of ownership and

management/labour. Gives rise to agency problems: do
managers have the incentive to maximize prots?
Dynamic and strategic issues: there may be trade-os

between prot maximization in the short run and the long run.
Parikshit Ghosh
Prot Maximization
Technology
The Production Function

I
The rm produces one output (y ) using n inputs

x = (x1 , x2 , . . . , xn ).
The input-output relationship is captured in the production

function: y = f (x), where f (.) is continuous, strictly
increasing and (strictly) quasiconcave, with f (0) = 0.
An isoquant is a set of input vectors that produce the same

output:
Q (y ) = fx 0jf (x) = y g
Monotonicity and quasiconcavity of f (.) implies isoquants are

convex to the origin and higher isoquants represent higher
output.
Parikshit Ghosh

Prot Maximization
Technology
The Production Function: Characteristics

I
Returns to scale:
I
I
I
Constant Returns to Scale (CRS) if f (x) = f (x).

Decreasing Returns to Scale (DRS) if f (x) < f (x).
Increasing Returns to Scale (IRS) if f (x) > f (x).
The production function is homogeneous of degree k if

f (x) = k f (x) for any x
There is CRS, DRS, IRS if k =, <, > 1.
Parikshit Ghosh
Prot Maximization
Optimization
Prot Maximization
I
I
I
A perfectly competitive market is a market where there are

a large number of buyers and sellers. Each agent takes the
prices as given, and assumes he will be able to buy/sell any
quantity he wants at these prices.
The rm faces some input price p and a vector of output
prices, w = (w1 , w2 , . . . , wn ).
The prot maximization problem:
max py
y ,x
wx subject to y
f (x)
Becomes an unconstrained problem after incorporating the

(binding) constraint into the objective function:
max pf (x)
x
wx
Parikshit Ghosh

Prot Maximization
Optimization
Two-Step Solutions: the Cost Function

I
Break up the problem into two parts.

I
Find the least costly way of producing any output level y :

c (w, y ) = min wx subject to f (x) = y
x
Using this information, nd the most protable output level:

max py
y
c (w, y )
The cost min problem is the dual of the consumers problem.
The cost function is the expenditure function.
The conditional input demand functions, x(w, y ), are

Hicksian demand functions.
Parikshit Ghosh
Prot Maximization
Optimization
Cost Functions of Homogeneous Production Functions

Theorem
Suppose f (x) is homogeneous of degree k. Then the cost and
conditional input demand functions are multiplicatively separable in
y and w, and are given by
1
c (w, y ) = c (w, 1).y k

x(w, y ) = x(w, 1).y
1
k
The cost function is linear/convex/concave if returns to scale

is constant/decreasing/increasing.
Marginal cost is constant/increasing/decreasing if the cost

function is linear/convex/concave.
Parikshit Ghosh

Prot Maximization
Optimization
Proof of the Theorem

I
The cost function can be rewritten as:

c (w, y ) = min wx subject to f (x) = y
x
= y k min w y
1
k
subject to y
1
k
subject to f
1
k
= y min w y
x
f (x) = 1
1
k
x =1
1
k
= y min wz subject to f (z) = 1

z
= c (w, 1).y k
I
Similar proof for conditional input demand functions.
Parikshit Ghosh
Prot Maximization
Optimization
Returns to Scale and Competition

I
p=
c (w, y )
y
Second-order condition:
2 c (w, y )
y 2
For IRS technology, the second-order condition cannot be

satised anywhere! y is either 0 or . Not compatible with
perfect competition. IRS typically leads to natural monopolies.
For CRS technology, the optimum is 0, [0, ] , when
p <, =, > c (w, 1). Optimum output can be indeterminate.
Parikshit Ghosh

Prot Maximization
Optimization
Returns to Scale and Competition

I
The prot function of the rm is the value function of the

prot-max problem:
(p, w) = max pf (x)
x
I
I
wx
f (x )
p
= wi
|{z}
x
| {z i }
Marginal revenue product = price of input
Second order condition: the Hessian matrix of f (.) must be

negative semi-denite (i.e. locally concave) at x .
The choice functions x(p, w) are the (unconditional) input
demand functions. y (p, w) = supply function.
Parikshit Ghosh
Prot Maximization
Properties of the Prot Function

I
Increasing in p (higher prot for every input choice).
Decreasing in wi (lower prot for every input choice).
Homogeneous of degree 1 in (p, w) (when input and output

prices are scaled up, relative prots remain unchanged).
Convex in (p, w).
Hotellings Lemma (using envelope theorem):

(p, w)
p
(p, w)
wi
= y (p, w)
= xi (p, w)
Parikshit Ghosh

Prot Maximization
Convexity of the Prot Function: Proof

I
Suppose optimal input-output choices are

I
I
I
(y 1 , x1 ) at prices (p 1 , w1 ).
(y 2 , x2 ) at prices (p 2 , w2 ).
(y , x) at prices (p, w), where
(p, w) = (p 1 , w1 ) + (1 )(p 2 , w2 ).
By denition of prot maximization:

p 1 x1
w1 x1
p1 x
w1 x
p 2 x2
w2 x2
p2 x
w2 x
Taking weighted averages:

(p 1 , w1 ) + (1
Parikshit Ghosh
) (p 2 , w 2 )
(p, w)
Prot Maximization
Implications of Convex Prot Function

I
The Hessian matrix of (p, w) is symmetric, positive

semi-denite:
2 2
3 2
2
2

y
y
2
pw 1
pw n
w 1
6 p
7 6 xp
2
2
2

x1
1
6 2
7 6
w 1 w n 7
6 w p w 1
p
w 1
H = 6 1.
7=6
..
..
6
6
7
..
.
.
4
5 4
2
w n p
I
I
I
2
w n w 1
2
w n2
xn
p
xn
w 1
y
w n
..
.
x1
w n
xn
w n
All principal minors are non-negative ) diagonal elements are

non-negative.
The supply function y (p, w) is increasing in output price:
y (p,w)
0.
p
The input demand functions, x(p, w), are decreasing in own
x (p,w)
price, iw i
0.
Parikshit Ghosh
3
7
7
7
7
5
Monopoly
Price Discrimination
Double Marginalization

Monopoly
Parikshit Ghosh
Parikshit Ghosh
Monopoly
Monopoly
Non-Discriminating Monopolist
The Monopolists Problem

I
I
I
I
The monopolist realizes his quantity choice aects market

price through the demand function.
Monopolists problem can be described either as one of
choosing optimal price or optimal quantity.
Inverse demand function p (q ) and cost function c (q ).
Maximizing prots:
max p (q )q
q
Parikshit Ghosh
Monopoly
c (q )
First order condition:

p (q m ) + qp 0 (q m ) = c 0 (q m )
|
{z
}
| {z }
marginal revenue = marginal cost
Monopoly
Monopolists Optimum: Another Look

I
The FOC can be rewritten as

pm 1 +
or, pm
I
I
= c 0 (qm )
= c 0 (qm )
e (q ) is the elasticity of demand at quantity q. The monopolist

never operates on the inelastic part of the demand curve.
Yet another way to write it:
pm
qm dp
.
pm dq
1
e (qm )
c 0 (qm )
1
=
pm
e (qm )
The mark-up is equal to the inverse of demand elasticity.
Parikshit Ghosh
Monopoly
Monopoly
Comparison With Competitive Markets

I
Competitive output level:

p (q ) = c 0 (q )
The monopolist produces less than the competitive market.
Suppose q m
Subtracting the f.o.cs yields a contradiction:
Parikshit Ghosh
Monopoly
q ) p (q m )
p (q ) and c 0 (q m )
c 0 (q ).
p (q m ) p (q ) + qp 0 (q m ) = c 0 (q m ) c 0 (q )
|
{z
} | {z }
|
{z
}
=
+
Unlike a competitive rm, when a monopolist raises output,

he earns lower revenue on previous units.
Monopoly
Monopoly in Pictures
Parikshit Ghosh
Monopoly
Monopoly
Parikshit Ghosh
Monopoly
Monopoly
Parikshit Ghosh
Monopoly
Monopoly
Parikshit Ghosh
Monopoly
Monopoly
Types of Price Discrimination
Price Discrimination: First Degree

I
I
I
I
Each consumer buys 0 or 1 unit and is willing to pay up to v .

There is a continuum of consumers whose v follow a
distribution with c.d.f F (v ).
The monopolist can charge each consumer his personal v .
Must choose a cuto v above which to sell:
max
v
vf (v )dv
v f (v ) + c 0 (1
Parikshit Ghosh
c (1
F (v ))
First-order condition (using Leibnitz Rule) implies absence of

ine ciency:
F (v ))f (v ) = 0
v
= c 0 (1 F (v ))
|{z}
|
{z
}
price to marginal customer = marginal cost
Monopoly
Monopoly
Price Discrimination: Second and Third Degree

I
Second degree price discrimination arises when the monopolist

can charge dierent prices for dierent quantities.
E.g., bulk discounts, multi-packs, frequent-yer miles,

buy-one-get-50%-o on next purchase, etc.
A way to extract consumers surplus from a single consumer.
Third degree price discrimination arises when observably

dierent groups are charged dierent prices.
E.g., student/senior citizen discounts, country specic prices.
A cruder form of rst degree price discrimination using

group identity as a predictor of individual traits.
A common instrument of price discrimination: screening.
Parikshit Ghosh
Monopoly
Monopoly
Second Degree Price Discrimination
Parikshit Ghosh
Monopoly
Monopoly
Parikshit Ghosh
Monopoly
Monopoly
Parikshit Ghosh
Monopoly
Monopoly
Parikshit Ghosh
Monopoly
Monopoly
Parikshit Ghosh
Monopoly
Monopoly
Parikshit Ghosh
Monopoly
Monopoly
Parikshit Ghosh
Monopoly
Monopoly
Two Part Tari
Parikshit Ghosh
Monopoly
Monopoly
Two Part Tari
Parikshit Ghosh
Monopoly
Monopoly
Two Part Tari

I
Suppose a single consumer with income y has quasi-linear

utility: u (q, m ) = (q ) + m.
The monopolist can charge an entry fee (f ) and a price (p)

per unit of consumption.
Consumers optimum quantity choice (if she subscribes):

max (q ) + y
q
Parikshit Ghosh
(p )
Consumer subscribes if (participation constraint):

(q (p )) + y
Monopoly
pq ) q (p ) = 0
pq (p )
Monopoly
Two Part Tari

I
The monopolists problem:

max pq (p ) + f
f ,p
subject to (q (p )) + y
I
pq (p )
Participation constraint must be binding at optimum

(otherwise simply increase f ).
max (q (p ))
p
c (q (p ))
c (q (p )) ) 0 (q
b) = c 0 (q
b)
If the market were competitive (price taking behaviour):

0 (q ) = c 0 (q )
Again, two-part taris remove the monopolistic distortion.
Parikshit Ghosh
Monopoly
Monopoly
Monopolistic Screening
Airfares: More Than 2 Weeks
Parikshit Ghosh
Monopoly
Monopoly
Airfares: Less Than 2 Weeks
Parikshit Ghosh
Monopoly
Monopoly
Hardcover vs. Paperback: Prices
Parikshit Ghosh
Monopoly
Monopoly
Hardcover vs. Paperback: Date and Sales
Parikshit Ghosh
Monopoly
Monopoly
Cash Back Coupons
Parikshit Ghosh
Monopoly
Monopoly
Rationing
Parikshit Ghosh
Monopoly
Monopoly
Rationing
Parikshit Ghosh
Monopoly
Monopoly
Price Discrimination by Screening

I
Two types of consumers:

I
I
high value (value = vH , proportion = ).

low value (value = vL , proportion = 1 ).
Consumers valuation is private information.
Cost of production is 0.
Uniform pricing strategy:

I
I
charge p = vH if vH
vL .
charge p = vL ifvH < vL .
Prot = maxfvH , vL g. Assume vH < vL .
Parikshit Ghosh
Monopoly
Monopoly

I
I
I
Parikshit Ghosh
Monopoly
Monopolist can impose a burden B on consumers (delay,

coupons, uncertainty, etc.).
Cost of the burden is cH for high value types and cL for low
value types (cH > cL ).
The burden reduces the willingness to pay of all customers.
Its direct impact on prots is negative. Why will the
monopolist hurt his own interests?
By imposing the burden, the monopolist can gather valuable
market information which allows him to price discriminate.
This indirect benet may compensate for the direct loss of
lower reservation prices.
The screening technique is useful in other contexts (employers
seeking dedicated workers, governments targeting anti-poverty
programmes at the poor).
Monopoly

I
Monopolist oers a menu (pH , 0) and (pL , B ), satisfying:

I
Self Selection Constraint: H-type chooses (pH , 0), L-type

chooses (pL , B ).
pL + cH
pH
pH
pL + cL
I
Participation Constraint: Both types want to buy.
vL
I
(IC-H)
(IC-L)
vH
pL
pH
cL
0
0
(PC-H)
(PC-L)
Monopolist solves: maxpH ,pL pH + (1

constraints.
)pL subject to these
Parikshit Ghosh
Monopoly
Monopoly

I
Assumption A: Relative to the L-type, the H-type gains

more from the good than she suers from the burden:
vH
vL )
or, vH
vH
Parikshit Ghosh
cH
cL
Step 1: PC-H is implied by the other constraints and can be

dropped.
vL + (vH
Monopoly
vL
(pL + cL ) + (cH
cL )
(combining PC-L and Assum
pL + cH
cH
(using IC-H)
Monopoly
Step 2: PC-L must be binding at the optimum.

I
I
Step 3: IC-H must be binding at the optimum.

I
I
If not, increase both pH and pL by e > 0.

ICs continue to hold, prots have increased.
Otherwise, increase pH by e > 0.

The other remaining constraints continue to be satised.
Step 4: Binding IC-H implies IC-L, so it can be dropped.
Parikshit Ghosh
Monopoly
Monopoly

I
Optimal solution:
pL
pH
Assume vH > vL + cH
Prot from price discrimination:
Parikshit Ghosh
Monopoly
= vL cL
= vL + cH
cL
cL .
= (vL + cH cL ) + (1
= vL + cH cL
)(vL
cL )
Price discrimination is better than uniform pricing if

cH cL > 0.
Monopoly
Vertical Mergers
Horizontal vs. Vertical Mergers

I
I
I
I
I
Downstream product X (furniture), upstream product Y

(wood). Fixed coe cient technology (1:1).
Each supplied by a separate monopolist, with marginal costs
cx and cy .
Inverse demand function for the nal good: p = p (x ). Let
R (x ) = p (x )x be downward sloping.
Let the price charged by the upstream monopolist be q.
Downstream problem:
max R (x )
x
( cx + q ) x

R 0 ( x ) = cx + q
Parikshit Ghosh
Monopoly
Monopoly
Vertical Mergers
Upstream Problem
I
The upstream monopolist takes into account the downstream

demand function:
max yq
y
cy y
max y R 0 (y )
y
cx
cy y
First order condition (replacing y by x ):

R 0 (x ) + x R 00 (x ) = cx + cy
Suppose the two monopolies merged. Then the rm will solve:
First order condition shows vertical merger increases e ciency:
max R (x )
x
( cx + cy ) x
R 0 ( x ) = cx + cy ) x > x
Parikshit Ghosh
Monopoly
Monopoly
Tax Harmonization
Centre and State Taxes

I
Let q = a
Supply is horizontal at some price p

b (net of taxes).
bp be the demand function.
The Centre rst chooses a tax tc . Then the State chooses its
own tax ts .
Both governments aim to maximize tax revenue.
Centre and State do not coordinate when choosing their tax

policies. They do not maximize total government revenue.
This can be ine cient. Cumulative tax rates are too high.
Parikshit Ghosh
Monopoly
Monopoly
Tax Harmonization
States Problem
I
The State solves:

max ts [a
ts
States FOC denes its reaction function ts (tc ):

a
b (p
b + tc + ts )]
Dierentiating w.r.t tc :
b (p
b + tc ) = 2bts
ts0 (tc ) =
Parikshit Ghosh
Monopoly
1
2
If Central taxes are higher, State will lower its own taxes to
some degree but not completely.
Monopoly
Tax Harmonization
Centres Problem
I
Centre chooses tc to solve:

max tc [a
tc
Replacing States reaction function:

tc
b (p
b + tc + ts (tc ))]
1
= arg max tc (a
tc 2
a bb
p
=
2b
bb
p
btc )
Plugging back:
ts =
bb
p
4b
Parikshit Ghosh
Monopoly
Monopoly
Tax Harmonization
Tax Harmonization
I
Consider a harmonized single tax rate to maximize tax

collection for Centre and State. It solves
b (p
b + t )]
max t [a
t
FOC is gives the optimal tax rate

t =
bb
p
2b
Since ts = t , we have ts + tc > t .
A harmonized single tax will

I
I
Parikshit Ghosh
Monopoly
Decrease tax burden and increase consumers surplus

Increase tax collection which can be suitably shared.
Critique
Applications
Microeconomic Theory I:
Choice Under Uncertainty
Parikshit Ghosh
September 8, 2014
Parikshit Ghosh

Critique
Applications
Lotteries
I
I
Set of outcomes: fa1 , a2 , . . . , an g .
A gamble/lottery is a probability distribution over outcomes:

g = (p1 a1 , p2 a2 , . . . , pn an ).
pi is the probabaility of outcome i.
Sure outcomes: (0 a1 , . . . , 1 ai , . . . , 0 an ) = ai .
Compound lotteries are probability distributions over lotteries:

(q1 g1 , q2 g2 , . . . , qm gm ).
(SG ) S is the set of all (simple) lotteries.
% is a preference relation dened over S.
Parikshit Ghosh
Critique
Applications
The von Neumann-Morgenstern Axioms

I
Axiom 1 (Completeness): For all g , g 0 2 G , either g % g 0

or g 0 % g (or both).
Axiom 2 (Transitivity): For all g , g 0 , g 00 2 G , if g % g 0 and

g 0 % g 00 , then g % g 00 .
Axiom 3 (Continuity): For any g 2 G , there exists

2 [0, 1] such that
g
( a1 , . . . , ( 1
) an )
Parikshit Ghosh

Critique
Applications
The von Neumann-Morgenstern Axioms

I
Axiom 4 (Monotonicity): For any , 2 [0, 1]
( a1 , ( 1
) an ) % ( a1 , ( 1
) an ) i
Axiom 5 (Substitution/Independence): If
g = (p1 g1 , . . . , pk gk ), h = (p1 h1 , . . . , pk
gi
hi for all i = 1, 2, . . . , k, then g h.
hk ) and
Axiom 6 (Reduction to simple lotteries): For any g 2 G ,

if gS 2 GS is the simple lottery induced by g , then g gS .
Parikshit Ghosh
Critique
Applications
Rrepresentation Theorems
The Expected Utility Theorem

Theorem
Suppose % satises Axioms 1 through 6. Then there exists a

function u : G ! R such that u (.)
(i) represents % , i.e. g % g 0 , u (g )
u (g 0 )
n
(ii) has the exp utility prop, i.e. u (g ) =
pi u (ai )
i =1
I
I
I
The probabilities pi are assumed to be objective (e.g. playing

roulette), not subjectively assessed (e.g. stock price).
Savage extended the theory to subjective probabilities.
The value of a lottery is linear in the probabilities of outcomes.
Parikshit Ghosh

Critique
Applications
Proof: Representation
I
Proof by construction: dene u (g ) 2 [0, 1] such that

g
( u ( g ) a1 , ( 1
u (g )) an ) (continuity)
Representation: g % g 0 ,
( u ( g ) a1 , ( 1
u (g )) an ) %
u ( g 0 ) a1 , ( 1
u (g 0 )) an
(transitivity)
, u (g )
Parikshit Ghosh
u (g 0 ) (monotonicity)
Critique
Applications
Proof: Expected Utility Property

I
Expected utility property:

ai
( u ( ai ) a1 , ( 1
u (ai )) an )
Then
g
(p1 q1 , p2 q2 , . . . , pn qn ) (substitution)
!
!
!
n
pi u (ai )
a1 , 1
i =1
qi
By monotonicity
pi u (ai )
an
(axiom 6)
i =1
u (g ) =
pi u (ai )
i =1
Parikshit Ghosh

Critique
Applications
Invariance to Positive A ne Transformations

Theorem
Suppose the VNM function u (.) represents % over G . Then the

VNM function v (.) represents % if and only if there exist real
numbers and > 0 such that
v (g ) = + u (g ) for all g 2 G
I
As in choice under certainty, there is no unique function that

represents preferences.
Representation is more restrictive: only positive linear

transformations preserve preference.
Parikshit Ghosh
Critique
Applications
Proof of Only IfPart

I
I
Su ciency is trivial. Proving necessity.

Let
ai
(i a1 , (1 i ) an ) (continuity)
Since both u (.) and v (.) represent % and are VNM (expected
utility) functions
u ( ai ) = i u ( a1 ) + ( 1
i ) u ( an )
v ( ai ) = i v ( a1 ) + ( 1
i ) v ( an )
Solving for i :
i =
u ( ai )
u ( a1 )
u ( an )
v ( ai )
=
u ( an )
v ( a1 )
Parikshit Ghosh
v ( an )
v ( an )

Critique
Applications
Proof (contd.)
Solving for v (ai ):

v ( ai ) =
u ( a1 ) v ( an ) u ( an ) v ( a1 )
v ( a1 ) v ( an )
+
u ( ai )
u ( a1 ) u ( an )
u ( a1 ) u ( an )
|
{z
} |
{z
}
There are two degrees of freedom while choosing the utility

function.
Parikshit Ghosh
Critique
Applications
Anomalies
The Allais Paradox

I

Lottery A: 1 crore (1)
Lottery B: 5crore (.1), 1 crore (.89), 0 (.01)

Lottery C: 1 crore (.11), 0 (.89)
Lottery D:5 crore (.1), 0 (.9)

A
B, D
Parikshit Ghosh

Critique
Applications
Anomalies
What is Wrong?
I
Suppose u (.) represents these preferences.
B implies
u (1) > .1u (5) + .89u (1) + .01u (0)
or .1u (5)
.11u (1) + .01u (0) < 0
C implies
.1u (5) + .9u (0) > .11u (1) + .89u (0)
or .1u (5)
.11u (1) + .01u (0) > 0
These preferences cannot be represented by a VNM function

since it leads to a contradiction.
Parikshit Ghosh
Critique
Applications
Anomalies
The Ellsberg Paradox

I
An urn contains 300 balls, out of which 100 are known to be

red, and the remaining 200 are known to be either blue or
green.
Lottery A: Rs. 100 if Red
Lottery B: Rs. 100 if Blue

Lottery C Rs. 100 if Not Red
Lottery D: Rs. 100 if Not Blue

A
B, C
Parikshit Ghosh
D

Critique
Applications
Anomalies
What is Wrong?
I
Suppose u (.) represents these preferences, and suppose the

decision maker conjectures Pr[blue] = p.
B implies
p<
1
3
D implies
2
1
>1 p)p>
3
3
These preferences cannot be represented by any expected
utility function (ambiguity aversion).
Parikshit Ghosh
Critique
Applications
Anomalies
Non-Consequentialism: Machinas Mom
A mother has two children but only one (indivisible) toy.
Outcomes: b (boy gets it), g (girl gets it).
Preference: b
Violates monotonicity axiom.
Why does Machinas mom strictly prefer tossing a coin?
To guarantee equal opportunity, since she cannot ensure equal

outcome.
g , (0.5 b, 0.5 g )
Parikshit Ghosh
b, g .

Critique
Applications
Perceptual Biases
BayesRule
I
Suppose 1% of the population is infected with swine u virus.
Suppose there is a test of 90% accuracy (10% chance of false

positive or false negative).
A patient tests positive. What is the probability he is actually

infected?
BayesRule says Pr(infectedjpositive)
=
=
I
Pr(inf) Pr(positivejinf)
Pr(inf) Pr(positivejinf) + Pr(uninf) Pr(positivejuninf)
(.01)(.9)
1
=
(.01)(.9) + (.99)(.1)
12
The small prior nullies the eect of the large test accuracy.
Parikshit Ghosh
Critique
Applications
Perceptual Biases
Framing Eect
I
I
Kahnemann and Tversky (1981): suppose 600 people will be

subjected to a medical treatment against some deadly disease.
Treatment A: 200 people will be saved

1
2
Treatment B: everyone saved (prob ) or no one saved (prob )
3
3
Treatment C: 400 people will die

1
2
Treatment D: everyone dies (prob ) or no one dies (prob )
3
3
A
B (72%), D
C (78%)
Parikshit Ghosh

Critique
Applications
Attitudes Towards Risk
Monetary Payos
I
Let ai = wi (some amount of wealth).
I
I
Expected value of a lottery: E(g ) = ni=1 pi wi .

Expected utility of a lottery: u (g ) = ni=1 pi u (wi ).
Denition: u (.) exhibits

I
I
I
risk neutrality if u (g ) = u (E(g )) for all g 2 G .

risk aversion if u (g ) < u (E(g )) for all g 2 G .
risk loving if u (g ) > u (E(g )) for all g 2 G .
Certainty equivalent: C (g ) is such that u (g ) = u (C (g )).
Risk premium R (g ) = E(g )
Risk neutrality/aversion/loving ) R (g ) =, >, < 0.
Parikshit Ghosh
C (g ).
Critique
Applications
Optimum Purchase of Insurance

I
I
I
I
An agent with wealth w faces a loss L with probability p.

She has a concave (risk averse) utility function u (w ).
She can insure her wealth at a premium of per rupee insured.
The agents problem is to insure an amount x w to solve:
max pu (w
x
x + x ) + (1
p )u (w
x )

p (1
)u 0 (w
x + x ) = (1
p )u (w
x )
x < (=)L if > (=)p.

Zero prot condition for insurance companies:
(1
p )
p (1
) = 0 ) p =
Parikshit Ghosh

Critique
Applications
Degree of Risk Aversion

I
The Arrow-Pratt measure of absolute risk aversion:

u 00 (w )
r (w ) =
u 0 (w )
Interpretation: a more risk averse agent will accept a strictly

smaller set of lotteries.
Consider lotteries of the form (p x1 , (1 p ) x2 ). Let
x2 (x1 ) be the boundary of the acceptable set.
By denition:
I
I
pu (w + x1 ) + (1
I
p )u (w + x2 (x1 ))
u (w )
Dierentiating with respect to x2 at (0, 0):

pu 0 (w ) + (1
Parikshit Ghosh
p )u 0 (w )x20 (0) = 0 ) x20 (0) =
p
1
Critique
Applications
Degree of Risk Aversion

I
The more curved the boundary at (0, 0), the smaller is the
acceptance set.
Dierentiating a second time at (0, 0):

pu 00 (w ) + (1
Since x20 (0) =
p )u 00 (w ) x20 (0)
+ (1
p )u 0 (w )x200 (0) = 0
p
1 p
x200 (0) =
I
(1
p )2
u 00 (w )
u 0 (w )
Agents with larger r (w ) have smaller acceptance sets.
Parikshit Ghosh
Lectures on Optimization
A. Banerji
September 2, 2014
Chapter 1
Introduction
1.1
Some Examples
We briefly introduce our framework for optimization, and then discuss some
preliminary concepts and results that well need to analyze specific problems.
Our optimization examples can all be couched in the following general
framework:
Suppose V is a vector space and S V . Suppose F : V <. We wish to
find x S s.t. F (x ) F (x), x S, or x S s.t. F (x ) F (x), x S.
x , x are respectively called a maximum and a minimum of F on S.
In different applications, V can be finite- or infinite-dimensional. The
latter need more sophisticated optimization tools such as optimal control; we
will keep that sort of stuff in abeyance for now. In our applications, F will
be continuous, and pretty much also differentiable; often twice continuously
differentiable. S will be specified most often using constraints.
Example 1 Let U : <k+ < be a utility function, p1 , ..., pk , I be positive
P
prices and wealth. Maximize U s.t. xi 0, i = 1, ..., k, and ki=1 pi xi
p.x I.
Here, the objective function is U , and
1
S = {x <k : xi 0i = 1, ..., k, and 0 p.x I}

.
Example 2 Expenditure minimization. Same setting as above. Minimize

p.x s.t. xi 0i = 1, ..., k and U (x) U , where U is a non-negative real
number.
Here the objective function F : <k < is F (x) = p.x and
S = {x <k : xi 0i = 1, ..., k, and U (x) U }
Example 3 Profit Maximization. Given positive output prices p1 , ..., ps and

input prices w1 , ..., wk , and a production function f : <k+ <s (transforming
k inputs into s products),
P
P
Maximize sj=1 pj fj (x) ki=1 wi xi , s.t. xi 0, i = 1, ..., k. fj (x) is
the output of product j as a function of a vector x of the k inputs.
Here, the objective function is profits : <k+ < defined by (x) =
Pk
Ps
i=1 wi xi , and
j=1 pj fj (x)
S = {x <k : xi 0, i = 1, ..., k}
Example 4 Intertemporal utility maximization. A worker with a known life

span T , earning a constant wage w, and receiving interest at rate r on accumulated savings, or paying the same rate on accumulated debts, wishes to
2
decide optimal consumption path c(t), t [0, T ]. Let accumulated assets/debts

at time t be denoted by k(t). His instantaneous utility from consumption is
0
00
u(c(t)), u > 0, u < 0, and future consumption is discounted at rate . So

the problem is to choose a function c(t), t [0, T ] to
RT
Maximize F (c) = 0 et u(c(t))dt s.t.
(i) c(t) 0, t [0, T ]
0
(ii) k (t) = w + rk(t) c(t)

(iii) k(0) = k(T)=0
(iii) assumes that the individual has no inheritance and wishes to leave
no bequest.
Here, the objective function F is a function of an infinite dimensional
vector, the consumption function c. The constraint set S admits only those
functions c(t) that satisfy conditions (i) to (iii).
Example 5 Game in Strategic Form. G = hN, (Si ), (ui )i, where N =

{1, ..., n} is the set of Players, and for each player i, Si is her set of strategies,
and ui : ni=1 Si < is her payoff function.
In a game, i0 s payoff can depend on the choices/strategies (s1 , ..., sn ) of
everyone.
A Nash Equilibrium (s1 , ..., sn ) is a strategy profile such that for each
player i, si solves the following maximization problem:
Maximize ui (s1 , .., si1 , si , si+1 , .., sn ) s.t.
si Si .
1.1.1
Optimization Problems in Parametric Form
Parameters are held constant in the optimization exercise. For instance, in

Example 1 (Utility Maximization), (p, I) <k+1
is the parameter vector held
+
constant. The budget set in fact depends on this parameter, and we may
3
write S(p, I) for the budget set to show this dependence. The maximum
value that the utility function takes on this set (if the maximum exists), i.e.
V (p, I) = Max{U (x)|x S(p, I)}
therefore typically depends on the parameter (p, I), and we denote this
dependence of the maximum by the value function V (p, I). In consumer
theory, we call this the indirect utility function. This is a function because
to each point (p, I) in the admissible parameter space, V (p, I) assigns a single
value, equal to the maximum of U (x) over all x S(p, I).
Note that for a given (p, I), the set of bundles that maximize utlity may
not be unique: we can denote this relationship by x(p, I), the set of all
bundles that maximize utility given (p, I). If the optimal bundle were unique
for every (p, I), then x(p, I) is a function (the Walrasian or Marshallian
demand function), and therefore V (p, I) = U (x(p, I)).
In other problems, not just the feasible set S but also the objective function depends on the parameter. For instance, in Example 2 (Expenditure
Minimization), the parameter is (p, U ), and the objective function px depends on the parameter, via the price vector p, while the constraint U (x) U
depends on it via the utility level U .
In general, we may write down the optimization problem in parametric
form as follows: A parameter is part of some admissible set of parameters,
where is a subset of some vector space W (finite or infinite-dimensional;
e.g. in example 1, (p, I) <n++ <+ , this latter set is thus . The feasible
set is S(), which depends on and is a subset of some (other) vector space
V (e.g. in example 1, S(p, I) <n+ ). The objective function F maps from
some subset of V W to the real line: we write F (x, ). The problem is to
maximize or minimize F (x, ) s.t. x S().
Please read the other examples in Sundaram. We give one final one here.
Example 6 Identifying Pareto Optima
There are 2 individuals, with utility functions u1 (x1 ), u2 (x2 ) respectively,

that map from <n+ , the n-good space, to <. There is an endowment <n+ to
be allocated between them. An allocation (x1 , x2 ) (vectors of the goods given
to the two agents) is feasible if x1 , x2 0 and x1 + x2 . We will call
F () = {(x1 , x2 )|x1 , x2 0, x1 + x2 }
the feasible set or set of feasible allocations.
An allocation (y1 , y2 ) Pareto dominates (x1 , x2 ) if
ui (yi ) ui (xi )i = 1, 2, with >0 for some i
An allocation (x1 , x2 ) is Pareto optimal if there is no feasible allocation
that Pareto dominates it.
Let a (0, 1) and consider the social welfare function U (x1 , x2 , a)
au1 (x1 ) + (1 a)u2 (x2 ). Then if (z1 , z2 ) is any allocation that solves
Maximize U (x1 , x2 , a) s.t. (x1 , x2 ) F ()

it is a Pareto optimal allocation. For, if (z1 , z2 ) is in this set of solutions but is not Pareto optimal, then there is a feasible allocation (y1 , y2 ) s.t.
u1 (y1 ) u1 (z1 ), u2 (y2 ) u2 (z2 ), with >0 for at least one of these. Multiplying the first inequality by a, the second by (1 a) and adding, we get
U (y1 , y2 , a) > U (z1 , z2, a), contradicting that (z1 , z2 ) is a maximizer.
Under certain assumptions, a converse holds, i.e. every Pareto optimal
allocation is a solution to some social welfare function of the kind above:
(i.e., a solution for some choice of a [0, 1]).
1.2
Some Concepts and Results
We will now discuss some concepts that we will need, such as the compactness
of the set S above, and the continuity and differentiability of the objective
function F . We will work in normed linear spaces. In the absence of any
other specification, the space we will be in is <n with the Euclidean norm
P
1/2
||x|| = ( ni=1 x2i ) . (Theres a bunch of other norms that would work
equally well. Recall that a norm in <n is defined to be a function assigning
to each vector x a non-negative real number ||x||, s.t. (i) for all x, ||x|| 0
with =0 iff x = 0 (00 being the zero vector); (ii) If c <, ||cx|| = |c|||x||.
(iii) ||x + y|| ||x|| + ||y||. The last requirement, the triangle inequality,
follows for the Euclidean norm from the Cauchy-Schwarz inequality).
One example in the previous section used another normed linear space,
namely the space of bounded continuous functions defined on an interval
of real numbers, with the sup norm. But in further work in this part of
the course, we will stick to using finite dimensional spaces. Some of the
concepts below apply to both finite and infinite dimensional spaces, so we
will sometimes call the underlying space V . But mostly, it will help to think
of V as simply <n , and to visualize stuff in <2 .
Pn
2 1/2
(x
y
)
.
We will measure distance between vectors using ||xy|| =
i
i
i=1
This is our intuitive notion of distance using Pythagoras theorem. Furthermore, it satisfies the three properties of a metric, viz., (i) ||x y|| 0, with
= iff x = y; (ii) ||x y|| = ||y x||; (iii) ||x z|| ||x y|| + ||y z||.
Note that property (iii) for the metric follows from that for the triangle
inequality for the norm, since ||xz|| = ||(xy)+(yz)|| ||xy||+||yz||.
Open and Closed Sets
Let > 0 and x V . The open ball centered at x with radius is defined
as
B(x, ) = {y : ||x y|| < }
We see that if V = <, B(x, ) is the open interval (x , x + ). If
6
V = <2 , it is an open disk centered at x. The boundary of the disk is traced

by Pythagoras theorem.
Exercise 1 Show that ||x y|| defined by max{|x1 y1 |, . . . , |xn yn |}, for
all x, y <n is a metric (i.e. satisfies the three requirements of a metric). In
the space <2 , sketch B(0, 1), the open ball centered at 0, the origin, of radius
1, in this metric.
Let S V . x is an interior point of S if B(x, ) S, for some > 0. S
is an open set if all points of S are interior points. On the other hand, S is
a closed set iff S c is an open set.
Example. Open in < vs. open in <2 .
There is an alternative, equivalent, convenient way to define closed sets.
x is an adherent point of S, or adheres to S, if every B(x, ) contains a point
belonging to S. Note that this does not necessarily mean that x is in S.
(However, if x S then x adheres to S of course).
Example. Singleton and finite sets; countable sets need not be open.
Lemma 1 A set S is closed iff it contains all its adherent points.
Proof
Suppose S is closed, so S c is open. Let x adhere to S. Want to show
that x S. Suppose not. Then x S c , and since S c is open, x is an interior
point of S c . So there is some > 0 s.t. B(x, ) S c ; this does not have any
points from S. So x cannot be an adherent point of S. Contradiction.
Conversely, suppose S contains all its adherent points. To show S is
closed, we show S c is open. We show that all the points in S c are interior
points. Let x S c . Since x does not adhere to S, it must be the case that
.
for some > 0, B(x, ) S c .
More examples of closed (and open) sets.
Now we will relate closedness to convergence of sequences. Recall that
formally, a sequence in V is a function x : N V . But instead of writing
{x(1), x(2), ...} as the images or members of the sequence, we write either
{x1 , x2 , ...} or {x1 , x2 , ...}.
7
Definition 1 Convergence:
A sequence (xk )
k=1 of points in V converges to x if for every > 0 there
exists a positive integer N s.t. k N implies ||xk x|| < .
Note that this is the same as saying that for every open ball B(x, ), we
can find N s.t. for all points xk following xN , xk lies in B(x, ). This implies
that when xk converges to x (notation: xk x), all but a finite number of
points in (xk ) lie arbitrarily close to x.
Examples. xk = 1/k, k = 1, 2, ... is a sequence of real numbers converging
to zero. xk = (1/k, 1/k), k = 1, 2, ... is a sequence of vectors in <2 converging
to the origin. More generally, a sequence converges in <n if and only if all
the coordinate sequences converge, as can be visualized in the example here
using hypotenuses and legs of triangles.
Theorem 2 (xk ) x in <n iff for every i {1, . . . , n}, the coordinate
sequence (xki ) xi .
Proof. Since
(xki xi )2
n
X
(xkj xj )2 ,
j=1
taking square roots implies |xki xi | ||xk x||, so for every k N s.t.
||xk x|| < , |xki xi | < .
Conversely, if all the coordinate sequences converge to the coordinates
of the point x, then there exists a positive integer N s.t. k N implies
|xki xi | < / n, for every coordinate i. Squaring, adding across all i and
taking square roots, we have ||xk x|| < .
Several convergence results that appear to be true are in fact so. For
instance, (xk ) x, (y k ) y implies (xk + y k ) (x + y). Indeed, there
exists N s.t. k N implies ||xk x|| < /2 and ||y k y|| < /2. So
||(xk + y k ) (x + y)|| = ||(xk x) + (y k y)|| ||xk x|| + ||y k y|| (by the
triangle inequality), and this is less than /2 + /2 = .
8
Exercise 3 Let (ak ) and (bk ) be sequences of real numbers that converge to a
and b respectively. Then the product sequence (ak bk ) converges to the product
ab.
Closed sets can be characterized in terms of convergent sequences as follows.
Lemma 2 A set S is closed if and only if for every sequence (xk ) lying in
S, xk x implies x S.
Proof. Suppose S is closed. Take any sequence (xk ) that converges to a
point x. Then for every B(x, ), we can find a member xk of the sequence
lying in this open ball. So, x adheres to S. Since S is closed, it must contain
this adherent point x.
Conversely, suppose the set S has the property that whenever (xk ) S
converges to x, x S. Take a point y that adheres to S. Take the successively
smaller open balls B(y, 1/k), k = 1, 2, 3, .... We can find, in each such open
ball, a point y k from the set S (since y adheres to S). These points need not
be all distinct, but since the open balls have radii converging to 0, y k y.
Thus by the convergence property of S, y S. So, any adherent point y of
.
S actually belongs to S.
Related Results
1. If (ak ) is a sequence of real numbers all greater than or equal to 0,
and ak a, then a 0. The reason is that for all k, ak [0, ) which is a
closed set and hence must contain the limit a.
2. Sup and Inf.
Let S <. u is an upper bound of S if u a, for every a S. s is the
supremum or least upper bound of S (called sup S), if s is an upper bound
of S, and s u, for every upper bound u of S.
We say that a set S of real numbers is bounded above if there exists an
upper bound, i.e. a real number M s.t. a M, a S. The most important
property of a supremum, which well by and large take here as given, is the
following:
9
Completeness Property of Real Numbers: Every set S of real numbers that is bounded above has a supremum.
For a short discussion of this property, see the Appendix.
Note that sup S may or may not belong to S.
Examples. S = (0, 1), D = [0, 1], K = set of all numbers in the sequence
1
, n = 1, 2, 3, .... The supremum of all these sets is 1, and this does not
1 2n
belong to S or to K.
When sup S belongs to S, it is called the maximum of S, for obvious
reasons. Another important property of suprema is the following.
Lemma 3 For every > 0, there exists a number a S s.t. a > sup S .
Note that this means that sup S is an adherent point of S.
Proof. Suppose that for some > 0, there is no number a S s.t. a >
sup S . So, every a S must then satisfy a sup S . But then,
sup S is an upper bound of S that is less than sup S. This implies that
sup S is not in fact the supremum of S. Contradiction.
.
Lemma 4 If a set S of real numbers is bounded above and closed, then it

has a maximum.
Proof. Since it is bounded above, it has a supremum, sup S. sup S is an
adherent point of S (by the above lemma). S is closed so it contains all its
adherent points, including sup S. Hence sup S is the max of S.
.
Corresponding to the notion of supremum or least upper bound of a set
S of real numbers, is the notion of infimum or greatest lower bound of S.
A number l is a lower bound of S if l a, a S. The infimum of S is a
number s s.t. s is a lower bound of S, and s l, for all lower bounds l of S.
We call the infimum of S, inf S.
Let S be the set of numbers of the form a, for all a S.
Fact. sup S = inf(S).
So, sup and inf are intimately related.
10
By the completeness property of real numbers, if S < is bounded below,

(i.e., there exists m s.t. m < a, a S), it has an infimum. If S is closed
and bounded below, it has a minimum.
A set S < is said to be bounded if it is bounded above and bounded
below. We can extend the lemma above along obvious lines as follows:
Theorem 4 If S < is closed and bounded, then it has a Maximum and a
Minimum.
For a more general normed linear space V , we define boundedness as
follows. A set S V is bounded if there exists an open ball B(0, M ) s.t.
S B(0, M )
Compact Sets.
Suppose (xn ) is a sequence in V . (Note the change in notation, from
superscript to subscript. This is just by the way; most places have this
subscript notation, but Rangarajan Sundaram at times has the superscript
notation in order to leave subscripts to denote co-ordinates of a vector).
Let m(k) be an increasing function from the natural numbers to the
natural numbers. So, l > n implies m(l) > m(n). A subsequence (xm(k) ) of
(xn ) is an infinite sequence whose k th member is the m(k)th member of the
original sequence.
Give an example. The idea is that to get a subsequence from (xn ), you
strike out some members, keeping the remaining members positions the
same.
Fact. If a sequence (xn ) converges to x, then all its subsequences converge
to x.
Proof. Take an arbitrary > 0. So, there exists N s.t. n N implies
||xn x|| < . This implies, for any subsequence (xm(k) ), that k N implies
.
||xm(k) x|| < .
However, if a sequence does not converge anywhere, it can still have (lots
of) subsequences that converge. For example, let (xn ) ((1)n ), n = 1, 2, ....
11
Then, (xn ) does not converge; but the subsequences (ym ) = 1, 1, 1, ....
and (zm ) = 1, 1, 1, ... both converge, to different limits. (Such points are
called limit points of the mother sequence (xn )).
Compact sets have a property related to this fact.
Definition 2 A set S V is compact if every sequence (xn ) in S has a
subsequence that converges to a point in S.
Theorem 5 Suppose S <n . Then S is compact if and only if it is closed
and bounded.
Proof (Sketch).
Suppose S is closed and bounded. We can show its compact using a
pigeonhole-like argument; lets sketch it here. Since S is bounded, we can
cover it in a closed rectangle R0 = I1 . . . In , where Ii , i = 1, ..., n are
closed intervals. Take a sequence (xn ) in S. Divide the rectangle in two:
I11 . . . In and I12 ... In , where I11 I12 = I1 is the union of 2 intervals.
Then, theres an infinity of members of (xn ) in at least one of these smaller
rectangles, call this R1 . Divide R1 into 2 smaller rectangles, say by dividing
I2 into 2 smaller intervals; well find an infinity of members of (xn ) in at least
one of these rectangles, call it R2 . This process goes on ad infinitum, and we
find an infinity of members of (xn ) in the rectangles R0 R1 R2 .... By
the Cantor Intersection Theorem,
i=0 Ri is a single point; call this point x.
Now we can choose points yi Ri , i = 1, 2, ... s.t. each yi is some member
of (xn ); because the Ri s collapse to x, it is easy to show that (ym ) is a
subsequence that converges to x. Moreover, the yi s lie in S, and S is closed;
so x S.
Conversely, suppose S is compact.
(i) Then it is bounded. For suppose not. Then we can construct a sequence (xn ) in S s.t. for every n = 1, 2, ..., ||xn || > n. But then, no subsequence of (xn ) can converge to a point in S. Indeed, take any point x S
and any subsequence (xm(n) ) of (xn ). Then
12
||xm(n) || = ||xm(n) x + x|| ||xm(n) x|| + ||x||

(The inequality above is due to the triangle inequality).
So,
||xm(n) x|| ||xm(n) || ||x|| n ||x||
and the RHS becomes larger with n. So (xm(n) ) does not converge to x.
(ii) S is also closed. Take any sequence (xn ) in S that converges to x.
Then, all subsequences of (xn ) converge to x, and since S is compact, (xn )
has a subsequence converging to a point in S. So, this point of limit is x,
and x S. So, S is closed.
.
Continuity of Functions
Definition 3 A function F : <n <m is continuous at x <n , if for
every sequence (xk ) that converges to x in <n , the image sequence (f (xk ))
converges to f (x) in <m .
Example of point discontinuity.
Example of continuous function on discrete space.
F is continuous on S <n , if it is continuous at every point x S.
Examples. The real-valued function F (x) = x is continuous using this
definition, almost trivially, since (xk ) and x are identical to (F (xk )) and F (x)
respectively.
F (x) = x2 is continuous. We want to show that if (xk ) converges to x,
then (F (xk )) = x2k converges to F (x) = x2 . This follows from the exercise
above on limits: xk x, xk x implies xk xk x.x = x2 .
By extension, polynomials are continuous functions.
May talk a little about the coordinate functions of F : <n <m :
(F1 (x1 , ..., xn ), ..., Fm (x1 , ..., xn )).
13
Example: F (x1 , x2 ) = (x1 + x2 , x21 + x22 ). This is continuous because (i)

F1 and F2 are continuous; e.g. let xk x. Then the coordinates xk1 x1
and xk2 x2 . So F1 (xk ) = xk1 + xk2 x1 + x2 = F1 (x).
(ii) Since the coordinate sequences F1 (xk ) F1 (x) and F2 (xk ) F2 (x),
F (xk ) (F1 (xk ), F2 (xk )) F (x) = (F1 (x), F2 (x)).
There is an equivalent, (, ) definition of continuity.
Definition 4 A function F : <n <m is continuous at x <n , if for every
> 0, there exists > 0 s.t. if for any y <n we have ||x y|| < , then
||F (x) F (y)|| < .
So if there is a hurdle of size around F (x), then, if point y is close
enough to x, F (y) cannot overcome the hurdle.
Theorem 6 The two definitions above are equivalent.
Proof. Suppose there exists an > 0 s.t. for every > 0, there exists a y
with ||x y|| < and ||F (x) F (y)|| . Then for this particular , we can
choose a sequence of k = 1/k and xk with ||x xk || < 1/k. So, (xk ) x
but (F (xk )) does not converge to F (x), staying always outside the -band of
F (x).
Conversely, suppose there exists a sequence (xk ) that converges to x, but
(F (xk )) does not converge to F (x). So, there exists > 0 s.t. for every
positive integer N , there exists k N for which ||F (xk ) F (x)|| . Then,
for this specific , there does not exist any > 0 s.t. for all y with ||xy|| <
we have ||F (x) F (y)||; for we can find for any such , one of the xk s s.t.
||xk x||, so ||F (xk ) F (x)|| .
Here is an immediate upshot of the latter definition. Suppose F : < <
is continuous at x. If F (x) > 0, then there is an open interval (x , x + )
s.t. if y is in this interval, then F (y) > 0. The idea is that we can take an
= F (x)/2, say, and use the (, ) definition. A similar statement will hold
if F (x) < 0.
We use this fact now in the following result.
14
Theorem 7 Intermediate Value Theorem

Suppose F : < < is continuous on an interval [a, b] and F (a) and F (b)
are of opposite signs. Then there exists c (a, b) s.t. F (c) = 0.
Proof. Suppose WLOG that F (a) > 0, F (b) < 0 (i.e. for the other case
just consider the function F ). Then the set
S = {x [a, b]|F (x) 0}
is bounded above. Indeed, b is an upper bound of S since F (b) is not 0.
By the completeness property of real numbers, S has a supremum, sup S = c,
say.
It cant be that F (c) > 0, for then by continuity, there is an h S, h > c,
s.t. F (h) > 0 so c is not an upper bound of S. It cant be that F (c) < 0. For,
if c is an upper bound of S with F (c) < 0, then we have for every x [a, b]
with F (x) 0, x c. However, by continuity, there is an interval (c , c]
s.t. every y in this interval satisfies F (y) < 0. But then, every x S must
be to the left of this interval. But then again, c is not the least upper bound
of S.
So, it must be that F (c) = 0.
As an application, you may want to prove the following corollary, a simple
fixed point theorem.
Exercise 8 Suppose f : [a, b] [a, b] is a continuous function. Then there
exists x [a, b] s.t. x = f (x ).
15
Chapter 2
Existence of Optima
2.1
Weierstrass Theorem
This theorem of Weierstrass gives a sufficient condition for a maximum and

minimum to exist, for an optimization problem.
Theorem 9 (Weierstrass). Let S <n be compact and let F : S < be
continuous. Then F has a maximum and minimum on S; i.e., there exist
z1 , z2 S s.t. f (z2 ) f (x) f (z1 ), x S.
The idea is that continuity of F preserves compactness; i.e. since S is
compact and F is continuous, the image set F (S) is compact. That holds
irrespective of the space F (S) is in; but since F is real-valued, F (S) is a
compact set of real numbers, and therefore must have a max and a min, by
a result in Chapter 1.
Proof.
Let (yk ) be a sequence in F (S). So, for every k, there is an xk S, s.t.yk =
F (xk ). Since (xk ), k = 1, 2, ... is a sequence in the compact set S, it has a
subsequence (xm(k) ) that converges to a point x in S. Since F is continuous,
the image sequence (f (xn(k) )) converges to f (x), which is obviously in F (S).
16
So weve found a convergent subsequence (ym(k) ) = (f (xm(k) )) of (yk ); hence

F (S) is compact. This means the set F (S) of real numbers is closed and
.
bounded; so, it has at least one maximum and at least one minimum.
Example 7 p1 = p2 = 1, I = 10. Maximize U (x1 , x2 ) = x1 x2 s.t. the budget

constraint. Here, the budget set is compact, since the prices are positive. We
can see that the image of the budget set S under the function U (or the range
of U ), is U (S) = [0, 25]. This is compact, and so U attains a max (25) and
a min (0) on S.
The fact that U (S) is in fact an interval has to do with another property of
continuity of the objective: such functions preserve connectedness in addition
to preserving compactness of the set S, and here, the budget set is a connected
set.
Do applications of Weierstrass theorem to utility maximization and
cost minimization.
17
Chapter 3
Unconstrained Optima
3.1
Preliminaries
A function f : < < is defined to be differentiable at x if there exists a <

s.t.

lim
yx
f (y) f (x)
a
yx

=0
(1)
By limit equal to 0 as y x, we require that the limit be 0 w.r.t. all
sequences (yn ) s.t. yn x. a turns out to be the unique number equal to
the slope of the tangent to the graph of f at the point x. We denote a by
0
the notation f (x). We can rewrite Equation (1) as follows:

lim
yx
f (y) f (x) a(y x)

yx

=0
(2)
Note that this means the numerator tends to zero faster than does the
denominator.
We can use this way of defining differentiability for more general functions.
18
Definition 5 Let f : <n <m . f is differentiable at x if there is an

m n matrix A s.t.

lim
yx
||f (y) f (x) A(y x)||

||y x||

=0
In the one variable case, the existence of a gives the existence of a tangent;
in the more general case, the existence of the matrix A gives the existence of
tangents to the graphs of the m component functions f = (f1 , ..., fm ), each
of those functions being from <n <. In other words this definition has
to do with the best linear affine approximation to f at the point x. To see
this in a way equivalent to the above definition, put h = y x in the above
definition, so y = x + h. Then in the 1-variable case, from the numerator,
0
f (x + h) is approximated by the affine function f (x) + ah = f (x) + f (x)h. In
the general case, f (x + h) is approximated by the affine function f (x) + Ah.
It can be shown that (w.r.t. the standard bases in <n and <m ), the matrix
A equals Df (x), the m n matrix of partial derivatives of f evaluated at the
point x. To see this, take the slightly less general case of a function f : <n
<. If f is differentiable at x, there exists a 1 n matrix A = (a11 , . . . , a1n )
satisfying the definition above: i.e.
||f (x + h) f (x) Ah||
=0
h
||h||
lim
In particular, the above must hold if we choose h = (0, .., 0, t, 0, .., 0) with
hj = t 0. That is,
||f (x1 , .., xj + t, .., xn ) f (x1 , .., xj , .., xn ) a1j t||
=0
t0
t
lim
But from the limit on the LHS, we know that a1j must equal the partial
derivative f (x)/x1 .
Df (x) as the derivative of f at x; Df itself is a function from <n to <m .
19
f1 (x)/x1 . . . f1 (x)/xn
Df (x) =
...
...
...
fm (x)/x1 . . . fm (x)/xn
Here,
fi (x)
fi (x1 , .., xj + t, ..., xn ) fi (x1 , .., xj , ..., xn )
= lim
t0
xj
t
We want to also represent the partial derivative in different notation: Let
ej = (0, .., 0, 1, 0, ..., 0) be the unit vector in <n on the j th axis. Then,
fi (x + tej ) fi (x)
fi (x)
= lim
t0
xj
t
That is, the partial of fi w.r.t. xj , evaluated at the point x, is looking at
essentially a function of 1-variable: we take the (n 1) dimensional surface
of the function fi , and slice it parallel to the j th axis, s.t. point x is contained
on this slice/plane; well get a function pasted on this plane; its derivative
is the relevant partial derivative.
To be more precise about this one-variable function pasted on the slice/plane,
note that the single variable t < is first mapped to a vector x+tej <n , and
then that vector is mapped to a real number fi (x + tej ). So, let : < <n
be defined by (t) = x + tej , for all t <. Then the one-variable function
were looking for is g : < < defined by g(t) = f ((t)), for all t <; its
the composition of f and .
In addition to slicing the surface of functions that map from <n to <
in the directions of the axes, we can slice them in any direction and get
a function pasted on the slicing plane. This is the notion of a directional
derivative.
Recall that if x <n , and h <n , then the set of all points that can
be written as x + th, for some t <, comprises the line through x in the
20
direction of h.
See figure (drawn in class).
Definition 6 The directional derivative of a function f : <n < at x <n ,
in the direction h <n , denoted Df (x; h), is
f (x + th) f (x)
t0+
t
lim
If t 0+ is replaced by t 0, we get the 2-sided directional derivative.

A function that is differentiable on a set S if it is differentiable at all
points in S. f is continuously differentiable if it is differentiable and Df is
continuous.
3.2
Interior Optima
Definition 7 Let f : <n <. A point z is a local maximum (resp. local

minimum) of f on a set S <n if f (z) ()f (x), for all x B(z, ) S,
for some > 0.
B(z, ) is intersected with S since it may not, by itself, lie entirely in S.
However, if z is in the interior of S, we can discard that. z is said to be
an interior local maximum, of minimum, of f on S if f (z) ()f (x), x
B(z, ), for some > 0.
We now give a necessary condition for a point to be an interior local max
or min; namely, that its derivative should be zero. For if not, then we can
increase or decrease the function value by moving away slightly from the
point.
First Order Necessary Condition.
Theorem 10 Let f : <n <, S <n , and let x be a local max or local
min of f on S, lying in the interior of S. If f is differentiable at x , then
Df (x ) = .
21
Here, = (0, ..., 0) is the origin, and Df (x ) = (f (x )/x1 , . . . , f (x )/xn ).

Proof. Let x be an interior local max (min proof is done along similar
lines).
Step 1: Suppose n = 1. Take any sequence (yk ), yk < x , yk x , and
(zk ), zk > x , zk x . Since x is a local max, we have, for k K and K
large enough,
f (zk ) f (x )
f (yk ) f (x )
zk x
y k x
Taking limits preserves these inequalities since (, 0] and [0, ) are
closed sets and the ratio sequences lie in these closed sets. So,
0
f (x ) 0 f (x )
0
so f (x ) = 0.
Step 2. Suppose n > 1. Take any j t h axis direction, and let g : < <
be defined by g(t) = f (x + tej ). Note that g(0) = f (x ). Now, since x is a
local max of f , f (x ) f (x + tej ), for t smaller than some cutoff value: i.e.,
g(0) g(t) for t smaller than this cutoff value, i.e., g(0) is a local interior
maximum. (Since t < 0 and t > 0 are both allowed). g is differentiable
at 0 since g(0) = f ((0)) = f (x ), and f is differentiable at x and is
differentiable at t = 0. (Here, (t) = x + tej , so D(t) = ej , t). So, g is
0
differentiable at 0, g (0) = 0, and by the Chain Rule,
0
g (0) = Df ((0))D(0) = Df (x )ej =
f (x )
xj
.
Note that this is necessary but not sufficient for a local max or min, e.g.
f (x) = x3 has a vanishing first derivative at x = 0, which is not a local
optimum.
Second Order Conditions
22
Definition. x is a strict local maximum of f on S if f (x) > f (y), for all

y B(x, ) S, y 6= x, for some > 0.
We will represent the Hessian or second derivative (matrix) of f by D2 f .
Theorem 11 Suppose f : <n < is C 2 on S <n , and x is an interior
point of S.
1. (necessary) If f has a local max (resp. local min) at x, then D2 f (x)
is n.s.d. (resp. p.s.d.).
2. (sufficient) If Df (x) = and D2 f (x) is n.d. (resp. p.d.) at x, then x
is a strict local max (resp. min) of f on S.
The results in the above theorem follow from taking a Taylor series approximation of order 2 around the local max or local min. For example,
x ) + 1 (x x )T D2 f (x )(x x ) + R2 (x x )
f (x) = f (x ) + Df (x )(x
2
where R2 () is a remainder of order smaller than two. If x is an interior
local max or min, then Df (x ) = 0 (a vector of zeros), so the quadratic form
in the second-order term will share the sign of (f (x) f (x )).
Examples to illustrate: (i) SONC are not sufficient. f (x) = x3 . (ii) Semidefiniteness cannot be replaced by definiteness. f (x) = x4 . (iii). These are
conditions for local, not global optima. f (x) = 2x3 3x2 . (iv) Strategy for
using the conditions to identify global optima. f (x) = 4x3 5x2 + 2x on
S = [0, 1].
23
Chapter 4
Optimization with Equality
Constraints
4.1
Introduction
We are given an objective function f : Rn R to maximize or minimize,

subject to k constraints. That is, there are k functions, g1 : Rn R,
g2 : Rn R, ... , gk : Rn R, and we wish to
Maximize f (x) over all x Rn such that g1 (x) = 0, . . . , gk (x) = 0.
More compactly, collect the constraint functions (looking at them as component functions) into one function g : Rn Rk , where g(x) = (g1 (x), . . . , gk (x)).
Then what we want is to
Maximize f (x) over all x Rn such that g(x)1k = 1k .
The Theorem of Lagrange provides necessary conditions for a local optimum x . By local we mean that the value f (x ) is a max or min compared
to other values f (x) for all x contained in some open set U containing x
such that x satisfies the k constraints. Thus the problem it considers is to
provide necessary conditions for a Max or a Min of
f (x) over all x S, where S = U {x Rn |g(x) = }, for some open
set U .
24
The following example illustrates the principle of no arbitrage underlying a maximum. A more general illustration, with more than 1 constraint,
requires a little bit of the machinery of linear inequalities, which well not
cover. The idea here is that the Lagrange multiplier captures how the constraint is distributed across the variables.
Example 1. Suppose x solves M ax U (x1 , x2 ) s.t. I p1 x1 p2 x2 = 0
and suppose x >> .
Then reallocating a small amount of income from one good to the other
does not increase utility. Say income dI > 0 is shifted from good 1 to good 2.
So dx1 = (dI/p1 ) > 0 and dx2 = (dI/p1 ) < 0. Note that this reallocation
satisfies the budget constraint, since
p1 (x1 + dx1 ) + p2 (x2 + dx2 ) = I
The change in utility is dU = U1 dx1 + U2 dx2
= [(U1 /p1 )(U2 /p2 )]dI 0, since the change in utility cannot be positive
at a maximum. Therefore,
(U1 /p1 ) (U2 /p2 ) 0
(1)
Similarly, dI > 0 shifted from good 1 to good 2 does not increase utility,
so that
[(U1 /p1 ) + (U2 /p2 )]dI 0, or
(U1 /p1 ) + (U2 /p2 ) 0
(2)
Eq. (1) and (2) imply (U1 (x )/p1 ) = (U2 (x )/p2 ) =

(3)
That is, the marginal utility of the last bit of income equals (U1 (x )/p1 ) =
(U2 (x )/p2 at the optimum. Also, (3) implies U1 (x ) = p1 , U2 (x ) = p2 .
Along with p1 x1 + p2 x2 = I, these are the FONC of the Lagrangean function
Max L(x, ) = U (x1 , x2 ) + [I p1 x1 p2 x2 ]
More generally, suppose F : Rn R and G : Rn R, and suppose x
solves Max F (x) s.t. c G(x) = 0. This part is skippable.
Contemplate a change dx in x that respects the constraint G(x) = c.
That is,
dG = G1 dx1 + G2 dx2 = 0. Therefore,
G1 dx1 = G2 dx2 = dc, say. So dx1 = (dc/G1 ), dx2 = (dc/G2 ). If
dc > 0, then our change dx implies dx1 > 0, dx2 < 0. F does not increase at
the maximum, x . So
25
dF = F1 dx1 + f2 dx2 0, or [(F1 /G1 ) (F2 /G2 )]dc 0. Similarly, 0

can be shown similarly.
Therefore, (F1 (x )/G1 (x )) = (F2 (x )/G2 (x )) =
(4)
Caveat: We have assumed that G1 (x ) and G2 (x ) are not both zero at

x . This is called the constraint qualification.
Again, note that (4) can be got as the FONC of the problem
Max L(x, ) = F (x) + [c G(x)].
On .
Lets go back to the utility example. At the optimum (x , ), suppose
you increase income by I. Buying more x1 implies utility increases by
(U1 (x )/p1 )I, approximately.
Buying more x2 implies utility increases by (U2 (x )/p2 )I
At the optimum,(U1 (x )/p1 ) = (U2 (x )/p2 ) = .
So in either case, utility increases by I. So gives the change in
the objective (here, the objective is utility), at an optimum, that results from
relaxing the constraint a little bit.
The interpretation is the same in the more general case: If G(x) = c, and
c is increased by c, suppose x1 alone is then increased.
So G = g1 dx1 = c, or dx1 = (c/G1 ). So at x , F increases by
dF = F1 dx1 = (F1 (x )/G1 (x ))c = c.
If instead x2 is changed, F increases by dF = F2 dx2 , = (F2 (x )/G2 (x ))c =
c.
4.2
The Theorem of Lagrange
The set up is the following. f : Rn R is the objective function, gi : Rn

R, i = 1, . . . , k, k < n are the constraint functions.
Let g : Rn Rk be the function given by g(x) = (g1 (x), . . . , gk (x)).
Df (x) =
(f (x)/x1 , . . . , f (x)/xn ).

(g1 (x)/x1 ) . . . (g1 (x)/xn )

Dg1 (x)
..
..
..
..
.
Dg(x) =
.
.
.
=
(gk (x)/x1 ) . . . (gk (x)/xn )

26
Dgk (x)
So Dg(x) is a k n matrix.
The theorem below provides a necessary condition for a local max or
local min. Note that x is a local max (resp. min) of f on the constraint set
{x Rn |gi (x) = 0, i = 1, . . . , k} if f (x ) f (x) (resp. f (x)) for all x U
for some open set U containing x , s.t. gi (x) = 0, i = 1, . . . , k}. Thus x is
a Max on the set S = U {x Rn |gi (x) = 0, i = 1, . . . , k}.
Theorem 12 (Theorem of Lagrange). Let f : Rn R and gi : Rn

R, i = 1, . . . , k, k < n be C 1 functions. Suppose x is a Max or a Min of
f on the set S = U {x Rn |gi (x) = 0, i = 1, . . . , k}, for some open set
U Rn . Then there exist real numbers , 1 , . . . , k , not all zero, such that
P
Df (x ) + ki=1 i Dgi (x ) = 1n .
Moreover if rank(Dg(x )) = k, then we may put = 1.
Notes:
(1) k < n so that the constraint set is not trivial. Just like a linear
constraint a.x = 0 (or a.x = c) marks out a set of points that is an (n1) dimensional subspace as the 1xn matrix a has rank 1 and nullity (n1), so each of the constraints gi (x) = 0, i = 1, . . . , n marks out an (n-1)
dimensional manifold. The constraint set is their intersection and hence an
(n-k)-dimensional space. For this to be a non empty set with more than one
point, we need k < n.
(2) The condition rank(Dg(x )) = k is called the constraint qualification. The first part of the theorem says that at a local Max or Min, under
the assumption of continuous differentiability of f and gi , i = 1, . . . , k, the
vectors Df (x ), Dg1 (x ), . . . , Dgk (x ) are linearly dependent. The constraint
qualification (CQ) basically assumes that the vectors Dg1 (x ), . . . , Dgk (x )
are linearly independent. In that case,
P
Df (x ) + ki=1 i Dgi (x ) = implies that cannot equal 0, for then
Pk
i=1 i Dgi (x ) = , which along with linear independence implies i =

0, i = 1, . . . , k. This cannot be. So if the CQ holds, then 6= 0, so we can
divide through by .
27
(3) In most applications the CQ holds. We usually check first whether it

holds, and then proceed. Suppose it does hold. Note that
P
Df (x )+ ki=1 i Dgi (x ) = subsumes the following n equations (with
= 1):
P
(f (x )/xj ) + ki=1 i (gi (x )/xj ) = 0, j = 1, . . . , n
Note also that this leads to the usual procedure for finding equality
constrained Max or Min, by setting up a Lagrangean function:
P
L(x, ) = f (x) + ki=1 i gi (x), and solving the FONC
P
(L(x, )/xj ) = (f (x)/xj ) + ki=1 i (gi (x)/xj ) = 0, j = 1, . . . , n
(L(x, )/i ) = gi (x) = 0, i = 1, . . . , k
Which is (n + k) equations in (n + k) variables x1 , . . . , xn , 1 , . . . , k .
Why does the above procedure usually work to isolate global
optima?
The FONC that come out of the Lagrangean function are, as seen in the
Theorem of Lagrange, necessary conditions for local optima. However, when
we do equality constrained optimization, (i) usually a global max (or min)
x is known to exist. (ii) Second, for most problems the CQ is met at all
x S. Therefore, it is met at the optimum as well. (Note that otherwise,
not knowing the optimum when we start out on a problem, it is not possible
to check whether the CQ holds at that point!)
When (i) and (ii) are met, the solutions to the FONC of the Lagrangean
function will include all local optima, and hence will include the global optimum that we want. By comparing the values f (x) for all x that solve the
FONC, we get the point at which f (x) is a max or a min. With this method,
we dont need second order conditions at all, if we just want to find a global
max or a min.
Pathologies
The above procedure may not always work.
Pathology 1. A global optimum may not exist. Then none of the critical
points (solutions to the FONC of the Lagrangean function) is a global optimum. Critical points may then be only it local optima, or they may not
even be local optima. Indeed, the Theorem of Lagrange gives a necessary
28
condition; so there could be points x0 that meet the condition and yet are
not even local max or min.
Example. Max f (x, y) = x3 + y 3 , s.t. g(x, y) = x y = 0. Here the
contour set Cg (0) is the 45-degree line on the x y plane. By taking larger
and larger positive values of x and y on this contour set, we get higher and
higher f (x, y). So f does not have a global max on the constraint set. But
if we mechanically crank out the Lagrangean FONCs as follows
Max x3 + y 3 + (x y)
FONC: 3x2 + = 0
3y 2 + = 0
x y = 0. So x = y = = 0 is a solution. But (x , y ) = (0, 0)
is neither a local max nor a local min. Indeed, f (0, 0) = 0, whereas for
(x, y) = (, ), > 0, f (, ) = 23 > 0, and for (x, y) = (, ), < 0,
f (, ) = 23 < 0.
Pathology 2. The CQ is violated at the optimum.
In this case, the FONCs need not be satisfied at the global optimum.
Example. Max f (x, y) = y s.t. g(x, y) = y 3 x2 = 0.
Let us first find the solution using native intelligence. Then well show
that the CQ fails at the optimum, and that the usual Lagrangean method
is a disaster. Finally, well show that the general form of the equation the
Theorem of Lagrange, that does NOT assume that the CQ holds at the
optimum, works.
The constraint is y 3 = x2 , and since x2 is nonnegative, so must y 3 be.
Therefore, y 0. The maximum of y s.t. y 0 implies y = 0 at the max.
So y 3 = x2 = 0, so x = 0. So f attains global max at (x, y) = (0, 0).
Dg(x, y) = (2x, 3y 2 ) = (0, 0) at (x, y) = (0, 0). So rank(Dg(x, y)) =
0 < k = 1 at the optimum; the CQ fails at this point. Using the Lagrangean
method, we get the following FONC:
(f /x) + (g/x) = 0, that is 2x = 0
(1)
2
(f /y) + (g/y) = 0, that is 1 + 3y = 0
(2)
(L/) = 0, that is x2 + y 3 = 0
(3)
Eq.(1) implies either = 0 or x = 0. x = 0 implies, from Eq.(3), that
29
y = 0, but then (2) becomes 1 = 0, which is not possible. Similarly, = 0

again violates (2).
But the general form of the condition in the Theorem of Lagrange does
not rely on the CQ and works. In this problem, the only equation out of the
above three that changes is Eq. (2), as we see below:
Df (x, y) + Dg(x, y) = (0, 0), and x2 + y 3 = 0, with Df (x, y) = (0, 1),
Dg(x, y) = (2x, 3y 2 ) yield
(f /x) + (g/x) = 0, that is 2x = 0
(1)
2
(f /y) + (g/y) = 0, that is + 3y = 0
(2)
2
3
(L/) = 0, that is x + y = 0
(3)
Now, Eq.(1) implies = 0 or x = 0. If = 0, then Eq.(2) implies = 0.
But = = 0 is ruled out by the Theorem of Lagrange. Therefore, here
6= 0. Hence x = 0. From Eq.(3), we then have y = 0, and so from Eq. (2),
= 0. So we get x = y = 0 as a solution.
Second-Order Conditions
These conditions are characterized by definiteness or semi-definiteness of
the Hessian of the Lagrangean function, which is the appropriate function
to look at in this constrained optimization problem. Also, we dont have to
check the appropriate inequality for the quadratic form for all x. Now, only
those x are relevant that satisfy the constraints. Second order conditions in
general say something about the curvature of the objective function around
the local max or min...i.e., how the graph curves as we move from x to
a nearby x. In constrained optimization, we cannot move from x to any
arbitrary x nearby; the move must be to an x which satisfies the constraints.
That is, such a move must leave all gi (x) at 0. In other words, dgi (x) =
Dgi (x).dx = 0, where dx is a vector x0 x that must be orthogonal to
Dgi (x). Thus it suffices to evaluate the appropriate quadratic form at all
vectors x that are orthogonal to all the gradients of the constraint functions.
Notice that if we parameterize the curve describing a constraint by setting
gi (x(t)) = 0, then by the Chain Rule, we have
Dgi (x(t))Dx(t) = 0
30
.
Dx(t) is in the direction of the tangent to the curve x(t), so the equation
above implies that Dgi (x(t)) is orthogonal to it. (Seen as a vector rather
than a matrix, we write this as the gradient gi (x(t))). (As an application,
notice how this geometry implies the first order condition MRSxy = px /py in
a two-good utility maximization in which both goods are consumed at the U
max).
In the second-order conditions, we check the definiteness or semi-definiteness
of the second-derivative or Hessian D2 L(x , ) w.r.t. all vectors x that are
orthogonal to the gradient of each constraint. This approximates vectors
close to x that satisfy each gi (x) = 0.
P
Since L(x, ) = f (x) + ki=1 i gi (x),
P
D2 L(x, )nn = D2 f (x)nn + ki=1 i D2 gi (x)nn ,
f11 (x) . . . f1n (x)

..
.
..
.
where D2 f (x) = ..
.
fn1 (x) . . . fnn (x)
gi11 (x) . . . gi1n (x)

..
.
..
.
and D2 gi (x) = ..
.
gin1 (x) . . . ginn (x)
P
P
f11 (x) + ki=1 i gi11 (x) . . . f1n (x) + ki=1 i gi1n (x)
..
..
...
So D2 L(x, )nn =
.
.
Pk
Pk
fn1 (x) + i=1 i gin1 (x) . . . fnn (x) + i=1 i ginn (x)
is the second derivative of L w.r.t. the x variables. Note that D2 L(x, )
is symmetric, so we may work with its quadratic form.
Dg1 (x )
..
At a given x Rn , Dg(x )kn =

.
Dgk (x )
So the set of all vectors x that are orthogonal to all the gradient vectors
of the constraint functions at x is the Null Space of Dg(x ), N (Dg(x )) =
{x Rn |Dg(x )x = k1 }.
Theorem 13 Suppose there exists (xn1 , k1 ) such that Rank(Dg(x )) = k
P
and Df (x ) + ki=1 Dgi (x ) = .
31
(i) (a necessary condition) If f has a local max (resp. local min) on S at

point x , then xT D2 L(x , )x 0, (resp. 0) for all x N (Dg(x ))
(ii) (a sufficient condition) If xT D2 L(x , )x < 0,(resp. > 0), for all
x N (Dg(x )), x 6= , then x is a strict local max (resp. strict local min)
of f on S.
Checking these directly involves checking inequalities for every vector
x in the Null Space of Dg(x ), which is an n k dimensional subspace.
Alternatively, we could check the signs of k determinants instead, and the
relevant tests are given by the theorem below, which states tests equivalent
to those of the above theorem. These are the Bordered Hessian conditions.
This stuff is tedious indeed, and it would be a hard taskmaster who would
ask anyone to waste hard disk space by memorizing these.

0kk
Dg(x )kn
BH(L ) =
[Dg(x )]Tnk D2 L(x , )nn (n+k)(n+k)
BH (L ; k + n r) is the matrix obtained by deleting the last r rows and
columns of BH (L ).
BH (L ) will denote a variant in which the permutation has been
applied to (i) both rows and columns of D2 L(x , ) and (ii) only the columns
of Dg(x ) and only the rows of [Dg(x )]T , which is the transpose of Dg(x ).
Theorem 14 (1a) xT D2 L(x , )x 0,for all x N (Dg(x )), iff for all
permutations of {1, . . . , n}, we have:
(1)nr det(BH (L ; n + k r)) 0, r = 0, 1, . . . , k 1.
(1b) xT D2 L(x , )x 0,for all x N (Dg(x )), iff for all permutations
of {1, . . . , n}, we have:
(1)k det(BH (L ; k + n r)) 0, r = 0, 1, . . . , k 1.
(2a). xT D2 L(x , )x < 0,for all nonzero x N (Dg(x )), iff (1)nr det(BH(L ; n+
k r)) > 0, r = 0, 1, . . . , k 1.
(2b)xT D2 L(x , )x > 0,for all nonzero x N (Dg(x )), iff (1)k det(BH(L ; n+
k r)) > 0, r = 0, 1, . . . , k 1.
32
Note. (1) For the negative definite or semidefiniteness subject to constraints cases, the determinant of bordered Hessian with last r rows and
columns deleted must be of the same sign as (1)nr . The sign of (1)nr
switches with each successive increase in r from r = 0 to r = k 1. So the
corresponding bordered Hessians switch signs. In the usual textbook case of
2 variables and one constraint, k = 1, k 1 = 0, so we just need to check
the sign for r = 0, that is, the sign of the determinant of the big bordered
Hessian. You should be clear about what this sign should be if it is to be a
sufficient condition for a strict local max or min. For the necessary condition,
we need to check signs or 0, for one permuted matrix as well, in this
case. What is this permuted matrix?
(2) As in the unconstrained case, the sufficiency conditions do not require
checking weak inequalities for permuted matrices.
(3) In the p.s.d. and p.d. cases, the signs of the principal minors must
be all positive, if the number k of constraints is even, and all negative, if k
is odd.
(4) If we know that a global max or min exists, where the CQ is satisfied,
and we get a unique solution x Rn that solves the FONC, then we may
use a second order condition to check whether it is a max or a min. However,
weak inequalities demonstrating n.s.d. or p.s.d. (subject to constraints) of
D2 (L ) do not imply a max or min; these are necessary conditions. Strict
inequalities are useful; they imply (strict) max or min. If however, a global
max or min exists, the CQ is satisfied everywhere, and there is more than
one solution of the FONC, then the one giving the highest value of f (x) is
the max. In this case, we dont need second order conditions to conclude
that it is the global max.
VII.4. Two Examples
Example 1.
A consumer with income I > 0 faces prices p1 > 0, p2 > 0, and wishes
to maximize U (x1 , x2 ) = x1 x2 . So the problem is: Max x1 x2 s.t. x1 0,
x2 0, and p1 x1 + p2 x2 I.
To be able to use the Theorem of Lagrange, we need equality constraints.
33
Now, it is easy to see that if (x1 , x2 ) solves the above problem, then (i)
(x1 , x2 ) > (0, 0). If xi = 0 for some i, then utility equals zero; clearly, we can
do better by allocating some income to the purchase of each good; and (ii)
the budget constraint binds at (x1 , x2 ). For if p1 x1 + p2 x2 < I, then we can
allocate some of the remaining income to both goods, and increase utility
further.
We conclude from this that a solution (x1 , x2 ) will also be a solution to
the problem
Max x1 x2 s.t. x1 > 0, x2 > 0, and p1 x1 + p2 x2 = I.
2
{(x1 , x2 )|I
That is, Maximize U (x1 , x2 ) = x1 x2 over the set S = R++
p1 x1 p2 x2 = 0}. Since the budget set in this problem is compact and the
utility function is continuous, U attains a maximum on the budget set (by
Weierstrass Theorem). Moreover, we argued above that at such a maximum
x , xi > 0, i = 1, 2 and the budget constraint binds. So, x S.
Furthermore, Dg(x) = (p1 , p2 ), so Rank(Dg(x)) = 1, at all points in
the budget set. So the CQ is met. Therefore, the global max will be among
the critical points of L(x1 , x2 , ) = x1 x2 + (I p1 x1 p2 x2 ).
FONC:
(L/x1 ) = x2 p1 = 0
(L/x2 ) = x1 p2 = 0
(L/) = I p1 x1 p2 x2 = 0
(1)
(2)
(3)
6= 0, (otherwise (1) and (2) imply that x1 = x2 = 0, which violates (3)).

Therefore, from (1) and (2), = (x2 /p1 ) = (x1 /p2 ), or p1 x1 = p2 x2 . So (3)
implies I 2p1 x1 = 0, or p1 x1 = (I/2), which is the standard Cobb-Douglas
utility result that the budget share of a good is proportional to the exponent
w.r.t. it in the utility function. So we get
xi = (I/2pi ), i = 1, 2, and = (I/2p1 p2 ).
We argued that the global max would be one of the critical points of
L(x, ) in this example; (note, however, that the global min (which occurs
at (x1 , x2 ) = (0, 0) is not a critical point). Since we have only it one critical
point, it follows that this must be the global max! (We know that x1 = x2 = 0
is the global min, and not the point that we have located).
Take a time-out to consider the alternative to checking the constraint
34
qualification. We can start by setting up a Lagrangean

L(x, ) = U (x) + (I p x)
and cranking out the FOCs. The fact that the CQ holds shows up as
6= 0 in the FOC. (Convince yourself that in any problem, the CQ failing
at a point manifests as = 0 in the FOC).
The FOCs are now:
(L/x1 ) = x2 p1 = 0
(1)
(L/x2 ) = x1 p2 = 0
(2)
(L/) = I p1 x1 p2 x2 = 0
(3)
If = 0, then (1) and (2) imply = 0. All multipliers equal to zero
is not a permitted solution in the Theorem of Lagrange. So 6= 0; we can
normalize (divide through by ) and set it equal to 1, and proceed as before.
If there were lots of solutions to the FOCs, the sufficient SOCs may help
in narrowing down the set of points one needs to evaluate for the global max.
Lets check SOCs for the above example, although this is not necessary.
Dg(x ) = (p1 , p2 )
2
D2
L(x , ) = D2 U (x) + D
g(x )

U11 (x ) U12 (x )
g11 (x ) g12 (x )
=
+
U
(x
)
U
(x
)
21
22
g21 (x) g22 (x )

0 0
0 1
0 1
=
=
+
1 0
1 0
0 0
Now evaluate the quadratic form z T D2 L(x , )z = 2z1 z2 at any (z1 , z2 )
that is orthogonal to Dg(x ) = (p1 , p2 ). So, p1 z1 p2 z2 = 0 or
z1 = (p2 /p1 )z2 . For such (z1 , z2 ), z T D2 L(x , )z = (2p2 /p1 )z22 < 0, so
D2 L(x , ) is negative definite relative to vectors orthogonal to the gradient
of the constraint, and x is therefore a strict local max.
Youve probably seen the computation below. I provide it here anyway,
even though it is unnecessary, and weve done the second-order exercise above
using the quadratic form.
35
0
p
p
1
2
0
Dg(x )
BH(L ) =
= p1
0
1
T
2
[Dg(x )]
D L(x , )
p2
1
0
n
det(BH(L )) = 2p1 p2 > 0. This is the sign of (1) = (1)2 . Therefore,
there is a strict local max at x .

Example 2.
Find global maxima and minima of f (x, y) = x2 y 2 on the unit circle
in <2 , i.e., on the set {(x, y) <2 |g(x, y) 1 x2 y 2 = 0}.
Constrained maxima and minima exist, by Weirerstrass theorem, as f
is continuous and the unit circle is closed and bounded. Bounded, as it is
entirely contained in, say, B(0, 2). Closed as well, visually, we can see that
the constraint set contains all its adherent points. More formally, suppose
(xk , yk )
k=1 is a sequence of points on the unit circle converging to the limit
(x, y). Since g is continuous, and (xk , yk ) (x, y), we have g(xk , yk )
g(x, y). Since g(xk , yk ) = 0, k, their limit is 0, i.e. g(x, y) = 0 or (x, y) is on
the unit circle, and so the unit circle is closed.
Constraint Qualification: Dg(x, y) = (2x 2y). The rank of this row
matrix is zero only at (x, y) = (0, 0). But the origin does not satisfy the
constraint. Everywhere on the constraint, at least one of x or y is not zero,
and the rank of Dg(x, y) is 1.
So, the max and min will be solutions to the FOCs of the usual Lagrangean.
L(x, y, ) = x2 y 2 + (1 x2 y 2 )
FOC.
2x 2x = 0
(1)
2y 2y = 0
(2)
2
2
x +y =1
(3)
(1) and (2) imply 2x(1 ) = 0 and 2y(1 + ) = 0 respectively. Suppose
6= 1 or 1. Then (x, y) = (0, 0), violating (3). If = 1, y = 0, so x2 = 1,
and so on. So the four solutions (x, y, ) to the FOCs form the solution set
{(1, 0, 1), (1, 0, 1), (0, 1, 1), (0, 1, 1)}. Evaluating the function values at
36
these points, we have that f has a constrained max at (1, 0) and (1, 0) and
constrained min at (0, 1) and (0, 1).
Although unnecessary, lets practice second-order conditions for this example. Df (x, y)= (2x 2y), Dg(x, y) = (2x 2y).
2 0
D2 f (x, y) =
0 2
2 0
D2 g(x, y) =
0 2
With = 1, for instance, D2 L(x

, y , ) evaluates to
0 0
D2 f (x , y ) + D2 g(x , y ) =
.
0 4
(x, y) orthogonal to Dg(x , y ) = (2 0) (at (x , y ) = (1, 0) implies that
(x, y) must satisfy
(2, 0) (x, y) = 0 or 2x + 0y = 0. So, x = 0, and y is free to take any
value. So consider the quadratic form with (x, y) = (0, y).

(0 y)
0 0
0 4

0
y
=4y 2 < 0 for all (x, y) 6= (0, 0). So negative definiteness holds, and we
are at a strict local max.
Some Derivatives
(1). Let I : <n <n be defined by I(x) = x, x <n . In component
function notation, we have I(x) = (I1 (x), . . . , In (x)) = (x1 , . . . , xn ). So,
DIi (x) = ei , i.e. the vector with 1 in the ith place and zeros elsewhere. So,
DI(x) = Inn , the identity matrix.
By similar work, we can show that if f (x) = Ax, where A is an m n matrix, then Df (x) = A. Indeed, the jth component function fj (x) = aj1 x1 +
. . . + ajn xn , so its matrix of partial derivatives with respect to x1 , . . . , xn is
Dfj (x) = (aj1 . . . ajn ).
(2). Let f : <n <m and g : <n <m . By way of convention, consider
f (x) and g(x) to be column vectors, and consider the function h : <n <
37
defined by h(x) = f (x)T g(x). Then,

Dh(x) = g(x)T Df (x) + f (x)T Dg(x)
Indeed,
!
m
X
X
X

D f (x)T g(x) = D
fi (x)gi (x) =
D (fi (x)gi (x)) =
[gi (x)Dfi (x) + fi (x)Dgi (x)]
i=1
The third equality above is because the differential operator D is linear.

Note that
Df1 (x)
Df (x)
P
2
g
(x)Df
(x)
=
(g
(x),
.
.
.
,
g
(x))
i
1
n
i i

Dfn (x)
T
= g(x) Df (x), and so on.
We take a step back and derive this in a more expanded fashion. Since
P
h(x) = m
i=1 fi (x)gi (x), its partial derivative with respect to xj is:
h(x)/xj =
m
X
[gi (x) (fi (x)/xj ) + fi (x) (gi (x)/xj )]
i=1
= g(x)T Df (x)[, j] + f (x)T Dg(x)[, j]

where for any matrix A, we write A[, j] to represent its jth column. So
Dh(x) = (h(x)/x1 . . . h(x)/xn ) equals
g(x)T Df (x)[, 1] + f (x)T Dg(x)[, 1], . . . , g(x)T Df (x)[, n] + f (x)T Dg(x)[, n]
= g(x)T Df (x) + f (x)T Dg(x)
38
.
As an application, let h(x) = xT x. Then Dh(x) equals xT D(x)+xT Dx =
xT I + xT I = 2xT .
On the Chain Rule
We saw an example (in the proof of the 1st order condition in unconstrained optimization) of the Chain Rule at work; youve seen this before.
Namely, if h : < <n and f : <n < are differentiable at the relevant
points, then the composition g(t) = f (h(t)) is differentiable at t and
0
g (t) = Df (h(t))Dh(t) =
n
X
f (h(t))
j=1
xj
hj (t)
You may have encountered this before in notation f (h1 (t), . . . , hn (t)),
with some use of total differentiation or something. Similarly, suppose h :
<p <n and f : <n <m are differentiable at the relevant points, then the
composition g(x) = f (h(x)), g : <p <m is differentiable at x, and
Dg(x) = Df (h(x))Dh(x)
.
Here, on the RHS an m n matrix multiplies an n p matrix, to result
in the m p matrix on the LHS.
The intuition for the Chain Rule is perhaps this. Let z = h(x). If x
changes by dx, the first-order change in z is dz = Dh(x)dx. The first-order
change in f (z) is then Df (z)dz. Substituting for dz, the first-order change
in f (h(x)) equals [Df (h(x))Dh(x)] dx.
In the formula, things are actually quite similar to the familiar case.
The (i, j)th element of the matrix Dg(x) is gi (x)/xj , where gi is the ith
component function of g and xj is the j th variable. Since this is equal to the
dot product of the ith row of Df (h(x)) and the j th column of Dh(x), we have
39
gi (x)/xj =
n
X
fi (h(x)) hk (x)
hk
k=1
xj
On the Implicit Function Theorem

Theorem 15 Suppose F : <n+m <m is C 1 , and suppose F (x , y ) = 0
for some y <m and some x <n . Suppose also that DFy (x , y ) has
rank m. Then there are open sets U containing x and V containing y and
a C 1 function f : U V s.t.
F (x, f (x)) = 0 x U
.
Moreover,
Df (x ) = [DFy (x , y )]1 DFx (x , y )
Note that we could alternatively look at the equation F (x, y) = c, for
some given c <m , without changing anything. The proof of this theorem
starts going deep, so will not be part of this course. The proof for the
n = m = 1 case, however, is provided at the end of this Chapter. But notice,
that applying the Chain Rule to differentiate
F (x , f (x )) = 0
yields
DFx (x , y ) + DFy (x , y )Df (x ) = 0
(*)
40
whence the expression for Df (x ).

tediously in terms of compositions, if h(x) = (x, f (x)), then Dh(x) =
More
I
,
Df (x)
whereas DF (.) = (DFx (.)|DFy (.)), so matrix multiplication using partitions yields Eq.(*).
Interpretations
(1) Indifference Curve.
By way of motivation, think of a utility function F defined over 2 goods,
evaluated at some utility value c. So F (x, y) = c. Let (x , y ) be a solution
of this equation, i.e. F (x , y ) = c. Under the assumptions of the Implicit
Function Theorem on F , there exists a function f s.t. if x is a point close
to x , then F (x, f (x)) = c. That is, as we vary x close to x , there exists
a unique y s.t. F (x, y) = c. Because of the uniqueness, we have y = f (x)
i.e. a functional relationship. Draw an indifference curve corresponding to
F (x, y) = c to see this visually. Moreover, the theorem asserts that the
derivative of the implicit function
0
f (x ) = Fx /Fy
where Fx , Fy are the partial derivatives of F , evaluated at (x , y ). The
marginal rate of substitution between the two goods (LHS) equals the ratio of
the marginal utilities (RHS). In fact, when we say under some assumptions
on F , one of the assumptions is that Fy evaluated at (x , y ) is not zero.
The mnemonic for getting the derivative: From F (x, y) = c, we totally
differentiate to get Fx dx+Fy dy = 0, and rearrange to get dy/dx = Fx /Fy .
(2). Comparative Statics.
We then move to the vector case by analogy. Suppose
F (x, y) = c
where x is an n-vector, y an m-vector, c a given m-vector. Let (x , y )
solve F (x , y ) = c. Think of x being exogenous, so this is a set of m
41
equations in the m endogenous variables y = (y1 , ..., ym ). You can stack

these equations vertically; for laziness, I write them now as
F1 (x1 , ..., xn , y1 , ..., ym ) = c1 , ..., Fm (x1 , ..., xn , y1 , ..., ym ) = cm
So the vector function F has m component functions F1 , ..., Fm .
Now I totally differentiate:
DFx dx + DFy dy = 0
as before; only now, dx = (dx1 , ..., dxn ), dy = (dy1 , ..., dym ), DFx is the
m n matrix of partial derivatives whose ith row is (Fi /x1 , ..., Fi /xn ),
and DFy is the m m matrix whose ith row is (Fi /y1 , ..., Fi /ym ).
So DFy dy = DFx dx. From here, we can work out the effect of changing
any xj on the endogenous variables y1 , ..., ym . Suppose we set all dxi = 0
for i 6= j. Then DFx dx becomes dxj times the jth column of DFx . We
divide both sides by dxj , getting
DFy (y1 /xj , ...., ym /xj )T = (F1 /xj , ...., Fm /xj )T
(the superscript T is for transpose to get column vectors).
So,
(y1 /xj , ...., ym /xj )T = DFy1 (F1 /xj , ...., Fm /xj )T
Where in the scalar case we divided by the number Fy , here we multiply
by the inverse of the matrix DFy . Now, if we stack horizontally the partial
derivatives of y1 , ..., ym w.r.t. x1 , ..., xn , on the LHS, we have Df (x); and on
the RHS, the appropriate columns give Fx , so we have (DFy (x, y))1 DFx ,
which is the just implicit function derivative formula.
(3) Application: Cournot Duopoly.
Firms 1 and 2 have constant unit costs c1 and c2 , and face the twice
continuously differentiable inverse demand function P (Q), where Q = q1 + q2
is industry output. So profits are given by
42
1 = P (q1 + q2 )q1 c1 q1
and
2 = P (q1 + q2 )q2 c2 q2
If profits are concave in own output, then the first-order conditions below
characterize Cournot-Nash equilibrium (q1 , q2 ).
1 /q1 = P 0 (q1 + q2 )q1 + P (q1 + q2 ) c1 = 0
2 /q2 = P 0 (q1 + q2 )q2 + P (q1 + q2 ) c2 = 0

The concavity of profit w.r.t. own output conditions follow from the
condition below: For all q1 , q2
2 1 /q12 = P 00 (q1 + q2 )Q + 2P 0 (q1 + q2 ) 0, i = 1, 2
The two first-order conditions can be written as the vector equation
F (q1 , q2 , c1 , c2 ) = 0
.
We want to know: How do the Cournot outputs change as a result of a
change in unit costs? If c1 decreases, for instance, does q1 increase and q2
decrease? The implicit function theorem says that if DFq1 ,q2 (q1 , q2 , c1 , c2 ) is
of full rank (rank=2), then, locally around this solution, q = (q1 , q2 ) is an
implicit function of c = (c1 , c2 ), with F (f (c), c) = 0. And
43
Df (c) = [DFq (q , c)]1 DFc (q , c)

Note that

F1 /q1 F1 /q2
DFq (.) =
F2 /q1 F2 /q2
For brevity, let P 0 and P 00 be the derivative and second derivative of P (.)
evaluated at the
Then
equilibrium.

P 00 q1 + 2P 0 P 00 q1 + P 0
DFq (.) =
P 00 q2 + P 0 P 00 q2 + 2P 0
The determinant of this matrix works out to be
(P 0 )2 + P 0 (P 00 (q1 + q2 ) + 2P 0 ) > 0 since P 0 < 0 and the concavity in own
output condition is assumed to be met. So the implicit function theorem can
be applied
Notice also

that
1 0
.
DFc (.) =
0 1
Thus we can work out Df (c), the changes in equilibrium outputs as a
result of changes in unit costs. It would be a good exercise for you to work
these out, and sign these.
Proof of the Theorem of Lagrange
Before the formal proof, note that well use the tangency of the contour
sets of the objective and the constraint approach, which in other words uses
the implicit function theorem. For example, consider maximizing F (x1 , x2 )
s.t. G(x1 , x2 ) = 0. If G1 6= 0 (this is the constraint qualification in this
0
case), we have at a tangency point of contour sets, G1 f (x2 ) + G2 = 0 (where
x1 = f (x2 ) is the implicit function that keeps the points (x1 , x2 ) on the
0
constraint); so f (x2 ) = G2 /G1 .
On the other hand, if we vary x2 and adjust x1 to stay on the constraint,
the function value F (x1 , x2 ) = F (f (x2 ), x2 ) does not increase; therefore lo0
cally around the optimum, F1 f (x2 ) + F2 = 0. Substituting, F1 (G2 /G1 ) +
F2 = 0. If we now put
44
F1 /G1 =
,
we have both F1 + G1 = 0 by definition, and G2 + F2 = 0, the two
FONC.
The Proof:
Without loss of generality, let the leading principal k k minor matrix of
Dg(x ) be linearly independent. We write x = (w, z) with w being the first
k coordinates of x and z being the last (n k) coordinates. So showing the
existence of (a 1 n vector) that solves
Df (x ) + Dg(x ) =
is the same as showing that the 2 equations below hold for this ; the
equations are of dimension 1 k and 1 (n k) respectively:
Dfw (w , z ) + Dgw (w , z ) =
(*)
Dfz (w , z ) + Dgz (w , z ) =
(*)
Since Dgw (w , z ) is square and of full rank, Eq.(*) yields
= Dfw (w , z ) [Dgw (w , z )]1
(**)
We show solves (*) as well. This needs two steps.

First, g(h(z), z) = for some implicit function h, so
45
Dh(z) = [Dgw (w , z )]1 Dgz (w , z )

Second, define F (z) = f (h(z), z). Since theres a constrained optimum
at (h(z ), z ), varying z while keeping w = h(z) will not increase the value
of F (z ). So
DF (z ) = Dfw (w , z )Dh(z ) + Dfz (w , z ) =
Substituting for Dh(z ),
Dfw (.)[Dgw (.)]1 Dgz (.) + Dfz (.) =
That is,
Dgz (.) + Dfz (.) =
Simple Implicit Function Theorem with Proof

Theorem 16 Suppose F : <2 <, F (a, b) = 0, F isC 1 , and DF2 (a, b) 6= 0.
Then there exists an open interval U containing a, and an open interval V
containing b, and a C 1 function f : U V s.t. F (x, f (x)) = 0 for all x V .
Proof. Well avoid the proof of continuous differentiability. Suppose
WLOG DF2 (a, b) < 0. Since DF2 (.) is continuous, there exists h > 0 s.t.
DF2 (a, y) < 0 for all y (b h, b + h). So, F (a, b h) > 0 > F (a, b + h).
Now, since F is continuous, there exists an interval I containing a, s.t.
F (x, b h) > 0x I; and an interval I 0 containing a s.t. F (x, b + h) <
0x I 0 .
So, for all x U = I I 0 , we have
46
F (x, b h) > 0 > F (x, b + h)

Therefore, by the Intermediate Value Theorem, there exists a y s.t. F (x, y) =
0. This y is unique because DF2 (.) < 0 in this interval. So, we can pull out
a unique function f (x) s.t. F (x, f (x)) = 0x U .
Digression about (discussion of ?) Envelope Theorems
Suppose we have an objective function f : <n+1 <, that is a function
of the vector variable x <n , and also a parameter a < that is held
constant when maximizing f on some feasible set S <n . Suppose that
for every admissible value of a, there is a unique interior maximizer, so we
can say that x (a) is the function that represents this relationship between
parameter and maximizer. Suppose f is smooth and x (a) is differentiable.
Let V (a) be the value function for this problem, that gives the maximum
value that f (x, a) obtains when the parameter is at the level a. That is,
V (a) f (x (a), a).
We wish to know how V (a) changes with a change in a. As a changes,
x (a) changes as well, but this change has no first-order effect on V (a): so
the first-order change in V (a) is solely through the direct effect of a on f .
this is the implication of the envelope theorem.
Indeed, using the Chain Rule on V (a) f (x (a), a), we have V 0 (a) =
Dfx (x , a)Dx (a)+f (x , a)/a. But because x is an interior Max, Dfx (x , a) =
1n . So, V 0 (a) = f /a.
Now suppose we want to maximize an objective function f (x), which does
not depend on a, but subject to a constraint g(x, a) = a G(x) = 0 that
does depend on a. Under nice conditions, at the Max,
Df (x ) + Dg(x , a) =
(i)
Also note that if a changes, the value of g(x (a), a) must continue to be
zero, so
47
Dgx (x (a), a)Dx (a) + g/a = 0
(ii)
Now, V (a) f (x (a)), so V 0 (a) = Df (x )Dx (a). Using (i) to substitute for Df (x ), this equals Dg(x , a)Dx (a), which equals, using (ii),
g/a = . So here, V 0 (a) = , the value of the multiplier at the optimum
is the rate of change of the objective with respect to the parameter a being
relaxed.
Suppose now that we have an objective function f (x, a) to maximize
subject to g(x, a) = 0. Along similar lines, we can show that V 0 (a) = f /a+
g/a, i.e. the direct effect of a on the Lagrangian function. As an exercise,
please derive Roys Identity using the indirect utility function V (p, I).
48
Chapter 5
Optimization with Inequality
Constraints
5.1
Introduction
The problem is to find the Maximum or the Minimum of f : Rn R on the

set {x Rn |gi (x) 0, i = 1, . . . , k}, where gi : Rn R are the k constraint
functions. At the optimum, the constraints are now allowed to be binding
(or tight or effective), i.e. gi (x) = 0, as before, or slack (or non-binding), i.e.
gi (x) > 0.
Example: Max U (x1 , x2 ) s.t. x1 0, x2 0, I p1 x1 p2 x2 0. If we
do not know whether xi = 0, for some i, at the utility maximum, or whether
xi > 0, then clearly we cannot use the Theorem of Lagrange. Similarly,
if there is a bliss point, then we do not know in advance whether there at
the budget constrained optimum the budget constraint is binding or slack.
Again, we cannot then use the Theorem of Lagrange, to use which we need
to be assured that the constraint is binding.
Note the general nature of a constraint of the form gi (x) 0. If we have a
constraint h(x) 0, this is equivalent to g(x) h(x) 0. And something
like h(x) c is equivalent to g(x) c h(x) 0.
49
We use Kuhn-Tucker Theory to address optimization problems with inequality constraints. The main result is a first order necessary condition
that is somewhat different for that of the Theorem of Lagrange; one main
difference is that the first order conditions gi (x) = 0, i = 1, . . . , k in the Theorem of Lagrange are replaced by the conditions i gi (x) = 0, i = 1, . . . , k in
Kuhn-Tucker theory.
In order to motivate this difference, let us discuss a simple setting. Consider an objective function f : <2 <. We want to maximize f (x) or
f (x1 , x2 ) over all x <2 that satisfy G(x) a, where G : <2 <. We will
alternatively write g(x) = a G(x) 0. For this example, let us assume
that G(x) is strictly increasing. We can view a as the total resource available;
such as the total income available for spending on goods. Draw a picture.
A maximum x can occur either in the interior (i.e. G(x ) < a or g(x ) >
0), or at the boundary ( G(x ) = a or g(x ) = 0). If it happens in the
interior, it implies Df (x ) = . If it happens on the boundary, it must
be that reducing the parameter value a does not increase f (x); for whatever
vector x you choose as maximizer after the reduction of a was available before,
at the higher value of a, and was not chosen as the maximizer. Consider then
setting up the Lagrangian
L(x, ) = f (x) + g(x)

and consider the first order condition
Df (x ) + Dg(x ) =
.
When would this first-order condition make sense?
(i) First, for this to coincide with Df (x ) = when x is in the interior,
we must have that g(x ) > 0 (or G(x ) < a) implies = 0.
(ii) Now let V (a) be the value function for this problem: the maximum
value of f (x) when the parameter in the constraint equals a. Consider the
50
interpretation that = V 0 (a), the change in the value of the objective as we

change a. If the maximizer x is on the boundary (G(x ) = a) and we were
to reduce a, the maximum value of f would get reduced (or not increase);
so, 0.
(iii) Finally, suppose that the maximizer x along with solve the firstorder condition Df (x ) + Dg(x ) = , and suppose Dg(x ) 6= . Suppose
> 0. Then from the first-order condition, we conclude Df ( ) 6= . But that
means x is not in the interior of the feasible set; its on the boundary, so
g(x ) = 0. Alternatively, interpreting > 0 as the decrease in the maximized
value of f (x) if a is decreased a little, it must mean g(x ) = 0 (or G(x ) =
a). For if x were not on the boundary, decreasing a would not affect this
maximum value.
So in addition to Df (x ) + Dg(x ) = , its implied that 0, g(x )
0, and g(x ) = 0. Note that this implies that if g(x ) > 0, then = 0, and
if > 0, then g(x ) = 0.
5.2
Kuhn-Tucker Theory
Recall that the problem is to find the Maximum or the Minimum of f :

Rn R on the set {x Rn |gi (x) 0, i = 1, . . . , k}, where gi : Rn R are
the k constraint functions. The main theorem deals with local maxima and
minima, though.
Suppose that l of the k constraints bind at the optimum x . Denote the
corresponding constraint functions as (gi )i , where is the set of indexes
of the binding constraints. Let g : Rn Rl be the function whose l
components are the constraint functions of the binding constraints. That is
g (x) = (gi (x))
i .
Dgi (x)
..
Dg (x) =
.
, where i, . . . , m are the indexes of the binding
Dg m (x)
constraints. So Dg (x) is an l n matrix.
51
We now state FONC for the problem. The Theorem below is a consolidation of the Fritz-John and the Kuhn-Tucker Theorems.
Theorem 17 (The Kuhn-Tucker (KT) Theorem). Let f : Rn R, and
gi : Rn R, i = 1, . . . , k be C 1 functions. Suppose x is a Maximum of f
on the set S = U
{x Rn |gi (x) 0, i = 1, . . . , k}, for some open set
U Rn . Then there exist real numbers , 1 , . . . , k , not all zero such that
P
Df (x ) + ki=1 i Dgi (x ) = 1n .
Moreover, if gi (x ) > 0 for some i, then i = 0.
If, in addition, RankDg (x ) = l, then we may take to be equal to 1.
Furthermore, i 0, i = 1, . . . , k, and i > 0 for some i implies gi (x ) = 0.
Suppose the constraint qualification, RankDg (x ) = l, is met at the
optimum. Then the KT equations are the following (n + k) equations in the
n + k variables x1 , . . . , xn , 1 , . . . , k :
i gi (x ) = 0, i = 1, . . . , k, i 0, gi (x ) 0 with complementary
slackness.
(1)
P
k
(2)
Df (x ) + i=1 i Dgi (x ) =
If x is a local minimum of f on S, then f (x ) attains a local maximum
value. Thus for minimization, while Eq.(1) stays the same, Eq.(2) changes
to
P
Df (x ) + ki=1 i Dgi (x ) =
(2)
Equation (1) and (2) are known as the Kuhn-Tucker conditions.
Note finally that since the conditions of the Kuhn-Tucker Theorem are
not sufficient conditions for local optima; there may be points that satisfy
Equations (1) and (2) or (2) without being local optima. For example, you
may check that for the problem
Max f (x) = x3 s.t. g(x) = x 0, the values x = = 0 satisfy the KT
FONC (1) and (2) for a local maximum but do not yield a maximum.
52
5.3
Using the Kuhn-Tucker Theorem
We want to maximize f (x) over the set {x Rn |g(x) 1k }, where g(x) =

(g1 (x), . . . , gk (x)).
P
Set up L(x, ) = f (x) + ki=1 i gi (x)
(If we want to minimize f (x), set up
P
L(x, ) = f (x) + ki=1 i gi (x))
To ensure that the KT FONC will hold at the global max, verify that
(1) a global max exists and (2) The constraint qualification is met at the
maximum.
This second is not possible to do if we dont know where the maximum
is. What we do instead is to check whether the CQ holds everywhere in the
domain, and if not, we note points where it fails. The CQ in the theorem
depends on what constraints are binding at the maximum. Again since we
dont know the maximum, we dont know what constraints bind at it.
With k constraint functions, there are 2k profiles of binding and nonbinding constraints possible, each of these profiles implying a different CQ.
We either check all of them, or we rule out some profiles using clever arguments.
If both checks are fine, then we find all solutions (x0 , 0 ) to the set of
equations:
i (L(x, )/i ) = 0, i 0, (L(x, )/i ) 0, i = 1, . . . , k, with CS.
(L(x, )/xj ) = 0, j = 1, . . . , n.
From the set of all solutions (x0 , 0 ), pick that solution (x , ) for which
f (x ) is maximum. Note that this method does not require checking for concavity of objective functions and constraints, and does not require checking
any second order condition.
The method may fail if a global max does not exist or if the CQ fails at
the maximum. The example Max f (x) = x3 s.t. g(x) = x 0 is one where
no global max exists, and we saw earlier that the method fails.
An example in which CQ fails: Max f (x) = 2x3 3x2 , s.t. g(x) =
(3 x)3 0.
53
Suppose the constraint does not bind at the maximum; then we dont have
to check a CQ. But suppose it does. That is, suppose the optimum occurs
at x = 3. Dg(x) = 3(3 x)2 = 0 at x = 3. The CQ fails here. You could
check that the KT FONC will not isolate the maximum.In fact, in this baby
example, it is easy to see that x = 3 is the max, as (3x)3 0 if f (3x) 0,
so we may work with the latter constraint function, with which CQ does not
fail. It is a good exercise to visualize f (x) and see that x = 3 is the maximum,
rather than merely cranking out the algebra now.
Alternatively, we may use the more general FONCs stated in the theorem.
Df (x) + Dg(x) = 0, with , not both zero.
(6x2 6x) + (3(3 x)2 ) = 0, and
(1)
3
(3 x) 0, with strict inequality implying = 0.
(2)
3
If (3 x) > 0, then = 0, which from Eq.(1) implies either = 0, which
violates the FONC, or x = 1. At x = 1, f (x) = 1.
On the other hand, if (3 x)3 = 0, that is x = 3, then Eq (1) implies
= 0, so it must be that > 0. At x = 3, f (x) = 27. so x = 3 is the
maximum.
Two Simple Utility Maximization Problems
Example 1. This is a real baby example meant purely for illustration.
No one expects you to use the heavy Kuhn-Tucker machinery for such simple
problems. In this example, one expects instead that you would use reasoning about the marginal utility per rupee ratios (U1 /p1 ), (U2 /p2 ) to solve the
problem.
Max U (x1 , x2 ) = x1 + x2 , over the set {x = (x1 , x2 ) R2 |x1 0, x2
0, I p1 x1 p2 x2 0}, where I > 0, p1 > 0 and p2 > 0 are given.
So there are 3 inequality constraints:
g1 (x1 , x2 ) = x1 0, g2 (x1 , x2 ) = x2 0, and
g3 (x1 , x2 ) = I p1 x1 p2 x2 0
At the maximum x , any combination of these three could bind; so there
are 8 possibilities. However, since U is strictly increasing, the budget constraint binds at the maximum (g3 (x ) = 0). Moreover, g1 (x ) = g2 (x ) = 0 is
54
not possible since consuming 0 of both goods gives utility equal to 0, which
is clearly not a maximum.
So we have to check just three possibilities out of the eight.
Case(1) g1 (x ) > 0, g2 (x ) > 0, g3 (x ) = 0
Case(2) g1 (x ) = 0, g2 (x ) > 0, g3 (x ) = 0
Case(3) g1 (x ) > 0, g2 (x ) = 0, g3 (x ) = 0
Before using the KT conditions, we verify that (i) a global max exists
(here, because the utility function is continuous and the budget set is compact), and that (ii) the CQ holds at all 3 relevant cominations of binding
constraints described above.
Indeed, for Case(1), Dg (x) = Dg 3 (x) = (p1 , p2 ), so Rank[Dg 3 (x)] =
1, so CQ holds.

Dg1 (x)
1
0
, so Rank[Dg (x)]
For Case(2), Dg (x) =
=
Dg 3 (x)
p1 p2
= 2.

0
1
Dg2 (x)
=
, so Rank[Dg (x)]
For Case(3), Dg (x) =
p1 p2
Dg3 (x)
= 2.
Thus for the maximum, x , there exists a such that (x , ) will be a
solution to the KT FONCs. Of course, there could be other (x, )0 s that are
solutions as well, but a simple comparison of U (x) for all candidate solutions
will isolate for us the Maximum.
L(x, ) = x1 + x2 + 1 x1 + 2 x2 + 3 (I p1 x1 p2 x2 )
The KT conditions are
1 (L/1 ) = 1 x1 = 0, 1 0, x1 0, with CS
(1)
2 (L/2 ) = 2 x2 = 0, 2 0, x2 0, with CS
(2)
3 (L/3 ) = 3 (I p1 x1 p2 x2 ) = 0, 3 0, I p1 x1 p2 x2 0, with
CS
(3)
(L/x1 ) = 1 + 1 3 p1 = 0
(4)
(L/x2 ) = 1 + 2 3 p2 = 0
(5)
Since we dont know which of the three cases select the constraints that
bind at the maximum, we must try all three.
Case(1). Since x1 > 0, x2 > 0, (1) and (2) imply 1 = 2 = 0.Plugging
these in Eq(4) and (5), we have 1 = 3 p1 = 3 p2 . This implies 3 > 0. (Also
55
note that this is consistent with the fact that since utility is strictly increasing,
relaxing the budget constraint will increase utility. So the marginal utility
of income, 3 > 0. Thus 3 p1 = 3 p2 implies p1 = p2 .
So if at a local max both x1 and x2 are strictly positive, then it must be
that their prices are equal. All (x1 , x2 ) that solve Eq(3) are solutions. The
utility in any such case equals
x1 + (I p1 x1 /p2 ) = I/p, where p = p1 = p2 . Note that in this case,
(U1 /p1 ) = (U2 /p2 ) = 1/p.
Case 2. x1 = 0 implies, from Eq(3), that x2 = (I/p2 ). Since this is
greater than 0, Eq(2) implies 2 = 0. Hence from Eq(5), 3 p2 = 1.
Since 1 0, Eq (4) and (5) imply 3 p1 = 1 + 1 1 = 3 p2 . Moreover,
since 3 > 0, this implies p1 p2 .
That is, if it is the case that at the maximum, x1 = 0, x2 > 0, then it must
be that p1 p2 . Note that in this case, (U2 /p2 ) = (1/p2 ) (U1 /p1 ) = (1/p1 ).
For completeness sake, Eq(5) implies 3 = (1/p2 ). So from Eq (4),
1 = (p1 /p2 ) 1. So the unique critical point of L(x, ) is
(x , ) = (x1 , x2 , 1 , 2 , 3 ) = (0, (I/p2 ), (p1 /p2 ) 1, 0, (1/p2 )).
Case(3). This case is similar, and we get that x2 = 0, x1 > 0 occurs only
if p1 p2 . We have
(x , ) = ((I/p1 ), 0, 0, (p2 /p1 ) 1, 1/p1 ).
We see that which of the cases applies depends upon the price ratio p1 /p2 .
2
such that the
If p1 = p2 , then all three cases are relevant, and all (x1 , x2 ) R+
budget constraint binds are utility maxima. But if p1 > p2 , then only Case(2)
is applies, because if Case (1) had applied, we would have had p1 = p2 , and
if Case (3) had applied, that would have implied p1 p2 . The solution to
the KT conditions in that case is the utility maximum. Similarly, if p1 < p2 ,
only Case (3) applies.
Example 2. Max U (x1 , x2 ) = (x1 /1 + x1 ) + x2 /1 + x2 ), s.t. x1 0,
x2 0, p1 x1 + p2 x2 I.
Check that the indifference curves are downward sloping, convex and that
they cut the axes (show all this). This last is due to the additive form of the
56
utility function, and may result in 0 consumption of one of the goods at the
utility maximum.
Exactly as in Example 1, we are assured that a global max exists, that
the CQ is met at the optimum, and that there are only 3 relevant cases of
binding constraints to check.
the Kuhn-Tucker conditions are:
1 (L/1 ) = 1 x1 = 0, 1 0, x1 0, with CS
(1)
2 (L/2 ) = 2 x2 = 0, 2 0, x2 0, with CS
(2)
3 (L/3 ) = 3 (I p1 x1 p2 x2 ) = 0, 3 0, I p1 x1 p2 x2 0, with
CS
(3)
2
(L/x1 ) = (1/(1 + x1 ) ) + 1 3 p1 = 0
(4)
(L/x2 ) = (1/(1 + x2 )2 ) + 2 3 p2 = 0
(5)
Case(1). x1 > 0, x2 > 0 implies 1 = 2 = 0. Eq (4) implies 3 > 0, so
that Eq(4) and (5) give ((1 + x2 )/(1 + x1 )) = (p1 /p2 )1/2 .
Using Eq(3), which gives x2 = ((I p1 x1 )/p2 ), above, we get
((p2 + I p1 x1 )/(p2 (1 + x1 )) = (p1 /p2 )1/2 , so simple computations yield
x1 = ((I + p2 (p1 p2 )1/2 )/(p1 + (p1 p2 )1/2 )),
x2 = ((I + p1 (p1 p2 )1/2 )/(p2 + (p1 p2 )1/2 )),
3 = (1/p1 (1 + x1 )2 ).
x1 > 0, x2 > 0 implies I > (p1 p2 )1/2 p1 , I > (p1 p2 )1/2 p2 . If either of
these fails, then we are not in the regime of Case 1.
Case(2) x1 = 0 with Eq(3) implies x2 = I/p2 . Since this is positive,
2 = 0, so Eq(5) implies 3 = 1/(1 + (I/p2 ))2 p2 .
1 = 3 p1 1 (from x1 = 0 and Eq(4)).
1 = p1 p2 /(p2 + I)2 1. For this to be 0, it is required that
p1 p2 /(p2 + I)2 1,that is I (p1 p2 )1/2 p2 .
Utility equals x2 /(1 + x2 ) = I/(p2 + I).
(x1 , x2 , 1 , 2 , 3 ) = (0, I/p2 , 1 + ((p1 p2 )/(p2 + I)2 ), 0, p2 /(p2 + I)2 ).
Case(3) By symmetry, the solution is
(x1 , x2 , 1 , 2 , 3 ) = (I/p1 , 0, 0, 1 + ((p1 p2 )/(p1 + I)2 ), p1 /(p1 + I)2 )
And for this Case to hold it is necessary that p1 p2 /(p1 + I)2 1, or
I (p1 p2 )1/2 p1 .
57
To summarize, suppose p1 = p2 = p, then (p1 p2 )1/2 p1 = (p1 p2 )1/2 p2

equals 0. So since I > 0, we are in the regime of Case I,and x1 = x2 = I/2p
at the maximum.
Suppose on the other hand that p1 < p2 (the contrary can be worked out
similarly), then p2 > (p1 p2 )1/2 > p1 , so that
(p1 p2 )1/2 p1 > 0 > (p1 p2 )1/2 p2 . Thus either
I > (p1 p2 )1/2 p1 , in which case we use Case(1), or
I (p1 p2 )1/2 p1 in which case we use Case(3). Case(2), that in which a
positive amount of good 2 and zero of good 1 is consumed at the maximum,
does not apply.
5.4
Miscellaneous
(1) For problems where some constraints are of the form gi (x) = 0, and
others of the form gj (x) 0, only the latter give rise to Kuhn-Tucker like
complementary slackness conditions (j 0, gj (x) 0, j gj (x) = 0).
(2) If the objective to be maximized, f , and the constraints gi , i = 1, . . . , k
(where constraints are of the form gi (x) 0) are all concave functions, and
if Slaters constraint qualification holds (i.e., there exists some x <n s.t.
gi (x) > 0, i = 1, . . . , k), then the Kuhn-Tucker conditions become both
necessary and sufficient for a global max.
(3) Suppose f and all the gi s are quasiconcave. Then the Kuhn-Tucker
conditions are almost sufficient for a global max: An x and that satisfy
the Kuhn-Tucker conditions indicate that x is a global max provided that
in addition to the above, either Df (x ) 6= , or f is concave.
58
Appendix
Completeness Property of Real Numbers
59
Course 003: Basic Econometrics, 2012-2013
Course 003: Basic Econometrics

Rohini Somanathan- Part 1
Sunil Kanwar- Part II
Delhi School of Economics, 2014-2015
Page 0
'
Rohini Somanathan
Outline of the Part 1

Main text: Morris H. DeGroot and Mark J. Schervish, Probability and Statistics, fourth edition.
1. Probability Theory: Chapters 1-6
Probability basics: The definition of probability, combinatorial methods, independent
events, conditional probability.
Random variables: Distribution functions, marginal and conditional distributions,
distributions of functions of random variables, moments of a random variable,
properties of expectations.
Some special distributions,laws of large numbers, central limit theorems
2. Statistical Inference: Chapters 7-10
Estimation: definition of an estimator, maximum likelihood estimation, sufficient
statistics, sampling distributions of estimators.
Hypotheses Testing: simple and composite hypotheses, tests for differences in means,
test size and power, uniformly most powerful tests.
Nonparametric Methods
&
Page 1
%
Rohini Somanathan
'
Administrative Information
Internal Assessment: 25% for Part 1
1. Midterm: 20%
2. Lab assignments, Tutorial attendance and class participation: 5%
Problem Sets: - Do as many problems from the book as you can. All odd-numbered
exercises have solutions so focus on these.
Tutorials: -Check the notice board in front of the lecture theatre for lists.
Punctuality is critical - coming in late disturbs the rest of the class and me
&
Page 2
'
Rohini Somanathan
Why is this course useful?

We (as economists, citizens, consumers, exam-takers) are often faced with situations in
which we have to make decisions in the face of uncertainty. This may be caused by:
randomness in the world ( a farmer making planting decisions does not know how much
it will rain during the season, we do not know how many days well be sick next year,
what the chances are of an economic crisis or recovery)
incomplete information about the realized state of the world (Is a politicians promise
sincere? Is a firm telling us the truth about a product? Has our opponent been dealt a
better hand of cards? Is a prisoner guilty or innocent... )
By putting structure on this uncertainty, we can arrive at
decision rules: firms choose techniques, doctors choose drug regimes, electors choose
politicians- these rules have to tell us how best to incorporate new information.
estimates: of empirical relationships (wages and education, drugs and health...)
tests: how likely is it that population parameters take particular values based on the
estimates weve obtained?
Probability theory puts structure on uncertain events and allows us to derive systematic
decision rules. The field of statistics shows us how we can collect and use data to estimate
empirical models and test hypotheses about the population based on our estimates.
&
Page 3
%
Rohini Somanathan
'
A motivating example: gender ratios

We are interested in whether the gender ratio in a population reflects discrimination, either
before of after birth.
Suppose it is equally likely for a child of either sex to be conceived.
We visit a small village with 10 children under the age of 1. If each birth is independent, we
would get considerable variation in the sex-ratio in the absence of discrimination.
.05
.1
probability
.15
.2
.25
P(0)=.0001, P(1)=.001, P(2)=.044, P(3)=.12, P(4)=.21, P(5)=.25 ... (display binomial(10, k, .5))
10
When should we conclude that there is gender bias? Can we get an estimate of this bias?
&
Page 4
'
Rohini Somanathan
Origins of probability theory

A probability is a number attached to some event which expresses the likelihood of the
event occurring.
A theory of probability was first exposited by European mathematicians in the 16th C
studying gambling problems.
How are probabilities assigned to events?
By thinking about all possible outcomes. If there are n of these, all equally likely, we
can attach numbers n1 to each of them. If an event contains k of these outcomes, we
k
attach a probability n
to the event. This is the classical interpretation of probability.
Alternatively, imagine the event as a possible outcome of an experiment. Its probability
is the fraction of times it occurs when the experiment is repeated a large number of
times. This is the frequency interpretation of probability
In many cases events cannot be thought of in terms of repeated experiments or equally
likely outcomes. We could base likelihoods in this case on what we believe about the
world subjective probabilities. The subjective probability of an event A is a real number
in the interval [0, 1] which reflects a subjective belief in the validity or occurence of event
A. Different people might attach different probabilities to the same events. Examples?
We formalize this subjective interpretation by imposing certain consistency conditions on
combinations of events.
&
Page 5
%
Rohini Somanathan
'
Definitions
An experiment is any process whose outcome is not known in advance with certainty. These
outcomes may be random or non-random, but we should be able to specify all of them and
attach probabilities to them.
Experiment
Event
10 coin tosses
4 heads
select 10 LS MPs
one is female
go to your bus-stop at 8
bus arrives within 5 min.
A sample space is the collection of all possible outcomes of an experiment.

An event is a certain subset of possible outcomes in the space S.
The complement of an event A is the event that contains all outcomes in the sample space
that do not belong to A. We denote this event by Ac
The subsets A1 , A2 , A3 . . . . . . of sample space S are called mutually disjoint sets if no two of
these sets have an element in common. The corresponding events A1 , A2 , A3 . . . . . . are said to
be mutually exclusive events.
If A1 , A2 , A3 . . . . . . are mutually exclusive events such that S = A1 A2 A3 . . . . . . , these are
called exhaustive events.
&
Page 6
'
Rohini Somanathan
Example: 3 tosses of a coin

The experiment has 23 possible outcomes and we can define the sample space S = {s1 , . . . , s8 }
where
s1 = HHH, s2 = HHT s3 = HT H, s4 = HT T , s5 = T HH, s6 = T HT , s7 = T T H, s8 = T T T
Any subset of this sample space is an event.
If we have a fair coin, each of the listed events are equally likely and we attach probability
to each of them.
1
8
Let us define the event A as atleast one head. Then A = {s1 , . . . , s7 }, Ac = {s8 }. A and Ac are
exhaustive events.
The events exactly one head and exactly two heads are mutually exclusive events.
Notice that there are lots of different ways in which we can define a sample space and the
most useful way to do so depending on the event we are interested in (# heads, or with
picking from a deck of cards, we may be interested in the suit, the number or both)
&
Page 7
%
Rohini Somanathan
'
The definition of probability

Definition: Let S be a collection of all events in S. A Probability distribution is a function
P : S [0, 1] which satisfies the following axioms:
1. The probability of every event must be non-negative
P(A) 0 for all events A S
2. If an event is certain to occur, its probability is 1
P(S) = 1
3. For any sequence of disjoint events A1 , . . . . . .
P(
i=1 Ai ) =
P(Ai )
i=1
Note:
We will typically use P(A) or Pr(A) instead of P(A)
For finite sample spaces S is straightforward to define. For any S which is a subset of the
real line (and therefore infinite) let S be the set of all intervals in S.
&
Page 8
Rohini Somanathan
'
Probability measures... some useful results

We can use our three axioms to derive some useful results:
Result 1:For each A S, P(A) = 1 P(Ac )
Proof: A Ac = S. By our second axiom, P(S) = 1 and by axiom 3,
P(A Ac ) = P(A) + P(Ac )
Result 2:P() = 0
Proof:
Let A = so Ac = S. Since P(S) = 1, P() = 0 using the first result above.
Result 3:If A1 and A2 are subsets of S such that A1 A2 , then P(A1 ) P(A2 )
Proof: Lets write A2 as: A2 = A1 (Ac1 A2 ). Since these are disjoint, we can use property
3 to get P(A2 ) = P(A1 ) + P(Ac1 A2 ). The second term on the RHS is non-negative (by
axiom 1), so P(A2 ) P(A1 ).
Result 4: For each A S, 0 P(A) 1
Proof: Since A S, we can directly apply the previous result to obtain
P() P(A) P(S) or 0 P(A) 1
&
Page 9
%
Rohini Somanathan
'
Some useful results..

Result 5: If A1 and A2 are subsets of S then P(A1 A2 ) = P(A1 ) + P(A2 ) P(A1 A2 )
Proof: As before, the trick is to write A1 A2 as a union of disjoint sets and then add the
probabilities associated with them. Drawing a Venn Diagram helps to do this.
A1 A2 = (A1 Ac2 ) (A1 A2 ) (A2 Ac1 )
(1)
but A1 = (A1 Ac2 ) (A1 A2 ) and A2 = (A2 Ac1 ) (A1 A2 ), so

P(A1 ) + P(A2 ) = P(A1 Ac2 ) + P(A1 A2 ) + P(A2 Ac1 ) + P(A1 A2 )
Subtracting P(A1 A2 ) gives us the expression in (??).
&
Page 10
Rohini Somanathan
'
Examples using the probability axioms

1. Consider two events A and B such that Pr(A) = 13 and Pr(B) = 21 . Determine the value of
P(BAc ) for each of the following conditions: (a) A and B are disjoint (b) A B (c)
Pr(AB) = 18
2. Consider two events A and B, where P(A) = .4 and P(B) = .7. Determine the minimum and
maximum values of Pr(AB) and the conditions under which they are obtained?
3. A point (x, y) is to be selected from the square containing all points (x, y), such that
0 x 1 and 0 y 1. Suppose that the probability that the point will belong to any
specified subset of S is equal to the area of that subset. Find the following probabilities:
(a) (x 21 )2 + (y 12 )2
(b)
1
2
<x+y<
1
4
3
2
(c) y < 1 x2
(d) x = y
answers: (1) 1/2, 1/6, 3/8 (2) .1, .4 (3) 1-/4, 3/4, 2/3, 0
&
Page 11
%
Rohini Somanathan
'
Finite sample spaces

If a sample space S contains a finite number of points s1 , . . . sn , we can specify a probability
distribution on S by assigning a probability to each point si S. This probability must
satisfy the two conditions:
1. pi 0 for i = 1, 2, . . . n and
n
P
2.
pi = 1
i=1
The probability of any event A can now be found as the sum of pi for all outcomes si that
belong to A.
A sample space containing n outcomes is called a simple sample space if the probability
assigned to each of the outcomes s1 . . . , sn is n1 . Probability measures are easy to define in
such spaces. If the event A contains exactly m outcomes, then P(A) = m
n
Notice that for the same experiment, we can define the sample space in multiple ways
depending on the events of interest. For example- suppose were interested in obtaining a
given number of heads in the tossing of 3 coins, our sample space can either comprise all
the 8 possible outcomes (a simple space) or just four outcomes (0,1,2 and 3 heads).
We can arrive at the total number of elements in a sample space through listing all possible
outcomes. A simple sample space for a coin-tossing experiment with 3 fair coins would have
a eight possible outcomes, a roll of two dice would have 36, etc. We then just calculate the
number of elements contained in our event A and divide this by the total number of
outcomes to get our probability (P(2 heads)=3/8 and P(sum of 7)=1/6
Listing outcomes can take a long time, and we can use a number of counting methods to
make things easier and avoid mistakes.
&
Page 12
'
Rohini Somanathan
Counting methods..the multiplication rule

Sometimes it is useful to think of an experiment as being performed in multiple stages
(tossing coins, picking cards, questions on an exam, going from one city to another via a
third).
If the first stage has m possible outcomes and the second n outcomes, then we can define a
simple sample space with exactly mn outcomes. Each element in this space with be a pair
(xi , yj ).
Example: the experiment of tossing 5 fair coins will have 32 elements in the simple sample
space, the probability of five heads is 1/32 and of one head is 5/32.
&
Page 13
%
Rohini Somanathan
'
Permutations
Suppose we are sampling k objects from a total of n distinct objects without replacement.
We are interested in the total number of different arrangements of these objects we can
obtain.
We first pick one object- this can happen in n different ways. Since we are now left with
n 1 objects, the second one can be picked in (n 1) different ways, and so on.
The total number of permutations of n objects taken k at a time is given by
Pn,k = n(n 1) . . . (n k + 1)
and Pn,n = n!
Pn,k can alternatively be written as:
Pn,k = n(n 1).. . . . (n k + 1) = n(n 1).. . . . (n k + 1)
n!
(n k)!
=
(n k)!
(n k)!
In the case with replacement, we can apply the multiplication rule derived above. In this
case there are n outcomes possible for each of the k selections, so the number of elements in
S is nk .
&
Page 14
Rohini Somanathan
'
The birthday problem

You go to watch an India-Australia cricket match with a friend.
He would like to bet Rs. 100 that among the group of 23 players on the field (2 teams plus
a referee) at least two people share a birthday
Should you take the bet?
What is the probability that out of a group of k, at least two share a birthday?
the total number of possible birthdays is 365k
365!
the number of different ways in which each of them has different birthdays is (365k)!
(because the second person has only 364 days to choose from, etc.). The required
365!
probability is therefore p = 1 (365k)!365
k
It turns out that for k = 23 this number is .507, so you should take the bet (if you are not
risk-averse)
&
Page 15
%
Rohini Somanathan
'
Combinatorial methods..the binomial coefficient

How many different subsets of k elements can be chosen from a set of n distinct elements?
We are not interested in the order in which the k elements are arranged.
Each such subset is called a combination, denoted by Cn,k
We derived above the number of permutations of n elements, taken k at a time. We can
think of these permutations as being derived by the following process:
First pick a set or combination of k elements.
Since there are k! permutations of these elements, this particular combination will give
rise to k! permutations.
This is true of each such combination, therefore the number of permutations is given by
Pn,k = k!Cn,k , or
Pn,k
n!
=
Cn,k =
k!
k!(n k)!

This is also denoted by n
k and called the binomial coefficient.
&
Page 16
Rohini Somanathan
'
The multinomial coefficient

Suppose we have elements of multiple types (jobs, modes of transport, methods of water
filtration..) and want to find the number of ways that n distinct elements (individuals,
trips..) can be divided into k groups such that for j = 1, 2, . . . , k jth group containing exactly
nj elements.

The n1 elements for the first group can be chosen in nn1 ways, the second group is chosen

1
out of (n n1 ) elements and this can be done in nn
total number of ways of
n2 ways...The
n nn1 nn1 n2
+nk
dividing the n elements into k groups is therefore n1
. . . nk1
n2
n3
n
k1
This can be simplified to
n!
n1 !n2 !...nk !
This expression is known as the multinomial coefficient.
Examples:
An student organization of 1000 people is picking 4 office-bearers and 8 members for its
1000!
managing council. The total number of ways of picking this groups is given by 4!8!988!
105 students have to be organized into 4 tutorial groups, 3 with 25 students each and
one with the remaining 30 students. How many ways can students be assigned to
groups?
&
Page 17
%
Rohini Somanathan
'
Unions of finite numbers of events

We can extent our formula on the probability of a union of events to the case where the number
of events is greater than 2 but finite:
For any three events A1 , A2 , and A3 ,
P(A1 A2 A3 ) = P(A1 ) + P(A2 ) + P(A3 ) [P(A1 A2 ) + P(A1 A3 ) + P(A2 A3 )] + P(A1 A2 A3 )
The easiest way to see this is to draw a Venn Diagram and express the desired set in terms
of 7 disjoint sets, p1 , . . . , p7 .
For a finite number of events, we have:
P(
n
[
i=1
Ai ) =
n
X
i=1
P(Ai )
X
i<j
P(Ai Aj ) +
Pr(Ai Aj Ak ) ...(1)n+1 P(A1 A2 . . . An )
i<j<k
&
Page 18
Rohini Somanathan
'
Independent Events
Definition: Let A and B be two events in a sample space S. Then A and B are independent
iff P(A B) = P(A)P(B). If A and B are not independent, A and B are said to be dependent.
Events may be independent because they are physically unrelated -tossing a coin and rolling
a die, two different people falling sick with some non-infectious disease, etc.
This need not be the case however, it may just be that one event provides no relevant
information on the likelihood of occurrence of the other.
Example:
The even A is getting an even number on a roll of a die .
The event B is getting one of the first four numbers.
The intersection of these two events is the event of rolling the number 2 or 4, which we
know has probability 13 .
Are A and B independent? Yes because P(A)P(B) =
12
23
1
3
This is because the occurrence of A does not affect the likelihood that B will occur, or
vice-versa. Why?
If A and B are independent, then A and Bc are also independent as are Ac and Bc . (We
require P(A Bc ) = P(A)P(Bc ). But A = (A B) (A Bc ), so with A and B independent,
P(A Bc ) = P(A) P(A)P(B) = P(A)[1 P(B)] = P(A)P(Bc ). Starting now with A and B
complement, we can use the same argument to show Ac and Bc independent.
&
Page 19
%
Rohini Somanathan
'
Independent Events..examples and special cases

1. A company has 100 employees, 40 men and 60 women. There are 6 male executives. How
many female executives should there be for gender and rank to be independent?
solution: If gender and rank are independent, then P(M E) = P(M)P(E). We can solve
P(ME)
for P(E) as P(M) = .06
.4 = .15. So there must be 9 female executives.
2. The experiment involves flipping two coins. A is the event that the coins match and B is
the event that the first coins is heads. Are these events independent?
solution: In this case P(B) = P(A) =
events are independent.
1
2
( {H,H} or {T,T}) and P(A B) = 41 , so yes, the
3. Suppose A and B are disjoint sets in S. Does it tell us anything about the independence of
events A and B?
4. Remember that disjointness is a property of sets whereas independence is a property of the
associated probability measure and the dependence of events will depend on the probability
measure that is being used.
&
Page 20
Rohini Somanathan
'
Independence with 3 or more events

Definition: Let A1 , A2 , A3 . . . . . . An be events in the sample space S. Then A1 , A2 , A3 . . . . . . An are
mutually independent iff P(A1 A2 A3 . . . . . . Ak ) = P(A1 )P(A2 ) . . . P(Ak ). for every collection of
k of these events, where 2 k n These events are pairwise independent iff
P(Ai Aj ) = P(Ai )P(Aj ) for all i, j.
Clearly mutual independence implies pairwise independence but not vice-versa.
Examples:
One ticket is chosen at random from a box containing 4 lottery tickets with numbers
112, 121, 211, 222.
The event Ai is that a 1 occurs in the ith place of the chosen number.
P(Ai ) = 12 i = 1, 2, 3 P(A1 A2 ) = P({112}) = 41 Similarly for A1 A3 and A2 A3 . These 3
events are therefore pairwise independent.
Are they mutually independent? Clearly not: P(A1 A2 A3 ) 6= P(A1 )P(A2 )P(A3 )
Toss two dice, white and black. The sample space consists of all ordered pairs
(i, j) i, j = 1, 2 . . . 6. Define the following events :
A1 : first die = {1, 2 or 3}, P(A1 ) =
A2 : first die = {3, 4 or 5}, P(A2 ) =
1
2
1
2
A3 : the sum of the faces equals 9, P(A3 ) =
1
9
1
In this case, P(A1 A2 A3 ) = P(3, 6) = 36
= ( 12 )( 21 )( 19 ) = P(A1 )P(A2 )P(A3 ) but
1
1
P(A1 A3 ) = P(3, 6) = 36 6= P(A1 )P(A3 ) = 18 , so the events are not independent, nor
pairwise independent.
&
Page 21
%
Rohini Somanathan
'
Conditional probability
When we conduct an experiment, we are absolutely sure that the event S will occur.
Suppose now we have some additional information about the outcome, say that it is an
element of B S.
What effect does this have on the probabilities of events in S? How exactly can we use such
additional information to compute conditional probabilities?
Example: The experiment involves tossing two fair coins in succession. What is the
probability of two tails? Suppose you know the first one is a head? What if it is a tail?
We denote the conditional probability of event A, given B by P(A|B)
B is now the conditional sample space and since B is certain to occur, P(B|B) = 1
Event A will now occur iff A B occurs
Definition: Let A and B be two events in a sample space S. If P(B) 6= 0, then conditional
probability of event A given event B is given by
P(A|B) =
P(A B)
P(B)
Notice that P(.|B) is now a probability set function (probability measure) defined for
subsets of B.
For independent events A and B, the conditional and unconditional probabilities are equal:
P(A)P(B)
P(A|B) = P(B) = P(A)
&
Page 22
Rohini Somanathan
'
Conditional probability...the multiplication rule

The above definition of conditional probability can be manipulated to arrive at a set of rules
that are useful in for computing conditional probabilities for particular types of problems.
We defined the conditional probability of event A given event B as P(A|B) =
P(AB)
P(B)
Multiplying both sides by P(B), we have the multiplication rule for probabilities:
P(A B) = P(A|B)P(B)
This is especially useful in cases where an experiment can be interpreted as being
conducted in two stages. In such cases, P(A|B) and P(B) can often be very easily assigned.
Examples:
Two cards are drawn successively, without replacement from an ordinary deck of
playing cards. What is the probability of drawing two aces?
Here the event B is that the first card drawn is an ace and the event A is that the
4
1
3
1
second card is an ace. P(B) is clearly 52
= 13
and P(A|B) = 51
= 17
The required
1
1
1
probability P(A B) is therefore ( 13 )( 17 ) = 221
There are two types of candidates, competent and incompetent (C and I). The share of
I-type candidates seeking admission is 0.3. All candidates are interviewed by a
committee and the committee rejects incompetent candidates with probability 0.9.
What is the probability that an incompetent candidate is admitted?
Here were interested in P(A I) where P(I) = .3 and P(A|I) = .1, so the required
probability is .03.
&
Page 23
%
Rohini Somanathan
'
The law of total probability

Let S denote the sample space of an experiment and consider k events A1 , A2 , . . . Ak in S
such that A1 , A2 , . . . Ak are disjoint and i = 1k Ai = S. These events are said to form a
partition of the sample space S.
If B is any other event, then the events A1 B, A2 B, . . . , Ak B form a partition of B:
k
P
B = (A1 B) (A2 B), (Ak B) and, since these are disjoint events P(B) =
P(Ai B).
i=1
If P(Ai ) > 0 for all i, then using the multiplication rule derived above, this can be written as:
P(B) =
k
X
P(Ai )P(B|Ai )
i=1
This is known as the law of total probability.

Example: Youre playing a game in which your score is equally likely to take any integer
value between 1 and 50. If your score the first time you play is equal to X, and you play
until you score Y X, what is the probability that Y = 50?
1
Solution: For each value xi , P(X = xi ) = 50
. We can compute the conditional probability of
Y = 50 for each of these values. The event Ai is the probability that X = xi and the event B
is getting a 50 to end the game. The probability of getting xi in the first round and 50 to
end the game is given by the product, P(B|Ai )P(Ai ). The required probability is the sum
of these products over all possible values of i:
P(Y = 50) =
50
X
x=1
1
1
1
1
1
1
.
=
(1 + + + +
) = .09
51 x 50
50
2
3
50
&
Page 24
'
Rohini Somanathan
Bayes Theorem
Bayes Theorem: (or Bayes Rule) Let the events A1 , A2 , . . . Ak form a partition of S such that
P(Aj ) > 0 for all j = 1, 2, . . . , k, and let B be any event such that P(B) > 0. Then for i = 1, . . . , k,
P(Ai |B) =
P(B|Ai )P(Ai )
k
P
P(Aj )P(B|Aj )
j=1
Proof:
By the definition of conditional probability,
P(Ai |B) =
P(Ai B)
P(B)
The denominators in these expressions are the same by the law of total probability and the
numerators are the same using the multiplication rule.
In the case where the partition of S consists of only two events,
P(A|B) =
P(B|A)P(A)
P(B|A)P(A) + P(B|Ac )P(Ac )
&
Page 25
%
Rohini Somanathan
'
Bayes Rule...remarks
Bayes rule provides us with a method of updating events in the partition based on the new
information provided by the occurrence of the event B
Since P(Aj ) is the probability of event Aj prior to the occurrence of event B, it is referred
to as the prior probability of event Aj .
P(Aj |B) is the updated probability of the same event after the occurrence of B and is called
the posterior probability of event Aj .
Bayes rule is very commonly used in game-theoretic models. For example, in political
economy models a Bayes-Nash equilibrium is a standard equilibrium concept: Players (say
voters) start with beliefs about politicians and update these beliefs when politicians take
actions. Beliefs are constrained to be updated based on Bayes conditional probability
formula.
In Bayesian estimation, prior distributions on population parameters are updated given
information contained in a sample. This is in contrast to more standard procedures where
only the sample information is used. The sample would now lead to different estimates,
depending on the prior distribution of the parameter that is used.
A word about Bayes: He was a non-conformist clergyman (1702-1761), with no formal
mathematics degree. He studied logic and theology at the University of Edinburgh.
&
Page 26
Rohini Somanathan
'
Bayes Rule ...examples

C1 , C2 and C3 are plants producing 10, 50 and 40 per cent of a companys output. The
percentage of defective pieces produced by each of these is 1, 3 and 4 respectively. Given
that a randomly selected piece is defective, what is the probability that it is from the first
plant?
P(C|C1 ))(P(C1 )
(.01)(.1)
1
P(C1 |C) =
=
=
= .03
P(C)
(.01)(.1) + (.03)(.5) + (.04)(.4)
32
How do the prior and posterior probabilities of the event C1 compare? What does this tell
you about the difference between the priors and posteriors for the other events?
Suppose that there is a new blood test to detect a virus. Only 1 in every thousand people
in the population has the virus. The test is 98 per cent effective in detecting a disease in
people who have it and gives a false positive for one per cent of disease free persons tested.
What is the probability that the person actually has the disease given a positive test result:
P(Disease|Positive) =
P(Positive|Disease)P(Disease)
(.98)(.001)
=
= .089
P(Positive)
(.98)(.001) + (.01)(.999)
So in spite of the test being very effective in catching the disease, we have a large number of
false positives.
&
Page 27
%
Rohini Somanathan
'
Bayes Rule ... priors, posteriors and politics

To understand the relationship between prior and posterior probabilities a little better, consider
the following example:
A politician, on entering parliament, has a reasonably good reputation. A citizen attaches a
prior probability of 34 to his being honest (undertaking policies to maximize social welfare,
rather than his bank balance).
At the end of his tenure, the citizen finds a very large number of potholes on roads in the
politicians constituency. While these do not leave the citizen with a favorable impression of
the incumbent, it is possible that the unusually heavy rainfall over these years was
responsible.
Elections are coming up. How does the citizen use this information on road conditions to
update his assessment of the moral standing of the politician? Let us compute the posterior
probability of the politicians being honest, given the event that the roads are in bad
condition:
Suppose that the probability of bad roads is
he/she is dishonest.
1
3
if the politician is honest and is
2
3
if
The posterior probability of the politician being honest is now given by

P(honest|bad roads) =
( 1 )( 3 )
P(bad roads|honest)P(honest)
3
= 1 33 42 1 =
P(bad roads)
5
( 3 )( 4 ) + ( 3 )( 4 )
What would the posterior be if the prior is equal to 1? What if it the prior is zero? What if
the probability of bad roads was equal to 12 for both types of politicians? When are
differences between priors and posteriors going to be large?
&
Page 28
'
Rohini Somanathan
The Monty Hall problem

A game show host leads the contestant to a wall with three closed doors.
Behind one of these is a fancy car, behind the other two a consolation prize (a bag of sweets)
The contestant must first choose a door without any prior knowledge of what is behind each
door.
The host then opens one of the doors hiding a bag of sweets.
The contestant is given an opportunity to switch doors and wins whatever is behind the
door that is finally chosen by him.
Does he raise his chances of winning the car by switching?
Suppose that the contestant chooses door 1 and the host opens door three. Denote by
A1 , A2 and A3 the events that the car is behind doors 1,2 and 3 respectively. Let B be
the event that the host opens door 3.
Wed like to compare P(A1 |B) and P(A2 |B).
By Bayes rule, the denominator of both these expressions is P(B), we therefore need to
compare P(B|A1 )P(A1 ) and P(B|A2 )P(A2 )
The first expression is 21 . 13 , the second is
certainly be opened, so P(B|A2 ) = 1
1
3
(because if the car is behind 2 then three will
The contestant can therefore double his probability of being correct by switching. The
posterior probability of A2 is 32 while that of A1 remains 13 .
&
Page 29
%
Rohini Somanathan
'
Bayes Rule: The Sally Clark case

Sally Clark was a British solicitor who became the victim of a one of the great miscarriages
of justice in modern British legal history
Her first son died within a few weeks of his birth in 1996 and her second one died in
similarly in 1998 after which she was arrested and tried for their murder.
A well-known paediatrician Professor Sir Roy Meadow, who testified that the chance of two
children from an affluent family suffering sudden infant death syndrome was 1 in 73 million,
which was arrived by squaring 1 in 8500 for likelihood of a cot death in similar circumstance.
Clark was convicted in November 1999. In 2001 the Royal Statistical Society issued a public
statement expressing its concern at the misuse of statistics in the courts and arguing that
there was no statistical basis for Meadows claim
In January 2003, she was released from prison having served more than three years of her
sentence after it emerged that the prosecutors pathologist had failed to disclose
microbiological reports that suggested one of her sons had died of natural causes.
RSS statement excerpts:
In the recent highly-publicised case of R v. Sally Clark, a medical expert
witness drew on published studies to obtain a figure for the frequency of sudden infant death syndrome
(SIDS, or cot death) in families having some of the characteristics of the defendants family. He went on
to square this figure to obtain a value of 1 in 73 million for the frequency of two cases of SIDS in such a
family. ..This approach is, in general, statistically invalid. It would only be valid if SIDS cases arose
independently within families,.. there are very strong a priori reasons for supposing that the assumption
will be false. There may well be unknown genetic or environmental factors that predispose families to SIDS,
so that a second case within the family becomes much more likely. The true frequency of families with two
cases of SIDS may be very much less incriminating than the figure presented to the jury at trial.
&
Page 30
%
Rohini Somanathan
Topic 2: Random Variables and Probability Distributions

Rohini Somanathan
Course 003, 2014-2015
Page 0
'
Rohini Somanathan
Sample spaces and random variables

The outcomes of some experiments inherently take the form of real numbers:
crop yields with the application of a new type of fertiliser
students scores on an exam
miles per litre of an automobile
Other experiments have a sample space that is not inherently a subset of Euclidean space
Outcomes from a series of coin tosses
The character of a politician
The modes of transport taken by a citys population
The degree of satisfaction respondents report for a service provider -patients in a
hospital may be asked whether they are very satisfied, satisfied or dissatisfied with the
quality of treatment. Our sample space would consist of arrays of the form
(VS, S, S, DS, ....)
The caste composition of elected politicians.
The gender composition of children attending school.
A random variable is a function that assigns a real number to each possible outcome s S.
&
Page 1
%
Rohini Somanathan
'
Random variables
Definition: Let (S, S, ) be a probability space. If X : S < is a real-valued function having as
its domain the elements of S, then X is called a random variable.
A random variable is therefore a real-valued function defined on the space S. Typically x is
used to denote this image value, i.e. x = X(s).
If the outcomes of an experiment are inherently real numbers, they are directly
interpretable as values of a random variable, and we can think of X as the identity function,
so X(s) = s.
We choose random variables based on what we are interested in getting out of the
experiment. For example, we may be interested in the number of students passing an exam,
and not the identities of those who pass. A random variable would assign each element in
the sample space a number corresponding to the number of passes associated with that
outcome.
We therefore begin with a probability space (S, S, ) and arrive at an induced probability
space (R(X), B, PX (A)).
How exactly do we arrive at the function Px (.)? As long as every set A R(X) is associated
with an event in our original sample space S, Px (A) is just the probability assigned to that
event by P.
&
Page 2
Rohini Somanathan
'
Random variables..examples
1. Tossing a coin ten times.
The sample space consists of the 210 possible sequences of heads and tails.
There are many different random variables that could be associated with this
experiment: X1 could be the number of heads, X2 the longest run of heads divided by
the longest run of tails, X3 the number of times we get two heads immediately before a
tail, etc...
For s = HT T HHHHT T H, what are the values of these random variables?
2. Choosing a point in a rectangle within a plane
An experiment involves choosing a point s = (x, y) at random from the rectangle
S = {(x, y) : 0 x 2, 0 y 1/2}
The random variable X could be the xcoordinate of the point and an event is X taking
values in [1, 2]
Another random variable Z would be the distance of the point from the origin,
p
Z(s) = x2 + y2 .
3. Heights, weights, distances, temperature, scores, incomes... In these cases, we can have
X(s) = s since these are already expressed as real numbers.
&
Page 3
%
Rohini Somanathan
'
Induced probability spaces..examples

Lets look at some examples of how we arrive at our probability measure PX (A).
A coin is tossed once and were interested in the number of heads, X. The probability
assigned to the set A = {1} in our new space is just the probability associated with one head
in our original space. So Pr(X = x) = 12 , x {0, 1}.
With two tosses, the probability attached to the set A = {1} is the sum of the probabilities
associated with the disjoint sets {H, T } and {T , H} whose union forms this event. In this case

Pr(X = x) = x2 ( 21 )2 x {0, 1, 2}
Now consider a sequence of flips of an unbiased coin and our random variable X is the
number of flips needed for the first head. We now have
Pr(X = x) = f(x) =
x1 x
1
1
1
=
2
2
2
x = 1, 2, 3 . . .
Is this a valid probability measure?

How is the nature of the sample space in the first two coin-flipping examples is different
from the third?
In all these cases we have a discrete random variable .
&
Page 4
Rohini Somanathan
'
The distribution function

Once weve assigned real numbers to all the subsets of our sample space S that are of
interest, we can restrict our attention to the probabilities associated with the occurrence of
sets of real numbers.
Consider the set A = (, x]
Now P(A) = Pr(X x)
F(x) is used to denote the probability Pr(X x) and is called the distribution function of x
Definition: The distribution function F of a random variable X is a function defined for each
real number x as follows:
F(x) = P(X x) for < x <
If there are a finite number of elements w in A, this probability can be computed as
F(x) =
f(w)
wx
In this case, the distribution function will be a step function, jumping at all points x in
R(X) which are assigned positive probability.
Consider the experiment of tossing two fair coins. Describe the probability space induced
by the random variable X, the number of heads, and derive the distribution function of X.
&
Page 5
%
Rohini Somanathan
'
Discrete distributions
Definition: A random variable X has a discrete distribution if X can take only a finite number k
of different values x1 , x2 , . . . , xK or an infinite sequence of different values x1 , x2 , . . . .
The function f(x) = P(X = x) is the probability function of x. We define it to be f(x) for all
values x in our sample space R(X) and zero elsewhere.
If X has a discrete distribution, the probability of any subset A of the real line is given by
P
P(X A) =
f(xi ).
xi A
Examples:
1. The discrete uniform distribution: picking one of the first k non-negative integers at
random
1
for x = 1, 2, ...k,
f(x) = k
0
otherwise
2. The binomial distribution: the probability of x successes in n trials.

n px qnx
for x = 0, 1, 2, ...n,
x
f(x) =
0
otherwise
Derive the distribution functions for each of these.
&
Page 6
Rohini Somanathan
'
Continuous distributions
The sample space associated with our random variable often has an infinite number of points.
Example: A point is randomly selected inside a circle of unit radius with origin (0, 0) where the probability
assigned to being in a set A S is P(A) = areaof A and X is the distance of the selected point from the
origin. In this case F(x) = Pr(X x) = area of circle with radius x , so the distribution function of X is given by
0
F(x) =
for x < 0
x2
0x<1
1x
Definition: A random variable X has a continuous distribution if there exists a nonnegative

function f defined on the real line, such that for any interval A,
Z
P(X A) =
f(x)dx
A
The function f is called the probability density function or p.d.f. of X and must satisfy the
conditions below
1. f(x) 0
2.
f(x)dx = 1
What is f(x) for the above example? How can you use this to compute P( 14 < X 21 )? How would
you use F(x) instead?
&
Page 7
%
Rohini Somanathan
'
Continuous distributions..examples
1. The uniform distribution on an interval: Suppose a and b are two real numbers with a < b.
A point x is selected from the interval S = {x : a x b} and the probability that it
belongs to any subinterval of S is proportional to the length of that subinterval. It follows
that the p.d.f. must be constant on S and zero outside it:
1
for a x b
f(x) = ba
0
otherwise
Notice that the value of the p.d.f is the reciprocal of the length of the interval, these values
can be greater than one, and the assignment of probabilities does not depend on whether
the distribution is defined over the closed interval or the open interval (a, b)
2. Unbounded random variables: It is sometimes convenient to define a p.d.f over unbounded
sets, because such functions may be easier to work with and may approximate the actual
distribution of a random variable quite well. An example is:
0
for x 0
f(x) =
1 2
for x > 0
(1+x)
3. Unbounded densities: The following function is unbounded around zero but still represents
a valid density.
2 x 13
for 0 < x < 1
f(x) = 3
0
otherwise
&
Page 8
Rohini Somanathan
'
Mixed distributions
Often the process of collecting or recording data leads to censoring, and instead of obtaining
a sample from a continuous distribution, we obtain one from a mixed distribution.
Examples:
The weight of an object is a continuous random variable, but our weighing scale only
records weights up to a certain level.
Households with very high incomes often underreport their income, for incomes above a
certain level (say $250,000), surveys often club all households together - this variable is
therefore top-censored.
In each of these examples, we can derive the probability distribution for the new random
variable, given the distribution for the continuous variable. In the example weve just
considered:
0
for x 0
f(x) =
1 2
for x > 0
(1+x)
suppose we record X = 3 for all values of X 3 The p.f. for our new random variable Y is
given by the same p.f. for values less than 3 and by 14 for Y=3.
Some variables, such as the number of hours worked per week have a mixed distribution in
the population, with mass points at 0 and 40.
&
Page 9
%
Rohini Somanathan
'
Properties of the distribution function

Recall that the distribution function or cumulative distribution function (c.d.f ) for a random
variable X is defined as
F(x) = P(X x) for < x < .
It follows that for any random variable (discrete, continuous or mixed), the domain of F is the
real line and the values of F(x) must lie in [0, 1]. We can also establish that all distribution
functions have the following three properties:
1. F(x) is a nondecreasing function of x, i.e. if x1 < x2 then F(x1 ) < F(x2 ).
( The occurrence of the event {X x1 } implies the occurrence of {X x2 } so P(X x1 ) P(X x2 ))
2. limx F(x) = 0 and limx F(x) = 1

( {x : x } is the entire sample space and {x : x } is the null set. )
3. F(x) is right-continuous, i.e. F(x) = F(x+ ) at every point x, where F(x+ ) is the right hand
limit of F(x).
( for discrete random variables, there will be a jump at values that are taken with positive probability)
&
Page 10
Rohini Somanathan
'
Computing probabilities using the distribution function

RESULT 1: For any given value of x, P(X > x) = 1 F(x)
RESULT 2: For any values x1 and x2 where x1 < x2 , P(x1 < X x2 ) = F(x2 ) F(x1 )
Proof: Let A be the event X x1 and B be the event X x2 . B can be written as the union of two events
B = (A B) (Ac B). Since A B, P(A B) = P(A). The event were interested in is Ac B whose probability
is given by P(B) P(A) or P(x1 < X x2 ) = P(X x2 ) P(X x1 ). Now apply the definition of a d.f.
RESULT 3: For any given value x

P(X < x) = F(x )
RESULT 4: For any given value x
P(X = x) = F(x+ ) F(x )
The distribution function of a continuous random variable will be continuous and since
Rx
F(x) =
f(t)dt,
F0 (x) = f(x)
For discrete and mixed discrete-continous random variables F(x) will exhibit a countable number
of discontinuities at jump points reflecting the assignment of positive probabilities to a countable
number of events.
&
Page 11
%
Rohini Somanathan
'
Examples of distribution functions

Consider the experiment of rolling a die or tossing a fair coin, with X in the first case being
the number of dots and in the second case the number of heads. Graph the distribution
function of X in each of these cases.
What about the experiment of picking a point in the unit interval [0, 1] with X as the
distance from the origin?
3.3 The Cumulative
Distribution
Function
109 distribution function?
What type of probability function
corresponds
to the
following
3.6 An example of a
F(x)
1
z3
z2
z1
z0
x1
x2
x3
x4
Section 1.10. Similarly, the fact that Pr(X x) approaches 1 as x follows from
&
Exercise 12 in Sec. 1.10.
Page 12
%
Rohini Somanathan
The limiting values specified in Property 3.3.2 are indicated in Fig. 3.6. In this
figure, the value of F (x) actually becomes 1 at x = x4 and then remains 1 for x > x4.
Course
Basic
Econometrics,
1 and
Pr(X
> x4) = 0.2012-2013
On the other
Hence, it may be concluded that Pr(X
x4) =003:
'
hand, according to the sketch in Fig. 3.6, the value of F (x) approaches 0 as x ,
but does not actually become 0 at any finite point x. Therefore, for every finite value
of x, no matter how small, Pr(X x) > 0.
A c.d.f. need not be continuous. In fact, the value of F (x) may jump at any
finite or countable number of points. In Fig. 3.6, for instance, such jumps or points
of discontinuity occur where x = x1 and x = x3. For each fixed value x, we shall let
F (x ) denote the limit of the values of F (y) as y approaches x from the left, that is,
as y approaches x through values smaller than x. In symbols,
The distribution function X gives us the probability that X x for all real numbers x
F (y).
F (x ) = lim
yx
y<x
Suppose we are given a probability
p and want to know the value of x corresponding to this
value
of
the
distribution
function.
+
Similarly, we shall define F (x ) as the limit of the values of F (y) as y approaches x
The quantile function
from
Thus,
Ifthe
F right.
is a one-to-one
function, then it has an inverse and the value we are looking for is given
1
by F (p)
F (y).
F (x +) = lim
yx
y>x
Examples: median income would be found by F1 ( 21 ) where F is the distribution function of

If the c.d.f. is continuous at a given point x, then F (x ) = F (x +) = F (x) at that point.
income.
Property
3.3.3
Definition:
When
the
distribution
a random
X=is continuous and one-to-one
Continuity from
the Right
. A
c.d.f. is alwaysfunction
continuousoffrom
the right; variable
that is, F (x)
+ ) at every point x.
F (x the
over
whole set of possible values of X, we call the function F1 the quantile function of X. The
value of F1 (p) is called the pth quantile of X or the 100 pth percentile of X for each 0 < p < 1.
Proof Let y1 > y2 > . . . be a sequence of numbers that are decreasing such that
event {X distribution
x} is the intersection
of all
the events
yF(x)
limn yn =Ifx.XThen
xa
n}
Example:
hasthe
a uniform
over the
interval
[a,{X
b],
= ba
over this interval, 0
for n = 1, 2, . . . . Hence, by Exercise 13 of Sec. 1.10,
for x a and 1 for x > b. Given a value p, we simply
solve for the pth quantile:
F (x) = Pr(X x) = lim Pr(X yn) = F (x +).
n
x = pb + (1 p)a. Compute this for p = .5, .25, .9, . . .
It follows from Property 3.3.3 that at every point x at which a jump occurs,
F (x +) = F (x) and F (x ) < F (x).
&
Page 13
%
Rohini Somanathan
'
Examples: computing quantiles, etc.

1. The p.d.f of a random variable is given by:
1x
f(x) = 8
0
for 0 x 4
otherwise
Find the value of t such that

(a) P(X t) =
(b) P(X t) =
1
4
1
2
2. The p.d.f of a random variable is given by:
cx2
f(x) =
0
for 1 x 2
otherwise
Find the value of the constant c and Pr(X > 23 )
&
Page 14
'
Rohini Somanathan
Bivariate distributions
Social scientists are typically interested in the manner in which multiple attributes of
people and the societies they live in. The object of interest is a multivariate probability
distribution. examples: education and earnings, days ill per month and age, sex-ratios and
areas under rice cultivation)
This involves dealing with the joint distribution of two or more random variables. Bivariate
distributions attach probabilities to events that are defined by values taken by two random
variables (say X and Y).
Values taken by these random variables are now ordered pairs, (xi , yi ) and an event A is a
set of such values.
If both X and Y are discrete random variables, the probability function
P
f(x, y) = P(X = x and Y = y) and P(X, Y) A =
f(xi , yi )
(xi ,yi )A
&
Page 15
%
Rohini Somanathan
'
Representing a discrete bivariate distribution

If both X and Y are discrete, this function takes only a finite number of values.
If there are only a small number of these values, they can be usefully presented in a table.
The table below could represent the probabilities of receiving different levels of education.
X is the highest level of education and Y is gender:
education
gender
male
female
none
.05
.2
primary
.25
.1
middle
.15
.04
high
.1
.03
senior secondary
.03
.02
graduate and above
.02
.01
What are some features of a table like this one? In particular, how would we obtain
probabilities associated with the following events:
receiving no education
becoming a female graduate
completing primary school
What else do you learn from the table about the population of interest?
&
Page 16
Rohini Somanathan
'
Continuous bivariate distributions

We can extend our definition of a continuous univariate distribution to the bivariate case:
Definition: Two random variables X and Y have a continuous joint distribution if there exists a
nonnegative function f defined over the xy-plane such that for any subset A of the plane
Z Z
P[(X, Y) A] =
f(x, y)dxdy
A
f is now called the joint probability density function and must satisfy
1. f(x, y) 0 for < x < and < y <
2.
f(x, y)dxdy = 1

Example 1: Given the following joint density function on X and Y, well calculate P(X Y)

f(x, y) =
cx2 y
for x2 y 1
otherwise
First find c to make this a valid joint density (notice the limits of integration here)-it will turn out to be 21/4.
3 .
Then integrate the density over Y (x2 , x) and X (1, 1). Now using this density, P(X Y) = 20
Example 2: A point (X, Y) is selected at random from inside the circle x2 + y2 9. Determine the joint density
function, f(x, y).
&
Page 17
%
Rohini Somanathan
'
Bivariate distribution functions

Definition: The joint distribution function of two random variables X and Y is defined as the
function F such that for all values of x and y ( < x < and < y < )
F(x, y) = P(X x and Y y)
The probability that (X, Y) will lie in a specified rectangle in the xy-plane is given by
Pr(a < X b and c < Y d) = F(b, d) F(a, d) F(b, c) + F(a, c)
Note: The distinction between weak and strict inequalities is important when points on the boundary of the
rectangle occur with positive probability.
The distribution functions of X and Y can be derived as:

Pr(X x) = F1 (x) = lim F(x, y) and Pr(Y y) = F2 (y) = lim F(x, y)
y
If F(x, y) is continuously differentiable in both its arguments, the joint density is derived as:
f(x, y) =
2 F(x, y)
xy
and given the density, we can integrate w.r.t x and y over the appropriate limits to get the
distribution function.
Example:
1 xy(x + y), derive the distribution functions of

Suppose that, for x and y [0, 2], we have F(x, y) = 16
X and Y and their joint density. Notice the (x, y) range over which F(x, y) is strictly increasing.
&
Page 18
'
Rohini Somanathan
Marginal distributions
A distribution of X derived from the joint distribution of X and Y is known as the marginal
distribution of X. For a discrete random variable:
f1 (x) = P(X = x) =
P(X = x and Y = y) =
f(x, y)
and analogously
f2 (y) = P(Y = y) =
P(X = x and Y = y) =
f(x, y)
For a continuous joint density f(x, y), the marginal densities for X and Y are given by:
f1 (x) =
f(x, y)dy and f2 (y) =
f(x, y)dx
Go back to our table representing the joint distribution of gender and education and find
the marginal distribution of education.
Can one construct the joint distribution from one of the marginal distributions?
&
Page 19
%
Rohini Somanathan
'
Independent random variables

Definition: The two random variables X and Y are independent if, for any two sets A and B of
real numbers,
P(X A and Y B) = P(X A)P(Y B)
In other words, if A is an event whose occurrence depends only values taken by X and Bs
occurrence depends only on values taken by Y, then the random variables X and Y are
independent only if the events A and B are independent, for all such events A and B.
The condition for independence can be alternatively stated in terms of the joint and
marginal distribution functions of X and Y by letting the sets A and B be the intervals
(, x) and (, y) respectively.
F(x, y) = F1 (x)F2 (y)
For discrete distributions, we simply define the sets A and B as the points x and y and
require f(x, y) = f1 (x)f2 (y).
In terms of the density functions, we say that X and Y are independent if it is possible to
choose functions f1 and f2 such that the following factorization holds for
( < x < and < y < )
f(x, y) = f1 (x)f2 (y)
&
Page 20
'
Rohini Somanathan
Independent random variables..examples

There are two independent measurements X and Y of rainfall at a certain location:
2x
for 0 x 1
g(x) =
0
otherwise
Find the probability that X + Y 1.
The joint density 4xy is got by multiplying the marginal densities because these variables
are independent. The required probability of 61 is then obtained by integrating over
y (0, 1 x) and x (0, 1)
How might we use a table of probabilities to determine whether two random variables are
independent?
Given the following density, can we tell whether the variables X and Y are independent?
ke(x+2y)
for x 0 and y 0
f(x, y) =
0
otherwise
Notice that we can factorize the joint density as the product of k1 ex and k2 e2y where
k1 k2 = k. To obtain the marginal densities of X and Y, we multiply these functions by
appropriate constants which make them integrate to unity. This gives us
f1 (x) = ex for x 0 and f2 (y) = 2e2y for y 0
&
Page 21
%
Rohini Somanathan
'
Dependent random variables..examples

Given the following density densities, lets see why the variables X and Y are dependent:
1.
f(x, y) =
x + y
for 0 < x < 1 and 0 < y < 1
otherwise
Notice that we cannot factorize the joint density as the product of a non-negative function
of x and another non-negative function of y. Computing the marginals gives us
f1 (x) = x +
1
1
for 0 < x < 1 and f2 (y) = y + for 0 < y < 1
2
2
so the product of the marginals is not equal to the joint density.

2. Suppose we have
f(x, y) =
kx2 y2
for x2 + y2 1
otherwise
In this case the possible values X can take depend on Y and therefore, even though the joint
density can be factorized, the same factorization cannot work for all values of (x, y).
More generally, whenever the space of positive probability density of X and Y is bounded by a
curve, rather than a rectangle, the two random variables are dependent.
&
Page 22
Rohini Somanathan
'
Dependent random variables..a result

Whenever the space of positive probability density of X and Y is bounded by a curve, rather
than a rectangle, the two random variables are dependent. If, on the other hand, the support of
f(x, y) is a rectangle and the joint density is of the form f(x, y) = kg(x)h(y), then X and Y are
independent.
Proof: For the latter part of the result, suppose the support of f(x, y) is given by the rectangle abcd where
a < b and c < d and a x b and c y d. Now the joint density f(x, y) can be written as
1
1
k1 g(x)k2 h(y) where k1 = b
and k2 = d
.
R
g(x)dx
The marginal densities are f1 (x) = k1 g(x)
h(y)dy
c
d
R
c
k2 h(y)dy and f2 (y) = k2 g(y)
b
R
a
k1 g(x)dx, whose product gives us the joint
density.
Now to show that if the support is not a rectangle, the variables are dependent: Start with a point (x, y) outside
the domain where f(x, y) > 0. If x and y are independent, we have f(x, y) = f1 (x)f2 (y), so one of these must be zero.
Now as we move due south and enter the set where f(x, y) > 0, our value of x has not changed, so it could not be
that f1 (x) was zero at the original point. Similarly, if we move west, y is unchanged so it could not be that f2 (y)
was zero at the original point. So we have a contradiction.
&
Page 23
%
Rohini Somanathan
'
Conditional distributions
Definition: Consider two discrete random variables X and Y with a joint probability function
f(x, y) and marginal probability functions f1 (x) and f2 (y). After the value Y = y has been
observed, we can write the the probability that X = x using our definition of conditional
probability:
P(X = x and Y = y)
f(x, y)
=
P(X = x|Y = y) =
Pr(Y = y)
f2 (y)
g1 (x|y) =
f(x,y)
f2 (y)
is called the conditional probability function of X given that Y = y. Notice that:
1. for each fixed value of y, g1 (x|y) is a probability function over all possible values of X
because it is non-negative and
X
g1 (x|y) =
1 X
1
f(x, y) =
f2 (y) = 1
f2 (y) x
f2 (y)
2. conditional probabilities are proportional to joint probabilities because they just divide
these by a constant.
We cannot use the definition of condition probability to derive the conditional density for
continuous random variables because the probability that Y takes any particular value y is zero.
We simply define the conditional probability density function of X given Y = y as
g1 (x|y) =
f(x, y)
for ( < x < and < y < )
f2 (y)
&
Page 24
Rohini Somanathan
'
Conditional versus joint densities

f(x,y)
The numerator in g1 (x|y) = f (y) is a section of the surface representing the joint density and
2
the denominator is the constant by which we need to divide the numerator to get a valid density
(which integrates to unity)
&
Page 25
%
Rohini Somanathan
'
Deriving conditional distributions... the discrete case

For the education-gender example, we can find the distribution of educational achievement
conditional on being male, the distribution of gender conditional on completing college, or any
other conditional distribution we are interested in :
education
gender
male
female
f(education|gender=male)
none
.05
.2
.08
primary
.25
.1
.42
middle
.15
.04
.25
high
.1
.03
.17
senior secondary
.03
.02
.05
graduate and above
.02
.01
.03
f(gender|graduate)
.67
.33
&
Page 26
'
Rohini Somanathan
Deriving conditional distributions... the continuous case

For the continuous joint distribution weve looked at before
cx2 y
for x2 y 1
f(x, y) =
0
otherwise
the marginal distribution of X is given by
Z1
21 2
21 2
x ydy =
x (1 x4 )
4
8
x2
and the conditional distribution g2 (y|x) =
f(x,y)
f1 (x) :
g2 (y|x) =
2y
1x4
for x2 y 1
otherwise
If X = 12 , we can compute P(Y 41 |X = 12 ) = 1 and P(Y 34 |X = 12 ) =
R1
3
4
g2 (y| 21 ) =
7
15
&
Page 27
%
Rohini Somanathan
'
Construction of the joint distribution

We can use conditional and marginal distributions to arrive at a joint distribution:
f(x, y) = g1 (x|y)f2 (y) = g2 (y|x)f1 (x)
(1)
Notice that the conditional distribution is not defined for a value y0 at which f2 (y) = 0, but this is irrelevant
because at any such value f(x, y0 ) = 0.
Example: X is first chosen from a uniform distribution on (0, 1) and then Y is chosen from a uniform distribution
on (x, 1). The marginal distribution of X is straightforward:

f1 (x) =
for 0 < x < 1
otherwise
Given a value of X = x, the conditional distribution

1
1x
for x < y < 1
otherwise
1
1x
for 0 < x < y < 1
otherwise
g2 (y|x) =
Using (1), the joint distribution is

f(x, y) =
and the marginal distribution for Y can now be derived as:
y
Z
f2 (y) =
f(x, y)dx =
1
dx = log(1 y) for 0 < y < 1
1x
&
Page 28
'
Rohini Somanathan
Multivariate distributions
Our definitions of joint, conditional and marginal distributions can be easily extended to an
arbitrary finite number of random variables. Such a distribution is now called a multivariate
distributon.
The joint distribution function is defined as the function F whose value at any point
(x1 , x2 , . . . xn ) <n is given by:
F(x1 , . . . , xn ) = P(X1 x1 , X2 x2 , . . . , Xn xn )
For a discrete joint distribution, the probability function at any point (x1 , x2 , . . . xn ) <n is given
by:
f(x1 , . . . , xn ) = P(X1 = x1 , X2 = x2 , . . . , Xn = xn )
(2)
and the random variables X1 , . . . , Xn have a continuous joint distribution if there is a nonnegative
function f defined on <n such that for any subset A <n ,
Z
Z
P[(X1 , . . . , Xn ) A] =
f(x1 , . . . , xn )dx1 . . . dxn
(3)
...A ...
The marginal distribution of any single random variable Xi can now be derived by integrating
over the other variables
Z
Z
f1 (x1 ) =
...
f(x1 , . . . , xn )dx2 . . . dxn
(4)
and the conditional probability density function of X1 given values of the other variables is:
g1 (x1 |x2 . . . xn ) =
f(x1 , . . . , xn )
f0 (x2 , . . . , xn )
(5)
&
Page 29
%
Rohini Somanathan
'
Independence for the multivariate case

Independence: The n random variables X1 , . . . Xn are independent if for any n sets
A1 , A1 , . . . An or real numbers,
P(X1 A1 , X2 A2 , . . . , Xn An ) = P(X1 A1 )P(X2 A2 ) . . . P(Xn An )
If the joint distribution function of X1 , . . . Xn is given by F and the marginal d.f. for Xi by
Fi , it follows that X1 , . . . Xn will be independent if and only if, for all points (x1 , . . . xn ) <n
F(x1 , . . . xn ) = F1 (x1 )F2 (x2 ) . . . Fn (xn )
and, if these random variables have a continuous joint distribution with joint density
f(x1 , . . . xn ):
f(x1 , . . . xn ) = f1 (x1 )f2 (x2 ) . . . fn (xn )
In the case of a discrete joint distribution the above equality holds for the probability
function f.
Random samples: The n random variables X1 , . . . Xn form a random sample if these
variables are independent and the marginal p.f. or p.d.f. of each of them is f. It follows that
for all points (x1 , . . . xn ), their joint p.f or p.d.f. is given by
g(x1 , . . . , xn ) = f(x1 ) . . . f(xn )
The variables that form a random sample are said to be independent and identically
distributed (i.i.d.) and n is the sample size.
&
Page 30
Rohini Somanathan
'
Multivariate distributions..example
Suppose we start with the following density function for a variable X1 :
ex for x > 0
f1 (x) =
0
otherwise
and are told that for any given value of X1 = x1 , two other random variables X2 and X3 are
independently and identically distributed with the following conditional p.d.f.:
x ex1 t for t > 0

1
g(t|x1 ) =
0
otherwise
The conditional p.d.f. is now given by g23 (x2 , x3 |x1 ) = x21 ex1 (x2 +x3 ) for non-negative values of
x2 , x3 (and zero otherwise) and the joint p.d.f of the three random variables is given by:
f(x1 , x2 , x3 ) = f1 (x1 )g23 (x2 , x3 |x1 ) = x21 ex1 (1+x2 +x3 )
for non-negative values of each of these variables. We can now obtain the marginal joint p.d.f of
X2 and X3 by integrating over X1
&
Page 31
%
Rohini Somanathan
'
Distributions of functions of random variables

Wed like to derive the distribution of X2 , knowing that X has a uniform distribution on (1, 1)
the density f(x) of X over this interval is
1
2
we know further than Y takes values in [0, 1).

the distribution function of Y is therefore given by
G(y) = P(Y y) = P(X y) = P( y X y) =
Zy
f(x)dx =
The density is obtained by differentiating this:
1 for 0 < y < 1

g(y) = 2 y
0
otherwise
&
Page 32
'
Rohini Somanathan
The Probability Integral Transformation

RESULT: Let X be a continuous random variable with the distribution function F and let
Y = F(X). Then Y must be uniformly distributed on [0, 1]. The transformation from X to Y is
called the probability integral transformation.
We know that the distribution function must take values between 0 and 1. If we pick any of
these values, y, the yth quantile of the distribution of X will be given by some number x and
Pr(Y y) = Pr(X x) = F(x) = y
which is the distribution function of a uniform random variable.
This result helps us generate random numbers from various distributions, because it allows
us to transform a sample from a uniform distribution into a sample from some other
distribution provided we can find F1 .
Example: Suppose we want a sample from an exponential distribution. The density is ex
defined over all x > 0 and the distribution function is 1 ex . If we pick from a uniform
between 0 and 1, and get (say) .3, we can invert the distribution function to get
x = log(10/7) = .36 as an observation of an exponential random variable.
&
Page 33
%
Rohini Somanathan
'
Random number generators

Historically, tables of random digits were used to generate a sample from a uniform
distribution. For example, consider the following series of digits
553617280595580771997955130480651347088612
If we want 10 numbers between 1 and 9, we start at a random digit in the table, and pick
the next 10 numbers. What about numbers between 1 and 100?
Today, we would never do this, but use a statistical package to generate these. In stata for
example:
runiform() returns uniformly distributed random variates on the interval [0,1).
Many packages also allow us to draw directly from the distribution we are interested in:
rnormal(m, s) returns normal(m, s) random variates, where m is the mean and s is the
standard deviation.
&
Page 34
%
Rohini Somanathan
Topic 3: The Expectation and other Moments of a Random Variable

Rohini Somanathan
Course 003, 2014-2015
Page 0
Rohini Somanathan
'
Expectation of a discrete random variable

Definition: The expected value of a discrete random variable exists, and is defined by
P
EX = xR(X) xf(x)
The expectation is simply a weighted average of possible outcomes, with the weights being
assigned by f(x).
In general EX 6 R(X). Consider the experiment of rolling a die where the random variable
is the number of dots on the die. The density function is given by f(x) = 61 I{1,2...6} (x)
The expectation is given by
6
P
x=1
x
6 I{1,2...6} (x)
= 3.5
If X can take only a finite number of different values, this expectation always exists.
If there is an infinite sequence of possible values of X, then this expectation exists if and
P
only if
limxR(X) |x|f(x) < (the series defining the expectation is absolutely convergent )
We can think of the expectation as a point of balance: if there were various weights placed
on a weightless rod, where should a fulcrum be placed so that the distribution of weights
balances?
The expectation of X is also called the expected value or the mean of X.
&
Page 1
%
Rohini Somanathan
'
Expectation of a continuous random variable

Definition: The expected value of a continuous random variable exists, and is defined by
EX =
|x|f(x) <
xf(x) iff
Suppose we have a distribution f which is symmetric with respect to a given point x0 on the
x axis, so that f(x0 + ) = f(x0 ) . If the expectations exists, it will be equal to x0 .
The expectation will always exist if the set of values taken by X is bounded.
When does it not exist? We need sufficiently small weights attached to large values of X
when the set of possible values is not bounded. The tails of a distribution may fall off fast
enough for the area under it to integrate to 1, but the function xf(x) may not have this
property if the tails are thick.
The Cauchy distribution (f(x) =
1
)
(1+x2 )
is symmetric (as are the Normal and the Students
t distributions) but the expectation of the Cauchy distribution does not exist.
&
Page 2
Rohini Somanathan
'
ter 4 Expectation
The p.d.f. of a
ribution.
f(x)
1
p
!3
!2
!1
The curve
or the Cauchy
1
3
1
3
f(x)
1
2p
!3
!2
!1
1
2p
&
Page 3
The Expectation of a Function
%
Rohini Somanathan
'
Expectation of functions of random variable

We may be interested in the expectation of a function y = g(x) of a random variable x.
Examples:
Agricultural yields may be given by the random variable X, revenue, for any given value
x, is given by the function p(x)x
Our random variable might be food availability on a farm, child health would be a
function of such availability.
Scores on an aptitude test may be the random variable and performance in a course
could be a function of these.
Suppose that the density function of y was available to us. We could directly compute the
R
expectation as EY =
yh(y)dy (if continuous). But we dont need this:
RESULT: Let X be a random variable having density function f(x). Then the expectation of
Y = g(X) ( in the discrete and continuous case respectively) is given by:
X
Eg(X) =
g(x)f(x)
xR(X)
Eg(X) =
g(x)f(x)dx
&
Page 4
Rohini Somanathan
'
Expectation of functions- examples

g(x) =
X
f(x) =
R1 1
E( X) = x 2 (2x)dx =
0
2x
for 0 < x < 1
otherwise
4
5
A point (X, Y) is chosen at random from the unit square: 0 x 1 and 0 y 1. The joint
R1 R1 2
density over all points (x, y) in the square is 1 and E(X2 + Y 2 ) =
(x + y2 )dxdy = 32
0 0
(X1 , X2 ) forms a random sample of size 2 from a uniform distribution on (0, 1) and
R1 xR2
R1
Y = min(X1 , X2 ). Well show that E(Y) = 2
x1 dx1 dx2 = x22 dx2 = 31
0 0
0
Suppose we are interested in the expectation of a random variable Y = g(X), defined over a set . This
R
would be given by
yf(x)dx. If 1 and 2 form a partition of , we can write this integral as
Z
yf(x)dx =
Z
yf(x)dx +
yf(x)dx
2
In this case, we either have X1 < X2 or X1 X2 and so

E(Y) = E(min(X1 , X2 )|X1 < X2 ) + E(min(X1 , X2 )|X1 X2 )
The first of these is given by integrating the density over the triangle above the 45 degree line and gives us
R1 xR2
R1 x22
1
x1 dx1 dx2 =
2 dx2 = 6 . We double this to account for the case where X1 X2
0 0
&
Page 5
%
Rohini Somanathan
'
Expectation properties
RESULT 1: If Y = aX + b, then E(Y) = aE(X) + b
( for a continuous random variable X)
R
R
R
E(aX + b) =
(ax + b)f(x)dx = a
xf(x)dx + b
f(x)dx = aE(x) + b
Proof:
Example: If E(X) = 5 then E(3X 5) = 10

RESULT 2: The expectation of a sum is the sum of the expectations:
k
k
P
P
E
ui (X) =
Eui (X)
i=1
i=1
Proof: E
k
P
ui (X) =
i=1
k
P

k
k
R
P
P
ui (x) f(x)dx =
ui (x)f(x)dx =
Eui (X)
i=1
i=1
i=1
RESULT 3: For a random sample, the expectation of a product is the product of the
expectations: If X1 , . . . , Xn are n independent random variables such that each expectation
n
n
Q
Q
E(Xi ) exists, then E( Xi ) =
E(Xi )
i=1
Proof:
i=1
( for a continuous random variable X) Since the random variables are independent,
their joint density is the product of the marginals,i.e.

E(
n
Q
i=1
Xi ) =
...
n
Q
i=1
xi )f(x1 , . . . , xn )dx1 . . . dxn =
...
n
Q
i=1
f(x1 , . . . , xn ) =
n
Q
fi (xi ) and
i=1
xi fi (xi )]dx1 . . . dxn =
n
R
Q
i=1
xi fi (xi )dxi =
n
Q
E(Xi )
i=1
(Notice that this third property applies only to independent random variables, whereas the
second property holds for dependent variables as well.)
&
Page 6
Rohini Somanathan
'
Expectation properties...examples
Expected number of successes: n balls are selected from a box containing a fraction p of red
balls. The random variable Xi takes the value 1 if the ith ball picked is red and zero
otherwise. Were interested in the expected value of the number of red balls picked.
This is simply X = X1 + X2 + + Xn . The expectation of X, (using our theorem) is equal to
E(X1 ) + E(X2 ) + + E(Xn ) where E(Xi ) = p1 + (1 p)0 = p. We therefore have E(X) = np
Expected number of matches: If n letters and randomly placed in n envelopes, how many
matches would we expect? Let Xi = 1 if the ith letter is placed in the correct envelope, and
zero otherwise.
1
1
P(Xi = 1) =
and P(Xi = 0) = 1
n
n
It is therefore the case that E(Xi ) =
1
n
i and E(X) =
1
n
1
n
+ +
1
n
= 1.
Suppose the random variables X1 , . . . , Xn form a random sample of size n from a given
continuous distribution on the real line for which the p.d.f. is f. Find the expectation of the
number of observations in the sample that fall in a specified interval [a, b]. This is just like
b
R
the first problem, except the probability of success is now f(x)dx, so the answer is
b
R
n f(x)dx
a
&
Page 7
%
Rohini Somanathan
'
More examples
The density function for X is given by f(x) = 2(1 x)I(0,1)
E(X) =
h 2
i
h 3
i
R1
R1
3 1
4 1
xf(x)dx = 2 (x x2 )dx = 2 x2 x3
= 2( 16 ) = 13 and E(X2 ) = 2 (x2 x3 )dx = 2 x3 x4
= 2( 13 14 ) = 16 . We
0
can use these to compute E(6X + 3X2 ) = 6( 31 ) + 3( 16 ) = 52 . We could have also computed this directly using the
formula for the expectation of a function r(X).
A horizontal line segment of length 5 is divided at a randomly selected point and X is the
length of the left-hand part. Let us find the expectation of the product of the lengths.
We are picking a point from a uniform distribution on [0, 5] so the density f(x) = 51 I(0,5) (x). E(X) = 52 and
E(5 X) = 25 (why?). The expected value of the product of the lengths is given by
2
R5
5
= E(X)E(5 X)
E [X(5 X)] = 51 x(5 x)dx = 25
6 6= 2
0
A bowl contains 5 chips, 3 marked $1 and 2 marked $4. A player draws 2 chips at random
and is paid the sum of the values of the chips. If it costs $4.75 to play, is his expected gain
positive?
3 )( 2 )
(x
2x , x = 0, 1, 2 (a
(52)
1
6
3
hypergeometric distribution). Compute f(0) = 10 , f(1) = 10 and f(2) = 10 . In this case u(x) = x + 4(2 x) = 8 3x,
Let the random variable X be the number of $1 chips. The probability function is f(x) =
1 )8 + ( 6 )5 + ( 3 )2 = 4.4. Alternatively, compute E(X) = 0 + f(1) + 2f(2) = 12 and find the desired
so E[u(x)] = ( 10
10
10
10
expectation as 8 3E(X).
&
Page 8
'
Rohini Somanathan
Variance of a random variable

Definition: If X is a random variable with E(X) = , the variance of X is defined as follows:
Var(X) = E[(X )2 ]
Since (X )2 0, as long as exists, the variance must be non-negative, if it exists.
The expectation E[(X )2 ] will always exist if the values of X are bounded, but need not
exist in general.
A small value of the variance indicates a distribution that is concentrated around .
The variance is denoted by 2 and its non-negative square root is called the standard
deviation and is denoted by .
&
Page 9
%
Rohini Somanathan
'
Variance properties
1. Var(X) = 0 if and only if there exists a constant c such that P(X = c) = 1
2. For an constants a and b, Var(aX + b) = a2 Var(X). It follows that Var(X) = Var(X)
Proof: Var(aX + b) = E[(aX + b a b)2 ] = E[(a(X ))2 ] = a2 E[(X )2 ] = a2 Var(X)
3. Var(X) = E(X2 ) [E(X)]2

Proof: expand the LHS and take expectations
4. If X1 , . . . , Xn are independent random variables, then

Var(X1 + + Xn ) = Var(X1 ) + + Var(Xn ).
Proof: For n = 2, E(X1 + X2 ) = 1 + 2 and therefore
Var(X1 + X2 ) = E[(X1 + X2 1 2 )2 ] = E[(X1 1 )2 + (X2 2 )2 + 2(X1 1 )(X2 2 )]
Taking expectations, we get
E[(X1 1 )2 + (X2 2 )2 + 2(X1 1 )(X2 2 )] = Var(X1 ) + Var(X2 ) + 2E[(X1 1 )(X2 2 )]
But since X1 and X2 are independent,
E[(X1 1 )(X2 2 )] = E(X1 1 )E(X2 2 ) = (1 1 )(2 2 ) = 0
It therefore follows that
Var(X1 + X2 ) = Var(X1 ) + Var(X2 )
Using an induction argument, this can be established for any n
&
Page 10
Rohini Somanathan
'
Moments of a random variable

Moments of a random variable are special types of expectations that capture characteristics of
the distribution that we may be interested in (its shape and position). Moments are defined
either around the origin or around the mean
Definition: Let X be a random variable with density function f(x). Then the kth moment of X is
0
the expectation E(Xk ). This moment is denoted by k and is said to exist if and only if
k
E(|X| ) < .
0
0 = 1
0
1 is called the mean of X and is denoted by

If a random variable is bounded, all moments exist, and if the kth moment exists, all lower
order moments exist.
Definition: Let X be a random variable for which E(X) = . Then for any positive integer k, the
expectation E[(X )k ] is called the kth central moment of X and denoted by k
1 is clearly zero.
The variance is the second central moment of X
If the distribution of X is symmetric with respect to its mean , and the central moment
exists for a given odd integer k, then it must be zero because the positive and negative
terms of the corresponding expectation will cancel one another.
&
Page 11
%
Rohini Somanathan
'
Moment generating functions

Given a random variable X, consider for each real number t, the following function (known as the
moment generating function (MGF) of X:
(t) = E(etX )
If X is bounded, the above expectation exists for all values of t, if not, it may only exist for
some values of t.
(t) is always defined at t = 0 and (0) = E(1) = 1
If the MGF exists for all values of t in an interval around t = 0, then the derivative of (t)
exists at t = 0 and
0 (0) =
d
d tX
[E(etX )]t=0 = E[(
e )]t=0 = E[(XetX )]t=0 = E(X)
dt
dt
The derivative of the MGF at t = 0 is the mean of X.

More generally, the kth derivative evaluated at t = 0 gives us the kth moment of X.
2
The function ex can be expressed as the sum of the series 1 + x + x2! + . . . and so etx can be expressed as the sum
P
2 2
2 2
(1 + tx + t 2!x + . . . )f(x). If we differentiate this w.r.t t and then set t = 0,
1 + tx + t 2!x + . . . and the expectation E(etx ) =
Proof.
x=0
were left with only the second term in parenthesis, so we have
differentiate twice, were left with

with an integral.
xf(x) which is defined as the expectation of X. Similarly, if we
x=0
x2 f(x), which is the second moment. For continuous distributions, we replace the sum
x=0
P
x=0
(. . . )dx
&
Page 12
Rohini Somanathan
'
Moment generating functions ..an example

Suppose a random variable X has the density function f(x) = ex I(0,) , we can use its MGF to
compute the mean and the variance of X as follows:

R
x(t1)
1
1
(t) = ex(t1) dx = e t1 = 0 t1
= 1t
for t < 1
0
Taking the derivative of this function with respect to t, we get 0 (t) =

differentiating again, we get
00 (t)
1
,
(1t)2
and
2
.
(1t)3
Evaluating the first derivative at t = 0, we get =
1
(10)2
= 1.
The variance 2 = 2 2 = 2(1 0)3 1 = 1.
&
Page 13
%
Rohini Somanathan
'
Properties of moment generating functions

RESULT 1: Let X be a random variable for which the MGF is 1 and consider the random
variable Y = aX + b, where a and b are given constants. Let the MGF of Y be denoted by
2 . Then for any value of t such that 1 (t) exists,
2 (t) = ebt 1 (at)
RESULT 2: Suppose that X1 , . . . , Xn are n independent random variables and that i is the
MGF of Xi . Let Y = X1 + + Xn and the MGF of Y be given by . Then for any value of t
such that i (t) exists for all i = 1, 2, . . . , n,
(t) =
n
Y
i (t)
i=1
RESULT 3: If the MGFs of two random variables X1 and X2 are identical for all values of t
in an interval around the point t = 0, then the probability distributions of X1 and X2 must
be identical.
Examples:
If f(x) = ex I(0,) as in the above example, the MGF of the random variable
Y = (X 1) =
et
1t
for t < 1 (using the first result above, setting a = 1 and b = 1) and if Y = 3 2X,
the MGF of Y is given by
e3t
1+2t
for t > 12
&
Page 14
'
Rohini Somanathan
An Illustration: the binomial distribution

Suppose that there is a probability p of a girl child being born and this probability does not
vary by the birth-order of the child.
A family has n children. The random variable Xi = 1 if the ith child is a girl and 0 otherwise.
The total number of girls in the family is given by the random variable X = X1 + + Xn
follows a binomial distribution with parameters n and p
We know from the properties of the variance that Var(X) =
n
P
Var(Xi )
i=1
E(Xi ) = 1.p + 0(1 p) = p and E(X2i ) = 12 (p) + 02 (1 p) = p, so Var(Xi ) = p p2 and Var(X) = np(1 p)
We can get the same expression using the MGF for the binomial:
The MGF for each of the Xi variables is given by
et P(Xi = 1) + (1)P(Xi = 0) = pet + q.
Using the additive property of MGFs for independent random variables, we get the
MGF for X as
(t) = (pet + q)n
For two random variables each with parameters (n1 , p) and (n2 , p), the MGF of their sum
is given by the product of the MGFs, (pet + q)n1 +n2
&
Page 15
%
Rohini Somanathan
'
The median of a distribution

The mean gives us the centre of gravity of a distribution and is one way of summarizing it. A
disadvantage in some contexts that it is influenced by every observation.
An alternative measure of the centre of a distribution is the median:
Definition: For any random variable X, a median of the distribution of X is defined as a
point m such that P(X m) 21 and P(X m) 12
RESULT: Let m be the median of the distribution of X and let d be any other number.
Then
E(|X m|) E(|X d|)
Every distribution has at least one median and may have multiple medians as seen in the
following examples:
1. P(X = 1) = .1 P(X = 2) = .2 P(X = 3) = .3 P(X = 4) = .4
2. P(X = 1) = .1 P(X = 2) = .4 P(X = 3) = .3 P(X = 4) = .2
3.
f(x) =
4.
4x3
for 0 < x < 1
otherwise
2
f(x) =
for 0 x 1
for 2.5 x 3
otherwise
&
Page 16
'
Rohini Somanathan
Covariance and correlation

Definition: Let X and Y be random variables with E(X) = X and E(Y) = Y and variances
Var(X) = 2X and Var(Y) = 2Y .
The covariance of X and Y is defined as E[(X X )(Y Y )] and is denoted by XY or Cov(X, Y).
The value of the covariance will be finite if each of the above variances are finite. It can be
positive, negative or zero. It can conveniently be computed as E(XY) E(X)E(Y) (just expand
the expression above and take expectations)
Definition: If 0 < 2X < and 0 < 2Y < then the correlation of X and Y is given by
(X, Y) =
Cov(X, Y)
X Y
Result: For any two random variables U and V, it is always the case that (EUV)2 EU2 EV 2 .
This is known as the Cauchy-Schwarz Inequality
This provides us with bounds on the value of the covariance, |XY | X Y (let U = (X EX) and
V = (Y EY) in the statement of the Cauchy-Schwarz inequality above) and in turn implies a
correlation bound
1 XY 1
Example: Let f(x, y) = (x + y)I(0,1) (x)I(0,1) (y). In this case E(X) =
EXY =
R1 R1
00
R1 R1
00
7 = E(Y) and
x(x + y)dxdy = 12
11 = 2 and cov(X, Y) = 1 ( 7 )( 7 ) = 1 and = q

xy(x + y)dxdy = 31 21 = E(X2 ) E(X)2 = 144
2
3
12
12
144
1
144
11 )( 11 )
( 144
144
1
= 11
&
Page 17
%
Rohini Somanathan
'
Properties of covariance and correlation

Result 1: Let X and Y be random variables with 2X < and 2Y < , then
Cov(X, Y) = E(XY) E(X)E(Y)
Proof: expand the expression for the covariance and take expectations.
Result 2: If X and Y are independent random variables, each with finite variance, then
Cov(X, Y) = (X, Y) = 0
Proof: If X and Y are independent, E(XY) = E(X)E(Y). Now apply the expression for covariance in the above
result.
Note: zero correlation does not imply independence- example: If X takes the values 1, 0, 1 with equal
probability and Y = X2 , E(XY) = E(X3 ) = 0. Since E(XY) = E(X)E(Y), = 0 and the variables are uncorrelated but
clearly dependent. A zero correlation only tells us that the variables are not linearly dependent.
Result 3: Suppose X is a random variable with finite variance and Y = aX + b for some
constants a and b. If a > 0, then (X, Y) = 1. If a < 0, then (X, Y) = 1
Proof: Y Y = a(X x ), so Cov(X, Y) = aE[(X X )2 ] = a2X and Y = |a|X , plug these values into the
expression for to get the result.
Result 4: If X and Y are random variables with finite variance then

Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y) .
Proof:
Var(X + Y) = E[(X + Y X Y )2 ] = E((X x )2 + (Y Y )2 + 2(X x )(Y Y ) = Var(X) + Var(Y) + 2Cov(X, Y)
Result 5: If X1 , . . . , Xn are random variables each with finite variance, then

n
n
P
P
PP
Var( Xi ) =
Var(Xi ) + 2
i<j Cov(Xi , Xj )
i=1
i=1
&
Page 18
'
Rohini Somanathan
Conditional Expectations
The conditional expectation of random variables is defined using conditional probability
density functions rather than their unconditional counterparts.
Suppose that X and Y are random variables with a joint density function f(x, y), with the
marginal p.d.f of X denoted by f1 (x).
For any value of x such that f1 (x) > 0, let g(y|x) denote the conditional p.d.f of Y given that
X=x
R
The conditional expectation of Y given X is E(Y|x) =
yg(y|x)dy for continuous X and Y
P
and E(Y|x) = y yg(y|x)dy if X and Y have a discrete distribution.
&
Page 19
%
Rohini Somanathan
Topic 4: Some Special Distributions

Rohini Somanathan
Course 003, 2014-2015
Page 0
'
Rohini Somanathan
Parametric Families of Distributions

There are a few classes of functions that are frequently used as probability distributions,
because they are easy to work with (have a small number of parameters) and attach
reasonable values to the types of uncertain events we are interested in analyzing.
The choice among these families depends on the question of interest.
For modeling the distribution of income or consumption expenditure, we want a density
which is skewed to the right ( gamma, weibull, lognormal..)
IQs, heights, weights, arm circumference are quite symmetric around a mode (normal
or truncated normal)
number of successes in a given number of trials (binomial)
the time to failure for a machine or person (gamma, exponential)
We refer to these probability density functions by f(x; ) where refers to a parameter
vector.
A given choice of therefore leads to a given probability density function.
is used to denote the parameter space.
&
Page 1
%
Rohini Somanathan
'
Discrete Distributions: Uniform

Parameter: N
Probability function: f(x; N) =
Moments:
1
N
I(1,2,...,N) (x)
1 N(N + 1) (N + 1)
=
N
2
2

X
(N + 1) 2 N2 1
1 N(N + 1)(2N + 1)
2
2
2
=
x f(x) =
=
N
6
2
12
=
MGF:
xf(x) =
PN
ejt
j=1 N
Applications: experiments or situations in which each outcome is equally likely (dice,

coins..) Can you think of applications in economics?
&
Page 2
Rohini Somanathan
'
Discrete Distributions: Bernoulli

Parameter: p , 0 p 1
Probability function: f(x; p) = px (1 p)1x I(0,1) (x)
Moments:
=
xf(x) = 1.p1 (1 p)0 + 0.p0 ((1 p)1 = p

2 =
x2 f(x) 2 = p(1 p)
MGF: et p + e0 (1 p) = pet + (1 p)
Applications: experiments or situations in which there are two possible outcomes: success
or failure, defective or not defective, male or female, etc.
&
Page 3
%
Rohini Somanathan
'
Discrete Distributions: Binomial

.
Parameters: (n, p) , 0 p 1 and n is a positive integer
Probability function: An observed sequence of n Bernoulli trials can be represented by an

n!
n-tuple of zeros and ones. The number of ways to achieve x ones is given by n
x = x!(nx)! .
The probability of x successes in n trials is therefore:
npx (1 p)nx
x = 0, 1, 2, . . . n
x
f(x; n, p) =
0
otherwise
Notice that since
n
P
x=0
n x nx
x a b
= (a + b)n ,
n
P
f(x) = [(p + (1 p)]n = 1 so we have a valid
x=0
density function.
MGF:The MGF is given by:
n
n
x
P tx
P
P
nx =
e f(x) =
etx n
x p (1 p)
x
x=0
x=0
n
t x
nx
x (pe ) (1 p)
= [(1 p) + pet ]n
Moments: The MGF can be used to derive = np and 2 = np(1 p)

Result: If X1 , . . . Xk are independent random variables and if each Xi has a binomial
distribution with parameters ni and p, then the sum X1 + + Xk has a binomial
distribution with parameters n = n1 + + nk and p.
&
Page 4
Rohini Somanathan
'
Multinomial Distributions
Suppose there are a small number of different outcomes (methods of public transport, water
purification etc. ) The Multinomial distribution gives us the probability associated with a
particular vector of these outcomes:
P
Parameters: (n, p1 , . . . pm ) , 0 pi 1, i pi = 1 and n is a positive integer
Probability function:
f(x1 , . . . xm ; n, p1 , . . . pm ) =
n!
m
Q
m
Q
xi ! i=1
pi i
x = 0, 1, 2, . . . n,
Pm
i
xi = n
i=1
otherwise
&
Page 5
%
Rohini Somanathan
'
Geometric and Negative Binomial distributions

The Negative Binomial (or Pascal) distribution gives us the probability that x failures will
occur before r successes are achieved. This means that the rth success occurs on the
(x + r)th trial.
P
Parameters: (r, p) , 0 pi 1, i pi = 1 and r is a positive integer
Density: For the rth success occurs on the (x + r)th trial, we require (r 1) successes in
the first (x + r 1) trials. We therefore obtain the density:
f(x; r, p) =

r+x1
pr qx ,
x
x = 0, 1, 2, 3...
The geometric distribution is a special case of the negative binomial with r = 1.

The density in this case takes the form f(x|1, p) = pqx over all natural numbers x
P
p
t x
the MGF is given by E(etX ) = p
x=0 (qe ) = 1qet
for t < log( q1 )
We can use this function to get the mean and variance, =
q
p
and 2 =
q
p2
The negative binomial is just a sum of r geometric variables, and the MGF is therefore
p
rq
rq
r
2
( 1qe
t ) and the corresponding mean and variance is = p and = p2
The geometric distribution is memory-less, so the conditional probability of k + t
failures given k failures is the unconditional probability of t failures,
P(X = k + t|X k) = P(X = t)
&
Page 6
Rohini Somanathan
'
Discrete Distributions: Poisson

Parameter: , > 0
Probability function:
f(x; ) =
e x
x!
, x = 0, 1, 2, . . . ,
0
2
otherwise
Using the result that the series 1 + + 2! + 3! + . . . converges to e ,
P
P
P
x
e x
e = 1 so we have a valid density.
f(x) =
= e
x!
x! = e
x
x=0
x=0
Moments: = =
MGF: E(etX ) =
P
x=0
etx e x
x!
= e
P
x=0
(et )x
x!
= e(e
t 1)
The MGF can be used to get the first and second moments about the origin, and 2 +
so the mean and the variance are both .
We can also use the product of k identical MGFs to show that the sum of k independently
distributed Poisson variables has a Poisson distribution with mean 1 + . . . k .
&
Page 7
%
Rohini Somanathan
'
A Poisson process
Suppose that the number of type A outcomes that occur over a fixed interval of time, [0, t]
follows a process in which
1. The probability that precisely one type A outcome will occur in a small interval of time t
is approximately proportional to the length of the interval:
g(1, t) = t + o(t)
where o(t) denotes a function of t having the property that limt0
o(t)
t
= 0.
2. The probability that two or more type A outcomes will occur in a small interval of time t
is negligible:
X
g(x, t) = o(t)
x=2
3. The numbers of type A outcomes that occur in nonoverlapping time intervals are
independent events.
These conditions imply a process which is stationary over the period of observation, i.e the
probability of an occurrence must be the same over the entire period with neither busy nor quiet
intervals.
&
Page 8
Rohini Somanathan
'
Poisson densities representing poisson processes

RESULT: Consider a Poisson process with the rate per unit of time. The number of events in
a time interval t is a Poisson density with mean = t.
Applications:
the number of weaving defects in a yard of handloom cloth or stitching defects in a shirt
the number of traffic accidents on a motorway in an hour
the number of particles of a noxious substance that come out of chimney in a given period
of time
the number of times a machine breaks down each week
Example:
let the probability of exactly one blemish in a foot of wire be
blemishes be zero.
1
1000
and that of two or more
were interested in the number of blemishes in 3, 000 feet of wire.

if the numbers of blemishes in non-overlapping intervals are assumed to be independently
distributed, then our random variable X follows a poisson process with = t = 3 and
P(X = 5) =
35 e3
5!
you can plug this into a computer, or alternatively use tables to compute f(5; 3) = .101
&
Page 9
%
Rohini Somanathan
'
The Poisson as a limiting distribution

We can show that a binomial distribution with large n and small p can be approximated by a
Poisson ( which is computationally easier).
useful result: ev = limn (1 +
v n
n)
We can rewrite the binomial density for non-zero values as

x
Q
f(x; n, p) =
(ni+1)
i=1
x!
px (1 p)nx . If np = , we can subsitute for p by

x
Q
limn f(x; n, p)
limn i=1
x
Q
=
=
=
(n i + 1)
x!

to get
x
nx
1
n
n
(n i + 1)
x
n
x
1
1
x!
n
n
h n (n 1)
(n x + 1) x
n
x i
limn
.
....
1
1
n
n
n
x!
n
n
limn i=1
nx
e x
x!
(using the above result and the property that the limit of a product is the product of the
limits)
&
Page 10
'
Rohini Somanathan
Poisson as a limiting distribution...example

We have a 300 page novel with 1, 500 letters on each page.
Typing errors are as likely to occur for one letter as for another, and the probability of such
an error is given by p = 105 .
The total number of letters n = (300) (1500) = 450, 000
Using = np, the poisson distribution function gives us the probability of the number of
errors being less than or equal to 10 as:
P(x 10)
10
X
e4.5 (4.5)x
= .9933
x!
x=0
Rules of Thumb: close to binomial probabilities when n 20 and p .05, excellent when n 100
and np 10.
&
Page 11
%
Rohini Somanathan
'
Discrete distributions: Hypergeometric

Suppose, like in the case of the binomial, there are two possible outcomes and were
interested in the probability of x values of a particular outcome, but we are drawing
randomly without replacement so our trials are not independent.
In particular, suppose there are A + B objects from which we pick n, A of the total number
available are of one type (red balls) and the rest are of the other (blue balls).
If the random variable is the total number of red balls selected, then, for appropriate values
B
(A)(nx
)
of x, we have f(x; A, B, n) = x A+B
( n )
Over what values of x is this defined? max{0, n B, } X min{n, A}
The multivariate extension is (for xi 0, 1, 2..n,
n
P
xi = n and
i=1
Ki = M ):
i=1
m
Q
f(x1 . . . xm ; K1 . . . Km , n) =
m
P
Kj
xj
j=1
M
n
&
Page 12
Rohini Somanathan
'
Continuous distributions: uniform or rectangular

Parameters: (a, b) , (a, b) : < a < b <
Density: f(x; a, b) =
Moments: =
1
ba
(a+b)
,
2
MGF: MX (t) =
I(a,b) (x) (hence the name rectangular)
2 =
ebt eat
(ba)t
(ba)2
12
for t 6= 0 and MX (t) = 1 for t = 0 ( use LHopitals rule )
Applications:
to construct the probability space of an experiment in which any outcome in the
interval [a, b] is equally likely.
to generate random samples from other distributions (based on the probability integral
transformation). This is part of your first lab assignment.
&
Page 13
%
Rohini Somanathan
'
The gamma function

The gamma function is a special mathematical function that is widely used in statistics. The
gamma function of is defined as
y1 ey dy
() =
(1)
If = 1, () =
R
0

ey dy = ey = 1
0
If > 1, we can integrate (1) by parts, setting u = y1 and dv = ey and using the formula

R
R
R
1

udv = uv vdu to get: yey + ( 1) y2 ey dy
0
The first term in the above expression is zero because the exponential function goes to zero
faster than any polynomial and we obtain
() = ( 1)( 1)
and for any integer > 1, we have
() = ( 1)( 2)( 3) . . . (3)(2)(1)(1) = ( 1)!
&
Page 14
Rohini Somanathan
'
The gamma distribution

Define the variable x by y =
x
,
where > 0. Then dy =
x 1
() =
1
dx
and can rewrite () as
1
dx
or as
1=
1
x
x1 e dx
()
This shows that for , > 0,

f(x; , ) =
1
x
x1 e I(0,) (x)
()
is a valid density and is known as a gamma-type probability density function.
&
Page 15
%
Rohini Somanathan
'
Features of the gamma density

This is a valuable distribution because it can take a variety of shapes depending on the values of
the parameters and
It is skewed to the right
It is strictly decreasing when 1
If = 1, we have the exponential density, which is memory-less.
pter 5 Special Distributions

For > 1 the density attains it maximum at x = ( 1)
Graphs of the
veral different
ributions with
an of 1.
a ! 0.1, b ! 0.1
a ! 1, b ! 1
a ! 2, b ! 2
a ! 3, b ! 3
1.2
Gamma p.d.f.
1.0
0.8
0.6
0.4
0.2
0
&
Page 16
Rohini Somanathan
Moments. Let X have the gamma distribution with parameters and . For k =
1, 2, . . . ,
Course
#(
+ k)003: Basic
(Econometrics,
+ 1) . . . (2012-2013
+ k 1)
'
E(X k ) = k
=
.
k
#()
Theorem
5.7.5
In particular, E(X) = , and Var(X) =
.
2
Proof For k = 1, 2, . . . ,
Moments
of the gamma
!
! distribution
x k f (x|, ) dx =
x +k1ex dx
E(X k ) =
Parameters: (, ) 0, > 0, > 0
#() 0
2
2
#( + k)
Moments: = ,
= . #( + k)
=
= k
.
(5.7.14)
+k
1
#() for t < which

can
#()be derived as follows:
MGF: MX (t) = (1 t)
The expression for E(X)Z follows immediately
from (5.7.14). The variance can be
x
1
tx
1 e
M
(t)
=
e
x
X
computed as
()
0
" #2
Z
1
((
+11)t)x
= Var(X)=
x1 e
= 2.
()
2
0
1
()
1 ( 1 t)x
dx
x
t
e
1 mean equal to
p.d.f.s that all have
Figure 5.7 shows several gamma distribution
t
0
1 but different values of and
1 . ()
1
=
Example
5.7.6
()
1
1
t
1
(by setting y = ( t)x in the expression for ().

t
1
Service Times in a Queue
5.7.5, the conditional mean service rate given
= . In Example

1 t
the observations X1 = x1, . . . , Xn = xn is
E(Z|x1, . . . , xn) =
n+1
.
$
2 + ni=1 xi
For large n, the conditional mean is approximately 1 over the sample average of
&
the service times. This makes sense since 1 over the average service time is what we
Page 17
Rohini Somanathan
generally mean by service rate.
!
'
Gamma applications
Survival analysis
We can use it to model the waiting time till the rth event/success. If X is the time that
passes until the first success, then X could be gamma distribution with = 1 and = 1 .
This is known as an exponential distribution. If, instead we are interested in the time
taken for the rth success, this has a gamma density with = r and 1 = .
Related to the Poisson distribution: If the variable Y is the number of successes
(deaths, for example) in a given time period t and has a poisson density with parameter
, the rate of success is given by = t .
Example: A bottling plant breaks down, on average, twice every four weeks. We want the
probability that the number of breakdowns, X 3 in the next four weeks. We have = 2
3
P
i
e2 2i! = .135 + .271 + .271 + .18 = .857
and the breakdown rate = 21 per week. P(X 3) =
i=0
Suppose we wanted the probability that the machine does not break down in the next four
weeks. The time taken until the first break-down, x must therefore be more than four
weeks. This follows a gamma distribution, with = 1 and = 1.

R
x
x
P(X 4) = 21 e 2 dx = e 2 = e2 = .135
4
Income distributions that are uni-modal
&
Page 18
Rohini Somanathan
'
Gamma distributions: some useful properties

Gamma Additivity: Let X1 , . . . Xn be independently distributed random variables with
respective gamma densities Gamma(i , ). Then
Y=
n
X
i=1
Scaling Gamma Random Variables:

Gamma(, ) and let c > 0. Then
n
X
Xi Gamma(
i , )
i=1
Let X be distributed with gamma density
Y = cX Gamma(, c)
Both these can be easily proved using the gamma MGF and applying the MGF uniqueness
theorem: In the first case the MGF of Y is the product of the individual MGFs, i.e.
MY (t) =
n
Y
i=1
n
P
n
Y
i
1
MXi (t) =
(1 t)i = (1 t)i=1
for t <
i=1
For the second result, MY (t) = McX (t) = MX (ct) = (1 ct) for t <
1
c
&
Page 19
%
Rohini Somanathan
'
The gamma family: exponential distributions

An exponential distribution is simply a gamma distribution with = 1
Parameters: , > 0
Density: f(x; ) =
x
1
I(0,) (x)
e
Moments: = , 2 = 2
MGF: MX (t) = (1 t)1 for t <
Applications: As discussed above, the most important application the representation of

operating lives. The exponential is memoryless and so, if failure hasnt occurred, the object
(or person, animal) is as good as new. The risk of failure at any point t is given by the
hazard rate,
f(t)
h(t) =
S(t)
where S(t) is the survival function, 1 F(t). Verify that the hazard rate in this case is a
constant, 1 .
If we would like wear-out effects, we should use a gamma with > 1 and for work-hardening
effects, use a gamma with < 1
&
Page 20
Rohini Somanathan
'
The gamma family: chi-square distributions

An Chi-square distribution is simply a gamma distribution with =
v
2
and = 2
Parameters: v , v is a positive integer (referred to as the degrees of freedom)

Density: f(x; v) =
x
v
1
x 2 1 e 2 I(0,) (x)
v
2 2 ( v
2 )
Moments: = v, 2 = 2v
v
MGF: MX (t) = (1 2t) 2 for t <
1
2
Applications:
Notice that for v = 2, the Chi-Square density is equivalent to the exponential density with
= 2. It is therefore decreasing for this value of v and hump-shaped for other higher values.
The 2 is especially useful in problems of statistical inference because if we have v
v
P
independent random variables, Xi N(0, 1), their sum
X2i 2v Many of the estimators we
i=1
use in our models fit this case (i.e. they can be expressed as the sum of independent normal
variables)
&
Page 21
%
Rohini Somanathan
'
The Normal (or Gaussian) distribution

This symmetric bell-shaped density is widely used because:
1. Outcomes certain types of continuous random variables can be shown to follow this type of
distribution, this is the motivation weve used for most parametric distributions weve
considered so far (heights-humans, animals and plants, weights, strength of physical
materials, the distance from the centre of a target if errors in both directions are
independent).
2. It has nice mathematical properties: many functions of a set normally distributed random
variables have distributions that take simple forms.
3. Central Limit Theorems: The sample mean of a random sample from any distribution with
finite variance is approximately normal.
&
Page 22
Rohini Somanathan
'
The Normal density

Parameters: (, 2 ) , (, ), > 0
Density: f(x; , 2 ) =
MGF: MX (t) = et+
1 x 2
1
e 2 ( ) I(,+) (x)
2
2 t2
2
The MGF can be used to derive the moments, E(X) = and variance is 2
As can be seen from the p.d.f, the distribution is symmetric around , where it achieves its
maximum value. this is therefore also the median and the mode of the distribution.
The normal distribution with zero mean and unit variance is known as the standard normal
1 2
distribution and is of the form: f(x; 0, 1) = 1 e 2 x I(,+) (x)
2
The tails of the distribution are thin: 68% of the total probability lies within one of the
mean, 95.4% within 2 and 99.7% within 3.
&
Page 23
%
Rohini Somanathan
'
The Normal distribution: deriving the MGF

By the definition of the MGF:
M(t)
(x)2
1
22
dx
e
2
h
i
Z
(x)2
tx
1
22
dx
e
2
etx
We can rewrite the term inside the square brackets to obtain:

tx
(x )2
[x ( + 2 t)]2
1 2 2
=
t
+
22
2
22
The MGF can now be written as:

1
2 t2
MX (t) = Cet+ 2
where C =
e
2
[x(+2 t)]2
22
dx = 1 because the integrand is a normal p.d.f with parameter
replaced by ( + 2 t)
&
Page 24
'
Rohini Somanathan
The Normal distribution: computing moments

First taking derivatives of the MGF:
2 t2
2
M(t)
e(t+
M0 (t)
M(t)( + 2 t)
M00 (t)
M(t) 2 + M(t)( + 2 t)2
(obtained by differentiating M(t) with respect to t and substituting for M0 (t))

Evaluating these at t = 0, we get M0 (0) = and M00 (0) = 2 + 2 , or the variance = 2 .
&
Page 25
%
Rohini Somanathan
'
Transformations of Normally Distributed Variables...1

RESULT 1: Let X N(, 2 ). Then Z =
(X)
Proof: Z is of the form aX + b with a =
MZ (t) = ebt MX (at) = e t e
2 t2
t+ 22
=e
N(0, 1)
and b =
. Therefore
t2
2
which is the MGF of a standard normal distribution.
An important implication of the above result is that if we are interested in any distribution in
this class of normal distributions, we only need to be able to compute integrals for the standard
normal-these are the tables youll see at the back of most textbooks.
Example: The kilometres per litre of fuel achieved by a new Maruti model , X N(17, .25). What
is the probability that a new car will achieve between 16 and 18 kilometres per litre?

Answer: P(16 x 18) = P 1617
z 1817
= P(2 z 2) = 1 2(.0228) = .9544
.5
.5
&
Page 26
'
Rohini Somanathan
Transformations of Normals...2
RESULT 2: Let X N(, 2 ) and Y = aX + b, where a and b are given constants and a 6= 0,
then Y has a normal distribution with mean a + b and variance a2 2
1
2 2
2 2
Proof: The MGF of Y can be expressed as MY (t) = ebt eat+ 2 a t = e(a+b)t+ 2 (a) t .
This is simply the MGF for a normal distribution with the mean a + b and variance a2 2
RESULT 3: If X1 , . . . , Xk are independent and Xi has a normal distribution with mean i
and variance 2i , then Y = X1 + + Xk has a normal distribution with mean 1 + + k
and variance 21 + + 2k .
Proof: Write the MGF of Y as the product of the MGFs of the Xi s and gather linear and
squared terms separately to get the desired result.
We can combine these two results to derive the distribution of sample mean:
RESULT 4: Suppose that the random variables X1 , . . . , Xn form a random sample from a
n denote the sample mean.
normal distribution with mean and variance 2 , and let X
2
Then Xn has a normal distribution with mean and variance n .
&
Page 27
%
Rohini Somanathan
'
Transformations of Normals to 2 distributions

RESULT 5 : If X N(0, 1), then Y = X2 has a 2 distribution with one degree of freedom.
Proof:
MY (t)
x2
2
1
ex t e 2 dx
2
1 2
1
e 2 x (12t) dx
2
1
p
(1 2t)
1
(12t)
e 2 (x
(12t))2
dx
1
1
p
for t <
2
(1 2t)
( the integrand is a normal density with = 0 and 2 =
1
(12t) ).
The MGF obtained is that of a 2 random variable with v = 1 since the 2 MGF is given by
v
MX (t) = (1 2t) 2 for t <
1
2
&
Page 28
'
Rohini Somanathan
Normals and 2 distributions...

RESULT 6 : Let X1 , . . . Xn be independent random variables with each Xi N(0, 1), then Y =
n
P
i=1
X2i
has a 2 distribution with n degrees of freedom.

Proof:
MY (t)
n
Y
i=1
n
Y
MX2 (t)
i
(1 2t) 2
i=1
(1 2t)
n
2
for t <
1
2
which is the MGF of a 2 random variable with v = n. This is the reason that the parameter v is
called the degrees of freedom. There are n freely varying random variables whose sum of squares
represents a 2v -distributed random variable.
&
Page 29
%
Rohini Somanathan
'
The Bivariate Normal distribution
The bivariate normal has the density:

f(x, y) =
q2
1
p
e 2
2
21 2 1
where
q=
x y y 2 i
1 h x 1 2
1
2
2
2
+
2
1
1
1
2
2
E(Xi ) = i , Var(Xi ) = 2i and the correlation coefficient (X1 , X2 ) =

Verify that in this case, X1 and X2 are independent iff they are uncorrelated.
Applications: heights of couples, scores on tests...
&
Page 30
%
Rohini Somanathan
Topic 5: Sample statistics and their properties

Rohini Somanathan
Course 003, 2014-2015
Page 0
'
Rohini Somanathan
The Inference Problem

So far, our starting point has been a given probability space. The likelihood of different
outcomes in this space is determined by our probability measure- weve discussed different
types of sample spaces and measures that can be used to assign probabilities to events.
Well now look at how we can generate information about the probability space by
analyzing a sample of outcomes. This process is referred to as statistical inference.
Inference procedures are parametric when we make assumptions about the probability
space from which our sample is drawn (for example, each sample observation represents an
outcome of a normally distributed random variable with unknown mean and unit variance).
If we make no such assumptions our procedures are nonparametric.
Well discuss parametric inference. This involves both the estimation of population
parameters and testing hypotheses about them.
&
Page 1
%
Rohini Somanathan
'
Defining a Statistic
Definition: Any real-valued function T = r(X1 , . . . , Xn ) is called a statistic.
Notice that:
a statistic is itself a random variable
weve considered several functions of random variables, whose distributions are well defined
such as :
Y=
Y=
n
P
where
X N(, 2 ). We showed that Y N(0, 1).
Xi , where each Xi has a bernoulli distribution with parameter p was shown to have a
i=1
binomial distribution with parameters n and p.

Y=
n
P
i=1
X2i where each Xi has a standard normal distribution was shown to have a 2n distribution
etc...
Only some of these functions of random variables are statistics ( why?) This distinction is
important because statistics have sample counterparts.
In a problem of estimating an unknown parameter, , our estimator will be a statistic
whose value can be regarded as an estimate of .
It turns out that for large samples, the distributions of some statistics, such as the sample
mean, are well-known.
&
Page 2
'
Rohini Somanathan
Markovs Inequality
We begin with some useful inequalities which provide us with distribution-free bounds on the
probability of certain events and are useful in proving the law of large numbers, one of our two
main large sample theorems.
Markovs Inequality: Let X be a random variable with density function f(x) such that
P(X 0) = 1. Then for any given number t > 0,
P(X t)
E(X)
t
Proof. (for discrete distributions)

P
P
P
E(X) = x xf(x) = x<t xf(x) + xt xf(x) All terms in these summations are non-negative by
assumption, so we have
X
X
E(X)
xf(x)
tf(x) = tP(X t)
xt
xt
This inequality obviously holds for t E(X) (why?). Its main interest is in bounding the
probability in the tails. For example, if the mean of X is 1, the probability of X taking values
bigger than 100 is less than .01. This is true irrespective of the distribution of X- this is what
makes the result powerful.
&
Page 3
%
Rohini Somanathan
'
Chebyshevs Inequality
This is a special case of Markovs inequality and relates the variance of a distribution to the
probability associated with deviations from the mean.
Chebyshevs Inequality: Let X be a random variable such that the distribution of X has a finite
variance 2 and mean . Then, for every t > 0,
P(|X | t)
2
t2
or equivalently,
P(|X | < t) 1
2
t2
Proof. Use Markovs inequality with Y = (X )2 and use t2 in place of the constant t. Then Y takes only
non-negative values and E(Y) = Var(X) = 2 .
In particular, this tells us that for any random variable, the probability that values taken by the
variable will be more than 3 standard deviations away from the mean cannot exceed 91
P(|X | 3)
1
9
For most distributions, this upper bound is considerably higher than the actual probability of
this event.
&
Page 4
Rohini Somanathan
'
Probability bounds ..an example

Chebyshevs Inequality can, in principle be used for computing bounds for the probabilities of
certain events. In practice this is not often used because the bounds it provides are quite
different from actual probabilities as seen in the following example:
Let the density function of X be given by f(x) =
2 =
(ba)2
12
(2
I (x).
3) ( 3, 3)
In this case = 0 and
= 1. If t = 32 , then
3
3
3
Pr(|X | ) = Pr(|X| ) = 1
2
2
Z2
32
Chebyshevs inequality gives us

while our bound is
1
t2
4
9
1
3
= .13
dx = 1
2
2 3
which is much higher. If t = 2, the exact probability is 0,
1
4.
&
Page 5
%
Rohini Somanathan
'
The sample mean and its properties

Our estimate for the mean of a population is typically the sample mean. We now define this
formally and derive its distribution. We will further justify the use of this estimator when we
move on to discuss estimation.
Definition: Suppose the random variables X1 , . . . , Xn form a random sample of size n from a
distribution with mean and variance 2 . The arithmetic average of these sample observations,
Xn is known as the sample mean:
Xn =
1
(X1 + + Xn )
n
Since Xn is the sum of i.i.d. random variables, it is also a random variable

E(Xn ) =
1
n
n
P
E(Xi ) =
i=1
Var(Xn ) =
1
n .n
n
P
1
Var( Xi )
n2
i=1
1
n2
n
P
i=1
Var(Xi ) =
1
n2
n2
2
n
Weve therefore learned something about the distribution of the sample mean, irrespective of the
distribution from which the sample is drawn:
Its expectation is equal to that of the population.
It is more concentrated around its mean value than was the original distribution.
The larger the sample, the lower the variance of Xn .
&
Page 6
Rohini Somanathan
'
Sample size and precision of the sample mean

We can use Chebyshevs Inequality to ask how big a sample we should take, if we want to ensure
a certain level of precision in our estimate of the sample mean.
Suppose the random sample is picked from a distribution which unknown mean and
variance equal to 4 and we want to ensure an estimate which is within 1 unit of the real
| 1) .01.
mean with probability .99. So we want Pr(|X
| 1)
Applying Chebyshevs Inequality we get Pr(|X
n = 400.
4
n.
Since we want
4
n
= .01 we take
and therefore often

This calculation does not use any information on the distribution of X
gives us a much larger number than we would get if this information was available.
Example:
n
P
each Xi followed a bernoulli distribution with p = 12 , then the total number of successes T =
Xi follows
T , E(T ) = n and Var(T ) = n

a binomial, Xn = n
2
4
wed like our sample mean to lie within .1 of the population mean, i.e. in the interval [.4, .6], with
probability equal to .7. Using Chebyshevs Inequality, we have
i=1
P(|Xn | .1) 1
2
Xn
t2
25
25
= 1 4n
1 = 1 n . We therefore need 1 n = 0.7. This gives us n = 84.
100
If we compute these probabilities directly from the binomial distribution, we get F(9) F(6) = .7 when
n = 15, so if we knew that Xi followed a Bernoulli distribution we would take this sample size for the
desired level of precision in our estimate of Xn .
This illustrates the trade-off between more efficient parametric procedures and more robust
non-parametric ones.
&
Page 7
%
Rohini Somanathan
'
Convergence of Real Sequences

We would like our estimators to be well-behaved. What does this mean?
One desirable property is that our estimates get closer to the parameter that we are trying
to estimate as our sample gets larger. Were going to make precise this notion of getting
closer.
Recall that a sequence is just a function from the set of natural numbers N to any set A
(Examples: yn = 2n , yn = n1 )
A real number sequence {yn } converges to y if for every > 0, there exists N() for which
n N() = |yn y| < . In such as case, we say that {yn } y Which of the above
functions converge?
If we have a sequence of functions {fn }, the sequence is said to converge to a function f if
fn (x) f(x) for all x in the domain of f.
In the case of matrices, a sequence of matrices converges if each the sequences formed by
(i, j)th elements converge, i.e. Yn [i, j] Y[i, j].
&
Page 8
Rohini Somanathan
'
Sequences of Random Variables

A sequence of random variables is a sequence for which the set A is a collection of random
variables and the function defining the sequence puts these random variables in a specific
order.
n
n
P
P
Xi , where Xi N(, 2 ), or Yn =
Xi , where Xi Bernoulli(p)
Examples: Yn = n1
i=1
i=1
We now need to modify our notion of convergence, since the sequence {Yn } no longer defines
a given sequence of real numbers, but rather, many different real number sequences,
depending on the realizations of X1 , . . . , Xn .
Convergence questions can no longer be verified unequivocally since we are not referring to
a given real sequence, but they can be assigned a probability of occurrence based on the
probability space for random variables involved.
There are several types of random variable convergence discussed in the literature. Well
focus on two of these:
Convergence in Distribution
Convergence in Probability
&
Page 9
%
Rohini Somanathan
'
Convergence in Distribution
Definition: Let {Yn } be a sequence of random variables, and let {Fn } be the associated sequence
of cumulative distribution functions. If there exists a cumulative distribution function F such
that Fn (y) F(y) y at which F is continuous, then F is called the limiting CDF of {Yn }.
Letting Y have the distribution function F, we say that Yn converges in distribution to the
d
random variable Y and denote this by Yn Y.

d
The notation Yn F is also used to denote Yn Y F

Convergence in distribution holds if there is convergence in the sequence of densities
(fn (y) f(y)) or in the sequence of MGFs (MYn (t) MY (t)). In some cases, it may
be easier to use these to show convergence in distribution.
d
Result: Let Xn X, and let the random variable g(X) be defined by a function continous
d
function g(.) Then g(Xn ) g(X)

d
Example: Suppose Zn Z N(0, 1), then 2Zn + 5 2Z + 5 N(5, 4) (why?)
&
Page 10
Rohini Somanathan
'
Convergence in Probability
This concept formalizes the idea that we can bring the outcomes of the random variable Yn
arbitrarily close to the outcomes of the random variable Y for large enough n.
Definition: The sequence of random variables {Yn } converges in probability to the random
varaible Y iff
lim P(|yn y| < ) = 1
>0
We denote this by Yn Y or plimYn = Y. This justifies using outcomes of Y as an

approximation for outcomes of Yn since the two are very close for large n.
Notice that while convergence in distribution is a statement about the distribution
functions of Yn and Y whereas convergence in probability is a statement about the joint
density of outcomes, yn and y.
Distribution functions of very different experiments may be the same: an even number on a
fair die and a head on a fair coin have the same distribution function, but the outcomes of
these random variables are unrelated.
p
Therefore Yn Y implies Yn Y In the special case where Yn c, we also have

p
Yn c and the two are equivalent.
&
Page 11
%
Rohini Somanathan
'
Properties of the plim operator

plim AXn = A plim Xn
the plim of a sum is the sum of the plims
the plim of a product is the product of the plims
Example: Yn = (2 +
plim Yn = plim(2 +
1
n )X + 3
and X N(1, 2). Using the properties of the plim operator, we have
1
n )X + plim(3)
= 2X + 3 N(5, 8) Since convergence in probability implies

d
convergence in distribution, Yn N(5, 8)
&
Page 12
Rohini Somanathan
'
The Weak Law of Large Numbers

Consider now the convergence of the random variable sequence whose nth term is given by:
Xn =
n
1 X
Xi
n
i=1
WLLN: Let {Xn } be a sequence of i.i.d. random variables with finite mean and variance 2 .
p
Then Xn .
Proof. Using Chebyshevs Inequality,
| < ) 1
P(|X
2
n2
Hence
| < ) = 1 or plimX
=
lim P(|X
The WLLN will allow us to use the sample mean as an estimate of the population mean, under
very general conditions.
&
Page 13
%
Rohini Somanathan
'
Central Limit Theorems

Central limit theorems specify conditions under which sequences of random variables
converge in distribution to known families of distributions.
These are very useful in deriving asymptotic distributions of test statistics whose exact
distributions are either cumbersome or difficult to derive.
There are a large number of theorems which vary by the assumptions they place on the
random variables -scalar or multivariate, dependent or independent, identically or
non-indentically distributed.
The Lindberg-Levy CLT: Let {Xn } be a sequence of i.i.d. random variables with EXi = and
var(Xi ) = 2 (0, ) i. Then
Xn
n(Xn ) d
N(0, 1)
&
Page 14
Rohini Somanathan
'
Lindberg-Levy CLT..applications
Approximating Binomial Probabilities via the Normal Distribution: Let {Xn } be a sequence
of i.i.d. Bernoulli random variables. Then, by the LLCLT:
n
P
Xi np
n
X
a
d
i=1
p
Xi N(np, np(1 p))
N(0, 1) and
np(1 p)
i=1
In this case,
n
P
Xi is of the form aZn + b with a =
p
np(1 p) and b = np and since Zn
i=1
converges to Z in distribution, the asymptotic distribution
n
P
Xi is normal with mean and
i=1
variance given above (based on our results on normally distributed variables).

Approximating 2 Probabilities via the Normal Distribution: Let {Xn } be a sequence of
i.i.d. chi-square random variables with 1 degree of freedom. Using the additivity property
n
P
of variables with gamma distributions, we have
Xi 2n Recall that the mean of gamma
i=1
distribution is and its variance is 2 . For a 2n random variable, =

by the LLCLT:
n
P
Xi n
n
X
a
d
i=1
N(0, 1) and
Xi N(n, 2n)
2n
i=1
v
2
and = 2. Then,
&
Page 15
%
Rohini Somanathan
Topic 6: Estimation
Rohini Somanathan
Course 003, 2014-2015
Page 0
'
Rohini Somanathan
Random Samples
We cannot usually look at the population as a whole because
it may be too big and therefore expensive and time-consuming
analyzing the sample may destroy the product/organism (you need brain cells to figure
out how the brain works, or to crash cars to see know how sturdy they are)
We would like to choose a sample which is representative of the population or process that
interests us. A common procedure with many desirable properties is random sampling - all
objects in the population have an equal chance of being selected
Haphazard sampling procedures often result in non-random samples.
Example: We have a bag of sweets and chocolates of different types (eclairs, five-stars, gems...) and want to
estimate the average weight of a items in the bag. If we pass the bag around, each student puts their hand
in and picks 5 items, how do you think these sample averages would compare with the true average?
Definition: Let f(x) be the density function of a continuous random variable X. Consider a
sample of size n from this distribution. We can think of the first value drawn as a realization of
the random variable X1 , similarly for X2 . . . Xn . (x1 , . . . , xn ) is a random sample if
f(x1 , . . . , xn ) = f(x1 )f(x2 ) . . . f(xn ) = f(x).
&
Page 1
%
Rohini Somanathan
'
Statistical Models
Definition: A statistical model for a random sample consists of a parametric functional form,
f(x; ) together with a parameter space which defines the potential candidates for .
Examples: We may specify that our sample comes from
a Bernoulli distribution and = {p : p [0, 1]}
a Normal distribution where = {(, 2 ) : (, ), > 0}
Note that could be much more restrictive in each of these examples. What matters for our
purposes is that
contains the true value of the parameter
the parameters are identifiable meaning that the probability of generating the given sample
is different under two distinct parameter values. If, given a sample x and parameter values
1 and 2 , suppose f(x, 1 ) = f(x, 2 ), well never be able to use the sample to reach a
conclusion on which of these values is the true parameter.
&
Page 2
'
Rohini Somanathan
Estimators and Estimates

Definition: An estimator of the parameter , based on the random variables X1 , . . . , Xn , is a
real-valued function (X1 , . . . , Xn ) which specifies the estimated value of for each possible set of
values of X1 , . . . , Xn .
Since an estimator (X1 , . . . , Xn ) is a function of random variables, X1 , . . . , Xn , the estimator
is itself a random variable and its probability distribution can be derived from the joint
distribution of X1 , . . . , Xn .
A point estimate is a specific value of the estimator (x1 , . . . , xn ) that is determined by using
the observed values x1 , . . . , xn .
There are lots of potential functions of the random sample, , what criteria should we use
to choose among these?
&
Page 3
%
Rohini Somanathan
'
Desirable Properties of Estimators

n ) = .
Unbiasedness : E(
n | > ) = 0 for every > 0.
Consistency: limn P(|
n )2 E(? )2 for any ? .
Minimum MSE: E(
n
n
Using the MSE criterion may often lead us to choose biased estimators because
= Var()
+ Bias(,
)2
MSE()
2 = E[E(
2 = E[(E(
2 +2(E(
= E()
2 +(E())
)2 +0
MSE()
)+E(
)]
))
))(E(
))]
= Var()+Bias(
,
A Minimum Variance Unbiased Estimator (MVUE) is an estimator which has the smallest
variance among the class of unbiased estimators.
A Best Linear Unbiased Estimator (BLUE) is an estimator which has the smallest variance among
the class of linear unbiased estimators ( the estimates must be linear functions of sample values).
&
Page 4
'
Rohini Somanathan
Maximum Likelihood Estimators

Definition: Suppose that the random variables X1 , . . . , Xn form a random sample from a discrete
or continuous distribution for which the p.f. or p.d.f is f(x|), where belongs to some
parameter space . For any observed vector x = (x1 , . . . , xn ), fn (x|) is a function of and is
called the likelihood function.
For each possible observed vector x, let (x) denote a value of for which the
= (X) be the estimator of defined in this
likelihood function fn (x|) is a maximum, and let
is called the maximum likelihood estimator of (M.L.E.).
way. The estimator
&
Page 5
%
Rohini Somanathan
'
M.L.E..of a Bernoulli parameter

Suppose that the random variables X1 , . . . , Xn form a random sample from a Bernoulli
distribution for which the parameter is unknown. We can derive the M.L.E. of as follows:
The Bernoulli density can be written as f(x; ) = x (1 )1x ,
x = 0, 1.
For any observed values x1 , . . . , xn , where each xi is either 0 or 1, the likelihood function is
P
P
n
Q
x
n xi
xi (1 )1xi = i (1 )
given by: fn (x|) =
i=1
The value of that will maximize this will be the same as that which maximizes the log of
the likelihood function, L() which is given by:
L() =
n
X
n

X
xi ln + n
xi ln(1 )
i=1
i=1

The first order condition for an extreme point is given by:
n
P
MLE =
this, we get
n
P
xi
i=1
n
P
i=1
xi

and solving
xi
i=1
Confirm that the second derivate of L() is in fact negative, so we do have a maximum.
&
Page 6
Rohini Somanathan
'
Sampling from a normal distribution..unknown mean

Suppose that the random variables X1 , . . . , Xn form a random sample from a normal distribution
for which the parameter is unknown, but the variance 2 is known. Recall the normal density
f(x; , 2 ) =
1 x 2
1
e 2 ( ) I(,+) (x)
2
For any observed values x1 , . . . , xn , the likelihood function is given by:

P
12 (
(xi )2 )
2
n
fn (x|) =
1
n
(22 ) 2
i=1
fn (x|) will be maximized by the value of which minimizes the following expression in :
Q() =
n
n
n
X
X
X
(xi )2 =
x2i 2
xi + n2
i=1
=2
The first order condition is now: 2n
n
P
MLE =
mean
i=1
n
P
i=1
xi and our M.L.E. is once again the sample
i=1
Xi
i=1
The second derivative is positive so we have a minimum value of Q().
&
Page 7
%
Rohini Somanathan
'
Sampling from a normal distribution..unknown and 2

Now suppose that the random variables X1 , . . . , Xn form a random sample from a normal
distribution for which both the parameters and 2 are unknown. Now the likelihood function
P
12 (
(xi )2 )
2
n
is a function of both parameters: fn (x|, 2 ) =
1
n
(22 ) 2
i=1
To find the M.L.Es of both and 2 , it is easiest to maximize the log of the likelihood
n
P
function: L(, 2 ) = n2 ln(2) n2 ln(2 ) 21 2
(xi )2
i=1
We now have two first-order conditions obtained by setting each of the following partial
derivatives equal to zero:
L
L
2
n
1 X
(
xi n)
2
(1)
i=1
n
n
1 X
(xi )2
+
22
24
(2)
i=1
=x
n and substituting this into the second
The first equation can be solved to obtain
n
P
1
2
2
= n
n)
equation, we obtain
(xi x
i=1
n and 2 =
=X
The maximum likelihood estimators are therefore
1
n
n
P
n )2
(Xi X
i=1
&
Page 8
'
Rohini Somanathan
Sampling from a uniform distribution

Now suppose X1 , . . . , Xn are draws from a uniform distribution on [0, ] for which the parameters
is unknown.
The density function is given by
f(x; ) =
1
I
(x)
[0,]
The likelihood function is therefore

fn (x; ) =
1
n
This function is decreasing in and is therefore maximized at the smallest admissible value
= max(X1 . . . Xn )
of which is given by
Notice that if we modify the domain of the density to be (0, ) instead of [0, ], then no
M.L.E. exists since the maximum sample value is no longer an admissible candidate for .
Now suppose the random sample is from a uniform distribution on [, + 1]. Now could
lie anywhere in the interval [max(x1 , . . . , xn ) 1, min(x1 , . . . , xn )] and the method of maximum
likelihood does not provide us with a unique estimate.
&
Page 9
%
Rohini Somanathan
'
Computation of MLEs
The form of a likelihood function is often complicated enough to make numerical computation
necessary
Consider, for example, a sample of size n from the following Gamma distribution and suppose we
would like an MLE of :
1
f(x; ) =
x1 ex , x > 0
()
n
P
n
Y
xi
1
fn (x|) = n
(
xi )1 e i=1 .
()
i=1
Setting the derivative of log L to zero, we get the first-order condition

n
1 X
0 ()
=
xi
()
n
i=1
The LHS is the digamma function which is tabulated and now embedded in software packages.
&
Page 10
Rohini Somanathan
'
Properties of Maximum Likelihood Estimators

is the maximum likelihood estimator of , and g() is a one-to-one
1. Invariance: If
is a maximum likelihood estimator of g()
function of , then g()
This allows us to easily compute M.L.E.s of various statistics once we have a few of them.
Example:we have shown that the sample mean and the sample variance are the M.L.E.s of the mean and
variance of a random sample from a normal distribution. We can use the invariance property to conclude
that
the M.L.E. of the standard deviation is the square root of the sample variance
the M.L.E of E(X2 ) is equal to the sample variance plus the square of the sample mean, i.e. since
2 +
2
E(X2 ) = 2 + 2 , the M.L.E of E(X2 ) =
n of a parameter for a sample of size n,

2. Consistency: If there exists a unique M.L.E.
then plim n = .
Note: MLEs are not, in general, unbiased.
n
P
n )2
(Xi X
2n = i=1
Example: The MLE of the variance of a normally distributed variable is given by
2n ) = E[
E(
n
X
X
X
X
X
1 X
+X
2 )] = E[ 1 (
2 )] = E[ 1 (
2 + nX
2 )] = E[ 1 (
2 )]
(X2i 2Xi X
X2i 2X
X2i 2nX
X2i nX
Xi +
X
n
n
n
n
i=1
2
1 X
2 )] = 1 [n(2 + 2 ) n( + 2 )] = 2 n 1
[
E(X2i ) nE(X
n
n
n
n
P
An unbiased estimate would therefore be
n )2
(Xi X
n1
&
Page 11
%
Rohini Somanathan
'
Sufficient Statistics
We have seen that M.L.Es may not exist, or may not be unique. Where should our search
for other estimators start? Well see that a natural starting point is the set of sufficient
statistics for the sample.
Suppose that in a specific estimation problem, two statisticians A and B would like to
estimate ; A observes the realized values of X1 , . . . Xn , while B only knows the value of a
certain statistic T = r(X1 , . . . , Xn ).
A can now choose any function of the observations (X1 , . . . , Xn ) whereas B can choose only
functions of T . In general, A will be able to choose a better estimator than B. Suppose
however that B does just as well as A because the single function T summarizes all the
relevant information in the sample for choosing a suitable . A statistic T with this
property is called a sufficient statistic.
In this case, given T = t, we can generate an alternative sample X01 . . . X0n in accordance with this
conditional joint distribution (auxiliary randomization). Suppose A uses (X1 . . . Xn ) as an
estimator. Well B could just use (01 . . . X0n ), which has the same probability distribution as A 0 s
estimator.
&
Page 12
'
Rohini Somanathan
Sufficient statistics: their detection and importance

The Factorization Criterion (Fisher (1922) ; Neyman (1935)): Let (X1 , . . . , Xn ) form a random
sample from either a continuous or discrete distribution for which the p.d.f. or the p.f. is f(x|),
where the value of is unknown and belongs to a given parameter space . A statistic
T = r(X1 , . . . , Xn ) is a sufficient statistic for if and only if, for all values of x = (x1 , . . . , xn ) Rn
and all values of , fn (x|) of (X1 , . . . , Xn ) can be factored as follows:
fn (x|) = u(x)v[r(x), ]
The functions u and v are nonnegative; the function u may depend on x but does not depend on
; and the function v will depend on but depends on the observed value x only through the
value of the statistic r(x).
Rao-Blackwell Theorem: An estimator that is not a function of a sufficient statistic is dominated
by one that is (in terms of having a lower MSE )
&
Page 13
%
Rohini Somanathan
'
Sufficient Statistics: examples

1. Sampling from a Poisson Distribution: Let (X1 , . . . , Xn ) form a random sample from a
Poisson distribution for which the the value of mean is unknown and belongs to the
parameter space = { : > 0}. For any set of nonnegative integers, x1 , . . . xn , the joint p.f.
fn (x|) of X1 , . . . Xn is as follows:
fn (x|) =
n
n
Y
e xi Y 1 n y
=
e
xi !
xi !
i=1
i=1
where y =
n
P
xi . Weve expressed fn (x|) as the product of a function that does not depend
i=1
on and a function that depends on but depends on the observed vector x only through
n
P
the value of y. It follows that T =
Xi is a sufficient statistic for .
i=1
2. Sampling from a normal distribution with known variance and unknown mean: Let
(X1 , . . . , Xn ) form a random sample from a normal distribution for which the the value of
mean is unknown and variance 2 is known. The joint p.f. fn (x|) of X1 , . . . Xn has already
been derived as:
fn (x|) =
1
(22 )
n
2
n
n

X
1 X 2
n2
exp 2
xi exp
xi
2
2
22
i=1
i=1
The above expression is a product of a function that does not depend on and a function
n
n
P
P
that depends on and on x only through the value of
xi . It follows that T =
Xi is a
i=1
i=1
sufficient statistic for .
&
Page 14
'
Rohini Somanathan
Jointly Sufficient Statistics

If our parameter space is multi-dimensional, and often even when it is not, there may not
exist a single sufficient statistic T , but we may be able to find a set of statistics, T1 . . . Tk
which are jointly sufficient statistics for estimating our parameter.
The corresponding factorization criterion is now
fn (x|) = u(x)v[r1 (x), . . . rk (x), ]
The functions u and v are nonnegative; the function u may depend on x but does not
depend on ; and the function v will depend on but depends on the observed value x only
through the value of the statistic r(x).
Example: If both the mean and the variance of a normal distribution is unknown, the joint
p.d.f.
n
n

X
1
1 X 2
n2
xi exp
xi
fn (x|) =
n exp
2
2
2
22
(22 ) 2
i=1
can be seen to depend on x only through the statistics T1 =

therefore jointly sufficient statistics for and 2 .
i=1
Xi and T2 =
X2i . These are
If T1 . . . , Tk are jointly sufficient for some parameter vector and the statistics T10 , . . . Tk0 are
obtained from these by a one-to-one transformation, then T10 , . . . Tk0 are also jointly sufficient.
So the sample mean and sample variance are also jointly sufficient in the above example,
since T1 = nT10 and T2 = n(T20 + T10 2 )
&
Page 15
%
Rohini Somanathan
'
Minimal Sufficient Statistics

Definition: A statistic T is a minimal sufficient statistic if T is a sufficient statistic and is a
function of every other sufficient statistic.
Minimal jointly sufficient statistics are defined in an analogous manner.
Let Y1 denote the smallest value in the sample, Y2 the next smallest, and so on, with Yn the
largest value in the sample. We call Y1 , . . . Yn the order statistics of a sample.
Order statistics are always jointly sufficient. To see this, note that the likelihood function is
given by
n
Y
fn (x|) =
f(xi |)
i=1
Since the order of the terms in this product are irrelevant (we need to know only the values
obtained and not which one was X1 . . . , we could as well write this expression as
fn (x|) =
n
Y
f(yi |).
i=1
For some distributions this may be the simplest set of jointly sufficient statistics and are
therefore minimally jointly sufficient.
Notice that if a sufficient statistic r(x) exists, the MLE must be a function of this (this
follows from the factorization criterion). It turns out that if MLE is a sufficient statistic, it
is minimally sufficient.
&
Page 16
'
Rohini Somanathan
Implications
Suppose we are picking a sample from a normal distribution, we may be tempted to use
Y(n+1)/2 as an estimate of the median m and Yn Y1 as an estimate of the variance. Yet we
know that we would do better using the sample mean for m and the sample variance must
P
P 2
be a function of
Xi and
Xi .
A statistic is always sufficient with respect of a particular probability distribution, f(x|)
and may not be sufficient w.r.t. , say, g(x|). Instead of choosing functions of the sufficient
statistic we obtain in one case, we may want to find a robust estimator that does well for
many possible distributions.
In non-parametric inference, we do not know the likelihood function, and so our estimators
are based on functions of the order statistics.
&
Page 17
%
Rohini Somanathan
Topic 7: Sampling Distributions of Estimators

Rohini Somanathan
Course 003, 2014-2015
Page 0
'
Rohini Somanathan
Sampling distributions of estimators

Since our estimators are statistics (particular functions of random variables), their
distribution can be derived from the joint distribution of X1 . . . Xn . It is called the sampling
distribution because it is based on the joint distribution of the random sample.
Given this distribution, we can
calculate the probability that an estimator will not differ from the parameter by more
than a specified amount
obtain interval estimates rather than point estimates after we have a sample- an
interval estimate is a random interval such that the true parameter lies within this
interval with a given probability (say 95%).
choose between to estimators- we can, for instance, calculate the mean-squared error of
)2 ] using the distribution of .
the estimator, E [(
Sampling distributions of estimators depend on sample size, and we want to know exactly
how the distribution changes as we change this size so that we can make the right trade-offs
between cost and accuracy.
We begin with a set of results which help us derive the sampling distributions of the
estimators we are interested in.
&
Page 1
%
Rohini Somanathan
'
Joint distribution of sample mean and sample variance

For a random sample from a normal distribution, we already know that the MLEs of the
n
n )2 /n respectively.
n and P (Xi X
population mean and variance are X
i=1
the sample mean is itself normally distributed with mean and variance
n
P
i=1
2
n
and
( Xi )2 has a 2n distribution since it is the sum of squares of n standard normal random
variables.
Theorem: If X1 , . . . Xn form a random sample from a normal distribution with mean and
n
n and the sample variance 1 P (Xi X
n )2 are independent
variance 2 , then the sample mean X
n
i=1
random variables and
2
n N(, )
X
n
n
P
n )2
(Xi X
i=1
2n1
&
Page 2
'
Rohini Somanathan
The t-distribution
Let Z N(0, 1), let Y 2v , and let Z and Y be independent random variables. Then
Z
X = q tv
Y
v
The p.d.f of the t-distribution is given by:

f(x; v) =

( v+1
x2 ( v+1
2 )
2 )
1+
v
v
( 2 ) v
Features of the t-distribution:

One can see from the above density function that the t-density is symmetric with a
maximum value at x = 0.
The shape of the density is similar to that of the standard normal (bell-shaped) but with
fatter tails.
&
Page 3
%
Rohini Somanathan
'
Relation to random normal samples

RESULT 1: Define S2n =
n
P
n )2 The random variable

(Xi X
i=1
U=
n(Xn )
q
tn1
2
Sn
n1
n(X )
S2
n
Proof: We know that
N(0, 1) and that n2 2n1 . Dividing the first random variable
by the square root of the second, divided by its degrees of freedom, the in the numerator and
denominator cancels to obtain U.
Implication: We may not be able to make statements about the difference between the
n using the normal distribution, because even though
population
mean and the sample mean X
n(Xn )
2
N(0, 1), may not unknown. This result allows us to use its estimate
n
P
n )
2
n )2 /n since (X
= (Xi X
tn1
n1
/
i=1
RESULT 2 Given X, Z, Y, n as above. As n
X Z N(0, 1)
As the sample size gets larger the t-density looks more and more like a standard normal
distribution. For instance, the value of x for which the distribution function is equal to .55 is
.129 for t10 , it is .127 for t20 and .126 for the standard normal distribution. The differences
between these values increases for higher values of their distribution functions (why?)
q
n1 n(Xn )
To see why this might happen, consider the variable we just derived,
tn1
gets very close to and

As n gets large
n1
n
is close to 1.
&
Page 4
'
Rohini Somanathan
Interval estimates for the mean

Let us know see how, given 2 , we can obtain an interval estimate for , i.e. an interval which is
likely to contain with a pre-specified probability.

(Xn )
(Xn )
Since /
N(0, 1), Pr 2 < /
< 2 = .955
n
n
2
But this event is equivalent to the events
< Xn <
n
and Xn
< < Xn +
2
2
With known , each of the random variables Xn
and Xn +
are statistics.
n
n
Therefore, we have derived a random interval within which the population parameter lies
with probability .955, i.e.

2
2
= .955 =
Pr Xn < < Xn +
n
n
Notice that there are many intervals for the same , this is the shortest one.
Now, given our sample, our statistics take particular values and the resulting interval either
contains or does not contain . We can therefore no longer talk about the probability that
it contains because the experiment has already been performed.
2
2
< < xn +
) is a 95.5% confidence interval for . Alternatively, we
We say that (xn
n
n
may say that lies in the above interval with confidence or that the above interval is a
confidence interval for with confidence coefficient
&
Page 5
%
Rohini Somanathan
'
Confidence Intervals for means..examples

Example 1: X1 , . . . , Xn forms a random sample from a normal distribution with unknown
and 2 = 10. xn is foundqto be 7.164 with q
n = 40. An 80% confidence interval for the mean
is given by (7.164 1.282
10
40 ), 7.164 + 1.282
10
40 )
or (6.523, 7.805). The confidence coefficient. is .8
Example 2: Let X denote the sample mean of a random sample of size 25 from a
distribution with variance 100 and mean . In this case, n = 2 and, making use of the
central limit theorem the following statement is approximately true:

(Xn )
< 1.96 = .95 or Pr Xn 3.92 < < Xn + 3.92 = .95
Pr 1.96 <
2
If the sample mean is given by xn = 67.53, an approximate 95% confidence interval for the
sample mean is given by (63.61, 71.45).
Example 3: Suppose we are interested in a confidence interval for the mean of a normal
distribution but do not know 2 . We know that
(Xn )
/
n1
tn1 and can use the
t-distribution with (n 1) degrees of freedom to construct our interval estimate. With

= 1.17, a 95% interval is given by
n = 10, xn = 3.22,
(3.22 (2.262)(1.17)/ 9, 3.22 + (2.262)(1.17)/ 9) = (2.34, 4.10)
&
Page 6
Rohini Somanathan
'
Confidence Intervals for differences in means

Let X1 , . . . , Xn and Y1 , . . . , Ym denote, respectively, independent random samples from two
Y and sample variances
distributions, N(1 , 2 ) and N(2 , 2 ), with sample means denoted by X,
2
2
1 and
2.
by
Weve established that:
and Y are normally and independently distributed with means 1 and 2 and variances
X
and
2
n
Using our results on the distribution of linear combinations of normally distributed

n Ym is normally distributed with mean 1 2 and variance
variables, we know that X
n Ym )(1 2 )
(X
2
2
q
has a standard normal distribution and will
n + m . The random variable
2
2
+
m
form the numerator of the T random variable that we are going to use.
2
n
2
m
We also know that 21 and 2 2 have 2 distributions with (n 1) and (m 1) degrees of

21 + m
2 )/2 has a 2 distribution with (n + m 2)
freedom respectively, so their sum (n
r2
degrees of freedom and the random variable
21 +m
22
n
2 (n+m2)
can appear as the denominator of
a random variable which has a tdistribution with (n + m 2) degrees of freedom.
&
Page 7
%
Rohini Somanathan
'
Confidence Intervals for differences in means..contd

We have therefore established that X =
1 2 )
s(Xn Ym )(

22
21 +m
n
(n+m2)
has a t-distribution with
1
1
n+m
(n + m 2) degrees of freedom. To simplify notation, denote the denominator of the above

expression by R.
Given our samples, X1 , . . . , Xn and Y1 , . . . , Ym , we can now construct confidence intervals for
differences in the means of the corresponding populations, 1 2 . We do this in the usual
way:
Suppose we want a 95% confidence interval for the difference in the means, we find a
number b such that, using the t-distribution with (n + m 2) degrees of freedom,

Pr b < X < b = .95
Y)
bR, (X
Y)
+ bR will now contain the true difference in
The random interval (X
means with 95% probability.
m ) and corresponding
A confidence interval is now based on sample values, (
xn y
sample variances.
Based on the CLT, we can use the same procedure even when our samples are not normal.
&
Page 8
Rohini Somanathan
'
The F-distribution
RESULT : Let Y 2m , Z 2n , and let Y and Z be independent random variables. Then
F=
Y/m
nY
=
Z/n
mZ
has an F-distribution with m and n degrees of freedom. The F-density is given by:
f(x) =
m/2 nn/2
( m+n
xm/21
2 )m
I(0,) (x)
m
n
( 2 )( 2 )
(mx + n)(m+n)/2
m and n are sometimes referred to as the numerator and denominator degrees of freedom
respectively.
It turns out that the square of a random variable with a T distribution with n degrees of
freedom has an F distribution with (1, n) degrees of freedom.
The F test turns out to be useful in testing for differences in variances between
two-distributions.
Many important specification tests rely on the F-distribution (example: testing for a set of
coefficients in a linear model being equal to zero).
&
Page 9
%
Rohini Somanathan
'
Computing probabilities with the F-distribution

While the t-density is symmetric, the F density is defined on the positive real numbers and
is skewed to the right.
From the definition of the F distribution (the ratio of two 2 distributions ), it follows that
1
= Fv2 ,v1
Fv1 ,v2
So if a random variable X has an F-distribution with (m, n) degrees of freedom, then
an F-distribution with (n, m) degrees of freedom.
1
X
has
This allows us to easily construct lower tail events having probability from upper-tail
events having probability . For the normal and t-distributions we use the symmetry of
those distributions to do this. We dont do this anymore since most statistical packages give
us the required output.
&
Page 10
%
Rohini Somanathan
Topic 8: Hypothesis Testing

Rohini Somanathan
Course 003, 2014-2015
Page 0
Rohini Somanathan
'
The Problem of Hypothesis Testing

A statistical hypothesis is an assertion or conjecture about the probability distribution of
one or more random variables.
A test of a statistical hypothesis is a rule or procedure for deciding whether to reject that
assertion.
Suppose we have a sample x = (x1 , . . . , xn ) from a density f. We have two hypotheses about
f. On the basis of our sample, one of the hypotheses is accepted and the other is rejected.
The two hypotheses have different status:
the null hypothesis, H0 , is the hypothesis under test. It is the conservative hypothesis,
not to be rejected unless the evidence is clear
the alternative hypothesis H1 specifies the kind of departure from the null hypothesis
that is of interest to us
Two types of tests:
tests of a parametric hypothesis: we partition into two subsets 0 and 1 and Hi
the hypothesis that is in i (we consider only these hypotheses).
goodness-of-fit tests: H0 : f = f0 against H1 : f 6= f0
A hypothesis is simple if it completely specifies the probability distribution, else it is
composite.
Examples: (i) Income is log-normally distributed with known variance but unknown mean. H0 : 8, 000 rupees per
month, H1 : < 8, 000 (ii) We would like to know whether parents are more likely to have boys than girls. Denoting
with the probability of a boy child being denoted by the Bernoulli parameter p. H0 : p = 12 and H1 : p > 12
&
Page 1
%
Rohini Somanathan
'
Statistical tests
Before deciding whether or not to accept H0 , we observe a random sample. Denote by S,
the set of all possible outcomes X of the random sample.
A test procedure or partitions the set S into two subsets, one containing the values that will
lead to the acceptance of H0 and the other containing values which lead its rejection.
A statistical test is defined by the critical region R which is the subset of S for which H0 is
rejected. The complement of this region must therefore contain all outcomes for which H0 is
not rejected.
Most tests are based on values taken by a test statistic ( the same mean, the sample variance
or functions of these). In this case, the critical region R, is a subset of values of the test
statistic for which H0 is rejected. The critical values of a test statistic are the bounds of R.
When arriving at a decision based on a sample and a test, we may make two types of errors:
H0 may be rejected when it is true- a Type I error
H0 may be accepted when it is false- a Type II error
Since a test is any rule which we use to make a decision and we have to think of rules that
help us make better decisions (in terms of these errors). We will discuss how to evaluate a
test and for some problems, we will characterize optimal tests.
&
Page 2
'
Rohini Somanathan
The power function

One way to characterize a test is to specify, for each value of , the probability ()
that the test procedure will lead to the rejection of H0 . The function () is called the
power function of the test:
() = Pr(X R) for
If our critical region is specified in terms of values taken by the statistic T , we have
() = Pr(T R) for
Since the power function of a test specifies the probability of rejecting H0 as a function of
the real parameter value, we can evaluate our test by asking how often it leads to mistakes.
What is the power function of an ideal test ? Think of examples when such a test exists.
It is common to specify an upper bound 0 on () for every value 0 . This bound 0
is the level of significance of the test.
The size of a test, is the maximum probability, among all values of 0 of making an
incorrect decision:
= sup ()
0 0
Given a level of significance 0 , only tests for which 0 are admissible.
&
Page 3
%
Rohini Somanathan
'
The power function..example 1

We want to test a hypothesis about the number of defective bolts in a shipment.
We assume that the probability of a bolt being defective is given by the parameter p in a
Bernoulli distribution.
The null hypothesis is that defects are less than 2%, i.e. p .02 and the alternative is that
p > .02.
We pick a sample of 200 items and our test takes the form of rejecting the null hypothesis if
more than a certain number of defective items are found.
Suppose that want to find a test for which 0 = .05. The probability of a given number of
defective items x is increasing in p for all x > np. Therefore for such values of x and all
values of p 0 , any test that rejects H0 if more than x defective items are found, will have
the highest probability of rejecting the null hypothesis when p = .02. We can therefore
restrict our attention to this value of p, when finding a test with significance level 0 = 0.05.
It turns out that for p = .02, the probability that the number of defective items is greater
than 7 is .049. This is therefore the test we choose. (The size of tests which reject for more
than 4, 5 and 6 defective pieces are .37. .21 and .11 respectively).
The size of the above test (R = {x : x > 7}) is 0.049. With discrete distributions, the size will
often be strictly smaller than 0 .
What does the power function look like?
Note: The stata 12 command you can use to verify this is:
display 1-binomial(200,7,.02).
&
Page 4
'
Rohini Somanathan
The power function graph..example 1
&
Page 5
%
Rohini Somanathan
'

Suppose a random sample is taken from a uniform distribution on [0, ] and the null and
alternative hypotheses are as follows:
H0 : 3 4
H1 : < 3 or > 4
Suppose that our test procedure uses the M.L.E. of , Yn = max(X1 , . . . , Xn ) and rejects the
null hypothesis whenever Yn lies outside [2.9, 4].
(What might be the rationale for this type of test?)
The power function for this test is given by
() = Pr(Yn < 2.9|) + Pr(Yn > 4|)
What is the power of the test if < 2.9?
When takes values between 2.9 and 4, the probability that any sample value is less than
2.9 n
2.9 is given by 2.9
and therefore Pr(Yn < 2.9|) = ( ) and Pr(Yn > 4|) = 0. Therefore the
2.9 n
power function () = ( )
4 n
n
When > 4, () = ( 2.9
) + 1 ()
&
Page 6
Rohini Somanathan
'
The power graph..example 2

9.1 Problems of Testing Hypotheses
wer funcple 9.1.7.
3
2.9
Let T be a test statistic, and suppose that our test will reject the null hypothesi
&
T c, for some constant c. Suppose also that we desire our test to have%
the level
significance 0. The power function of our test is (|) = Pr(T c|), and we wa
Page 7
Rohini Somanathan
'

Let X be N(, 100). To test H0 : = 80 against H1 : > 80 , let the critical region be defined by
> 83}, where x
is the sample mean of a random sample of size n = 25 from
R = {(x1 , x2 , . . . x25 ) : x
this distribution.
1. How is the power function () defined for this test?
The probability of rejecting the null is
> 83) = P
() = P(X

X 83
83
= 1 (
)
>
2
2
/ n
2. What is the size of this test?

This is simply the probability of Type 1 error: = 1 ( 23 ) = .067 = (80)
3. What are the values of (83), and (86)?
(80) is given above, (83) = 0.5, (86) = 1 ( 23 ) = ( 32 ) = .933
stata 12: display normal(1.5)
4. Sketch the graph of the power function
stata 12: twoway function 1-normal((83-x)/2), range (70 90)
= 83.41 This is the smallest level of significance, 0 at
5. What is the p-value corresponding to x
which a given hypothesis would be rejected based on the observed outcome of X?
83.41) = 1 ( 3.41 ) = .044.
Solution: The p-value is given by Pr(X
2
&
Page 8
'
Rohini Somanathan
Testing simple hypotheses

Suppose that 0 and 1 contain only a single element each and our null and alternative
hypotheses are given by
H0 : = 0 and H1 : = 1
Denote by fi (x) the joint density function or p.f. of the observations in our sample under
Hi :
fi (x) = f(x1 |i )f(x2 |i ) . . . f(xn |i )
Let the type I error and type II error be denoted by () and () respectively:
() = Pr( Rejecting H0 | = 0 )
and
() = Pr( Not Rejecting H0 | = 1 )
By always accepting H0 , we achieve () = 0 but then () = 1. The converse is true if we

always reject H0 .
It turns out that we can find an optimal test which minimizes any linear combination of
() and ().
&
Page 9
%
Rohini Somanathan
'
Optimal tests for simple hypotheses

The following result gives us the test procedure that minimizes a() + b() for specified
constants a and b:
THEOREM : Let denote a test procedure such that the hypothesis H0 is accepted if
af0 (x) > bf1 (x) and H1 is accepted if af0 (x) < bf1 (x). Either H0 or H1 may be accepted if
af0 (x) = bf1 (x). Then for any other test procedure ,
a( ) + b( ) a() + b()
So if we are minimizing the sum of errors, we would reject whenever the likelihood ratio
Proof.
f1 (x)
f0 (x)
> 1.
(for discrete distributions)
a() + b() = a
X
xR
f0 (x) + b
X
xRc
f1 (x) = a
X
xR
f0 (x) + b 1
f1 (x) = b +
xR
[af0 (x) bf1 (x)]
xR
The desired function a() + b() will be minimized when the above summation is minimized. This will happen if the critical
a.
region includes only those points for which af0 (x) bf1 (x) < 0. We therefore reject when the likelihood ratio exceeds b
&
Page 10
'
Rohini Somanathan
Minimizing (), given 0

If we fix a level of significance 0 we want a test procedure that minimizes (), the type II
error subject to 0 .
The Neyman-Pearson Lemma : Let denote a test procedure such that, for some constant k,
the hypothesis H0 is accepted if f0 (x) > kf1 (x) and H1 is accepted if f0 (x) < kf1 (x). Either H0 or
H1 may be accepted if f0 (x) = kf1 (x). If is any other test procedure such that () ( ),
then it follows that () ( ). Furthermore if () < ( ) then () > ( )
This result implies that if we set a level of significance 0 = .05, we should try and find a value of
k for which ( ) = .05 This procedure will then have the minimum possible value of ().
Proof.
(for discrete distributions)
From the previous theorem we know that ( ) + k( ) () + k(). So if () ( ), it follows that () ( )
&
Page 11
%
Rohini Somanathan
'
Neyman Pearson Lemma..example 1

Let X1 . . . Xn be a sample from a normal distribution with unknown mean and variance 1.
H0 : = 0 and H1 : = 1
We will find a test procedure which keeps = .05 and minimizes . We have
f0 (x) =
(2)
n
2
e 2
P 2
xi
and f1 (x) =
1
(2)
n
2
e 2
(xi 1)2
1
f1 (x)
= en(xn 2 )
f0 (x)
The lemma tells us to use a procedure which rejects H0 when the likelihood ratio is greater than
n > 21 + n1 log k = k0 .
a constant k. This condition can be re-written in terms of our sample mean x
0
0
We want to find a value of k for which Pr(Xn > k | = 0) = .05 or, alternatively, Pr(Z > k0 n) = .05
(why?)
. Under this procedure, the type II error, () is given by

This gives us k0 n = 1.645 or k0 = 1.645
n
n < 1.645
( ) = Pr(X
| = 1) = Pr(Z < 1.645 n)
n
For n = 9, we have ( ) = 0.0877
If instead, we are interested in choosing 0 to minimize 2() + (), we choose k0 = 21 + n1 log 2,
n > 0.577. In this case, (0 ) = 0.0417
so with n = 9 our optimal procedure rejects H0 when X
(display normal( (.577-1)*3)) and (0 ) = 0.1022 (display normal( (.577-1)*3)) and the minimized
value of 2() + () is 0.186
&
Page 12
'
Rohini Somanathan
Neyman Pearson Lemma..example 2

Let X1 . . . Xn be a sample from a Bernoulli distribution
H0 : p = 0.2 and H1 : p = 0.4
We will find a test procedure which keeps = .05 and minimizes . let Y =
realization. We have
Xi and y its
f0 (x) = (0.2)y (0.8)ny and f1 (x) = (0.4)y (0.6)ny

f1 (x) 3 n 8 y
=
f0 (x)
4
3
The lemma tells us to use a procedure which rejects H0 when the likelihood ratio is greater than
a constant k. This condition can be re-written in terms of our sample mean y >
log k+n log 43

log 83
= k0 .
Now we would like to find k0 such that Pr(Y > k0 |p = 0.2) = .05 We may not however be able to do
this given that Y is discrete. If n = 10, we find that Pr(Y > 3|p = 0.2) = .121 and
Pr(Y > 4|p = 0.2) = .038, (display 1-binomial(10,4,.2)) so we can decide to set these probabilities as
the values of () and use the corresponding values of k0 for our test.
&
Page 13
%
Rohini Somanathan
'
Optimal tests for composite hypotheses

Suppose now that 1 contains multiple elements and we are considering tests at the level of
significance 0 , i.e. test procedures for which
(|) 0 or () 0 for all 0
If 1 and 2 are two different values of in 1 , there may be no single test procedure
that maximizes the power function for all values of 1 .
When such a test does exist, it is called the uniformly most powerful test or a UMP test.
Definition : A test procedure is a UMP test at the level of significance 0 if ( ) 0
and (|) (| ) for every value of 1
We will now look at a sufficient condition for such a test to exist.
&
Page 14
'
Rohini Somanathan
Monotone likelihood ratios

Definition : Let fn (x|) denote the joint density or joint p.f. of the observations X1 , . . . , Xn and
T = r(X) some function of the vector X. Then fn (x|) has a monotone likelihood ratio in the
fn (x|2 )
statistic T if, for any two values 1 and 2 with 1 < 2 , the ratio fn
(x|1 ) depends on the
vector x only through the function r(X) and this ratio is an increasing function of r(X) over the
range of possible values of r(X).
Example 1 : Consider a sample of size n from a Bernoulli distribution for which the parameter p
P
is unknown. Let y =
xi . Then fn (x|p) = py (1 p)ny and for p1 < p2 , the ratio

fn (x|p2 )
p2 (1 p1 ) y 1 p2 n
=
fn (x|p1 )
p1 (1 p2 )
1 p1
is increasing in y so fn (x|p) has a monotone likelihood ratio in the statistic Y =
n
P
Xi .
i=1
Example 2 : Consider a sample of size n from a normal distribution for which the mean is
unknown and the variance is known. The joint p.d.f. is:
P
12 (
(xi )2 )
2
n
fn (x|) =
1
n
(2) 2 n
i=1
n(2 1 )
fn (x|2 )
[x n 12 (2 +1 )]
2
=e
fn (x|1 )
n
n , therefore fn (x|) has a MLR in the statistic X
is increasing in x
&
Page 15
%
Rohini Somanathan
is a UMP test of the hypotheses (9.3.15) with size equal to 0. We shall then determine
the power function of the UMPCourse
test.003: Basic Econometrics, 2012-2013
' It is known from Example 9.3.5 that the joint p.d.f. of X , . . . , X has an increas1
n
ing monotone likelihood ratio in the statistic X n. Therefore, by Theorem 9.3.1, a test
procedure 1 that rejects H0 when X n c is a UMP test of the hypotheses (9.3.15).
alternatives
Pr(X n of
c|one-sided
= 0).
The size of this UMP
test is 0 =tests
Since X has a continuous distribution, c is the 1 0 quantile
of the distribution
Suppose that n0 is an element of the parameter space and
consider the following hypotheses
of X n given = 0. That is, c is the 1 0 quantile of the normal distribution with
H0learned
: 0in Chapter
H1 : >5,this
mean 0 and variance 2 /n. As we
quantile is
0
We have the following result:
1
c = 0 + $1(1 0) n1/2 ,
(9.3.16)
is the quantile
function
the standard
normal ratio
distribution.
For simplicity,
where $Suppose
Theorem:
that fn (x|)
has aofmonotone
likelihood
in the statistic
T = r(X), and let
1(1 ) for the rest of this example.
=
$
we
shall
let
z
c be a constant such
that
0
0
We shall now determine the power
Pr(Tfunction
c| = (|
0 ) = 1)0of this UMP test. By definition,
Then the test procedure which rejects H0 if T c is a UMP test of the above hypotheses at the
level of significance
(9.3.17)
(|)0 .= Pr(Rejecting H |) = Pr(X + z n1/2 |).
1
If instead of the above hypotheses, we have
For every value of , the random variable Z = n1/2 (X n )/ will have the stanH0 :
H1the
: c.d.f.
< 0 of the standard normal
dard normal distribution. Therefore,
if
$ denotes
0
distribution, then
our UMP test will now
! set
"
1/2 (
Pr(T
c|
= 0 ) = 0
n
)
0
(|1) = Pr Z z0 +
In the first case, the power function with be monotonically increasing in while in the second
(9.3.18)
!
"
"
!
case it will be decreasing.
n1/2 (0 )
n1/2 ( 0)
= 1 $ z0 +
=$
z 0 .
&
! Somanathan
The
Page
16 power function (|1) is sketched in Fig. 9.6.
Rohini
In each of the pairs of hypotheses (9.3.8), (9.3.14), and (9.3.15), the alternative
because2012-2013
the set of possible values of
hypothesis H1 is called a one-sided
Course alternative
003: Basic Econometrics,
'
the parameter under H1 lies entirely on one side of the set of possible values under
the null hypothesis H0. In particular, for the hypotheses (9.3.8), (9.3.14), or (9.3.15),
functions
UMP
is larger
thantests
every possible value
every possible valuePower
of the parameter
under H1of
under H0.
The following figures show power functions for one-sided alternative hypotheses discussed above
for the case of a sample from a normal distribution with unknown mean:
The power funcfor the UMP test

heses (9.3.15).
a0
0
m0
9.3 Uniformly Most Powerful Tests
The power funcfor the UMP test

heses (9.3.19).
H0 : 0
565
H1 : > 0
a0
0
m0
H0 : 0
H1 : < 0
&
Page 17
Rohini Somanathan
Example
One-Sided Alternatives in the Other Direction. Suppose now that instead of testing
'
Two-sided alternatives
No UMP tests exists in these cases. A test which does very well for an alternative 2 > 0 may do
9.4 Two-Sided Alternatives
very badly for 1 < 0
Figure 9.8 The power func-
p(md2 )
tions of four test procedures.
569
p(md1)
p( md3)
p(md4 )
m0
in Fig. 9.8, along with the power functions (|1) and (|2 ), which had previously
been sketched in Figs. 9.6 and 9.7.
power
As the values of c1 and c2 in Eq. (9.4.2) or Eq. (9.4.3) are decreased, the%
Rohini Somanathan
function (|) will become smaller for < 0 and larger
for > 0. For 0 =
0.05, the limiting case is obtained by choosing c1 = and c2 = 0 + 1.645 n1/2 .
The test procedure defined by these values is just 1. Similarly, as the values of c1
and c2 in Eq. (9.4.2) or Eq. (9.4.3) are increased, the power function (|) will
become larger for < 0 and smaller for > 0. For 0 = 0.05, the limiting case is
obtained by choosing c2 = and c1 = 0 1.645 n1/2 . The test procedure defined
by these values is just 2 . Something between these two extreme limiting cases seems
appropriate for hypotheses (9.4.1).
&
Page 18
Selection of the Test Procedure

For a given sample size n, the values of the constants c1 and c2 in Eq. (9.4.2) should
be chosen so that the size and shape of the power function are appropriate for the
particular problem to be solved. In some problems, it is important not to reject the
null hypothesis unless the data strongly indicate that differs greatly from 0. In
such problems, a small value of 0 should be used. In other problems, not rejecting
the null hypothesis H0 when is slightly larger than 0 is a more serious error than
not rejecting H0 when is slightly less than 0. Then it is better to select a test having
a power function such as (|4) in Fig. 9.8 than to select a test having a symmetric
function such as (|3).
In general, the choice of a particular test procedure in a given problem should be
based both on the cost of rejecting H0 when = 0 and on the cost, for each possible
value of , of not rejecting H0 when = 0. Also, when a test is being selected, the
relative likelihoods of different values of should be considered. For example, if it
is more likely that will be greater than 0 than that will be less than 0, then it
is better to select a test for which the power function is large when > 0, and not
so large when < 0, than to select one for which these relations are reversed.
Example
9.4.2
Egyptian Skulls. Suppose that, in Example 9.4.1, it is equally important to reject the
null hypothesis that the mean breadth equals 140 when < 140 as when > 140.

Notes On Microeconomics, Macroeconomics, Optimization

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Notes On Microeconomics, Macroeconomics, Optimization

Încărcat de

Drepturi de autor:

Formate disponibile

Introduction

Demand and Supply

Microeconomic Theory: Lecture 1

Summer Semester, 2014

Delhi School of Economics

Prices, Markets and E ciency

Demand and Supply

Schellings Seating Puzzle

In a crowded auditorium, why are the rst few rows empty?

Delhi School of Economics

Demand and Supply

Hypothesis 1: Everyone prefers to sit as far back as possible.

Delhi School of Economics

Prices, Markets and E ciency

Demand and Supply

Questions for a Social Scientist

What do people want (motives/preferences)?

How do motives aect individual behaviour (choice)?

How does the interaction of individual choices aect group

How do we decide which theory is true (empirical testing)?

Is the outcome good (welfare evaluation)?

How do we decide what is good (ethics)?

Can the outome be improved by some suitable intervention

Delhi School of Economics

Demand and Supply

Utility functions: vi (qi , mi ) = ui (qi ) + mi where

qi = quantity of the good consumed

ui (qi ) is the utility of consuming units of the good, expressed

Diminishing marginal utility: ui0 (qi ) > 0, ui00 (qi ) < 0.

Inada condition: ui0 (0) = , ui0 () = 0.

Consumers budget constraint: pqi + mi = yi where

Delhi School of Economics

Prices, Markets and E ciency

Demand and Supply

Max utility within budget:

Substitution gives unconstrained problem:

First-order necessary condition (FOC):

Second-order su cient condition (ui00 (qi ) < 0) due to DMU.

Delhi School of Economics

Demand and Supply

Individual Demand Functions

FOC of consumers problem gives is demand function qi (p )

The FOC can be written as an identity:

Equating derivatives of both sides:

Implicit function theorem gives Law of Demand.

Delhi School of Economics

Prices, Markets and E ciency

Demand and Supply

How much does the consumer gain from having the

Utility from optimally purchasing the good

Utility from not buying the good at all: yi . Dierence is:

Delhi School of Economics

Demand and Supply

Consumers Surplus in Pictures

Delhi School of Economics

Prices, Markets and E ciency

Demand and Supply

Market Demand Function

Market demand is the sum of individual demands:

If each individual demand function is downward sloping, so is

Price elasticity of demand: how responsive to price change?

Petrol likely to have low price elasticity; apples high. Why?

Delhi School of Economics