Micro Notes Main

Microeconomic Theory I
1
Terence Johnson
tjohns20@nd.edu
nd.edu/tjohns20/micro601.html
Updated 11/26/2012
1
Disclaimer: These notes have plenty of errors and misstatements, are not intended to substitute for a
legitimate textbook, like Simon and Blume and Chiang and Wainwright, or MWG and Varian. They are
simply to make it easier to follow what is going on in class, and help remind me what Im supposed to teach
from year to year. I am making them available to you as a courtesy.
Contents
I Mathematics for Economics 4
1 Introduction 5
1.1 Economic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Perfectly Competitive Markets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Price-Taking Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Basic questions of economic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Basic Math and Logic 11
2.1 Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Functions and Correspondences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Real Numbers and Absolute Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Logic and Methods of Proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Examples of Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
II Optimization and Comparative Statics in R 24
3 Basics of R 25
3.0.2 Maximization and Minimization . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Existence of Maxima/Maximizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2 Intervals and Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 The Extreme Value Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5.1 Non-dierentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7 Partial Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.7.1 Dierentiation with Multiple Arguments and Chain Rules . . . . . . . . . . . 39
4 Necessary and Sucient Conditions for Maximization 43
4.1 First-Order Necessary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Second-Order Sucient Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5 The Implicit Function Theorem 50
5.1 The Implicit Function Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Maximization and Comparative Statics . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6 The Envelope Theorem 59
6.1 The Envelope Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
1
7 Concavity and Convexity 66
7.1 Concavity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
III Optimization and Comparative Statics in R
N
71
8 Basics of R
N
72
8.1 Intervals Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.2 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.3 The Extreme Value Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.4 Multivariate Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.5 Taylor Polynomials in R
n
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.6 Denite Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9 Unconstrained Optimization 88
9.1 First-Order Necessary Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
9.2 Second-Order Sucient Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.3 Comparative Statics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
10 Equality Constrained Optimization Problems 99
10.1 Two useful but sub-optimal approaches . . . . . . . . . . . . . . . . . . . . . . . . . 100
10.1.1 Using the implicit function theorem . . . . . . . . . . . . . . . . . . . . . . . 100
10.1.2 A more geometric approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
10.2.1 The Geometry of Constrained Maximization . . . . . . . . . . . . . . . . . . 105
10.2.2 The Lagrange Multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
10.2.3 The Constraint Qualication . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
10.3 Second-Order Sucient Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
10.4 Comparative Statics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
11 Inequality-Constrained Maximization Problems 120
11.1.1 The sign of the KT multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . 128
11.2 Second-Order Sucient Conditions and the IFT . . . . . . . . . . . . . . . . . . . . 128
12 Concavity and Quasi-Concavity 131
12.1 Some Geometry of Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
12.2 Concave Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
12.3 Quasi-Concave Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
12.4 Convexity and Quasi-Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
12.5 Comparative Statics with Concavity and Convexity . . . . . . . . . . . . . . . . . . . 139
12.6 Convex Sets and Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
IV Classical Consumer and Firm Theory 146
13 Choice Theory 147
13.1 Preference Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
2
13.2 Utility Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
13.3 Consistent Decision-Making and WARP . . . . . . . . . . . . . . . . . . . . . . . . . 151
14 Consumer Behavior 155
14.1 Consumer Choice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
14.2 Walrasian Demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
14.3 The Law of Demand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
14.4 The Indirect Utility Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
15 Consumer Behavior, II 160
15.1 The Expenditure Minimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . 160
15.2 The Slutsky Equation and Roys Identity . . . . . . . . . . . . . . . . . . . . . . . . 163
15.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
16 Welfare Analysis 167
16.1 Taxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
16.2 Consumer Surplus vs. CV and EV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
17 Production 171
17.1 Production Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
17.2 Prot Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
17.3 The Law of Supply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
17.4 Cost Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
17.5 A Cobb-Douglas Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
17.6 Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
17.7 Ecient Production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
18 Aggregation 182
18.1 Firms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
18.2 Consumers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
V Other Topics 187
19 Choice Under Uncertainty 188
19.1 Probability and Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
19.2 Choice Theory with Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
19.3 Risk Aversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
19.4 Stochastic Orders and Comparative Statics . . . . . . . . . . . . . . . . . . . . . . . 203
19.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
3
Part I
Mathematics for Economics
4
Chapter 1
Introduction
If you learn everything in this math handout text and exercises you will be able to work
through most of MWG and pass comprehensive exams with reasonably high likelihood. The math
presented here is, honestly, what I think you really, truly need to know. I think you need to
know these things not just to pass comps, but so that you can competently skim an important
Econometrica paper to learn a new estimator you need for your research, or to pick up a book on
solving non-linear equations numerically written by a non-economist and still be able to digest the
main ideas. Tools like Taylor series come up in asymptotic analysis of estimators, proving necessity
and suciency of conditions for a point to maximize a function, and the theory of functional
approximation, which touch every eld from applied micro to empirical macro to theoretical micro.
Economists use these tools are used all the time, and not learning them will handicap you in your
ability to continue acquiring skills.
1.1 Economic Models
There are ve ingredients to an economic model:
Who are the agents? (Players)
What can the agents decide to do, and what outcomes arise as a consequence of the agents
choices? (Actions)
What are the agents preferences over outcomes? (Payos)
When do the agents make decisions? (Timing)
What do the agents know when they decide how to act? (Information)
Once the above ingredients have been xed, the question then arises how agents will behave.
Economics has taken the hard stance that agents will act deliberately to inuence the outcome in
their favor, or that each agent maximizes his own payo. This does not mean that agents cannot
care either for selsh reasons, or per se about the welfare or happiness of other agents in
the economy. It simply means that, when faced with a decision, agents act deliberately to get the
best outcome available to them according to their own preferences, and not as, say, a wild animal
in the grip of terror (such a claim is controversial, at least in the social sciences). Once the model
has been xed, the rest of the analysis is largely an application of mathematical and statistical
reasoning.
5
1.2 Perfectly Competitive Markets
The foundational economic model is the classical Marshallian or partial equilibrium market.
This is the model you are probably most familiar with from undergraduate economics. In a price-
taking market, there are two agents: a representative rm and a representative consumer.
The rms goal is to maximize prots. It chooses how much of its product to make, q, taking
as given the price, p, and its costs, C(q). Then the rms prots are
(q) = pq C(q)
The consumers goal is to maximize his utility. The consumer chooses how much quantity to
purchase, q, taking as given the price, p, and wealth, w. The consumers utility function takes the
form
u(q, m)
where m is the amount of money spent on goods other than q. The consumers utility function is
quasi-linear if
u(q, m) = v(q) +m
where v(q) is positive, increasing, and has diminishing marginal benet. Then the consumer is
trying to solve
max
q,m
v(q) +m
subject to a budget constraint w = pq +m.
An outcome is an allocation (q
, m
) of goods giving the amount traded between the consumer

and the rm, q
and the amount of money the consumer spends on other goods, m
.
The allocation will be decided by nding the market-clearing price and quantity. The consumer
is asked to report how much quantity they demand q
D
(p) at each price p, and the rm is asked
how much quantity q
S
(p) it is willing to supply at each price p. The market clears where q
D
(p
) =
q
S
(p
) = q
, which is the market-clearing quantity; the market-clearing price is p
.
The market meets once, and all of the information above is known by all the agents.
Then the model is
Agents: The consumer and rm.
Actions: The consumer reports a demand curve q
D
(p), and the rm report a supply curve
q
S
(p), both taking the price as given.
Payos: The consumer and rm trade the market-clearing quantity at the market-clearing
price, giving the consumer a payo of u(q
, w p
) and the rm a payo of p
C(q
).
Information: The consumer and rm both know everything about the market.
Timing: The market meets once, and the demand and supply curves are submitted simulta-
neously.
1.2.1 Price-Taking Equilibrium
Lets make further assumptions so that the model is extremely easy to solve. First, let C(q) =
c
2
q
2
,
and let v(q) = b log(q).
6
The Supply Side
Since rms are maximizing prots, their objective is
max
q
pq
c
2
q
2
The rst-order necessary condition for the rm is
p cq
S
= 0
and the second-order sucient condition for the rm is
c < 0
which is automatically satised. Solving the FONC gives the supply curve,
q
S
(p) =
p
c
This expresses the rms willingness to produce the good as a function of their costs, parametrized
by c, and the price they receive, p.
The Demand Side
Consumers act to maximize utility subject to their budget constraint, or
max
q
b log(q) +m
subject to w = pq +m. To handle the constraint, re-write it as m = w pq and substitute it into
the objective function to get
max
q
b log(q) +w pq
Then the rst-order necessary condition for the consumer is
b
q
D
p = 0
and the second-order sucient condition for the consumer is
1
(q
D
)
2
< 0
which is automatically satised. Solving the FONC gives the demand curve,
q
D
(p) =
b
p
This expresses how much quantity the consumer would like to purchase as a function of preferences
b and the price p.
7
Market-Clearing Equilibrium
In equilibrium, supply equals demand, and the market-clearing price p
and market-clearing quan-

tity q
are determined as
q
D
(p
) = q
S
(p
) = q
or
b
p
=
p
c
= q
Solve for p
yields
p
bc
and
q
c
Market Equilibrium
Notice that the variables of interest p
and q
are expressed completely in terms of the

parameters of the model that are outside of any agents control, b and c. We can now vary b and
c freely to see how the equilibrium price and quantity vary, and study how changes in tastes or
technology will change behavior.
1.3 Basic questions of economic analysis
After building a model, two mathematical questions naturally arise,
How do we know solutions exist to the agents maximization problems? (the Weierstass
theorem)
How do we nd solutions? (rst order necessary conditions, second order sucient conditions)
as well as two economic questions,
How does an agents behavior respond to changes in the economic environment? (the implicit
function theorem)
8
How does an agents payo respond to changes in the economic environment? (the envelope
theorem)
In Micro I, we learn in detail the nuts and bolts of solving agents problems in isolation (opti-
mization theory). In Micro II, you learn through general equilibrium and game theory how
to solve for the behavior when many agents are interacting at once (equilibrium and xed-point
theory).
The basic methodology for solving problems in Micro I is:
Check that a solution exists to the problem using the Weierstrass theorem
Build a candidate list of potential solutions using the appropriate rst-order necessary con-
ditions
Find the global maximizer by comparing candidate solutions directly by computing their
payo, or use the appropriate second-order sucient conditions to verify which are maxima
Study how an agents behavior changes when economic circumstances change using the im-
plicit function theorem
Study how an agents payos changes when economic circumstances change using the envelope
theorem
Actually, the above is a little misleading. There are many versions of the Weierstrass theorem,
the FONCs, the SOSCs, the IFT and the envelope theorem. The appropriate version depends on
the kind of problem you face: Is there a single choice variable or many? Are there equality or
inequality constraints? And so on.
The entire class including the parts of Mas-Colell-Whiston-Green that we cover is nothing
more than the application of the above algorithm to specic problems. At the end of the course, it
is imperative that you understand the above algorithm and know how to implement it to pass the
comprehensive exams.
We will start by studying the algorithm in detail for one-dimensional maximization problems in
which agents only choose one variable. Then we will generalize it to multi-dimensional maximization
problems in which agents choose many things at once.
Exercises
The exercises refer to the simple partial equilibrium model of Section 1.2.
1. [Basics] As b and c change, how do the supply and demand curves shift? How do the
equilibrium price and quantity change? Sketch graphs as well as compute derivatives.
2. [Taxes] Suppose there is a tax t on the good q. A portion of t is paid by the consumer
for each unit purchased, and a portion 1 of t is paid by the rm for each unit sold. How does
the consumers demand function depend on ? How does the rms supply function depend on ?
How does the equilibrium price and quantity p
and q
depend on ? If t goes up, how are the

market clearing price and quantity aected? Sketch a graph of the equilibrium in this market.
3. [Long-run equilibrium] (i) Suppose that there are K rms in the industry with short-run cost
function C(q) =
c
2
q
2
, so that each rm produces q, but aggregate supply is Kq = Q. Consumers
maximize b log(Q) + m subject to w = pQ + m. Solve for the market-clearing price and quantity
for each K and the short-run prots of the rms. (ii) Suppose rms have to pay a xed cost F to
enter the industry. Find the equilibrium number of rms in the long run, K
, if there is entry as
9
long as prots are strictly positive. How does K
vary in F, b and c? How do the market-clearing

price and quantity vary in the long run with F?
4. [Firm cost structure] Suppose a price-taking rms costs are C(q) = c(q)+F, where c(0) = 0,
c
(0) > 0 and c
(0) > 0 and F is a xed cost. (i) Show that marginal cost C
(q) intersects average

cost C(q)/q at the minimum of C(q)/q. This is the ecient scale. (ii) How does the rms optimal
choice of q depend on F?
5. [Monopoly] Suppose now that the consumers utility function is
2b
q +m
and the budget constraint is w = pq+m. Suppose there is a single rm with cost function C(q) = cq.
(i) Derive the demand curve q
D
(p) and inverse demand curve p
D
(q). (ii) If the monopolist recognizes
its inuence on the market-clearing price and quantity, it will maximize
max
q
p
D
(q)q cq
or
max
p
q
D
(p)(p c)
Show that the solutions to these problems are the same. If total revenue is pq
D
(p) show that
its derivative, the marginal revenue curve, lies below the total revenue curve, and compare the
monopolists FONC with a price-taking rms FONC in the same market.
6. [Eciency] A benevolent, utilitarian social planner (or well-meaning government) would
choose the market-clearing quantity to maximize the sum of the two agents payos, or
max
p,q
(v(q) +w pq) + (pq C(q))
Show that this outcome is the same as that selected by the perfectly competitive market. Conclude
that a competitive equilibrium achieves the same outcome that a benevolent government would pick.
Give an argument for why a government trying to intervene in a decentralized market would then
probably achieve a worse outcome. Give an argument why a decentralized market would probably
achieve a worse outcome than a well-meaning government. Show that the allocation selected by the
decentralized market and utilitarian social planner is not the allocation selected by a monopolist.
Sketch a graph of the situation.
Congratulations, you are now qualied to be a micro TA!
10
Chapter 2
Basic Math and Logic
These are basic denitions that appear in all of mathematics and modern economics, but reviewing
it doesnt hurt.
2.1 Set Theory
Denition 2.1.1 A set A is a collection of elements. If x is an element or member of A, then
x A. If x is not in A, then x / A. If all the elements of A and B are the same, A = B. The set
with no elements is called the empty set, .
We often enumerate sets by collecting the elements between braces, as
F = { Apples, Bananas, Pears }
F
= { Cowboys, Bears, Chargers, Colts}

We can build up and break down sets as follows:
Denition 2.1.2 Let A, B and C be sets.
A is a subset of B if all the elements of A are elements of B, written A B; if there exists
an element x A but x / B, we can also write A B.
If A B and B A, then A = B.
The set of all elements in either A or B is the union of A and B, written A B.
The set of all elements in both A and B is the intersection of A and B, written A B. If
A B = , then A and B are disjoint.
Suppose A B. The set of elements in B but not in A is the complement of A in B, written
A
c
.
These precisely dene all the normal set operations like union and intersection, and give exact
meaning to the symbol A = B. Its easy to see that operations like union and intersection are
associative (check) and commutative (check). But how does taking the complement of a union or
intersection behave?
Theorem 2.1.3 (DeMorgans Laws) For any sets A and B that are both subsets of X,
(A B)
c
= A
c
B
c
11
and
(A B)
c
= A
c
B
c
,
where the complement is taken with respect to X.
Proof Suppose x (A B)
c
. Then x is not in the union of A and B, so x is in neither A nor
B. Therefore, x must be in the complement of both A and B, so x A
c
B
c
. This shows that
(A B)
c
A
c
B
c
.
Suppose x A
c
B
c
. Then x is contained in the complement of both A and B, so it not a
member of either set, so it is not a member of the union A B; that implies x (A B)
c
. This
shows that (A B)
c
A
c
B
c
.
If you ever nd yourself confused by a complicated set relation ((A B C)
c
D...), draw a
Venn diagram and try writing in words what is going on.
We are interested in dening relationships between sets. The easiest example is a function,
f : X Y , which assigns a unique element of Y to at least some elements of X. However, in
economics we will have to study some more exotic objects, so it pays to start slowly in dening
functions, correspondences, and relations.
Denition 2.1.4 The (Cartesian) product of two non-empty sets X and Y is the set of all ordered
pairs (x, y) = {x X, y Y }.
The product of two sets X and Y is often written X Y . If its just one set copied over and
over, XX...X = X
n
, and an ordered pair would be (x
1
, x
2
, ..., x
n
). For example RR = R
2
is
product of the real numbers with itself, which is just the plane. If we have a set of sets, {X
i
}
N
i=1
,
we often write
N
i=1
X
i
or
N
i=1
X
i
.
Example Suppose two agents, A and B, are playing the following game: If A and B both choose
heads, or HH, or both choose tails, or TT, agent B pays agent A a dollar. If one player
chooses tails and the other chooses heads, and vice versa, agent A pays agent B a dollar. Then
A and B both have the strategy sets S
A
= S
B
= {H, T}, and an outcome is an ordered pair
(s
A
, s
B
) S
A
S
B
. This game is called matching pennies.
Besides a Cartesian product, we can also take a space and investigate all of its subsets:
Denition 2.1.5 The power set of A is the set of all subsets of A, often written 2
A
, since it has
2
|A|
members.
For example, take A = {a, b}. Then the power set of A is , {a, b}, {a}, and {b}.
2.2 Functions and Correspondences
Denition 2.2.1 Let X and Y be sets. A function is a rule that assigns a single element y Y to
each element x X; this is written y = f(x) or f : X Y . The set X is called the domain, and the
set Y is called the range. Let A X; then the image of A is the set f(A) = {y : x A, y = f(x)}.
Denition 2.2.2 A function f : X Y is injective or one-to-one if for every x
1
, x
2
X, x
1
= x
2
implies f(x
1
) = f(x
2
). A function f : X Y is surjective or onto if for all y Y , there exists an
x X such that f(x) = y.
Denition 2.2.3 A function is invertible if for every y Y , there is a unique x X for which
y = f(x).
12
Note that the denition of invertible is extremely precise: For every y Y , there is a unique
x X for which y = f(x).
Do not confuse whether or not a function is invertible with the following idea:
Denition 2.2.4 Let f : X Y , and I Y . Then the inverse image of I under f is the set of
all x X such that f(x) = y and y I.
Example On [0, ), the function f(x) = x
2
is invertible, since x =

y is the unique inverse
element. Then for any set I [0, ), the inverse image f
1
(I) = {x :

y = x, y I}.
On [1, 1], f(x) = x
2
has the image set I = [0, 1]. For any y = 0, y I, we can solve for
x =
y, so that x
2
is not invertible on [1, 1]. However, the inverse image f
1
([0, 1]) is [1, 1],
by the same reasoning.
There is an important generalization of a function called a correspondence:
Denition 2.2.5 Let X and Y be sets. A correspondence is a rule that assigns a subset F(x) Y
to each x X. Alternatively
1
, a correspondence is a rule that maps X into the power set of Y ,
F : X 2
Y
.
So a correspondence is just a function with set-valued images: The restriction on functions
that f(x) = y and y is a singleton is being relaxed so that F(x) = Z where Z is a set containing
elements of Y .
Example Consider the equation x
2
= c. We can think of the correspondence X(c) as the set of
(real) solutions to the equation. For c < 0, the equation cannot be solved, because x
2
> 0, so X(c)
is empty for c < 0. For c = 0, there is exactly one solution, X(c) = {0}, and X(c) happens to be
a function at that point. But for c > 0, X(c) = {
c,
c}, since (1)

2
= 1: Here there is a
non-trivial correspondence, where the image is a set with multiple elements.
Example Suppose two agents are playing matching pennies. Their payos can be succinctly
enumerated in a table:
B
H T
H 1,-1 -1,1
A T -1,1 1,-1
Suppose agent B uses a mixed strategy and plays randomly, so that = pr[Column uses H]. Then
As expected utility from using H is
1 + (1 )(1)
while the row players expected utility from using T is
(1) + (1 )(1)
We ask, When is it an optimal strategy for the row player to use H against the column players
mixed strategy ?, or, What is the row players best-response to ?
Well, H is strictly better than T if
1 + (1 )(1) > (1) + (1 )(1) >
1
2
1
For reasons beyond our current purposes, there are advantages and disadvantages to each approach.
13
and H is strictly worse than T if
(1) + (1 )(1) > 1 + (1 )(1) <
1
2
But when = 1/2, the row player is exactly indierent between H and T. Indeed, the row player
can himself randomize over any mix between H and T and get exactly the same payo. Therefore,
the row players best-response correspondence is
pr[Row uses H|] =
_
_
1 > 1/2
0 < 1/2
[0, 1] = 1/2
So correspondences arise naturally (and frequently!) in game theory.
Example Consider the maximization problem
max
x
1
,x
2
x
1
+x
2
subject to the constraints x
1
, x
2
0, x
1
+px
2
1.
Since x
1
and x
2
are perfect substitutes, the solution is intuitively to pick the cheapest good and
buy only that one. However, in the case when p = 1, the objective and constraint lie on top of one
another, and any pair x
1
+x
2
= 1 is a solution. The solution to the maximization problem then is
(x
1
(p), x
2
(p)) =
_
_
(0, 1/p) p < 1
(1, 0) p > 1
(z, 1 z) p = 1, z [0, 1]
Therefore, we have a correspondence, where the optimal solution in the case when p = 1 is set
valued, [z, 1 z] for z [0, 1]. So correspondences arise naturally in optimization theory.
It turns out that correspondences are very common in microeconomics, even though they arent
usually studied in calculus or undergraduate math courses.
2.3 Real Numbers and Absolute Value
The set of numbers we usually work in, (, ), is called the real numbers, or R. The symbols
and are not in R (they are not even really numbers), but represent the idea of an unbounded
process, like 1, 2, 3, .... The features of R that make it attractive follow from the intuitive idea that
R is a continuum, with no breaks. This is unlike the integers, Z = ..., 3, 2, 1, 0, 1, 2, 3, ..., which
have spaces between each number.
Absolute value is given by
|x| =
_
_
x if x < 0
0 if x = 0
x if x > 0
The most important feature of absolute value is that it satises
|x +y| |x| +|y|
And in particular,
|x y| |x| +|y|
14
2.4 Logic and Methods of Proof
It really helps to get a sense of what is going on in this section, and return to it a few times over the
course. Thinking logically and nding the most elegant (read as, slick) way to prove something
is a skill that is developed. It is a talent, like composing or improvising music or athletic ability, in
that you begin with a certain stock of potential to which you can add by working hard and being
alert and careful when you see how other people do things. Even if you want to be an applied micro
or macro economist, you will someday need to make logical arguments that go beyond taking a few
derivatives or citing someone elses work, and it will be easier if you remember the basic nuts and
bolts of what it means to prove something.
Propositional Logic
Denition 2.4.1 A proposition is a statement that is either true or false.
For example, some propositions are
The number e is transcendental
Every Nash equilibrium is Pareto optimal
Minimum wage laws cause unemployment
These statements are either true or false. If you think minimum wage laws sometimes cause
unemployment, then you are simply asserting the third proposition is false, although a similar
statement that is more tentative might be true. The following are not propositions:
What time is it?
This statement is false
The rst sentence is simply neither true nor false. The second sentence is such that if it were true
it would be false, and if it were false it would be true
2
. Consequently, you can arrange symbols in
ways that do not amount to propositions, so the denition is not meaningless.
Our goal in theory is usually to establish that If P is true, then Q must be true, or P Q,
or Q is a logical implication of P. We begin with a set of conditions that take the form of
propositions, and show that these propositions collectively imply the truth of another proposition.
Since any proposition P is either true or false, we can consider the propositions negation: The
proposition that is true whenever P is false, and false whenever P is true. We write the negation of
P in symbols as P as not P. For example, the number e is not transcendental, some Nash
equilibria are not Pareto optimal, and minimum wage laws do not cause unemployment. Note
that when negating a proposition, you have to take some care, but well get to more on that later.
Consider the following variations on P Q:
The Converse: Q P
The Contrapositive: Q P
The Inverse: P Q
2
Note that this sentence is not a proposition not because it contradicts the denition of a proposition, but
because it fails to satisfy the denition.
15
The converse is usually false, but sometimes true. For example, a dierentiable function is
continuous (dierentiable continuous is true), but there are continuous functions that are
not dierentiable (continuous dierentiable is false). The inverse, likewise, is usually false,
since non-dierentiable functions need not be discontinuous, since many functions with kinks are
continuous but non-dierentiable, like |x|. The contrapositive, however, is always true if P Q
is true, and always false if P Q is false. For example, a discontinuous function cannot be
dierentiable is a true statement. Another way of saying this is that a claim P Q and its
contrapositive have the same truth value (true or false). Indeed, it is often easier to prove the
contrapositive than to prove the original claim.
While the above paragraph shows that the converse and inverse need not be true if the original
claim P Q is true, we should show more concretely that the contrapositive is true. Why is this?
Well, if P Q, then whenever P is true, Q must be true also, which we can represent in a table:
P Q P Q
T T T
T F F
F T T
F F T
So the rst two columns say what happens when P and Q are true or false. If P and Q are both
true, then of course P Q is true. But if P and Q are both false, then of course P Q is also
a true statement. The proposition P Q is actually only false when P is true but Q is false,
which is the second line. Now, as for the contrapositive, the truth table is:
P Q Q P
T T T
T F F
F T T
F F T
By the same kind of reasoning, when P and Q are both true or both false, Q P is true. In the
case where P is true but Q is false, we have Q implies P, but P is true, so Q P is false.
Thus, we end up with exactly the same truth values for the contrapositive as for the original claim,
so they are equivalent. To see whether you understand this, make a truth table for the inverse and
converse, and compare them to the truth table for the original claim.
Quantiers and negation
Once you begin manipulating the notation and ideas, you must inevitably negate a statement like,
All bears are brown. This proposition has a quantier in it, namely the word all. In this
situation, we want to assert that All bears are brown is false; there are, after all, giant pandas
that are black and white and not brown at all. So the right negation is something like, Some bears
are not brown. But how do we get some from all and not brown from brown in more
general situations without making mistakes?
Denition 2.4.2 Given a set A, the set of elements S that share some property P is written
S = {x A : P(x)} (e.g., S
1
= {x R : 2x > 4} = {x R : x 2}) .
Denition 2.4.3 In a set A, if all the elements x share some property P, x A : P(x). If there
exists an element x A that satises a property P, then write x A : P(x). The symbol is the
universal quantier and the symbol is the existential quantier.
So a claim might be made,
16
All allocations of goods in a competitive equilibria of an economy are Pareto optimal.
Then we are implicitly saying: Let A be the set of all allocations generated by a competitive
equilibrium, and let P(a) denote the claim that a is a Pareto optimal allocation, whatever that
is. Then we might say, If a A, P(a), or
a A : P(a)
Negating propositions with quantiers can be very tricky. For example, consider the claim:
Everybody loves somebody sometime. We have three quantiers, and it is unclear what order
they should go in. The negation of this statement becomes a non-trivial problem precisely because
of how many quantiers are involved: Should the negation be, Everyone hates someone sometime?
Or Someone hates everyone all the time? It requires care to get these details right. Recall the
denition of negation:
Denition 2.4.4 Given a statement P(x), the negation of P(x) asserts that P(x) is false for x.
The negation of a statement P(x) is written P(x).
The rules for negating statements are
(x A : P(x)) = x A : P(x)
and
(x A : P(x)) = x A : P(x)
For example, the claim All allocations of goods in a competitive equilibria of an economy are
Pareto optimal could be written in the above form as follows: Let A be the set of allocations
achievable in a competitive equilibrium of an economy (whatever that is), let a A be a particular
allocation, and let P(a) be the proposition that the allocation is Pareto optimal (whatever that is).
Then the claim is equivalent to the statement a A : P(a). Then the negation of that statement,
then, is
a A : P(a)
or in words, There exists an allocation of goods in some competitive equilibrium of an economy
that is not Pareto optimal.
Be careful about or statements. In proposition logic, P Q
1
or Q
2
means, P logically
implies Q
1
, or Q
2
, or both. So the negation of or statements is that P implies either not Q
1
or
not Q
2
or neither. For example, All U.S. Presidents were U.S. citizens and older than 35 when
they took oce is negated to Theres no U.S. President who wasnt a U.S. citizen, or younger
than 35, or both, when they took oce.
Some examples:
A given strategy prole is a Nash equilibrium if there is no player with a protable deviation.
Negating this statement gives, A given strategy prole is not a Nash equilibrium if there
exists a player with a protable deviation.
A given allocation is Pareto ecient if there is no agent who can be made strictly better
o without making some other agent worse o. Negating this statement gives, A given
allocation is not Pareto ecient if there exists an agent who can be made strictly better o
without making any other agent worse o.
A continuous function with a closed, bounded subset of Euclidean space as its domain always
achieves its maximum. Negating this statement gives, A function that is not continuous
or whose domain is not a closed, bounded subset of Euclidean space may not achieve a
maximum.
Obviously, we are not logicians, so our relatively loose statements of ideas will have to be negated
with some care.
17
2.4.1 Examples of Proofs
Most proofs use one of the following approaches:
Direct proof: Show that P logically implies Q
Proof by Contrapositive: Show that Q logically implies P
Proof by Contradiction: Assume P and Q, and show this leads to a logical contradiction
Proof by Induction: Suppose we want to show that for any natural number n = 1, 2, 3, ...,
P(n) Q(n). A proof by induction shows that P(1) Q(1) (the base case), and that for
any n, if P(n) Q(n) (the induction hypothesis), then P(n +1) Q(n +1). Consequently,
P(n) Q(n) is true for all n.
Disproof by Counterexample: Let x A : P(x) be a statement we wish to disprove. Show
that x
A such that P(x
) is false.
The claim might be fairly complicated, like
If f(x) is a bounded, increasing function on [a, b], then the set of points of discontinuity
of f() is a countable set.
The P is, (f(x) is a bounded, increasing function on [a, b]), and the Q is (the set of points of
discontinuity of f(x) is a countable set). To prove this, we could start by using P to show Q must
be true. Or, in a proof by contrapositive, we could prove that if the function has an uncountable
number of discontinuities on [a, b], then f(x) is either unbounded, or decreasing, or both. Or, in
a proof by contradiction, we could assume that the function is bounded and increasing, but that
the function has an uncountable number of discontinuities. The challenge for us is to make sure
that while we are using mathematical sentences and denitions, no logical mistakes are made in
our arguments.
Each method of proof has its own advantages and disadvantages, so well do an example of each
kind now.
A Proof by Contradiction
Theorem 2.4.5 There is no largest prime.
Proof Suppose there was a largest prime number, n. Let p = n (n 1) ... 1 +1. Then p is not
divisible by any of the rst n numbers, so it is prime and greater than n, since n(n1).. 1+1 >
n 1 +1 > n. So p is a prime number larger than n this is a contradiction, since n was assumed
to be the largest prime. Therefore, there is no largest prime.
A Proof by Contrapositive
Suppose we have two disjoint sets, M and W with |M| = |W|, and are looking for a matching of
elements in M to elements in W a relation mapping each element from M into W and vice versa,
which is one-to-one and onto. Each element has a quality attached, either q
m
0 or q
w
0, and
the value of the match is v(q
m
, q
w
) =
1
2
q
m
q
w
. The assortative match is the one where the highest
quality agents are matched, the second-highest quality agents are matched, and so on. If agent m
and agent w are matched, then h(m, w) = 1; otherwise h(m, w) = 0.
Theorem 2.4.6 If a matching maximizes
w
h(m, w)v(q
m
, q
w
), then it is assortative.
18
Proof Since the proof is by contrapositive, we need to show that any matching that is not assorta-
tive does not maximize
w
h
mw
v(q
m
, q
w
) i.e., any non-assortative match can be improved.
If the match is not assortative, we can nd two sets of agents where q
m
1
> q
m
2
and q
w
1
> q
w
2
,
but h(m
1
, w
2
) = 1 and h(m
2
, q
1
) = 1. Then the value of those two matches is
q
m
1
q
w
2
+q
m
2
q
w
1
If we swapped the partners, the value would be
q
m
1
q
w
1
+q
m
2
q
w
2
and
{q
m
1
q
w
1
+q
m
2
q
w
2
} {q
m
1
q
w
2
+q
m
2
q
w
1
} = (q
m
1
q
m
2
)(q
w
1
q
w
2
) > 0
So a non-assortative match does not maximize
w
h(m, w)v(q
m
, q
w
). .
Whats the dierence between proof by contradiction and proof by contrapositive? In proof by
contradiction, you get to use P and Q to arrive at a contradiction, while in proof by contrapositive,
you use Q to show that P is a logical implication, without using P or P in the proof.
A Proof by Induction
Theorem 2.4.7
n
i=1
i =
1
2
n(n + 1)
Proof Basis Step: Let n = 1. Then
1
i=1
i = 1 =
1
2
1(1 +1) = 1. So the theorem is true for n = 1.
Induction Step: Suppose
n
i=1
i =
1
2
n(n +1); well show that
n+1
i=1
i =
1
2
(n +1)(n +2). Now
add n + 1 to both sides:
n
i=1
i + (n + 1) =
1
2
n(n + 1) + (n + 1)
Factoring
1
2
(n + 1) out of both terms gives
n+1
i=1
i =
1
2
(n + 1)(n + 2)
And the induction step is shown to be true.
Now, since the basis step is true (P(1) Q(1) is true), and the induction step is true (P(n)
Q(n) P(n + 1) Q(n + 1)), the theorem is true for all n.
Disproof by Counter-example
Denition 2.4.8 Given a (possibly false) statement x A : P(x), a counter-example is an
element x
of A for which P(x
).
For example, All continuous functions are dierentiable. This sounds like it could be true.
Continuous functions have no jumps and are pretty well-behaved. But its false. The negation of
the statement is, There exist continuous functions that are not dierentiable. Maybe we can
prove that?
19
My favorite non-dierentiable function is the absolute value,
|x| =
_
_
x if x < 0
0 if x = 0
x if x > 0
For x < 0, the derivative is f
(x) = 1, and for x > 0, the derivative is f
(x) = 1. But at zero, the

derivative is not well-dened. Is it +1? 1? Something in-between?
Non-uniqueness of |0|
This is what we call a kink in the function, a point of non-dierentiability. So |x| is non-
dierentiable at zero. But |x| is continuous at zero, since for any sequence x
n
0, |x
n
| |0|. So
we have provided a counterexample, showing that All continuous functions are dierentiable is
false.
One theorem three ways
We did three theorems that highlight the usefulness of each method of proof above. Here, well do
one theorem using each of the three tools. Our theorem is
3
:
Theorem 2.4.9 Suppose f(x) is continuously dierentiable. If f(x) is increasing on [a, b], then
f
(x) 0 for all x [a, b].

Now, the direct proof:
Proof Let x
> x. If f(x) is increasing, then f(x
) > f(x), and letting x
= x + where > 0, we
have
f(x +) f(x) 0
Dividing by gives
f(x +) f(x)
0
and taking the limit as 0 gives, f
(x) 0.
3
A continuously dierentiable function is one for which its derivative exists everywhere, and is itself a continuous
function.
20
In this proof, we use the premise to prove the conclusion directly, without using the conclusion.
Lets try a proof by contrapositive.
Proof This proof is by contrapositive, so we will show that if f
(x) 0 for all x [a, b], then f(x)

is non-increasing. Suppose f
(x) 0. Then for all x
> x in [a, b], by the fundamental theorem of

calculus,
f(x
) f(x) =
_
x
x
f
(z)dz 0
so that f(x
) f(x), as was to be shown.

Here, instead of showing that an increasing function has positive derivatives, we showed that
negative derivatives implied a decreasing function. In some sense, the proof was easier, even though
the approach involved more thought. The last proof is by contradiction:
Proof This proof is by contradiction, so we assume that for all x
> x in [a, b], f(x
) f(x), but
f
(x) 0 for all x [a, b].

If x
> x, then f(x
) f(x) =
_
x
x
f
(z)dz 0, so that f(x
) f(x). But we assumed that

f(x
) f(x), so we are lead to a contradiction.

Note that the contradiction doesnt always have to be the opposite of the P this is why
proof by contradiction can be very useful. You just need a contradiction of any kind 0 > 1, e is
rational, etc. to conclude your proof. The disadvantage is that you can often learn more from
a direct proof about the mathematical situation than you can from a proof by contradiction, since
the rst often constructs the objects of interest or shows how various conditions give rise to certain
properties, where the second is sometimes magical.
Necessary and Sucient Conditions
If P Q and Q P, then we often say P if and only if Q, or P is necessary for Q and Q is
sucient for P.
Consider
To be a lawyer, it is necessary to have attended law school.
To be a lawyer, it is sucient to have passed the bar exam.
To pass the bar exam, it is required that a candidate have a law degree. On the other hand, a
law degree alone does not entitle someone to be a lawyer, since they still need to be certied. So a
law degree is a necessary condition to be a lawyer, but not sucient. On the other hand, if someone
has passed the bar exam, they are allowed to represent clients, so it is a sucient condition to be
a lawyer. There is clearly a gap between the necessary condition of attending law school, and the
sucient condition of passing the bar exam; namely, passing the bar exam.
Lets look at a more mathematical example:
A necessary condition for a point x
to be a local maximum of a dierentiable function f(x)

on (a, b) is that f
(x
) = 0.
A sucient condition for a point x
to be a local maximum of a dierentiable function f(x)

on (a, b) is that f
(x
) = 0 and f
(x
) < 0.
21
Why is the rst condition necessary? Well, if f(x) is dierentiable on (a, b) and f
(x
) > 0, we
can take a small step to x
+ h and increase the value of the function (alternatively, if f
(x
) < 0,
take a small step to x
h and you can also increase the value of the function). Consequently, x
could not have been a local maximum. So it is necessary that f
(x
) = 0.
Why is the second condition sucient? Well, Taylors theorem says that
f(x) = f(x
) +f
(x
)(x x
) +f
(x
)
(x
x)
2
2
+o((h)
3
)
where x
x = h and o(h
3
) goes to zero as h goes to zero faster than the (x
x)
2
and (x
x)
terms. Since f
(x
) = 0, we then have
f(x) = f(x
) +f
(x
)
(x x
)
2
2
+o(h
3
)
so that for x suciently close to x
,
f(x
) f(x) = f
(x
)
(x
x)
2
2
so that f(x
) > f(x) only if f
(x
) < 0. So if the rst-order necessary conditions and second-

order sucient conditions are satised, x
must be a solution. However, there is a substantial gap

between the rst-order necessary and second-order sucient conditions.
For example, consider f(x) = x
3
on (1, 1). The point x
= 0 is clearly a solution of the

rst-order necessary conditions, and f
(0) = 0 as well. However, f(1/2) = (1/2)

3
> 0 = f(0),
so the point x
is a critical point, but not a maximum. Similarly, suppose we try to maximize

f(x) = x
2
on (1, 1). Again, x
= 0 is a solution to the rst-order necessary conditions, but it is

a minimum, not a maximum. So the rst-order necessary conditions only identify critical points,
but these critical points come in three avors: maximum, minimum, and saddle-point.
Alternatively, consider the function
g(x) =
_
_
x , < x < 5
5 , 5 x 10
10 x , 10 < x <
This function looks like a plateau, where the set [5, 10] all sits at a level of 5, and everything before
and after drops o. The necessary conditions are satised for all x strictly between 5 and 10,
since f
(x) = 0 for x strictly between 5 and 10. But the sucient conditions arent satised, since
f
(x) = 0 for x strictly between 5 and 10. These points are all clearly global maxima, but the
second-order sucient conditions require strict negativity to rule out pathologies like f(x) = x
3
.
So, even when you have useful necessary and sucient conditions, there can be subtle gaps that
you miss if youre not careful.
Exercises
1. If A and B are sets, prove that A B i A B = A.
2. Prove the distributive law
A (B C) = (A B) (A C)
and the associative law
(A B) C = A (B C)
22
What is ((A B)))
c
?
3. Sketch graphs of the following sets:
{(x, y) R
2
: x y}
{(x, y) R
2
: x
2
+y
2
1}
{(x, y) R
2
:

xy 5}
{(x, y) R
2
: min{x, y} 5}
4. Write the converse and contrapositive of the following statements (note that you dont have
to know what any of the statements actually mean to negate them) :
A set X is convex if, for every x
1
, x
2
X and [0, 1], the element x
= x
1
+ (1 )x
2
is
in X.
The point x
is a maximum of a function f on the set A if there exists no x
A such that
f(x
) > f(x).
A strategy prole is a subgame perfect Nash equilibrium if, for every player and in every
subgame, the strategy prole is a Nash equilibrium.
A function is uniformly continuous on a set A if, for all > 0 there exists a > 0 such that
for all x, c A, |x c| < , |f(x) f(c)| < .
Any competitive equilibrium (x
, y
, p
) is a Pareto optimal allocation.

23
Part II
Optimization and Comparative
Statics in R
24
Chapter 3
Basics of R
Deliberate behavior is central to every microeconomics model, since we assume that agents act to
get the best payo possible available to them, given the behavior of other agents and exogenous
circumstances that are outside their control.
Some examples are:
A rm hires capital K and labor L at prices r and w per unit, respectively, and has production
technology function F(K, L). It receives a price p per unit it sells of its good. It would like
to maximize prots,
(K, L) = pF(K, L) rK wL
A consumer has utility function u(x) over bundles of goods x = (x
1
, x
2
, ..., x
N
), and a budget
constraint w

N
i=1
p
i
x
i
. The consumer wants to maximize utility, subject to his budget
constraint:
max
x
u(x)
subject to
i
p
i
x
i
w.
A buyer goes to a rst-price auction, where the highest bidder wins the good and pays his
bid. His value for the good is v, and the probability he wins, conditional on his bid, is p(b).
If he wins, he gets a payo v b, while if he loses, he gets nothing. Then he is trying to
maximize
max
b
p(b)(v b) + (1 p(b))0
These are probably very familiar to you. But, there are other, related problems of interest you
might not have thought of:
A rm hires capital K and labor L at prices r and w per unit, respectively, and has production
technology function F(K, L). It would like to minimize costs, subject to producing q units of
output, or
min
K,L
rK +wL
subject to F(K, L) = q.
A consumer has utility function u(x) over bundles of goods x = (x
1
, x
2
, ..., x
N
), and a budget
constraint w

N
i=1
p
i
x
i
. The consumer would like to minimize the amount they spend,
subject to achieving utility equal to u, or
min
x
N
i=1
p
i
x
i
subject to u(x) u.
25
A seller faces N buyers, each of whom has a value for the good unobservable to the seller. How
much revenue can the seller raise from the transaction, among all possible ways of soliciting
information from the buyers?
We want to develop some basic, useful facts about solving these kinds of models. There are
three questions we want to answer:
When do solutions exist to maximization problems? (The Weierstrass theorem)
How do we nd candidate solutions? (First-order necessary conditions)
How do we show that a candidate solution is a maximum? (Second-order sucient conditions)
Since were going to study dozens of dierent maximization problems, we need a general way
of expressing the idea of a maximization problem eciently:
Denition 3.0.10 A maximization problem is a payo function or objective function, f(x, t),
and a choice set or feasible set, X. The agent is then trying to solve
max
xX
f(x, t)
The variable x is a choice variable or control, since the agent gets to decide its value. The variable
t is an exogenous variable or parameter, over which the agent has no control.
For this chapter, agents can choose any real number: Firms pick a price, consumers pick a
quantity, and so on. This means that X is the set of real numbers, R. Also, unless we are
interested in t in particular, we often ignore the exogenous variables in an agents decision problem.
For example, we might write (q) = pq cq
2
, even though () depends on p and c.
A good question to start with is, When does a maximizer of f(x) exist? While this question
is simple to state in words, it is actually pretty complicated to state and show precisely, which is
why this chapter is fairly math-heavy.
3.0.2 Maximization and Minimization
A function is a mapping from some set D, the domain, uniquely into R = (, ), the real
numbers, and we write f : D R. For example,

x : [0, ) R. The image of D under f(x) is
the set f(D). For example,
D = [0, ), and for log(x) : (0, ) R, we have log(D) = (, ).

It turns out that the properties of the domain D and the function f(x) both matter for us.
Denition 3.0.11 A point x
is a global maximizer of f : D R if, for any other x
in D,
f(x
) f(x
)
A point x
is a local maximizer of f : D R if, for some > 0 and for any other x
in
(x
, x
+) D,
f(x
) f(x
)
26
Domain, Image, Maxima, Minima
So a local maximum is basically saying, If we restrict attention to (x
, x
+ ) instead of
(a, b), then x
is a global maximum in (x
, x
+). In even less mathematical terms, x
is a
local maximum if it does at least as well as anything nearby.
A point f(x
) is a local or global minimum if, instead of f(x
) f(x
) above, we have f(x
)
f(x
). However, we can focus on just maximization or just minimization, since

Theorem 3.0.12 If x
is a local maximum of f(), then x
is a local minimum of f().

Proof If f(x
) f(x
) for all x
in (x
, x
+ ) for some > 0, then f(x
) f(x
) for all
x
in (x
, x
+), so it is a local minimum.

So whenever you have a minimization problem,
min
x
f(x)
you can turn it into a maximization problem
max
x
f(x)
and forget about minimization entirely.
27
Maximization vs. Minimization
On the other hand, since I have programmed myself to do this automatically, I nd it annoying
when a book is written entirely in terms of minimization, and some of the rules used down the
road for minimization (second-order sucient conditions for constrained optimization problems)
are slightly dierent than those for maximization.
3.1 Existence of Maxima/Maximizers
So what do we mean by existence of a maximum or maximizer?
Before moving to functions, lets start with the maximum and minimum of a set. For a set
F = [a, b] with a and b nite, the maximum is b and the minimum is a these are members of
the set, and b x for all other x in S, and a x for all other x in S.
If we consider the set E = (a, b) however, the points b and a are not actually members of the
set, although b x for all x in E, and a x for all x in E. We call b the least upper bound of E,
or the supremum,
sup(a, b) = b
and a is the greatest lower bound of E, or the inmum,
inf(a, b) = a
Even if the maximum and minimum of a set dont exist, the supremum and inmum always do
(though they may be innite, like for (a, )).
Since function is a rule that assigns a real number to each x in the domain D, the image of D
under f is going to be a set, f(D). For maximization purposes, we are interested in whether the
image of a function includes its largest point or not.
For example, the function f(x) = x
2
on [0, 1] achieves a maximum at x
= 1, since f(1) =
1 > f(x) for 0 x < 1, and supf([0, 1]) = max f([0, 1]). However, if we consider f(x) = x
2
on
(0, 1), the supremum of f(x) is still 1, since 1 is the least upper bound of f((0, 1)) = (0, 1), but the
function does not achieve a maximum.
28
Existence and Non-existence
Why? On the left, the graph has no holes, so it moves smoothly to the highest point. On
the right, however, a point is missing from the graph exactly where it would have achieved its
maximum. This is the dierence between a maximizer (left) and a supremum when a maximizer
fails to exist (right). Why does this happen? In short, because the choice set on the left, [a, b],
includes all its points, while the choice set on the right, (a, b), is missing a and b.
To be a bit more formal, suppose x
(0, 1) is the maximum of f(x) = x

2
on (0, 1). It must
be less than 1, since 1 is not an element of the choice set. Let x
= (1 x
)/2 + x
this is the
point exactly halfway between x
and 1. Then f(x
) > f(x
), since f(x) is increasing. But then x
couldnt have been a maximizer. This is true for all x
in (0, 1), so were forced to conclude that

no maximum exists.
But this isnt the only way something can go wrong. Consider the function on [0, 1] where
g(x) =
_
_
x 0 x
1
2
2 x
1
2
< x 1
Again, the functions graph is missing a point precisely where it would have achieved a maximum,
at 1/2. Instead, it takes the value 1/2 instead of 3/2.
29
So in either case, the function can fail to achieve a maximum. In one situation, it happens
because the domain D has the wrong features: It is missing some key points. In the second situation,
it happens because the function f has the wrong features: It jumps around in unfortunate ways.
Consequently, the question is: What features of the domain D and the function f(x) guarantee
that sup
xD
f(x) = m
is equal to f(x
) for some x D?
Denition 3.1.1 A (global) maximizer of f(x), x
, exists if f(x
) = supf(x), and f(x
) is the
(global) maximum of f().
3.2 Intervals and Sequences
Intervals are special subsets of the real numbers, but some of their properties are key for optimiza-
tion theory.
Denition 3.2.1 The set (a, b) = {x : a < x < b} is open, while the set [a, b] = {x : a x b} is
closed. If < a < b < , the sets (a, b) and [a, b] are bounded (both the endpoints a and b are
nite).
A sequence x
n
= x
1
, x
2
, ... is a rule that assigns a real number to each of the natural (counting)
numbers, N = 1, 2, 3, ....
Some basic sequences are:
Let x
n
= n. Then the sequence is 1, 2, 3, ...
Let x
n
=
1
n
. Then the sequence is 1,
1
2
,
1
3
, ...
Let x
n
= (1)
n1
1
n
. Then the sequence is 1,
1
2
,
1
3
,
1
4
...
Let x
1
= 1, x
2
= 1, and for n > 2, x
n
= x
n1
+x
n2
. Then the sequence is 1, 1, 2, 3, 5, 8, 13, 21, ...
Let x
n
= sin
_
n
2
_
. Then the sequence is 1, 0, 1, 0, ...
At rst, they might seem like a weird object to study when our goal is to understand functions.
A function is basically an uncountable number of points, while sequences are countable, so it seems
like we would lose information by simplifying an object this way. However, it turns out that
studying how functions f(x) act on a sequence x
n
gives us all the information we need to study
the behavior of a function, without the hassle of dealing with the function itself.
Denition 3.2.2 A sequence x
n
= x
1
, x
2
, .... is a function from N R. A sequence converges to
x if, for all > 0 there exists an N such that n > N implies |x
n
x| < . Then we write
lim
n
x
n
= x
and x is the limit of x
n
.
Everyones favorite convergent sequence is
x
n
=
1
n
= 1,
1
2
,
1
3
, ...
and the limit of x
n
is zero. Why? First, for any positive number , I can make |x
n
0| < by
picking n large enough. Just take n > N = 1/, so that
1
n
<
1
1/
=
30
Second, since was arbitrary, it might as well be zero. Therefore, x
n
0 as n . Notice,
however, how 1/n is never actually equal to zero for nite n; it is only in the limit that it hits its
lower bound. This is another example of the dierence between minimum and inmum: inf x
n
= 0
but minx
n
does not exist, since 1/n > 0 for all n.
Everyones favorite non-convergent or divergent sequence is
x
n
= n
Pick any nite number , and I can make |x
n
| > , contrary to the denition of convergence. How?
Just make |n| > . This sequence goes o to innity, and its long-run behavior doesnt settle
down to any particular number.
Denition 3.2.3 A sequence x
n
is bounded if |x
n
| M for some nite real number M.
But boundedness alone doesnt guarantee that a sequence converges, and just because a sequence
fails to converge doesnt mean that its not interesting. Consider
x
n
= cos
_
(n 1)
2
_
= 1, 0, 1, 0, 1, 0, 1, ...
This function fails to converge (right?). But its still quite well-behaved. Every even term is zero,
x
2k
= 0, 0, 0
and it includes a convergent string of ones,
x
4(k1)+1
= 1, 1, 1, ...
and a convergent string of 1s,
x
4(k1)+3
= 1, 1, 1, ...
So the sequence x
n
even though it is non-convergent is composed of three convergent se-
quences. These building blocks are important enough to deserve a name:
Denition 3.2.4 Let n
k
be a sequence of natural numbers, with n
1
< n
2
< ... < n
k
< .... Then
x
n
k
is a sub-sequence of x
n
.
It turns out that every bounded sequence has at least one sub-sequence that is convergent.
Theorem 3.2.5 (Bolzano-Weierstrass Theorem) Every bounded sequence has a convergent
subsequence.
This seems like an esoteric (and potentially useless) result, but it is actually a fundamental tool
in analysis. Many times, we have a sequence of interest but know very little about it. To study its
behavior, we use the Bolzano-Weierstrass to nd a convergent sub-sequence, and sometimes thats
enough.
31
3.3 Continuity
In short, a function is continuous if you can trace its graph without lifting your pencil o the paper.
This denition is not precise enough to use, but it captures the basic idea: The function has no
jumps where the value f(x) is changing very quickly even though x is changing very little. The
reason continuity is so important is because continuous functions preserve properties of sets like
being open, closed, or bounded. For example, if you take a set S = (a, b) and apply a continuous
function f() to it, the image f(S) is an open interval (c, d). This is not the case for every function.
The most basic denition of continuity is
Denition 3.3.1 A function f(x) is continuous at c if, for all > 0, there is a > 0 so that if
|x c| < , then |f(x) f(c)| < .
Continuity
Cauchys original explanation was that is an error tolerance, and a function is continuous
at c if the actual error (|f(x) f(c)|) can made smaller than any error tolerance by making x close
to c (|x c| < ). An equivalent denition of continuity is given in terms of sequences by the
following:
Theorem 3.3.2 (Sequential Criterion for Continuity) A function f(x) is continuous at c i
for all sequences x
n
c, we have
lim
n
f(x
n
) = f
_
lim
n
x
n
_
= f(c)
We could just as easily use If lim
n
f(x
n
) = f(c) for all sequences x
n
c, then f() is
continuous as the denition of continuity as the denition. The Sequential Criterion for
Continuity converts an idea about functions expressed into notation into an equivalent idea
about how f() acts on sequences. In particular, a continuous function preserves convergence of a
sequence x
n
to its limit c.
The important result is that if f(x) is continuous and x
n
c, then
lim
n
f(x
n
) = f
_
lim
n
x
n
_
or that the limit operator and the function can be interchanged if and only if f(x) is continuous.
32
3.4 The Extreme Value Theorem
Up to this point, what are the facts that you should internalize?
Sequences are convergent only if they settle down to a constant value in the long run
Every bounded sequence has a convergent subsequence, even if the original sequence doesnt
converge (Bolzano-Weierstrass Theorem)
A function is continuous if and only if lim
n
f(x
n
) = f(x) for all sequences x
n
x
With these facts, we can prove the Weierstrass Theorem, also called the Extreme Value Theorem.
Theorem 3.4.1 (Weierstrass Extreme Value Theorem) If f : [a, b] R is a continuous
function and a and b are nite, then f achieves a global maximum on [a, b].
First, lets check that if any of the assumptions are violated, then examples exist where f does
not achieve a maximum. Recall our examples of functions that failed to achieve maxima, f(x) = x
2
on (0, 1) and
g(x) =
_
_
x 0 x
1
2
2 x
1
2
< x 1
on [0, 1]. In the rst example, f(x) is continuous, but the set (0, 1) is open, unlike [a, b], violating the
hypotheses of the Weierstrass theorem. In the second example, the functions domain is closed and
bounded, but the function is discontinuous, violating the hypotheses of the Weierstrass theorem.
So the reason the Weierstrass theorem is useful is that it provides sucient conditions for a
function to achieve a maximum, so that we know for sure, without exception, that any continuous
f(x) on a closed, bounded set [a, b] will achieve a maximum.
Proof Suppose that
supf(x) = m
We know m
< , since if f(x) is continuous on [a, b], it is well-dened for its whole domain and
has no asymptotes (unlike 1/x on [0, 1], which is discontinuous at zero). We want to show that
there exists an x
in [a, b] so that f(x
) = m
.
Step 1: Since m
is the supremum of f(), we can nd a sequence of points {x

n
} = x
1
, x
2
, ... in
[a, b] so that
m
1
n
f(x
n
) m
Why? We show this by contradiction. If the above inequality was violated, then for some n, we
would not be able to nd an x
n
so that m
1
n
f(x
n
) m
, and m
1/n would be greater

than f(x) for all x in [a, b], implying that m
1/n is the supremum of f AND less than m
, so
that m
was not actually the supremum in the rst place, since the supremum is the least upper
bound. This would be a contradiction.
Step 2: Since the sequence x
n
dened by f(x
n
) is contained in [a, b], by the Bolzano-Weierstrass
theorem we can nd a convergent subsequence, x
n
k
x
, where x
is in [a, b].
Step 3: If we take the limit of the convergent subsequence,
lim
n
k
1
n
k
lim
n
k
f(x
n
k
) m
by continuity of f() and the Sequential Criterion for Continuity, we have lim
n
k
f(x
n
k
) = f(x
),
and
m
f(x
) m
33
The above inequalities imply f(x
) = m
, so a maximizer x
exists in [a, b], and the function

achieves a maximum at f(x
).
This is the foundational result of all optimization theory, and it pays to appreciate how and
why these steps are each required. This is the kind of powerful result you can prove using a little
bit of analysis.
3.5 Derivatives
As you know, derivatives measure the rate of change of a function at a point, or
f
(x) = D
x
f(x) =
df(x)
dx
= lim
h0
f(x +h) f(x)
h
The way to visualize the derivative is as the limit of a sequence of chords,
lim
n
f(x
n
) f(x)
x
n
x
that converge to the tangent line, f
(x).
The Derivative
Since this sequence of chords is just a sequence of numbers, the derivative is just the limit of a
particular kind of sequence. So if the derivative exists, it is unique, and a derivative exists only if
it takes the same value no matter what sequence x
n
x you pick.
For example, the derivative of the square root of x,

x, can be computed bare-handed as
lim
h0
x +h
x
h
= lim
h0
x +h
x
h
x +h +
x +h +
x
= lim
h0
x +h x
h
1
x +h +
x
= lim
h0
1
x +h +
x
=
1
2
x
For the most part, of course, no one really computes derivatives like that. We have theorems like
D
x
[af(x) +bg(x)] = af
(x) +bg
(x)
34
D
x
[f(x)g(x)] = f
(x)g(x) +f(x)g
(x) (multiplication rule)

D
x
[f(g(x))] = f
(g(x))f
(x) (chain rule)

D
x
[f
1
(f(x))] =
1
f
(x)
as well as the derivatives of specic functional forms
D
x
x
k
= ke
k1
D
x
e
x
= e
x
D
x
log(x) =
1
x
and so on. This allows us to compute many fairly complicated derivatives by grinding through
the above rules. But a notable feature of economics is that we are fundamentally unsure of what
functional forms we should be using, despite the fact that we know a reasonable amount about
what they look like. These qualitative features are often expressed in terms of derivatives. For
example, it is typically assumed that a consumers benet from a good is positive, marginal benet
is positive, but marginal benet is decreasing. In short, v(q) 0, v
(q) 0 and v
(q) 0. A rms
total costs are typically positive, marginal cost is positive, and marginal cost is increasing. In short,
C(q) 0, C
(q) 0, and C
(q) 0. By specifying our assumptions this way, we are being precise

as well as avoiding the arbitrariness of assuming that a consumer has a log(q) preferences for some
good, but 1 e
bq
preferences for another.
3.5.1 Non-dierentiability
How do we recognize non-dierentiability? Consider
f(x) = |x| =
_
_
x if x < 0
0 if x = 0
x if x > 0
,
the absolute value of x, what is the derivative, f
(x)? For x < 0, the function is just f(x) = x,

which has derivative f
(x) = 1. For x > 0, the function is just f(x) = x, which has derivative
f
(x) = 1. But what about at zero? First, lets dene the derivative of f at x from the left as
f
(x
) = lim
h0
f(x +h) f(x)
h
and the derivative of f at x from the right as
f
(x
+
) = lim
h0
f(x +h) f(x)
h
Note that:
Theorem 3.5.1 A function is dierentiable at a point x if and only if its one-sided derivatives
exist and are equal.
Then for f(x) = |x| with x = 0, we have
f
(0
+
) = lim
h0
|x +h| |0|
h
= lim
h0
h
h
= 1
35
and
f
(0
) = lim
h0
|0 +h| |0|
h
= lim
h0
|h|
h
= 1
So we could hypothetically assign any number from 1 to 1 to be the derivative of |x| at zero. In
this case, we say that f(x) is non-dierentiable at x, since the tangent line to the graph of f(x) is
not unique people often say there is a corner or kink in the graph of |x| at zero. We already
computed
D
x
x =
1
2
x
Obviously, we cant evaluate this function for x < 0 since

x is only dened for positive numbers.
For x > 0, the function is also well behaved. But at zero, we have
D
x
0 =
1
2 0
which is undened, so the derivative fails to exist at zero. So, if you want to show a function is
non-dierentiable, you need to show that the derivatives from the left and from the right are not
equal, or that the derivative fails to exist.
3.6 Taylor Series
It turns out that the sequence of derivatives of a function
f
(x), f
(x), f
(x), ..., f
(k)
(x), ...
generally provides enough information to recover the function, or approximate it as well as we
need near a particular point using only the rst k terms.
Denition 3.6.1 The k-th order Taylor polynomial of f(x) based at x
0
is
f(x) = f(x
0
) +f
(x
0
)(x x
0
) +f
(x
0
)
(x x
0
)
2
2
+... +f
(k)
(x
0
)
(x x
0
)
k
k!
. .
k-th order approximation of f
+f
(k+1)
(c)
(x x
0
)
k+1
(k + 1)!
. .
Remainder term
where c is between x and x
0
.
Example The second-order Taylor polynomial of f(x) based at x
0
is
f(x) = f(x
0
) +f
(x
0
)(x x
0
) +
1
2
f
(c)(x x
0
)
2
For f(x) = e
x
with x
0
= 1, we have
f(x) = e +e(x 1) +e
(x 1)
2
2
+e
c
(x 1)
3
6
while for x
0
= 0 we have
f(x) = 1 +x +
x
2
2
+e
c
x
3
6
36
For f(x) = x
5
+ 7x
2
+x with x
0
= 3, we have
f(x) = 309 + 448(x 3) + 554
(x 3)
2
2
+ 60c
2
(x 3)
3
6
while for x
0
= 10 we have
f(x) = 100, 710 + 50, 141(x 10) + 20, 014
(x 10)
2
2
+ 60c
2
(x 10)
3
6
For f(x) = log(x) with x
0
= 1, we have
f(x) = (x 1) +
(x 1)
2
2
+
2
6c
3
(x 1)
3
So while the Taylor series with the remainder/error term is an exact approximation, dropping
the approximation introduces error. We often simply work with the approximation and claim that
if we are suciently close to the base point, it wont matter. Or, we will use a Taylor series to
expand a function in terms of its derivatives, perform some calculations, and then take a limit so
that the error vanishes. Why are these claims valid? Consider the second-order Taylor polynomial,
f(x) = f(x
0
) +f
(x
0
)(x x
0
) +f
(x
0
)
(x x
0
)
2
2
+f
(c)
(x x
0
)
3
6
This equality is exact when we include the remainder term, but not when we drop it. Let
f(x) = f(x
0
) +f
(x
0
)(x x
0
) +f
(x
0
)
(x x
0
)
2
2
be our second-order approximation of f around x
0
. Then the approximation error
f(x)

f(x)
(c)
(x x
0
)
3
6
is just a constant |f
(c)|/6 multiplied by |(x x

0
)
3
|. Therefore, we can make the error arbitrarily
small (less than any > 0) by making x very close to x
0
:
|f
(x
0
)|
6
|(x x
0
)
3
| < |x x
0
| <
_

|f
(x
0
)|
_
3
We write this as
f(x) = f(x
0
) +f
(x
0
)(x x
0
) +f
(x
0
)
(x x
0
)
2
2
+o(h
3
)
where h = xx
0
, or that the error is order h-cubed. This is understood to mean that if h = xx
0
is small enough, then f(x) f(x
0
) will be as small as desired. This is important for maximization
theory because we will often want to use low-order Taylor polynomials around a local maximum x
,
and we need to know that if x is close enough to x
, the approximation will satisfy f(x
) f(x)
(why is this important?).
37
3.7 Partial Derivatives
Most functions of interest to us are not a function of a single variable, but many. As a result, even
though were focused on maximization where the choice variable is one-dimensional, it helps to
introduce partial derivatives so we can study how solutions and payos vary in terms of variables
outside the agents control.
For example, a rms prot function
(q) = pq
c
2
q
2
is really a function of q, p and c, or (q, p, c). We will need to dierentiate not just with respect to
q, but also p and c.
Denition 3.7.1 Let f(x
1
, x
2
, ..., x
n
) : R
n
R. The partial derivative of f(x) with respect to x
i
is
f(x)
x
i
= lim
h0
f(x
1
, ..., x
i
+h, x
n
) f(x
1
, ..., x
i
, ..., x
n
)
h
The gradient is the vector of partial derivatives
f(x) =
_
_
_
_
_
_
_
_
_
_
f(x)
x
1
f(x)
x
2
.
.
.
f(x)
x
N
_
_
_
_
_
_
_
_
_
_
Since this notation can become cumbersome especially when we dierentiate multiple times
with respect to dierent variables x
i
and then x
j
, and so on we often write
f(x)
x
i
= f
x
i
(x)
or
2
f(x)
x
j
x
i
= f
x
j
x
i
(x)
Example Consider a simple prot function
(q, p, c) = pq
c
2
q
2
Then
(q, p, c)
c
=
1
2
q
2
and
(q, p, c)
p
= q
and the gradient is
(q, p, c) =
_
(q, p, c)
q
,
(q, p, c)
p
,
(q, p, c)
c
_
=
_
q cq, q,
1
2
q
2
_
The partial derivative with respect to x
i
holds all the other variables (x
1
, ..., x
i1
, x
i+1
, ..., x
n
)
constant and only varies x
i
slightly, exactly like a one-dimensional derivative where the other
arguments of the function are treated as constants.
38
3.7.1 Dierentiation with Multiple Arguments and Chain Rules
Recall that the one-dimensional chain rule is that, for any two dierentiable functions f(x) and
g(y),
Dg(f(x)) = g
(f(x))f
(x)
Of course, since a partial derivative is just a regular derivative where all other arguments are held
constant, its true that
g(y
1
, y
2
, ..., f(x
i
), ..., y
N
)
x
i
=
g(...)
y
i
f
(x
i
)
But we run into problems when we face a function f(y
1
, ..., y
N
) where many of the variables y
i
are
functions that depend on some other, common variable, c. For example, consider f(x(c), c). The c
term shows up multiple places, so it is not immediately obvious how to dierentiate with respect
to c.
Let g(c) = f(x(c), c), and consider totally dierentiating g(c) with respect to c:
g
(c) =
df(x(c), c)
dc
= lim
h0
f(x(c +h), c +h) f(x(c), c)
h
This limit looks incalculable, since both arguments are varying at the same time. Consider using
a Taylor series to expand the rst term in x at x(c) as
f(x(c +h), c +h) = f(x(c), c +h) +
f(, c +h)
x
(x(c +h) x(c))
where is between x(c) and x(c +h). Lets now expand the rst term on the right hand side in c
at c, as
f(x(c), c +h) = f(x(c), c) +
f(x(c), )
c
h
where is between c and c +h. Inserting the second equation into the rst, we get
f(x(c +h), c +h) = f(x(c), c) +
f(x(c), )
c
h +
f(, c +h)
x
(x(c +h) x(c))
Since is between x(c +h) and x(c), it must tend to x(c) as h 0; since is between c and c +h,
it must tend to c as h 0. Then re-arranging and dividing by h yields
f(x(c +h), c +h) f(x(c), c)
h
=
f(x(c), )
c
+
f(, c +h)
x
x(c +h) x(c)
h
The left-hand side is almost the derivative of g(c), we just need to take limits with respect to h.
Taking the limit as h 0 then yields
g
(c) =
df(x(c), c)
dc
= f
c
(x(c), c) +f
x
(x(c), c)
x(c)
c
So we work argument by argument, partially dierentiating all the way through using the chain
rule, and then summing all the resulting terms. For example,
d
dc
g(x
1
(c), x
2
(c), ..., x
N
(c), c) =
_
N
i=1
g
y
i
(x(c), c)
x
i
(c)
c
_
+
g(x(c), c)
c
39
Exercises
1. Write out the rst few terms of the following sequences, nd their limits if the sequence converges,
and nd the suprema, and inma:
x
n
=
(1)
n
n
x
n
=
n
n
x
n
=
n + 1
n + 1 +
n
x
n
= sin
_
(n 1)
2
_
(Hint: Does this sequence have multiple convergent sub-sequences? Argue that a sequence with
multiple convergent sub-sequences that have dierent limits cannot be convergent.)
x
n
=
_
1
n
_
1/n
(Hint: Show that x
n
is an increasing sequence, and then argue that sup
n
(1/n)
1/n
= 1. Then
x
n
1, right?)
2. Give an example of a sequence x
n
and a function f(x) so that x
n
c, but lim
n
f(x
n
) = f(c).
3. Suppose x
is a local maximizer of f(x). Let g() be a strictly increasing function. Is x
a
local maximizer of g(f(x))? Suppose x
is a local minimizer of f(x). What kind of transformations

g() ensure that x
is also a local minimizer of g(f(x))? Suppose x
is a local maximizer of f(x).

Let g() be a strictly decreasing function. Is x
a local maximizer or minimizer of g(f(x))?

4. Rewrite the proof of the extreme value theorem for minimization, rather than maximization.
5. A function is Lipschitz continuous on [a, b] if, for all x
and x
in [a, b], |f(x
) f(x
)| <
K|x
|, where K is nite. Show that a Lipschitz continuous function is continuous. Provide an

example of a continuous function that is not Lipschitz continuous. How is the Lipschitz constant
K related to the derivative of f(x)?
6. Using Matlab or Excel, numerically compute rst- and second-order approximations of the
exponential function with x
0
= 0 and x
0
= 1. Graph the approximations and the approximation
error as you move away from each x
0
. Do the same for the natural log function with x
0
= 1 and
x
0
= 10. Explain whether the second- or third-order approximation does better, and how the
performance degrades as x moves away from x
0
for each approximation.
7. Suppose f
(x) > g
(x). Is f(x) > g(x)? Prove that if f
(x) > g
(x) > 0, there exists a point

x
0
such that for x > x
0
, f(x) g(x).
8. Explain when the derivatives of f(g(x)), f(x)g(x), and f(x)/g(x) are positive for all x or
strictly negative for all x.
9. Consider a function f(c, d) = g(y
1
(c, d), y
2
(d), c). Compute the partial derivatives of f(c, d)
with respect to c and d. Repeat with f(a, b) = g(y
1
(z(a), h(a, b)), y
2
(w(b)).
40
Proofs
Bolzano-Weierstrass Theorem If a sequence is bounded, all its terms are contained in a set
I = [a, b]. Take the rst term of the sequence, x
1
, and let it be x
k
1
. Now, split the interval I = [a, b]
into [a, 1/2(b a)] and [1/2(b a), b]. One of these subsets must have an innite number of terms,
since x
n
is an innitely long sequence. Pick that subset, and call it I
1
. Select any member of the
sequence in I
1
, and call it x
k
2
. Now split I
1
in half. Again, one of the subsets of I
1
has an innite
number of terms, so pick it and call it I
2
. Proceeding in this way, splitting the sets in half and
picking an element from the set with an innite number of terms in it, we construct a sequence of
sets I
k
and a subsequence x
n
k
.
Now note that the length of the sets I
k
is
(b a)
_
1
2
_
k
0
So the distance between the terms of the sub-sequence x
n
k
and x
n
k+1
cannot be more than (b
a)(1/2)
k
. Since this process continues indenitely and (b a)(1/2)
k
0, there will be a limit term
x of the sequence
1
. Then for some k we have |x
n
k
x| < (1/2)
k
, which can be made arbitrarily
small by making k large, or for any > 0 if k K then |x
n
k
x| < for K log()/ log(1/2).
Sequential Criterion for Continuity Suppose f(x) is continuous at c. Then for all > 0, there
is a > 0 so that |x c| < implies |f(x) f(c)| < . Take any sequence x
n
c. This implies that
for all > 0 there is some n > N so that |x
n
c| . Then |x
n
c| < , so that |f(x
n
) f(c)| < .
Therefore, limf(x
n
) = f(c).
Take any sequence x
n
c. Then for all > 0 there exists an N
so that n N
implies
|f(x
n
) f(c)| < . Since x
n
c, for all > 0 there is an N
so that n N
implies |x
n
c| < .
Let N = max{N
, N
}. Then for all > 0, there is a > 0 so that |x

n
c| < implies
|f(x
n
) f(c)| < . Therefore f(x) is continuous.
Questions about sequences and the extreme value theorem:
7. Prove that a convergent sequence is bounded. (You will need the inequality |y| < c implies
c y c. Try to bound the tail terms as |x
n
| < |+x
N
|, m > N, and then argue the sequence
is bounded by the constant M = max{x
1
, x
2
, ..., x
N
, | +x
N
|}.)
8. Prove that a convergent sequence has a unique limit. (Start by supposing that x
n
converges
to x
and x
. Then use the facts that |y| c implies c y c and |a +b| |a| +|b| to show that
|x
| < |/2| +|x

n
x
|, and |x
n
x
| can be made less /2. )

9. The extreme value theorem says that if f : [a, b] R is continuous, then a maximizer of
f(x) exists in [a, b]. Is it necessary for f(x) to be continuous for this result to be true (provide
an example to explain your answer). A function f : [a, b] R is upper semi-continuous if, for all
> 0, there exists a > 0 so that if |x c| < , then
f(x) f(c) +
(a) Sketch some upper semi-continuous functions. (b) Show that upper semi-continuity is equivalent
to: For any sequence x
n
c,
lim
n
sup
kn
f(x) f(c)
1
This part of the proof is a little loose. The existence of x is ensured by the Cauchy criterion for convergence,
which is covered in exercise 9
41
(c) Use the sequential criterion for upper semi-continuity in part b to show that the extreme value
theorem extends to upper semi-continuous functions.
10. A sequence is Cauchy if, for all > 0 and all n, m N, |x
n
x
m
| < . Show that a sequence
x
n
x if and only if it is Cauchy. The bullet points below sketch the proof for you, provide the
missing steps:
To prove a convergent sequence is Cauchy: (Easy part)
Add and subtract x inside |x
n
x
m
|, then use the triangle inequality, |a +b| < |a| +|b|.
Lastly, use the denition of a convergent sequence.
To prove a Cauchy sequence converges: (Hard part)
Show that a Cauchy sequence is bounded. (You will need the inequality |y| < c implies
c y c. Try to bound the tail terms as |x
m
| < | + x
N
|, m > N, and then argue
the sequence is bounded by the constant M = max{x
1
, x
2
, ..., x
N
, | +x
N
|}.)
Use the Balzano-Weierstrass theorem to argue that a Cauchy sequence then has a con-
vergent subsequence with limit x.
Show that x
n
x, so that the sequence converges to x. Suppose there is a sub-sequence
x
n
k
of x
n
that does not converge to x, and show that this leads to a contradiction
(again, the inequality |y| c implies c y c is useful. Compare the non-convergent
subsequence with the convergent one, along with the denition of a Cauchy sequence for
points from the two sequences.).
Cauchy sequences can be very useful when you want to prove a sequence converges, but have no idea
what the exact limit is. One example is studying sequences of decision variables in macroeconomics,
where, for example, a choice of capital decisions k
t
is generated by the economy each period t, but
it is unclear whether this sequences converges or diverges.
42
Chapter 4
Necessary and Sucient Conditions
for Maximization
While existence results are useful in terms of helping us understand which maximization problems
have answers at all, they do little to help us nd maximizers. Since analyzing any economic model
relies on nding the optimal behavior for each agent, we need a method of nding maximizers and
determining when they are unique.
4.1 First-Order Necessary Conditions
We start by asking the question, What criteria must a maximizer satisfy? This throws away many
possibilities, and narrows attention on a candidate list. This does not mean that every candidate
is indeed a maximizer, but merely that if a particular point is not on the list, then it cannot be a
maximizer. These criteria are called rst-order necessary conditions.
Example Suppose a price-taking rm has total costs C(q), and gets a price p for its product.
Then its prot function is
(q) = pq C(q)
How can we nd a maximizer of ()? Lets start by dierentiating with respect to q:
(q) = p C
(q)
Since
(q) measures the rate of change of q, we can increase prots if p C
(q) > 0 at q by
increasing q a little bit. On the other hand, if p C
(q) < 0, we can raise prots by decreasing q a

little bit. The only place where the rm has no incentive to change its decision is where p = C
(q
).
Note that the logic of the above argument is that if q
is a maximizer of (q), then
(q
) = 0.
We are using a property of a maximizer to derive conditions it must satisfy, or necessary conditions.
We can make the same argument for any maximization problem:
Theorem 4.1.1 If x
is a local maximum of f(x) and f(x) is dierentiable at x
, then f
(x
) = 0.
Proof Suppose x
is a local maximum of f(). If f
(x
) > 0, then we could take a small step to

x
+h, and the Taylor series would be

f(x
+h) = f(x
) +f
(x
)h +o(h
2
)
43
implying that for very small h,
f(x
+h) f(x
) = f
(x
)h > 0
so that x
+h gives a higher value of f() than x
, which would be a contradiction. So f
(x
) cannot
be strictly greater than zero.
Suppose x
is a local maximum of f(). If f
(x
) < 0, then we could take a small step to x
h,
and the Taylor series would be
f(x
h) = f(x
) f
(x
)h +o(h
2
)
implying that for very small h,
f(x
h) f(x
) = f
(x
) > 0
so that x
h gives a higher value of f() than x
, which would be a contradiction. So f
(x
) cannot
be strictly less than zero.
That leaves f
(x
) = 0 as the only time when we cant improve f() by taking a small step away
from x
, so that x
must be a local maximum.

So the basic candidate list for an (unconstrained, single-dimensional maximization problem)
boils down to
The set of points at which f(x) is not dierentiable.
The set of critical points, where f
(x) = 0.
Example A consumer with utility function u(q, m) = b log(q) + m over the market good q and
money spent on other goods m faces budget constraint w = pq + m. The consumer wants to
maximize utility.
We can rewrite the constraint as m = wpq, and substitute it into u(q, m) to get b log(q)+wpq.
Treating this as a one-dimensional maximization problem,
max
q
b log(q) +w pq
our FONCs are
b
q
p = 0
or
q
=
b
p
Note however, that the set of critical points potentially includes some local minimizers. Why?
If x
is a maximizer of f(x), then f
(x
) = 0; but then x
is a minimizer of f(x), and f
(x) = 0.
So being a critical point is not enough to guarantee that a point is a maximizer.
Example Consider the function
f(x) =
1
4
x
4
+
c
2
2
x
2
The FONC is
f
(x) = x
3
+c
2
x = 0
44
One critical point is x
= 0. Dividing by x = 0 and solving by radicals gives two more: x
= c.
So our candidate list has three entries. To gure out which is best, we substitute them back into
the objective:
f(0) = 0, f(+c) = f(c) =
1
4
c
4
> 0
So +c and c are both global maxima (since f(x) is decreasing for x > c and x < c, we can ignore
all those points). The point x = 0 is a local minimum (but not global minimum).
On the other hand, we can nd critical points that are neither maxima nor minima.
Example Let f(x) = x
3
on [1, 1] (a solution to this problem exists, right?). Then
f
(x) = 3x
2
has only one solution, x = 0. So the only critical point of x
3
is zero. But it is neither a maximum
nor a minimum on [1, 1] since f(0) = 0, but f(1) = 1 and f(1) = 1. This is an example of an
inection point or a saddle point.
However, if we have built the candidate list correctly, then one or more of the points on the
candidate list must be the global maximizer. If worse comes to worst, we can always evaluate the
function f(x) for every candidate on the list, and compare their values to see which does best.
We would then have found the global maximizer for sure. (But this might be prohibitively costly,
which is why we use second-order sucient conditions, which are the next topic).
So even though FONCs are useful for building a candidate list (non-dierentiable points and
critical points), they dont discriminate between maxima, minima, and inection points. However,
we can develop ways of testing critical points to see how they behave locally, called second-order
sucient conditions.
4.2 Second-Order Sucient Conditions
The idea of second-order sucient conditions is to provide criteria that ensure a critical point is a
maximum or minimum. This gives us both a test to see if a point on the candidate list is a local
maximum or local minimum, as well as provides more information about the behavior of a function
near a local maximum in terms of calculus.
Theorem 4.2.1 If f
(x
) = 0 and f
(x
) < 0, then x
is a local maximum of f().

Proof The Taylor series at f(x
) is
f(x) = f(x
) +f
(x
)(x x
) +f
(x
)
(x x
)
2
2
+o(h
3
)
Since f
(x
) = 0, we have
f(x) = f(x
) +f
(x
)
(x x
)
2
2
+o(h
3
)
and re-arranging yields
f(x
) f(x) = f
(x
)
(x x
)
2
2
o(h
3
)
so for h = x x
very close to zero, we get

f(x
) f(x) = f
(x
)
(x x
)
2
2
> 0
45
and
f(x
) > f(x)
so that x
is a local maximum of f(x).

This is the standard proof of the SOSCs, but it doesnt give much geometric intuition about
what is going on. Using one-sided derivatives, the second derivative can always be written as
f
(x) = lim
h0
f
(x +h) f
(x)
h
= lim
h0
f(x +h) f(x)
h

f(x) f(x h)
h
h
which equals
f
(x) = lim
h0
f(x +h) +f(x h)) 2f(x)
h
2
or
f
(x) = lim
h0
2
f(x +h) +f(x h)
2
f(x)
h
2
Now, the term
f(x +h) +f(x h)
2
is the average of the points one step above and below f(x). So if (f(x +h) +f(x h))/2 < f(x),
the average of the function values at x + h and x h is less than the value at x, so the function
must locally look like a hill. If (f(x +h) +f(x h))/2 > f(x), then the average of the function
values at x+h and xh is above the function value at x, and the function must locally look like a
valley. This is the real intuition for the second-order sucient conditions: The second derivative
is testing the curvature of the function to see whether its a hill or a valley.
Second-Order Sucient Conditions
If f
(x) = 0, however, we cant conclude anything about a critical point. Recall that if f(x) =
x
3
, then f
(x) = 3x
2
and f
(x) = 6x. Evaluated at zero, the critical point x = 0 gives f
(0) = 0. So
an indeterminate second derivative provides no information about whether we are at a maximum,
minimum or inection point.
Example Consider a price-taking rm with cost function
max
q
pq C(q)
46
To maximize prot, its FONCs are
p C
(q
) = 0
and its SOSCs are
C
(q
) < 0
So as long as C
(q
) > 0, a critical point is a maximum. For example C(q) =

c
2
q
2
satises
C
(q) = c, so the SOSCs are always satised

Example Consider a consumer with utility function u(q, m) = v(q) + m and budget constraint
w = pq +m. The consumer then maximizes
max
q
v(q) +w pq
yielding FONCs
v
(q
) p = 0
and SOSCs
v
(q
) < 0
So as long as v
(q
) < 0 a critical point q
is a local maximum. For example, v(q) = b log(q)

satises this condition.
Example Recall the function
f(x) =
1
4
x
4
+
c
2
2
x
2
The FONC is
f
(x) = x
3
+c
2
x = 0
And it had three critical points, 0 and c. The second derivative is
f
(x
) = 3x
2
+ 2c
2
< 0
For x
= 0, we get f
(0) = 2c
2
> 0, so it is a local minimum, not maximum. For x
= c, we get
f
(c) = 3c
2
+ 2c
2
= c
2
< 0, so these points are both local maxima.
Example Suppose an agent consumes in period 1 and period 2, with utility function
u(c
1
, c
2
) = log(c
1
) + log(c
2
)
where c
1
+ s = y
1
and Rs + y
2
= c
2
, where R > 1 is the interest rate plus 1. Substituting the
constraints into the objective yields the problem
max
s
log(y
1
s) + log(Rs +y
2
)
Maximizing over s yields the FONC
1
y
1
s
+R
1
Rs
2
+y
2
= 0
and the SOSC is
1
(y
1
s
)
2
+R
2
1
(Rs
2
+y
2
)
2
< 0
which is automatically satised, since for any s not just a critical point we have
1
(y
1
s)
2
+R
2
1
(Rs
2
+y
2
)
2
< 0
47
Its a nice feature that the SOSCs are automatically satised in the previous example. If we
could nd general features of functions that guarantee this, it would make our lives much easier. In
particular, it turned out that f
(x) < 0 for all x, not just at a critical point x
. This is the special

characteristic of a concave function.
We can say a bit more about local maximizers using a similar approach.
Theorem 4.2.2 If x
is a local maximum of f() and f
(x
) = 0, then f
(x
) 0.
Proof If x
is a local maximum, then the Taylor series around x
is
f(x) = f(x
) +f
(x
)(x x
) +f
(x
)
(x x
)
2
2
+o(h
3
)
Using the FONCs and re-arranging yields
f(x
) f(x) = f
(x
)
(x x
)
2
2
+o(h
3
)
Since x
is a local maximum, f(x
) f(x), and this implies that for x close enough to x
,
f
(x
) 0
These are called second-order necessary conditions, since they follow by necessity from the fact
that x
is a local maximum and critical point (can you have a local maximum that is not a critical
point?). Exercise 6 asks you to explain the dierence between Theorem 4.2.1 and Theorem 4.2.2.
This is one of these subtle points that will bother you in the future if you dont gure out now.
Exercises
1. Derive FONCs and SOSCs for the rms prot-maximization problem for the cost functions
C(q) =
c
2
q
2
and C(q) = e
cx
.
2. When an objective function f(x, y) depends on two controls, x and y, and is subject to
a linear constraint c = ax + by, the problem can be simplied to a one-dimensional program by
solving the constraint in terms of one of the controls,
y =
c ax
b
and substituting it into the objective to get
max
x
f
_
x,
c ax
b
_
Derive FONCs and SOSCs for this problem. What assumptions guarantee that the SOSCs hold?
3. Derive FONCs and SOSCs for the consumers utility-maximization problem for the benet
functions v(q) = b log(q), v(q) = 1 e
bq
, and v(q) = bq
c
2
q
2
.
4. For a monopolist with prot function
max
q
p(q)q C(q)
48
where C(q) is an increasing, convex function, derive the FONCs and SOSCs. Provide the best
sucient condition you can think of for the SOSCs to hold for any increasing, convex cost function
C(q).
5. Explain the dierence between Theorem 4.2.1 and Theorem 4.2.2 both (i) as completely as
possible, and (ii) as briey as possible. Examples can be helpful.
6. Prove that for a function f : R
n
R, a point x
is a local maximum only if f(x) = 0.

Suppose a rm hires both capital K and labor L to produce output through technology F(K, L),
where p is the price of its good, w is the price of hiring a unit of labor, and r is the price of hiring
a unit of capital. Derive FONCs for the rms prot maximization problem.
7. For a function f : R
n
R and a critical point x
, is it sucient that f(x) have a negative

second partial derivative for each argument x
i
2
f(x
)
x
2
i
< 0
for x
to be a local maximum of f(x)? Prove or disprove with a counter-example. (Try using

Matlab or Excel to plot the family of quadratic forms
f(x
1
, x
2
) = a
1
x
1
+a
2
x
2
+
b
1
2
x
2
1
+
b
2
2
x
2
2
+b
3
x
1
x
2
and experiment with dierent coecients to see if the critical point is a maximum or not. )
8. Suppose you have a hard maximization problem that you cannot solve by hand, or even in a
straightforward way on a computer. How might you proceed? (i) Write out a second-order Taylor
series around an initial guess, x
0
, to the point x
1
. Now derive an expression for f(x
1
) f(x
0
),
and maximize the dierence (you are using a local approximation of the function to maximize
the increase in the functions value as you move from x
0
to x
1
). Solve the rst-order necessary
condition in terms of x
1
. (ii) If you replace x
0
with x
k
and x
1
with x
k+1
, repeating this procedure
will generate a sequence of guesses x
k
. When do you think it converges to the true x
? This
numerical approach to nding maximizers is called Newtons method.
49
Chapter 5
The Implicit Function Theorem
The quintessential example of an economic model is a partial equilibrium market with a demand
equation, expressing how much consumers are willing to pay for an amount q as a function of
weather w,
p = f
d
(q, w)
and a supply equation, expressing how much rms require to produce an amount q as a function
of technology t,
p = f
s
(q, t)
At a market-clearing equilibrium (p
, q
), we have
f
d
(q
, w) = p
= f
s
(q
, t)
But now we might ask, how do q
and p
change when w and t change? If the rms technology

t improves, how is the market aected? If the weather w improves and more consumers want to
travel, how does demand shift? Since we rarely know much about f
d
() and f
s
(), we want to avoid
adding additional, unneeded assumptions to the framework. The implicit function theorem allows
us to do this.
In economics, we make a distinction between endogenous variables p
and q
and exogenous
variables w and t. An exogenous variable is something like the weather, where no economic agent
has any control over it. Another example is the price of lettuce if you are a consumer: Refusing
to buy lettuce today probably has no impact on the price, so it essentially beyond your control,
and can be treated as a constant. An endogenous variable is one that is determined within the
system. For example, the weather is out of your control, but you can choose whether to go to the
beach or go to the movies. Prices might be beyond a consumers control, but the consumer can
still choose what bundle of goods to purchase.
A good model is one which judiciously picks what behavior to explain endogenous variables
in terms of economic circumstances that are plausibly beyond the agents control exogenous
variables, in the clearest and simplest way possible.
Once a good model is provided, we want to ask, how do endogenous variables change when
exogenous variables change? The Implicit Function Theorem is our tool for doing this. You might
want to see how a change in a tax aects behavior, or how an increase in the interest rate aects
capital accumulation. Since data are not available for these hypothetical worlds, we need models
to explain the relationships between exogenous and endogenous variables, so we can then adjust
the exogenous variables in our theoretical laboratory.
Example Consider a partial equilibrium market where consumers have utility function u(q, m) =
v(q)+m where v(q) is increasing and concave, and a budget constraint w = pq+m. Then consumers
maximize
max
q
v(q) +w pq
50
yielding a FONC
v
(q
) p = 0
and an inverse demand curve is given by p = v
(q
D
). Suppose that rms have increasing, convex
costs C(q), and face a tax t for every unit they sell. Then their prot function is
(q) = (p t)q C(q)
The FONC is
p t C
(q
) = 0
and the inverse supply curve is given by p = t +C
(q
S
).
A market-clearing equilibrium (p
, q
) occurs where
v
(q
) = p
= t +C
(q
)
If we re-arrange this equation, we get
v
(q
) t C
(q
) = 0
Now, if we think of the market-clearing quantity q
(t) as a function of taxes t, we can totally

dierentiate with respect to t to get
v
(q
(t))
q
t
1 C
(q
(t))
q
t
= 0
q
t
=
1
v
(q
(t)) C
(q
(t))
< 0
So that if t increases, the market-clearing quantity falls (what happens to the market-clearing
price?).
The general idea of the above two exercises is called the Implicit Function Theorem.
5.1 The Implicit Function Theorem
Suppose we have an equation
f(x, c) = 0
Then an implicit solution is a function x(c), so that
f(x(c), c) = 0
We say the variables c are exogenous variables, and that the x are endogenous variables.
Theorem 5.1.1 (Implicit Function Theorem) Suppose f(x
0
, c
0
) = 0 and
f(x
0
, c
0
)
x
= 0. Then
there exists a continuous implicit solution x(c), with derivative
x(c)
c
=
f
c
(x(c), c)
f
x
(x(c), c)
for c close to c
0
.
51
Proof If we dierentiate f(x(c), c) with respect to c, we get
f(x(c), c)
x
x(c)
c
+
f(x(c), c)
c
= 0
Re-arranging the equation yields
x(c)
c
=
f(x(c), c)/c
f(x(c), c)/x
which exists for all c only if f(x(c), c)/x = 0 for all c. Since we have computed
x(c)
c
, it follows
that x(c) is continuous, since all dierentiable functions are continuous.
The key step in proving the implicit function theorem is the equation
f(x(c), c)
x
x(c)
c
+
f(x(c), c)
c
= 0
The rst term is the equilibrium eect or indirect eect: The system itself adjusts the endogenous
variables to restore equilibrium in the equation f(x(c), c) = 0 (for us, this will occur because a
change in c causes agents to re-optimize, cancelling out part of the change). The second term is
the direct eect: By changing the parameters, the nature of the system has changed slightly, so
that equilibrium will be achieved in a slightly dierent manner.
Example Consider a monopolists prot-maximization problem:
max
q
p(q)q
c
2
q
2
The rst-order necessary condition is
p
(q)q +p(q) cq = 0
This equation ts our f(x(c), c), with q(c) and c as the endogenous and exogenous parameters.
Totally dierentiating as in the proof yields
p
(q)q
dq
dc
+ 2p
(q)
dq
dc
q(c) c
dq
dc
= 0
and re-arranging a bit yields
dq
dc
_
p
(q)q + 2p
(q) c
_
q(c) = 0
The second term is the direct eect: a small increase in c reduces the monopolists prots by q(c).
The rst term is the equilibrium eect: The monopolist adjusts q(c) to maintain the FONC, and
a small change in q changes the FONC by exactly the term in parentheses. But it is unclear how
to sign this in general, since the term in parentheses may not immediately mean anything to you
(is it positive? negative?).
5.2 Maximization and Comparative Statics
Suppose we have an optimization problem
max
x
f(x, c)
52
where c is some parameter the decision-maker takes as given, like the temperature or a price that
cant be inuenced by gaming the market. Let x
(c) be a maximizer of f(x, c). Then the FONCs

imply
f(x
(c), c)
x
= f
x
(x
(c), c) = 0
and
2
f(x
(c), c)
x
2
= f
xx
(x
(c), c) 0
Note that we know that f
x
(x
(c), c) = 0 and f
xx
(x
(c), c) 0 since we are assuming that x
(c) is
a maximizer.
The FONC looks exactly like the kinds of equations studied in the proof of the implicit function
theorem, f
x
(x, c) = 0. The only dierence is that it is generated by a maximization problem, not
an abstractly given equation. If we dierentiate the FONC with respect to c, we get
f
xx
(x
(c), c)
x
(c)
c
+f
cx
(x
(c), c) = 0
and solving for x
/c yields
x
(c)
c
=
f
cx
(x
(c), c)
f
xx
(x
(c), c)
This is called the comparative static of x
with respect to c: We are measuring how x
(c) the
agents behavior responds to a change in c some exogenous parameter outside their control.
The SOSC implies that f
xx
(x
(c), c) < 0, so we know that

sign
_
x
(c)
c
_
= sign (f
cx
(x
(c), c))
So that x
(c) is increasing in c if f
cx
(x
(c), c) 0.
Theorem 5.2.1 Suppose x
(c) is a local maximum of f(x, c). Then

sign
x
(c)
c
= sign f
cx
(x
(c), c)
Example Recall the monopolist, facing the problem
max
q
p(q)q
c
2
q
2
His FONCs are
p
(q
)q
+p(q
) cq
= 0
and his SOSCs are
p
(q
)q
+ 2p
(q
) c < 0
We apply the IFT to the FONCS: Treat q
as an implicit function of c, and totally dierentiate to

get
p
(q
)q
c
+ 2p
(q
)
q
c
c
q
c
q
= 0
Re-arranging, we get,
_
p
(q
)q
+ 2p
(q
) c
_
. .
SOSCs, <0
q
c
q
= 0
or
q
c
=
q
(q
)q
+ 2p
(q
) c
< 0
So we use the information from the SOSCs to sign the denominator, giving us an unambiguous
comparative static.
53
Example Suppose an agent consumes in period 1 and period 2, with utility function
u(c
1
, c
2
) = log(c
1
) + log(c
2
)
where c
1
+ s = y
1
and Rs + y
2
= c
2
, where R > 1 is the interest rate plus 1. Substituting the
constraints into the objective yields the problem
max
s
log(y
1
s) + log(Rs +y
2
)
Maximizing over s yields the FONC
1
y
1
s
+R
1
Rs
2
+y
2
= 0
and the SOSC is automatically satised, since for any s not just a critical point we have
1
(y
1
s)
2
+R
2
1
(Rs
2
+y
2
)
2
< 0
How does s
vary with R? Well, I dont want to write it all out again. Let the FONC be the
function dened as
f(s
(R), R) = 0
Then we know that
s
R
=
f
R
(s
(R), R)
f
S
(s
(R), R)
From the SOSCs, f
S
(s
(R), R) is negative, so the denominator is positive, and since

f
R
(s
(R), R) =
1
Rs
+y
2
Rs
2
1
(Rs
2
+y
2
)
We only need to sign f
R
(s
(R), R) to get sign of the expression. It is positive if

(Rs
2
+y
2
) Rs
2
= y
2
> 0
which is true. So s
is increasing in R (and look, we didnt even solve for s
(R)).
Example Consider a rm with maximization problem
(q) = pq C(q, t)
where t represents the rms technology. In particular, increasing t reduces the rms marginal
costs for all q, or
t
C(q, t)
q
= C
qt
(q, t) < 0
The rms FONC is
p C
q
(q
, t) = 0
and the SOSC is
C
qq
(q
, t) < 0
We can study how the optimal quantity q
varies with either p or t. Lets start with p. Applying

the implicit function theorem, we get
q
(p)
p
=
1
C
qq
(q
(p), t)
> 0
54
so the supply curve is upward sloping. If t increases instead, we get
q
(t)
t
=
C
qt
(q
, t)
C
qq
(q
(p), t)
> 0
so that if technology improves, the supply curve shifts up.
We might ask, what is the cross-partial with respect to both t and p? Theres nothing stopping
us from pushing this approach further. Dierentiating the FONC with respect to p, we get
1 C
qq
(q
(t, p), t)
q
(t, p)
p
= 0
and again with respect to t we get
C
qqq
(q
(t, p), t)
q
(t, p)
t
q
(t, p)
p
C
tqq
(q
(t, p), t)
q
(t, p)
p
C
qq
(q
(t, p), t)
2
q
(t, p)
tp
= 0
But now were in trouble. Whats the sign of C
qqq
(q
(t, p), t)? Can we make any sense of this? This

is the kind of game you often end up playing as a theorist. We have a complicated, apparently
ambiguous comparative static that we would like to make sense of. The goal now is to gure out
what kinds of worlds have unambiguous answers. Lets start by re-arranging to get the comparative
static of interest alone and seeing what we can sign with our existing assumptions:
2
q
(t, p)
tp
=
C
qqq
(q
(t, p), t)
q
(t, p)
t
q
(t, p)
p
. .
+
+C
tqq
(q
(t, p), t)
q
(t, p)
p
. .
+
C
qq
(q
(t, p), t)
. .
So if C
qqq
(q, t) and C
tqq
(q, t) have the same sign, the cross-partial of q with respect to p and t will be
unambiguous. Otherwise, you need to make more assumptions about how variables relate, adopt
specic functional forms, or rely on the empirical literature to argue that some quantities are more
important than others.
Example As an application of the implicit function theorem, consider the equation
f
1
(f(x)) = x
If we dierentiate this with respect to x, we get
df
1
(y)
dy
f
(x) = 1
or
df
1
(y)
dy
=
1
f
(x)
Theorem 5.2.2 (Inverse Function Theorem) If f
(x) > 0 for all x, then

_
f
1
_
(y) > 0 for

all y, and if f
(x) < 0 for all x, then

_
f
1
_
(y) < 0 for all y. If f(x) = y and f

1
(y) = x, then
df
1
(y)
dy
=
1
f
(x)
This allows us to sign the inverse of a function just from information about the derivative of
the function itself. This is actually pretty useful. This exact theorem is a key step in deriving the
Nash equilibrium of many pay-as-bid auctions, for example.
55
Example From the implicit function theorem, we have
x(c)
c
=
f
xc
(x(c), c)
f
xx
(x(c), c)
Multiplying by c and dividing by x(c) gives
x(c)
c
c
x(c)
=
f
xc
(x(c), c)
f
xx
(x(c), c)
c
x(c)
which is often written as
log(x(c))
log(c)
=
%x(c)
%c
=
f
xc
(x(c), c)
f
xx
(x(c), c)
c
x(c)
This quantity
x(c)
c
c
x(c)
=
log(x(c))
log(c)
is called an elasticity. Why are these useful?
Suppose we are comparing the eect of a tax on the gallons of water and bushels of apples
traded. We might naively compute the derivatives, but what do the numbers even mean? One
good is denominated in gallons, and the other in bushels. We might convert gallons to bushels
by comparing the weight of water and apples and coming up with a gallons-to-bushels conversion
scale, like fahrenheit to celsius. But again, this is somewhat irrelevant. People consume a large
amount of water everyday, while they probably only eat a few dozen apples a year (if that). So we
might gather data on usage and improve our conversion scale to have economic signicance, rather
than just physical signicance. This approach, though, seems misguided.
The derivative of the quantity q(t) of water traded with respect to t is
lim
t
t
q(t
) q(t)
t
t
so the numerator is the change in the quantity, measured in gallons, while the denominator is the
change in the tax, measured in dollars. If we multiply by t/q(t), we get
q(t
) q(t)
t
t
gallons
dollars

t
q(t)
dollars
gallons
=
q(t
) q(t)
t
t
t
q(t)
so the units cancel leaving a dimensionless quantity, the elasticity. This can be freely compared
across goods since we dont have to keep track of currency, quantity denominations, economic
signicance of units, and so on.
Exercises
1. A monopolist faces demand curve p(q, a), where a is advertising expenditure, and has costs
C(q). Solve for the monopolists optimal quantity, q
(a), and explain how the optimal quantity

varies with advertising if p
a
(q, a) 0.
2. Suppose there is a perfectly competitive market, where consumers maximize utility v(q, w) +
m subject to budget constraints w = (p +t)q +m, where t is a tax on consumption of q, and rms
have costs C(q) = cq. (i) Characterize a perfectly competitive equilibrium in the market, and show
how tax revenue tq
(t, w) varies with t and w. (ii) Suppose w changes there is a shock to

consumer preferences and the government wants to adjust t to hold tax revenue constant. Use
the IFT to show how to achieve this.
56
3. An agent is trying to decide how much to invest in a safe asset that yields return 1 and a
risky asset that returns H with probability p and L with probability 1 p, where H > 1 > L. His
budget constraint is w = pa + S, where p is the price of the risky asset, a is the number of shares
of the risky asset he purchases, and S is the number of shares of the safe asset he purchases. His
expected utility (objective function) is
max
a,S
pu(S +aH) + (1 p)u(S +aL)
(i) Provide FONCs and SOSCs for the agents investment problem; when are they satised? (ii)
How does the optimal a
vary with p and H? (iii) Can you gure out how the optimal a
varies
with wealth?
4. Re-write Theorem 5.2.1 to apply to minimization problems, rather than maximization prob-
lems, and provide a proof. Consider the consumers expenditure minimization problem, e(p, u) =
min
q,m
pq +m subject to v(q) +m = u, where a consumer nds the cheapest bundle that provides
u of utility. How does q
(p, u) vary in p and u?

5. Consider a representative consumer with utility function u(q
1
, q
2
, m) = v(q
1
, q
2
) + m and
budget constraint w = p
1
q
1
+ p
2
q
2
+ m. The two goods are produced by rm 1, whose costs are
C
1
(q
1
) = c
1
q
1
, and rm 2, whose costs are C
2
(q
2
) =
c
2
2
q
2
2
. (i) Solve for the consumers and rms
demand and supply curves. How does demand for good 1 change with respect to a change in the
price of good 2? How does supply of good 1 change with respect to a change in the price of good
2? (ii) Solve for the system of equations that characterize equilibrium in the market. How does the
equilibrium quantity of good 1 traded change with respect to a change in rm 2s marginal cost?
6. Do YOU want to be a New York Times columnist? Lets learn IS-LM in one exercise!
We start with a system of four equations: (i) The NIPA accounts equation,
Y = C +I +G
so that total national income, Y , equals total expenditure, consumption C plus investment I plus
government expenditure G. (ii) The consumption function
C = f(Y T)
giving consumption as a function of income less taxes. Assume f
() 0, so that more income-less-

taxes translates to more spending on consumption. (iii) The investment function
I = i(r)
where r is the interest rate. Assume i
(r) 0, so that a higher interest rate leads to more investment.

(iv) The money market equilibrium equation
M
s
= h(Y, r)
so that supply of money must equal demand for money, which depends on national income and the
interest rate. Assume h
y
0, so that more income implies a higher demand for currency for trade,
and h
r
0, so that a higher interest rate moves resources away from consumption into investment,
thereby reducing demand for currency.
This can be simplied to two equations,
Y f(Y T) i(r) = G
h(Y, r) = M
s
57
The endogenous variables are Y and r, and the exogenous variables are , , , M
s
, and G.
(i) What is the eect of increases in G aect Y and r? What does this suggest about scal
interventions in the economy? (ii) How does an increase in M
s
aect Y and r? What does this
suggest about monetary interventions in the economy? (iii) Suppose there is a balanced-budget
amendment so that G = T. How does an increase in G aect Y and r? Explain the eect of the
amendment on the governments ability to aect the economy.
7. Consider a partial equilibrium model with a tax t where of the legal incidence falls on the
consumer and 1 falls on the rm. The consumer has objective max
q
v(q) + w (p + t)q and
the rm has objective max
q
(p (1 ))q C(q). (i) Compute the elasticity of the supply and
demand curves with respect to p, , and t. (ii) Compute the change in the total taxes paid by the
consumer and producer in equilibrium with respect to t and . (iii) Show that for a small increase
in t starting from t = 0, the increase in the consumers tax burden is larger than the producers
tax burden when the consumers demand curve is more inelastic than the producers supply curve.
58
Chapter 6
The Envelope Theorem
Recall our friend the monopolist:
max
q
p(q)q
c
2
q
2
His FONCs are
p
(q
)q
p(q
) cq
= 0
and SOSCs are
p
(q
)q
2p
(q
) c < 0
This denes an implicit solution q
(c) in terms of the marginal cost parameter, c. But then we can

dene the value function or indirect prot function as
(c) = p(q
(c))q
(c)
c
2
(q
(c))
2
giving the monopolists maximized payo for any value of c. If we dierentiate with respect to c,
we get
(c) = p
(q
(c))
q
(c)
c
q
(c) +p(q
(c))
q
(c)
c

1
2
q
(c)
2
2q
(c)
q
(c)
c
To sign this, it appears at rst that we have to use the Implicit Function Theorem to derive
q
(c)/c, and then maybe substitute it in or just sign as much as we can. But wait! If we
re-arrange it to put all the q
(c)/c terms together, we get
(c) =
_
p
(q
(c))q
(c) +p(q
(c))2q
(c)
_
. .
FONCs
q
(c)
c

1
2
q
(c)
2
Since the FONCs are zero at q
(c), we get
(c) =
1
2
q
(c)
2
< 0
Without using the IFT at all. This is the basic idea of Envelope Theorems. The reason it is called
an envelope is that the value function traces out a curve which is the maximum along all the
global maxima of the objective function:
59
The Envelope Theorem
When you dierentiate the value function, you are studying how the peaks shift.
6.1 The Envelope Theorem
Suppose an agent faces the maximization problem
max
x
f(x, c)
where c is some parameter. The FONC is
f
x
(x
(c), c) = 0
Now, consider the value function or indirect payo function
V (c) = f(x
(c), c)
This is the agents optimized payo, given the parameter c. We might want to know how V (c)
varies with c, or V
(c). That derivative equals

V
(c) = f
x
(x
(c), c)
. .
FONC
x
(c)
c
+f
c
(x
(c), c)
At rst glance, it looks like well need to determine x
(c)/c using the implicit function theorem,

substitute it in, and then try to sign the expression. But since f
x
(x
(c), c) is zero at the optimum,

the expression reduces to
V
(c) = f
c
(x
(c), c)
This means that the derivative of the value function with respect to a parameter is the partial
derivative of the objective functions evaluated at the optimal solution.
Theorem 6.1.1 (Envelope Theorem) For the maximization problem
max
x
f(x, c)
the derivative of the value function is
V
(c) =
_
f(x, c)
c
_
x=x
(c)
60
Again, in words, the derivative of the value function is the partial derivative of f(x, c) with
respect to c, evaluated at the optimal choice, x
(c).
Example Consider
(q, c) = pq
c
2
q
2
The FONCs are
p cq
(p, c) = 0
Substituting this back into the objective yields
V (p, c) =
p
2
2c
and
V
p
(p, c) =
p
c
> 0
V
c
(p, c) =
p
2
2c
2
< 0
If we use the envelope theorem, however,
V
p
(p, c) =
p
(q
(p, c), p, c) = q
(p, c) =
p
c
> 0
V
c
(p, c) =
c
(q
(p, c), p, c) =
1
2
q
(p, c)
2
=
p
2
2c
2
< 0
These have the correct signs, and illustrate the usefulness of the envelope theorem.
Example Suppose a consumer has utility function u(q, m) and budget constraint w = pq + m.
Then his value function will be
V (p, w) = u(q
(p, w), w pq
(p, w))
Without even computing FONCs or SOSCs, then, we know that
V
p
(p, w) = u
m
(q
(p, w), w pq
(p, w))q
(p, w)
and
V
w
(p, w) = u
m
(q
(p, w), w pq
(p, w))
Do you see that without grinding through the FONCs?
So we can gure out how changes in the environment aect the welfare of an agent without
necessarily solving for the agents optimal behavior explicitly. This can be very useful, not just in
determining welfare changes, but also in simplifying the analysis of models. The next two examples
illustrate how this works.
Example As a simple way of incorporating dynamic concerns into models, we might consider
models of rms in which some variables are xed capital, technology, capacity, etc. and
maximize over a exible control variable like price or quantity to generate a short-run prot function.
Then, the xed variable becomes exible in the long-run, giving us the long-run prot function.
Suppose a price-taking rms cost function depends on the quantity it produces, q, as well as
its technology, t. Its short-run prot function is
S
(q, p, t) = max
q
pq C(q, t)
61
Lets stop now to think about what better technology means. Presumably, better technology
should, at the least, mean that C
t
(q, t) < 0, so that total cost is decreasing in t. Similarly,
C
qt
(q, t) < 0 would also make sense: Marginal cost C
q
(q, t) is decreasing in t. However, is this
always the case? Some technologies have scale eects, where for q < q, C
qt
(q, t) > 0, but for q > q,
C
qt
(q, t) < 0, so that the cost-reducing benets are sensitive to scale. For example, having a high-
tech factory where machines do all the labor will certainly be more ecient than an assembly line
with workers, provided that enough cars are being made. Lets see what the FONCs and SOSCs
we derive say about the relationships between these partials and cross-partials.
In the short-run we can treat t as xed and study how the rm maximizes prots: It has an
FONC
p C
q
(q, t) = 0
and SOSC
C
qq
(q, t) < 0
So as long as C(q, t) is convex in q, there will be a unique prot-maximizing q
(t). Using the

implicit function theorem, we can see how the optimal q
(t) varies in t:
C
qq
(q
(p, t), t)
q
(p, t)
t
C
tq
(q
(p, t), t) = 0
or
q
(p, t)
t
=
C
tq
(q
(t), t)
C
qq
(q
(t), t)
If better technology lowers marginal cost, C
tq
(q, t) < 0, and this q
(p, t) is increasing in t. Otherwise,

q
(p, t)/t will be decreasing in t. Note that there is no contradiction between C

t
(q, t) < 0 and
C
qt
(q, t) > 0.
But now suppose we take a step back to the long-run. How should the rm invest in tech-
nology? Consider the long-run prot function
L
(p, t) =
S
(q
(p, t), p, t) kt
where the kt term is the cost of adopting level t technology. Then the rms FONC is
d
dt
[
S
(q
(p, t), p, t)] k = 0

but by the envelope theorem, the rst term is just the derivative of the short run value function
with respect to t, so
C
t
(q
(p, t
(p)), t
(p)) k = 0
and the SOSC is
C
qt
(q
(p, t
(p)), t
(p))
q
(t)
t
C
tt
(q
(p, t
(p)), t
(p)) < 0
Since q
(p, t)/t < 0, for the SOSC to be satised it must be the case that C
tt
(q
(p, t), t) < 0.

This would mean that the marginal cost reduction in t is decreasing in t: Better technology always
reduces the total cost function, C
t
(q, t) < 0, but at a decreasing rate C
tt
(q, t) < 0.
How does the prot-maximizing t
(p) depend on price? Using the implicit function theorem,

C
qt
(q
(p, t
(p)), t
(p))
_
q
(p, t
(p))
p
+
q
(p, t
(p))
t
t
(p)
p
_
C
tt
(q
(p, t
(p)), t
(p))
t
(p)
p
= 0
62
or
t
p
=
C
qt
q
p
C
qt
q
t
C
tt
=
q
p
..
+
t
C
tt
/C
qt
The numerator is positive, and the denominator is positive if (using the comparative static from
the short-run problem)
C
tq
C
qq
C
tt
/C
qt
> 0
or
C
qt
C
tq
C
qq
C
tt
> 0
It will turn out that the above inequality implies that C(q, t) is a convex function when considered
as a two-dimensional object.
This kind of short-run/long-run model is very useful in providing simple but rigorous models
of how rms behave across time. Of course, there are no dynamics here, but it captures the idea of
how in the short run the rm can vary output but not technology, but in the long run things like
technology, capital, and other investment goods become choice variables.
Example This is an advanced example of how outrageously powerful the envelope theorem can
be. In particular, well use it to derive the (Bayesian Nash) equilibrium strategies of a rst-price
auction.
At a rst-price auction, there are i = 1, 2, ..., N buyers competing for a good. Buyer i knows
his own value, v
i
> 0, but no one elses. The buyers simultaneously submit a bid b
i
. The highest
bidder wins, and gets a payo v
i
b
i
, while the losers get nothing. Let p(b
i
) be the probability that
i wins given a bid of b
i
.
Presumably, all the buyers bids should be increasing in their values. For example, if buyer 1s
value is 5 and buyer 2s value is 3, buyer 1 should bid a higher amount. Said another way, the bid
function b
i
= b(v
i
) is increasing. Then buyer is expected payo is
U(v
i
) = max
b
i
p(b
i
)(v
i
b
i
)
with FONC
p
(b
i
)(v
i
b
i
) p(b
i
) = 0
The envelope theorem implies
U
(v
i
) = p(b(v
i
))
And integrating with respect to v
i
yields
U(v
i
) =
_
v
i
0
p(b(x))dx
Now, equating the two expressions for U(v
i
) gives
p(b(v
i
))(v
i
b(v
i
)) = U(v
i
) =
_
v
i
0
p(b(x))dx
b(v) = v
_
v
0
p(b(x))dx
p(b(v))
63
If we knew the probability that an agent with value v making a bid b(v) won, we could solve
the about expression. But consider the rules of the auction: The highest bidder wins. If b(v) is
increasing, then the probability of i being the highest bidder with bid b(v
i
) is
pr[b(v
i
) b(v
1
), ..., b(v
i
) b(v
i1
), b(v
i
) b(v
i+1
), ..., b(v
i
) b(v
N
)]
= pr[v
i
v
1
, ..., v
i
v
i1
, v
i
v
i+1
, ..., v
i
v
N
]
If the bidders values are independent and the probability distribution of each bidders value v is
F(v), then the probability of having the highest value given v
i
is
pr[v
i
v
1
, ..., v
i
v
i1
, v
i
v
i+1
, ..., v
i
v
N
]
= pr[v
i
v
1
]...pr[v
i
v
i1
]pr[v
i
v
i+1
]...pr[v
i
v
N
]
= F(v
i
)...F(v
i
)F(v
i
)...F(v
i
) = F(v
i
)
N1
So that
p(b(v
i
)) = F(v
i
)
N1
and
b(v) = v
_
v
0
F(x)
N1
dx
F(v)
N1
Since this function is indeed increasing, it satises the FONC above. So using the envelope theorem
and some basic probability, weve just derived the strategies in one of the most important strategic
games that economists study. Other derivations require solving systems of dierential equations or
using a sub-eld of game theory called mechanism design.
Exercises
1. Suppose that a price-taking rm can vary its choice of labor, L, in the short run, but capital,
K is xed. Quantity q is produced using technology F(K, L). In the long run the rm can vary
capital. The cost of labor is w, and the cost of capital is r, and the price of its good is p. (i) Derive
the short-run prot function and show how q
varies with r and p. (ii) Derive the long-run prot

function and show how K varies with r. (iii) How does a change in p aect the long-run choices of
K and L, and the short run choice of L?
2. Suppose an agent maximizes
max
x
f(x, c)
yielding a solution x (c). Suppose the parameter c is perturbed to c
. Use a second-order Taylor

series expansion to characterize the loss that arises from using x
(c) instead of the new maximizer,

x
(c
). Show that the loss is proportional to the square of the maximization error, x
(c
) x
(c).
3. Suppose consumers have utility function u(Q, m) = b log(Q)+m and face a budget constraint
w = pQ+m. Solve for the rms short-run prot functions in equilibrium. (i) If there are K price-
taking rms, each with cost function c(q) =
c
2
q
2
and aggregate supply is Q = Kq, solve for the
rms short-run prot functions
S
(K) in terms of K. (ii) If there is a xed cost F to entry and
there no other barriers to entry, characterize the long-run prot function
L
(K) and solve for the
long-run number of rms K
. How does K
vary in b, c, and F? (iii) Can you generalize this

analysis to a convex, increasing C(q) and concave, increasing v(q) using the IFT and envelope
theorem? In particular, how do p
and q
respond to an increase in F, and how does K
respond
to an increase in F?
64
4. Consider a partial equilibrium model with a consumer with utility function u(q, m) = v(q)+m
and budget constraint w = pq +m, and a rm with cost function C(q). Suppose the rm must pay
a tax t for each unit of the good it sells. Let social welfare be given by
W(t) = (v(q
(t)) p
(t)q
(t) +w) + ((p
(t) t)q
(t) C(q
(t))) +tq
(t)
Use a second-order Taylor series to approximate the loss in welfare from the tax, W(t) W(0).
Show that this welfare loss is approximately proportional to the square of the tax. Sketch a graph
of the situation.
5. Consider a partial equilibrium model with a consumer with utility function u(q, m, R) =
v(q) +m+g(R) and budget constraint w = pq +m, where R is government expenditure and g(R)
is the benet to the consumers from government expenditures. The rm has a cost function C(q),
and pays a tax t for each unit of the good it sells. Tax revenue is used to fund government services,
which yield benet g(R) to consumers, where R = tq is total tax revenue. Dene social welfare as
W(t) = (v(q) pq +w) + ((p t)q C(q)) +g(tq)
where the last term represents tax revenue. (i) If the rm and consumer maximize their private
benet taking R as a constant, what are the market-clearing price and quantity? How do the
market-clearing price and quantity vary with the tax? (ii) What is the welfare maximizing level of
t?
65
Chapter 7
Concavity and Convexity
At various times, weve seen functions for which the SOSCs are satised for any point, not just the
critical point. For example,
A price-taking rm with cost function C(q) =
c
2
q
2
has SOSCs c < 0, independent of q
A price-taking consumer with benet function v(q) = b log(q) has SOSCs 1/q
2
< 0, which
is satised for any q
A household with savings problem max
s
log(y
1
s) + log(Rs +y
2
) has SOSCs
1
(y
1
s)
2
+R
2
1
(Rs +y
2
)
2
< 0
which is satised for any s
The common feature that ties all these examples together is that
f
(x) < 0
for all x, not just a critical point x
. This implies that the rst-order necessary condition f
(x) is
a monotone decreasing function in x, so if it has a zero, it can only have one (keep in mind though,
some decreasing functions have no zeros, like e
x
).
Concave functions have a unique critical point, or none
This actually solves all our problems with identifying maxima: If a function satises f
(x) < 0,
it has at most one critical point, and any critical point will be a global maximum. This is a special
class of functions that deserves some study.
66
7.1 Concavity
Recall the partial equilibrium consumer with preferences u(q, m) = v(q) + m. Earlier, we claimed
that v
(q) > 0 and v
(q) < 0 were good economic assumptions: The agent has positive marginal
value for each additional unit, but this marginal value is decreasing. Lets see what these economic
assumptions imply about Taylor series. The benet function, v(q), satises
v(q) = v(q
0
) +v
(q
0
)(q q
0
) +v
(q
0
)
(q q
0
)
2
2
+o(h
3
)
If we re-arrange this a little bit, we get
v(q) v(q
0
)
q q
0
= v
(q
0
) +v
(q
0
)
q q
0
2
+o(h
3
)
and since we know that v
(q
0
) 0, this implies, for q
0
close to q
v(q) v(q
0
)
q q
0
v
(q
0
)
In words or pictures, the derivative at q
0
is steeper than the chord from v(q
0
) to v(q).
Concavity
This is one way of dening the idea of a concave function. Here are the standard, equivalent
denitions:
Theorem 7.1.1 The following are equivalent: Let f : D = (a, b) R.
f(x) is concave on D
For all (0, 1) and x
, x
in D, f(x
+ (1 )x
) f(x
) + (1 )f(x
)
f
(x) 0 for all x in D, where f() twice dierentiable

For all x
and x
,
f
(x
)
f(x
) f(x
)
x
67
If the weak inequalities above hold as strict inequalities, then f(x) is strictly concave.
Note that concavity is a global property (holds on all of D) that depends on the domain of the
function, D = (a, b). For example, log(x) is concave on (0, ), since
f
(x) =
1
x
2
< 0
for all x in (0, ). The function
_
|x|, however, is concave for (0, ), and concave for (, 0),
but not concave for (, ). Why? If we connect the points at
_
| 1| = 1 by a chord, it lies
above the function at
_
|0| = 0, violating the second criterion for concavity. So a function can have
f
(x) 0 for some x and not be concave: It is concave only if f
(x) < 0 for all x in its domain,

D.
Theorem 7.1.2 If f(x) is dierentiable and strictly concave, then it has at most one critical point
which is the unique global maximum.
Proof If f(x) is strictly concave, the FONC is
f
(x) = 0
Since f(x) is strictly concave, the derivative f
(x) is monotone decreasing. This implies there is at

most one time that f
(x) crosses zero, so it has at most one critical point. Since

f(x
) f(x) = f
(x
)
(x x
)
2
2
> 0
holds (if f
(x) < 0 for all x, not just x
), this is a local maximum. Since there is a unique critical

point x
and f
(x) < 0 for all x, the candidate list consists only of x
.
There is some room for confusion about what the previous theorem buys you, because a concave
function might not have any critical points. For example,
f(x) = log(x)
has FONC
f
(x) =
1
x
which reaches f
(x) = 0 only as x ( is not a number). Since log(x) as x , there

isnt a nite maximizer. On the other hand
f(x) = log(x) x
has FONC
f
(x) =
1
x
1
and SOSC
f
(x) =
1
x
2
so that x
= 1 is the unique global maximizer.

68
7.2 Convexity
Similarly, our cost function C(q), satises
C(q) = C(q
0
) +C
(q
0
)(q q
0
) +C
(c)
(q q
0
)
2
2
Re-arranging it the same way, we get the opposite conclusion,
C(q) C(q
0
)
q q
0
C
(q
0
)
so that the chord from C(q
0
) to C(q) is steeper than the derivative C
(q
0
). This is a convex function.
Convexity
Theorem 7.2.1 The following are equivalent: Let f : D = (a, b) R.
f(x) is a convex on D.
For all (0, 1) and x
, x
in D, f(x
+ (1 )x
) f(x
) + (1 )f(x
)
f
(x) > 0 for all x in D

If f(x
) f(x
for all x
, x
in D, then
f(x
) f(x
)
x
(x
)
If the above weak inequalities hold strictly, then f(x) is strictly convex.
Convexity is a useful property for, in particular, minimization.
69
Exercises
1. Rewrite Theorem 7.1.2, replacing the word concave with convex and maximization with
minimization. Sketch a graph like the rst one in the chapter to illustrate the proof.
2. Show that if f(x) is concave, then f(x) is convex. If f(x) is convex, then f(x) is concave.
3. Show that if f
1
(x), f
2
(x), ..., f
n
(x) are concave, then f(x) =
n
i=1
f
i
(x) is concave. If
f
1
(x), f
2
(x), ..., f
n
(x) are convex, then f(x) =
n
i=1
f
i
(x) is convex.
4. Can concave or convex functions have discontinuities? Provide an example of a concave or
convex function with a discontinuity, or show why they must be continuous.
70
Part III
Optimization and Comparative
Statics in R
N
71
Chapter 8
Basics of R
N
Now we need to generalize all the key results (extreme value theorem, FONCs/SOSCs, and the
implicit function theorem) to situations in which many decisions are being made at once, sometimes
subject to constraints. Since we need functions involving many choice variables and possibly many
parameters, we need to generalize the real numbers to R
N
, or N-dimensional Euclidean vector
space.
An N-dimensional vector is a list of N ordered real numbers,
x =
_
_
_
_
_
x
1
x
2
.
.
.
x
N
_
_
_
_
_
where certain rules of addition and multiplication are dened. In particular, if x and y are vectors
and a is a real number, scalar multiplication is dened as
ax =
_
_
_
_
_
ax
1
ax
2
.
.
.
ax
N
_
_
_
_
_
and vector addition is dened as
x +y =
_
_
_
_
_
x
1
+y
1
x
2
+y
2
.
.
.
x
N
+y
N
_
_
_
_
_
The transpose of a column vector is a row vector,
x
= (x
1
, x
2
, ..., x
N
)
and vice versa. A basis vector e
i
is a vector with a 1 in the i-th spot, and zeros everywhere else.
The set of N-dimensional vectors with real entries R
N
is Euclidean, since we dene length by
the Euclidean norm. The norm is a generalization of absolute value, and is given by
||x|| =
_
x
2
1
+x
2
2
+... +x
2
N
Though, we could just as easily use
||x|| = max
i
x
i
72
or
||x|| =
_
i
x
p
i
_
1/p
These are all just ways of summarizing how large a vector is.
The Norm
For N = 1, this is just R, since ||x|| =
x
2
= |x|. With this norm in mind, we dene distance
between two vectors x and y as
||x y|| =
_
(x
1
y
1
)
2
+ (x
2
y
2
)
2
+... + (x
N
y
N
)
2
Instead of dening an open interval as (a, b), we dene the open ball of radius at x
0
, given by
B
(x
0
) = {y : ||y x
0
|| < }
and instead of dening a closed interval [a, b], we dene the closed ball of radius at x
0
is
(x
0
) = {y : ||y x
0
|| }
The last special piece of structure on Euclidean space is vector multiplication, or the inner
product or dot product:
x
y =< x, y >= x y = x
1
y
1
+x
2
y
2
+... +x
N
y
N
In terms of geometric intuition, the dot product is related to the angle between x and y:
cos() =
x y
||x||||y||
Note that if x y = 0, then cos() = 0, so that the vectors must be at right angles to each other, or
x is orthogonal to y.
73
Orthogonal Vectors
So a Euclidean space has notions of direction, distance, and angle, and open and closed balls
are easy to dene. Not all spaces have these properties, which makes Euclidean space special
1
.
8.1 Intervals Topology
In R, we had open sets (a, b), closed sets [a, b], and bounded sets, where a and b are nite. In multiple
dimensions, however, its not immediately obvious what the generalizations of these properties
should be. For example, we could dene open cells as sets C = (0, 1) (0, 1) ...(0, 1), which look
like squares or boxes. Or we could dene open balls as sets B = {y : ||y|| < 1}. But then, were going
to study things like budget sets, dened by inequalities like {(x, y) : x 0, y 0, p
x
x + p
y
y w},
which are neither balls nor cells.
To avoid these diculties, well use some ideas from topology. Topology is the study of properties
of sets, and what kinds of functions preserve those properties. For example, the function f(x) = x
maps the set [1, 1] into the set [1, 1], but the function
g(x) =
_
x if |x| = 1
0 if |x| = 1
maps [1, 1] into (1, 1). So f(x) maps a closed set to a closed set, while g(x) maps a closed set
to an open set. So the question, What kinds of functions map closed sets to closed sets? is a
topological question.
Why do YOU care about topology? What we want to do is study the image of a function,
f(D) and see if it achieves its maximum. The function f(x) = x above achieves a maximum at
x = 1. The second function, g(x), does not achieve a maximum, since supg([1, 1]) = 1 but 1
is not in g([1, 1]). Our one-dimensional Weierstrass theorem tells us that we shouldnt expect
g(x) to achieve a maximum since it is not a continuous function. But how do we generalize this
to Euclidean space and more general sets than simple intervals (right? Because now we might be
maximizing over spheres, triangles, tetrahedrons, simplexes, and all kinds of non-pathological sets
that are more complicated than [a, b])?
1
For example, the space of continuous functions over a closed interval [a, b] is also a vector space, called C([a, b]),
with scalar multiplication fa = (af(x)), vector addition f + g = (f(x) + g(x)), norm ||f|| = max
x[a,b]
f(x) and
distance d(f, g) = max
x[a,b]
|f(x) g(x)|. You use this space all the time when you do dynamic programming in
macroeconomics, since you are looking for an unknown function that satises certain properties. This space has very
dierent properties for example, the closed unit ball {g : d(f, g) 1} is not compact.
74
Denition 8.1.1 A set is open if it is a union of open balls
x is a point of closure of a set S if for every > 0, there is a y in S such that ||x y|| <
The set of all points of closure of a set S is

S, and

S is the closure of S
A set S is closed if

S = S
A set S is closed if its complement, S
c
is open
A set S is bounded if it is a subset of an open ball B
(x
0
), where is nite. Otherwise, the
set is unbounded
Open Balls, Closed Balls, and Points of Closure
Note that x can be a point of closure of a set S without being a member of S. For example, the
open interval (1, 1) is an open set, since it is an open ball, B
1
(0) = {y : |y 0| < 1} = (1, 1).
Each point in B
1
(0) is a point of closure of (1, 1), since for all > 0 they contain an element
of (1, 1). However, the points 1 and 1 are also points of closure of (1, 1), since for all > 0,
1 and 1 contain some points of (1, 1). So the closure of (1, 1) is [1, 1]. As you might have
expected, this is a closed set.
This works just as well in Euclidean space. The open ball at zero, B
(0) = {y : ||y 0|| < }

is an open ball, so it is a union of open balls, so it is open. All elements of B
(0) are points of

closure of B
(0), since for all > 0, they contain a point of B
(0). However, the points satisfying

{y : ||y 0|| = } are also points of closure, since part of the open ball around any of these
points must intersect with B
(0). Consequently, the set of all points of closure of B
(0) is the set

{y : ||y 0|| }, which is the closed ball.
There are many other properties that are considered topological: connected, dense, convex,
separable, and so on. But for our purposes of developing maximization theory, there is a special
topological property that will buy us what we need to generalize the extreme value theorem:
75
Denition 8.1.2 A subset K of Euclidean space is compact if each sequence in K has a subse-
quence that converges to a point in K.
The Bolzano-Weierstrass theorem (which is a statement about the properties of sequences)
looks similar, but this denition concerns the properties of a sets. A set is called compact if any
sequence constructed from it has the Balzano-Weierstrass property. For example, the set (a, b)
is not compact, because x
n
= b 1/n is a sequence entirely in (a, b), but all of its subsequences
converge to b which is not in (a, b). Characterizing compact sets in a space is the typical starting
point for studying optimization in that space
2
. As it happens, there is an easy characterization for
R
N
:
Theorem 8.1.3 (Heine-Borel) In Euclidean space, a set is compact i it is closed and bounded.
Basically, bounded sets that include all their points of closure are compact in R
N
. Non-compact
sets are either unbounded, like {(x, y) : (x, y) (0, 0)}, or open, like {(x, y) : x
2
+y
2
< c}.
Example Consider the set in R
N
described by x
i
0 for all i = 1, ..., N, and
N
i=1
p
i
x
i
= p x w
This is called a budget set, B
p,w
. In two dimensions, this looks like a triangle with vertices at 0,
w/p
1
and w/p
2
. This set is compact.
Well use the Heine-Borel theorem, and show the set is closed and bounded.
Budget Sets
It is bounded, since if we take = max
i
w/p
i
+ 1, the set is contained in B
(0) (right?). It is
closed since if we take any sequence x
n
satisfying x
n
0 for all n and p x
n
w for all n, the limit
of the sequence must satisfy these inequalities as well (a weakly positive sequence for all terms cant
become negative at the limit). If you dont like that argument that it is closed, we can prove B
p,w
is closed by showing that the complement of B
p,w
is open: Let y be a point outside B
p,w
. Then
the ray z = 0 +(1 )y where [0, 1] starts in the budget set, but eventually exits it to reach
y. Take
to be the point at which the ray exits the set. Then we can draw an open ball around
y with radius r = (1
)/2, so that B
r
(y) contains no points of B
p,w
. That means that for any
y, we can draw an open ball around it that contains no point in B
p,w
. That means that B
c
p,w
is a
union of open balls, so B
p,w
is closed.
2
See the footnote on page 1: Closed, bounded sets in C([a, b]) are not compact.
76
So competitive budget sets are compact. In fact, weve shown that any set characterized by
x 0 and a x c is compact, which is actually a sizeable number of sets of interest.
Example The N-dimensional unit simplex,
N
is constructed by taking the N basis vectors,
(e
1
, e
2
, ..., e
n
), and considering all convex combinations
x
=
1
e
1
+
2
e
2
+... +
N
e
N
such that
N
i=1
i
= 1 and 0
i
1.
Then
N
is bounded, since if we take the N-dimensional open ball B
2
(0) of radius 2, it includes
the simple. And
N
is closed, since, taking any sequence
n
, and lim
n
n
(e
1
, ..., e
N
) =
(e
1
, ..., e
N
), which is in the simplex. So the simplex is its own closure,

N
=
N
, so it is closed.
The simplex comes up in probability theory all the time: It is the set of all probability distri-
butions over N outcomes.
8.2 Continuity
Having generalized the idea of [a, b] to R
N
, we now need to generalize continuity. Continuity is
more dicult to visualize in R
N
since we can no longer sketch a graph on a piece of paper. For
a function f : R
2
R we can visualize the graph in three dimensions. A continuous function is
one for which the sheet is relatively smooth: It may have ridges or kinks like a crumpled-up
piece of paper that has been smoothed out, but there are no rips or tears in the surface.
Denition 8.2.1 A function f : D R is continuous at c D if for all > 0, there exists a
> 0 so that if ||x c|| < , then |f(x) f(c)| < .
The only modication from the one-dimensional denition is that we have replaced the set
(c , c + ) with an open ball, B
(c) = {x : ||x c|| < }. Otherwise, everything is the same.

Subsequently, the proof of the Sequential Criterion for Continuity is almost exactly the same:
Theorem 8.2.2 (Sequential Criterion for Continuity) A function f : D R is continuous
at c i for all sequences x
n
c, we have
lim
n
f(x
n
) = f
_
lim
n
x
n
_
= f(c)
Again, continuity allows us to commute the function f() with a limit operator lim. This is the
last piece of the puzzle of generalizing the extreme value theorem.
8.3 The Extreme Value Theorem
So, what do you have to internalize from the preceding discussion in this chapter?
A set is open if it is a union of open balls.
A set is closed if it contains all its points of closure. A set is closed if its complement is open.
A set is compact in R
N
if it is closed and bounded.
In compact sets, all sequences have a convergent subsequence.
77
If a function is continuous in R
N
, then
lim
n
f(x
n
) = f
_
lim
x
x
n
_
for any sequence x
n
c.
The above facts let us generalize the Extreme Value Theorem:
Theorem 8.3.1 (Weierstrass Extreme Value Theorem) If K is a compact subset of R
N
and
f : K R is a continuous function, then f(x) achieves a maximum on K.
Proof Let
supf(K) = m
Then we can construct a sequence x

n
satisfying
m
1
n
f(x
n
) m
Since K is compact, every sequence x

n
has a convergent subsequence x
n
k
x
. Taking limits
throughout the inequality, we get
lim
n
k
1
n
k
lim
n
k
f(x
n
k
) m
and by continuity of f(),

lim
n
k
1
n
k
f
_
lim
n
k
x
n
k
_
m
so that
m
f(x
) m
Since K is closed, the limit x
of the convergent subsequence x

n
k
is in K (See the proof appendix;
all closed sets contain all of their limit points).
Therefore, a maximizer exists since x
is in K and achieves the supremum, f(x
) = m
.
This is the key result in explaining when maximizers and minimizers exist in R
N
. Moreover,
the assumptions are generally easy to verify: Compact sets are closed and bounded, which we can
easily nd in Euclidean space, and since our maximization problems generally involve calculus, f(x)
will usually be dierentiable, which implies continuity.
8.4 Multivariate Calculus
As we discussed earlier, dierentiability of a function f : R
N
R is slightly more complicated that
for the one-dimensional case. We review some of that material here briey and introduce some new
concepts.
Denition 8.4.1 The partial derivative of f(x) with respect to x
i
, where f : D R and D
is an open subset of R, is
f(x)
x
i
= f
x
i
(x) = lim
h0
f(x
1
, x
2
, ..., x
i1
, x
i
+h, x
i+1
, ..., x
N
) f(x
1
, x
2
, ..., x
i1
, x
i
, x
i+1
, ..., x
N
)
h
The gradient of f(x) is the vector of partial derivatives,
f(x) =
_
f(x)
x
1
,
f(x)
x
2
, ...,
f(x)
x
N
_
78
The total dierential of f(x) at x is
df(x) =
N
i=1
f
x
i
(x)dx
i
Now that we have a slightly better understanding of R
N
, the geometric intuition of the gradient
is more clear. The partial derivative with respect to x
i
is a one-dimensional derivative in the i-th
dimension, giving the change in the function value that can be attributed to perturbing x
i
slightly.
The gradient is just the vector from the point (x, f(x)) pointing in the direction f(x), which
represents how the function is changing at that point.
If we multiple f(x) by the basis vector e
i
where e
i
= 1 in the i-th entry, but is zero
otherwise we get
f(x) e
i
=
f(x)
x
i
so we are asking, If we increase x
i
a small amount, how does f(x) change? What if we wanted
to investigate how f(x) changes in some other direction, y? The change in f(x) in the direction yh
as h goes to zero is given by
lim
h0
f(x
1
+y
1
h, ..., x
N
+y
N
h) f(x
1
, ...x
N
)
h
To compute this, we can use a Taylor series dimension-by-dimension on f(x
1
+ y
1
h, x
2
+ y
2
h) to
get
f(x
1
+y
1
h, x
2
+y
2
h) = f(x
1
, x
2
+y
2
h) +
f(
1
, x
2
+y
2
h)
x
1
y
1
h
where
1
is between x
1
and x
1
+y
1
h. Doing this again with respect to x
2
gives
f(x
1
, x
2
+y
2
h) = f(x
1
, x
2
) +
f(x
1
,
2
)
x
2
y
2
h
Substituting the above equation into the previous one yields
f(x
1
+y
1
h, x
2
+y
2
h) = f(x
1
, x
2
) +
f(x
1
,
2
)
x
2
y
2
h +
f(
1
, x
2
+y
2
h)
x
1
y
1
h
Re-arranging and dividing by h yields
f(x
1
+y
1
h, x
2
+y
2
h) f(x
1
, x
2
)
h
=
f(
1
, x
2
+y
2
h)
x
1
y
1
+
f(x
1
,
2
)
x
2
y
2
Since x
1
+y
1
h x
1
as h 0,
1
x
1
and
2
x
2
. Then taking limits in h yields
D
y
f(x) =
f(x
1
, x
2
)
x
1
y
1
+
f(x
1
, x
2
)
x
2
y
2
Note that even through y is not an innitesimal vector, this is the dierential change in the value
of f(x) with respect to an innitesimal change in x in the direction y. In general,
Theorem 8.4.2 The change in f(x) in the direction y is given by the directional derivative of
f(x) in the direction y,
D
y
f(x) =
N
i=1
f(x)
x
i
y
i
or
D
y
f(x) = f(x) y
This is certainly useful for maximization purposes: You should anticipate that x
is a local
maximum if, for any y, D
y
f(x
) = 0. Notice also that the total dierential df(x), is just the

direction derivative with respect to the unit vector e = (dx
1
, dx
2
, ..., dx
N
), or
df(x) = f(x) dx
79
The Hessian
The gradient only generalizes the rst derivative, however, and our experience from the one-
dimensional case tells us that the second derivative is also important. But Exercise 8 from Chapter
4 suggests that the vector of second partial derivatives of f(x) is not sucient to determine whether
a point is a local maximum or minimum. The correct generalization of the second derivative is not
another vector, but a matrix.
Denition 8.4.3 The Hessian of f(x) is the matrix
_
2
f(x)
x
2
1
2
f(x)
x
1
x
2
. . .

2
f(x)
x
N
x
1
2
f(x)
x
2
x
1
2
f(x)
x
2
2
.
.
.
.
.
.
2
f(x)
x
N
x
1
2
f(x)
x
2
N
_
_
Various notations for the Hessian are
2
x
f(x), D
2
f(x),
xx
=
x

x
, and H(x).
For notational purposes, it is sometimes easier to write the matrix
xx
=
_
2
x
2
1
2
x
1
x
2
. . .

2
x
N
x
1
2
x
2
x
1
2
x
2
2
.
.
.
.
.
.
2
x
N
x
1
2
x
2
N
_
_
and think of the Hessian as
xx
f(x) = H(x). In particular,
xx
=
x

x
, so that when you
take the gradient of a gradient you get a Hessian.
To begin understanding this matrix, consider the quadratic form,
f(x
1
, x
2
) = a
1
x
1
+a
2
x
2
+
1
2
b
11
2x
2
1
+
1
2
b
22
x
2
2
+b
3
x
1
x
2
+c
This is the correct generalization of a quadratic function bx
2
+ax +c to two dimensions, allowing
interaction between the x
1
and x
2
arguments. It can be written as
f(x) = a
x +
1
2
x
Bx +c
where a = [a
1
, a
2
]
and
B =
_
b
11
b
3
b
3
b
22
_
By increasing the rows and columns of a and H, this can easily be extended to mappings from R
N
for any N.
The gradient of f(x
1
, x
2
) is
f(x) =
_
a
1
+b
11
x
1
+b
3
x
2
a
2
+b
22
x
2
+b
3
x
1
_
80
But you can see that x
2
appears in f
x
1
(x), and vice versa, so that changes in x
2
aect the partial
derivative of x
1
. To summarize this information, we need the extra, o-diagonal terms of the
Hessian, or
H(x) =
_
b
11
b
3
b
3
b
22
_
= B
A basic fact of importance is that the Hessian is symmetric matrix when f(x) is continuous, or
or H
= H:
Theorem 8.4.4 (Youngs Theorem) If f(x) is a continuous function from R
N
R, then
2
f(x)
x
i
x
j
=

2
f(x)
x
j
x
i
and more generally, the order of partial dierentiation can be interchanged arbitrarily for all higher-
order partial derivatives.
8.5 Taylor Polynomials in R
n
Second-order approximations played a key role in developing second-order sucient conditions
for the one-dimensional case, and we need a generalization of them for the multi-dimensional
case. Again, however, the right generalization is somewhat subtle. The second-order Taylor
polynomials of the one-dimensional case were quadratic functions with an error term,
f(x) = f(x
0
) +f
(x
0
)(x x
0
) +f
(x
0
)
(x x
0
)
2
2
+o(h
3
)
In the multi-variate case, we start with a quadratic function,
f(x) = c +a
x +
1
2
x
Bx
and replace each piece with its generalization of the one-dimensional derivative and second deriva-
tive to get a quadratic approximation
f(x) = f(x
0
) +f(x
0
)(x x
0
) + (x x
0
)
H(x
0
)
2
(x x
0
)
and this quadratic approximation becomes a Taylor polynomial when we add the remainder/error
term, o(h
3
):
f(x) = f(x
0
) +f(x
0
)(x x
0
) + (x x
0
)
H(x
0
)
2
(x x
0
) +o(h
3
)
where h = x x
0
.
Denition 8.5.1 Let f : D R be a thrice-dierentiable function, and D an open subset of R
N
.
Then the second-order Taylor polynomial of f(x) at x
0
is
f(x) = f(x
0
) +f(x
0
)(x x
0
) +
1
2
(x x
0
)
H(x
0
)(x x
0
) +o(h
3
)
where h = x x
0
, and the remainder o(h
3
) 0 as x x
0
.
A linear approximation of f around a point x
0
is the function
f(x) = f(x
0
) +f(x
0
)(x x
0
)
and a quadratic approximation of f around a point x
0
is the function
f(x) = f(x
0
) +f(x
0
)(x x
0
) + (x x
0
)
H(x
0
)
2
(x x
0
)
Higher-order approximations require a lot more notation, but second-order approximations are all
we need.
81
8.6 Denite Matrices
In the one dimensional case, the quadratic function
f(x) = c +ax +Ax
2
was well-behaved for maximization purposes when it had a negative second derivative, or A < 0.
In the multi-variate case,
f(x) = c +a x +
1
2
x
Ax
it is unclear what sign
x
Ax
takes. For instance, a x > 0 if a 0 and x 0. But when is x
Ax , , >, < 0? If we knew this, it

would be easier to pick out nice optimization problems, just as we know that the one-dimensional
function f(x) = Ax
2
+ax +c will only have a nice solution if A < 0.
We can easily pick out some matrices A so that x
Ax < 0 for all x R

N
. If
A =
_
_
a
1
0 0 0
0 a
2
0 0
0 0
.
.
. 0
0 0 0 a
N
_
_
then x
Ax =
N
i=1
a
i
x
2
i
, and this will only be negative if each a
i
< 0. Notice that,
1. The eigenvalues of A are precisely the diagonal terms a
1
, ..., a
N
2. The absolute value of a
i
is greater than the sum of all the o-diagonal terms (which is zero)
3. If we compute the determinant of each sub-matrix starting from the upper left-hand corner,
the determinants will alternate in sign
These are all characterizations of a negative denite matrix, or one for which x
Ax < 0 for all x.

But once we start adding o-diagonal terms, this all becomes much more complicated because it is
no longer sucient merely to have negative terms on the diagonal. Consider the following matrix:
A =
_
1 2
2 1
_
All the terms are negative, so many students unfamiliar with denite matrices think that x
Ax
will then be negative, as an extension of their intuitions about one-dimensional quadratic forms
ax
2
+bx. But this is false. Take the vector x = [1, 1]:
x
Ax = [1, 1]
[1, 1] = 1 + 1 = 2 > 0
The failure here is that the quadratic form x
Ax = x
2
1
+ x
2
2
4x
1
x
2
looks like a saddle, not
like a hill. If x
1
x
2
< 0, then 4x
1
x
2
> 0, and this can wipe out x
2
1
x
2
2
, leaving a positive
number. So a negative denite matrix must have a negative diagonal, and the contribution of
the diagonal terms will outweigh those of the o-diagonal terms.
Let A be a N N matrix, and x = (x
1
, x
2
, ..., x
n
) a vector in R
N
. A quadratic form on R
N
is
the function
x
Ax =
N
i=1
N
j=1
a
ij
x
i
x
j
82
Denition 8.6.1 Let A be an n n symmetric matrix. Then A is
positive denite if, for any x R
n
\0, x
Ax > 0
negative denite if, for any x R
n
\0, x
Ax < 0
positive semi-denite if, for any x R
n
\0, x
Ax 0
negative semi-denite if, for any x R
n
\0, x
Ax 0
Why does this relate to optimization theory? When maximizing a multi-dimensional function,
a local (second-order Taylor) approximation of it around a point x
should then look like a hill if

x
is a local maximum. If, locally around x
, the function looks like a saddle or a bowl, the value

of the function could be increased by moving away from x
. So these tests are motivated by the

connections between the geometry of maximization and quadratic forms. In short, these denitions
generalize the idea of a positive or negative number in R to a positive or negative matrix.
Denition 8.6.2 The leading principal minors of a matrix A are the matrices,
A
1
= a
11
A
2
=
_
a
11
a
12
a
21
a
22
_
.
.
.
A
n
=
_
_
a
11
a
12
. . . a
1n
a
21
a
22
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
a
n1
a
n2
. . . a
nn
_
_
Here are the main tools for characterizing denite matrices:
Theorem 8.6.3 Let A be an n n symmetric matrix. Then A is negative denite if
All the eigenvalues of A are negative
The leading principal minors of A alternate in sign, starting with det(A
1
) < 0
The matrix has a negative dominant diagonal: For each row r,
|a
rr
| >
i=r
a
ri
and a
rr
< 0.
Example Consider
A =
_
2 1
1 2
_
The leading principal minors are 2 < 0 and 4 1 = 3 > 0, so this matrix is negative denite.
The eigenvalues are found by solving
det(AI) = (2 )
2
1 = 0
and
= 2 1 < 0. So the eigenvalues of A are all negative.

Since 2 > 1, it has the dominant diagonal property.
83
Example Consider
A =
_
1 2
2 1
_
The leading principal minors are 1 < 0 and 1 4 < 0, so this matrix is not negative denite.
The eigenvalues are found by solving
det(A I) =
2
+ 2 3 = 0
and
= 1 2. It is not negative denite since it has an eigenvalue 1 > 0.

The principal minors test is obviously not satised.
Since A is neither negative nor positive (semi-)denite, it is actually indenite.
These tests seem magical, so its important to develop some intuition about what they mean.
If H is a real symmetric matrix, it can be factored into an orthogonal matrix P and a diagonal
matrix D where
H = PDP
and the eigenvalues (

1
,
2
, ...
n
) of H are on the diagonal. If we do this,
x
Hx = x
PDP
x = (x
P)D(P
x) = (P
x)
D(P
x)
Note that P
x is a vector; call it y. Then we get

y
Dy =
y
2
i
i
If all the eigenvalues
i
of H are negative, the y
2
i
terms are positive scalars, and we can
conclude x
Hx < 0 if and only if H has all strictly negative eigenvalues.

How can the principal minors test be motivated? We can get a feel for why this must
be true from the following observations: (i) If x
Hx < 0 for all x, then if we use x =

(x
1
, x
2
, ..., x
k
, 0, ..., 0), with not all components zero, then x
H x < 0 as well. But because of

the zeros, were really just considering a kk sub-matrix of H, H
k
. (ii) The determinant of a
matrix A is equal to the product of its eigenvalues. So if we take H
1
, the determinant is h
11
,
and that should be negative; det(H
2
) =
1
2
, which should be positive if H is negative denite
(since the determinant is the product of the eigenvalues, and we know that the eigenvalues
of a negative denite matrix are negative) ; det(H
3
) =
1
3
< 0; and so on. So combining
these two facts tells us that if H is negative denite, then we should observe this alternative
sign pattern in the determinants of the principal minors. (Note that this only shows that if
H is negative denite, then the principal minors test holds, not the converse.)
Necessity of the dominant diagonal theorem is easy to show, but suciency is much harder.
Suppose H is negative denite. Take the set of vectors 1 = (1, 1, ..., 1), and observe the
quadratic form
1
H1 =
j
h
ij
< 0
Now, if for each row i, we had
j
h
ij
< 0, the above inequality holds, since we then just sum
over all i. But this is equivalent to
h
ii
+
j=i
h
ji
< 0
84
Re-arranging yields
h
ii
<
j=i
h
ji
Multiplying by 1 yields
h
ii
>
j=i
h
ji
and taking absolute values yields
|h
ii
| >
j=i
h
ji
which is the dominant diagonal condition.

There are other tests and conditions available, but these are often the most useful in practice.
Note that it is harder to identify negative semi-deniteness, because the only test that still applies
is that all eigenvalues be weakly negative. In particular, if all the determinants of the principal
minors alternate in sign but are sometimes zero, the quadratic form might be a saddle, similar to
f(x) = x
3
being indenite at x = 0, with f
(x) = 6x.
Exercises
1. Show that R
N
is both open and closed. Show that the union and intersection of any collection of
closed sets is closed. Show that the union of any collection of open sets is open, and the intersection
of a nite number of open sets is open. Show that a countable intersection of open sets can be
closed.
2. Determine if the following matrices are positive or negative (semi)-denite:
_
1 3
0 1
_
_
1 1
1 1
_
_
1 5
3 10
_
_
_
1 2 1
2 4 2
1 2 1
_
_
3. Find the gradients and Hessians of the following functions, evaluate the Hessian at the point
given, and determine whether it is positive or negative (semi-)denite. (i) f(x, y) = x
2
+
x at
(1, 1). (ii) f(x, y) =

xy at (3, 2), and any (x, y). (iii) f(x, y) = (xy)
2
at (7, 11) and any (x, y).
(iv) f(x, y, z) =

xyz at (2, 1, 3).
4. Give an example of a matrix that has all negative entries on the diagonal but is not negative
denite. Give an example of a matrix that has all negative entries but is not negative denite.
5. Prove that a function is continuous i the inverse image of an open set is open, and the
inverse image of a closed set is closed. Prove that if a function is continuous, then the image of a
bounded set is bounded. Conclude that a continuous function maps compact sets to compact sets.
Use this to provide a second proof of the extreme value theorem.
85
Proofs
The proofs here are mostly about compactness and the Heine-Borel theorem.
Denition 8.6.4 A subset K of Euclidean space is compact if each sequence in K has a subse-
quence that converges to a point in K.
So a set is compact
3
if the Bolzano-Weierstrass theorem holds for every sequence constructed
only from points in the set. For an example of a non-compact set, consider (a, b). The sequence
x
n
= b 1/n is constructed completely from points in (a, b), but its limit, b, is not in the set, so it
does not have the Bolzano-Weierstrass property.
Recall that in our proof of the Weierstrass theorem, we insured a maximum existed by studying
convergent sub-sequences. This will be the key to again ensuring existence in the N-dimensional
case. The next proof theorem characterizes closedness in terms of sequences.
Theorem 8.6.5 Suppose x
n
is a convergent sequence such that x
n
is in S for all n. Then x
n
converges to a point in the closure of S,

S.
Proof (Sketch a picture as you go along). Suppose, by way of contradiction, that x is not in

S.
Then we can draw an open ball around x of radius > 0 such that no points of

S are in B
( x).
But since x
n
x, we also know that for all > 0, for n H, ||x
n
x|| < , so that countably
many points of

S are in B
( x). This is a contradiction.

Since a closed set satises

S = S, this implies that every sequence in a closed set if it
converges converges to a member of the set. However, there are sets like [0, ) that are closed
since any sequence that converges in this set converges to a point of the set but allow plenty
of non-convergent sequences, like x
n
= n. For example, if you were asked to maximize f(x) = x
on the set [b, ), no maximum exists: f([b, )) = [b, ), and the range is unbounded. So for
maximization theory, it appears that closedness and boundedness of the image of a set under a
function are key properties. In fact, they are equivalent to compactness in R
N
.
Theorem 8.6.6 (Heine-Borel) In Euclidean space, a set is compact i it is closed and bounded.
Proof Consider any sequence x
n
in a closed and bounded subset K of R
N
. We will show it has a
convergent subsequence, and consequently that K is compact.
Since K is bounded, there is a N-dimensional hypercube H, which satises H =
N
i=1
[a
i
, b
i
]
and K H. Without loss of generality we can enlarge the hypercube to [a, b]
N
, where a = min
i
a
i
and b = max
i
b
i
, so that b
i
a
i
is the same length for all dimensions.
Now cut H in N
2
equally sized sets, each of size ((b a)/N)
N
. One of these cubes contains an
innite number of terms of the sequence in K. Select a term from that cube and call it x
n
1
, and
throw the rest of the cubes away. Now repeat this procedure, cutting the remaining cube N ways
along each dimensions; one of these sub-cubes contains an innite number of terms of the sequence;
select a term from that subcube and call it x
n
k
, and throw the rest away.
The volume of the subcubes at each step k is equal to
(b a)
N
N
k
3
The most general denition of compactness is, A subset K of Euclidean space is compact if any collection of
open sets {Oi}
i=1
for which K iOi has a nite collection {Oi
k
}
K
k=1
so that K
K
k=1
Oi
k
, which is converted into
words by saying, K is compact if every open cover of K K
i=1
Oi has a nite sub-cover K
K
k=1
Oi
k
,
or that if an innite collection of open sets covers K, we can nd a nite number of them that do the same job. This
is actually easier to work with than the sequential compactness well use, but they are equivalent for
N
.
86
which is clearly converging to zero as k . Then a bound on the distance from each term to all
later terms in the sequence is given by the above estimate.
Therefore, the sequence constructed from this procedure has a limit, x (since it is a Cauchy
sequence). Therefore, the subsequence x
n
k
converges. Since K is closed, it contains all of its limit
points by Theorem 8.8.2 above, so x is an element of K. Therefore K is compact.
The Bolzano-Weierstrass theorem was a statement about bounded sequences: Every bounded
sequence has a convergent subsequence. The Heine-Borel theorem is a statement about closed and
bounded sets: A set is compact i it is closed and bounded. The bridge between the two is that
an innitely long sequence in a bounded set must be near some point x innitely often. If the set
contains all of its points of closure, this point x is actually a member of the set K, and compactness
and boundedness are closely related. However, boundedness is not sucient, since a sequence might
not converge to a point in the set, like the case (a, b) with b 1/n. To ensure the limit is in the
set, we add the closedness condition, and we get a useful characterization of compactness.
87
Chapter 9
Unconstrained Optimization
Weve already seen some examples of unconstrained optimization problems. For example, a rm
who faces a price p for its good, hired capital K and labor L at rates r and w per unit, and pro-
duces output according to the production technology F(K, L) faces an unconstrained maximization
problem
max
K,L
pF(K, L) rK wL
In this and similar problems, we usually allow the choice set to be R
N
, and allow the rm to pick
any K and L that maximizes prots. As long as K and L are both positive, this is a ne approach,
and we dont have to move to the more complicated world of constrained optimization.
Similarly, many constrained problems have the feature that the constraint can be re-written in
terms of one of the controls, and substituted into the objective. For example, in the consumers
problem
max
x
u(x)
subject to w = p x, the constraint can be re-written as
x
N
=
w
i = 1
N1
p
i
x
i
p
N
and substituted into the objective to yield
max
x
1
,...,x
N1
u
_
x
1
, ..., x
N1
,
w
i = 1
N1
p
i
x
i
p
N
_
which is an unconstrained maximization problem in x
1
, ..., x
N1
. Even though we have better
methods in general for solving the above problem from a theoretical perspective, solving an uncon-
strained problem numerically is generally easier that solving a constrained one.
Denition 9.0.7 Let f : R
n
R, x R
n
. Then the unconstrained maximization problem is
max
xR
N
f(x)
A global maximum of f on X is a point x
so that f(x
) f(x
) for all other x
X. A point x
is a local maximum of f if there is an open set S = {y : ||x y|| < } for which f(x
) f(y) for
all y S.
How do we nd solutions to this problem?
88
Our rst step to solving unconstrained maximization problems is to build up a candidate list using
FONCs, just like in the one-dimensional case.
Denition 9.1.1 If f(x
) = 0, then x
is a critical point of f(x).

Example Recall the rm with prot function
(K, L) = max
K,L
pF(K, L) rK wL
Then if we make a small change in K, the change in prots is
(K, L)
K
= pF
K
(K, L) r
and if we make a small change in L, the change in prots is
(K, L)
L
= pF
L
(K, L) w
If there are no protable adjustments away from a given point (K
, L
), then it must be a local

maximum, so the equations above both equation zero. But then (K
, L
) = 0 implies that
(K
, L
) is a critical point of f(x).

The above argument works for any function f(x):
Theorem 9.1.2 (First-Order Necessary Conditions) If x
is a local maximum of f and f is

dierentiable at x
, then x
is a critical point of f.
Proof If x
is a local maximum and f is dierentiable at x
, there cannot be any improvement in

any direction v = 0. The directional derivative is
f(x
) v =
i
f(x
)
x
i
v
i
So we can think of the dierential change as the sum of one-dimensional directional derivatives in
the direction y = (0, ..., v
i
, 0, ..., 0) where y is zero except for the i-th slot, taking the value v
i
= 0:
f(x
) y =
i
f(x
)
x
i
v
i
=
f(x
)
x
i
v
i
= 0
So that each partial derivative must be zero along each dimension individually, implying that a
local maximum is a critical point.
Like in the one-dimensional case, this gives us a way of building a candidate list of potential
maximizers: Critical points and any points of non-dierentiability.
Example Consider a quadratic objective function,
f(x
1
, x
1
) = a
1
x
1
+a
2
x
2

b
1
2
x
2
1

b
2
2
x
2
2
+cx
1
x
2
The FONCs are
a
1
b
1
x
1
+cx
2
= 0
89
a
2
b
2
x
2
+cx
1
= 0
Any solution to these equations is a critical point. If we solve the system by hand, the second
equation is equivalent to
x
2
=
a
2
+cx
1
b
2
and substituting it into the rst gives
a
1
b
1
x
1
+c
a
2
+cx
1
b
2
= 0
or
x
1
=
a
1
b
2
+a
2
c
b
1
b
2
c
2
x
2
=
a
2
b
1
+a
1
c
b
1
b
2
c
2
If, instead, we convert this to a matrix equation,
_
a
1
a
2
_
+
_
b
1
c
c b
2
_ _
x
1
x
2
_
= 0
Re-arranging yields
_
b
1
c
c b
2
_ _
x
1
x
2
_
=
_
a
1
a
2
_
Which looks exactly like Bx
= a. From linear algebra, we know that there is a solution as long

as B is non-singular (it has full rank all its eigenvalues are non-zero it is invertible it has
non-zero determinant). Then
x
= (B)
1
a
and we have a solution. The determinant of B is non-zero i
b
1
b
2
c
2
= 0
So if the above condition is satised, there is a unique critical point.
Notice, however, that we could have done the whole analysis as
f(x) = ax +x
B
2
x
yielding FONCs
a +Bx
= 0
or
x
= B
1
a
Moreover, this equation can be solved in a few lines of Matlab code, making it useful starting point
for playing with non-linear optimization problems. However, we still dont know know whether x
is a maximum, minimum, or saddle point.

90
With FONCs, we can put together a candidate list for any unconstrained maximization problem:
Critical points and any points of non-dierentiability. However, we still dont know whether a given
critical point is maximum, minimum or saddle/inection point.
f(x, y) = x
2
y
The FONCs for this function are
f
x
(x, y) = 2xy = 0
f
y
(x, y) = x
2
= 0
The unique critical point is (0, 0). But if we evaluate the function at (1, 1), we get f(1, 1) =
(1)(1) = 1 > 0 = f(0, 0). What is going on? Well, if we were considering x
2
alone, it would
have a global maximum at zero. But the y term has no global maximum. When you multiply
these functions together, even though one is well-behaved, the mis-behavior of the other leads to
non-existence of a maximizer in R
2
(I can make f(x, y) arbitrarily large by making y arbitrarily
negative and x = 1).
Still worse, as you showed in Exercise 8 of Chapter 4, it is not enough that a critical point
(x
, y
) satisfy f
xx
(x
, y
) < 0 and f
yy
(x
, y
) < 0 to be a local maximum.

f(x, y) =
1
2
x
2
1
2
y
2
+bxy
where b > 0. The FONCs are
f
x
(x, y) = x +by = 0
f
y
(x, y) = y +bx = 0
The unique critical point of the system is (0, 0). However, if we set x = y = z, the function becomes
f(z) = z
2
+bz
2
and if b > 1, we can make the function arbitrarily large. The function is perfectly well behaved
in each direction: If you plot a cross-section of the function for all x setting y to zero, it achieves a
maximum in x at zero, and if you plot a cross-section of the function for all y setting x to zero, it
achieves a maximum in y at zero. Nevertheless, this is not a global maximum if b is too large.
If we use a Taylor series, however, we can write any multi-dimensional function as
f(x) = f(x
0
) +f(x
0
) (x x
0
) + (x x
0
)
H(x
0
)
2
(x x
0
) +o(h
3
)
If x
is a critical point, we know that f(x
) = 0, or
f(x) = f(x
) + (x x
)
H(x
)
2
(x x
) +o(h
3
)
f(x
) f(x) = (x x
)
H(x
)
2
(x x
) o(h
3
)
91
Letting h = x x
be arbitrarily close to zero and noting that if x
is a local maximum of f(x),

we then have
f(x
) f(x) = (x x
)
H(x
)
2
(x x
) > 0
Or that, for any vector y,
y
H(x
)y < 0
This is the denition of a negative denite matrix, giving us a workable set of SOSCs for multi-
dimensional maximization:
Theorem 9.2.1 (Second-Order Sucient Conditions) If x
is a critical point of f(x) and H(x
)
is negative denite, then x
is a local maximum of f(x).

For example, the Hessian for
f(x, y) =
1
2
x
2
1
2
y
2
+bxy
is
_
1 b
b 1
_
The determinant of the Hessian is 1 b
2
, which is positive if b < 1. So if b is suciently small,
f(x, y) will have a local maximum at (0, 0).
On the other hand, if we start with the assumption that x
is a local maximum of f(x) and not

just a critical point, using our Taylor series approximation, we get
f(x
) f(x) = (x x
H(x
)
2
(x x
) +o(h
3
)
Since x
is a local maximum, we know that f(x
) f(x), so that
(x x
H(x
)
2
(x x
) 0
for x suciently close to x
, implying that y
H(x
)y 0. This is negative semi -deniteness.

To summarize:
Theorem 9.2.2 (Second-Order Sucient Conditions) Let f : R
n
R be a twice dieren-
tiable function, and let x
be a point in R
n
.
If x
) is negative denite, then x
is a local maximum of
f.
If x
) is positive denite, then x
is a local minimum of
f.
If f has a local maximum at x
, then H(x
) is negative semi-denite.
If f has a local minimum at x
, then H(x
) is positive semi-denite.
The rst two points are useful for checking whether or not a particular point is a local maximum
or minimum. The second two are useful when using the implicit function theorem.
Here are some examples:
92
Example Suppose a price-taking rm gets a price p for its product, which it produces using
capital K and labor L and technology F(K, L) = q. Capital costs r per unit and labor costs w per
unit. This gives a prot function of
(K, L) = pF(K, L) rK wL
The FONCs are
pF
K
(K, L) r = 0
pF
L
(K, L) w = 0
And our SOSCs are that
_
F
KK
(K
, L
) F
KL
(K
, L
)
F
LK
(K
, L
) F
LL
(K
, L
)
_
be a negative-denite matrix. This implies
F
KK
(K
, L
) < 0
F
LL
(K
, L
) < 0
and
F
KK
(K
, L
)F
LL
(K
, L
) F
KL
(K
, L
)
2
> 0
So F(K, L) must be own-concave in K and L, or F
KK
(K
, L
) < 0 and F
LL
(K
, L
) < 0.
Likewise, the cross-partial F
KL
(K
, L
) cannot be too large. In terms of economics, K and L

cannot be such strong substitutes or complements that switching from one to the other has a larger
impact than using more of each. A simple example might be
F(K, L) = a
1
K
a
2
2
K
2
+b
1
L
b
2
2
L
2
+cKL
Our Hessian would then be
_
a
2
c
c b
2
_
with determinant
a
2
b
2
c
2
> 0
So that (K
, L
) is a local maximum if
a
2
b
2
> c.
Example Slightly dierent rm problem: A rm hires capital K at rental rate r and labor L at
wage rate w. The rms output produced by a given mix of capital and labor is F(K, L) = K
.
The price of the rms good is p. This is an unconstrained maximization problem in (K, L), so we
can solve
max
K,L
pK
rK wL
This gives rst-order conditions
r = pK
1
L
w = pK
L
1
Dividing these two equations yields
r
w
=
L
K
Solving for K in terms of L yields
L =
r
w
K
93
Substituting back into the rst-order condition for K gives
r = pK
1
_
r
w
K
_
So that
K
=
_
p
1+
r
1+
_
1
1
Hypothetically, we could now dierentiate K
with respect to , , r, , or w, to see how the rms

choice of capital varies with economic conditions. But that looks inconvenient, especially for or
, which appear everywhere and in exponents.
What are the SOSCs? The Hessian is
_
( 1)K
2
L
L
1
K
L
1
( 1)K
L
2
_
The upper and lower corners are negative if 0 < < 1 and 0 < < 1. The determinant is
det H = ( 1)( 1)K
22
L
22
2
K
22
L
22
or
det H = {( 1)( 1) } K
22
L
22
Which is positive if + 1 > 0, or 1 > +.
So as long as 0 < < 1, 0 < < 1 and 1 > + , the Hessian is negative denite, and the
critical point (K
, L
) is a local maximum. Since it is the only point on the candidate list, it is a

global maximum.
Example Suppose we have a consumer with utility function u(q
1
, q
2
, m) = v(q
1
, q
2
) +m over two
goods and money, with wealth constraint w = p
1
q
1
+ p
2
q
2
+ m. Substituting the constraint into
the objective, we get
max
q
1
,q
2
v(q
1
, q
2
) +w p
1
q
1
p
2
q
2
The FONCs are
v
1
(q
1
, q
2
) p
1
= 0
v
2
(q
1
, q
2
) p
2
= 0
and the SOSCs are that
_
v
11
(q
1
, q
2
) v
21
(q
1
, q
2
)
v
12
(q
1
, q
2
) v
22
(q
1
, q
2
)
_
be negative denite. Can we gure out how purchases of q
1
respond to a change in p
2
? Well, the
two functions q
1
(p
2
) and q
2
(p
2
) are implicitly determined by the system of FONCs. If we totally
dierentiate, we get
v
11
q
1
p
2
+v
21
q
2
p
2
= 0
v
12
q
1
p
2
+v
22
q
2
p
2
1 = 0
Solving the rst equation in terms of q
2
/p
2
, we get
q
2
p
2
=
v
11
q
1
p
2
v
21
94
and substituting into the second equation gives
v
12
q
1
p
2
v
22
v
11
q
1
p
2
v
21
1 = 0
and solving for q
1
/p
2
yields
q
1
p
2
=
v
21
v
11
v
22
v
12
v
21
Notice that the denominator is the determinant of H(q
), so it must be positive. Consequently,

the sign is determined by the numerator,
sign
_
q
1
p
2
_
= sign (v
21
)
So q
1
are gross complements when v
21
> 0, and gross substitutes when v
21
< 0.
So it is pretty straightforward to apply the implicit function theorem to this two-dimensional
problem. But when the number of controls becomes large, it is less obvious that this will work. We
will need to develop more subtle tools for working through higher-dimensional comparative statics
problems.
9.3 Comparative Statics
Perhaps if we take a broader perspective, we can see some of the structure behind the comparative
statics exercise we did for the consumer above. The unconstrained maximization problem
max
x
f(x, c)
has FONCs
x
f(x
(c), c) = 0
where c is a single exogenous parameter (the extension to a vector of parameters is easy, and in
any math econ text).
If we dierentiate the FONCs with respect to c, we get
xx
f(x
(c), c)
c
x
(c) +
x
f
c
(x
(c), c) = 0
Since
xx
f(x
(c), c) = H(x
(c), c), we can write

H(x
(c), c)
c
x
(c) =
x
f
c
(x
(c), c)
Since H(x
(c), c) is negative denite, all its eigenvalues are negative, so it is invertible, and
c
x
(c) = H(x
(c), c)
1
(
x
f
c
(x
(c), c))
If x and c are both one-dimensional, this is just
x
(c)
c
=
f
xc
(x
(c), c)
f
xx
(x
(c), c)
So the H(x
(c), c)
1
term is just the generalization of 1/f
xx
(x
(c), c), and

x
f
c
(x
(c), c) is just the

generalization of f
xc
(x
(c), c).
95
Theorem 9.3.1 (Implicit Function Theorem ) Suppose that f(x(c), c) = 0 and that f(x, c)
is dierentiable in x and c. Then there is a locally continuous, implicit solution x
(c) with derivative
c
x
(c) = H(x
(c), c)
1
(
x
f
c
(x
(c), c))
Example Recall the prot-maximizing rm with general production function F(K, L), and lets
see how a change in r aects K
and L
. The system of FONCs is

pF
K
(K
, L
) r = 0
pF
L
(K
, L
) w = 0
Totally dierentiating with respect to r yields
pF
KK
K
r
+pF
LK
L
r
1 = 0
pF
KL
K
r
+pF
LL
L
r
= 0
Doing the system by hand yields
L
r
=
pF
KL
F
KK
F
LL
F
KL
F
LK
Note that the denominator is the determinant of the Hessian.
You can always grind the solution out by hand, and I like to do it sometimes to check my
answer. But there is another tool, Cramers rule, for solving equations like this. Writing the
system of equations in matrix notation gives
_
pF
KK
pF
KL
pF
LK
pF
LL
_
. .
xxf(x
(c),c)=H(x
(c))
_
K
/r
L
/r
_
. .
cx
(c)
=
_
1
0
_
. .
xfc(x
(c),c)
So we have a matrix equation, Ax = b and we want to solve for x, the vector of comparative statics.
Theorem 9.3.2 (Cramers Rule) Consider the matrix equation
Ax = (A
.1
, A
.2
, ..., A
.N
)
_
_
x
1
.
.
.
x
i1
x
i
x
i+1
.
.
.
x
N
_
_
= b
where A
.k
is the k-th column of A. Then
x
i
=
det([A
.1
, ..., A
.i1
, b, A
.i+1
, ..., A
.n
])
det(A)
96
So to use Cramers rule to solve for the i-th component of x, replace the i-th column of A with
b and compute that determinant, then divide by the determinant of A.
For the rm example, we get
L
r
=
det
_
pF
KK
1
pF
KL
0
_
det(H)
=
pF
KL
det(H)
So the sign of L
/r is equal to the sign of F

KL
.
In fact, this should always work, because
H(x
(c), c)
. .
NN matrix
c
x
(c)
. .
N1 vector
=
x
f
c
(x
(c), c)
. .
N1 vector
can always be written as a matrix equation Hx = b. Since H is symmetric, it has all real eigenvalues,
so should be invertible.
Similarly, we can characterize an envelope theorem for unconstrained maximization problems that
shows how an agents payo varies with an exogenous parameter. The generic unconstrained
maximization problem is
max
x
f(x, c)
with FONCs
x
f(x
(c), c) = 0
If we consider the value function,
V (c) = f(x
(c), c)
this gives the maximized payo of the agent for each value of the exogeneous parameter c. Dier-
entiating with respect to c yields
c
V (c) =
x
f(x
(c), c)
. .
FONC
c
x
(c) +
c
f(x
(c), c)
Again, since the FONC must be zero at a maximum, we have
c
V (c) =
c
f(x
(c), c)
So the derivative of the value function with respect to a given parameter is just the partial derivative
with respect to that parameter.
Example Consider a consumer with utility function
V (p
1
, p
2
, w) = v(q
1
, q
2
) +w p
1
q
1
p
2
q
2
If we dierentiate with respect to, say, p
2
, each maximized value q
1
and q
2
must also be dierenti-
ated, yielding
V
p
2
= v
1
q
1
p
2
+v
2
q
2
p
2
p
1
q
1
p
2
p
2
q
2
p
2
q
2
Which is a mess. It would appear that we have to go back and compute all the comparative statics
to sign this (and even then there would be no guarantee that would work). But re-arranging yields
V
p
2
= (v
1
p
1
)
q
1
p
2
+ (v
2
p
2
)
q
2
p
2
q
2
97
Using the FONCs, we get
V
p
2
= q
2
This is the logic of the envelope theorem. So we can skip all the intermediate re-arranging steps,
and just dierentiate directly with respect to parameters once the payo-maximizing behavior has
been substituted in:
V
w
= 1
V
p
1
= q
1
Exercises
1. Prove that if x
is a local maximizer of f(x), then it is a local minimizer of f(x).

2. Suppose x
is a global maximizer of f : R
n
R. Show that for any monotone increasing
transformation g : R R, x
is a maximizer of g(f(x)).
3. Find all critical points of f(x, y) = (x
2
4)
2
+ y
2
and show which are maxima and which
are minima.
4. Find all critical points of f(x, y) = (y x
2
)
2
x
2
and show which are maxima and which
are minima.
5. Describe the set of maximizers and set of minimizers of
f(x, y) = cos
_
_
x
2
+y
2
_
6. Suppose a rm produces two goods, q
1
and q
2
, whose prices are p
1
and p
2
, respectively. The
costs of production are C(q
1
, q
2
). Characterize prot-maximizing output and show how q
1
varies
with p
2
. If C(q
1
, q
2
) = c
1
(q
1
) +c
2
(q
2
) +bq
1
q
2
, explain when a critical point is a local maximum of
the prot function. How do prots vary with b and p
1
?
7. Suppose you have a set of dependent variables {y
1
, y
2
, ..., y
N
} generated by independent
variables {x
1
, x
2
, ..., x
N
}, and believe the true model is given by
y
i
=
x
i
+
i
where is a normally distributed random variable with mean m and variance
2
. The sum of
squared errors is
(y (
x))
(y (
x)) =
N
i=1
(y
i
x
i
)
2
Check that this is a quadratic programming problem. Compute the gradient and Hessian. Solve
for the optimal estimator

. (This is just OLS, right?)
8. A consumer with utility function u(q
1
, q
2
, m) = (q
1

1
)q
2
+ m and budget constraint
w = p
1
q
2
+ p
2
q
2
+ m is trying to maximize utility. Solve for the optimal bundle (q
1
, q
2
, m
) and
show how q
1
varies with p
2
, and how q
2
varies with p
1
, both using the closed-form solutions and
the IFT. How does the value function vary with
1
?
9. Suppose you are trying to maximize f(x
1
, x
2
, ..., x
N
) subject to the non-linear constraint
that g(x
1
, ..., x
N
) = 0. Use the Implicit Function Theorem to (i) use the constraint to dene
x
N
(x
1
, ..., x
N1
) and substitute this into f to get an unconstrained maximization problem in
x
1
, ..., x
N1
, (ii) derive FONCs for x
1
, ..., x
N1
.
98
Chapter 10
Equality Constrained Optimization
Problems
Optimization problems often come with extra conditions that the solution must satisfy. For ex-
ample, consumers cant spend more than their budget allows, and rms are constrained by their
technology. Some examples are:
The canonical equality constrained maximization problem comes from consumer theory. There
is a consumer choosing between bundles of x
1
and x
2
. He has a utility function u(x
1
, x
2
),
which is increasing and dierentiable in both arguments. However, he only has wealth w and
the prices of x
1
and x
2
per unit are p
1
and p
2
, respectively, so that his budget constraint is
w = p
1
x
1
+p
2
x
2
. Then his maximization problem is
max
x
1
,x
2
u(x
1
, x
2
)
subject to w = p
1
x
1
+p
2
x
2
. Here, the objective function is non-linear in x, while the constraint
is linear in x.
Consider a rm who transforms inputs z into output q through a technology F(z) = q. The
cost of input z
k
is w
k
. The rm would like to minimize costs subject to producing a certain
amount of output, q. Then his problem becomes
min
z
w z
subject to F(z) = q. This is a dierent problem from the consumers, primarily because since
the objective is linear in z while the constraint is non-linear in z.
These are equality constrained maximization problems because the set of feasible points is
described by an equation, w = p
1
x
1
+ p
2
x
2
(as opposed to an inequality constraint, like w
p
1
x
1
+p
2
x
2
).
To provide a general theory for all the constrained maximization problems we might encounter,
then, we need to write the constraints in a common form. Often, the constraints will have the form
f(x) = c, where x are the choice variables and c is a constant.
For maximization problems, move all the terms to the side with the choice variables, and
dene a new function
g(x) = f(x) c
Then whenever g(x) = 0, the constraints are satised.
99
For minimization problems, move all the rms to the side with the constant, and dene a
new function
g(x) = c f(x)
Then whenever g(x) = 0, the constraints are satised.
This will ensure that the Lagrange multiplier see below is always positive, so that you
dont have to gure out its sign later on. For example, the constraint w = p
1
x
1
+ p
2
x
2
becomes
0 = p
1
x
1
+p
2
x
2
w = g(x).
Denition 10.0.1 Let f : D R. Suppose that the choice of x is subject to an equality constraint,
such that any solution must satisfy g(x) = 0 where g : D R. Then the equality-constrained
maximization problem is
max
x
f(x)
subject to
g(x) = 0
Simply dierentiating f(x) with respect to x is no longer a sensible approach to nding a
maximizer, since that ignores the constraints. We need a theory that incorporates constraints into
the search for a maximum.
10.1 Two useful but sub-optimal approaches
There are two ways of approaching the question of constrained maximization that are instructive,
but not necessarily ecient. If you look at proof of Lagranges theorem, these ideas show up,
however, and they give a lot of intuition about how this kind of maximization problem works.
10.1.1 Using the implicit function theorem
Note that the constraint
g(x
1
, x
2
, ..., x
N1
, x
N
) = 0
can be used to formulate an implicit function, x
N
(x
1
, x
2
, ..., x
N1
), dened by
g(x
1
, x
2
, ..., x
N1
, x
N
(x
1
, x
2
, ..., x
N1
)) = 0
Then the unconstrained problem in x
1
, ..., x
N1
can be stated as:
max
x
1
,...,x
N1
f(x
1
, x
2
, ..., x
N1
, x
N
(x
1
, x
2
, ..., x
N1
))
with rst-order necessary conditions for k = 1, ..., N 1,
f(x
)
x
k
+
f(x
)
x
N
x
N
(x
)
x
k
= 0
Using the implicit function theorem on the constraint, we get
g(x
)
x
k
+
g(x
)
x
N
x
N
(x
)
x
k
= 0
Substituting this into the FONC for k, we get
f(x
)
x
k
f(x
)
x
N
g(x
)/x
k
g(x
)/x
N
= 0
100
Yielding the tangency conditions
f(x
)/x
k
f(x
)/x
N
=
g(x
)/x
k
g(x
)/x
N
= 0
which is a generalization of the familiar marginal utility of x over marginal utility of y equals the
price ratio relationship in consumer theory.
However, developing and verifying second-order sucient conditions using this approach appears
to be quite challenging. We would need to apply the implicit function theorem to the system of
FONCs, leading to a very complicated Hessian that might not obviously be negative denite.
10.1.2 A more geometric approach
Since we are using calculus, lets focus on what must be true locally for a point x
to maximize
f(x) subject to g(x) = 0. In particular, imagine that instead of maximizing f(x) in R
N
but being
restricted to the points such that g(x) = 0, imagine that the set of points such that g(x) = 0 is the
set over which we maximize f(x).
What do I mean by that? Let g(x
0
) = 0. The set of feasible local variations on x
0
are the set
of points for which the directional derivative of g(x) at x
0
evaluated in the direction x
are zero:
g(x
0
) x
= 0
This is the set of points x
for which the constraint is still satised if a dierential step is taken in

that direction: Imagine standing at the point x
0
and taking a small step towards x
so that your
foot is still in the set of points satisfying g(x) = 0. Dene this set as
Y (x
0
) = {x
: g(x
0
) x
= 0}
Now, a point x
is a local maximum of f(x) subject to g(x) = 0 if it is a local maximum of

f(x) on the set Y (x
) (right?). This implies the following:

f(x
) x
= 0
for all x
such that g(x
) x
= 0.
This gives great geometric intuition: For any feasible local variation x
on x
, it must be the
case that the gradient of the objective function f(x) and the gradient of the constraint g(x) are
both orthogonal to x
.
This supplies more geometric intuition and claries the set of points which must consider as
potential improvements on a local maximizer (the set of feasible local variations), but doesnt seem
to provide an algorithm for solving for maximizers or testing whether they are local maximizers,
minimizers, or neither.
The preferred method of solving these problems is Lagrange maximization. To use this approach,
we introduce a special function:
Denition 10.2.1 The Lagrangean is the function
L(x, ) = f(x) g(x)
where is a real number.
101
The Lagrangean is designed so that if g(x) = 0, the term g(x) acts as a penalty on the
objective function f(x). You might even imagine coming up with an algorithm that works by
somehow penalizing the decision-maker for violating constraints by raising the penalties to push
them towards a good solution that makes f(x) large and satises the constraint.
I nd that it is best to think of L(x, ) as a convenient way of converting a constrained maxi-
mization problem in x subject to g(x) = 0 into an unconstrained maximization problem in terms of
(x, ). When you subtract g(x) from the objective in the Lagrangian, you are basically imposing
an extra cost on the decision-maker for violating the constraint. When you maximizer with respect
to , you are minimizing the pain of this cost, g(x) (since maximizing the negative is the same as
minimizing). So Lagrange maximization trades o between increasing the objective function, f(x),
and satisfying the constraints, g(x), by introducing this ctional cost of violating them.
This is the basic idea of our new rst-order necessary conditions:
Theorem 10.2.2 (Lagrange First-Order Necessary Conditions) If
1. x
is a local maximum of f(x) that satises the constraints g(x) = 0

2. the constraint gradients g(x
) = 0 (this is called the constraint qualication)

3. f(x) and g(x) are both dierentiable at x
Then there exists a unique vector
such that the FONCs
x
L(x
) = f(x
g(x
) = 0
L(x
) = g(x
) = 0
hold.
Proof Consider L(x, ) as an unconstrained maximization problem in (x, ). The rst-order nec-
essary conditions are
x
L(x, ) =
x
f(x)
g(x) = 0
L(x, ) = g(x) = 0
Note that the second equation, g(x) = 0, implies that at any critical point of L(x, ), the con-
straints g(x) = 0 are satised. By the implicit function theorem (x
below is an implicit solution),

a local solution to the system of FONCs requires
g(x
) =
x
f(x
)
g(x
) = 0
where g(x) is the matrix of partial derivatives; this only exists if each row is independent so that
the matrix has full rank, so none of the gradients of the constraint vectors can be scalar multiples
of each other (this is where the constraint qualication comes from). If nablag(x
) = 0, then
g(x
) =
x
f(x
) can have a solution only if f(x
) = 0, in which case there isnt a unique

multiplier which solves the equation.
Since x
is a local maximum of f(x) subject to g(x) = 0, we know that f(x
) f(y) for all y

in a neighborhood of x
. Now, we compute the directional derivative of L(x
), considering only
changes in x
(we are looking for some protable local deviation from x
). This yields
D
y
L(x
) =
_
x
f(x
g(x
)
_
y
But the term in parentheses equals zero by the rst-order necessary conditions, so that there is no
direction y in which the value of L(x
) increases.
102
Therefore, if x
is a local maximum of f(x) subject to g(x) = 0 and g(x
) is a non-singular
matrix, then there exists
such that
x
f(x
) =
g(x
)
and (x
) is a critical point of L(x, ).

Basically, the FONCs say that any local maximum of f(x) subject to g(x) = 0 is a critical point
of the Lagrangian when the Lagrangian is viewed as an unconstrained maximization problem in
(x, ). As usual, these are necessary conditions and help us identify a candidate list:
Points where f(x) and g(x) are non-dierentiable
Points where the constraint qualication fails
The critical points of the Lagrangian
All we know from the FONCs is that the global maximizer of f(x) subject to g(x) = 0 must be
on the list, not whether any particular entry on the list is a maximum, minimum, or saddle.
Example Consider the problem
max
x,y
xy
subject to a = bx +cy. The Lagrangian is
L(x, y, ) = xy (bx +cy a)
with FONCs
L
x
(x, y, ) = y b = 0
L
y
(x, y, ) = x c = 0
L
(x, y, ) = (bx +cy a) = 0

To solve the system, notice that the rst two equations can be rewritten as
y = b
x = c
so that
y
x
=
b
c
Solving in terms of x yields
y =
b
c
x
We have one equation left, the constraint. Substituting the above equation into the constraint
yields
a = bx +c
b
c
x
or
x
=
a
2b
y
=
a
2c
and
=
a
2bc
Since f(x) and g(x) are continuously dierentiable and the constraint qualication is everywhere
satised, the candidate list consists of this single critical point (As of yet, we cannot determine
whether it is a maximum or a minimum, but its a maximum.).
103
Example Consider the rm cost minimization problem: A rm hires capital K and labor L at
prices r and w to product output F(K, L) = q. The rm minimizes cost subject to achieving output
q. Then the rm is trying to solve
min
K,L
rK +wL
subject to F(K, L) = q. First, we convert to a maximization problem,
max
K,L
rK wL
subject to g(K, L) = F(K, L) + q = 0, and then form the Lagrangean,
L(z, ) = rK wL ( q F(K, L))
Then the FONCs are
r +F
K
(K
, L
) = 0
w +F
L
(K
, L
) = 0
Which characterizes the cost-minimizing plan in terms of w, r, and q.
Example Consider maximizing f(x, y) = xy subject to the constraint g(x, y) = x
2
+ y
2
1, so
we are trying to make xy as large as possible on the unit circle. Then the Lagragnean is
L(x, y, ) = xy (x
2
+y
2
1)
and the FONCs are
y
2x
= 0
x
2y
= 0
(x
2
+y
2
1) = 0
Then we can use the same approach, re-arranging and dividing the rst two equations. This yields
x
2
= y
2
So the set of critical points are all the points on the circle such that
x
2
=
_
y
2
. There are four
such points,
_
_
1
2
,
_
1
2
_
There are two global maxima, and two global minima. One maximum is
_
_
1
2
,
_
1
2
_
and the other is
_
_
1
2
,
_
1
2
_
The two global minima are
_
+
_
1
2
,
_
1
2
_
and
_
_
1
2
, +
_
1
2
_
Right?
104
10.2.1 The Geometry of Constrained Maximization
Lagrange maximization is actually very intuitive geometrically.
Lets start by recalling what indierence curves are. Suppose we x a value, C, and set
f(x, y) = C
Then by the implicit function theorem, there is a function x(y) that solves
f(x(y), y) = C
at least locally near y for some C. Now if we totally dierentiate with respect to y, we get
f
x
x
(y) +f
y
= 0
and
x(y)
y
=
f
y
(x(y), y)
f
x
(x(y), y)
This derivative is called the marginal rate of substitution between x and y: It expresses how much
x must be given to or taken from the agent to compensate him for a small increase in the amount
of y. So if we give the agent one more apple, he is presumably better o, so we have to take away
a half a banana, and so on. If we graph x(y) in the plane, we see the set of bundles (x, y) which all
achieve an f(x, y) = C level of satisfaction. The agent is indierent among all these bundles, and
the locus of points (y, x(y)) is an indierence curve.
Indierence Curves
Since f(x, y) 0, the set of points above an indierence curve are all better than anything
on the indierence curve, and we call these the upper contour sets of f(x):
UC(a) = {x : f(x) a}
while the set of points below an indierence curve are all worse than anything on the indierence
curve, and we call these the lower contour sets of f(x):
LC(a) = {x : g(x) a}
105
Now, along any indierence curve (y, x(y)), lets compute
f(x) (x
(y), 1)
This gives
f
x
f
y
f
x
+f
y
= 0
Recall that if v
1
v
2
= 0, then v
1
and v
2
are at a right-angle to each other (they are orthogonal).
This implies that the gradient and the indierence curve are at right angles to one another.
Tangency of the Gradient to Indierence Curves
If f(x) 0, the gradient gives the direction in which the function is increasing. Now, if we
started on the interior of the constraint set and consulted the gradient, taking a step in the x
direction if f
x
(x, y) f
y
(x, y) and taking a step in the y direction if f
y
(x, y) > f
x
(x, y), we would
drift up to the constraint. At this point, we would be on an indierence curve that is orthogonal to
the gradient and tangent to the constraint, g(x), and any more steps would violate the constraint.
Now, the FONCs are
f(x
) = g(x
)
g(x
) = 0
Recall that the gradient, f(x), is a vector from the point f(x) in the direction in which the
function has the greatest rate of change. So the equation f(x
) = g(x
) implies that the

constraint gradient is a scalar multiple of the gradient of the objective function:
106
Tangency of the Gradient to the Constraint
So at a local maximum, it must be the case that the indierence curve is tangent to the
constraint, since the constraint gradient points the same direction as the gradient of the objective,
and the gradient of the objective is tangent to the indierence curve. This is all of the intuition:
Were looking for a spot where the agents indierence curve is tangent to the constraint, so that
the agent cant nd any local changes in which his payo improves.
But if the function or constraint set have a lot of curvature, there can be multiple local solutions:
Multiple Solutions
This is the basic idea of Lagrange maximization, expressed graphically.
107
10.2.2 The Lagrange Multiplier
What is the interpretation of the Lagrange multiplier? Consider the problem of maximizing f(x)
subject to a linear constraint w = p
x, p 0. You can think of w as an agents wealth, p as

the prices, and f(x) as their payo from consuming a bundle x. The Lagrangean evaluated at the
optimum is:
L(x
) = f(x
) (p
w)
But if we consider this instead as a function of w and dierentiate, we get
V
(w) = (f(x
(w)) p)
. .
First-order condition
w
x
(w)

w
(p x
w)
. .
=0
+
(w)
So V

(w) =
(w), and the Lagrange multiplier gives the change in the optimized value of the
objective for a small relaxation of the constraint set. Economists often call it the shadow price of
w how much the decision-maker would be willing to pay to relax the constraint slightly.
10.2.3 The Constraint Qualication
The constraint qualication can be confusing because it is a technical condition that has nothing
to do with maximization. To make some sense of it (and generalize things a bit), consider the
following problem:
max
x
f(x)
subject to m = 1, 2, ..., M equality constraints, g
1
(x) = 0, g
2
(x) = 0, ..., g
M
(x) = 0. Let
g(x) =
_
_
g
1
(x)
g
2
(x)
.
.
.
g
M
(x)
_
_
The new Langrangean is given by
L(x, ) = f(x)
M
m=1
m
g
m
(x) = f(x)
g(x)
Theorem 10.2.3 (First-Order Necessary Conditions) If
1. x
is a local maximum of f(x) that satises the constraints g(x) = 0

2. the constraint gradients g(x
) = 0 (this is called the constraint qualication)

3. f(x) and g(x) are both dierentiable at x
Then there exists a unique vector
such that the FONCs
x
L(x
) = f(x
g(x
L(x
) = g(x
) = 0
hold.
108
Now, the new FONCs are
f(x)
g(x) = f(x)
M
m=1
m
g
m
(x)
g(x) = 0
The term
g(x) =
_
_
g
1
(x)
x
1
g
1
(x)
x
2
. . .
g
1
(x)
x
N
g
2
(x)
x
1
g
2
(x)
x
2
.
.
.
.
.
.
g
M
(x)
x
1
g
M
(x)
x
N
_
_
is actually an M N matrix. Call this matrix G, just to clean things up a bit. Now our FONCs
are
f(x) =
G
g(x) = 0
Remember that we are trying to solve for thats what Lagranges theorem guarantees so
we need G to be invertible, so that
= G
1
f(x)
But if G fails to be invertible, we cant solve for
, and we cant nish the proof.

So the constraint qualication is just saying, The matrix G is invertible, nothing more.
Example Heres an example where the constraint qualication fails with only one constraint:
max
{x,y:y
3
x
2
=0}
y
The Lagrangean is
L = y (y
3
x
2
)
with rst-order necessary conditions
(y
3
x
2
) = 0
1 3y
2
= 0
2x = 0
The constraint y
3
x
2
= 0 requires that both x and y be (weakly) positive. Since the objective
is decreasing in y and non-responsive in x, the global maximum is at y
= x
= 0. But then the

rst-order necessary conditions are
_
_
0
1
0
_
_
=
_
_
0
0
0
_
_
and cannot be satised.
So we have a simple situation where the Lagrange approach fails. The reason is that the
constraints have gradient
_
g(0, 0)/x
g(0, 0)/y
_
=
_
0
0
_
so the constraint vanishes at the optimum. Then L = f +g cannot be solved if f = 0 but
g is. This is the basic idea of the constraint qualication.
109
Just like unconstrained problems, there are second-order conditions, but they more complicated and
require patience and practice to understand. There are two main dierences from the unconstrained
problem.
First, we need not consider all possible local changes in the choice variables x
to verify whether
a particular critical point is a local maximum, but only those that do not violate the constraints.
For example, if the function if f(x, y) = xy and the constraint is w = px + y, there are always
better allocations nearby any critical point, but most of them violate the constraint w = px + y.
The set of changes (dx, dy) that leave the total expenditure unchanged satisfy pdx + dy = 0, and
these are the only changes which we are really interested in.
Denition 10.3.1 Let (x
) be a critical point of the Lagrangian. The set of feasible local

variations is
Y = {y R
n
: g(x
) y = 0}
This is the set of vectors y local to x
for which all of the constraints are still satised, in the sense
of the directional derivative, g(x
) y = 0.
Second, the Hessian of L(x, ) is not just the Hessian of f(x). Intuitively, I have suggested that
we should think of equality constrained maximization of x subject to g(x) = 0 as unconstrained
maximization of L(x, ) in terms of (x, ). This is reected in the fact that the Hessian of L(x, )
is the focus of the SOSCs, not the Hessian of f(x).
Theorem 10.3.2 (Second-Order Sucient Conditions) Suppose (x
) is a critical point
of the Lagrangian (i.e., the constraint qualication and the FONCs f(x
) +
g(x
) = 0 and
g(x
) = 0 are satised). Let

xx
L(x
) be the Hessian of the Lagrangean function with respect

to the choice variables only. Then
If y
xx
L(x
)y < 0 for all feasible local changes y, then x
is a local maximum of f
If y
xx
L(x
)y > 0 for all feasible local changes y, then x
is a local minimum of f
Since we are not solving an unconstrained maximization problem and the solution must lie on
the locus of points dened by g(x) = 0, it follows that we can imagine checking the local, second-
order sucient conditions only on the locus. Since we are using calculus, we restrict attention to
feasible local changes, and the consequence is that for all small steps y that do not violate the
constraint, the Lagrangian must have a negative denite Hessian in terms of x alone. This does not
mean that the Hessian of the Lagrangian is negative denite, because the Lagrangian depends on
x and . However, we would like a test that avoids dealing with the set of feasible local variations
directly, since that set appears dicult to solve for and manipulate.
Example If
L(x, y, ) = f(x, y) (ax +by c)
then the FONCS are
(ax +by c) = 0
f
x
(x
, y
a = 0
f
y
(x
, y
b = 0
Now, dierentiating all the equations again in the order
, x
, and y
gives the matrix
xx
L(x
, y
) =
_
_
0 a b
a f
xx
(x
, y
) f
yx
(x
, y
)
b f
xy
(x
, y
) f
yy
(x
, y
)
_
_
110
The above Hessian is called bordered, since the rst column and rst row are transposes of each
other, while the lower right-hand corner is the Hessian of the objective function.
For a single constraint g(x), the Hessian of the Lagrangean looks like this:
xx
L(x
) =
_
0
x
g(x
x
g(x
)
xx
f(x
) +
xx
g(x
)
_
(10.1)
This is called the bordered Hessian.
Suppose you are maximizing f(x) subject to a single constraint, g(x) = 0 (you can nd the
generalization to any number of constraints in any math econ textbook). Then the determinant of
the upper-left hand corner is trivially zero. The non-trivial leading principal minors of the bordered
Hessian are for the case with a single constraint
H
3
=
_
_
0 g
x
1
(x
) g
x
2
(x
)
g
x
1
(x
) f
x
1
x
1
(x
g
x
1
x
1
(x
) f
x
2
x
1
(x
g
x
2
x
1
(x
)
g
x
2
(x
) f
x
1
x
2
(x
g
x
1
x
2
(x
) f
x
2
x
2
(x
g
x
2
x
2
(x
)
_
_
H
4
=
_
_
0 g
x
1
(x
) g
x
2
(x
) g
x
3
(x
)
g
x
1
(x
) f
x
1
x
1
(x
g
x
1
x
1
(x
) f
x
2
x
1
(x
g
x
2
x
1
(x
) f
x
3
x
1
(x
g
x
3
x
1
(x
)
g
x
2
(x
) f
x
1
x
2
(x
g
x
1
x
2
(x
) f
x
2
x
2
(x
g
x
2
x
2
(x
) f
x
3
x
2
(x
g
x
3
x
2
(x
)
g
x
3
(x
) f
x
1
x
3
(x
g
x
1
x
3
(x
) f
x
2
x
3
(x
g
x
2
x
3
(x
) f
x
3
x
3
(x
g
x
3
x
3
(x
)
_
_
.
.
.
H
k+1
=
_
_
0 g
x
1
(x
) . . . g
x
k
(x
)
g
x
1
(x
) f
x
1
x
1
(x
g
x
1
x
1
(x
)
.
.
.
.
.
.
g
x
k
(x
) f
x
k
x
k
(x
g
x
k
x
k
(x
)
_
_
So H
k+1
is the upper left-hand k + 1 k + 1 principal minor of the full bordered Hessian. The
+1 term comes from the fact that we have an extra leading row and column that correspond to
the constraint.
An example makes this a bit clearer:
Example Let
f(x
1
, x
2
, x
3
) = x
1
x
2
+x
2
x
3
+x
1
x
3
subject to
x
1
+x
2
+x
3
= 3
Then the Lagrangean is
L = x
1
x
2
+x
2
x
3
+x
1
x
3
(x
1
+x
2
+x
3
3)
This generates a system of rst-order conditions:
_
_
L/
L/x
1
L/x
2
L/x
3
_
_
=
_
_
(3 x
1
x
2
x
3
)
x
2
+x
3

x
1
+x
3

x
2
+x
1

_
_
The only critical point is x
= (1/3, 1/3, 1/3). The bordered Hessian at the critical point is
xx
L(x
) =
_
_
0 1 1 1
1 0 1 1
1 1 0 1
1 1 1 0
_
_
111
and the leading principal minors are
H
3
=
_
_
0 1 1
1 0 1
1 1 0
_
_
and
H
4
=
_
_
0 1 1 1
1 0 1 1
1 1 0 1
1 1 1 0
_
_
It turns out that we can develop a test based on the bordered Hessian of the whole Lagrangian,
including derivatives with respect to the multiplier, rather than focusing on the Hessian of the
Lagrangian restricted only to the choice variables.
Theorem 10.3.3 (Alternating Sign Test) A critical point (x
) is a local maximum of
f(x) subject to the constraint g(x) = 0 if the determinants of all the principal minors of the
bordered Hessian alternate in sign, starting with
det H
3
> 0
A critical point (x
) is a local minimum of f(x) subject to the constraint g(x) = 0 if the

determinants of all the principal minors of the bordered Hessian are all negative, starting with
det H
3
< 0
Returning to the example,
Example We need to decide whether the bordered Hessian is negative denite or not. The
non-trivial leading principal minors of the bordered Hessian are
det(H
3
) = 2 > 0, det(H
4
) = 3 < 0
And the second-order condition is satised for any vector y
Hy.
Example Let
f(x
1
, x
2
) = f
1
(x
1
) +f
2
(x
2
)
where f
1
(x
1
) and f
2
(x
2
) are increasing and f
1
(x), f
2
(x) < 0, and f
i
(0) = 0. There is a constraint
that C = x
1
+x
2
. Then the Lagrangean is
L = f
1
(x
1
) +f
2
(x
2
) (x
1
+x
2
C)
This generates a system of rst-order conditions
_
_
L/
L/x
1
L/x
2
_
_
=
_
_
(x
1
+x
2
C)
f
1
(x
1
)
f
2
(x
2
)
_
_
with bordered Hessian
H(x, ) =
_
_
0 1 1
1 f
1
(x
1
) 0
1 0 f
2
(x
2
)
_
_
112
Then the bordered Hessians non-trivial leading principal minors starting from H
3
have determi-
nants
det(H
3
) = f
1
(x
1
) f
2
(x
2
) > 0
So the Lagrangian is negative denite, and an interior solution to the rst-order conditions is a
solution to the maximization problem.
Example Suppose a consumer is trying to maximize utility, with utility function u(x, y) = x
and budget constraint w = px +y. Then we can write the Lagrangian as (why?)
L = log(x) + log(y) (px +y w)
we get a system of rst-order conditions
_
_
L/
L/x
L/y
_
_
=
_
_
w px y
/x p
/y
_
_
The Hessian of the Lagrangean is
_
_
0 p 1
p

(x
)
2
0
1 0

(y
)
2
_
_
The determinants of the non-trivial leading principal minors of the bordered Hessian are
det
_
_
0 p 1
p

(x
)
2
0
1 0

(y
)
2
_
_
= p(p)

(x
)
2
1(1)(1)

(x
)
2
> 0
So the alternating sign test is satised.
This works because adding the border imposes the restriction that g(x) = 0 on the test of
whether the objective is negative denite or not. Consequently, when we use the alternating sign
test, we are really asking, Is the Hessian of the objective negative denite on Y ? This requires
more matrix algebra to show, but you can nd the details in MWG or Debreus papers, for example.
10.4 Comparative Statics
Our system of rst-order necessary conditions
x
f(x
x
g(x
) = 0
g(x
) = 0
is a non-linear system of equations with endogenous variables (x
), just like any other we

have applied the IFT to. The only new part is that you have to keep in mind that
is an
endogenous variable, so that it changes when we vary any exogenous variables. Second, the sign of
the determinant of the Hessian is determined by whether we are looking at a maximum or minimum,
and the number of equations.
113
For simplicity, consider a constrained maximization problem
max
x,y
f(x, y, t)
subject to a single equality constraint ax+by s = 0, where t and s are exogenous variables. This
is simpler than assuming that g(x, s) = 0, since the second-order derivatives of the constraint are
all zero. However, there are many economic problems of interest where the constraints are linear,
and a general g(x, s) can be very complicated to work with. Then the Lagrangean is
L(x, y, ) = f(x, y, t) (ax +by s)
and the FONCs are
(ax
+by
s) = 0
f(x
, y
, t)
g(x
, y
, s) = 0
Since we have two parameters t shifts the objective function, s shifts the constraint we can
look at two dierent comparative statics.
Example Lets start with t: Totally dierentiate the FONCs to get three equations
_
a
x
t
+b
y
t
_
= 0
f
tx
+f
xx
x
t
+f
yx
y
t
a = 0
f
ty
+f
xy
x
t
+f
yy
y
t
b = 0
If we write this as a matrix equation,
_
_
0 a b
a f
xx
f
xy
b f
yx
f
yy
_
_
_
_
/t
x
/t
y
/t
_
_
=
_
_
0
f
tx
f
ty
_
_
Which is just a regular Ax = b-type matrix equation. On the right-hand side, the bordered
Hessian appears; from the alternating sign test and the fact that x
is a local maximum, we can

determine the sign of its determinant. To some extent, we are nished, since all that remains is
to solve the system. You can solve the system by hand by solving for each comparative static and
reducing the number of equations, but that is a lot of work. It is simpler to use Cramers rule. To
see how x
varies with t,
x
t
=
det
_
_
0 0 b
a f
tx
f
xy
b f
ty
f
yy
_
_
det H
3
which is
x
t
=
b(af
ty
(x
, y
, t) bf
tx
(x
, y
, t)
det H
3
Since (x
, y
) is a local maximum, det H

3
> 0, and the sign of the comparative static is just the
sign of the numerator.
Having done t, lets do s:
114
Example Totally dierentiate the system of FONCs to get
_
a
x
s
+b
y
s
1
_
= 0
f
xx
x
s
+f
yx
y
s
a = 0
f
xy
x
s
+f
yy
y
s
b = 0
Writing this as a matrix equation yields
_
_
0 a b
a f
xx
f
xy
b f
yx
f
yy
_
_
_
_
/s
x
/s
y
/s
_
_
=
_
_
1
0
0
_
_
Lets focus on y
/s. Using Cramers rule,

y
s
=
det
_
_
0 a 1
a f
xx
0
b f
yx
0
_
_
det H
3
y
s
=
af
yx
+bf
xx
det H
3
An example with more economic content is:
Example Consider the consumer problem
L = log(x) + log(y) (px +y w)
we get a system of rst-order conditions
_
_
L/
L/x
L/y
_
_
=
_
_
w px y
/x p
/y
_
_
This is a system of non-linear equations, so theres no problem dierentiating it with respect to p,
for instance, and using all the same steps as for the unconstrained rm. You need to remember,
however, that
(w, p, , ) is now a function of parameters as well, so it is not a constant that

drops out. Totally dierentiating with respect to p gives
0 = x p
x
p

y
p
0 =

x
2
x
p

p
p
0 =

y
2
y
p

p
Solving the system the long way for x/p gives
x
p
=
y
2
px
x
2
+

y
2
p
2
< 0
Since the Lagrange multiplier
is positive, if p goes up, the consumer reduces his consumption of

x
.
115
The last tool we need to generalize is the envelope theorem. Since a local maximum of the La-
grangian satises
L(x
) = f(x
g(x
) = f(x
)
we can use it as our value function.
As a start, suppose we are studying the consumers problem with utility function u(x, y) and
budget constraint w = px +y. Then the Lagrangian at any critical point is
L(x
, y
) = u(x
, y
(px
+y
w)
Lets consider it as a value function in terms of w,
V (w) = u(x
(w), y
(w))
(w)(px
(w) +y
(w) w)
Then
V
(w) = u
x
x
(w) +u
y
y
(w)
. .
I
(w)px
(w)
(w)y
(w)
. .
II
(w)(px
(w) +y
(w) w)
. .
III
+
(w)
. .
IV
There are a bunch of terms that each represent a dierent consequence of giving the agent more
wealth. First, the agent re-optimizes and the value of the objective function changes (I). Second,
this has an impact on the constraint, and its implicit cost changes (II). Third, since the constraint
has been relaxed, the Lagrange multiplier changes, again changing the implicit cost (III). Fourth,
there is the direct eect on the objective of increasing w (IV).
Re-arranging, however, gives
V
(w) = (u
x

(w)p)x
(w) + (u
y

(w))y
(w)
. .
FONCs
(w) (px
(w) +y
(w) w)
. .
Constraint
+
(w)
so that the FONCs are zero, and since the constraint is satised, that term also equals zero, leaving
just the direct eect,
V
(w) =
(w)
or
V
(w) =
L(x
(w), y
(w),
(w), w)
w
=
dL(x
(w), y
(w),
(w), w)
dw
So as before, we can simply take the partial derivative of the Lagrangean to see how an agents
payo changes with respect to a parameter, rather than working out all the comparative statics
through the implicit function theorem.
Theorem 10.5.1 (Envelope Theorem) Consider the constrained maximization problem
max
x
f(x, t)
subject to g(x, s) = 0. Let x
(t, s),
(t, s) be a critical point of the Lagrangean. Dene

V (t, s) = L(x
(t, s),
(t, s), t, s) = f(x
(t, s), t) +
(t, s)g(x
(t, s), s)
Then
V (t, s)
t
=
f(x
(t, s), t)
t
and
V (t, s)
s
=
g(x
(t, s), s)
s
116
In words, the envelope theorem implies that a change in the agents payo with respect to an
exogenous variable is the partial derivative of the non-optimized Lagranian evaluated at the optimal
decision x
(t, s).
Example Suppose a rm has production function F(K, L, t) where K is capital, L is labor, and
t is technology. The price of the rms good is p. Then the we can write the prot maximization
problem as a constrained maximization problem as
(t) = max
q,K,L
pq rK wL
subject to q = F(K, L, t). The Lagrangean then is
L(q, K, L,
p
) = pq rK wL
p
(q F(K, L, t))
And without any actual work
(t) =
p
F
t
(K
, L
, t)
Similarly, the cost minimization problem is
C(q, t) = min
K,L
rK +wL
subject to F(K, L, t) q. The Lagrangean then is
L(K, L, ) = rK wL +
c
(F(K, L, t) q)
And
C
t
(q, t) =
c
F
t
(K
, L
, t)
So be careful: The envelope theorem takes the derivative of the Lagrangean, then evaluates it
at the optimal choice to obtain the derivative of the value function.
Exercises
1. Suppose we take a strictly increasing transformation of the objective function and leave the
constraints unchanged. Is a solution of the transformed problem a solution of the original problem?
Suppose we have constraints g(x) = c and take a strictly increasing transformation of both sides.
Is a solution of the transformed problem a solution of the original problem?
2. Consider the maximization problem
max
x
1
,x
2
x
1
+x
2
subject to
x
2
1
+x
2
2
= 2
Sketch the constraint set and contour lines of the objective function. Find all critical points of the
Lagrangian. Verify whether each critical point is a local maximum or a local minimum. Find the
global maximum.
3. Consider the maximization problem
max
x,y
xy
117
subject to
a
2
x
2
+
b
2
y
2
= r
Sketch the constraint set and contour lines of the objective function. Find all critical points of the
Lagrangian. Verify whether each critical point is a local maximum or a local minimum. Find the
global maximum. How does the value of the objective vary with r? a? How does x
respond to a
change in r; show this using the closed-form solutions and the IFT.
4. Solve the problem
max
x,y,z
2x 2y +z
subject to x
2
+y
2
+z
2
= 9.
5. Solve the following two dimensional maximization problems subject to the linear constraint
w = p
1
x
1
+ p
2
x
2
, p
1
, p
2
> 0. Then compute the change in x
1
with respect to p
2
, and a change in
x
1
with respect to w. Assume
1
>
2
> 0. For i and ii, when do the SOSCs hold?
i. Cobb-Douglas
f(x) = x
1
1
x
2
2
ii. Stone-Geary
f(x) = (x
1

1
)
1
(x
2

2
)
2
iii. Constant Elasticity of Substitution
_
1
x
1/
1
+
2
x
1
2
_
iv. Leontief
min{
1
x
1
,
2
x
2
}
6. An agent receives income w
1
in period one and no income in the second period. He consumes
c
1
in the rst period and c
2
in the second. He can save in the rst period, s, and receives back Rs
in the second period. His utility function is
u(c
1
, c
2
) = log(c
1
) + log(c
2
)
where R > 1 and 0 < < 1. (i) Write this down as a constrained maximization problem. (ii)
Find rst-order necessary conditions for optimality and check the SOSCs. What does the Lagrange
multiplier represent in this problem? (iii) Compute the change in c
2
with respect to R and . (iv)
How does the agents payo change if R goes up? ?
7. For utility function
u(x
1
, x
2
) = (x
1

1
)x
2
and budget constraint w = p
1
x
1
+ p
2
x
2
, compute the utility-maximizing bundle. You should get
closed-form solutions. Now compute x
1
/p
2
using the closed-form solution as well as the implicit
function theorem from the system of FONCs. Repeat with x
2
/p
1
. How does the agents payo
vary in
1
? w? p
2
?
8. Generalize and proof the envelope theorem in the case where f(x, t) and g(x, t) both depend
on the same parameter, t. Construct an economically relevant problem in which this occurs, and
use the implicit function theorem to show how x
(t) varies in t when both the constraint and

objective are varying at the same time.
9. Suppose you are trying to maximize f(x). Show that if the constraint is linear, so that
p x = w, you can always rewrite the constraint as
x
N
=
N1
i=1
p
i
x
i
p
N
118
and substitute this into the objective, turning the problem into an unconstrained maximization
problem. Why do we need Lagrange maximization then?
10. Suppose you have a maximization problem
max
x
f(x)
subject to g(x, t) = 0. Show how to use the implicit function theorem to derive comparative statics
with respect to t. Explain briey what the bordered Hessian looks like, and how it diers from the
examples in the text.
119
Chapter 11
Inequality-Constrained Maximization
Problems
Many problems particularly in macroeconomics dont involve strict equality constraints, but
involve inequality constraints. For example, a household might be making a savings/consumption
decision, subject to the constraint that savings can never become negative this is called a
borrowing constraint. As a result, many households will not have to worry about this, as long as
they maintain positive savings in all periods. A simple version is:
Example Consider a household with utility function
u(c
1
) +u(c
2
)
The household gets income y
1
in period one and y
2
in period two, with y
2
> y
1
. The household would
potentially like to borrow money today to smooth consumption, but faces a nancial constraint
that it can borrow an amount B only up to R < y
2
. Then we have the constraints
c
1
= y
1
+B
c
2
= y
2
B
B R
Substituting the rst two constraints into the objective yields
max
B
u(y
1
+B) +u(y
2
B)
subject to B y
2
.
Now, sometimes the constraint will be irrelevant, or
u
(y
1
+B
) = u
(y
2
B
)
and B
< R, and the constraint is slack, or inactive. Other times, however, B
> R, and the

constraint becomes binding or active. Then B
= R, and we have two potential solutions, and

the one selected depends on the parameters of the problem.
The simple borrowing problem illustrates the general issues: Having inequality constraints cre-
ates the possibility that for dierent values of the parameters, dierent collections of the constraints
are binding. The maximization theory developed here is designed to deal with these shifting sets
of constraints.
120
Denition 11.0.2 Let f(x) be the objective function. Suppose that the choice of x is subject
to m = 1, 2, ..., M equality constraints, g
m
(x) = 0, and k = 1, 2, ..., K inequality constraints,
h
k
(x) 0. Then the inequality-constrained maximization problem is
max
x
f(x)
such that for m = 1, 2, ..., M and k = 1, 2, ..., K,
g
m
(x) = 0
h
k
(x) 0
Note that we can brute force solve this problem as follows: Since there are K inequality con-
straints, there are 2
K
ways to pick a subset of the inequality constraints. For each subset, we can
make the chosen inequality constraints into equality constraints, and solve the resulting equality
constrained maximization problem (which we already know how to do). Each of these sub-problems
might generate no candidates, or many, and some of the candidates may conict with some of the
constraints we are ignoring. Once we compile a list of all the solutions that are actually feasible,
then the global maximum must be on the list somewhere. Then we can simply compare the payos
from our 2
K
sets of candidate solutions, and pick the best.
Example Suppose an agent is trying to maximize f(x, y) = ax + by, a, b > 0, subject to the
constraints x 0, y 0, and c px +qy.
Since f = (a, b) 0, it is immediate that the constraint c px+qy will bind with an equality.
For if c > px+qy, we can always increase x or y a little bit, thus improving the objective functions
value as well as not violating the constraint. This is feasible and better than any proposed solution
with c > px +qy, so we must have c = px +qy.
Then we have three cases remaining:
1. The constraint x 0 binds and x = 0, but y > 0
2. The constraint y 0 binds and y = 0, but x > 0
3. The constraints x, y 0 are both slack, and x, y > 0
We then solve case-by-case:
1. If y = 0, then the constraint implies x = c/p, giving a value of ac/p
2. If x = 0, then the constraint implies y = c/q, giving a value of bc/q
3. If x, y > 0, then we solve the constrained problem using a Lagrangean,
L(x, y, ) = ax +by (px +qy c)
with FONCs
a p = 0
b q = 0
and
(px +qy c) = 0
This has a solution only if a/b = p/q, in which case any x and y that satisfy the constraints
are a solution, since the objective is equivalent to (a/b)x +y and the constraint is equivalent
to (p/q)x + y, from which you can easily see that the indierence curves and constraint set
are exactly parallel, so that any px
+qy
= c with x
, y
> 0 is a solution.
121
So we have two candidate solutions with either x or y equal to zero, and a continuum of solutions
when the objective and constraint are parallel.
As happened in the previous problem, we dierentiate between two kinds of solutions:
Denition 11.0.3 A point x
that is a local maximizer of f(x) subject to g

m
(x) = 0 for all m
and h
k
(x) 0 for all k is a corner solution if h
k
(x
) = 0 for some k, and an interior solution if

h
k
(x
) < 0 for all k.

A corner solution is one like x
= c/p and y
= 0, while an interior solution corresponds to the

third case from the previous example, where the constraint c = px +qy binds.
This example suggests we dont really need a new maximization theory, but either a lot of
paper or a powerful computer. However, just as Lagrange multipliers can be theoretically useful,
the multipliers well attach to the inequality constraints can be theoretically useful. In fact, the
framework well develop is mostly just a systematic way of doing the brute force approach described
in the previous paragraph.
For simplicity, well work with one equality constraint g(x) = 0 and k = 1, ..., K inequality con-
straints, h
k
(x) 0.
As before, we form the Lagrangean, putting a Lagrange multiplier on the equality constraint,
and = (
1
,
2
, ...,
K
) Kuhn-Tucker multipliers on the inequality constraints:
L(x, , ) = f(x) g(x)
k
h
k
(x) = f(x) g(x)
h(x)
But now we have a dilemma, because some constraints may be binding at the solution while others
are not. Moreover, there may be many solutions that correspond to dierent binding constraint
sets. How are we supposed to know which bind and which dont? We proceed by making the set
of FONCs larger to include complementary slackness conditions.
Theorem 11.1.1 (Kuhn-Tucker First-Order Necessary Conditions) Suppose that
1. Suppose x
is a local maximum of f(x) subject to the constraints g(x) = 0 and h

k
(x) 0 for
all k.
2. Let A(x
) = {1, 2, ..., } be the set of inequality constraints that binding at x
, so that h
j
(x
) =
0 for all j A(x
) and h
j
(x
) < 0 for all j / E. Suppose that the matrix formed by the

gradient of the equality constraint, g(x
), and the binding inequality constraints in A(x
),
h
1
(x
), h
2
(x
), ..., h
(x
), is non-singular (this is the constraint qualication).

3. The objective function, equality constraint, and inequality constraints are all dierentiable at
x
Then, there exist unique vectors
and
of multipliers so that
the FONCs
f(x
g(x
j
h
j
(x
) = 0
g(x
) = 0
122
and the complementary slackness conditions
j
h
j
(x
) = 0
for all k = 1, ..., K
are satised.
So, we always have the issue of the constraint qualication lurking in the background of a
constrained maximization problem, so set that aside. The rest of the theorem says that if x
is a local maximum subject to the equality constraint and inequality constraints, some will be
active/binding and some will be inactive/slack. If we treat the active/binding constraints as regular
equality constraints, then
j
0 for each binding constraint, but each slack constraint will satisfy
h
j
(x
) < 0, so
j
= 0 these are the complementary slackness conditions. (Note that it is not
obvious right now that
j
0, but this is shown later.)
Example Suppose a consumers utility function is u(x, y), and he faces constraints x 0, y 0
and w = px +y. Suppose that u(x, y) 0, so that the consumers utility is weakly increasing in
the amount of each good.
Then we can form the Lagrangian
L(x, y, ) = u(x, y)
x
x
y
y (px +y w)
The FONCs are
u
x
(x
, y
p = 0
u
y
(x
, y
= 0
(px
+y
w) = 0
and the complementary slackness conditions are
x
x
= 0
y
y
= 0
Now we hypothetically have 2
2
= 4 cases:
1.
x
,
y
0 (which will imply x
= 0, y
= 0 by the complementary slackness conditions)

2.
x
= 0,
y
0 (which will imply x
0, y
= 0 by the complementary slackness conditions)

3.
x
0,
y
= 0 (which will imply x
= 0, y
0 by the complementary slackness conditions)

4.
x
,
y
= 0 (which will imply x
, y
0 by the complementary slackness conditions)

We now solve the FONCs case-by-case:
1. In the rst case,
x
,
y
0, we look at the corresponding complementary slackness conditions
x
x
= 0
y
y
= 0
For these to hold, it must be the case that x
= 0 and y
= 0. Substituting these into the

FONCs yields
u
x
(0, 0)
p = 0
u
y
(0, 0)
= 0
w = 0
which yields a contradiction, since w = 0. This case provides no candidate solutions.
123
2. In the second case,
x
= 0,
y
0, we look at the complementary slackness conditions
x
x
= 0
y
y
= 0
For these to hold, y
must be zero. Substituting this into the FONCs yields

u
x
(x
, 0)
p = 0
u
y
(x
, 0)
= 0
(px
w) = 0
from which we get x
= w/p from the nal equation. This case gives a candidate solution
x
= w/p and y
= 0.
3. In the third case,
x
= 0,
y
= 0, we look at the complementary slackness conditions
x
x
= 0
y
y
= 0
For these to hold, we must have x
= 0. Substituting this into the FONCs yields

u
x
(0, y
p = 0
u
y
(0, y
= 0
(y
w) = 0
from which we get y
= w from the nal equation. This case gives a candidate solution x
= 0
and y
= w.
4. In the last case,
x
,
y
= 0, the complementary slackness conditions are satised automati-
cally. Substituting this into the FONCS yields
u
x
(x
, y
p = 0
u
y
(x
, y
= 0
(px
+y
w) = 0
which gives an interior solution x
, y
> 0.
So the KT conditions produce three candidate solutions: All x and no y, all y and no x, and some
of each. Which solution is selected will depend on whether u(x, y) is strictly greater than zero or
only weakly greater than zero in each good.
The complementary slackness conditions work as follows: Suppose we have an inequality con-
straint x 0. Then the complementary slackness condition is that
x
x
= 0, where
x
is the
Kuhn-Tucker multiplier associated with the constraint x 0. Then if
x
0, it must be the
case that x
= 0 for the complementary slackness condition to hold; if
x
= 0, then x
can take
any value. This is how KT theory systematically works through all the possibilities of binding con-
straints: if the KT multiplier is non-zero, the constraint must be binding, while if the KT multiplier
is zero, it is slack.
124
Example Consider maximizing f(x, y) = x
subject to the inequality constraints

y + 2x 6
x + 2y 6
x 0
y 0
Then there are four inequality constraints. The set of points satisfying all four is not empty (graph
it to check). However, the rst two constraints only bind at the same time when x = y = 2;
otherwise either the rst is binding or the second but not both. This means that the Lagrangean is
not the right tool, because you could only satisfy both equations as equality constraints at the point
x = y = 2, which may not be a solution. You could set up a series of problems where you treat
the constraints as binding or non-binding, and solve the 2
4
resulting maximization problems that
come from considering each subset of constraints separately, but this is exactly what Kuhn-Tucker
does for you, in a systematic way. The Lagrangean is
L = log(x) + log(y)
1
(y + 2x 6)
2
(x + 2y 6)
3
x
4
y
The rst-order necessary conditions are
_
L/x
L/y
_
=
_
x
2
1

2
y

1
2
2

4
_
_
The complementary slackness conditions are
1
(y + 2x 6) = 0
2
(x + 2y 6) = 0
3
x = 0
4
y = 0
We now proceed by picking combinations of multipliers, and setting
j
0 if h
j
(x
, y
) = 0, and
j
= 0 if h
j
(x
, y
) > 0. This is what the complementary part of complementary slackness

means.
First, note that the gradient is strictly increasing, so that if x = 0 or y = 0 at the optimium, it
occurs at an extreme point of the set. So if x = 0 (
3
0), then y must be as large as possible,
and if y = 0 (
4
0), x must be as large as possible. Second, note that if 6 = y + 2x is binding
(
1
0), then 6 = x + 2y is slack (
2
= 0), and vice versa.
So we really have ve possibilities:
1. x = y = 2 and both
1
,
2
0 and
3
,
4
= 0
2. x, y 0 and
1
0,
2
,
3
,
4
= 0
3. x, y 0 and
2
0,
1
,
3
,
4
= 0
4. x 0 and y = 0, and
1
0,
2
,
3
,
4
= 0
5. y 0 and x = 0, and
2
0,
1
,
3
,
4
= 0
For each case, we nd the set of critical points that satisfy the resulting rst-order conditions:
1. The unique critical point is x = y = 2.
125
2. If x, y 0 and
1
0, the binding constraint is that y + 2x = 6. The rst-order necessary
conditions become
_
/x 2
1
/y
1
_
=
_
0
0
_
This implies (why?) y = 2x. With the constraint, this implies
x
=
3
+
, y
=
6
+
3. If x, y 0 and
2
0, the binding constraint is that 2y + x = 6. The rst-order necessary
conditions become
_
/x
2
/y 2
2
_
=
_
0
0
_
Using the conditions and constraint gives
x
=
6
+
, y
=
3
+
4. If x 0 and y
= 0, and
1
0, then the constraint 2x +y = 6 binds, and x
= 3.
5. If y 0 and x
= 0, and
2
0, then the constraint x + 2y = 6 binds, and y
= 3.
So we have a candidate list of ve potential solutions:
2 , 2
6
+
,
3
+
3
+
,
6
+
3 , 0
0 , 3
To gure out which is the global maximum, you generally need to plug these back into the
objective and determine which point achieves the highest value. This will depend on the value of
and .
Example Suppose an agent is trying to solve an optimal portfolio problem, where two assets, x
and y are available. Asset x has mean m
x
and variance v
x
, while asset y has mean m
y
and variance
v
y
. The covariance of x and y is v
xy
. He has wealth w, shares of x cost p, and shares of y cost 1, so
his budget constraint is w = xm
x
+ym
y
, but x 0 and y 0, so he cannot short. The objective
function is the mean minus the variance of the portfolio,
xm
x
+ym
y
(x
2
v
x
+ym
y
2xyv
xy
)
Then the Lagrangean is
L = xm
x
+ym
y
(x
2
v
x
+ym
y
2v
xy
) (w px y)
x
x
y
y
The FONCs are
m
x
(2x
v
x
2y
v
xy
) p
x
= 0
m
y
(2y
v
y
2x
v
xy
)
y
= 0
(w px
) = 0
126
and the complementary slackness conditions are
x
x
= 0
y
y
= 0
Then there are 2
2
= 4 cases:
1.
x
= mu
y
= 0, placing no restrictions on x
and y
, by the complementary slackness condi-

tions
2.
x
= 0 and
y
0, so that y
= 0, by the complementary slackness conditions

3.
x
0 and
y
= 0, so that x

4.
x
0 and
y
0, so that x
= y

We solve case by case:
1. If
x
=
y
= 0, then the FONCs are
m
x
(2x
v
x
2y
v
xy
)
p = 0
m
y
(2y
v
y
2x
v
xy
)
= 0
(w px
) = 0
which is a quadratic programming problem and can be solved by matrix methods. In partic-
ular,
y
=
w(v
x
+pv
xy
)
p
2
(m
x
pm
y
)
p(2v
xy
+v
y
) +v
x
2. If
x
= 0 and
y
0, then the FONCs are
m
x
(2x
v
x
)
p = 0
m
y
(2x
v
xy
)
y
= 0
(w px
) = 0
And the last equation gives a candidate solution x
= w/p and y
= 0.
3. This case is similar to the previous one, with y
= w/p and x
= 0.
So the portfolio optimization problem produces three candidate solutions. Our candidate solu-
tion with x
and y
both positive requires

w(v
x
+pv
xy
)
p
2
(m
x
pm
y
) > 0
If p = 1, this reduces to
w >
m
x
m
y
2(v
x
+v
xy
)
so that the dierence in means cant be too large (m
x
m
y
) or the agent will prefer to purchase all
x. If v
x
or v
xy
is large, however, this makes the agent prefer to diversify, even if m
x
> m
y
. The KT
conditions allow us to pick out the various solutions and compare them, even for large numbers of
assets (since this is just a quadratic programming problem).
127
11.1.1 The sign of the KT multipliers
We havent yet shown that
j
0. To do this, consider relaxing the rst inequality constraint:
h
1
(x)
where is a relaxation parameter that makes the constraint easier to satisfy. The Lagrangian for
a problem with one equality constraint and J inequality constraints becomes
L = f(x) g(x)
1
(h
1
(x) )
J
j=2
j
h
j
(x)
Let V () be the value function in terms of . If we have relaxed the constraint, the agents choice
set is larger, and he must be better o, so that V
() 0. But by the envelope theorem,

V
() =
L(x
1
0
So the KT multipliers must be weakly positive, as long as your constraints are always written as
h
j
(x) 0.
11.2 Second-Order Sucient Conditions and the IFT
Some good news: Since KT maximization is essentially the same as Lagrange maximization once
you have xed the set of binding constraints, the SOSCs and implicit function theorem for an
inequality-constrained problem are equivalent to those in an equality-constrained maximization
problem where the set of binding constraints are equality constraints, and the slack constraints are
completely ignored.
Remember: KT is just a formal way of working through the process of testing all the possible
combinations of binding constraints. So for a particular set of binding constraints, the problem
is equivalent to an equality constrained problem where these are the only constraints. To check
whether a critical point is a local maximum or compute comparative statics, we use exactly the
same approach as an equality constrained problem.
The envelope theorem is similar, but you have to be careful. In an inequality constrained
problem there are potentially many local solutions for some values of the exogenous variables. But
the envelope theorem is computed from the maximum of all of them, and as you vary the exogenous
variables, you have to be careful that you dont drift into another solutions territory. Locally, there
is not much to worry about, since you are almost never on such a boundary (in the sense of Lebesgue
measure zero). But suppose you want to plot the maximized value of the objective function for an
agent facing a savings behavior for a variety of interest rates where there is a borrowing constraint
that binds for some values of the interest rate but not for others. However, this upper envelope
over all maximization solutions is continuous, at least (but not necessary dierentiable at kinks
where the regime switches):
Theorem 11.2.1 (Theorem of the Maximum) Consider maximizing f(x, c) over x, where the
choice set dened by the equality and inequality constraints is compact and potentially depends on
c. Then f(x
(c), c) is a continuous function in c.

The theorem of the maximum is a kind of generalization of the envelope theorem that says,
While V (c) may not be dierentiable because of the kinks where the set of binding constraints
shifts, the value functions will always be at least continuous in a parameter c.
128
Exercises
1. Suppose an agent has the following utility function:
u(x
1
, x
2
) =

x
1
+
x
2
and faces a linear constraint
1 = px
1
+x
2
and non-negativity constraints x
1
0, x
2
0.
i. Verify the objective function is concave in (x
1
, x
2
). ii. Graph the budget set for various prices
p, and some of the upper contour sets. iii. Solve the agents optimization problem for all p > 0.
Check the second-order conditions are satised at the optimum. iv. Sketch the set of maximizers as
a function of p, x
1
(p). v. Solve the agents problem when u(x
1
, x
2
) =

x
1
+
x
2
+ 1. Is it possible
to get corner solutions in this case? Explain the dierence between the old objective function and
the new one.
2. Suppose a rm hires capital K and labor L to produce output using technology q = F(K, L),
with F(K, L) 0. The price of capital is r and the price of labor is w. Solve the cost minimization
problem
min
K,L
rK +wL
subject to F(K, L) q, K 0 and L 0. Check the SOSCs or otherwise prove that your solutions
are local maxima subject to the constraints. How does (K
, L
) vary in r and q at each solution?

How does the rms cost function vary in q?
3. Warning: This question is better approached by thinking without a Lagrangean than trying
to use KT theory. A manager has two factories at his disposal, a and b. The cost function for
factory a is C
a
(q
a
) where C
a
(0) = 0 and C
a
(q
a
) is increasing, dierentiable, and convex. The cost
function for factory b is
C
b
(q
b
) =
_
0 if q
b
= 0
c
b
q
b
+F if q
b
> 0
i. Is factory bs cost function convex? Briey explain the economic intuition of the property
C
b
(0) = 0 but lim
q
b
0
C
b
(q
b
) = F. ii. Solve for the cost-minimizing production plan (q
a
, q
b
) that
achieves q units of output, q
a
+ q
b
= q, q
a
0, q
b
0. Is the cost function C
( q) continuous?
Convex? Dierentiable? iii. Let p be the price of the good. If the rms prot function is
(q) = pq C
(q), what is the set of maximizers of (q)?

4. Suppose an agent has the following utility function:
u(x
1
, x
2
) =
_
x
1
x
2
, x
2
> x
1
x
1
+x
2
2
, x
2
x
1
and faces a linear constraint
1 = px
1
+x
2
and non-negativity constraints x
1
0, x
2
0. (i.) Verify that these preferences are continuous
but non-dierentiable along the ray x
1
= x
2
. (ii.) Graph the budget set for various prices p, and
some of the upper contour sets. (iii.) Solve the agents optimization problem for all p > 0. (iv.)
Sketch the set of maximizers as a function/correspondence of p, x
1
(p).
5. Suppose an agent gets utility from consumption, c 0, and leisure, 0. He has one unit
of time, which can also be spent working, h 0. From working, he gets a wage of w per hour, and
his utility function is
u(c, )
129
However, he faces taxes, which take the form
t(wh) =
_
0 , wh < t
0
wh , wh t
0
Therefore, his income is wh, consumption costs pc, and he is taxed based on whether his wage is
above or below a certain threshold, linearly at rate . i. Sketch the agents budget set. Is it a convex
set? ii. Formulate a maximization problem and characterize any rst-order necessary conditions
or complementary slackness conditions. iii. Characterize the agents behavior as a function of w
and t
0
. What implications does this model have for the design of tax codes? iv. If the tax took
the form
t(wh) =
_
0 , wh < t
0
, wh t
0
sketch the agents budget set. Is it convex? What implications does the geometry of this budget
set have for consumer behavior?
6. A consumer has wealth w, and access to a riskless asset with return R > 1, and a risky asset
with return r N(r,
2
). He places
1
of his wealth into the riskless asset, and
2
of his wealth
into the risky asset:
w =
1
+
2
and he maximizes the mean-less-the-variance of his returns:
U(
1
,
2
) =
1
R +
2
r

2
2
2
(
2
+r
2
)
(i) Find conditions on , R, r
,
2
so the agent holds some of both the risky and riskless assets, and
solve for the optimal portfolio weights, (
1
,
2
). (ii) How do the optimal portfolio weights vary with
r
,
2
and R? (iii) How does his payo vary with w and R when he uses the optimal portfolio?
130
Chapter 12
Concavity and Quasi-Concavity
Checking second-order sucient conditions for equality- and inequality-constrained maximization
problems is often outrageously painful. It is time-consuming and error-prone, and pain increases
exponentially in the number of choice variables.
For this reason, it would be very helpful to know when the second-order sucient conditions
for multi-dimensional maximization problems are automatically satised.
There are two classes of function that are useful for maximization: Concave and Quasi-concave.
(The corresponding classes for minimization are Convex and Quasi-convex.)
In the one-dimensional case, strict concavity meant that f
(x) < 0 for all x. In the multi-

dimensional case, strictly concavity will similarly mean that y
xx
f(x)y < 0 for all x and y, and it
follows that the second-order conditions will be satised. This is the best case.
However, recall that if x
is a local maximum of f(x), then for any strictly increasing function

g(), x
is also a maximum of g(f(x)). This is nice, because it means that the units of f(x) are
irrelevant to the set of maximizers. But concavity and convexity are not preserved under monotone
transformations. For example, log(x) is one of our prototype concave functions. But if we take the
strictly increasing transformation g(y) = (e
y
)
2
, we get
g(log(x)) = x
2
which is a convex function.
This has the potential to cause problems for us as economists, because we want the set of
maximizers to be independent of how we describe f(x) up to increasing transformations (that lets
us claim that we dont need utils or cardinal utility to measure peoples preferences, we can just
observe how they behave). We want our theories to be based on ordinal utility, meaning that the
numbers provided by a utility function are meaningless in themselves, and only serve to verify
whether one option is better than another.
This is where quasi-concavity comes in. It is a property similar to concavity that is preserved
under monotone transformations, and has the same convenient theorems built in.
12.1 Some Geometry of Convex Sets
For analyzing maximization problems, it is helpful to consider the behavior of the better than
sets:
Denition 12.1.1 The upper contour sets of a function are the sets
UC(a) = {x : f(x) a}
Now, we need to be careful about the dierence between convex sets and convex functions,
since were about to use both words side by side quite a bit.
131
Denition 12.1.2 A set S is convex if, for any x
and x
in S and any [0, 1], the point

x
= x
+ (1 )x
is also in S. The interior of a convex set S,

_
(S), are all the points x for which there is a ball
B
(x) S. If any convex combination is in the interior of S, then S is a strictly convex set.
Convexity
Convex sets are well behaved since you can never drift outside the set along a chord between
two points. Geometrically, if your upper contour sets are convex and the constraint set is strictly
convex, then we should be able to separate them, like this:
132
Separation of Constraint and Upper-contour Sets
This would mean there is a unique maximizer, since there is a unique point of tangency between
the indierence curves and the constraint set.
Why? If we pick a point x
0
on an indierence curve and some point x
1
better than x
0
, all of
the options along the path from x
0
to x
1
are better than x
0
. We might express this more formally
by saying that for [0, 1], the options x
= x
1
+ (1 )x
0
are all better than x
0
, or that the
upper contour sets are convex sets. This is a key geometric feature of well-behaved constrained
maximization problems.
12.2 Concave Programming
Concave functions are the best-case scenario for optimization, since they imply that rst-order
necessary conditions are sucient for a critical point to be a global maximum (this makes checking
second-order sucient conditions unnecessary, and life is much easier).
Denition 12.2.1 A function f(x) is concave if for all x
and x
and all [0, 1],

f(x
+ (1 )x
) f(x
) + (1 )f(x
)
It is strictly concave if the equality is strict for all (0, 1).
133
Concave functions
This is a natural property in economics: Agents often prefer variety consuming the bundle
x
+(1)x
to consuming either of two extreme bundles. For example, which sounds better:
Five apples and ve oranges, or ten apples with probability 1/2 and ten oranges with probability
1/2? If you say, Five apples and ve oranges, you have concave preferences.
However, there are many, equivalent ways to characterize a concave function:
Denition 12.2.2 The following are equivalent: Let D ofR
N
be the domain of f().
f(x) is concave on D
For all x
, x
D and all (0, 1),

f(x
+ (1 )x
) f(x
) + (1 )f(x
)
For all x
, x
D,
f(x
) f(x
) f(x
)(x
)
The Hessian of f(x) satises y
H(x)y 0 for all x D and y R

N
.
If the weak inequalities above are replaced with strict inequalities, the function is strictly concave.
The Hessian characterization concave if H(x) is negative semi-denite for all x in the domain
of f(x), and strictly concave if H(x) is negative denite for all x in the domain of f(x) is extremely
useful, since the Hessian is so ubiquitous in proving SOSCs. If the objective function is strictly
concave, then at a critical point x
,
f(x) = f(x
) + (x x
H(x
)
2
(x x
) +o(h
3
)
or for x close to x
,
f(x
) f(x) = (x x
)
H(x
)
2
(x x
) > 0
since y
H(x)y < 0 for any x, including x
. Then we can conclude that f(x
) > f(x), so that a

critical point x
is a local maximum.
134
Theorem 12.2.3 Consider an unconstrained maximization problem max
x
f(x). If f(x) is
concave, any critical point is a global maximum of f(x). If f(x) is strictly concave, any
critical point is the unique global maximum.
Consider an inequality constrained maximization problem max
x
f(x) subject to h
k
(x) 0, for
k = 1, ..., K. If f(x) is concave and the constraint set is convex, any critical point of the
Lagrangian is a global maximum. If f(x) is strictly concave and the constraint set is convex,
any critical point of the Lagrangian is the unique global maximum.
Note that for Kuhn-Tucker, the situation is slightly more complicated than it might appear,
since the objective function might be strictly concave so that for each set of active constraints, there
is a unique global maximizer (since each is just an equality-constrained maximization problem),
but these candidates must still be somehow compared.
12.3 Quasi-Concave Programming
Concavity is a cardinal property, not an ordinal one, and we would like to have a generalization of
concavity that is merely ordinal.
For example, if we consider x
and compute the Hessian, we get

_
( 1)x
2
y
x
1
y
1
x
1
y
1
( 1)x
y
2
_
which is not concave if + > 1. But f(x, y) = xy still has convex upper contour sets, so we would
expect it to be well-behaved for maximization purposes, even if it isnt concave (see Proposition
12.3.2 below).
To deal with this, we introduce a new kind of function, quasi-concave functions. Were going
to motivate quasi-concavity in a somewhat roundabout way. Recall that for a function f(x, y), the
indierence curve x(y) was dened as the implicit solution to the equation
f(x(y), y) = c
Now, if the indierence curves x(y) were
1. Downward sloping, so that taking some y away from the agent required increasing x to keep
them on the same indierence curve
2. A convex function, so that taking away a lot of y away from the agent requires giving them
ever higher quantities of x to compensate him
3. Invariant to strictly increasing transformations
we would have the right geometric properties for maximization without the cardinal baggage that
comes with assuming concavity.
When are 1-3 satised? Well, for a strictly increasing transformation g(), g(f(x(y), y)) has
indierence curves dened by
g
(f(x(y), y))(f
x
(x(y), y)x
(y) +f
y
(x(y), y)) = 0
which are the same as those generated by f(x(y), y). So far so good. Now, the derivative of x(y) is
x
(y) =
f
y
(x(y), y)
f
x
(x(y), y)
135
And the second derivative is
x
(y) =
(f
xy
x
(y) +f
yy
)f
x
(f
xx
x
(y) +f
yx
)f
y
f
2
x
> 0
which implies
_
f
xy
f
y
f
x
+f
yy
_
f
x
_
f
xx
f
y
f
x
+f
yx
_
f
y
< 0
or
f
xy
f
y
f
x
+f
yy
f
2
y
+f
2
x
f
xx
f
xy
f
y
f
x
< 0
Which doesnt appear to be anything at rst glance. Actually, it is the determinant of the matrix
_
_
0 f
x
f
y
f
x
f
xx
f
xy
f
y
f
yx
f
yy
_
_
So if this matrix is negative semi-denite, the indierence curves x(y) will be concave functions,
and if it is negative denite, the indierence curves x(y) will be strictly concave functions. This is
the idea of quasi-concavity.
Denition 12.3.1 The following are equivalent: Let D be the domain of f(x).
f(x) is quasi-concave on D
For every real number a, the upper contour sets of f(x),
UC(a) = {x : f(x) a}
are convex.
The bordered Hessian
H(x)
_
0
x
f(x)
x
f(x)
xx
f(x)
_
is negative semi-denite on D, so that y
H(x)y 0 for all x D for all y R

N
.
For all in [0, 1] and x D,
f(x
+ (1 )x
) min{f(x
), f(x
)}
For all x
, x
D, if f(x
) f(x
), then
f(x
)(x
) 0
If the weak inequalities above are replaced with strict inequalities, the function is strictly quasi-
concave.
Now, the above concepts are usual, but the following provides what is usually the easiest route
to proving quasi-convexity of a function:
Proposition 12.3.2 Let g() be an increasing function. If g(f(x)) is concave, then f(x) is quasi-
concave.
136
Proof If g(f(x)) is concave, then
g(f(x + (1 )x
)) g(f(x)) + (1 )g(f(x
))
Without loss of generality, let g(f(x)) g(f(x
)); this implies f(x) f(x
). But then
g(f(x + (1 )x
)) g(f(x)) + (1 )g(f(x
)) g(f(x
))
and we get
f(x + (1 )x
) f(x) = min{f(x), f(x
)}
so that F() is quasi-concave.
For example, this works with f(x, y) = x
with + > 1: Take the monotone transformation

g(z) = z
1/(++1)
, which concavies the function. The quasi-concavity of g(f(x, y)) is then easy
to show using standard tool for concavity, which are somewhat easier than those for quasi-concavity.
Quasi-concavity is useful because of the following theorem:
Theorem 12.3.3 Consider an inequality constrained maximization problem max
x
f(x) subject to
h
k
(x) 0 for k = 1, ..., K. If f(x) is quasi-concave and the constraint set is convex, any critical
point of the Lagrangian is a global maximum. If f(x) is strictly concave and the constraint set is
convex, any critical point of the Lagrangian is the unique global maximum.
Note that in unconstrained problems, quasi-concavity might not be enough to guarantee a
critical point is a maximum. For example, the function f(x) = x
3
is quasi-concave, since the upper
contour sets UC(a) = {x : x
3
a} are convex sets. However, it fails to achieve a maximum.
To check that a function is quasi-concave, let
H
k
(x) =
_
_
0 f(x)/x
1
f(x)/x
2
. . . f(x)/x
k1
f(x)/x
1

2
f(x)/x
2
1

2
f(x)/x
2
x
1
. . .
2
f(x)/x
1
x
k1
f(x)/x
2

2
f(x)/x
1
x
2

2
f(x)/x
2
2
. . .
2
f(x)/x
2
x
k1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
f(x)/x
k1

2
f(x)/x
k1
x
1

2
f(x)/x
k1
x
2
. . .
2
f(x)/x
2
k1
_
_
Then we have an alternating sign test on the determinants of the principal minors:
Theorem 12.3.4 (Alternating Sign Test) If f is quasiconcave, then the determinants of the
leading principal minors of the bordered Hessian alternate in sign, starting with det H
3
(x) 0,
det H
4
(x) 0, and so on. If the leading principal minors of the bordered Hessian alternate in sign,
starting with det H
3
(x) > 0, det H
4
(x) < 0, and so on, then f is quasiconcave.
12.4 Convexity and Quasi-Convexity
The set of convex and quasi-convex functions play a similar role in minimization theory as concave
and quasi-concave functions play in maximization theory.
Denition 12.4.1 The following are equivalent:
f(x) is convex
For all x
and x
and all [0, 1],

f(x
) + (1 )f(x
) f(x
+ (1 )x
)
137
For all x
and x
,
f(x
) f(x
) Df(x
)(x
)
The Hessian of f(x) satises y
H(x)y 0 for all x in the domain of f(x) and y R

N
.
If the weak inequalities above are replaced with strict inequalities, the function is strictly convex.
and
Denition 12.4.2 The following are equivalent:
f(x) is quasi-convex
For every real number a, the lower contour sets of f(x),
LC(a) = {x : f(x) a}
are convex.
The bordered Hessian
H(x)
_
0
x
f(x)
x
f(x)
xx
f(x)
_
is positive semi-denite, y
H(x)y 0 for all y.

For all in (0, 1),
f(x
+ (1 )x
) max f(x
), f(x
)}
If f(x
) f(x
), then
f(x
)(x
) 0
If the weak inequalities above are replaced with strict inequalities, the function is strictly quasi-
convex.
and we have
Theorem 12.4.3 Consider an unconstrained maximization problem max
x
f(x). If f(x) is
convex, then any critical point of f(x) is a global minimizer of f(x). If f(x) is strictly
convex, any critical point is the unique global minimum.
Consider an inequality constrained maximization problem max
x
f(x) subject to h
k
(x) 0 for
k = 1, ..., K. If f(x) is convex or quasi-convex, any critical point of the Lagrangian is a global
maximum. If f(x) is strictly convex or strictly quasi-convex and the constraint set is convex,
any critical point of the Lagrangian is a unique global minimum.
The alternating sign test for quasi-convexity is slightly dierent:
Theorem 12.4.4 (Alternating Sign Test) If f is quasiconvex, then the determinants of the
leading principal minors of the bordered Hessian are all weakly negative. If the leading principal
minors of the bordered Hessian are all strictly negative, then f is quasiconvex.
138
12.5 Comparative Statics with Concavity and Convexity
There are some common tricks to deriving comparative statics relationships under the assumptions
of concavity and convexity.
Example Suppose a rm chooses inputs z = (z
1
, z
2
) whose costs per unit are w = (w
1
, w
2
),
subject to a production constraint F(z
1
, z
2
) = q. Let
c(w, q) = min
z
w z
subject to F(z) = q.
First, note that c(w, q) is concave in w. How do we prove this? Suppose z
minimizes costs at
w
and z
minimizes costs at w
. Now consider the price vector w
= w
+ (1 )w
, and its
cost-minimizing solution z
. Then
c(w
, q) = w
= (w
+ (1 )w
) z
= w
+ (1 )w
But by denition w
and w
, so that
w
+ (1 )w
or
c(w
+ (1 )w
, q) c(w
, q) + (1 )c(w
, q)
so that the cost function is concave.
Second, we use the envelope theorem to dierentiate the cost function with respect to w:
w
c(w, q) = z(w, q)
or
_
c
w
1
(w
1
, w
2
, q)
c
w
2
(w
1
, w
2
, q)
_
=
_
z
1
(w
1
, w
2
, q)
z
2
(w
1
, w
2
, q)
_
and again with respect to w to get
ww
c(w, q) =
w
z(w, q)
or
_
c
w
1
w
1
(w
1
, w
2
, q) c
w
2
w
1
(w
1
, w
2
, q)
c
w
1
w
2
(w
1
, w
2
, q) c
w
2
w
2
(w
1
, w
2
, q)
_
=
_
_
z
1
(w
1
, w
2
, q)
w
1
z
1
(w
1
, w
2
, q)
w
2
z
2
(w
1
, w
2
, q)
w
1
z
2
(w
1
, w
2
, q)
w
2
_
_
Finally, since c(w, q) is concave, every element on its diagonal is weakly negative. Therefore, every
element on the diagonal of z(w, q) must also be negative, or
z
k
(w, q)
w
k
=

2
c(w
1
, w
2
, q)
w
2
k
0
so that if w
k
then z
k
.
This exercise would actually be pretty dicult without the knowledge that c(w, q) is concave,
and that is the key.
139
Example Suppose a consumer buys bundles q = (q
1
, ..., q
N
) of goods, has utility function u(q, m) =
v(q)+m, and budget constraint w = p
q +m. Suppose v(q) is concave. Then the objective function

is
max
q
v(q) p
q +w
The FONCs are
q
v(q
) p = 0
and totally dierentiating with respect to p yields
qq
v(q
)
p
q
1 = 0
and
p
q
= [
qq
v(q
)]
1
Since v(q) is concave, its Hessian is negative denite, and the inverse of a negative denite
matrix is negative denite, so all the entries on the diagonal are weakly negative. Therefore,
q
k
(p)
p
k
0
Again, the concavity of the objective function is what makes the last part do-able. Without
knowing that [
qq
v(q
)]
1
is negative semi-denite by the concavity of v(q), we would be unable
to decide the sign of the comparative statics of each good with respect to a change in its own-price.
12.6 Convex Sets and Convex Functions
Since the relationship between a convex set and a convex function is so important, this section
focuses on that alone. This is somewhat repetitive after reading the previous content of the section,
but there are some new results.
Denition 12.6.1 1. A set S is convex if for all x
, x
S and all (0, 1), x
+(1)x
S.
2. A function F is convex if for all x
, x
in the domain of F and all (0, 1), F(x
+ (1
)x
) F(x
) + (1 )F(x
)
In economics we often have statements like, For the utility maximization problem, if u(x) is
quasi-concave, then the upper contour sets of u(x) are convex, and the set of maximizers is a convex
set. This xes a bunch of objects together that all share some aspect of convexity or concavity: (i)
the function u(x) is quasi-concave, (ii) since u(x) is quasi-concave, its upper contour sets are convex,
(iii) since the upper contour sets are convex and the constraint set is convex (p x w, x 0), the
separating hyperplane theorem guarantees we can separate them, (iv) the set of solutions to the
UMP will itself be a convex set. Note that you cannot lose track of which object is being referred
to: The function u(x), the sets UC(), or the set of maximizers x(p, w).
There are some equivalent denitions of convexity: The following are equivalent:
1. A function F is convex
2. For all x
, x
in the domain of F, F(x
+ (1 )x
) F(x
) + (1 )F(x
) (convexity
inequality)
3. For all x in the domain of F, the Hessian at x satises y
H(x)y 0 for all y = 0 (second-order

test)
140
4. For all x
, x
) +F(x
)(x
) F(x
) (rst-order test)
If the weak inequalities in the above are made strict, the function is strictly convex.
Now, note, that if a function is convex, it is automatically quasi-convex. Why is that? Well, a
function is quasi-convex if
F(x
+ (1 )x
) max{F(x
), F(x
)}
But since F(x
) + (1 )F(x
) max{F(x
), F(x
)}, a convex function automatically satises

the inequality dening quasi-convexity. Therefore, we can always exploit information about quasi-
convex functions as well:
The following are equivalent:
1. A function F is quasi-convex
2. For all x
, x
+ (1 )x
) max{F(x
), F(x
)} (quasi-convexity
inequality)
3. For all x in the domain of F, the bordered Hessian at x satises y
BH(x)y 0 for all

y = 0(second-order test)
4. If F(x
) F(x
), then F(x
)(x
) 0 (rst-order test)
5. The lower contour sets of F(), LC(a) = {x : F(x) a}, are convex sets
If the above weak inequalities are made strict, the function is strictly quasi-convex.
Note the important fact that since convex functions are quasi-convex, their lower contour sets
are convex sets. However, the hassle of computing determinants is exponential in the number of
rows and columns, so that it is easier to use the second-order test for convex functions than the
second-order test for quasi-convex functions. Quasi-convex functions, therefore, have slightly more
straightforward geometric features, while convex functions are easier to work with analytically.
Note that for optimization, quasi-convexity and minimization go together perfectly: If the
lower contour sets of F(x) are convex and our feasible/constraint set is a convex set, we can use
the separating hyperplane theorem to prove that the better than set can be separated from the
feasible set. If the function is strictly quasi-convex, we even get uniqueness of the solution (this was
proved in class for a quasi-concave function and a convex feasible set, but the details are almost
exactly the same).
1. A sum of convex functions is convex, but a sum of quasi-convex functions is not necessarily
quasi-convex
2. An intersection of convex sets is a convex set
3. A function F is quasi-convex if there is an increasing transformation g such that g(F(x)) is
convex
4. If a function F is convex and g is a strictly increasing, convex function, then g(F(x)) is convex
5. A function F is convex if F is concave (so this all applies to concave functions with suitable
reworking of the direction of inequalities)
With these facts around, there are two methods of proving a set is convex:
1. Take two arbitrary elements in the set and show their convex combination is in the set
2. Show that the set is the lower contour set of a convex or quasi-convex function
141
The most useful tools are the convexity inequalities and the second-order tests, and points 3
and 4 of proposition 1.3. I have rarely used the other items myself, but they are potentially useful
alternatives when more obvious approaches fail.
Note that you want to be very clear about what set or function you are trying to prove is
convex. For example, if you want to prove that the set
S() = {x : F(x) }
is convex, you are considering the lower contour sets of F(x), which are in R
N
. This would come
up in a situation where, perhaps, you are minimizing and want to invoke a result about separating
hyperplanes. On the other hand, you might have a function F(x) = y and you are interested in the
set
S = {(x, y) : F(x) y 0}
which is a set in R
N+1
. An example of this is the production sets from class, for instance.
Since proving functions are concave or convex is something weve talked about before x
is concave when , > 0 and , < 1, for instance lets focus on the second kind of problem:
Proving that sets of the form
S = {(x, y) : F(x) y 0}
are convex.
Example Consider the function f(x) = x
2
. We will show that the set S = {(x, y) : y f(x)} is
convex (sketch this set). To start, take any two points that satisfy
y x
2
y
x
2
Then for all (0, 1),
y + (1 )y
x
2
+ (1 )x
2
But since g(z) = z
2
is a convex function (g
(z) = 2z, g
(z) = 2 > 0), we then have

y + (1 )y
x
2
+ (1 )x
2
(x + (1 )x
)
2
so that
y + (1 )y
(x + (1 )x
)
2
and the convex combination of the points (x, y) and (x
, y
) lies in S, and S is convex.

Now consider the function f(x) =

x. We will show that the set S = {(x, y) : y f(x)} is
convex. Suppose
y
x
y
Then for (0, 1),

y + (1 )y
x + (1 )
But since h(z) =

z is a concave function (h
(z) = .5z
.5
, g
(z) = .25z
1.5
< 0), we have
y + (1 )y
x + (1 )
_
x + (1 )x
or
y + (1 )y
_
x + (1 )x
so that if (x, y) and (x
, y
) are in the set S, then the convex combination (x+(1)x
, y+(1)y
)
is in the set, so S is a convex set.
142
Now, the above examples use the denition of convexity to prove this property directly from
the denitions. How can we use the second order test?
Example Consider the function f : R R given by y = x
2
. Lets convert it into a function
F : R
2
R as
F(x, y) = y x
2
So as we vary y and x
2
, the original function is represented by the isoquant/indierence curve/one-
dimensional manifold F(x, y) = 0, but the function can be translated up and down. Consider
the set F(x, y) 0. This is equivalent to the set {(x, y) : f(x) y} considered above.
Now, if F(x, y) is a concave function, then its upper contour sets will be convex sets, and we
can conclude that the set {(x, y) : F(x, y) 0} = {(x, y) : y f(x)} is a convex set.
We show this with the second derivative test. Dierentiate F(x, y) to get the Hessian
_
0 0
0 2
_
which is everywhere negative semi-denite. Unlike the function x
3
x
2
, where it is sometimes
positive denite, sometimes negative denite, and sometimes negative semi-denite, it is uniformly
negative semi-denite, so we can conclude the function is concave and its upper contour sets are
convex. Note that a faster route to proving concavity is to say that y is a concave function of (x, y)
and x
2
is a concave function of (x, y), so that their sum must be concave.
Now consider the function f : R
+
R given by y =

x. This function is not dened on all of
R, so we must be careful. Consider the function F(x, y) = y
x that maps R
+
R R. If this
is a convex function we can conclude that its upper contour sets are concave. The Hessian is
_
0 0
0 .25x
1.5
_
which is undened at (0, 0), where the lower-right hand corner takes the value .25/0. So on the
set R
+
\{0} R, the Hessian is positive semi-denite, and the function is convex on this set.
Consequently, its lower contour sets are convex sets, and weve shown that {(x, y) : y f(x)} is
a convex set. Note that a faster route to proving convexity is to point out that y and
x are
convex functions of (x, y), so that F(x, y) is a sum of convex functions.
The previous two examples were for functions f : R R. But what about higher dimensional
functions?
Example Let F(z) = x
, z = (x, y), where , > 0 but + < 1. We want to show that

S = {(, z) : F(z)} is a convex set. Suppose
x
Then for all (0, 1),

+ (1 )
+ (1 )x
But we know that F(z) is concave when + < 1 and , > 0, so we have
+ (1 )
+ (1 )x
(x + (1 )x
(y + (1 )y
or
+ (1 )
(x + (1 )x
(y + (1 )y
143
so that the convex combination ( + (1 )
, z + (1 )z
) is in the set S.
I nd this approach straightforward and easy, but to use the function approach rather than the
set approach, dene G(, z) = F(z). This function is the sum of two convex functions, and
F(z), so its lower contour sets are convex. Or, you can compute the Hessian and check the signs
of its leading principal minors.
The next example generalizes all of the previous ones.
Example We will show that the epigraph,
epif = {(x, ) : f(x)}
of a convex function is a convex set. This is the set of points above the graph of f(x). For example,
think about f(x) = x
2
: Shade all the points above the graph, and you get the epigraph.
If f(x) is convex, we want to show that the epigraph is a convex set. This means that for any
f(x) and
f(x
), + (1 )
f(x + (1 )x
). Now, we have the inequalities that

f(x)
f(x
)
since (x, ) and (x
) are in the epigraph of f. Then multiplying by and 1 yields

+ (1 )
f(x) + (1 )f(x
)
but by convexity of f(x),
+ (1 )
f(x) + (1 )f(x
) f(x + (1 )x
)
or
+ (1 )
f(x + (1 )x
)
so that if (x, ) and (x
) are in the epigraph of f, then the convex combination (x + (1

)x
, + (1 )
) are in the epigraph, and it is a convex set.

Similarly the hypograph,
hypof = {(x, ) : f(x)}
of a concave function is a convex set. Well go faster for this one. Suppose
f(x)
f(x
)
Then for (0, 1),
+ (1 )
f(x) + (1 )f(x
)
and by concavity of f(x),
+ (1 )
f(x) + (1 )f(x
) f(x + (1 )x
)
or
+ (1 )
f(x + (1 )x
)
so that if (x, ) and (x
) are in the hypograph of f(x), then so is (x+(1x
), +(1)
),
and the hypograph of a concave function is a convex set.
144
Note that the epigraphs and hypographs are NOT the upper and lower contour sets of the
function. To visualize them, imagine the surface mapping R
N
R: It is like a sheet oating in
space. The epigraph is the sheet and everything above it, while the hypograph is the sheet and
everything below it, which are objects in R
N+1
. The upper and lower contour sets are subsets of
R
N
.
The previous example generalizes the earlier ones to arbitrary concave and convex functions F :
R
N
R: Convex functions have convex epigraphs, and concave functions have convex hypographs.
These are the production sets Y of chapter four: They are essentially dened as the epigraphs of
transformation functions F(y).
Exercises
1. Let (x, y) R
2
+
. (i) When is f(x, y) = x
concave? Quasi-concave? (ii) When is f(x, y) =

_
x
+y
)
1/
concave? (iii) When is min{ax, by} concave? Quasi-concave?
2. Show that if f(x) is quasi-concave and h(y) is strictly increasing, then h(f(x)) is quasi-
concave. Show that x
maximizes f(x) subject to h(x) = 0 i x
maximizes h(f(x)) subject to

g(x) = 0.
3. Show that any increasing function f : R R is quasi-concave, but not every increasing
function is concave. Show that the sum of two concave functions is concave, but the sum of two
quasi-concave functions is not necessarily quasi-concave.
4. Prove that if f(x) is convex, then any critical point of f(x) is a global minimum of the
unconstrained maximization problem.
5. Show that all concave functions are quasi-concave. Show that all convex functions are
quasi-convex.
6. Suppose a rm maximizes prots (q, K, L) = pq rKwL subject to the constraint where
F(K, L) = q. Show that () is convex in (p, r, w). Prove that the gradient of with respect to
(p, r, w) is equal to (q, K, L). Lastly, show that q
is increasing in p, K
is decreasing in r, and
L
is decreasing in w. Note how these comparative statics do not rely on the properties of F(K, L).
145
Part IV
Classical Consumer and Firm Theory
146
Chapter 13
Choice Theory
13.1 Preference Relations
Denition 13.1.1 Suppose an agent is facing a decision problem where he can choose any x X.
Then for any x, y X, let x y if and only if the agent weakly prefers x to y i.e., whenever
x and y are both available, the agent is at least as happy choosing x as y. If he is strictly happier,
write x y, and if he is strictly indierent, say x y. Then we say (or ) is a preference
relation. A preference relation is rational if it is
Complete: For all x, y X, either x y or y x (or both)
Transitive: For all x, y, z X, if x y and y z, then x z
It is natural for economics to begin with preference relations in mind as a basic object of study.
It seems fair to require that agents be able to sensibly rank a list of alternatives from best to
worst, allowing for indierence. It might be the case that these preferences are contingent on other
circumstances, so that the preference ordering varies across states of the world such as wealth
or weather, or that it is costly for agents to decide whether a given alternative is better than
another. It might even be the case that agents preferences are simply somewhat stochastic, so that
the answer to whether a is better than b varies over time holding all else xed (see the random
utility models used in empirical industrial organization). But these issues are all embellishments
on the basic idea that agents can rank alternatives. The added restrictions of completeness and
transitivity merely reect a reasonable level of internal consistency: Agents preferences dont have
internal contradictions, and no pairwise comparison is impossible.
Example An agent has choice set X = {x
1
, x
2
, x
3
}.
The following are all rational preferences on X:
x
1
x
2
x
3
x
1
x
2
x
3
x
1
x
2
x
3
Consider the following preferences on X:
x
1
x
2
x
3
x
1
This isnt transitive, since it implies x
1
x
1
. Therefore, these preferences arent rational.
147
Consider the following preferences on X:
x
1
x
2
, x
1
x
3
This isnt complete: No relationship is specied between x
2
and x
3
. Therefore, these prefer-
ences arent rational.
Here is a non-trivial example of an economic agent with non-rational preferences:
Example There are three persons voting on alternatives a, b, and c. Their preference orderings
are
1 : a b c
2 : b c a
3 : c a b
The three persons have resolved to decide the groups preference ordering by doing a sequence of
pairwise comparisons of the alternatives, and picking the alternative that wins. So the agent
of this example is the group itself, not the three persons (who indeed have complete, transitive
preferences).
Suppose the group starts by comparing a and b. Then a gets two votes and b gets one, so that
a wins, and a b. Then a and c are compared, and c wins, since it gets two votes and a gets one,
so c a. Then b and c are compared, and b wins, since it gets two votes and c gets one, so b c.
Then the preference ordering is
a b c a...
which is non-transitive.
It is arguable that the group studied above using the pairwise run-o social choice function is
not rational, since its preferences are non-transitive. In particular, a b c a. This is an
example of Arrows impossibility theorem which says that even if individual voters have rational
preferences, there is, in general, no rule that aggregates those preferences into a rational preference
ordering for the entire group. So, when we think about representative agents or talk about societal
preferences, it is important to remember that there is no way to guarantee that such aggregative
agents are rational (but we will eventually do it, anyway).
What about social or other regarding preferences?
Example There are two apples. Amy and Bob can each have 0 or 1 apples, so that (x
a
, x
b
) is an
allocation of the apples where x
a
is the number of apples that Amy gets, and x
b
is the number of
apples Bob gets. So there are four possibilities: {(0, 0), (1, 0), (0, 1), (1, 1)}.
Amy cares about equality, so she prefers (1, 1) {(0, 1), (1, 0)} (0, 0). Bob actually prefers in-
equality in his own favor, and hates it when others do better than him, so that (0, 1) {(1, 0), (1, 1)}
(0, 0).
These preferences depend on the consumption bundles of other agents, but are completely
rational: Each agent has complete, transitive preferences over all the possible outcomes.
It is important to note that we are economists, so our job is not to judge preferences and say
some are better than others, or that some fail to make sense to us. Our job is to explain how
preferences translate into economic outcomes of interest, and we impose restrictions on only
insofar as the assumptions allow us to make useful inferences, not because we agree or disagree
with agents choices or desires.
148
13.2 Utility Functions
Preference relations are not typically very convenient to work with in particular, maximizing
them requires searching through all the possible alternatives for the best option. This makes it
dicult to use the kind of maximization theory developed in the rst part of the course.
Denition 13.2.1 A function u : X R is a utility function representing preference relation
if, for all x, y X,
x y u(x) u(y)
The big question for economists is, When can be represented by a utility function u, so that
optimization theory and comparative statics can be used? In particular, we want to know whether
utility functions and preference relations contain all the same information, so that working with
one is equivalent to working with the other.
Theorem 13.2.2 (1.B.2) If a preference relation can be represented by a utility function, then it
is rational.
What do we have to do to prove the theorem? Just show that if we have a utility function
representing , then is transitive and complete.
Proof Transitivity: Since u() represents , and u(x) u(y) and u(y) u(z) (implying that x y
and y z, since u represents ). Then u(x) u(z), by transitivity of . Since u() represents ,
then, x z, and is transitive.
Completeness: Since u() represents , for all elements x, y, we have u(x) u(y) or u(y) u(x).
But then since u() represents , that means either x y or y x, so is complete.
So weve shown that if we have a utility function representing some preferences, it must be
rational. But when is rationality enough to guarantee that there is a utility function representing
the preferences? Unfortunately, it depends on the topology of the choice set and the behavior of the
preference ordering, but there are two cases that can easily be studied (there are plenty of others
you could understand fairly quickly, but they would require new denitions and proofs that take
time):
Theorem 13.2.3 Suppose the choice set X is nite. Then a utility function u() represents pref-
erences if and only if they are complete and transitive.
We already have the if direction from the previous theorem. We need to show that any complete
and transitive preference relation can be represented by a utility function when the choice set is
nite.
Proof The proof will be by induction on the size of the choice set. Note that if we have a rational
preference relation on X, then the preference relation is also rational on X
X, because rationality
is decided by pairwise comparisons, which are unchanged by considering a smaller choice set on
which is already rational. So we can start by constructing a choice function for a set of two
elements, then work our way back up to the entire set.
(Base Case) Lets start with a choice set of two options, x
0
and x
1
. Because is complete
and transitive, either x
0
x
1
, x
0
x
1
, or x
1
x
0
. If x
0
x
1
, assign them both to the number
u(x
0
) = u(x
1
) = .5. If x
0
x
1
, assign u(x
0
) = .75 and u(x
1
) = .25. If x
1
x
0
, assign u(x
1
) = .75
and u(x
0
) = .25. Then u() represents on X
1
= {x
0
, x
1
}.
(Induction Case) At the k-th step, suppose we have a utility function u() that represents over
k + 1 elements, X
k
. We will show there is a way to extend u() to k + 2 elements in X
k+1
so that
the extensions represents .
First, pick an element x
k
from X\X
k
. We need to add it to the utility function in such a way
that u() represents on X
k+1
= X
k
{x
k
}. There are four cases:
149
We compare x
k
to all the elements in X
k
, and nd that it is worse than all of them; this is
possible because is transitive and complete. Take one of the worst elements in X
k
, x, and
nd its utility value, u(x). Set
u(x
k
) =
u(x)
2
Then u() represents on X
k+1
by construction.
We compare x
k
k
, and nd that it is better than all of them; this is
possible because is transitive and complete. Take one of the best elements in X
k
, x, and
nd its utility value, u( x). Set
u(x
k
) =
1 u( x)
2
+u( x)
k+1
by construction.
We compare x
k
k
, and nd that x
b
x
k
x
a
, with no other elements
x
c
of X
k
between x
b
and x
a
, so that x
b
x
c
x
a
(if there were such an element, we could
compare it to x
k
and see if we were still in this case with x
c
now taking the role of x
a
or
x
b
; or if we end up with an indierence x
k
x
c
, we move to the next case); this is possible
because is transitive and complete. Then set
u(x
k
) =
u(x
b
) u(x
a
)
2
+u(x
a
)
k+1
by construction.
We compare x
k
k
, and nd that x
k
x
c
; this is possible because
is transitive and complete. Then set
u(x
k
) = u(x
c
)
k+1
by construction.
So by induction, we can construct a utility function u() that represents on any nite choice set
X.
For uncountable choice sets ( like R
n
), the induction proof given above doesnt work, since we
cant sit around doing pairwise comparisons we would literally never, even if given a countably
innite amount of time to do the comparisons, get through all the possibilities.
Denition 13.2.4 Say that preferences are continuous on X if, for any sequence of pairwise
comparisons {(x
n
, y
n
)}
n=1
with x
n
y
n
for all n, then lim
n
x
n
= x and lim
n
y
n
, x y.
So for a violation of continuity, you need to construct a sequence of choices where arbitrarily
close to the limit x
n
y
n
, but at the limit y x. This means that for an innitesimal change in
the things being compared, the agents preferences suddenly reverse. MWG have a good example:
Example Preferences are lexicographic on R
2
if, for all pairs x = (x
1
, x
2
) and y = (y
1
, y
2
), x y
if x
1
> y
1
and if x
1
= y
1
, then x y if x
2
y
2
. This basically means that there is one dimension
that is always more important than the other dimension, and if two options are equal, then the
second dimension decides the choice. For example, in a computer, you might only care about speed
and not care about monitor size unless two computers are equally fast. Or you might have children
and put the safety features of a car above all considerations of horsepower, but compare two equally
safe cars on the dimension of horsepower.
150
These preferences are discontinuous, however. Take the sequences
x
n
= (1/n, 0)
and
y
n
= (0, 1)
Then for all nite n, x
n
y
n
because (1/n, 0) (0, 0), but the limits are x = (0, 0) and y = (0, 1),
so the preference reverses at the limit. These preferences also cannot be represented by a utility
function.
So it isnt guaranteed that rationality implies that the preferences can be represented by a
utility function. But it turns out that
Theorem 13.2.5 (3.C.1) Suppose that the preference relation on X is rational and continuous.
Then there exists a continuous utility function u() representing .
So suciently smooth preferences can be represented not just by any utility function, but
a continuous one. Consequently, we know that a rational, continuous preference relation on a
compact set has a maximizer by the Bolzano-Weierstrass theorem.
13.3 Consistent Decision-Making and WARP
Economics, however, aspires to be an observational science. In particular, we dont ever actually
get to observe , only the way that agents actually choose, and when were lucky some
detailed information about the circumstances under which those choices were made.
For example, you might have data about which digital camcorders consumers purchased over
ve years. If you are even luckier, you have data about those consumers incomes and whether they
have children. Ideally, you even know that the consumers all bought the camcorders online, and can
observe their browsing behavior, such as what online stores they browsed, so you can reconstruct
the prices and products that they were comparing. But even with all of this detailed information,
you still only get to observe their nal choice, not the pairwise comparisons among all the possible
options they faced.
So we want to say something about whether the observed choices of agents are consistent with
them having a rational preference relation.
Denition 13.3.1 A choice structure on X is (i) a set of non-empty subsets B of X, B, and (ii)
a choice rule C assigning a set of chosen elements to each B B: C(B) B.
Example Let X = {x
1
, x
2
, x
3
}. Then let B = {B
1
= {x
1
, x
2
}, B
2
= {x
2
, x
3
}}. Then if an agent
has preferences x
1
x
2
x
3
, he would behave as C(B
1
) = x
1
and C(B
2
) = x
2
.
Denition 13.3.2 A choice structure (B, C) satises the weak axiom of revealed preference if for
any B B with x, y B where x C(B), then for any B
B with x, y B
and y C(B
), then
x C(B
).
Whenever you get a very ambiguous looking denition, you should negate it immediately. Some-
times the negation of a denition is more intuitive than the denition itself (and you get a theorem
for free):
Theorem 13.3.3 A choice structure (B, C) does not satisfy WARP if there exist two budget sets
B, B
B with x, y B, B
, x C(B), y C(B
), but x / C(B
).
151
So if we see x chosen in the presence of y once, it must always be chosen whenever y is chosen.
This keeps decision-makers from making choices that seem to obviously contradict the preferences
they have exhibited in other situations.
Example Let
X = {x, y, z}
B
1
= {x, y}
B
2
= {x, y, z}
If the agent chooses C(B
1
) = x and C(B
2
) = z, they seem to be making consistent choices this
person probably prefers x to y and z to x and y. But if C(B
1
) = x and C(B
2
) = y, we certainly
have a violation. The option to choose z has now motivated the agent to pick y over x why
would the existence of z as an option aect the behavior of the agent in this way?
The interesting case you dont want to make a mistake with is the following: Suppose C(B
1
) =
{y} and C(B
2
) = {x, y}. This violates WARP, because we observed y chosen along with x at B
2
,
but not were throwing x away at B
1
. Where exactly is the violation? Go back to the negation
of WARP: Let B = B
2
, and B
= B
1
. Then we have x C(B) and y C(B
), but x / C(B
).
Another way of looking at this is that the decision-maker is elevating elements into choice sets
when theyve previously been discarded in favor of other options; this also violated WARP.
However, if we only observe some of an agents choices, we cant be sure if the agents preferences
are rational or not. We can only discuss whether a given preference relation is consistent with the
observed behavior, and whether observed behavior is consistent with a given preference relation:
Denition 13.3.4 Suppose that is a preference relation over X, and let B be any collection of
choice sets. Then the choice rule
C(B, ) = {x B : x y for all y B}
is generated by . Similarly, a choice structure (B, C(B)) is rationalized by a preference relation
if C(B) = C
(B, ) for all B B. Given a choice structure (B, C()), the revealed preference
relation
is dened as x
y if there is some B B such that x, y B and x C(B).

So when are choice structures generated by preference relations, and when can preference rela-
tions be observed from choice structures?
Example Suppose X = {w, x, y, z}. Then all of the possible budget sets are
wx, wy, wz, xy, xz, yz, wxy, wyz, wxz, xyz, wxyz
Let
B = {x, y, z}, {w, y, z}
Suppose C({x, y, z}) = {x, y}, C({w, y, z}) = w. Then we can say that x and y are revealed
preferred to z, and that w is revealed preferred to y and z. If the decision-makers true preferences
satisfy WARP, then
C({x, y}) = {x, y}, C({w, y}) = w, C({w, z}) = w, C({z, y}) = y, C({x, z}) = x
That leaves
wx, wxyz, wxz, wxy
152
as the only sets for which we might not be sure about what the decision-maker will do. But from
WARP, we know that whenever x is chosen and y is available, y must be chosen as well. We also
know that w was chosen over y, so y can never be chosen when w is available.
Now, can x be chosen with w? Weve seen x chosen with y, so to avoid a violation of WARP,
we cant see x chosen but y not chosen. But w is revealed preferred to y. Therefore, there cant be
choice set with w, x, y where x or y is chosen. Therefore,
C({w, x, y, z}) = C({w, x, z}) = C({w, x, y}) = w
But if C({w, x, y}) = w, then w would be revealed preferred to x. Then our best guess of the
agents preferences given WARP is w x y z.
Now suppose
B = {w, y}, {x, z}
and
C({w, y}) = w, C({x, z}) = x
What can we conclude? Well, w
y, so y can never be chosen with or above w if satises

WARP, and x
z, so z can never be chosen with or above x, if satises WARP. But we have

no clues to whether w is better or worse than x. Here are some preference relations that rationalize
the observed choices:
w x y z
w y x z
x w z y
and so on. So its not obvious that preferences can be recovered from observed choices.
So we want a general idea of when we can or cant recover true preferences from observed
choices.
Theorem 13.3.5 (1.D.1) Suppose that is a rational preference relation. Then the choice struc-
ture (B, C
(B, )) generated by satises the weak axiom.

Proof Suppose that for some B B, we have x, y B and x C
(B, ) (this is the set-up of

WARP). The denition of C
(B, ) then implies that x y. Now take some B
B with x, y B
,
and y C
(B
, ). Then for any other z B
, we must have y z (if z is not chosen y z, and if

z C(B, ), y z). But we know that x y, and y z, so by transitivity, x z for all z B
,
so x C
(B
, ). Therefore, the weak axiom is satised.

The following is a partial converse to the previous theorem:
Theorem 13.3.6 (1.D.2) Suppose (B, C(B)) is a choice structure that satises the weak axiom,
and B contains all subsets of X of up to three elements. Then there is a rational preference relation
that rationalizes C(B) on B. This relation is unique.
Proof First, the revealed preference relation
is rational. Since B contains all choice sets up to

three elements, we can see how the decision-maker does all pairwise comparisons: For any x, y, we
have B = {x, y}, and C({x, y}) provides either x
y or y
x or both, so
is complete. To
show transitivity, assume x
y and y
z. Consider the three-element budget set {x, y, z} B.

Suppose that y C({x, y, z}). Since x
y, then x C({x, y, z}) as well, since x is assumed to

be revealed preferred to y. Now if z C({x, y, z}), then it must be the case that y C({x, y, z}),
since y was revealed preferred to z. But then x must also be in C({x, y, z}), since x was revealed
preferred to y. Therefore, x
z, since it is chosen from {x, y, z} in any case.

153
Now we have to show that C
(B,
) is the same function as C(B) if we knew the agents

revealed preference relation
on B, we could make all the same decisions as he would. First,

suppose x C(B). Then x is revealed preferred to all other elements y B, so x C
(B,
),
and C(B) C
(B,
). Now suppose that x C
(B,
); then x
y for any other y B. Then

for any y B, consider the sets taking the form B
y
B with {x, y} B
y
, and x C(B
y
). The
weak axiom then implies that x C(B), because it is revealed preferred in the smaller set B
y
, so
C
(B,
) C(B). Consequently, C
(B,
) = C(B).
So revealed preference axioms like WARP are central to constructing an equivalence between the
abstract idea of a preference relation that represents an agents tastes and the choice structure
(B, C(B)) that they are observed to exhibit. If WARP is satised, we can think of choice structures
and preference relations as being equivalent objects. If WARP fails, however, this equivalence may
fail.
Exercises
MWG: 1.B.1, 1.B.2, 1.C.1, 1.B.4, 1.C.2, 1.D.3, 1.D.4
154
Chapter 14
Consumer Behavior
Imagine consumers are agents with preferences represented by a continuous utility function u(x),
over bundles of goods. The bundles are vectors in R
L
, such as (apples, car tires, socks, ..., pencils).
The prices of goods are a vector p = (p
1
, p
2
, ..., p
L
). Then the consumer chooses
max
x
u(x)
such that p x w.
14.1 Consumer Choice
For the case of preferences over goods, there are some useful additional denitions:
Denition 14.1.1 The preference relation is
monotone on X if y x implies y x
strongly monotone if y x implies y x.
locally non-satiated if, for all x and every > 0, there is a y such that ||y x|| and
y x.
convex if, for all x, y, z and (0, 1), if y x and z x, then y + (1 )z x.
strictly convex if, for all x, y, z and (0, 1), if y x and z x, then y + (1 )z x.
As we know, if is continuous, it can be represented by a continuous utility function, u(x).
Since p x w is a compact set, a solution exists to the utility maximization problem.
Denition 14.1.2 Suppose the consumers choice set is X R
L
. Then the consumers competi-
tive budget set is
B(p, w) = {x X : p x w}
We proved that a closed subset of a compact set is compact, so if X is compact, so is B(p, w).
The set of points in B(p, w) are feasible for the consumer. A particular bundle x
B(p, w) is
optimal if u(x
) u(x
) for all x
= x
.
If u(x) is a dierentiable function, the problem described above can be formulated as a con-
strained maximization problem with Lagrangean
L(x, ) = u(x) (w p x) + x
155
Since u(x) is locally non-satiated, Walras law holds, and the solution will satisfy the constraint
with an equality. The Kuhn-Tucker multipliers acknowledge the possibility of a corner solution
where some of one good is not consumed at all. Then the rst-order necessary conditions are
0 = Du(x) p +
and
w p x = 0
If there is an interior solution where = 0, the FONCs imply that the gradient of the utility
function is a scalar multiple of p: Du(x) = p. Moreover, if u() is strictly quasiconcave, there will
be a unique interior solution, so that we can use comparative statics to nd relationships between
x
k
and x
.
14.2 Walrasian Demand
The solution is called the Walrasian demand (function or correspondence), x(p, w). Before doing
anything with maximization, some things can already be proved:
Theorem 14.2.1 Suppose u is a continuous utility function representing a locally nonsatiated
preference relation dened on the consumption set X R
L
. Then Walrasian demand satises
Homogeneity of degree zero: x(p, w) = x(p, w), for all p, w, and > 0.
Walras law: p x = w for all x x(p, w).
Convexity/uniqueness: If is convex, so that u() is quasiconcave, then x(p, w) is a convex
set. If is strictly convex so that u() is strictly quasiconcave, then x(p, w) is a singleton.
Proof (i) If the original budget set is B(p, w) = {x : p x w}, consider B(p, w) = {x : p x
w} = {x : p x w} = B(p, w). So the choice set is the same, so the maximizer should be the
same. (ii) Suppose the agent chose x ( assume x is optimal) but p x < w. Then there exists
a y satisfying ||x y|| < but y x. For small enough, y will also satisfy p y w (y is
aordable). But this contradicts x being optimal. Therefore, any optimizer must satisfy p x = w.
(iii) Suppose u() is quasiconcave and that x = x
are both optimal (x, x
x(p, w)). Note that

x
= x+(1)x
satises p x
= p (x+(1)x
) = p x+(1)p x
w (so x
is feasible).
If x and x
are both optimal, u(x) = u(x
); by quasiconcavity, x + (1 )x
. By quasiconcavity,
u(x
) u(x) = u(x
), so x
is optimal. Therefore, x
x(p, w). If u is strictly quasiconcave, then

u(x
) > u(x) = u(x
) for all (0, 1). So if x
is feasible, this contradicts the assumption that

x, x
x(p, w), so x(p, w) must be a singleton.

Before using any information about the preferences, we can learn a lot about Walrasian demand
just from the previous proposition. In particular, we want to start studying how changes in prices
or wealth change agents behavior. So the wealth eect for the -th good is
x
(p, w)
w
A good is normal if x
/w 0, and inferior if x
/w < 0. The price eect of p

k
on the -th
good is
x
(p, w)
p
k
You might think x
/p
0, or the own-price eect is negative, but goods for which x
/p
> 0
are called Gien goods.
From these assumptions, we might start to search for interesting comparative statics results.
Unfortunately, with so little assumed, it is hard to say much:
156
Proposition 14.2.2 (2E1,2E2,2E3) Suppose u is a continuous utility function representing a
locally nonsatiated preference relation dened on the consumption set X R
L
.
If the Walrasian demand function x(p, w) is homogenous of degree zero, then for all (p, w),
D
p
x(p, w)p +D
w
x(p, w)w = 0
or, for each = 1, ..., L.
L
k=1
x
(p, w)
p
k
p
k
+
x
(p, w)
w
w = 0
If x(p, w) satises Walras law, then for all (p, w),
p D
p
x(p, w) +x(p, w)
= 0
or for each k = 1, ..., L,

K
=1
p
(p, w)
p
k
+x
k
(p, w) = 0
If x(p, w) satises Walras law, then for all (p, w),
p D
w
x(p, w) = 1
or
K
=1
p
(p, w)
w
= 1
14.3 The Law of Demand
The archetypal example of a comparative statics result is the Law of Demand: If the price of a
good goes up, the quantity demanded by consumers goes down. Does that hold for our full-blown
consumer model? Likewise, how does this idea of quantity demanded and price moving in opposite
directions relate to WARP and consistent decision-making?
We developed a weak axiom of revealed preference for situations where X was simply an abstract
choice set. Now that we have specialized to the case of consumers with competitive budget sets,
we can restate that idea
Denition 14.3.1 Suppose u is a continuous utility function representing a locally non-satiated
preference relation dened on the consumption set X R
L
. The Walrasian demand function
x(p, w) satises the weak axiom of revealed preference (WARP) if the following holds for any two
price-wealth situations (p, w) and (p
): If p x(p
, w
) w and x(p
, w
) = x(p, w), then p
x(p, w) > w
.
In other words, if x(p, w) is chosen over x(p
, w
) at (p, w) when x(p
, w
) was aordable, then x(p, w)

must be unaordable at (p
, w
).
What we want to study now is how behavior changes when prices and wealth change. Dene
p = (p
p) , w = p x(p, w)
A compensated price change is any move from one price-wealth pair (p, w) to (p
, w
) that satises
(p
, w
) = (p
, p
x(p, w)). This kind of change ensure that if x(p, w) was chosen at (p, w), the same
bundle is aordable at the new price-wealth pair (p
, p
x(p, w)).
157
Proposition 14.3.2 (2F1) Suppose that the Walrasian demand function x(p, w) is homogeneous
of degree zero and satises Walras law. Then x(p, w) satises WARP i, for any compensated
price change from (p, w) to (p
, w
) = (p
, p
x(p, w)),
(p p
) (x(p
, w
) x(p, w)) = px 0
with strict inequality whenever x(p, w) = x(p
, w
) (this is the compensated law of demand).

Proof (WARP implies CLD) Suppose x(p, w) = x(p
, w
). Then
(p p
) (x(p
, w
) x(p, w)) = p (x(p
, w
) x(p, w)) p
(x(p
, w
) x(p, w))
Because the price change is compensated, p
x(p, w) = w
, so WARP implies that p x(p
, w
) < w,
and Walras law implies that p x(p, w) = w. Then
p x(p
, w
) p x(p, w) p
x(p
, w
) +p
x(p, w) = p x
w w
+w
= p x
w < 0
So the inequality holds.
(CLD implies WARP) p. 31 in MWG
So the weak axiom of revealed preference is equivalent to a form of the law of demand, that
prices and quantities move in opposite directions: px < 0.
The dierential version of the CLD is that dp dx 0 whenever dw = x(p, w) dp. This is
equivalent to the following string of manipulations:
dx = D
p
x(p, w)dp +D
w
x(p, w)dw
dx = D
p
x(p, w)dp +D
w
x(p, w)[x(p, w) dp]
dx = [D
p
x(p, w) +D
w
x(p, w)x(p, w)
]dp
Multiplying both sides by dp yields, by the CLD,
dpdx = dp[D
p
x(p, w) +D
w
x(p, w)x(p, w)
]dp 0
Then the matrix D
p
x(p, w) + D
w
x(p, w)x(p, w)
must be negative semi-denite, since the price

change is arbitrary. This special matrix is called the Slutsky matrix,
S(p, w) = D
p
x(p, w) +D
w
x(p, w)x(p, w)
and its properties play a central role in determining whether consumer behavior is rational or not.
The above has shown:
Proposition 14.3.3 If the uncompensated law of demand holds, the Slutsky matrix must be nega-
tive semi-denite.
The entries of the Slutsky matrix are
s
k
(p, w) =
x
(p, w)
p
k
+
x
(p, w)
w
x
k
(p, w)
This describes how a consumer is adjusting his consumption of x
in response to a dierential
change in price p
k
when wealth w is adjusted to make the original bundle aordable. In particular,
s
k
(p, w) =
x
(p, w)
p
k
. .
Price Eect
+
x
(p, w)
w
x
k
(p, w)
. .
Wealth Eect
158
14.4 The Indirect Utility Function
If x
(p, w) is a solution of the utility maximization problem

max
{x:pxw}
u(x)
Then the indirect utility function is the function
v(p, w) = u(x
(p, w))
This gives the utility value of the consumer for any price-wealth situation (p, w).
Proposition 14.4.1 (3D3) The indirect utility function v(p, w) is
homogeneous of degree zero
strictly increasing in w and non-increasing in p
for each
quasiconvex the set {(p, w) : v(p, w) v} is convex for any v
continuous in (p, w)
These results all follow because you cant lower a consumers wealth and make him better o, or
raise prices and make him better o, since these both shrink the size of the choice set. The proofs
of each point formalize one implication of that fact.
Exercises
MWG : 2E1, 2E2, 2E3, 2E5, 3B1, 3B2, 3C6, 3D6
159
Chapter 15
Consumer Behavior, II
For any maximization problem
max
x
u(x)
subject to b(x) = w, we can consider the problem
minb(x)
subject to u(x) = y. Then we can the original problem the primal problem and the second program
with the roles of the objective and constrained reversed the dual problem.
At rst, you might wonder why we would study such a problem. The Lagrangian for the primal
is
L
p
= u(x)
p
(b(x) w)
with FONCs
u(x
p
) =
p
b(x
p
)
(b(x
p
) w) = 0
and the Lagrangian for the dual is
L
d
= b(x)
d
(u(x) y)
with FONCs
b(x
d
) =
d
u(x
d
)
(u(x
d
) y) = 0
These two problems have the same solution only if
p
=
d
, or that b(x
p
) = w = b(x
d
), and
u(x
p
) = y = u(x
d
). But this is actually incredibly useful information. As we vary exogenous
parameters, we know that the primal and dual solutions x
p
and x
d
start o with the same value,
but react dierently to the variation in things like prices and wealth. This gives us a variety of
angles of attack on the problem that can clarify what changes are a result of price eects (relate
to the dual, which does not involve wealth) and what changes are a result of wealth eects (relate
to the primal, which does).
15.1 The Expenditure Minimization Problem
Consider the dual of the utility maximization problem, the expenditure minimization problem:
min
x
p x
160
subject to u(x) u. If u(x) is dierentiable, we can form the Lagrangean and rst-order necessary
conditions for a minimum:
L(x, ) = p x (u(x) + u)
yielding the FONCs
p = u(x)
u(x) = u
The solution to this problem is called Hicksian Demand or Compensated Demand, h(p, u), similar
to Walrasian/uncompensated demand.
For example, consider u(x) = x
1
x
2
. Then the EMP is
min
x
p
1
x
1
+p
2
x
2
subject to x
1
x
2
u. This has Lagrangean
L = p
1
x
1
p
2
x
2
+(x
1
x
2
u)
and solution
h
1
=
_
_
p
2
p
1
_
u
_
1/(+)
h
2
=
p
1
p
2
_
_
p
2
p
1
_
u
_
1/(+)
Proposition 15.1.1 (3E1) Suppose u() is a continuous utility function representing a locally non-
satiated preference relation and the price vector p 0. Then
If x
(p, w) is optimal in the UMP, then x(p, w)
is optimal in the EMP when u(x
(p, w)) = u
If h
(p, u) is optimal in the EMP, then h
(p, u) is optimal in the UMP when p h
(p, u) = w
Proof If x
is optimal the UMP at p, then it is feasible in the EMP when the utility target is
u(x
); if it werent optimal, that would imply that there were a vector x
that was strictly cheaper

and achieved the same utility as x
, and by local non-satiation, x
couldnt have been optimal in

the rst place (since a better bundle would have been aordable), leading to a contradiction. The
same kind of argument establishes the other part of the proposition.
This implies that x(p, e(p, u)) = h(p, u) and x(p, w) = h(p, v(p, w)). These are important
identities to keep in mind. Similarly, just as v(p, w) = u(x
(p, w)) for the UMP, there is a value

function for the EMP: e(p, u) = p h
(p, u).
Proposition 15.1.2 (3E3) Suppose u is a continuous utility function representing a locally non-
satiated preference relation . Hicksian demand h(p, u) possesses the following properties:
homogeneity of degree zero in p: h(p, u) = h(p, u) for all > 0
no excess utility: for any h h(p, u), u(h(p, u)) = u
convexity/uniqueness: If is convex, then h(p, u) is a convex set, and if is strictly convex,
then h(p, u) is a function
161
Proof (i) Minimizing p x subject to u(x) u is the same as minimizing p x subject to u(x) u,
0, since this is just a monotone transformation of the utility functions. (ii) Suppose there was
an x h(p, u) so that u(x) > u. Then the bundle x
= x with (0, 1) satises u(x
) = u(x) >
u(x), if is close to 1 by continuity of u. But then the bundle x
is slightly less expensive, so

p x < p x. Consequently, x could not have been optimal in the rst place (since x
is cheaper
and less expensive).
satiated preference relation . The expenditure function e(p, u) is
Homogeneous of degree one in p
Strictly increasing in u and non-decreasing in p
for all
Concave in p
Continuous in p and u
Proof (i) e(pu) = p x = (p x) = e(p, u). This turns out to be a monotone transformation
of the objective, so the solution doesnt change. (ii) Suppose e(p, u) were not strictly increasing in
u, and let x
be optimal at u
and x
be optimal at u
, where u
> u
and p x
p x
. Consider
a bundle x = x
, (0, 1) with p x
> p x. Then if is close to 1, u( x) > u
but p x
> p x,
so x is feasible and cheaper, so it is strictly better than x
, contradicting the assumption that x
was optimal. (iii) Suppose p
and p
have p
but are otherwise equal. Let x
be optimal in
the EMP at p
. Then e(p
, u) = p
e(p
, u). Then p
implies e(p
, u) e(p
, u).
(iv) For concavity, x u and let p
= p + (1 )p
for [0, 1]. Suppose x
is optimal in the
EMP at p
. Then e(p
, u) = p
= p x
+ (1 )p
e(p, u) + (1 )e(p
, u).
And there is a corresponding law of demand for Hicksian demand:
satiated preference relation . Suppose h(p, u) is a function for all p 0. Then the Hicksian
demand function satises the compensated law of demand: For all p and p
,
(p
p)(h(p
, u) h(p
, u)) = ph 0
Proof For any p 0, h(p, u) is optimal in the EMP, so it achieves a lower expenditure at p than
any other bundle oering that much utility. Therefore
p
h(p
, u) p
h(p
, u)
and
p
h(p
, u) p
h(p
, u)
Subtracting yields
(p
) h(p
, u) (p
) h(p
, u)
This is one of our most important results to keep in mind:
Proposition 15.1.5 (3G1) For all p and u, Hicksian demand h(p, u) satises
h(p, u) = D
p
e(p, u)
162
Proof Using the envelope theorem: Take V (p, u) = p h(p, u) + (u(h(p, u)) u). Now totally
dierentiate with respect to p: D
p
V (p, u) = h(p, u) pD
p
h(p, u) +D
x
u(h(p, u))D
p
h(p, u). Since
the FOC of the EMP is 0 = x + D
x
u(x), all the terms drop out except D
p
V (p, u) = h(p, u).
But D
p
V (p, u) = D
p
e(p, u), so D
p
e(p, u) = h(p, u).
Now, since h(p, u) = D
p
e(p, u) and we know that h(p, u) has certain properties in p, we can
dierentiate again to learn more. This trick is very common, especially in producer theory, so it
pays to understand this proposition:
Proposition 15.1.6 (3G2) Suppose h(p, u) is dierentiable at p with Jacobian matrix D
p
h(p, u).
Then
D
p
h(p, u) = D
2
p
e(p, u)
D
p
h(p, u) is negative semi-denite
D
p
h(p, u) is symmetric
D
p
h(p, u) p = 0
Proof (i) This follows from the previous proposition (3G1). (ii) Since e(p, u) is a concave function
(proposition 3E2), its Hessian is negative denite, which is the Jacobian of h(p, u). (iii) Any negative
semi-denite matrix is symmetric by denition. (iv) Since h(p, u) homogeneous of degree zero in
p , we have h(p, u) h(p, u) = 0, and dierentiating with respect to yields D
p
h(p, u) p = 0;
evaluation at = 1.
So now that the expenditure minimization problem has been characterized and its solution
and value function studied, lets ask again, what is this about? The EMP says, Find the least
expensive way to give the consumer u worth of utility. If you worked at a hospital and they
asked you to maximize the nutritional value of a patients diet subject to a budget constraint, you
would be solving the UMP. If you worked at a hospital and they asked you to minimize the cost
of a patients diet subject to achieving a certain level of nutrition, you would be solving the EMP.
Naturally, the solutions and value functions of these problems are related, and their solutions have
similar properties. But there are actually some deeper connections.
15.2 The Slutsky Equation and Roys Identity
Slutskys equation and Roys identity link all the ideas developed up to now together. In particular,
Slutskys equation links Hicksian to Walrasian demand, and Roys identity shows how the indirect
utility function relates to Walrasian demand.
Proposition 15.2.1 (3G3, the Slutsky Equation) For all (p, w) and u = v(p, w),
D
p
h(p, u) = D
p
x(p, w) +D
w
x(p, w)x(p, w)
or for all and k,

h
(p, u)
p
k
=
x
(p, w)
p
k
+
x
(p, w)
w
x
k
(p, w)
Proof Consider a consumer at prices (p, w) with utility u. That implies that w = e(p, u), and
h(p, u) = x(p, e(p, u)). Dierentiating with respect to p yields
D
p
h(p, u) = D
p
x(p, e(p, u)) +D
w
x(p, e(p, u))D
p
e(p, u)
163
But then 3G1 implies
D
p
h(p, u) = D
p
x(p, e(p, u)) +D
w
x(p, e(p, u))h(p, u)
and since w = e(p, u) and x(p, w) = x(p, e(p, u)) = h(p, u),
D
p
h(p, u) = D
p
x(p, w) +D
w
x(p, w)x(p, w)
This also shows that S(p, w) = D

p
h(p, u), since the right-hand side of the identity is exactly the
Slutsky matrix. Since we know that D
p
h(p, u) is negative semi-denite, symmetric, and satises
D
p
h(p, u) p = 0, then these properties must be satises for S(p, w) as well.
Another indispensable tool that comes from exploiting duality theory is Roys Identity, which
allows us to derive the Walrasian/Uncompensated Demand function from the indirect utility func-
tion.
Proposition 15.2.2 (3G4, Roys identity) Suppose that the indirect utility function v(p, w) is
dierentiable at (p, w) 0. Then
x(p, w) =
1
D
w
v(p, w)
D
p
v(p, w)
or for all = 1, ..., L,
x
(p, w) =
v(p, w)/p
v(p, w)/w
Proof (i) Note that u = v(p, w), so that u = v(p, e(p, u)). Dierentiating with respect to p yields
0 = D
p
v(p, e(p, u)) +D
w
v(p, e(p, u))D
p
e(p, u)
From 3G1, D
p
e(p, u) = h(p, u), and
0 = D
p
v(p, e(p, u)) +D
w
v(p, e(p, u))h(p, u)
and since e(p, u) = w and u = v(p, e(p, u)),
0 = D
p
v(p, w) +D
w
v(p, w)h(p, v(p, w))
0 = D
p
v(p, w) +D
w
v(p, w)x(p, w)
so that x(p, w) = D
p
v(p, w)/D
w
v(p, w).
15.3 Example
Suppose u(x
1
, x
2
) = x
1
x
2
. Then the EMP is
L = p
1
x
1
p
2
x
2
+(x
1
x
2
u)
This generates Hicksian demands
h
1
(p, u) =
_
u
p
2
p
1
h
2
(p, u) =
_
u
p
1
p
2
164
and an expenditure function
e(p, u) =
u(2
p
1
p
2
)
If we make the substitution w = e(p, u) and u = v(p, w),
w =
_
v(p, w)(2
p
1
p
2
)
and
v(p, w) =
w
2
4p
1
p
2
Using Roys identity gives
x
1
(p, w) =
w
2
4p
2
p
2
1
4p
1
p
2
2w
=
w
2p
1
which are the solutions of the UMP:
L = x
1
x
2
+(w p
1
x
1
p
2
x
2
)
which has solutions
x
1
(p, w) =
w
2p
1
x
2
(p, w) =
w
2p
2
from these we can compute the indirect utility function
v(p, w) =
w
2
4p
1
p
2
Making the substitution u = v(p, w) and w = e(p, u), we get
u =
e(p, u)
2
4p
1
p
2
Re-arranging to solve for e(p, u) yields
e(p, u) = 2
p
1
p
2
u
and dierentiating with respect to p
1
and p
2
yield
h
1
(p, u) =
up
2
p
1
h
2
(p, u) =
up
1
p
2
which are exactly the Hicksian demands we started with. So, in principal, the following is true:
Walrasian demand can be generated from the UMP
Hicksian demand can be generated from the EMP
The indirect utility function can be generated from Walrasian demand
The expenditure function can be generated from Hicksian demand
The expenditure function can be generated from the indirect utility function, and the indirect
utility function can be generated from the expenditure function
Walrasian demand can be generated from the indirect utility function by Roys identity
Hicksian demand can be generated from the expenditure function by 3G1
So if you are given any piece of the puzzle, you can, in principal, generate all of the other objects.
165
Exercises
MWG: 3E4, 3E6, 3E8, 3E9, 3G1, 3G3, 3G7,
166
Chapter 16
Welfare Analysis
Suppose a consumer has a rational, continuous, locally non-satiated preference relation repre-
sented by a continuous utility function u(x). Consider an initial price p
0
and a nal price p
1
. You
might want to know, How does the consumers welfare change when prices move from p
0
to p
1
?
However, the consumers behavior varies in response to those prices, making some price changes
more painful than others. Moreover, what if some prices increase increase, and some decrease? For
example, a shock to oil prices has an impact on many products, and measuring the impact will be
complicated precisely because the consumer behavior will adjust to shrug o some of the welfare
loss by buying substitute goods or taking other steps to adjust to the new price regime.
Since v(p, w) is our theoretical index of the consumers welfare for all price-wealth situations,
the real question is, What is v(p
1
, w) v(p
0
, w)? But we dont actually ever observe v(p, w), so
is there some other function (p, v(p, w)), that is observed (in principle) and is closely related to
v(p, w)? The expenditure function e(p, u) = e(p, v(p, w)) is strictly increasing in u and is observable,
so it is a natural choice; this is called the money metric utility function:
e( p, v(p
1
, w)) e( p, v(p
0
, w))
Note that the prices are xed, but the utility is varying. Since the choice of p is still undetermined,
there are two natural choices: Compensating Variation
CV (p
0
, p
1
, w) = e(p
1
, v(p
1
, w)) e(p
1
, v(p
0
, w)) = e(p
1
, u
1
) e(p
1
, u
0
) = w e(p
1
, u
0
)
and Equivalent Variation
EV (p
0
, p
1
, w) = e(p
0
, v(p
1
, w)) e(p
0
, v(p
0
, w)) = e(p
0
, u
1
) e(p
0
, u
0
) = e(p
0
, u
1
) w
The value of CV measures how much wealth the consumer would have to be paid after the price
change to make her indierent about the price change, while EV measures how much she would
have to be paid before the price change to make her indierent about it. Or, EV is the amount
she would pay to stop the price change, while CV is how much she would have to be paid after the
price change to achieve the same utility level.
Example Suppose we can a consumer with preferences u(x
1
, x
2
) = x
1
x
2
. Then his expenditure
function is
e(p, u) = 2
p
1
p
2
u
and
v(p, w) =
w
2
4p
1
p
2
167
and our composite function
e(p
a
, v(p
b
, w)) = w
p
a
1
p
a
2
p
b
1
p
b
2
is the money-metric utility function described above.
Suppose that prices are initially (1, 1), but then the price of good two changes from 1 to 2. Our
compensating variation is
CV (p
0
, p
1
, w) = e(p
1
, v(p
1
, w)) e(p
1
, v(p
0
, w)) = w
(1)(2)
(1)(2)
w
(1)(2)
(1)(1)
= w(1
2)
while the equivalent variation is
EV (p
0
, p
1
, w) = e(p
0
, v(p
1
, w)) e(p
0
, v(p
0
, w)) = w
(1)(1)
(1)(2)
w
(1)(1)
(1)(1)
= w(
_
1/2 1)
So in this case it is easy to calculate closed-form solutions. But suppose we want to know more
generally about how changes in p
2
aect the consumers welfare. Hold p
1
xed, and consider
CV (p
0
, p
1
, w) = e(p
1
, v(p
1
, w)) e(p
1
, v(p
0
, w)) = w e(p
1
, u
0
) = e(p
0
, u
0
) e(p
1
, u
0
)
In this case, that equals
CV ((p
1
, p
0
2
), (p
1
, p
1
2
), w) = e((p
1
, p
0
2
), u
0
) e((p
1
, p
1
2
), u
0
)
From the fundamental theorem of calculus,
e((p
1
, p
0
2
), u
0
) e((p
1
, p
1
2
), u
0
) =
_
p
0
2
p
1
2
e(p
1
, x, w)
p
2
dx
From 3G1,
CV = e((p
1
, p
0
2
), u
0
) e((p
1
, p
1
2
), u
0
) =
_
p
0
2
p
1
2
h
2
(p
1
, y, w)dy =
_
p
0
2
p
1
2
_
u
0
p
1
y
dy
integration yields
CV =
_
u
0
p
1
2(
_
p
0
2
_
p
1
2
)
And inserting u
0
= v(p
0
, w) =
w
2
4p
1
p
0
2
yields
CV = w
_
1
_
p
1
2
_
p
0
2
_
Evaluating at p
0
= (1, 1) and p
1
= (1, 2) yields
CV = w(1
2)
which is exactly as before.
168
So the example shows that when it comes to welfare, we are talking about integrals of Hicksian
demand, while our undergraduate measure was Consumers Surplus, based on integrals of Walrasian
demand. So suppose only the price of good is changing, from p
0
to p
1
. In general, how can EV

or CV be computed?
CV = e(p
1
, v(p
1
, w)) e(p
1
, v(p
0
, w)) = w e(p
1
, v(p
0
, w))
= e(p
0
, v(p
0
, w)) e(p
1
, v(p
0
, w)) = e(p
0
, u
0
) e(p
1
, u
0
)
Let p
ell
= (p
1
, p
2
, ..., p
1
, p
+1
, ..., p
L
) be the vector of prices that are being held xed. Then from
the Fundamental Theorem of Calculus,
CV = e(p
0
, u
0
) e(p
1
, u
0
) =
_
p
0
p
1
e(z, p
, u
0
)
p
dz
and from 3G1
CV =
_
p
0
p
1
(z, p
, u
0
)dz
So compensating variation is the area to the left of the Hicksian demand curve, evaluated at u
0
,
and equivalent variation is
EV =
_
p
0
p
1
(z, p
, u
1
)dz
or the area to the left of the Hicksian demand curve, evaluated at u
1
. This is the new Consumer
Surplus. In fact, consider the Slutsky equation:
D
p
h(p, u) = D
p
x(p, w) +D
w
x(p, w)x(p, w)
If there were no wealth eects,

_
p
D
p
h(p, u)dp =
_
p
D
p
x(p, w)dp (in the sense of line integrals), and
Hicksian and Walrasian demand would be essentially the same. Then the integral in the above
example could replace h
2
(p, u) with x(p, w), and we would have computed the area beneath a
demand curve to compute the agents welfare. Since wealth eects are not negligible, however, we
must work with Hicksian demand.
16.1 Taxes
For example, the deadweight loss triangle represents the loss from a tax in introductory microe-
conomics. Consider a tax t on good such that p
1
= p
0
+ t, with all other prices remaining the

same. Her equivalent variation is the amount he would need to be paid to be compensated for the
introduction of the tax. If T is the total tax burden the consumer faces, he is worse o after the
tax if EV < T i.e., the burden of the tax per se is worse than merely facing the change in prices
(remember, a tax changes the relative prices of goods and also reduces his wealth). In particular,
T EV (p
0
, p
1
, w) = w e(p
0
, u
1
) T
gives exactly the measure of the welfare lost by imposing the tax. The burden of the tax is equal
to T = tx
(p
1
, w), his demand for good after the tax times the size of the tax, or T = th
(p
1
, u
1
).
Then
T EV (p
0
, p
1
, w) = e(p
1
, u
1
) e(p
0
, u
1
) th
(p
1
, u
1
) =
_
p
1
p
0
e(z, p
, u
1
)
p
dz th
(p
1
, u
1
)
169
Then p
1
= p
0
+t equals
=
_
p
0
+t
p
0
e(z, p
, u
1
)
p
dz(p
0
+tp
0
)h
(p
0
+t, p
, , u
1
) =
_
p
0
+t
p
0
(z, p
, u
1
)h
(p
0
+t, p
, u
1
)dz
Again, this is an integral involving the Hicksian demand curve.
16.2 Consumer Surplus vs. CV and EV
The Area Variation in MWG is just Consumer Surplus, or
AV (p
0
, p
1
, w) =
_
p
0
p
1
(z, p
, w)dz
As we mentioned, this is the wrong welfare measure. But consider
AV (p
0
, p
1
, w) =
_
p
0
p
1
x(p
1
, p
, w) +
_
z
p
1
(y, p
, w)
p
dydz
versus
EV (p
0
, p
1
, w) =
_
p
0
p
1
(p
1
, p
, u
1
) +
_
z
p
1
(y, p
, u
1
)
p
dydz
From the Slutsky equation, we know that
h
(p, u)
p
=
x
(p, w)
p
+
x
(p, w)
w
x
(p, w)
Substituting this in yields
EV (p
0
, p
1
, w) =
_
p
0
p
1
(p
1
, p
, u
1
) +
_
z
p
1
(y, p
, w)
p
+
x
(y, p
, w)
w
x
(y, p
, w)dydz
or
EV (p
0
, p
1
, w) =
_
p
0
p
1
_
z
p
1
(y, p
, w)
w
. .
Wealth Eect
x
(y, p
, w)dydz +AV (p
0
, p
1
, w)
On page 89, this is why MWG say that AV tends to underestimate EV when is a normal good:
If x
/w 0 and p
1
< p
0
, the integral term will be positive, so EV = AV + where > 0. A

similar derivation can be done for the compensating variation to show that AV > CV , giving
EV > AV > CV
Exercises
MWG: 3I1, 3I2, 3I3, 3I5, 3I7
170
Chapter 17
Production
In classical economics, consumers demand and rms supply. Now we apply the theory of rational
choice to rms, just as we did for consumers in the previous two chapters.
17.1 Production Sets
A production vector y R
L
is a plan describing the net outputs of a production process. For
example, if you had a machine that takes a pound of plastic and a half a pound of capacitors
to make 1 computer, y = (1/2, 1/2, 1). Also, rms might sell each other goods, so that the
computer produced by one rm is sold to another rm to monitor the chemicals being sprayed on
oranges, and so on.
Each rm has a set Y R
L
of possible production plans, the production set. Any y Y is
feasible, while any y R
L
\Y is not.
Another way of describing Y is the transformation function F, which satises
Y = {y R
L
: F(y) 0}
The boundary of F are the set of plans y such that F(y) = 0; this is like the production possibilities
frontier of introductory microeconomics.
Denition 17.1.1 A production set Y = satises
no free lunch if y Y and y 0, then y = 0 (it is not possible to produce something with
nothing)
free disposal if y Y and y
y, then y
Y
irreversibility if y Y implies y / Y
non-increasing returns to scale if for any y Y , y Y for [0, 1]
non-decreasing returns to scale if for any y Y , y Y for > 1
constant returns to scale if for any y Y , 0
additivity if for any y, y
Y , y +y
Y
convexity if y, y
Y and [0, 1], then y + (1 )y
Y
is closed if it includes its boundary; for any sequence y
n
y, y Y
171
Example Consider F(y
1
, y
2
) = y
1
1 + e
y
2
and assume that y
2
0 and p
1
> p
2
/ (sketch the
set y F(y) before going any further). Then you can think of this as a rm who can transform y
2
into y
1
and vice versa according to the technology F. Then the rm maximizes
max
yF(y)
p y
This is probably not what you have in mind when you think of a canonical prot maximization
problem. For example, you probably have in mind
max
z
1
,z
2
F(z
1
, z
2
) w
1
z
1
w
2
z
2
where the zs are factor inputs, like capital and labor. But that is not how MWG set up the prot
maximization or cost minimization problems. They are being agnostic about what it means for
something to be an input or an ouput, and merely considering a technology F : R
L
R
L
that
gives bounds on what kinds of vectors a rm can produce. In this case, the Lagrangean takes the
form
L = p
1
y
1
+p
2
y
2
+(y
1
1 +e
y
2
)
with solution
y
1
= 1
p
2
p
1
y
2
=
1
log
_
p
2
p
1
_
in which it is not clear a priori which good is the input and which is the output.
Note that we didnt assume which good was the output and which was the input this was
determined by the technology and prices. For example, a farm might raise cows and corn, producing
milk. Under some prices, the rm doesnt sell the corn and uses it simply to feed cows to produce
milk. At other prices (for instance, high demand for ethanol to make gasoline), the farm begins
to sell corn as well as use it to feed cows, so that its an input and output. Then, the prices shift
so much towards corn that the farm stops raising cows, butchers them for meat, and focuses on
growing corn (but note that corn kernels are an input in growing corn, so it is, in some sense, its
own input). MWG want to have this level of generality of inputs and outputs in mind, and its
important in macroeconomics as well (Blinder wrote an inuential paper about how sectoral shocks
propogate through intermediate goods channels, for example).
Make sure you understand the above example, because MWG never actually provide a fully
worked example of the prot-maximization problem theyve chosen, and if you think of it as a
pF(K, L) rK wL kind of problem, the propositions below will probably become confusing.
In particular, the functional form = pF(K, L) rK wL assumes that K and L are inputs and
substitutes q = F(K, L) into the objective. How can we transform it into the kind of framework
MWG are using? First, let

K = K and

L = L, and

F(q,

K,

L) = F(

K,
L) q, and then
consider
max
q,

K,
L
pq +r

K +w
L
subject to

F(q,

K,

L) 0. This generates a Lagrangean
L = pq +r

K +w
L +

F(q,

K,

L)
Thats exactly the PMP dened at the bottom of page 135.
So remember that MWG are including output in the production plan, so were working with

F.
172
17.2 Prot Maximization
Let p = (p
1
, p
2
, ..., p
L
). Then the prot maximization problem is
(p) = max
yY
p y
Since Y = {y : F(y) 0}, so that
(p) = max
yF(y)
p y
Proposition 17.2.1 (5C1) Suppose () is the prot function of the production set Y and y() is
the supply correspondence. If Y is closed and satises the free disposal property, then
1. (p) is homogeneous of degree one
2. (p) is convex
3. If Y is convex, then Y = {y R
L
: p y (p)}
4. y() is homogeneous of degree zero
5. If Y is convex, then y(p) is a convex set. If Y is strictly convex, y(p) is single-valued.
6. If y(p) is a function, then (p) is dierentiable and D(p) = y(p) (Hotellings Lemma)
7. If y(p) is a function, then Dy(p) = D
2
(p) is a symmetric and positive semidenite matrix
with Dy(p) p = 0
(i) If (p) = max
yY
py, then py is a monotone transformation of the objective, and will
have the same solution set
(ii) If
(p) = p y
and
(p
) = p
Then
(p + (1 )p
) = (p + (1 )p
)y(p + (1 )p
)
= p y(p + (1 )p
) + (1 )p y(p + (1 )p
)
p y(p) + (1 )p y(p
)
= (p) + (1 )(p
)
so that (p) is convex.
(iv) Since the constraint set is unaected by a scalar transformation of the prices p
= p, if
y is optimal for max
yY
p y, then y is optimal for max
yY
p y = max
yY
p y
(v) If Y is convex, then for all y, y
Y , y + (1 )y
Y . Suppose y and y
are optimal
at p. Then
(p) = p y = p y
= p y + (1 )p y
= p (y + (1 )y
)
So that y +(1 )y
must also be optimal. If Y is strictly convex, then y = y
in the above
calculations, or else y +(1 )y
gives strictly higher prots than y or y
, contradicting the
assumption that y and y
were optimal.
173
(vi) Hotellings lemma follows from
(p) = p y(p)
and
D(p) = y(p) +p Dy(p) = y(p)
from the Envelope Theorem or a Duality argument.
(vii) Last, since (p) is convex, we know that D
2
(p) is a positive semi-denite matrix. Then
Hotellings Lemma implies that y(p) = D(p), so dierentiating again yields D
2
(p) = Dy(p),
so the Jacobian of the rms supply curve is the Hessian of the prot function.
In some sense, the above results are the UMP version of the rms problem, and y(p) behaves
much like Walrasian demand.
Example Recall the example with F(y
1
, y
2
) = y
1
1 + e
y
2
and p
2
> p
1
. Then the rms
prot-maximizing production plan was
y
1
= 1
p
2
p
1
y
2
=
1
log
_
p
2
p
1
_
Clearly, these are HOD-0 with respect to p
1
and p
2
. The Jacobian of the optimal plan is
D
p
y(p) =
_
_
p
2
p
2
1
1
p
1
1
p
1
1
p
2
_
_
which is positive semi-denite (and symmetric). In particular, the terms on the diagonal are
positive, so that the law of supply will hold. The prot function becomes
(p) = p
1
_
1
p
2
p
1
_
+p
2
_
1
log
_
p
2
p
1
__
which is HOD-1 in p, since
(p) = p
1
_
1
p
2
p
1
_
+p
2
_
1
log
_
p
2
p
1
__
= (p)
So all the required properties are satised.
17.3 The Law of Supply
The following kind of argument is very common in economics: Suppose y solves max
y
p y and y
maximizes max
y
p
y. Then
p y p y
and
p
y
Subtracting the second inequality from the rst gives
(p p
) y (p p
) y
174
or
(p p
) (y y
) 0
yielding
py 0
which is the Law of Supply. In particular,
(p
)(y
) 0
so that if good s price goes up, the rm produces more of y
. That all looked easy, but this is a

often a very eective argument.
Proposition 17.3.1 Suppose p p
. Then y y
, or
(p p
) (y y
) = py 0
Note that we dont have Walrasian supply and Hicksian supply, and an Compensated Law
of Supply we just have the supply function y(p) and a law of supply. Remember that we dened
compensated price-wealth changes dpx = dw, and then considered
dx = D
p
x dp +D
w
xdw = D
p
x dp +D
w
x(dpx) = [D
p
x +D
w
x x
]dp
giving us the Slutsky equation S(p, w) = D
p
x(p, w) +D
w
x(p, w) x(p, w)
. This equation included

wealth eects through D
w
x(p, w), making the evaluation of welfare complicated, since price changes
had an indirect on the consumer by changing his real wealth. When it comes to the classical theory
of rms, there is no budget constraint that plays a similar role to the wealth constraint that the
consumer faces. Consequently, dy = D
p
y(p)dp, and the law of supply holds for all prices changes.
17.4 Cost Minimization
Now we focus on a slightly dierent problem, where we which goods are inputs and which are
outputs. The dual problem to the prot maximization problem is the cost minimization problem
(CMP):
min
z
w z
subject to f(z) q.
The solution to this problem is a set of factor demands, expressing how much of each input to
hire at a vector of wage rates w.
Proposition 17.4.1 (5C2) Let c(w, q) be the cost function of a single-output technology Y with
production function f(), and that z(w, q) is the factor demand correspondence. Assume that Y is
closed and satises the free disposal property. Then
c() is homogeneous of degree one in w and non-decreasing in q
c() is concave in w
z() is homogeneous of degree zero in w
If z(w, q) is single-valued, then c() is dierentiable with respect to w and D
w
c(w, q) = z(w, q)
If z() is dierentiable at w, then D
w
z(w, q) = D
2
w
c(w, q) is a symmetric and negative semi-
denite matrix with D
w
z(w, q) w = 0
175
If f() is homogeneous of degree one, then c() and z() are homogeneous of degree one in q
If f() is concave, then c() is a convex function of q
Proof (i) Since the objective is w z, a monotone transformation of the objective w z has
the same solution, so it is HOD-1: c(z, q) = (z, q). By increasing q, the rm must produce
more goods. Suppose that z is the solution at (w, q) but z
is the solution at (w, q
), with
q
q. Then z
is feasible at (z, q), since f(z
) f(z). Suppose that c(w, q
) c(w, q). Then

w z
< w z, and z
is actually a solution at (w, q). But this contradicts z being optimal.

(ii) Let z be the solution at (w, q) and z
the solution at (w
, q). Then
c(w + (1 )w
, q) = w z
+ (1 )w
Since w z = c(w, q) and w
= c(w
, q), the plan z
must satisfy w z
w z and
w z
. Then
w z
+ (1 )w
w z + (1 )w
= c(w, q) + (1 )c(w
, q)
(iii) Since a monotone transformation of the objective doesnt change the solution correspon-
dence, z(w, q) = z(w, q).
(v) (Shepards lemma) The Lagrangean of the constrained optimization problem is
L = w z +(f(z) q)
Dierentiating at the optimum with respect to w yields
z(w, q) +w D
w
z(w, q) +D
z
f(z(w, q))D
w
z(w, q) +D
w
(f(z) q)
which implies
D
w
c(w, q) = z(w, q)
(vi) If c(w, q) is dierentiable at w, then D
w
z(w, q) = D
2
w
c(w, q) is a symmetric and negative
semidenite matrix satisfying D
w
c(w, q) w = 0 from the concavity of c(w, q) in w and HOD-0
in w of z(w, q).
Example The example F(y
1
, y
2
) = y
1
+e
y
2
1 is trivial, since y
1
= q and this can be solved for
y
2
= log(1 +q)/. Instead, let F(q, z
2
, z
3
) = q +e
z
2
z
3
1, so that there is an interesting trade-o
between z
2
and z
3
. Then the Lagrangean for the CMP is
L = w
2
z
2
w
3
z
3
+(e
z
2
z
3
1 q)
This has rst order conditions
w
2
= z
3
e
z
2
z
3
w
3
= z
2
e
z
2
z
3
implying that w
2
z
2
= w
3
z
3
. Substituting this relationship into the constraint gives
z
2
=
w
3
w
2
log(1 +q)
3
=
w
2
w
3
log(1 +q)
These are the conditional factor demands, and give the least-cost way of producing q = y
1
units.
176
So we can play many of the same games with rms using Shepards lemma and Hotellings
lemma as we did for consumers with Roys identity and the Slutsky Equation. In particular, if we
take the last example and solve for c(w, q) we get
c(w, q) = 2
_
w
2
w
3
_
log(1 +q)
From which we can describe a prot maximization problem
pq c(w, q) = pq 2
_
w
2
w
3
_
log(1 +q)
17.5 A Cobb-Douglas Example
In a lot of economic problems, we describe the rms production function f : R
L
R that takes
some subset of L inputs z and produces a single ouput, q. Consider q = z
1
z
2
with w = (w
1
, w
2
)
and output price p. Then the prot maximization problem is (substituting in the technology into
the prot function and writing inputs as negative quantities)
(p) = max
z
1
,z
2
pz
1
z
2
w
1
z
1
w
2
z
2
A necessary condition for maximization is
pz
1
1
z
2
w
1
= 0
pz
1
z
1
2
w
2
= 0
and the second-order sucient conditions are that
_
p( 1)z
2
1
z
2
pz
1
1
z
1
2
pz
1
1
z
1
2
p( 1)z
1
z
2
2
_
be negative semi-denite (which is satised as long as + < 1). Then the solution to the FOCs
is
z
1
=
_
p
_

w
2
_
_

w
1
_
1
_
1
z
2
=
_
p
_

w
1
_
_

w
2
_
1
_
1
Using the envelope theorem,
d
dw
1
= z
1
and
d
dp
= z
= q
just as Hotellings Lemma predicts (make sure you see why this is true).
Similarly, the cost minimization problem is
min
z
1
,z
2
w
1
z
1
+w
2
z
2
177
subject to z
1
z
2
q. Solving the rst order conditions and plugging back into the constraint yields
z
1
=
_
q
_
w
2
w
1
_
_
1/(+)
z
2
=
_
q
_
w
1
w
2
_
_
1/(+)
Plugging this in to get the cost function
c(w
1
, w
2
, q) = w
1
_
q
_
w
2
w
1
_
_
1/(+)
+w
2
_
q
_
w
1
w
2
_
_
1/(+)
And using the envelope theorem to totally dierentiate the cost function yields
dC(w, q)
dw
1
= z
1
and
dC(w, q)
dw
2
= z
2
which is Shephards Lemma.
17.6 Scale
A rm has constant returns to scale (CRS) if f(z) = f(z) for all 0. Dierentiating with
respect to yields
f(z) =
f(z)
z
= z Df(z)
which is a useful property often evoked in macroeconomics. When do our usual production functions
satisfy this conditions? For CES technologies,
f(z) =
_
(z
_
1
=
_
_
1
= f(z)
So there is at least one non-trivial CRS technology. When is Cobb-Douglas CRS?
f(z) =
(z
which equals f(z) i
=
which requires
= 1. If
< 1,
f(z) < f(z)
so there are decreasing returns to scale and if
> 1,
f(z) > f(z)
178
so there are increasing returns to scale.
In economics, decreasing and constant returns to scale technologies are generally well-behaved,
but increasing returns to scale can pose a number of problems. To see this clearly, consider a
one-input technology q = z
, where the rm faces price p for its ouput and pays w for the input.
Then
(z) = pz
wz
The rst-order condition is
0 = pz
1
w
and the second-order condition is
0 > p( 1)z
2
when evaluated at a critical point.
The rst thing to note is that if > 1, the second-order condition cannot be satised, since it
will always be positive. Therefore, when conceptualizing increasing returns to scale situations, the
story needs to be adjusted so that (i) the rm faces varying input costs w or ouput prices p (ii)
is a monopolist or part of a concetrated industry. For example, a rm might completely saturate
demand in a market, driving p to zero, or there might be a rare input that is costly to acquire that
limits the rms scale of production. In other industries, costs might be constant, but there may
be xed costs of entry and incumbent rms build up capacity, thereby discouraging entry of new
competitors. Consequently, the increasing returns to scale technology is not the part of the story
that determines the rms maximizing behavior, but its strategic motives.
For decreasing returns to scale technologies, f(z) =
< z
, so that < 1. In this case,

the second-order condition is automatically satised, and any solution to the rst-order conditions
is a local maximum.
For constant returns to scale technologies, the prot function becomes
= max
z
pz wz = (p w)z
If p > w, then the rm wants to make an innite amount. If p < w, the rm can never make any
money, so it would produce nothing. If p = w exactly, then the rm is indierent about whether
and how much to make, since it doesnt make an economic prot on any of the units. Consequently,
the solution correspondence is
z(p, w) =
_
_
_
0 , p < w
(0, ) , p = w
, p > w
17.7 Ecient Production
The goal of much of economics is to decide whether or not a given market generates ecient
outcomes. Generally, this is in the sense of Pareto: An outcome is ecient if no agent can be
made strictly better o without making some agent strictly worse o. For example, if a society
were simply throwing away tons of food, this is ostensibly inecient because even if no individual
member of the society wanted the food, the resources that went into producing the excess output
could have been put into something more socially useful. This is the same sense in which well talk
about ecient production:
Denition 17.7.1 A production plan y Y is ecient if there is no y
Y such that y
y and
y
= y.
179
This means that a rm is operating eciently if there is no other plan that uses weakly fewer inputs
to produce weakly more outputs. If you consider
F(y
1
, y
2
) = y
1
+e
y
2
1
there are many choices of (y
1
, y
2
) that satisfy y
1
+ e
y
2
1 < 0, and choosing an interior point is
like a consumer choosing a consumption bundle on the interior of their budget set. This motivates
Proposition 17.7.2 (5F1) If y Y is prot maximizing for some p 0, then y is ecient.
Proof Suppose there is a y
Y such that y
= y and y
y. Because p 0, this implies

p y
> p y, implying that y is not prot maximizing.

This is a simple version of the First Fundamental Theorem of Welfare Economics (look up
Proposition 10.D.1 on page 326 and Proposition 16.C.1 on page 549), that competitive, price-taking
behavior generates ecient outcomes.
We might ask the opposite question, however: For an ecient plan y, does there exist a price
vector for which that y is prot-maximizing? Note that we are really asking whether a particular
kind of objective function exists. The answer comes from separating hyperplane theorems such as
the following:
Theorem 17.7.3 (Edelheit) Let K
1
and K
2
be convex sets in R
N
such that K
1
has interior
points and K
2
contains no interior points of K
1
. Then there is a closed hyperplane H separating
K
1
and K
2
; i.e., there is a vector x
such that
sup
xK
1
x x
inf
xK
2
x x
So if we have a convex set K

1
such that there is some x K
1
with a ball B
(x) K
1
, and any
convex set K
2
(even a single point), we can separate them with a hyperplane x x
. To use this
theorem, we set K
1
= Y , the set of feasible production plans, and assume it is convex. Pick an
ecient y
that we want to make prot maximizing. We then dene the set K

2
= {y : y y
},
which are all the plans that are more ecient than y
, and infeasible, since they lie to the northeast

of the feasible set Y . Note that K
2
contains no interior point of K
1
= Y , and both sets are convex.
Then the Edelheit separation theorem implies there exists a vector p
such that
max
yY =K
1
y
min
yK
2
y x
Note that the left-hand side of the inequality says that y
is prot-maximizing over Y , so we have

proven a p
exists so that the ecient production plan y
is prot-maximizing. The right-hand

side says that any plan satisfying y y
will do better, but is not feasible.

A bit more formally, and using MWGs version of the separating hyperplane theorem:
Proposition 17.7.4 (5F2) Suppose that Y is convex. Then every ecient production plan y Y
is a prot-maximizing production plan for some price vector p 0.
Proof (i) Since y is ecient, any plan y
y must lie above Y , namely to the north-east, or else

there would be a y
Y with y
y, so y could not have been ecient. (ii) That means the set
P
y
= {y
y} is to the northeast of some point on the boundary of Y . Note that P

y
is a convex
set as well, since if we take any two vectors y
, y
y, their convex combination is also strictly

greater than y. (iii) Then we have two convex sets, Y and P
y
, with Y P
y
= . These are the
hypotheses of the Separating Hyperplane Theorem there exists a hyperplane H = p y for which
Y is on one side of the hyperplane, and P
y
is on the other side. Consequently, there exists a vector
180
p for which p y
p y for all y
P
y
and p y
p y for all y
Y . The second inequality

implies that y achieves at least as much prot as any other feasible y
Y , and the rst inequality

implies that anything that achieves a higher prot must be infeasible at p. Therefore, any ecient
y is prot-maximizing.
This is a simple version of the Second Fundamental Theorem of Welfare Economics (look up
Proposition 10.D.2 on page 327 and Proposition 16.D.1 on page 553), that any ecient outcome can
be supported by a competitive, price-taking behavior, provided that preferences are convex. This
is a condition that generally cant be relaxed in the assumptions of the second welfare theorem,
but note that it isnt assumed in the statement of the rst welfare theorem.
Exercises
MWG: 5B2, 5B6, 5C3, 5C6, 5C9, 5C10, 5D1, 5D2
181
Chapter 18
Aggregation
We have two basic questions:
When can we treat a collection of j = 1, ..., J rms as a single rm?
When can we treat a collection of i = 1, ..., I consumers as a single consumer?
In terms of rms, we want to think of a single aggregate supply curve y(p) that accurately
reects how all the individual rms behave. In terms of consumers, we want to know when
I
i=1
x
i
(p, w
i
) = x
_
p,
I
i=1
w
i
_
so that a single aggregate Walrasian demand curve captures all the information contained in all
the individual demand curves, and only relies on aggregate wealth.
These questions are intimately related to the idea of a representative rm and a represen-
tative consumer: Can groups of agents be imagined as a single entity without loss of generality?
In that case, we could think of the prot function (p) associated with aggregate supply y(p) as
describing the payos and behavior of a single rm rather than dealing with industries of rms.
Likewise, if we could work with a single indirect utility function v
_
p,
I
i=1
w
i
_
that was associated
with the aggregate Walrasian demand curve, we could treat all the consumers are a single entity.
18.1 Firms
Let the individual supply correspondences of rm j be
y
j
(p) = Argmax
y
j
Y
j
p y
j
where Y
j
is rm js production set. If we imagine that the rms are all merged into one entity, this
entity would solve
max
y
1
Y
1
,...,y
J
Y
J
J
j=1
p y
j
yielding the aggregate supply correspondence
y
(p) = Argmax
y
1
Y
1
,...,y
J
Y
J
J
j=1
p y
j
182
Theorem 18.1.1 The aggregate supply correspondence equals the sum of the individual supply
correspondences,
y
(p) =
J
j=1
y
j
(p)
which is the supply function of the representative rm, and the aggregate prot function equals the
sum of the individual prot functions,
(p) =
J
j=1
p y
j
(p)
which is the prot function of the representative rm.
Proof Note that anything that is feasible for the individual rms individually is feasible for the
aggregate rm, and anything that is feasible for the aggregate rm can be achieved by having the
individual rms each pick their part of the optimal aggregate plan.
First, we show that the individual rms cant do worse than the aggregate rm. For if the
individual rms are prot maximizing, for all plans y
j
feasible for rm j,
p y
j
(p) p y
j
and summing over j yields
J
j=1
p y
j
(p)
J
j=1
p y
j
so that the sum of prots of the individual rms is an upper bound on the prots achievable by the
aggregate rm. Since this is a feasible production plan for the aggregate rm, it must be the global
maximizer. Therefore y
(p) =
J
j=1
y
j
(p), and multiplying by p yields
(p) =
J
j=1
p y
j
(p).
18.2 Consumers
Consumers are more complicated because of wealth eects, as usual. We would like to know when
consumers individual demand curves x
1
(p, w
1
), x
2
(p, w
2
), ..., x
I
(p, w
I
) satisfy
I
i=1
x
i
(p, w
i
) = x
_
p,
I
i=1
w
i
_
where x
_
p,
I
i=1
w
i
_
is the aggregate demand curve, which only depends on prices and aggregate
wealth.
Imagine we redistribute some money among all the consumers in the economy but take none
away from them, so that dw =
I
i
dw
i
= 0. Then the eect on consumption of good j by this
redistribution is
I
i=1
x
ij
(p, w
i
)
w
dw
i
If this were equal to zero for (i) all goods j, (ii) all initial wealth distributions (w
1
, ..., w
I
),
and (iii) all patterns of redistribution dw, then we could conclude that the distribution
of wealth does not matter for determining demand behavior, only the total amount
183
of wealth. This would then imply that
I
i=1
x
i
(p, w
i
) = x
_
p,
I
i=1
w
i
_
(check that you
understand this completely). This requires
I
i=1
x
ij
(p, w
i
)
w
dw
i
= 0
Lets consider consumer k and consumer , where dw
k
+dw
= 0 ; since all dw must be considered,

so that we can repeat the following steps for all pairs of individuals in the economy, so the actual
k and selected are not important. Then
x
kj
(p, w
k
)
w
dw
k
+
x
j
(p, w
)
w
dw
= 0
Then dw
= dw
k
, and
x
kj
(p, w
k
)
w
dw
k
= dw
x
j
(p, w
)
w
and
x
kj
(p, w
k
)
w
=
x
j
(p, w
)
w
implying that the wealth eects must be constant across all individuals in the economy for all
goods. But that in turn implies that consumers demands be essentially linear in wealth (this is
the key to the Gorman form).
Theorem 18.2.1 Suppose that consumers indirect utility functions satisfy the Gorman form,
v
i
(p, w
i
) = a
i
(p) +b(p)w
i
Then there exists a representative consumer with indirect utility function v
_
p,
I
i=1
w
i
_
whose
Walrasian demand curve satises
x
_
p,
I
i=1
w
i
_
=
i
x
i
(p, w
i
)
.
Proof If consumers indirect utility functions satisfy the Gorman form, dene
v
_
p,
I
i=1
w
i
_
=
I
i=1
a
i
(p) +b(p)w
i
If the consumers indirect utility functions are increasing in w
i
, non-increasing in p, quasi-convex
in (p, w), and continuous, then v
_
p,
I
i=1
w
i
_
=
I
i=1
a
i
(p) + b(p)w
i
=
I
i=1
a
i
(p) + b(p)
I
i=1
w
i
will be as well, so it is a valid indirect utility function.
Now, Roys identity implies that the consumers individual demand curves must be
x
i
(p, w
i
) =
p
a
i
(p) +
p
b(p)w
i
b(p)
which implies
I
i=1
x
i
(p, w
i
) =
I
i=1
p
a
i
(p) +
p
b(p)w
i
b(p)
184
Applying Roys identity to v
_
p,
I
i=1
w
i
_
with W =
I
i=1
w
i
yields
x(p, W) =
I
i=1
p
a
i
(p) +
p
b(p)W
b(p)
and
x(p, W) =
I
i=1
x
i
(p, w
i
)
Therefore, the representative consumer with indirect utility function v(p, W) has the same
aggregate demand curve x(p, W) as the sum of the individual Walrasian demand curves.
Now, you might ask, can we do better? For example, maybe if we allow aggregate Walrasian
demand to depend on more moments of the distribution of wealth like the variance, skewness, and
so on, maybe this added information will allow more general results? Of course, but theres clearly
a limit. If consumers all behave dierently at all dierent wealth levels so that individual quirkiness
doesnt cancel out in the aggregate, there wont be a representative consumer. For your research, it
is better to adopt more advanced tools in industrial organization or macroeconomics to deal with
heterogeneity without assuming that representative agents exist, or focus on questions for which a
representative agent can be shown to be a relatively good approximation. For example, researchers
have adopted discrete choice models to study micro-level consumer behavior and do welfare analysis
(See the Handbook of Economics chapters by McFadden, Trains book, or Anderson, dePalma
and Thisses Discrete Choice Theory of Product Dierentiation). These models go many steps
further in allowing comparisons of consumer behavior and incorporate broader information about
the characteristics of the goods and consumers.
Exercises
Exercises: 5E1, 5E5, 4B1, 4C11, 4D2
Show that the aggregate prot function satises Hotellings Lemma and that
p
(p) =
J
j=1
y
j
(p).
Show that the Hessian of the aggregate prot functions is the sum of the derivatives of the
individual rms supply functions with respect to p.
If each individual rms cost function is c
k
(w, q
k
), can the aggregate rms prot maximization
problem can be written as
max
q
pq
j
c
j
(w, q
j
)
subject to q =
j
q
j
? If so, show that the solution to this problem is the same as the solution
the prot maximization problem studied above.
Suppose consumer is utility function is u
i
(x
1
, x
2
) = x
i
1
x
i
2
. For what restrictions on (
i
,
i
)
do the indirect utility functions of this collection of consumers satisfy the Gorman form?
What does the aggregate demand curve look like?
Suppose consumer is utility function is u
i
(x
1
, x
2
) = x
i
1
(x
2

i
) where
i
+
i
= 1 for all
i. For what restrictions on (
i
,
i
,
i
) do the indirect utility functions of this collection of
consumers satisfy the Gorman form? What does the aggregate demand curve look like?
185
For consumers with quasi-linear utility functions, u
i
(x, m) =
i
(x) +m where x = (x
1
, ..., x
L
)
and m is money spent on other things, show that there is always a representative consumer
and aggregate demand curve for the goods x that depends only on aggregate wealth.
For a collection of consumers whose preferences satisfy the Gorman form, derive the expen-
diture function for the representative consumer. Derive Hicksian demand and compensating
variation.
Is the Slutsky matrix that corresponds to aggregate Walrasian demand the sum of Slutsky
matrices that correspond to individual Walrasian demand?
186
Part V
Other Topics
187
Chapter 19
Choice Under Uncertainty
19.1 Probability and Random Variables
Many interesting economic situations involve uncertainty, and a theory of decision-making under
risk becomes necessary to provide meaningful results. Before we look at decision-making under
uncertainty, we want a simple model of randomness that doesnt involve too much detail, but does
have the structure we need to think about uncertainty.
Let be a space of outcomes, like the pips on a die, the temperature on a given day, the color
and number of a slot on a roulette wheel, or suit and type of a card drawn from a deck of cards.
This is just a set it doesnt have to be numerical, ordered, or anything else. Then we dene
three objects to describe an uncertain outcome:
Denition 19.1.1 A probability space is a triple, (, E, p), such that
is a set of outcomes
E is a set of events: E is a set of subsets of such that (i) and are in E, (ii) if E E,
then E
c
E, (iii) If E
1
, E
2
, ... are events in E, then
i
E
i
E and
i
E
i
E
p(E) is a probability distribution: p : E [0, 1] so that p() = 1, p() = 0, and if
E
1
, E
2
, ..., E
n
, ... are an arbitrary collection of disjoint sets, p (
i
E
i
) =
i
p(E
i
)
A probability space can be any random environment: For example, the various things that can
happen when you roll a die, like getting six dots, an even or an odd number of dots, or one, two
or ve dots. Similarly, you could reach into a jar of red and green balls, and the outcomes are R
or G. The denition of a probability space puts just enough structure on the outcomes and events
so that any resulting (conditional or marginal) probability distribution will be well-dened, and we
can take the kinds of limits we want to.
Example A household might have a concave utility functions u(c
t
) for consumption in each
period, but face a sequence of random wages y
t
and prices p
t
. It can save assets a
t
at a risk-free
interest rate r. Then the household wants to solve:
U(y
0
, p
0
) = lim
T
E
yt,pt
_
T
t=0
u(c
t
)
y
0
, p
0
_
subject to
a
t+1
= (1 +r)(a
t
+y
t
p
t
c
t
) , c
t
0.
See how even computing U(y
0
, p
0
) is an involved mathematical process? We want to take a limit of
an expected sum, but that requires interchanging the limit and expectation (which is itself a kind
188
of limit). For such manipulations to be well-dened, your probability space has to have at least the
properties given above.
For quantitative analysis, we dene a random variable, mostly separate from the probability
space:
Denition 19.1.2 A random variable is a function X : R, where R R.
Note that a random variable X : R is just a function that assigns real numbers to each of
the outcomes. There is no such thing as a normal random variable: There are random variables
that are normally distributed. The variable is just the mapping from outcomes to real numbers,
and depends on the application. In fact, any probability space can have many random variables
assigned on it: Which ones are interesting is driven by the economic application at hand.
Example Consider rolling a die. Then the following are probability spaces and random variables:
The random variable is the number of realized pips, so that = {1, 2, 3, 4, 5, 6}, the set of
events E is the set of all subsets of , p(k) = 1/6, and X() = .
The random variable is whether or not the die roll is even, so that = {1, 2, 3, 4, 5, 6}, the
set of events E is the set {, O, E} where O = {1, 3, 4} and E = {2, 4, 6}, p(O) = 1/2 and
p(E) = 1/2, and X(E) = 1 and X(O) = 0.
The random variable is whether a six is rolled, so that = {1, 2, 3, 4, 5, 6}, the set of events
E is the set {, , {1, 2, 3, 4, 5}, 6} where A = {1, 2, 3, 4, 5} and S = {6}, p(A) = 5/6 and
p(S) = 1/6, and X(S) = 1 and X(A) = 0.
Clearly, there are many probability spaces and random variables that we might construct from
the idea of a die roll. If we are betting on the outcome 6, the last version is the one of interest. If
we care about the average number of pips, the rst version is best.
Example For ipping two coins, consider the probability space
= {HH, HT, TH, TT}
E = {, ,
{HH, TH, HT}, {HH, TT, TH}, {TT, TH, HT}, {HH, TT, HT},
{HH, HT}, {TH, TT}, {HT, TH}, {HH, TT},
HH, HT, TH, TT}
p() = 1, p() = 0, p({HH, HT}) = p({TH, TT}) = p({HT, TH}) = p({HH, TT}) =
1
2
,
p(HH) = p(TH) = p(HT) = p(TH) = p(TT) =
1
4
, p({HH, TH, HT}) = p({HH, TT, TH}) =
p({TT, TH, HT} = p({HH, TT, HT}) =
3
4
.
This is a complete probabilistic characterization of ipping two coins. For instance, we can ask,
Whats the probability at least one coin is a heads? Thats p({HT, TH, HH}) =
3
4
. Or we can
ask, Whats the probability of getting no heads? Thats p({HT, TH, HH}
c
) = p(TT) =
1
4
.
189
We mostly use continuous random variables in economics, where = (a, b) R. Then E is the
set of open subsets of R, and we dene the probability distribution function or cumulative density
function as the function
F(x) = p ({ : X() x})
for example,
F(x) =
_
x
2
e
1
2
(
z
)
2
dz
is the probability that a normally distributed random variable is less than x. Or, for a variable
uniformly distributed between 0 and 1,
F(x) =
_
_
_
x , x [0, 1]
0 , x < 0
1 , x > 1
The derivatives of these distribution functions are density functions, or probability density func-
tions,
f(x) = e
1
2
(
x
)
2
or
f(x) =
_
1 , x [0, 1]
0 , Otherwise
The most popular thing to do with a random variable is take the expected value,
E[X] =
_
x()dP() =
_ _
x()p()d , R R
p()x() , R discrete
If a function is rst applied to X() to get u(X()), this is written
E[u(X)] =
_
u(x())dP() =
_ _
u(x())p()d , R R
p()u(x()) , R discrete
There are two key facts to remember when working with economic problems and random vari-
ables.
Theorem 19.1.3 (Jensens Inequality) If u() is a concave function, then
E[u(X)] u(E[X])
and if u() is a convex function, then
E[u(X)] u(E[X])
Proposition 19.1.4 Suppose that X is distributed f on [a, b] and u() is normalized so that u(a) =
0. Then
E[u(X)] =
_
b
a
u
(x)(1 F(x))dx
This is because
E[u(X)] =
_
x
u(x)f(x)dx =
_
b
a
u(x)d {1 F(x)} = [(1 F(x))u(x)]
b
a
+
_
b
a
u
(x)(1 F(x))dx
and 1 F(b) = 0. This is really helpful because we often have information about F and u
(), but
not about f (for example, look up the denition of rst-order stochastic dominance).
190
Note that if you have two random variables, X and Y over the same probability space (, E, p),
then
E[X +Y ] =
_
x() +y()dP() =
_
x()dP() +
_
y()dP() = E[X] +E[Y ]

E[X] =
_
x()dP() =
_
x()dP() = E[X]
so that the integral is additive and linear. However, the following is not, in general, true:
E[XY ] =
_
x()y()dP() = E[X]E[Y ]
For example, the covariance of X and Y is dened as
E[(X E[X])(Y E[Y ])] = E[XY E[X]Y Y E[X] +E[X]EY ]
= E[XY ] E[X]E[Y ] E[Y ]E[X] +E[X]EY ]
= E[XY ] E[X]E[Y ]
so that E[XY ] = cov(X, Y )+E[X]E[Y ], and the expectation operator can be broken up in E[XY ] =
E[X]E[Y ] only if cov(X, Y ) = 0, which is a special case
1
.
Two random variables X and Y are independent i they have a joint probability distribution
f(x, y) = f
X
(x)f
Y
(y), so that
E[XY ] =
_
x
_
y
xyf(x, y)dxdy =
_
x
_
y
xyf
X
(x)f
Y
(y)dxdy =
_
x
xf
X
(x)dx
_
y
yf
Y
(y)dy = E[X]E[Y ]
19.2 Choice Theory with Uncertainty
Now we want to develop a theory of choice similar to what we had for goods a rational preference
relations but for situations where outcomes are random. Suppose there is a set of nal
outcomes C
1
, C
2
, ..., C
N
that are non-stochastic, like the prizes in a lottery. We then have the set
of probability distributions over {C
1
, C
2
, ..., C
N
}, which is a simplex, and is written L as the space
of lotteries. We then say L L
if the agent prefers to face lottery L to lottery L
.
Denition 19.2.1 A simple lottery L is a list L = (p
1
, p
2
, ..., p
n
) with p
n
0 for all n and
n
p
n
= 1, where p
n
is the probability that outcome n occurs. Given K simple lotteries, L
K
=
(p
k
1
, p
k
2
, ..., p
k
N
) with k = 1, 2, ..., K, and weights
1
,
2
, ...,
K
with
k
= 1, the compound
lottery (L
1
, L
2
, ..., L
K
;
1
,
2
, ...,
K
) is the risky alternative that yields the simple lottery L
k
with
probability
k
.
So a compound lottery is a simple lottery with probability
1
,
2
, ...,
K
of each outcome k,
where the outcome is itself another lottery. For example, as a senior high school student you might
be uncertain about where youll be admitted to college, which will then lead to a lottery over rst
jobs or graduate schools, which then leads to a lottery over careers. At each step, some uncertainty
is resolved, but a lot of uncertainty remains.
The reduced lottery L
is obtained from the compound lottery (L

1
, L
2
, ..., L
K
;
1
,
2
, ...,
K
) by
setting
p
n
=
1
p
1
n
+
2
p
2
n
+... +
n
p
K
n
1
And also E[XX] = cov(X, X) +E[X]E[X] so that E[X
2
] = var(X) +E[X]
2
.
191
So instead of worrying about the intermediate uncertainty of college and rst-jobs or grad school,
we combine the probabilities over all the paths to generate a simple lottery over nal careers.
If we suppose the decision-maker only cares about the nal outcomes, C = {C
1
, C
2
, ..., C
N
},
then this is a reasonable way to describe many situations. In particular, let L be the set of all
simple lotteries over the set C. This is called a probability simplex, which is created by taking the
N-dimensional vector space generated by the N basis vectors e
n
= (0, 0, ..., 0, 1, 0, ..., 0) where the
1 occurs at the n-th spot, and considering the set of lotteries
L = {p : p
1
e
1
+p
2
e
2
+... +p
n
e
n
= 1,
n
p
n
= 1, 0 p
n
1}
Each basis vector e
n
is then the lottery with outcome C
n
for sure, and each element L L is
some combination of simple lotteries. These lotteries are what the decision-maker is going to have
preferences over, rather than the sure outcomes themselves.
Denition 19.2.2 (6B3) The rational preference relation on L is continuous if, for all L, L
, L
L and all [0, 1],

{ : L + (1 )L
}
and
{ : L
L + (1 )L
}
are closed.
Think of it this way: There is a sequence
N
slowly shifting weight from L
to L. Then
if the terms of the sequence of compound lotteries
n
L + (1
n
)L
are preferred to L
for all n,
then the limiting lottery L + (1 )L
should also be preferred to L
. So if we perturb a lottery
very slightly, it doesnt change the agents preferences.
The continuity axiom also guarantees that there exists a continuous function U(L) that repre-
sents the preferences on L. What we would like, however, is that U(L) essentially look like an
expected value.
Denition 19.2.3 (6B4) The preference relations on L satises the independence axiom if for
all L, L
, L
L and (0, 1), L L
i
L + (1 )L
+ (1 )L
So if we have two lotteries L and L
with L L
and we add a possibility of a third, L
, the
preference ordering cannot reverse. This one is probably violated systematically and often, and
many choice theorists think about how to relax this axiom without losing desirable parts of choice
theory.
The most desirable part of this theory deals with the form and existence of utility functions. For
choice sets X, it was true that any continuous, rational preference relation could be represented
by a continuous utility function u(). For these lotteries, there is a similar result.
Denition 19.2.4 (6B5) The utility function U : L R has the expected utility form if there is
an assignment of numbers (u
1
, u
2
, ..., u
N
) to the N outcomes in C such that for every simple lottery
L = (p
1
, ..., p
N
) L, we have
U(L) = p
1
u
1
+p
2
u
2
+... +p
N
u
N
Such a utility function is called a von Neumann-Morgenstern expected utility function.
192
This is mathematically a nice form, because U(L) = E[u
i
] =
i
p
i
u
i
, so the decision-makers
preferences can be represented as an expectation. That is the main benet of this form. There is
also substantial evidence that in a statistical comparison between this and other theories, such
as prospect theory, the gains are not very large (prospect theory replaces
_
x
p(x)u(x)dx with
_
x
(p(x))u(x)dx, so that agents have non-linear sensitivities to the weights). The next proposi-
tion is very important:
Proposition 19.2.5 (6B1) A utility function u : L R has an expected utility form i it is
linear:
U
_
k
L
k
_
=
k
U(L
k
)
for any K lotteries L
k
L and probabilities (
1
,
2
, ...,
K
) with
k

k
= 1.
To put this result in perspective, think about the similar result that A preference relation
can be represented by a continuous utility function u() only if is rational and continuous.
The above proposition says that preferences have the expected utility form if and only if we can
move seamlessly between one level of compound lotteries and the next. For example, consider the
a situation with two certain outcomes C = {c
1
, c
2
}. If U has the expected utility form, then
U (
1
L
1
+
2
L
2
) =
1
U(L
1
) +
2
U(L
2
)
for any probability distribution (
1
,
2
). But then note that the terms U(L
1
) and U(L
2
) appear
in the new expression. Suppose that L
1
corresponds to the lottery (
1
,
2
) over two new lotteries
over the certain outcomes, L
3
and L
4
, and L
2
corresponds to the lottery (
1
,
2
) over two more
new lotteries, L
5
and L
6
. Then
U
_
1
L
1
+
2
L
2
) =
1
1
U(L
3
) +
1
2
U(L
4
) +
1
U(L
5
) +
2
U(L
6
)
We can keep drilling down until we reach lotteries U(e
1
) and U(e
2
) that correspond to the certain
outcomes. For example, let L
3
= L
5
= e
1
and L
4
= L
6
= e
2
. Then
U
_
1
L
1
+
2
L
2
) = (
1
1
+
2
1
)U(e
1
) + (
1
2
+
2
2
)U(e
2
)
This is incredibly valuable, since now the expected utility of the agent depends only on the value of
the certain outcomes and the probabilities that are generated by the above drilling down process.
Proof The expected utility form allows us to write
U
_
k
L
k
_
=
n
u
n
_
k
p
k
n
_
just as the drilling down process above illustrates. Our job is to show that the linear form is
equivalent to drilling down which is equivalent to the expected utility form.
Suppose U satises U (
k

k
L
k
) =
k

k
U(L
k
) (it is linear). Then the above argument shows
that we can write any lottery as
U
_
k
L
k
_
=
k
U(L
k
) =
k
u
n
so it has the expected utility form. Suppose now that U has the expected utility form.
U
_
k
L
k
_
=
k
u
n
U(
k
p
k
n
) =
k
_
n
u
n
p
k
n
_
=
k
U(L
k
)
193
Monotone transformations of ordinal utility functions induced the same orderings over the
options, so if an agents preferences could be represented by a continuous utility function u(x) then
if g() was a monotone increasing function, g(u(x)) represented the same preferences. This is not
true for expected utility. The above proposition shows that a utility function has the expected
utility form i it is linear. Consequently, only linear (ane) transformations of linear functions are
still linear:
U(L) + =
n
u
n
_
k
L
k
_
+ =
n
(u
n
+)
_
k
L
k
_
=
n
u
n
_
k
L
k
_
=

U(L)
What the above manipulation shows is that a linear transformation of an expected utility function
is simply changing the units of the utility of the certain outcomes, like going from Fahrenheit to
Celsius. However, this is a cardinal property, not ordinal.
Proposition 19.2.6 (6B2) Suppose that U : L R is an expected utility function for the
preference relation on L. Then

U : L R is another expected utility function for i
U(L) = U(L) + for some scalars > 0 and for all L L.

Proof Choose two lotteries

L and L with

L L L for all L L. If

L L, then all the lotteries
are equivalent and the result is true. Suppose that

L L.
Suppose that U(L) is an expected utility function and

U(L) = U(L) +. THen
U
_
k
L
k
_
= U
_
k
L
k
_
+
=
k
U (L
k
) +
=
k
[U (L
k
) +]
=
k

U(L
k
)
Since this function

U then satises proposition 6B1, it has the expected utility form.
Now, suppose that U and

U have the expected utility form; well construct constants and
that have the desired properties. Dene
U(L) = U(
L) + (1 )U(L)
so that
=
U(L) U(L)
U(
L) U(L)
Note that since U(
L) +(1 )U(L) = U(L), then the expected utility form implies that U(
L+
(1 )U(L) = U(L), so that the agent is indierent between L and
L + (1 )U(L. Then
U(L) =

U
_
L + (1 )L
_
=
U
_
L
_
+ (1 )
U (L)
=
_
U
_
L
_

U (L)
_
+

U (L)
Substituting in yields
U(L) =
U(L) U(L)
U(
L) U(L)
_
U
_
L
_

U (L)
_
+

U (L)
194
or
U(L) =
_
U(
L)

U(L)
U(
L) U(L)
_
U(L)
_
U(L) U(L)
U(
L)

U(L)
U(
L) U(L)
_
U(L) = U(L) +
People often express this by saying that the set of v.n.M preferences is closed under linear
transformations.
So the previous two propositions say that if agents have preferences that take the expected
utility form, then we can represent them as
U(L) =
n
u
n
p
n
+
where u
n
is the utility of the certain outcome c
n
, and (, ) is a positive ane transformation.
But what restrictions on preferences guarantee that a utility function exists that has this kind
of form?
Proposition 19.2.7 (Expected Utility Theorem, 6B3) Suppose that the rational preference
relation on the space of lotteries L satises the continuity and independence axioms. Then can
be represented by an expected utility function, so that for any two reduced lotteries L = (p
1
, p
2
, ..., p
n
)
and L
= (p
1
, p
2
, ..., p
n
),
L L
n
p
n
u
n

n
p
n
u
n
The proof is long and includes many steps, so Ill omit parts that are straightforward to prove
(see p. 176 in MWG for the whole proof). It is very similar to the regular proof that a utility
function exists, with the only dierence being that the index of preference constructed in the real
numbers occurs in step three, where it is shown that for any lotteries L
L L
, there exists
a lottery
L
L
+ (1
L
)L
L. The
L
s become the index of preference of the intermediate
lotteries.
Proof Suppose that there are best and worst lotteries in L,

L and L, respectively, so that

L
L L for all L L.
First note that if L L
, then L L + (1 )L
. This follows from the independence

axiom.
Let , [0, 1]. Then
L + (1 )L
L + (1 )L i > . (This follows from the

previous step, which follows from the independence axiom)
For any L L, there is a unique
L
[0, 1] such that
L
L + (1
L
)L L. This follows
from the continuity axiom, and uniqueness comes from the previous step.
Consider the function U : L R that assigns U(L) =
L
; well show that it represents
. The previous step implies that for any two lotteries L, L
L, we have L L
L + (1
L
)L
L
L + (1
L
)L. By step 2 (hence the independence axiom) we have
it that L L
i
L

L
.
195
Consider the function U : L R that assigns U(L) =
L
; well show that it is linear (so it
has the expected utility form). Take
L U(L)
L + (1 U(L))L
and
L
U(L
L + (1 U(L
))L
Then the independence axiom (used twice) implies
L + (1 )L
[U(L)
L + (1 U(L))L] + (1 )[U(L
L + (1 U(L
))L]
= [U(L) + (1 )U(L
)]
L + [1 U(L) (1 )U(L
)]L
Consequently we can interpret [U(L)+(1)U(L
)] as the probability of

L and [1U(L)
(1 )U(L
)] as the probability of L when facing the compound lottery L+(1 )L
. This
implies
(L) + (1 )U(L
) = U(L + (1 )L
)
So for any two lotteries L, L
, the utility function has the expected utility form.

So the rst two steps of the proof establish a relationship between preferences over compound
lotteries
L + (1 )L; namely that lotteries with a higher are better. Then in step 3, we show
that any lottery can be uniquely assigned an
L
. This has set up a way of comparing lotteries L, L
by comparing
L
with
L
that is consistent with the underlying preferences . Step four and step
ve then show the results we want; that independence and continuity imply linearity and existence
of a linear utility function representing .
Note that the continuity axiom is only used once, to establish a unique correspondence between
lotteries L and index values
L
. On the other hand, the independence axiom is used many times.
So while the continuity axiom is mostly a technical axiom used in a technical way, the independence
axiom has a strong behavioral interpretation and is used many times to break up lotteries so
that they can be reduced into simpler ones. Without independence, its clear that many parts of
the above proof would no longer work.
Now, what do you need to internalize from this discussion?
The theory of choice under uncertainty depends on the independence and continuity axioms,
which are mathematically convenient but not rock solid behaviorally. However, these allow
you to write an agents expected payo as a vNM expected utility function.
The phrase (von Neumann - Morgenstern) Expected Utility refers to expected utility func-
tions of the form
n
p
n
u
n
or _
p(n)u(n)dn
where u
n
or u(n) is the utility of getting outcome n with certainty.
The set of vNM expected utility functions is closed under transformations of the kind U(L)+
=
n
p
n
u
n
+ =
n
p
n
(u
n
+), > 0. The utility function is linear in the probability
weights, p
n
.
196
19.3 Risk Aversion
The following kinds of situations are very common in economics:
There is a lottery at the fair, where for a entry fee e you can drop a business card into a
bucket, and at the end of the night the lottery-owner draws a business card, and that player
wins half of the proceeds. If N people participate,
E[U] =
1
N
u
_
1
2
N e
_
+
N 1
N
u(e)
and not playing gives expected utility
E[U] = u(0)
A risk-averse agent with utility function over money u() is deciding whether or not to buy
homeowners insurance. He has wealth w, a policy costs t and pays out r in the case of an
accident, the loss of an accident is L, and the proabability of an accident is p.
E[U] = pu(w L t +r) + (1 p)u(w t)
while his utility without the policy is
E[U] = p(w L) + (1 p)u(w)
A risk-neutral agent is competing in a rst-price auction for a good for which he has value v;
namely, he submits a sealed bid b, and if his bid is the highest of all those submitted, he wins
the good and pays his bid b. Then there are two outcomes, winning which yields a payo of
v b, and losing, which yields a payo of zero. Suppose that the agent wins the auction with
probability (b) if he bids b. Then
U(v, b) = (b)(v b) + (1 (b))0 = (b)(v b)
But now we have the right tools to analyze them formally. Namely, we consider outcomes as
realizations of a random variable X : R where (, E, p) is a probability space, and the agent
gets certain utility u(x()) from each of the outcomes Omega. The utility of the certain
outcomes given by u(x()) includes the function u(), which is called the agents Bernoulli utility
function.
Then each lottery L is a probability distribution over a set , and the agents expected utility
is
U(L) = E[u(X)] =
_
u(x())dP()
In the discrete case, this becomes
U(L) = E[u(X)] =
u(x())p()
and in the case where is a one-dimensional subset of R,
U(L) = E[u(X)] =
_
u(x)f(x)dx
And because of the work in the previous section, this is all well-dened in terms of there being a
rational, continuous preference relation that satises the independence axiom that gives rise to
U(L).
197
Denition 19.3.1 (6C1) A decision maker is risk averse if for any lottery F(),
E[u(X)] < u(E[X])
A decision maker is risk neutral if for any lottery F(),
E[u(X)] = u(E[X])
A decision maker is risk inclined or risk seeking if for any lottery F(),
E[u(X)] > u(E[X])
It follows immediately from Jensens inequality that if u() is concave, an agent is risk averse; if
u() is linear, the agent is risk neutral; and if u() is convex, the agent is risk inclined.
Example Let x be distributed F(x) = x
on [0, 1], and suppose the agent has preferences u(x) =

x
(so that > 1 implies convexity and < 1 implies concavity). Then his expected utility is
U(L) =
_
1
0
x
dx
_
1
0
x
+1
dx =
_

+
x
+
_
1
0
=

+
and the expectation of X is
E[X] =
_
1
0
xdx
=
_
1
0
x
dx =

1 +
So that
U(L) < u(E[X])

+
<

1 +
So that the agent is risk averse only if > 1. If = 1, he is risk neutral, and if < 1, his risk
inclined.
Example Let X be a random variable with mean m and variance
2
. If an agent has quadratic
preferences
u(x) = x

2
x
2
then his expected utility is (see the footnote a few pages back)
U(L) = E[x

2
x
2
] = m

2
(
2
+m
2
)
Then the agent is risk averse if
U(L) < u(E[X])
i
m

2
(
2
+m
2
) < m
which is true as long as > 0. If = 0, he is risk neutral, and if < 0, he is risk-inclined.
Example Suppose we have an agent with wealth w who faces a stochastic loss D with probability
p, and no loss with probability 1p. The agent can buy insurance, such that for each unit he must
pay q but receives a payout of 1 if the loss occurs, so that his ex post wealth in the event of a loss
is w q D + and if no loss occurs w q. You might expect him to under- or over-insure,
because of the risk aversion. Then he maximizes
max
pu(w q D +) + (1 p)u(w q)
198
Whenever
> 0, the FOC holds:

0 = p(q + 1)u
(w
q D +) (1 p)qu
(w
q)
This equates the marginal utility of consumption across the two states. Suppose that q = p, so
that the price of insurance is equal to the expected cost; you can imagine perfectly competitive
insurance companies undercutting each other until they get to the point where the price is the
expected marginal cost. Then
u
(w
p D +) = u
(w
p)
Since u
() < 0, the function u
() is monotone, and we can invert it to get:

w
p D + = w
p
Or
= D
So that the optimal amount of insurance is exactly equal to the expected loss, and the agent fully
insures.
Example There is an agent with an increasing, concave Bernoulli utility function u(x) and initial
wealth w
0
. There is a safe asset that yields a return of 1 with certainty, and a risky asset that
yields a return H > q with probability p and L < q with probability 1 p. The price of the risky
asset is q, the amount of the risk asset purchased is x and the amount of the safe asset purchased is
y, giving the agent a budget constraint w
0
= qx+y. Assume both x and y must be weakly greater
than zero. The agents expected payo is
pu(Hx +y) + (1 p)u(Lx +y)
Then the agent solves the expected utility maximization problem
max
x,y
pu(Hx +y) + (1 p)u(Lx +y)
subject to w
0
= qx +y, x 0, y 0, or
max
x
pu((H q)x +w
0
) + (1 p)u((L q)x +w
0
) +x
with FONC
pu
((H q)x +w
0
)(H q) + (1 p)u
((L q)x +w
0
)(L q) + = 0
and complementary slackness condition
x = 0
So the agent holds no risky asset (x = 0) if 0, or
pu
(w
0
)(H q) + (1 p)u
(w
0
)(L q) 0
or
pH + (1 p)L q
So if the expected return is less than q, the agent holds none of the risky asset. The agent only
holds a strictly positive amount if the expected return, pH + (1 p)L, is strictly greater than the
price, q.
199
The certain value of x at which the agent is indierent between facing the lottery and taking
the certain value is called the certainty equivalent. You can imagine an agent sitting in front of
an insurer, and the insurer slowly increasing the price of a policy until the agent is indierent
between facing the risk or having the insurer bear the risk instead. Similarly, we can ask, For a
small gamble over {x + , x }, what is the deviation from a fair lottery over the two outcomes
required for the decision-maker to be indierent between the gamble and getting x for sure? This
is called the probability premium. You see something like this in sports all the time, when there is
an uninteresting game. For example, if one team is almost sure to win, people might bet on the
spread, giving a higher likelihood of winning to the people who bet on the team that is likely to
lose. This compensates them for bearing the risk by increasing the probability that they win by
accepting the bet.
Denition 19.3.2 The certainty equivalent of F(), c(F, u), is dened as u(c(F, u)) =
_
u(x)dF(x).
The probability premium for a xed value x and is given by u(x) = u(x + )[.5 + (x, , u)] +
u(x )[.5 (x, , u)].
Example Suppose an agent has Bernoulli utility function u(x) =

x, and faces a lottery where
the outcome is 0 with probability 1p and y
2
with probability p. Then the agents expected payo
is
E[u(X)] = (1 p)
0 +p
_
y
2
= py
Now, what is the amount of money which makes the agent indierent between facing the gamble
and not?
u(c(F, u)) =
_
c(F, u) = py
and
c(F, u) = p
2
y
2
This is the certainty equivalent.
We can express risk aversion in four ways, then:
Proposition 19.3.3 (6C1) Suppose a decision maker is an expected utility maximizer with in-
creasing Bernoulli utility function u(). Then the following are equivalent:
The decision maker is risk averse
u() is concave
c(F, u) <
_
xdF(x) for all F()
(x, , u) 0 for all x,
Proof ( i and ii ) Jensens inequality. ( ii and iii ) If u is increasing, u(c(F, u)) < u
__
xdF(x)
_
,
which is equivalent to Jensens inequality. (i and iv) Linearize around x:
u(x +) = u(x) +u
(x)() +u
(c)
2
/2
u(x ) = u(x) +u
(x)() +u
(c)()
2
/2
Multiply by p and 1 p respectively and add to get
pu(x +) + (1 p)u(x ) = u(x) + (2p 1)u
(x) +u
(c)
2
/2
If u
(c) < 0, u
(c) > 0, then evaluating at p = .5 yields

1
2
u(x +) +
1
2
u(x ) = u(x) +u
(c)
2
/2
200
so that
1
2
u(x +) +
1
2
u(x ) < u(x)
and to compensate the agent for the risk, we must increase the likelihood of outcome x+. Namely,
from .5 to the value of .5 + that solves
[.5 +]u(x +) + [.5 ]u(x ) = u(x)
We often need to compare risk aversion within or across agents. For example, agents with
large quantities of wealth are probably less averse to a small bet than agents who have very little
wealth. Agents who accept one kind of bet may refuse another, similar one. We want to develop a
framework in which to think about these questions.
As is shown above, saying that an agent is risk averse is equivalent to claiming that they have
a concave utility function. So to say an agent is more risk averse, what we want to say is that
they have a more concave utility function. But concavity refers to the curvature of a function, so
it is a dicult property to measure. We could start with u
(x), but if we took the same agent and

took a positive ane transformation to get u(x) +, their risk aversion would then equal u
(x),
so the measure wouldnt be invariant to ane transformations. So we need something else.
Denition 19.3.4 If u() is twice-dierentiable, then the Arrow-Pratt measure of absolute risk
aversion at x is dened as
r
A
(x) =
u
(x)
u
(x)
If u() is twice-dierentiable, then the Arrow-Pratt measure of relative risk aversion at x is dened
as
r
A
(x) =
xu
(x)
u
(x)
So this measure normalizes the second-derivative by the rst, and multiplies by 1 to make it a
positive number. For a linear function (risk neutrality), the measure is zero. If we wanted constant
absolute risk aversion, we are looking for a function satisfying
(x)
u
(x)
= r
for all x. This is a dierential equation
u
(x) +ru
(x) = 0
with general solution
u(x) = Ce
rx
We can check this easily, since r
2
Ce
rx
r
2
Ce
rx
= 0. However, we want u(x) to be increasing, or
u
(x) = rCe
rx
> 0
So C < 0 if this is a utility function. In particular, the choice C = 1/r is often made so that the
rst derivative is just e
rx
. Then the CARA utility function is
u(x) =
1
r
e
rx
201
or (when you want utility to be a positive number)
u(c) = 1
1
r
e
rx
There are a few others whose solutions are generally found using the above pattern of solving a
dierential equation. For example, hyperbolic absolute risk aversion (HARA) or constant relative
risk aversion (CRRA) is given by
u(c) =
c
1r
1 r
Then we can start comparing individuals:
Proposition 19.3.5 (6C2) The utility function u
1
() represents the preferences of a decision maker
who is more risk averse than a decision maker represented by u
2
if
r
A
(u
1
, x) r
A
(u
2
, x) for all x
There exists an increasing, concave function () so that u
2
(x) = (u
1
(x))
c(F, u
2
) < c(F, u
1
) for any F()
(x, , u
2
) (x, , u
2
) for any x and
The absolute and relative measures of risk aversion are local ones, in that we are dening risk
aversion at x. We might make more demands. Particularly, we might want risk aversion to
decrease in x, so that at higher levels of wealth, the agent is less risk averse.
Denition 19.3.6 The Bernolli utility function u() exhibits decreasing absolute risk aversion if
r
A
(x) is decreasing in x.
This implies that
r
A
(x) =
u
(x)u
(x) +u
(x)
2
u
(x)
2
< 0
requiring that u
(x) > 0. This in turn implies that u
(x) is an increasing (but negative) function.

So as x becomes large, the function is becoming less concave. Some macroeconomists refer to
considerations about the third derivative as prudence.
Proposition 19.3.7 (6C3) The following are equivalent:
The Bernoulli utility function u() exhibits decreasing absolute risk aversion
If x
2
< x
1
, then u
2
(z) = u(x
2
+z) is a concave transformation of u
1
(z) = u(x
1
+z).
For any lottery F(), consider u(c
x
) =
_
u(x +z)dF(z). Then x c
x
is decreasing in x.
The probability premium (x, , u) is decreasing in x.
For any F(), if
_
u(x
2
+ z)dF(z) u(x
2
) and x
2
< x
1
, then
_
u(x
1
+ z)dF(z)
_
u(x
2
+
z)dF(z).
Proof To prove the second point, note that DARA implies that
r(z +y) =
u
(z +y)
u
(z +y)
is a decreasing function in z +y, so that r(z +x
2
) < r(z +x
1
), and from Proposition 6C2 it follows
that u
2
is a concave transformation of u
1
.
Since that is true, the other points of this Proposition are just restatements of Proposition 6C2
as comparisons across wealth levels.
202
19.4 Stochastic Orders and Comparative Statics
Often in economics, we want provide comparative statics predictions with respect to changes in
distributions. A rst step might be to classify a family of distributions that depend on a parameter
, such as f(z, ), and then provide comparative statics predictions with respect to . For example,
consider the indirect utility function
_
x
u(x)dF(x, )
If we dierentiate with respect to , we get
_
x
u(x)f
(x, )dx
But signing this term is dicult. We cant just assume that f
(z, ) > 0, because

_

f(z, )dz = 1
and dierentiating with respect to yields
_

(z, )dz = 0
meaning that if we change and perturb f(), the sum of all the perturbations must cancel out.
For instance, we can put more weight on the good outcomes, but only if we put less weight on
the bad outcomes. So any of our variational changes to the distribution must cancel out across
all states. If we go back to the indirect utility function
_
b
a
u(x)dF(x, ), this is equal to
_
b
a
u(x)dF(x, ) =
_
b
a
u(x)d(1 F(x, )) = u(x, a) +
_
b
a
u
(x)(1 F(z, ))dz

Then a change from F
2
to F
1
for a decision maker with payo function u(x)
_
b
a
u(x)dF
1
(x)
_
b
a
u(x)dF
2
(x) =
_
b
a
u
(x)(F
2
(x) F
1
(x))dz
But now we need a way to compare distributions. If u
(x) > 0, then the above inequality is

_
b
a
u
(x)
. .
>0
(F
2
(x) F
1
(x))
. .
?
dz
If F
1
(x) < F
2
(x) for all x, then we could conclude that the integrand is positive for all x, and the
change is positive.
Denition 19.4.1 A distribution F rst-order stochastically dominates G if, for all x, F(x)
G(x).
In other words, F rst-order stochastically dominates G if it places less weight on the low
outcomes, which necessarily means it places more weight on the high outcomes. The simplest
example is a lottery over two dollar amounts a and b, b > a, where the agent receives b with
probability p and a with 1 p. The cumulative distribution is
F(x, p) =
_
_
_
p a x < b
1 x b
0 otherwise
203
So if we increase p to p
, we get a distribution where more weight is placed on getting the price b,

and F(x, p
) F(x, p) for all x.

This is one example of a stochastic order. This is a partial order on the space of probability
distributions, but not a total order, since there are probability distributions where F(x) > G(x)
but F(y) < G(y). Here are the other main examples:
A distribution F hazard rate dominates G if, for all x,
f(x)
1 F(x)

g(x)
1 G(x)
A distribution F likelihood ratio dominates G if, for all x < y,
f(x)
g(x)

f(y)
g(y)
Depending on what youre doing, one or another of these orders might be easier to use. The
intuition of hazard rate dominance comes from the following expression: If Y is a random variable,
then
f(x)dx
1 F(x)
pr[y = x|y x]
so the hazard rate is the probability that Y takes the value x, given that y x. Said another
way, the hazard rate is the probability that a lightbulb bursts at this exact moment given that
it hasnt burst up to now. Likelihood ratio implies the probability density functions of f and g
only cross once on the interior of their support, and when this happens f intersects g from below.
Consequently, f places less weight on low outcomes than g.
Proposition 19.4.2 Likelihood ratio dominance implies hazard rate dominance, and hazard rate
dominance implies rst-order stochastic dominance.
MWG only use rst-order stochastic dominance, but when comparing distributions it is helpful
to know that these other orders exist. For example, if you are estimating how likely it is that a
worker is red at given dates after being hired, you might think of the hazard rate. If a better
education lowers the likelihood of being red at all dates, you have hazard rate dominance.
Proposition 19.4.3 (6D1,6D2) The distribution F rst-order stochastically dominates G if, for
every non-decreasing function u : R R,
_
u(x)dF(x)
_
u(x)dG(x)
In particular, any decision maker with an increasing utility function will prefer F to G.
These ideas provide a way of discussing shifting probability from less to more favorable outcomes.
But another kind of change in uncertainty is when weight isnt shifted upward monotonically, but
instead pushed from the center towards the tails. For example, when the variance of a normal
distribution is increased, the bell becomes fatter and the probability at the mean falls, but the
mean value itself isnt changed. This motivates a second kind of order.
Denition 19.4.4 For any two distributions F and G with the same mean, F second-order stochas-
tically dominates G if for every non-decreasing concave function u : R R,
_
u(x)dF(x) u(x)dG(x)
In particular, any risk averse decision maker with an increasing utility function will prefer F to G.
204
This denition implies that whats really going on is that probability is being shifted away from
the mean. Another way of imagining this is that we draw a value x distributed according to F,
but then add a noise term that perturbs it further, but has an expectation of zero. For example,
we draw a value of x from a normal distribution with mean m and variance
2
x
, but then add a
normally distributed shock z with mean zero and variance
2
z
, getting a variable y = x + z. This
new variable has a distribution
f(y) =
1
_
2(
2
x
+
2
z
)
exp
_
1
2
(y m)
2
2
x
+
2
z
_
This is just another normally distributed random variable, but with mean m and variance
2
x
+
2
z
.
A general denition of this kind of construction is
Denition 19.4.5 Let X be a random variable with distribution F, and let Z be a random variable
whose distribution conditional on x, H(z|X = x), satises
_
zh(z|X = x)dz = 0. Then the
distribution G of the random variable Y = X +Z is a mean-preserving spread of F.
Using the above denition,
_
u(x)g(x)dx =
_ _
u(x +z)f(x)h(z|x)dzdx =
_ __
u(x +z)h(z|x)dz
_
f(x)dx
_
u
__
(x +z)h(z|x)dz
_
f(x)dx
But then since E[z|x] = 0, the term
_
(x +z)h(z|x)dz = x, and
_
u(x)g(x)dx
_
u(x)f(x)dx
Which is to say that if G is a mean-preserving spread of F, then F second-order stochastically
dominates G.
Lastly, consider the dierence in means (using the integration by parts)
_
xdF(x)
_
xdG(x) = [xF(x) xG(x)]
b
a

_
F(x)dx +
_
G(x)dx
But F(b) = G(b) = 1 and F(a) = G(a) = 0, so
_
xdF(x)
_
xdG(x) =
_
F(x)dx +
_
G(x)dx
and since G is a mean-preserving spread of F so they have the same mean, the left-hand side is
zero, leaving
0 =
_
G(x) F(x)dx
But G is obtained from F by shifting weight into the tails, so for each x, weve already integrated
over more probability than in the distribution F, implying that
_
x
a
G(x)dx
_
x
a
F(x)dx
So this gives three ways of characterizing second-order stochastic dominance:
Proposition 19.4.6 (6D2) For two distributions F and G with the same mean, the following are
equivalent:
F second-order stochastically dominates G
G is a mean-preserving spread of F
_
x
a
G(x)dx
_
x
a
F(x)dx
205
19.5 Exercises
MWG: 6B1, 6B2, 6C6, 6C18, 6C15, 6C18, 6C20
206

Micro Notes Main

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Micro Notes Main

Încărcat de

Drepturi de autor:

Formate disponibile

Microeconomic Theory I

) of goods giving the amount traded between the consumer

and the amount of money the consumer spends on other goods, m

, which is the market-clearing quantity; the market-clearing price is p

) and the rm a payo of p

and market-clearing quan-

are expressed completely in terms of the

depend on ? If t goes up, how are the

vary in F, b and c? How do the market-clearing

(0) > 0 and c

(q) intersects average

= { Cowboys, Bears, Chargers, Colts}

c}, since (1)

A such that P(x

of A for which P(x

(x) = 1, and for x > 0, the derivative is f

(x) = 1. But at zero, the

(x) 0 for all x [a, b].

> x. If f(x) is increasing, then f(x

) > f(x), and letting x

(x) 0 for all x [a, b], then f(x)

(x) 0. Then for all x

> x in [a, b], by the fundamental theorem of

) f(x), as was to be shown.

> x in [a, b], f(x

(x) 0 for all x [a, b].

> x, then f(x

(z)dz 0, so that f(x

) f(x). But we assumed that

) f(x), so we are lead to a contradiction.

to be a local maximum of a dierentiable function f(x)

to be a local maximum of a dierentiable function f(x)

+ h and increase the value of the function (alternatively, if f

could not have been a local maximum. So it is necessary that f

) > f(x) only if f

) < 0. So if the rst-order necessary conditions and second-

must be a solution. However, there is a substantial gap

= 0 is clearly a solution of the

(0) = 0 as well. However, f(1/2) = (1/2)

is a critical point, but not a maximum. Similarly, suppose we try to maximize

= 0 is a solution to the rst-order necessary conditions, but it is

is a maximum of a function f on the set A if there exists no x

) is a Pareto optimal allocation.

D = [0, ), and for log(x) : (0, ) R, we have log(D) = (, ).

is a global maximizer of f : D R if, for any other x

+). In even less mathematical terms, x

) is a local or global minimum if, instead of f(x

) above, we have f(x

). However, we can focus on just maximization or just minimization, since

is a local maximum of f(), then x

is a local minimum of f().

+ ) for some > 0, then f(x

+), so it is a local minimum.

(0, 1) is the maximum of f(x) = x

and 1. Then f(x

), since f(x) is increasing. But then x

couldnt have been a maximizer. This is true for all x

in (0, 1), so were forced to conclude that

) = supf(x), and f(x

in [a, b] so that f(x

is the supremum of f(), we can nd a sequence of points {x

1/n would be greater

1/n is the supremum of f AND less than m

exists in [a, b], and the function