Sunteți pe pagina 1din 100

PROBABILISTIC

THINKING
PREFACE

Probabilistic thinking was a mid-17th century artifact originating in a famous


correspondence between Fermat and Pascal -- a correspondence on which Huygens
based a widely read textbook: On Calculating in Games of Luck (1657). The
probabilistic framework didn't exist until those people cobbled it together. It
remains in use today, much as in Huygens's book.

Within the framework, we make up our minds by adopting probability functions --


or, anyway, features of such functions. This is not a matter of eliciting what is
already in mind; rather, it is a matter of artifice, a family of arts of probabilistic
judgment. And that includes making up our minds about how to change our minds.
With the founders (so I think), and certainly with Ramsey and de Finetti, who
revived the view of probability as a mode of judgment, I see the probability
calculus as a logic. Mere logic does not tell us how to make up our minds. It does
help us spot inconsistencies within projected mental makeups, but fails to
underwrite the further gradations we make in classifying survivors of the first cut
as reasonable or not, or in grading them as more or less reasonable. These finer
distinctions separate cases on the basis of standards rooted in our actual
appreciation of ongoing methodological experience.

The basic ideas, floated in 1 and 3, are applied in 2 and 4 to troubling questions
about scientific method and practical decision-making. The question of normativity
is addressed in 5.

I'd be glad to have corrigenda and other suggestions.

Richard Jeffrey
dickjeff@princeton.edu

7 Dec 99

Please write to bayesway@princeton.edu with any comments or suggestions.


CHAPTER 1: PROBABILITY

Introduction
"Yes or no: was there once life on Mars?" I can't say. "What about intelligent
life?"' That seems most unlikely, but again, I can't really say.

The simple yes-or-no framework has no place for shadings of doubt; no room to
say that I see intelligent life on Mars as far less probable than life of a possibly
very simple sort. Nor does it let me express exact probability judgments, if I have
them. We can do better.

1.1 Bets and Probabilities

What if I were able to say exactly what odds I'd give on there having been life, or
intelligent life, on Mars? That would be a more nuanced form of judgment, and
perhaps a more useful one.

Suppose my odds were 1:9 for life, and 1:999 for intelligent life, corresponding to
probabilities of 1/10 and 1/1000, respectively; odds m:n correspond to probability
m/(m+n.) That means I'd see no special advantage for either player in risking one
dollar to gain nine in case there was once life on Mars; and it means I'd see an
advantage on one side or the other if those odds were shortened or lengthened. And
similarly for intelligent life on Mars when the risk is 1 thousandth of the same ten
dollars (1cents ) and the gain is 999 thousandths ($9.99).

Here is another way of saying the same thing: I'd think a price of one dollar just
right for a ticket worth ten if there was life on Mars and nothing if there wasn't, but
I'd think a price of only one cent right if there has to have been intelligent life on
Mars for the ticket to be worth ten dollars.
So if I have an exact judgmental probability for truth of a hypothesis, it
corresponds to my idea of the right price for a ticket worth 1 unit or nothing
depending on whether the hypothesis is true or false. (For the life on Mars ticket
the unit was $10; the price was a tenth of that.)

Of course I have no exact judgmental probability for there having been life on
Mars, or intelligent life there. Still, I know that any probabilities anyone might
think acceptable for those two hypotheses ought to satisfy certain rules, e.g., that
the first can't be less than the second. That's because the second hypothesis implies
the first: see the implication rule in sec. 3 below. Another such rule, for 'not': the
probabilities that a hypothesis is and is not true must add to 1.

In sec. 2 we'll turn to the question of what the laws of judgmental probability are,
and why. Meanwhile, take some time with these questions, as a way of getting in
touch with some of your own ideas about probability. Afterward, read the
discussion that follows.

Questions

1 A vigorously flipped thumbtack will land on the sidewalk. Is it reasonable for


you to have a probability for the hypothesis that it will land point up?

2 An ordinary coin is to be tossed twice in the usual way. What is your probability
for the head turning up both times--

(a) 1/3, because 2 heads is one of three possibilities: 2 heads, 1 head, 0 heads?

(b) 1/4, because 2 heads is one of four possibilities: HH, HT, TH, TT?

3 There are three coins in a bag: ordinary, two-headed, and two-tailed. One is
shaken out onto the table and lies head up. What should be your probability that it's
the two-headed one--
(a) 1/2, since it can only be two-headed or normal?

(b) 2/3, because the other side could be the tail of the normal coin, or either side of
the two-headed one?

4 "It's a goy!"

(a) As you know, about 49% of recorded human births have been girls. What's
your judgmental probability that the first child born in the 21st century will be a
girl?

(b) A goy is defined as a girl born before the beginning of the 21st century or a boy
born thereafter. As you know, about 49% of recorded human births have been
goys. What is your judgmental probability that the first child born in the 21st
century will be a goy?

Discussion

1 Surely it is reasonable to suspect that the geometry of the tack gives one of the
outcomes a better chance of happening than the other; but if you have no clue
about which of the two has the better chance, it may well be reasonable to have
judgmental probability 1/2 for each. Evidence about the chances might be given by
statistics on tosses of similar tacks, e.g., if you learned that in 20 tosses there were
6 "up"s you might take the chance of "up" to be in the neighborhood of 30%; and
whether or not you do that, you might well adopt 30% as your judgmental
probability for "up" on the next toss.

2, 3. These questions are meant to undermine the impression that judgmental


probabilities can be based on analysis into cases in a way that doesn't already
involve probabilistic judgment (e.g., the judgment that the cases are equiprobable).

In either problem you can arrive at a judgmental probability by trying the


experiment (or a similar one) often enough, and seeing the statistics settle down
close enough to 1/2 or to 1/3 to persuade you that more trials won't reverse the
indications.

In each these problems it's the finer of the two suggested analyses that makes more
sense; but any analysis can be refined in significantly different ways, and there's no
point at which the process of refinement has to stop. (Head or tail can be refined to
head-facing-north or head-not-facing-north or tail.) Indeed some of these analyses
seem more natural or relevant than others, but that reflects the relevance of
probability judgments that you bring with you to the analyses.

4. Goys and birls.

This question is meant to undermine the impression that judgmental probabilities


can be based on frequencies in a way that doesn't already involve judgmental
probabilities. Since all girls born so far have been goys, the current statistics for
girls apply to goys as well: these days, about 49% of human births are goys. Then
if you read probabilities off statistics in a straightforward way your probability will
be 49% for each hypothesis: (1) the first child born in the 21st century will be a
girl; and (2) the first child born in the 21st century will be a goy. Thus
P(1)+P(2)=98%. But it's clear that those probabilities should sum to 1, since (2) is
logically equivalent to (3) the first child born in the 21st century will be a boy, and
P(1)+P(3) = 100%. Contradiction.

What you must do is decide which statistics are relevant: the 49% of girls or the
51% of boys. That's not a matter of statistics but of judgment -- no less so because
we'd all make the same judgment, P(H) = 51%.

1.2 Why Probabilities are Additive

Authentic tickets of the Mars sort are hard to come by. Is the first of them really
worth $10 to me if there was life on Mars? Probably not. If the truth isn't known in
my lifetime, I can't cash the ticket even if it's really a winner. But some
probabilities are plausibly represented by prices, e.g., probabilities of the
hypotheses about athletic contests and lotteries that people commonly bet on. And
it is plausible to think that the general laws of probability ought to be the same for
all hypotheses - about planets no less than about ball games. If that's so, we can
justify laws of probability if we can prove all betting policies that violate them to
be inconsistent.

Such justifications are called "Dutch book arguments." (In racing jargon your book
is the set of bets you've accepted, and a book against you - a Dutch book - is one
on which you inevitably suffer a net loss.) We now give a Dutch book argument
for the requirement that probabilities be additive in this sense:

Finite Additivity. The probability of any hypothesis is the sum of the probabilities
of the cases in which it is true, provided there is only a finite number of cases,
incompatible and exhaustive.

Example 1. The probability p of the hypothesis

(H) A woman will be elected

is q+r+s if exactly three of the candidates are women, and their probabilities of
winning are q, r and s. In the following diagram, A, B, C, D,... are the hypotheses
that the various different candidates win; the first three are the women in the race.

Proof. For definiteness, we suppose that the hypothesis in question is true in three
cases as in the example. The argument differs inessentially for other examples,
with other finite numbers of cases. Now consider the following array of tickets.
Suppose I am willing to buy or sell any or all of these tickets at the stated prices.
Why should p be the sum q+r+s?

Because no matter what it's worth -- $1 or $0 -- the ticket on H is worth exactly as


much as the tickets on A, B and C together. (If H loses it's because A, B and C all
lose; if H wins it's because exactly one of A, B, C wins.) Then if the price of the H
ticket is different from the sum of the prices of the other three, I am inconsistently
placing different values on one and the same contract, depending on how it is
presented.

If I am inconsistent in that way, I can be fleeced by anyone who'll ask me to sell


the H ticket and buy the other three (in case p is less than q+r+s) or buy the H
ticket and sell the other three (in case p is more). Thus, no matter whether the
equation p = q+r+s fails because the left-hand side is less than the right or more, a
book can be made against me.

That's the Dutch book argument for additivity when the number of ultimate cases
under consideration is finite. The talk about being fleeced is just a way of
dramatizing the inconsistency of any policy in which the dollar value of the ticket
on H is anything but the sum of the values of the other three tickets: to place a
different value on the three tickets on A, B, C from the value you place on the H
ticket is to place different values on the same commodity bundle under two
demonstrably equivalent descriptions.

When the number of cases is infinite, a Dutch book argument for additivity can
still be given -- provided the infinite number is not too big!

It turns out that not all infinite sets are the same size. The smallest infinite sets are
said to be "countable." A countable set is one whose members can be listed: first,
second, etc., with each member of the set appearing as the n'th item for some finite
n. Of course any finite set is countable in this sense, and some infinite sets are
countable. An obvious example of a countably infinite set is the set { 1, 2, 3, ... } of
the positive whole numbers. A less obvious example is the set { ... , -2, -1, 0, 1, 2,
... } of all the whole numbers; it can be rearranged in a list (with a beginning): 0, 1,
-1, 2, -2, 3, -3, ... . Then it is countable. Order doesn't matter, as long as they're all
in the list. But there are uncountably infinite sets, too (example 3).
Example 2. In the election example, suppose there were an endless list of
candidates, including no end of women. If H says that a woman wins, and A1, A2,
etc., identify the winner as the first, second, etc. woman, then an extension of the
finite additivity law to countably infinite sets would be as follows, with no end of
terms on the right.

P(H) = P(A1) + P(A2) + ...

Thus, if the probability of a woman's winning were 1/2, and the probabilities of
winning for the first, second, third, etc. woman were 1/4, 1/8, 1/16, etc. (decreasing
by half each time), the equation would be satisfied.

Dutch book argument for additivity in the countably infinite case. Whatever my
probabilities P(An) may be, if they don't add up to P(H) there will be an infinite set
of $1 bets on truth of A1, A2, ... separately, on which my net gain will surely be the
same as my gain from betting $1 on truth of H. (Note that this infinity of bets can
be arranged by a finite contract: "In consideration of $1 paid in advance,
Bookmaker hereby undertakes to pay Bettor the amount $P(the true one) when the
true one has been identified.") This will be a Dutch book if the sum P(A1) + P(A2)
+ ... is greater or less than P(H)--against me if it's greater, against the bookmaker if
it's less.

Summarizing, the following additivity law holds for any countable set of
alternatives, finite or infinite.

Countable Additivity. If the possible cases are countable, the probability of a


hypothesis is the sum of the probabilities of the cases in which it is true.

Example 3. Cantor's Diagonal Argument. The collection of all sets of positive


whole numbers is not enumerable. For, given any list N1, N2, ... , there will be a
"diagonal" set D consisting of the positive whole numbers n that do not belong to
the the corresponding sets Nn in the list. For example, supose the first two entries
in the list are N1 = the odd numbers = {1, 3, ...}, and N2 = the powers of 10 = {1,
10, ...}. Then it is false that D = N1, because 1 is in N1 but not in D; and it is false
that D = N2, because 2 is in D but not in N2. (For it to be true that D = N2 it must be
that each number is in both D and N2 or in neither.) In general, D cannot be
anywhere in the list N1, N2, ... because by definition of D, each positive whole
number n is in one but not the other of D and Nn.
1.3 Laws of Probability

The simplest laws of probability are the consequences of additivity under this
assumption:

Probabilities are real numbers in the unit interval, 0 to 1, with the endpoints
reserved for certainty of falsity and of truth, respectively.

This makes it possible to read laws of probability off diagrams, much as we read
laws of logic off them.

Let's pause to recall how that works for laws of logic. Example:

De Morgan's Law. -(G&H) = -Gv-H

Here the symbols '-', '&' and 'v' stand for not, and, and or. Thus, if G is the
hypothesis that the water is green, and H is the hypothesis that it's hot, then G&H
is the hypothesis that it's green and hot, GvH is the hypothesis that it's green or hot
(not excluding the possibility that it's both), and -G and -H are the hypothesis that
it's not green, and not hot. Here is a diagram for De Morgan's law.

Stippled: -Gv-H

Points in such diagrams stand for the ultimate cases -- say, complete possible
courses of events, each specified in enough detail to make it clear whether each of
the hypotheses under consideration is true or false in it. The cases where G and H
are both true are represented by points in the upper left-hand corner; that's the
G&H region. The cases where at least one of G, H is true make up the GvH region,
which covers everything but the points in the lower right-hand corner, where G and
H are both false (-G&-H). And so on.

In general, the merger of two regions covers the cases where one hypothesis or the
other is true, and the intersection of two regions covers the cases where both
hypotheses are true.

Now in the diagram for De Morgan's law, above, the stippled region covers what's
outside the G&H corner; then it represents the denial -(G&H) of G&H. At the
same time it represents the merger (-Gv-H) of the lower region, where G is false,
with the right-hand region, where H is false. So the law says: denying that G and H
are both true, -(G&H), is the same as (=) asserting that G is false or H is, -Gv-H.

Adapting that sort of thing to probabilistic reasoning is just a matter of thinking of


the probability of a hypothesis as its region's fraction of the area of the whole
diagram. Of course the fraction for the whole Hv-H rectangle is 1, and the fraction
for the empty H&-H region is 0. It's handy to be able to denote those two in neutral
ways. Let's call them 1 and 0:

The Whole Rectangle: 1 = Hv-H = Gv-G etc.

The Empty Region: 0 = H&-H = G&-G etc.

Now let's read a couple of probability laws off diagrams.

Proof. The GvH area is the G area plus the H area, eacept that when you simply
add, you count the G&H bit twice. So subtract it on the right-hand side.

Proof. The G&-H region is what remains of the G strip after you delete the G&H
region.
We will often abbreviate by dropping ampersands (&), e.g., writing the subtraction
law as follows.

Subtraction. P(G-H) = P(G)-P(GH)

Solving that for P(G), we have the rule of

In general, there is a rule of n-adic analysis for each n, e.g., for n=3:

You can verify the next two rules on your own, via diagrams.

Not. P(-D) = 1-P(D)

If. P(Hv-D) = P(DH)+P(-D)

In the second, 'H if D' is understood truth-functionally, i.e., as synonymous with


'H, unless not D': H or not D.

The idea is that saying "If D then H" is a guarded way of saying "H", for in case 'D'
is false, the "if" statement makes no claim at all -- about "H" or anything else.
The next rule is an immediate consequence of the fact that logically equivalent
hypotheses, e.g., -(GH) and -Gv-H, are always represented by the same region of
the diagram.

Equivalence. Logically equivalent hypotheses are equiprobable.

That fact is also presupposed when we write '=' to indicate logical equivalence.
Thus, since -(GH) = -Gv-H, the probability of the one must be the same as the
probability of the other, for the one is the other.

Recall that to be implied by G, H must be true in every case in which G is true, not
just in the actual case. (In other words, the conditional "H if G" must be valid: true
as a matter of logic.) Then the G region must lie entirely inside the H region.This
gives us the following rule.

1.4 Conditional Probability

Just as we identified your ordinary (unconditional) probability for H as the price


you would think fair for the ticket at the left below, we now identify your
conditional probability for H given D as the price you would think fair for the
ticket at its right. We wrote 'P(H)' for the first of these prices. We write 'P(H | D)'
for the second.
The tickets are represented as follows in diagrammatic form, with numbers
indicating dollar values in the various cases.

The first ticket represents a simple bet on H; the second represents a conditional
bet on H, i.e., a bet that's called off (the price of the ticket is refunded) in case the
condition D fails. If D and H are both true the bet's on and you win; if D is true but
H is false the bet's on and you lose; and if D is false, the bet's off: you get your
$P(H | D) back.

With that understanding we can construct a Dutch book argument for the rule
connecting conditional and unconditional probabilities:

Product Rule. P(DH) = P(D)P(H | D)

Imagine that your pockets are stuffed with money and tickets whose prices you
think fair -- including the following three tickets. The first represents a conditional
bet on H given D; the second and third represent unconditional bets on DH and
against D, respectively. The third bet has an odd payoff, i.e., not a whole dollar,
but only $P(H | D). That's why its price isn't the full $P(-D) but only the fraction
P(-D) of the $P(H | D) that you stand to win. This third payoff was chosen to equal
the price of the first ticket. That's what makes the three fit together into a neat
book.
The three tickets are shown below in compact diagrammatic form. In each, the
upper and lower halves represent D and -D, and the left and right halves represent
H and -H. The number in each region shows the ticket's value when the
corresponding hypothesis is true.

Observe that in every possible case regarding truth and falsity of D and H the
second two tickets together have the same value as the first. Then there is nothing
to choose between the first and the other two together, and so it would be
inconsistent to place different values on them. Thus, the price you think fair for the
first ought to equal the sum of the prices you think fair for the other two: P(H | D)
= P(DH)+P(-D)P(H | D). Rewriting P(-D) as 1-P(D), this boils down to

P(H | D) = P(DH) + P(H | D) - P(D)P(H | D).

Cancelling the term on the left and the second term on the right and transposing,
we have the product rule.

That's the Dutch book argument for the product rule: to violate the rule is to place
different values on the same commodity bundle when it is described in two
probably equivalent ways.

1.5 Laws of Conditional Probability

Here is the product rule in a slightly different form:


Graphically, the quotient rule expresses P(H | D) as the fraction of the area of the D
strip that lies in the H region. It's as if calculating P(H | D) were a matter of
trimming the square down to the D strip by discarding the blank region, and taking
the stippled region as the new unit of area. Thus the conditional probability
distribution assigns to H as its probability the H fraction of the D strip, the fraction
P(HD)/P(D).

The quotient rule is often spoken of as a definition of conditional probability in


terms of unconditional ones -- when the unconditional probability of the condition
D is positive. But if P(D) is zero then by the implication rule so is P(DH), and the
quotient P(DH)/P(D) assumes the indeterminate form 0/0. Then if the quotient rule
really were its definition, the conditional probability would be undefined in all
such cases. Yet, in many cases in which P(D)=0, we do assign definite values to
P(H | D).

Example: the spinner. Although the probability is 0 that when the spinner stops it
will point straight up (U) or straight down (D) we want to say that the conditional
probability of up, given up or down,is 1/2: although P(U)/P(UvD) = 0/0, still P(U |
UvD) = 1/2.

By applying the product rule to each term on the right-hand side of the analysis
rule, P(D) = P(DH1) + P(DH2) + ..., we get the rule of

Total Probability

If the H's are incompatible and exhaustive,

P(D) = P(D|H1)P(H1) + P(D|H2)P(H2) + ...


Example. A ball will be drawn at random from urn 1 or urn 2, with odds 2:1 of
being drawn from urn 2. Is black or white the more probable outcome?

Solution. By the rule of total probability with n=2 and D=black, we have

P(D) = P(D | H1)P(H1)+P(D | H2)P(H2) =


(3/4) (1/3)+(1/2) (2/3) =
1/4 + 1/3 = 7/12,

i.e., a bit over 1/2. So black is the more probable outcome.

Finally, note that for any fixed proposition D of positive probability, the function
P( | D) obeys all the laws of unconditional probability, e.g., additivity:

P(GvH | D) = P(G | D) + P(H | D) - P(G&H | D)

(Proof. Multiply both sides of the equation by P(D), and apply the product rule.)
Therefore we sometimes write the function P( | D) as PD( ), e.g., in the additivity
law:

PD(GvH) = PD(G) + PD(H)PD(-H) - PD(G&H)

If we condition again, on E, PD becomes PD&E:

PD(H | E) = PDE(H) = P(DEH)/P(DE) = P(H | DE)


1.6 Why '|' Can't be a Connective

The bar in 'P(H | D)' isn't a connective that turns pairs H, D of propositions into
new, conditional propositions, H if D. Rather, it is as if we wrote the conditional
probability of H given D as 'P(H, D)': the bar is a typographical variant of the
comma. Thus we use 'P' for a function of one variable as in 'P(D)' and 'P(HD)', and
also for the corresponding function of two variables as in 'P(H | D)'. The ambiguity
is harmless because in every context, presence or absence of the bar clearly marks
the distinction between the two uses. But of course the two are connected, i.e., by
the product rule, P(HD) = P(H | D)P(D). That's why it's handy to make 'P' do
double duty.

But what is it that goes wrong when we treat the bar as a statement-forming
connective, 'if'? This question was answered by David Lewis in 1976, pretty much
as follows.

Consider the simple special case of the rule of total probability where there are
only two hypotheses, H and -H:

P(X) = P(X | H). P(H) + P(X | -H). P(-H)

Now if '|' is a connective, H | D is a proposition, and we are entitled to set X = H |


D above. Result:

(*) P(H | D) = P[(H | D) | H] P(H) + P[(H | D) | -H] P(-H)

So far, so good. But remember: '|' means if, so

'((H | D) | G)' means If G, then if D then H.

And as we ordinarily use the word 'if', this comes to the same as If D and G, then
H:
(H | D) | G = H | DG

(The identity means that the two sides represent the same region, i.e., the two
sentences are logically equivalent.) Now we can rewrite (*) as follows.

P(H | D) = P(H | DH). P(H) +

P(H | D-H). P(-H)

-- where the two terms on the right reduce to 1. P(H) and 0. P(-H), so that (*) itself
reduces to

P(H | D) = P(H).

Conclusion: If '|' is a connective ("if"), conditional probabilities don't depend on


their conditions at all. That means that 'P(H | D)' would be just a clumsy way of
writing 'P(H)'. And it means that P(H | D) would come to the same thing as P(H | -
D), and as P(H | G) for any other statement G. That's David Lewis's "trivialization
result."

In proving this, the only assumption needed about "if" was that "If A, then if B
then C" is equivalent to "If A and B then C": whatever region of a diagram
represents (C | B) | A must also represent C | BA.
CHAPTER 2: METHODOLOGY
Introduction

Huygens gave this account of the scientific method in the introduction to his
Treatise on Light (1690):

"... whereas the geometers prove their propositions by fixed and incontestable
principles, here the principles are verified by the conclusions to be drawn from
them; the nature of these things not allowing of this being done otherwise. It is
always possible thereby to attain a degree of probability which very often is
scarcely less than complete proof. To wit, when things which have been
demonstrated by the principles that have been assumed correspond perfectly to the
phenomena which experiment has brought under observation; especially when
there are a great number of them, and further, principally, when one can imagine
and foresee new phenomena which ought to follow from the hypotheses which one
employs, and when one finds that therein the fact corresponds to our prevision. But
if all these proofs of probability are met with in that which I propose to discuss, as
it seems to me they are, this ought to be a very strong confirmation of the success
of my inquiry; and it must be ill if the facts are not pretty much as I represent
them."

Here we interpret and extend Huygens's methodology in the light of the discussion
of rigidity, conditioning, and generalized conditioning in 1.7 and 1.8.

2.1 Confirmation

The thought is that you see an episode of observation, experiment, or reasoning as


confirming or infirming a hypotheses depending on whether your probability for it
increases or decreases during the episode, i.e., depending on whether your
posterior probability, Q(H), is greater or less than your prior probability, P(H).

The degree of confirmation


Q(H)-P(H)

can be a useful measure of that change--positive for confirmation, negative for


infirmation. Others are the probability factor, and the odds factor, greater than 1
for confirmation, less than 1 for infirmation:

Q(H) Q(H)/Q(-H)
------ ------------
P(H) P(H)/P(-H)
These are the factors by which prior probabilities P(H) or odds P(H)/P(-H) are
multiplied to get the posterior probabilities Q(H) or posterior odds Q(H)/Q(-H).

By the odds on one hypothesis against another -- say, on a theory T, against an


alternative S, is meant the ratio of the probability of T to the probability of S. In
these terms the plain odds on T are simply the odds on T against -T. The definition
of the odds factor is easily modified for the case where S is not simply -T:

Q(T)/Q(S)
Odds factor for T against S = -----------
P(T)/P(S)
The odds factor can also be expressed as the ratio of the probability factor for T to
that for S:

Q(T)/P(T)
Odds factor for T against S = -----------
Q(S)/P(S)
S is confirmed against T, or T against S, depending on whether the odds factor is
greater than 1, or less.

We will choose among these measures case by case, depending on which measure
seems most illuminating.

2.2 Huygens on Light

Let H represent Huygens's principles and C the conclusions he drew from them--
i.e., the conclusions which "verify" the principles.
If C follows from H, and we can discover by observation whether C is true or false,
then we have the means to test H -- more or less conclusively, depending on
whether we find that C is false or true. If C proves false, H is refuted decisively,
for then reality lies somewhere in the shaded region of the diagram, outside the
"H" circle. If C proves true, H's probability changes from

area of "H" circle


P(H) = --------------------
area of square
to
area of "H" circle
P(H | C) = --------------------
area of "C" circle
So verification of C multiples H's probability by 1/P(C). Therefore it is the
antecedently least probable conclusions whose unexpected verification raises H's
probability the most. George Pólya put it: "More danger, more honor."

2.3 Observation and Sufficiency

It only remains to clarify the rationale for updating P to Q by conditioning on C or


on -C , i.e., setting Q(H) = P(H | +/- C) depending on whether what we observe
assures us of C's truth or of its falsity. According to the analysis in 1.7, the warrant
for this must be rigidity (sufficiency) of truth or falsity of C as evidence about H,
assuring us that whatever information the observation provides over and above a
bare report of C's truth or falsity has no further relevance to H. This is guaranteed
if the information about C arrives in a pre-arranged 1-word telegram: "true," or
"false." But if the observers are the very people whose judgmental states are to be
updated by the transition from P to Q, the possibility must be considered that the
information about H conveyed by the observation will overflow the package
provided by the sentence +/- C.

Of course there will be no overflow if C is found to be false, for since the shaded
region is disjoint from the "H" circle, any conditional probability function must
assign 0 to H given falsity of C. This guarantees rigidity relative to -C:
Q(H | -C) = P(H | -C) = 0

No matter what else observation might reveal about the circumstances of C's
falsity, H would remain refuted.

But overflow is possible in case of a positive result, verification of C. In this case,


observation may provide further information that complicates matters by removing
our warrant to update by conditioning.

Example: the Green Bean, yet again.

H: the next bean will be lime-flavored.

C: the next bean will be green.

You know that half the beans in the bag are green, all the lime-flavored ones are
green, and the green ones are equally divided between lime and mint flavors. So
P(C) = 1/2 = P(H | C), and P(H) = 1/4. But although Q(C) = 1, your probability
Q(H) for lime can drop below P(H)=1/4 instead of rising to 1/2 = P(H | C) -- e.g. if,
when you see that the bean is green you also get a whiff of mint, or also see that it
has a special shade of green that you have found to be associated with the mint-
flavored ones.

2.4 Leverrier on Neptune

We now turn to a more recent methodological story. This is how Pólya tells it:

"On the basis of Newton's theory, the astronomers tried to compute the motions of
... the planet Uranus; the differences between theory and observation seemed to
exceed the admissible limits of error. Some astronomers suspected that these
deviations may be due to the attraction of a planet revolving beyond Uranus' orbit,
and the French astronomer Leverrier investigated this conjecture more thoroughly
than his colleagues. Examining the various explanations proposed, he found that
there was just one that could account for the observed irregularities in Uranus'
motion: the existence of an extra-Uranian planet [sc., Neptune]. He tried to
compute the orbit of such a hypothetical planet from the irregularities of Uranus.
Finally Leverrier succeeded in assigning a definite position in the sky to the
hypothetical planet [say, with a 1 margin of error]. He wrote about it to another
astronomer whose observatory was the best equipped to examine that portion of
the sky. The letter arrived on the 23rd of September 1846 and in the evening of the
same day a new planet was found within one degree of the spot indicated by
Leverrier. It was a large ultra-Uranian planet that had approximately the mass and
orbit predicted by Leverrier."

We treated Huygens's conclusion as a strict deductive consequence of his


principles. But Pólya made the more realistic assumption that Leverrier's prediction
C (a bright spot near a certain point in the sky at a certain time) was highly
probable but not 100%, given his H (i.e., Newton's laws and observational data
about Uranus). So P(C | H)~ 1; and presumably the rigidity condition was satisfied
so that Q(C | H)~ 1, too. Then verification of C would have raised H's probability
by a factor ~ 1/P(C), which is large if the prior probability P(C) of Leverrier's
prediction was ~ 0.

Pólya offers a reason for regarding 1/P(C) as at least 180 -- and perhaps as much
as 13131: The accuracy of Leverrier's prediction proved to be better than 1 , and
the probability of a randomly selected point on a circle or on a sphere being closer
than 1 to a previously specified point is 1/180 for a circle, and about 1/13131 for
a sphere. Favoring the circle is the fact that the orbits of all known planets lay in a
common plane ("of the ecliptic"). Then the great circle cut out by that plane gets
the lion's share of probability. Thus, if P(C) is half of 1%, H's probability factor
will be about 200.

2.5 Multiple Uncertainties

In Pólya's story, Leverrier loosened Huygens's tight hypothetico-deductive


reasoning by backing off from deductive certainty to values of P(C | H) falling
somewhat short of 1 -- which he treated as approximately 1. But what is the effect
of different shortfalls?

Similarly, we can back off from observational certainty to Q(C) values less than 1.
What if the confirming observation had raised the probability of Leverrier's C from
a prior value of half of 1% to some posterior value short of 1; say, Q(C) = 95%.
Surely that would that have increased H's probability by a factor smaller than
Pólya's 200; but how much smaller?

Again, it would be more realistic to tell the story in terms of a point prediction
with stated imprecision -- say, +/- 1 . (In fact the new planet was observed within
that margin, i.e., 57' from the point.) As between two theories that make such
predictions, the one making the more precise prediction can be expected to gain the
more from a confirming observation. But how much more?

The following formula for H's probability factor, with is due to John Burgess,
answers such questions provided C and -C satisfy the rigidity condition.

Q(C)-P(C) x P(C | H)-P(C)


pf(H,C) = 1 + --------------------------
P(C)P(-C)
By lots of algebra you can derive this formula from basic laws of probability and
generalized conditioning with n=2 (sec. 1.8). If we call the term added to 1 in
pf(H,C) the strength of confirmation for H in view of C's change in probability,
then we have

Q(C)-P(C) x P(C | H)-P(C)


sc(H,C) = -----------------------------
P(C)P(-C)
The sign distinguishes confirmation (+) from infirmation (-, "negative
confirmation").

Exercises. What does sc reduce to in these cases?

(a) Q(C)=1 (b) P(C | H)=1 (c) Q(C)=P(C | H)=1

(d) P(C) = 0 or 1, i.e., prior certainty about C.

To see the effect of precision, suppose that C predicts that a planet will be found
within +/- e of a certain point in the sky --a prediction that is definitely confirmed,
within the limits of observational error. Thus P(C | H) = Q(C) = 1, and P(C)
increases with e. Here sc(H,C) = P(-C)/P(C) = the prior odds against C, and H's
probability factor is 1/P(C). Thus, if it was thought certain that the observed
position would be in the plane of the ecliptic, P(C) might well be proportional to e,
P(C) = ke.
Exercise. (e) On this assumption of proportionality, what happens to H's
probability factor when e doubles?

2.6 Dorling on the Duhem problem

Skeptical conclusions about scientific hypothesis-testing are often drawn from the
presumed arbitrariness of answers to the question of which to give up -- theory, or
auxiliary hypothesis -- when they jointly contradict empirical data. The problem,
addressed by Duhem in the first years of the 20th century, was agitated by Quine in
mid-century. As drawn by some of Quine's readers, the conclusion depends on his
assumption that aside from our genetical and cultural heritage, deductive logic is
all we've got to go on when it comes to theory testing. That would leave things
pretty much as Descartes saw them, just before the mid-17th century emergence in
the hands of Fermat, Pascal, Huygens and others of the probabilistic ("Bayesian")
methodology that Jon Dorling has brought to bear on various episodes in the
history of science.

The conclusion is one that scientists themselves generally dismiss, thinking they
have good reason to evaluate the effects of evidence as they do, but regarding
formulation and justification of such reasons as someone else's job -- the
methodologist's. Here is an introduction to Dorling's work on the job, using
extracts from his important but still unpublished 1982 paper.

It is presented here in terms of probability factors. Assuming rigidity relative to D,


the probability factor for a theory T against an alternative theory S is the left-hand
side of the following equation. The right-hand side is called the likelihood ratio.
The equation follows from the quotient rule.

P(T | D)/P(S | D) P(D | T)


------------------ = ----------
P(T)/P(S) P(D | S)
The empirical result D is not generally deducible or refutable by T alone, or by S
alone, but in interesting cases of scientific hypothesis testing D is deducible or
refutable on the basis of the theory and an auxiliary hypothesis A (e.g., the
hypothesis that the equipment is in good working order). To simplify the analysis,
Dorling makes an assumption that can generally be justified by appropriate
formulation of the auxiliary hypothesis:
Prior independence

P(AT) = P(A)P(T), P(AS) = P(A)P(S)

In some cases S is simply the denial, -T, of T; in others it is a definite scientific


theory R, a rival to T. In any case Dorling uses the independence assumption to
expand the right-hand side of the odds Factor = Likelihood Ratio equation. Result,
with f for odds factor:

P(D | TA)P(A) + P(D | T-A)P(-A)


(1) f(T,S) = ---------------------------------
P(D | SA)P(A) + P(D | S-A)P(-A)
To study the effect of D on A, he also expands f(A,-A) with respect to T (and
similarly with respect to S):

P(D | AT)P(T) + P(D | A-T)P(-T)


(2) f(A,-A) = -----------------------------------
P(D | -AT)P(T) + P(D | -A-T)P(-T)

2.7 Einstein vs. Newton, 1919

In these terms Dorling analyzes two famous tests that were duplicated, with
apparatus differing in seemingly unimportant ways, with conflicting results: one of
the duplicates confirmed T against R, the other confirmed R against T. But in each
case the scientific experts took the experiments to clearly confirm one of the rivals
against the other. Dorling explains why the experts were right:

"In the solar eclipse experiments of 1919, the telescopic observations were made
in two locations, but only in one location was the weather good enough to obtain
easily interpretable results. Here, at Sobral, there were two telescopes: one, the one
we hear about, confirmed Einstein; the other, in fact the slightly larger one,
confirmed Newton. Conclusion: Einstein was vindicated, and the results with the
larger telescope were rejected." ( 4)
Notation

T: General Relativistic light-bending effect of the sun

R: No light-bending effect of the sun

A: Both telescopes are working correctly

D: The actual, conflicting data from both telescopes

Set S=R in the odds factor (1), and observe that P(D | TA) = P(D | RA) = 0. Then
(1) becomes

P(D | T-A)
(3) f(T,R) = ------------
P(D | R-A)
"Now the experimenters argued that one way in which A might easily be false was
if the mirror of one or the other of the telescopes had distorted in the heat, and this
was much more likely to have happened with the larger mirror belonging to the
telescope which confirmed R than with the smaller mirror belonging to the
telescope which confirmed T. Now the effect of mirror distortion of the kind
envisaged would be to shift the recorded images of the stars from the positions
predicted by T to or beyond those predicted by R. Hence P(D | T-A) was regarded
as having an appreciable value, while, since it was very hard to think of any similar
effect which could have shifted the positions of the stars in the other telescope
from those predicted by R to those predicted by T, P(D | R-A) was regarded as
negligibly small, hence the result as overall a decisive confirmation of T and
refutation of R." ( 4) Thus in (3) we have f(T,R) >> 1.

2.8 Bell's Inequalities: Holt vs. Clauser

"Holt's experiments were conducted first and confirmed the predictions of the local
hidden variable theories and refuted those of the quantum theory. Clauser
examined Holt's apparatus and could find nothing wrong with it, and obtained the
same results as Holt with Holt's apparatus. Holt refrained from publishing his
results, but Clauser published his, and they were rightly taken as excellent
evidence for the quantum theory and against hidden-variable theories." ( 4)
Notation

T: Quantum theory

R: Disjunction of local hidden variable theories

A: Holt's setup is sensitive enough to

distinguish T from R

D: The specific correlations predicted by T and contradicted by R are not detected


by Holt's setup

The characterization of D yields the first two of the following equations. In


conjunction with the characterization of A it also yields P(D | T-A) = 1, for if A is
false, Holt's apparatus was not sensitive enough to detect the correlations that
would have been present according to T; and it yields P(D | R-A) = 1 because of
the wild improbability of the apparatus "hallucinating" those specific correlations.

P(D | TA) = 0, P(D | RA) = 1,

P(D | T-A) = P(D | R-A) = 1

Setting S=R in (1), these substitutions yield

(4) f(T,R) = P(-A)


Then with a prior probability 4/5 for adequate sensitivity of Holt's apparatus, the
odds between quantum theory and the local hidden variable theories shift strongly
in favor of the latter, e.g., with prior odds 45:55 between T and R, the posterior
odds are only 9:55, a 14% probability for T.

Why then does not Holt publish his result? Because the experimental result
undermined confidence in his apparatus. Setting -T = R in (2) because T and R
were the only theories given any credence as explanations of the results, and
making the same substitutions as in (4), we have

(5) f(A,-A) = P(R)


so the odds on A fall from 4:1 to 2.2:1; the probability of A falls from 80% to 69%.
Holt is not prepared to publish with better than a 30% chance that his apparatus
could have missed actual quantum mechanical correlations; the swing to R depends
too much on a prior confidence in the experimental setup that is undermined by the
same thing that caused the swing.

Now why did Clauser publish?

Notation

T: Quantum theory

R: Disjunction of local hidden variable theories

C: Clauser's setup is sensitive enough

E: The specific correlations predicted by T and contradicted by R are detected by


Clauser's setup

Suppose that P(C) = .5. At this point, although P(A) has fallen by 11%, both
experimenters still trust Holt's well-tried set-up better than Clauser's. Suppose
Clauser's initial results E indicate presence of the quantum mechanical correlations
pretty strongly, but still with a 1% chance of error. Then E strongly favors T over
R:

P(E | TC)P(C)+P(E | T-C)P(-C)


(6) f(T,R) = -------------------------------
P(E | RC)P(C)+P(E | R-C)P(-C)

.5 + .01 + .5
= --------------- = 50.5
.01
Starting from the low 9:55 to which T's odds fell after Holt's experiment, odds after
Clauser's experiment will be 909:110, an 89% probability for T.

The result E boosts confidence in Clauser's apparatus by a factor of

P(E | CT)P(T) + P(E | CR)P(R)


(7) f(C,-C) = --------------------------------- = 15
P(E | -CT)P(T) + P(E | -CR)P(R)
This raises the initially even odds on C to 15:1, raises the probability from 50% to
94%, and lowers the 50% probability of the effect's being due to chance down to 6
or 7 percent.

2.9 Laplace vs. Adams

Finally, note one more class of cases: a theory T remains highly probable although
(with auxiliary hypothesis A) it implies a false prediction D. With S=-T in
formulas (1) and (2), with P(D | TA)=0, and setting

P(D | T-A) P(D | -TA)


t = ------------- s = ------------
P(D | -T-A) P(D | -T-A)
we have

t
(8) f(T,-T) = -----------------
sP(A)/P(-A) + 1

s
(9) f(A,-A) = -----------------
tP(T)/P(-T) + 1

tP(-A)
(10) f(T,-A) = --------
sP(-T)
These formulas apply to ( 1) "a famous episode from the history of astronomy
which clearly illustrated striking asymmetries in `normal' scientists' reactions to
confirmation and refutation. This particular historical case furnished an almost
perfect controlled experiment from a philosophical point of view, because owing to
a mathematical error of Laplace, later corrected by Adams, the same observational
data were first seen by scientists as confirmatory and later as disconfirmatory of
the orthodox theory. Yet their reactions were strikingly asymmetric: what was
initially seen as a great triumph and of striking evidential weight in favour of the
Newtonian theory, was later, when it had to be re-analyzed as disconfirmatory after
the discovery of Laplace's mathematical oversight, viewed merely as a minor
embarrassment and of negligible evidential weight against the Newtonian theory.
Scientists reacted in the `refutation' situation by making a hidden auxiliary
hypothesis, which had previously been considered plausible, bear the brunt of the
refutation, or, if you like, by introducing that hypothesis's negation as an
apparently ad hoc face-saving auxiliary hypothesis."

Notation

T: the theory, Newtonian celestial mechanics

A: The hypothesis that disturbances (tidal friction, etc.) make a negligible


contribution to

D: the observed secular acceleration of the moon.

Dorling argues on scientific and historical grounds for approximate numerical


values

t=1, s=1/50

The general drift: t = 1 because with A false, truth or falsity of T is irrelevant to D,


and t = 50s because in plausible partitions of -T into rival theories predicting lunar
accelerations, P(R | -T) = 2% where R is the disjunction of rivals not embarrassed
by D.

Then for a theorist whose odds are 3:2 on A and 9:1 on T (probabilities 60% for A
and 90% for T),

f(T,-T)=100/103, f(A,-A)=1/500, f(T,A)=200.

Thus the prior odds 900:100 on T barely decrease, to 900:103; the new probability
of T, 900/1003, agrees with the original 90% to two decimal places. But odds on
the auxiliary hypothesis A drop sharply, from prior 3:2 to posterior 3/1000, i.e., the
probability of A drops from 60% to about three tenths of 1%.

2.10 Dorling's conclusions

"Until recently there was no adequate theory available of how scientists should
change their beliefs in the light of evidence. Standard logic is obviously inadequate
to solve this problem unless supplemented by an account of the logical relations
between degrees of belief which fall short of certainty. Subjective probability
theory provides such an account and is the simplest such account that we possess.
When applied to the celebrated Duhem (or Duhem-Quine) problem and to the
related problem of the use of ad hoc, or supposedly ad hoc, hypotheses in science,
it yields an elegant solution. This solution has all the properties which scientists
and philosophers might hope for. It provides standards for the validity of informal
deductive reasoning comparable to those which traditional logic has provided for
the validity of informal deductive reasoning. These standards can be provided with
a rationale and justification quite independent of any appeal to the actual practice
of scientists, or to the past success of such practices. [Here a long footnote explains
the Putnam-Lewis Dutch book argument for conditioning.] Nevertheless they seem
fully in line with the intuitions of scientists in simple cases and with the intuitions
of the most experienced and most successful scientists in trickier and more
complex cases. The Bayesian analysis indeed vindicates the rationality of
experienced scientists' reactions in many cases where those reactions were
superficially paradoxical and where the relevant scientists themselves must have
puzzled over the correctness of their own intuitive reactions to the evidence. It is
clear that in many such complex situations many less experienced commentators
and critics have sometimes drawn incorrect conclusions and have mistakenly
attributed the correct conclusions of the experts to scientific dogmatism. Recent
philosophical and sociological commentators have sometimes generalized this
mistaken reaction into a full-scale attack on the rationality of men of science, and
as a result have mistakenly looked for purely sociological explanations for many
changes in scientists' beliefs, or the absence of such changes, which were in fact, as
we now see, rationally de rigeur.

"It appears that in the past even many experts have sometimes been misled in
trickier reasoning situations of this kind. A more widespread understanding of the
adequacy and power of the kinds of Bayesian analyses illustrated in this paper
could prevent such mistakes in the future and could form a useful part of standard
scientific education. It would be an exaggeration to say that it would offer a wholly
new level of precision to informal scientific reasoning, for of course the
quantitative subjective probability assignments in such calculations are merely
representative surrogates for informal qualitative judgments. Nevertheless the
qualitative conclusions which can be extracted from these relatively arbitrary
quantitative illustrations and calculations seem acceptably robust under the
relevant latitudes in those quantitative assignments. Hence if we seek to avoid
qualitative errors in our informal reasoning in such scientific contexts, such
illustrative quantitative analyses are an exceptionally useful tool for ensuring this,
as well as for making explicit the logical basis for those qualitative conclusions
which follow correctly from our premises, but which are sometimes nevertheless
surprising and superficially paradoxical." ( 5)

2.11 Problems

1 "Someone is trying decide whether or not T is true. He notes that T is a


consequence of H. Later he succeeds in proving that H is false. How does this
refutation affect the probability of T?" In particular, what is P(T)-P(T|~ H)?

2 "We are trying to decide whether or not T is true. We derive a sequence of


consequences from T, say C1, C2, C3, ... . We succeed in verifying C1, then C2, then
C3, and so on. What will be the effect of these successive verifications on the
probability of T?" In particular, setting P(T|C1&C2&... Cn-1&Cn) = pn, what is the
probability factor pn/pn+1?

3 Four Fallacies. Each of the following plausible rules is unreliable. Find


counterexamples to (b), (c), and (d) on the model of the one for (a) given below.

(a) If D confirms T, and T implies H, then D confirms H. Counterexample: in an


eight-ticket lottery, let D mean that the winner is ticket 2 or 3, T that it is 3 or 4, H
that it is neither 1 nor 2.

(b) If D confirms H and T separately, it must confirm their conjunction, T&H.

(c) If D and E each confirm H, then their conjunction, E&F, must also confirm H.

(d) If D confirms a conjunction, T&H, then it can't infirm each conjunct


separately.
2.12 Notes

Sec. 2.1. The term "Bayes factor" or simply "factor" is more commonly used than
"odds factor". Call it `f'. A useful variant is its logarithm, sc., the weight of
evidence for T against S:

w(T, S) = log f(T, S)

As the probability factor varies from 0 through 1 to , its logarithm varies from -
through 0 to + , thus equalizing the treatments of confirmation and
infirmation. Where the odds factor is multiplicative for odds, weight of evidence is
additive for logarithms of odds (`lods'):

(new odds) = f . (old odds)

log(new odds) = w + log(old odds)

Sec. 2.2: "More danger, more honor." See George Pólya, Patterns of Plausible
Inference, 2nd ed., Princeton University Press 1968, vol. 2, p. 126.

Sec. 2.4. See Pólya, op. cit., pp. 130-132.

Sec. 2.6. See Jon Dorling, "Bayesian personalism, the methodology of research
programmes, and Duhem's problem" Studies in History and Philosophy of Science
10(1979)177-187.

More along the same lines: Michael Redhead, "A Bayesian reconstruction of the
methodology of scientific research programmes," Studies in History and
Philosophy of Science 11(1980)341-347.

Dorling's unpublished paper from which excerpts appear here in sec. 2.7 - 2.10 is
"Further illustrations of the Bayesian solution of Duhem's problem" (29 pp.,
photocopied, 1982). References here (" 4" etc.) are to the numbered sections of
that paper.

Dorling's work is also discussed in Colin Howson and Peter Urbach, Scientific
Reasoning: the Bayesian approach (Open Court, La Salle, Illinois, 2nd ed., 1993).

Sec. 2.10, the Putnam-Lewis Dutch book argument (i.e., for conditioning as the
only legitimate updating policy). Putnam stated the result, or, anyway, a special
case, in a 1963 Voice of America Broadcast, "Probability and Confirmation",
reprinted in his Mathematics, Matter and Method, Cambridge University Press
(1975)293-304. Paul Teller, "Conditionalization and observation", Synthese
26(1973)218-258, reports--and attributes to David Lewis--a general argument to
that effect which Lewis had devised as a reconstruction of what Putnam must have
had in mind.

Sec. 2.11. Problems 1 and 2 are from George Pólya, "Heuristic reasoning and the
theory of probability", American Mathematical Monthly48(1941)450-465. Problem
3 relates to Carl G. Hempel's "Studies in the logic of confirmation", Mind
54(1945)1-26 and 97-121. Reprinted in Hempel's Aspects of Scientific Explanation,
The Free Press, New York, 1965.
CHAPTER 3: EXPECTATION

Introduction
It was in terms of gambling that Pascal, Fermat, Huygens and others in their wake
floated the modern probability concept. Betting was their paradigm for action
under uncertainty; adoption of odds or probabilities was the relevant form of
factual judgment. They saw probabilistic factual judgment and graded value
judgment as a pair of hands to shape decision.

3.1 Desirability

The matter was put as follows in the final section of a most influential 17th century
How to Think book, "The Port-Royal Logic" (1662). "To judge what one must do
to obtain a good or avoid an evil, it is necessary to consider not only the good and
the evil themselves, but also the probability that they happen, or not; and to view
geometrically the proportion that all these things have together." This
"geometrical" view takes seriously the perennial image of deliberation as a
weighing in the balance. Where a course of action might eventuate in a good or an
evil, we are to weigh the probabilities of those outcomes in a balance whose arms
are proportional to the gain and the loss that the outcomes would bring. To
consider the "good and the evil themselves" is to compare their desirability
differences, g-f and f-e in Fig. 1.

Fig. 1. Lengths are proportional to desirability differences, weights to probabilities.


The desirability of the course of action is represented by the position f of the
fulcrum about which the opposed turning effects of the weights just cancel.
Example 1. The last digit. The action under consideration is a bet on the last digit
of the serial number of a $5 bill in your pocket: if it's one of the 8 digits from 2 to
9, you give me the bill; if it's 0 or 1, I give you $20. Then my odds on winning are
4:1. In the balance diagram, that's the ratio of weights in the pans. Suppose I have
$100 on hand. Crassly, I might see my present desirability level as f = 100, and
equate the desirabilities g and e with the cash I'll have on hand if I win and lose:
g=105 and e=80. Now the 4:1 odds between the good and the evil agree with the
4:1 ratio of loss (f-e = 20) to gain (g-f = 5) as I see it. I'd think the bet fair.

In example 1, the options ("acts") were (G) take the gamble, and (-G) don't. My
desirabilities des(act) for these were averages of my desirabilities des(level & act)
for possible levels of wealth after acts, weighted with my probabilities P(level | act)
for levels given acts:

des(G) = des($80 & G)P($80 | G) + des($105 & G)P($105 | G)

des(-G) = des($100 & -G)

If desirability equals wealth in dollars no matter whether it is a gain, a loss, or the


status quo, these work out as:

des(G) = (80)(.25)+(100)(0)+(105)(.75) = 100

des(-G) = (80)(0)+(100)(1)+(105)(0) = 100

Then my desirabilities for the two acts are the same, and I am indifferent between
them.

In the next example preference between acts reveals a previously unknown feature
of desirabilities.

Example 2. The Heavy Smoker. The following statistics were provided by the
American Cancer Society in the early 1960's.
Percentage of American Men Aged 35
Expected to Die before Age 65
Nonsmokers 23%
Cigar and Pipe Smokers 25%
Cigarette smokers:
Less than 1/2 pack a day 27%
1/2 to 1 pack a day 34%
1 to 2 packs a day 38%
2 or more packs a day 41%
In 1965, Diamond Jim, a 35-year-old American man, had found that if he smoked
cigarettes at all, he smoked 2 or more packs a day. Thinking himself incapable of
quitting altogether, he saw his options as the following two.
C = Continue to smoke 2 or more packs a day
S = Switch to pipes and cigars
And he saw these as the relevant conditions:
L = He lives to age 65 or more
D = He dies before the age of 65
His probabilities came from the statistics in the normal way, so that, e.g., P(D | C)
= 41% and P(D | S) = .25. Thus, his conditional probability matrix was as follows.
L D
C 59% 41%
S 75% 25%
Unsure of the desirabilities of the four conjunctions of C and S with D and L, he
was clear that DS (= die before age 65 in spite of having switched) was the worst
of them; and he thought that longevity and cigarette-smoking would contribute
independent increments of desirability, say l and c:

des(LS) = des(DS)+l, des(LC) = des(DC)+l

des(LC) = des(LS)+c, des(DC) = des(DS)+c

Then if we set the desirability of the worst conjunction equal to d, his desirability
matrix is this:

L D
C d+c+l d+c
S d+l d
Now in Diamond Jim's judgment the desirability of (C) continuing to smoke 2
packs a day and of (S) switching are as follows.
des(C) = des(LC)P(L | C) + des(DC)P(D | C)

= (d +c +l )(.59) + (d +c )(.41) = d +c +.59l

des(S) = des(LS)P(L | S) + des(DS)P(D | S)

= (d +l )(.75) + (d )(.25) = d +.75l

The difference des(C)-des(S) between these is c -.16l. If Diamond Jim preferred to


continue smoking, this was positive; if he preferred swiching, it was negative.

Fact: Diamond Jim switched. Then the difference was negative, i.e., c was less
than 16% of l: his preference for cigarettes over pipes and cigars was less than
16% as intense as his preference for living to age 65 or more over dying before age
65.

3.2 Problems

1 Train or Plane? With regard to cost and safety, train and plane are equally good
ways of getting from Los Angeles to San Francisco. The trip takes 8 hours by train
but only 1 hour by plane, unless the San Francisco airport proves to be fogged in,
in which case the plane trip takes 15 hours. The weather forecast says there are 7
chances in 10 that San Francisco will be fogged in. If your desirabilities are simply
negatives of travel times, how should you go?

2 The point of Balance. What must the probability of fog be, in problem 1, to make
you indifferent between plane and train?

3 You assign the following desirabilities to wealth.

$: 0 10 20 30 40

des($): 0 10 17 22 26
a. With assets of $20 you are offered a gamble to win $10 with probability .58 or
otherwise lose $10. Work out your desirabilities for accepting and rejecting the
offer. Note that you should reject it.

b. What if you had been offered a gamble consisting of two independent plays of
the gamble in (a)? Should you have accepted?

4 The Allais Paradox. You may choose one of the following options at no cost to
yourself. Don't calculate, just decide!

A: One million dollars ($1M) for sure.

B: 10 or 1 or 0 $M with probabilities 10%, 89%, 1%.

What if you were offered the following options instead? Decide!

C: $1M or $0 with probabilities 11%, 89%.

D: $10M or $0 with probabilities 10%, 90%.

Note your intuitive answers; then compute your desirabilities for the four options,
using x, y, z for the desirabilities of $10M, $1M, $0. Verify that the excess
desirability of A over B in must be the same as that for C over D. Thus a policy of
maximizing conditional expectation of dollar payoffs would have you prefer C to
D if you prefer A to B.

5 The Ellsberg Paradox. A ball will be drawn from an urn containing 90 balls: 30
red, the rest black and yellow in some unknown ratio. As in problem 4, choose
between A and B, and between C and D. Then calculate.

A: $100 if red, $0 if not. B: $100 if black, $0 if not.


C: $0 if red, $100 if not. D: $0 if black, $100 if not.

6 Deutero Russian Roulette. You've got to play Russian roulette, using a six-
shooter that has 2 loaded cylinders. You've got a million, and would pay it all to
empty both cylinders before you have to pull the trigger. Show that if dying rich is
no better than dying poor, and it's the prospects of being dead, or being alive at
various levels of wealth to which you attach your various desirabilities, the present
decision theory would advise you to pay the full million to have just 1 bullet
removed if originally there were 4 in the cylinder.

7 Proto Russian Roulette. If dying rich is no better than dying poor, and
des(Dead)=0, des(Rich)=1, how many units of desirability is it worth to remove a
single bullet before playing Russian Roulette when the six-shooter has e empty
chambers?

8 In the Allais and Ellsberg paradoxes, and in Proto Russian Roulette, many people
would choose in ways incompatible with the analyses suggested above. Thus, in
problem 4, the desirability of being so unlucky as to win nothing in option B -
having passed up the option (A) of a sure million - is often seen as much lower
than the desirability of winning nothing in option C or D. Verify that the view of
decision-making as desirability maximization needn't then see preference for A
over B and for D over C as irrational.

Review the Allais and Ellsberg paradoxes in that light. (Note that in each case the
question of irrationality is addressed to the agent's values, i.e., determinants of
desirability, rather than to how the agent weighs those values together with with
probabilities.)

9 It takes a dollar to ride the subway. You and I each have a half-dollar coin, and
sorely need a second. For each, desirabilities of cash are as in the graph above, so
we decide to toss one coin and give both to you or to me depending on whether the
head or tail turns up. In dollars, each thinks the gamble neither advantageous nor
disadvantageous, since the expectation is 50cents , i.e., half way between losing
($0) and winning ($1). But in desirability, each thinks the gamble advantageous.
To see why, read des(gamble) and (b) des(don't) off the graph.
10 The Certainty Equivalent of a Gamble

According to the graph, $50 in hand is more desirable than a ticket worth $100 or
$0, each with probability 1/2, a ticket of "actuarial value" $50. How many dollars
in hand would be exactly as desirable as the ticket?

11 The St. Petersburg Paradox.

"Peter tosses a coin and continues to do so until it should land "heads" when it
comes to the ground. He agrees to give Paul one ducat if he gets "heads" on the
very first throw, two ducats if he gets it on the second, four if on the third, eight if
on th e fourth, and so on, so that with each additional throw the number of ducats
he must pay is doubled. Suppose we seek to determine the value of Paul's
expectation."

Paul's probability that the first head comes on the n'th toss is pn = 1/2n, and in that
case Paul's receipt is rn = 2n-1. Then Paul's expectation of gain, p1r1+p2r2+..., will be
1/2+1/2+... = *. Then should Paul be glad to pay any finite sum for the privilege of
playing?
"This seems absurd because no reasonable man would be willing to pay 20 ducats
as equivalent. You ask for an explanation of the discrepancy between the
mathematical calculation and the vulgar evaluation. I believe that it results from
the fact that, in their theory, mathematicians evaluate money in proportion to its
quantity while, in practice, people with common sense evaluate money in
proportion to the utility they can obtain from it."

If the desirability des(r) of receiving r ducats increases more and more slowly,
des(gamble) might be finite:

des(gamble) = des(r1)/2 + des(r2)/4 + ... + des(rn)/2n + ...

"If, for example, we suppose the moral value of goods to be directly proportionate
to the square root of their mathematical quantities, e.g., that the satisfaction
provided by 40,000,000 is double that provided by 10,000,000, my psychic
expectation becomes 1/2+ 2/4+ 4/8+ 8/16... = 1/(2- 2)."

On this reckoning Paul should not be willing to pay as much as 3 ducats to play the
game, for des(3) is 3, i.e., 1.73..., which is larger than des(gamble) = 1.70...

But the paradox reappears as long as des(r) does eventually exceed any preassigned
value, for then a variant of the St. Petersburg game can be devised in which the
payoffs rn are large enough so that des(r1)p1 + des(r2)p2 + ... = .

Problem. With des(r) = r as above, find payoffs rnthat restore the paradox.

3.3 Rescaling

A balanced beam would remain balanced if expanded or contracted uniformly


about the fulcrum, e.g., if each inch stretched to a foot, or shrank to a centimeter.
That's because balance is a matter of cancellation of the net clockwise and
counterclockwise turning effects, and uniform expansion or contraction would
multiply each of these by a common factor k, e.g., k=12 if inches stretch to feet
and k=.3937 if inches shrink to centimeters.

Applying the laws of the lever to choice, we conclude that nothing relevant to
decision-making depends on the size of the unit of the desirability scale.
Furthermore, nothing relevant to decision-making depends on the location of the
zero of the desirability scale. In physics this corresponds to the fact that if a beam
is in balance then the turning effects of the various forces on it about any one point
add up to 0, whether or not the point is the fulcrum - provided we view the fulcrum
as pressing upward with a force equal to the net weight of the loaded beam

Then if numbers des(H) accurately represent your judgments of how good it would
be for hypotheses H to be true, so will the numbers ades(H), where a is any
positive constant. That's because multiplying by a positive constant is just a matter
of uniformly shrinking or stretching the scale -- depending on whether the constant
is greater or less than 1 (as when lengths in feet look 12 times as great in inches
and 1/3 as great in yards). And if numbers des(H) accurately represent your
valuations, so will ades(H)+b, where a is positive and b is any constant at all; for
moving the origin of coordinates left (positive b) or right (negative b) by the same
amount, b, leaves distances between them (gains and losses) unchanged.

E.g., in example 13.2, we can set d=0, l=1 without thereby making any substantive
assuptions about Diamond Jim's desirabilities. On that scale, desirabilities of the
acts are simply des(C) = c + .59 and des(S) = .75.

Two desirability assignments des and V determine the same preferences among
options if the graph of one against the other is a straight line des' (H) = ades(H)+b
as in Fig. 1, sloping up to the right, so that des' is a positive linear transform of des.
The multiplicative constant a is the line's slope (rise per unit run); the additive
constant b is the des'-intercept, the height at which the line cuts the des'-axis.

Fig. 1. des and des' are equivalent desirability scales.

There is less scope for rescaling probabilities. If the weights in all pans are
multiplied by the same positive constant, balance will not be disturbed; but no
other sort of change, e.g., adding the same extra weight to all pans, can be relied
upon never to unbalance the scales.

It would be all right to use different positive numbers as probabilities of a sure


thing in different problems -- perhaps, the upward force at the fulcrum. Although
do use 1 as the probability of a sure thing in every problem, that is just a
convention; any positive constant would do.

On the other hand, we adopt no such convention for desirabilities, e.g., we do not
insist that the desirability of a sure thing (Av-A) always be 0, or that the
desirability of an impossibility (A&-A) always be 0.

3.4 Expectations, RV's, Indicators

My expectation of any unknown quantity -- any so-called "random variable" (or


"RV" for short) -- is a weighted average of the values I think it can assume, in
which the weights are my probabilities for those values.

Example 1. Giant Pandas. Let X = the birth weight to the nearest pound of the next
giant panda (ailuropoda melanoleuca) to be born in captivity. If I have definite
probabilities p0, p1, etc. for the hypotheses that X = 0, 1, ... , 99, etc., my
expectation of X will be

0. p0 + 1. p1 + ... +99. p99 + ...

This sum can be stopped after the 99th term without affecting its value, since I
attribute probability 0 to values of X of 100 or more.

It turns out that probability itself is an expectation:

Indicator Property. My probability p for truth of a hypothesis is my expectation of


a random variable (the "indicator" of the hypothesis) that has value 1 or 0
depending on whether the hypothesis is true or false.
Proof. As 1 and 0 are the only values this RV can assume, its expectation is 1p +
0(1-p), i.e., p.

Observe that an expected value needn't be one of the values the random variable
can actually assume. Thus in the Panda example X must be a whole number; but its
expected value, which is an average of whole numbers, need not be a whole
number. Nor need my expectation of the indicator of past life on Mars be one of
the values that indicators can assume, i.e., 0 or 1; it might well be 1/10, as in the
story at the beginning of chapter 1.

The indicator property of expectation is basic. So is this:

Additivity. Your expectation of a sum is

the sum of your expectations of its terms.

From additivity it follows that your expectation of X+X is twice your expectation
of X, your expectation of X+X+X is three times your expectation of X, and for any
whole number n,

Proportionality. Your expectation of

nX is n times your expectation of X.

Example: 2. Success Rates. I attribute the same probability, p, to success on each


trial of an experiment. Consider the indicators of the hypotheses that the different
trials succeed. The number of successes in the first n trials will be the sum of the
first n indicators; the success rate in the first n trials will be that sum divided by n.
Now by additivity, my expectation of the number of successes must be the sum of
my expectations of the separate indicators. Then by the indicator property, my
expectation of the number of successes in the first n trials is np. Therefore my
expectation of the success rate in the first n trials must be np divided by p, i.e., p
itself.
That last statement deserves its own billing:

Calibration Theorem. If you have the same probability (say, p) for success on each
trial of an experiment, then p will also be your expectation of the success rate in
any finite set of trials.

The name comes from the jargon of weather forecasting; forecasters are said to be
well calibrated when the fraction of truths ("success rate") is p among statements to
which they have attributed p as probability. Thus theß theorem says: forecasters
expect to be well calibrated.

3.5 Why Expectations are Additive

Like probabilities, expectations can be related to prices. My expectation of a


magnitude X can be identified with the buying-or-selling price I'd think fair for a
ticket that can be cashed for X(r) units of currency. A ticket for X as in the Panda
example is shown above. By comparing prices and values of combinations of such
tickets we can give a Dutch book argument for additivity of expectations.

Suppose x and y are your expectations of magnitudes X and Y -- say, rainfall in


inches during the first and second halves of next year -- and z is your expectation
for next year's total rainfall, X+Y. Why should z be x+y?

Because in every eventuality about rainfall at the two locations, the first two of
these tickets together are worth the same as the third:
Then unless the prices you would pay for the first two add up to the price you
would pay for the third, you are inconsistently placing different values on the same
prospect, depending on whether it is described to you in one or the other of two
provably equivalent ways.

3.6 Conditional Expectation

Just as we defined your conditional probabilities as the prices you'd think fair for
tickets that represent conditional bets, so we define conditional expectations:

Your conditional expectation E(X | H) of the random variable X given truth of the
statement H is the price you'd think fair for the following ticket:

Corresponding to the notation E(X | H) for your conditional expectation for X, we


use E(X) for your unconditional expectation for X, and E(XY) for your
unconditional expectation of the product of the magnitudes X and Y.

The following rule might be viewed as a definition of conditional expectations in


terms of unconditional ones in case P(H)!= 0, just as the quotient rule for
probabilities might be viewed as a definition of P(G|H) as the quotient
P(G&H)/P(H). On the right-hand side, IH is the indicator of H. Therefeore
P(H)=E(IH).

Quotient Rule.

E(X | H) = E(X. IH)/P(H)

The quotient rule is equivalent to the following relationship between conditional


and unconditional expectations.
Product Rule.

E(X. IH) = E(X | H)P(H)

A "Dutch book" consistency argument for this relationship can be modelled on the
one given in sec. 4 of for the corresponding relationship between probabilities.
Consider the following tickets. Clearly, the first has the same value as the pair to
its right, whether H is true or false. And those two have the same values as the ones
under them, for XIH is X if H is true and 0 if H is false, and the last ticket just
duplicates the one above it. Then unless your price for the first ticket is the sum of
your prices for the last two, i.e., unless the condition

E(X | H) = E(X. IH) + E(X | H)P(-H),

is met, you are inconsistently placing different values on the same prospect
depending on whether it is described in one or the other of two provably equivalent
ways. Now set P(-H) = 1-P(H) in this condition, and simplify. It boils down to the
product rule.

Historical Note. Solved for P(H), the product rule determines H's probability as the
ratio of E(X. IH) to E(X | H). As IH is the indicator of H, X. IH is X or 0 depending
on whether H is true or false. Thus, viewed as a statement about P(H), the product
rule corresponds to Thomas Bayes's definition of probability:

"The probability of any event is the ratio between the value at which an expectation
depending on the happening of the event ought to be computed, and the value of
the thing expected upon its happening."
E(X. IH) is the value you place on the ticket at the left; E(X | H) is the value you
place on the ticket at the right:

The first ticket is "an expectation depending on the happening of the event", i.e., an
expectation of $X depending on the truth of H, an unconditional bet on H.Your
price for the second ticket, $E(X | H), is "the value of the thing expected [$X] upon
its [H's] happening": as you get the price back if H is false, your uncertainty about
H doesn't dilute your expectation of X here, as it does if the ticket is worthless
when H is false.

3.7 Laws of Expectation

The basic properties of expectation are the product rule and

Linearity. E(aX+bY+c) = aE(X)+bE(Y)+c

Three notable special cases of the linearity equation are obtained by setting a=b=1
and c=0 (additivity), b=c=0 (proportionality), and a=b=0 (constancy):

Additivity. E(X+Y)=E(X)+E(Y)

Proportionality. E(aX)=aE(X)

Constancy. E(c)=c

By repeated application, the additivity equation can be seen to hold for arbitrary
finite numbers of terms -- e.g., for 3 terms, by applying 2-term additivity to
X+(Y+Z):
E(X+Y+Z) = E(X)+E(Y)+E(Z)

The magnitudes of which we have expectations are called "random variables" since
they may have various values with various probabilities. May have: in the linearity
property, a, b, and c are constants, so that, as we have seen, E(c) makes sense, and
equals c. But more typically, a random variable might have any of a number of
values as far as you know. Convexity says that your expectation for the variable
cannot be larger than all of those values, or smaller than all of them:

Convexity. E(X) lies in the range from the

largest to the smallest values that X can assume.

Where X can assume only a finite numer of values, convexity follows from
linearity.

The following connection between conditional and unconditional expectations is of


particular importance. Here, the H's are any hypotheses whatever.

Total Expectation. If no two of H1, H2, ...

are compatible, and H is their disjunction, then

E(X | H) = E(X | H1)P(H1| H) + E(X | H2)P(H2| H)+ ...

Proof. X . IH = X . IH1 + X . IH2 + ... ; apply E to both sides, then use additivity and
the product rule. Divide both sides by P(H), and use the fact that P(Hi)/P(H) =
P(Hi| H) since Hi&H = Hi.
Note that when when conditions are certainties, conditional expectations reduce to
unconditional ones:

Certainty. E(X | H) = E(X) if P(H)=1

Applying conditions. It is always OK to apply conditions of form Y = blah (e.g., Y


= 2X) appearing at the right of the bar, to rewrite Y as blah at the left:

E(... Y... | Y=blah) = E(... blah... | Y=blah) (OK!)

The Discharge Fallacy. But we cannot generally discharge a condition Y=blah by


rewriting Y as blah at the left and dropping the condition; e.g., E(3Y2 | Y=2X)
cannot be relied upon to equal E(3(2X)2). In general:

E(... Y... | Y=blah) = E(... blah... ) (NOT!)

The problem of the two sealed envelopes. One contains a check for an unknown
whole number of dollars, the other a check for twice or half as much. Offered a
free choice, you pick one at random. You might as well have chosen the other,
since you think them equally likely to contain the larger amount. What is wrong
with the following argument for thinking you have chosen badly? "Let X and Y be
the values of the checks in the one and the other. As you think Y equally likely to
be .5X or 2X, E(Y) will be .5E(.5X) + .5E(2X) = 1.25E(X), which is larger than
E(X)."

A Valid Special Case of the Discharge Fallacy.As an unrestricted rule of


inference, the discharge fallacy is unreliable. (That's what it is, to be a fallacy.) But
it becomes a valid rule of inference when "blah" represents a constant, e.g., as in
"Y=.5". Show that the following is valid.
E(... Y... | Y=constant) = E(... constant... )

3.8 Physical Analogies; Mean and Median

Hydraulic Analogy. Let "F" and "S" mean heads on the first and second tosses of
an ordinary coin. Suppose you stand to gain a dollar for each head. Then your net
gain in the four possibilities for +/- F and +/- S will be as shown at the left below.

Think of that as a map of flooded walled fields in a plain, with the numbers
indicating water depths in the four sections. In the four regions, depths are values
of X and areas are probabilities. To find your expectation for X, open sluices so
that the water reaches a common level in all sections. That level will be E(X). To
find your conditional expectation for X given F, open a sluice between the two
sections of F so that the water reaches a single level in F. That level will be E(X |
F), i.e., 1.5. Similarly, E(X | -F) = 0.5. To find your unconditional expectation of
gain, open all four sluices so that the water reaches the same level throughout:
E(X) = 1.

There is no mathematical reason for magnitudes X to have only a finite numbers of


values, e.g., we might think of X as the birth weight in pounds of the next giant
panda to be born in captivity -- to no end of decimal places of accuracy, as if that
meant something. (It doesn't. The commonplace distinction between panda and
ambient moisture, dirt, etc. isn't drawn finely enough to let us take the remote
decimal places seriously.) We can extend the hydraulic analogy to such continuous
magnitudes by supposing that the fields may be pitted and contoured so that water
depth (X) can vary continuously from point to point. But cases like temperature,
where X can also go negative, require more tinkering -- e.g., solid state H2O, with
heights on the iceberg representing negative values.
Balance. The balance analogy (sec. 13, Fig. 1) is more easily adapted to the
continuous case. The narrow rigid beam itself is weightless. Positions on it
represent values of a magnitude X that can go negative as well as positive. Pick a
zero, a unit, and a positive direction on the beam. Get a pound of modelling clay,
and distribute it along the beam so that the weight of clay on each section
represents the probability that the true value of X is in that section. (Fig. 1 below is
an example -- where, as it happens, X cannot go negative.)

Example. "The Median isn't the Message" In 1985 Stephen Gould wrote: "In 1982,
I learned I was suffering from a rare and serious cancer. After surgery, I asked my
doctor what the best technical literature on the cancer was. She told me ... that
there was nothing really worth reading. I soon realized why she had offered that
humane advice: my cancer is incurable, with a median mortality of eight months
after discovery."

In terms of the balanced beam analogy, here are the key definitions, of the terms
"median" and "mean" -- the latter being a synonym for "expectation":

The median is the point on the beam that divides the weight of clay in half: the
probabilities are equal that the true value of X is represented by a point to the right
and to the left of the median.

The mean (= your expectation ) is the point of support at which the beam would
just balance.

Gould continues: "The distribution of variation had to be right skewed, I reasoned.


After all, the left of the distribution contains an irrevocable lower boundary of zero
(since mesothelioma can only be identified at death or before). Thus there isn't
much room for the distribution's lower (or left) half -- it must be scrunched up
between zero and eight months. But the upper (or right) half can extend out for
years and years, even if nobody ultimately survives."

See Fig. 1, below. Being skewed (stretched out) to the right, the median of this
probability distribution is to the left of its mean; Gould's life expectancy is greater
than 8 months. (The mean of 24 months suggested in the graph is my invention. I
don't know the statistics.)
Fig. l. Locations on the beam are months lived after diagnosis; the weight of clay
on the interval from 0 to m is the probability of still being alive in m months.

The effect of skewness can be seen especially clearly in the case of discrete
distributions like the following. Observe that if the right-hand weight is pushed
further right the mean will follow, while the median stays fixed. The effect is most
striking in the case of the St. Petersburg game, where the median gain is between 1
and 2 ducats but the expected (mean) gain is infinite.

Fig. 2. The median stays between the second and third blocks no matter how far
right you move the fourth block.

3.9 Desirabilies as Expectations

Desirability is a mixture of judgments of fact and value; your desirability for H


represents a survey of your desirabilities for the various ways you think H could
happen, combined into a single figure by multiplying each number by your
probability for its being the desirability of the way H actually happens. In effect,
your desirability for H is your conditional expectation of a magnitude ("U"),
commonly called "utility":

des(H) = E(U | H)

U(s) is your desirability for a complete scenario, s, that says exactly how
everything turns out. Then U(s) records a pure value judgment, untainted by
uncertainty about details, whereas des(H) mixes pure value judgments with pure
factual judgments.

Example 1. The Heavy Smoker, Again (cf. example 13.2). In Fig.1(a), U's actual
value is the depth of the unknown point representing the real situation, and
desirabilities are average depths -- e.g., the desirability des(SL)=1 of switching and
living at least 5 more years is a probability-weighted average of all manner of ways
for that to happen--hideous, delightful, or middling. With P(L | C)=.60 (nearly; it's
really .59), and P(L | S) = .75, switching raises the odds on L:D from 3:2 to 3:1.
Then in (b), desirabilities des(C)=.69 and des(S)=.75 are mixtures--of 1.1 with 1 in
the ratio 3:2, and 1 with 0 in the ratio 3:1.

(a) des(act & outcome) (b) des(act)

Fig. 1. Hydraulic Analogy

(a) Initial P(act & outcome) (b) Final P(act & outcome)

Fig. 2. Unconditional probabilities. (Depths as in Fig. 1.)

Figures 1(a, b) and 2(a) are drawn with P(C) = P(S) = 1/2; the upper and lower
sections have equal areas. That is one way to represent Diamond Jim's initial state
of mind, unsure of his action. Another is to say he has no numbers at all in mind
for P(C) and P(S), even though he does for P(L | C) and P(L | S). Either way, he
has clearly not yet made up his mind, for when he has, P(C) and P(S) will be
numbers near the ends of the unit interval. In fact, deliberation ends with a
realization that switching has the higher desirability; he decides to make S true, or
try to; final P(S) will be closer to 1 than initial P(S), i.e., 2/3 instead of 1/2 in Fig.
2. (Diamond Jim is far from sure he will carry out the chosen line of action.)
Warning. des(act) measures choiceworthiness only if odds on outcomes
conditionally on acts remain constant as odds on acts vary -- e.g. as in Fig. 2,
where odds on L:D given C and given S remain constant at 3:2 and 3:1 as odds on
C:S vary from 1:1 to 1:2.

This warning is important in "Newcomb" problems, i.e., quasi-decision problems


in which acts are seen as mere symptoms of outcomes that agents would promote
or prevent if they could.

Example 2. Genetic Determinism. Suppose Diamond Jim attributes the observed


correlations between smoking habits and longevities to the existence in the human
population of two alleles (good, bad) of a certain gene, where the bad allele
promotes heavy cigarette smoking and early death, and works against switching.
Jim thinks it's the allele, not the habit, that's good or bad for you; he sees his act
and his life expectancy as conditionally independent given his allele -- whether
good or bad. And he sees the allele as hegemonic (sec. 11), determining the
chances

P(act | allele), P(outcome | allele)

of acts and outcomes. Then higher odds on switching are a sign that his allele is the
longevity-promoting one. He sees reason to hope to try to switch, but no reason to
try.

3.10 Notes

Sec. 3.1 For the statistics in example 2, see The Consumers Union Report on
Smoking and the Public Interest (Consumers Union, Mt. Vernon, N.Y., 1963, p.
69). This example is adapted from R. C. Jeffrey, The Logic of Decision (2nd ed.,
Chicago: U. of Chicago Press, 1983).

Sec. 3.2, Problems

1 and 2 come from The Logic of Decision.

3 is from D. V. Lindley, Making Decisions (New York: Wiley-Interscience, 1971),


p. 96.
4 is from Maurice Allais, "Le comportment de l'homme rationnel devant la risque,"
Econometrica 21(1953): 503-46. Translated in Maurice Allais and Ole Hagen
(eds.), Expected Utility and the Allais Paradox, Dordrecht: Reidel, 1979.

5 is from Daniel Ellsberg, "Risk, Ambiguity, and the Savage Axioms," Quarterly
Journal of Economics 75(1961) 643-69.

6 is Alan Gibbard's variation of problem 7 (i.e. Richard Zeckhauser's) that Daniel


Kahneman and Amos Tversky report in Econometrica 47 (1979) 283. See also
"Risk and human rationality" by R. C. Jeffrey, The Monist 70(1987)223-236.

9. In the diagram, "marginal desirability" (rate of increase of desirability) reaches a


maximum at des = 4, and then shrinks to a minimum at des = 6. The second half
dollar increases des twice as much as the first.

11, The St. Petersburg paradox. Daniel Bernoulli's "Exposition of a new theory of
the measurement of risk" (in Latin) appeared in the Proceedings of the St.
Petersburg Imperial Academy of Sciences 5(1738) Translation: Econometrica 22
(1954) 123-36, reprinted in Utility Theory: A Book of Readings, ed. Alfred N. Page
(New York: Wiley, 1968) The three quotations areare from correspondence
between Daniel's uncle Nicholas Bernoulli and (first, 1713) Pierre de Montmort,
and (second and third, 1728) G abriel Cramer.

It seems to have been Karl Menger (1934) who first noted that the paradox
reappears as long as U(r) is unbounded; see the translation of that paper in Essays
in Mathematical Economics, Martin Shubik (ed.), Priceton U. P., 1967, especially
the first footnote on p. 211.

Sec. 3.4. For more about calibration, etc., see Morris DeGroot and Stephen
Fienberg, "Assessing Probability Assessors: Calibration and Refinement," in
Shanti S. Gupta and James O. Berger (eds.), Statistical Decision Theory and
Related Topics III, Vol. 1, New York: Academic Press, 1982, pp. 291-314.

Sec. 3.5, 3.6. The Dutch book theorems for expectations and conditional
expectations are bits of Bruno de Finetti's treatment of the subject in vol. 1 of his
Theory of Probability, New York: Wiley, 1974.

Sec. 3.6. Bayes's definition of probability is from his "Essay toward solving a
problem in the doctrine of chances," Philosophical Transactions of the Royal
Society 50 (1763), p. 376, reprinted in Facsimiles of Two Papers by Bayes, New
York: Hafner, 1963.

Sec. 3.8. "The Median isn't the Message" by Stephen Jay Gould) appeared in
Discover 6(June 1985)40-42.

Sec. 3.9, example 2. This is a "Newcomb" problem; see Robert Nozick,


"Newcomb's Problem and Two Principles of Choice" in Essays in Honor of Carl
G. Hempel, N. Rescher, ed. (Dordrecht: Reidel Publishers, 1969). For recent
references and further discussion, see Richard Jeffrey, "Causality in the Logic of
Decision" in Philosophical Topics 21 (1993) 139-151.

SOLUTIONS

Sec. 3.2

1 Train 2 1/2 3(b) Yes.

5 If you prefer A to B, you should prefer D to C.

6 With 0 = des[die], 1 = des[rich and alive], and u = des[a million dollars poorer,
but alive], suppose that u = des[get rid of the two bullets]; you'd pay everything
you have. Then you are indifferent between A and B below.
To see that you must be indifferent between C and D as well, observe that if you
plug the A diagram in at the "u" position in the D diagram, you get the DA
diagram. But there, the probability of getting 1 is 1/3 (i.e., 1/2 times 2/3), so the
probability of getting 0 one way or the other must be 2/3, and the DA diagram is
equivalent to the C diagram. Thus you should be indifferent between C and D if
you are indifferent between A and B.

7 1/(e+1) 9 (gamble) = 3, (don't) = 2.

10 $10 11 rn = (2n)2

Sec. 3.7, The Two Envelopes. It's the discharge fallacy. To see why, apply the law
of total expectation and the assumption that P(Y=.5X) = P(Y=2X) = 1/2, to get
this:

E(Y) = .5E(Y | Y=.5X) + .5E(Y | Y=2X)

By the discharge fallacy, we would then have

E(Y) = .5E(.5X) + .5E(2X) = 1.25E(X) (NOT!)

But in fact, what we have is this:

E(Y) = .5E(.5X | Y=.5X) + .5E(2X | Y=2X)


= .25E(X | Y=.5X) + E(X | Y=2X)

In fact, E(X) and E(Y) are the same mixture of your larger and smaller
expectations of X when Y is the larger (2X) or smaller (X/2) amount: E(X) = E(Y)
= five parts of E(X | Y=.5X) with two parts of E(X | Y=2X).
CHAPTER 4: DECISION
Introduction

Suppose you regard two events as positively correlated, i.e., your personal
probability for both happening is greater than the product of their separate personal
probabilities. Is there something that shows you see one of these as promoting the
other, or see both as promoted by some third event? Here an answer is suggested in
terms of the dynamics of your probability judgments. This answer is then applied
to resolve a prima facie difficulty for the logic of decision posed by ("Newcomb")
problems in which you see acts as mere symptoms of conditions you would
promote or prevent if you could. We begin with an account of preference as
conditional expected utility, as in my 1965 book, The Logic of Decision.

4.1 Preference Logic

In Theory of Games and Econmic Behavior, von Neumann and Morgenstern


represented what we do in adopting options as a matter of adopting particular
probability distributions over the states of nature. The thought was that your
preference ranking of options will then agree with your numerical ranking of
expectations of utility as computed according to the adopted probabilities.

In fact, they took your utilities for states of nature to be the same, no matter which
option you choose, as when states of nature determine definite dollar gains or
losses, and these are all you care about. This condition may seem overly restrictive,
for the means by which gains are realized may affect your final utility -- as when
you prefer work to theft as a chancy way of gaining $1000 (success) or $0 (failure).
But the condition can always be met by making distinctions, e.g., by splitting each
state of nature in which you realize an outcome by work-or-theft into one in which
you realize it by work, and another in which you realize it by theft. (The difference
between the work and theft options will be encoded in their associated probability
distributions: each assigns probability 0 to the set of states in which you opt for the
other.) This means taking a naturalistic view of the decision-maker, whose choices
are blended with states of nature in a single space, ; each point in that space
represents a particular choice by the decision-maker as well as a particular state of
(the rest of) nature.

In 1965 Bolker and Jeffrey offered a framework of that sort in which options are
represented by propositions (i.e., subsets of : in statistical jargon, "events"), and
any choice is a decision to make some proposition true. Each such proposition
corrsponds to a definite probability distribution in the von Neumann-Morgenstern
scheme. To the option of making the proposition A true corresponds the
conditional probability distribution P(--| A), where the unconditional distribution
P(--) represents your prior probability judgment -- i.e., prior to deciding which
option-proposition to make true. And your expectation of utility associated with
the A-option will be your conditional expectation of utility, E(u | A) -- also known
as your desirability for truth of A, and denoted "des A":

des A = E(u | A)

Now preference (>), indifference (=), and preference-or-indifference (>= ) go by


desirability, so that

A>B if des A > des B

A=B if des A = des B

A>= B if des A >= des B

Note that it is not only option-propositions that appear in preference rankings; you
can perfectly well prefer a sunny day tomorrow (= truth of "Tomorrow will be
sunny") to a rainy one even though you know you cannot affect the weather.

Various principles of preference logic can now be enunciated, and fallacies


identified, as in the following two examples. The first is a fallacious mode of
inference according to which you must prefer B's falsity to A's if you prefer A's
truth to B's:

A > B
Invalid: ---------
-B > -A
Counterexample: Death before dishonor. A = You are dead tomorrow; B = You
are dishonored today. (You mean to commit suicide if dishonored.) If your
probabilities and desirabilities for the four cases tt, tf, ft, ff concerning truth and
falsity of AB are as follows, then your desirabilities for A, B, -B, -A will be 0.5, -
2.9, 5.5, 6.8, so that the premise is true but the conclusion false.
case: tt tf ft ff
P(case): .33 .33 .01 .33
des(case): 0 1 -100 10
The second is a valid mode of inference:

A >= B
Valid: ---------------- (if A and B are incompatible)
A >= AvB >= B.
Proof. Given the proviso, and setting w = P(A | AvB), we find that des(AvB) =
w(des A) + (1-w)(des B). This is a convex combination of des A and des B, which
must therefore lie between them or at an endpoint.

Bayesian decision theory is said to represent a certain structural concept of


rationality. This is contrasted with substantive criteria of rationality having to do
with the aptness of particular probability and utility functions to particular
predicaments. With Davidson, I would interpret this talk of rationality as follows.
What remains when all substantive questions of rationality are set aside is bare
logic, a framework for tracking your changing judgments, in which questions of
validity and invalidity of argument-forms can be settled as illustrated above. A
complete set of substantive judgments would be represented by a Bayesian frame,
consisting of (1) a probability distribution over a space of "possible worlds," (2)
a function u assigning "utilities" to the various worlds in , and (3) an assignment
of subsets of as values to letters "A", "B", etc., where the letters represent
sentences and the corresponding subsets represent "propositions." In any logic,
validity of an argument is truth of the conclusion in every frame in which all the
premises are true. In a Bayesian logic of decision, Bayesian frames represent
possible answers to substantive questions of rationality; we can understand that
without knowing how to determine whether particular frames would be
substantively rational for you on particular occasions. So in Bayesian decision
theory we can understand validity of an argument as truth of its conclusion in any
Bayesian frame in which all of its premises are true, and understand consistency of
a judgment (e.g., affirming A > B while denying -B > -A), as existence of a non-
empty set of Bayesian frames in which the judgment is true. On this view,
consistency -- bare structural rationality -- is simply representability in the
Bayesian framework.
4.2 Kinematics

In the design of mechanisms, kinematics is the discipline in which rigid rods and
distortionless wheels, gears, etc. are thought of as faithful, prompt communicators
of motion. The contrasting dynamic analysis takes forces into account, so that, e.g.,
elasticity may introduce distortion, delay, and vibration; but kinematical analyses
often suffice, or, anyway, suggest relevant dynamical questions. That is the
metaphor behind the title of this section and behind use of the term "rigidity"
below for constancy of conditional probabilities. (Here, as in the case of
mechanisms, rigidity assumptions are to be understood as holding only within
rough bounds defining normal conditions of use, the analogs of load limits for
bridges.)
1
In choosing a mixed option -- in which you choose one two options, O1 or O2,
depending on whether some proposition C is true or false -- you place the
following constraint on your probabilities.

Stable conditional probabilities

P(O1 | C) and P(O2 | -C) are set near 1

(Near: you may mistake C's truth value, or bungle an attempt to make Oi true, or
revise your decision.) As choosing the mixed option involves expecting to learn
whether C is true or false, choosing it involves expecting your probabilities for C
and -C to move toward the extremes:

Labile probabilities of conditions

P(C) and P(-C) change from middling

values to extreme values, near 0 or 1

This combination of stable conditional probabilities and labile probabilities of


conditions is analogous to the constraints under which modus ponens (below) is a
useful mode of inference; for if confidence in the second premise is to serve as a
channel transmitting confidence in the first premise to the conclusion as well, the
increase in P(first premise) had better not be accompanied by a decrease in
P(second premise).

C
D or not C
Modus Ponens ------------
D

When unconditional probabilities change, some conditional probabilities may


remain fixed; but others will change. Example. Set

a = P(A | C), a' = P(A | -C), o = P(C)/P(-C)

and suppose a and a' both remain fixed as the odds o on C change. Then P(C | A)
and P(C | -A) must change, since

a a'
P(C | A) = ----------- P(C | -A) = ---------
a + a' /o a' + ao
What alternative is there to conditioning, as a way of updating probabilities?
Mustn't the rational effect of an observation always be certainty of the truth of
some data proposition? Surely not. Much of the perception on which we
reasonably base action eludes that sort of formulation. Our vocabulary for
describing what we see, hear, taste, smell, and touch is no match for our visual,
auditory, etc. sensitivities, and the propositional judgments we make with
confidence are not generally tied to confident judgments expressible in terms of
bare sensation.

Example. One day on Broadway, my wife and I saw what proved to be Mayor
Dinkins. There were various clues: police, cameramen, etc. We looked, and smiled
tentatively. He came and shook hands. Someone gave us Dinkins badges. We had
known an election was coming and the candidates campaigning. At the end we had
no doubt it was the Mayor. But there was no thought of describing the sensations
on which our progress toward conviction was founded, no hope of formulating
sensory data propositions that brought our probabilities up the unit interval toward
1:
Pold(It's Dinkins)

Pold(It's Dinkins | data1)

Pold(It's Dinkins | data1 & data2)

...

Of course the people in uniform and the slight, distinguished figure with the
moustache might all have been actors. Our visual, auditory and tactile experiences
did combine with our prior judgments to make us nearly certain it was the Mayor,
but there seems to be no way to represent that process by conditioning on data
propositions that are sensory certainties. The accessible data propositions were
chancy claims about people on Broadway, not authoritative reports of events on
our retinas, palms, and eardrums. We made reasonable moves (smiling, shaking the
hand) on the basis of relatively diffuse probability distributions over a partition of
such chancy propositions -- distributions not obtained by conditioning their
predecessors on fresh certainties. (Jeffrey 1992 1-13, 78-82, etc.)

Here are two generalizations of conditioning that you can prove are applicable in
such cases, provided rigidity conditions hold for a partition { C1, C2, ... } of .

If the conditions Q(A | Ci) = P(A | Ci) all hold,

then probabilities and factors can be updated so:

(Probabilities) Q(A) = iP(Ci)Q(A | Ci)

(Factors) f(A) = if(Ci)P(Ci | A)

In the second condition, f(A) and f(Ci) are the factors Q(A)/P(A) and Q(Ci)/P(Ci)
by which probabilities P(A), P(Ci) are multiplied in the updating process.

Generalized conditioning allows probabilistic response to observations which


prompt no changes in your conditional probabilities given any of the Ci but do
prompt definite new probabilities or factors for the Ci. (If your new probability for
one of the Ci is 1, this reduces to ordinary conditioning.)

Probabilistic judgment is not generally a matter of assigning definite probabilities


to all propositions of interest, any more than yes/no judgment is a matter of
assigning definite truth values to all of them. Typical yes/no judgment identifies
some propositions as true, leaving truth values of others undetermined. Similarly,
probabilistic judgment may assign values to some propositions, none to others.

The two sorts of generalized conditioning tolerate different sorts of indefiniteness.


The probability version determines a definite value for prnew(A) even if you had
no old probabilities in mind for the Ci, as long as you have definite new values for
them and definite old values for A conditionally on them. The factor version,
determining the probability ratio f(A), tolerates indefiniteness about your old and
new probabilities of A and of the Ci as long as your prold(Ci | A) values are
definite. Both versions illustrate the use of dynamic constraints to represent
probabilistic states of mind. In the next section, judgments of causal influence are
analyzed in that light.

4.3 Causality

In decision-making it is deliberation, not observation, that changes your


probabilities. To think you face a decision problem rather than a question of fact
about the rest of nature is to expect whatever changes arise in your probabilities for
those states of nature during your deliberation to stem from changes in your
probabilities of choosing options. In terms of the analogy with mechanical
kinematics: as a decision-maker you regard probabilities of options as inputs,
driving the mechanism, not driven by it.

Is there something about your judgmental probabilities which shows that you are
treating truth of one proposition as promoting truth of another -- rather than as
promoted by it or by truth of some third proposition which also promotes truth of
the other? Here the promised positive answer to this question is used to analyze
puzzling problems in which we see acts as mere symptoms of conditions we would
promote or prevent if we could. Such "Newcomb problems" (Nozick 1963, 1969,
1990) pose a challenge to the decision theory floated in the first edition of The
Logic of Decision (Jeffrey 1965), where notions of causal influence play no rô le.
The present suggestion about causal judgments will be used to question the
credentials of Newcomb problems as decision problems.

The suggestion (cf. Arntzenius) is that imputations of causal influence are not
shown simply by momentary features of probabilistic states of mind, but by
intended or expected features of their evolution. The following is a widely
recognized probabilistic consequence of the judgment that truth of one proposition
("cause") promotes truth of another ("effect").
>0 P(effect | cause) - P(effect | -cause) > 0

But what distinguishes cause from effect in this relationship? -- i.e., a relationship
equivalent to

P(cause | effect) - P(cause | -effect) > 0

With Arntzenius, I suggest the following answer, i.e., rigidity relative to the
partition { cause, -cause} .

Rigidity Constancy of P(effect | cause) and

P(effect | -cause) as P(cause) varies

Both >0 and rigidity are conditions on a variable "pr" ranging over a set of
probability functions. The functions in the set represent ideally definite momentary
probabilistic states of mind for the deliberating agent. Clearly, pr can vary during
deliberation, for if deliberation converges toward choice of a particular act, the
probability of the corresponding proposition will rise toward 1. In general, agents's
intentions or assumptions about the kinematics of pr might be described by maps of
possible courses of evolution of probabilistic states of mind -- often, very simple
maps. These are like road maps in that paths from point to point indicate feasibility
of passage via the anticipated mode of transportation, e.g., ordinary automobiles,
not "all terrain" vehicles. Your kinematical map represents your understanding of
the dynamics of your current predicament, the possible courses of development of
your probability and desirability functions.

The Logic of Decision used conditional expectation of utility given an act as the
figure of merit for the act, sc., its desirability, des(act). Newcomb problems
(Nozick 1969) led many to see that figure as acceptable only on special causal
assumptions, and a number of versions of "causal decision theory" were proposed
as more generally acceptable. In the one I like best (Skyrms 1980), the figure of
merit for choice of an act is the agent's unconditional expectation of its desirability
on various incompatible, collectively exhaustive causal hypotheses. But if
Newcomb problems are excluded as bogus, then in genuine decision problems
des(act) will remain constant throughout deliberation, and will be an adequate
figure of merit.

In any decision problem whose outcome is not clear from the beginning,
probabilities of possible acts will vary during deliberation, for finally an act will be
chosen and so have probability near 1, a probability no act had initially. Newcomb
problems (Table 1) seem ill posed as decision problems because too much
information is given about conditional probabilities, i.e., enough to fix the
unconditional probabilities of the acts. We are told that there is an association
between acts (making A true or false) and states of nature (truth or falsity of B)
which makes acts strong predictors of states, and states of acts, in the sense that p
and q are large relative to p' and q' -- the four terms being the agent's conditional
probabilities:.

p = P(B | A), p' = P(B | -A),

q = P(A | B), q' = P(A | -B)

But the values of these terms themselves fix the agent's probability for A, for they
fix the odds on A as

P(A) qp'
---------- = --------
P(~ A) (1-q)p
Of course this formula doesn't fix P(A) if the values on the right are not all fixed,
but as decision problems are normally understood, values are fixed, once given.
Normally, p and p' might be given, together with the desirabilities of the act-state
combinations, i.e., just enough information to determine the desirabilities of A's
truth and falsity, which determine the agent's choice. But normally, p and p' remain
fixed as P(A) varies, and q and q' , unmentioned because irrelevant to the problem,
vary with P(A).

4.4 Fisher

We now examine a Newcomb problem that would have made sense to R. A. Fisher
in the late 1950's.
For smokers who see quitting as prophylaxis against cancer, preferability goes by
initial des(act) as in Table 1b; but there are views about smoking and cancer on
which these preferences might be reversed. Thus, R. A. Fisher (1959) urged
serious consideration of the hypothesis of a common inclining cause of (A)
smoking and (B) bronchial cancer in (C) a bad allele of a certain gene, posessors of
which have a higher chance of being smokers and developing cancer than do
posessors of the good allele (independently, given their allele). On that hypothesis,
smoking is bad news for smokers but not bad for their health, being a mere sign of
the bad allele, and, so, of bad health. Nor would quitting conduce to health,
although it would testify to the agent's membership in the low-risk group.

On Fisher's hypothesis, where +/- A and +/- B are seen as independently promoted
by +/- C, i.e., by presence (C) or absence (-C) of the bad allele, the kinematical
constraints on pr are the following. (Thanks to Brian Skyrms for this.)

Rigidity The following are constant as c = P(C) varies.

a = P(A | C) a' = P(A | -C)

b = P(B | C) b' = P(B | -C)

>0 P(B | A) > P(B | -A), i.e., p > p

Indeterminacy None of a, b, a' , b' are 0 or 1.

Independence P(AB | C) = ab, P(AB | -C) = a' b'

Since in general, P(F | GH) = P(FG | H)/P(G | H), the independence and rigidity
conditions imply that +/- C screens off A and B from each other, in the following
sense.

Screening-off P(A | BC) = a, P(A | B-C) = a'

P(B | AC) = b, P(B | A-C) = b'

Under these constraints, preference between A and -A can change as P(C) = c


moves out to either end of the unit interval in thought-experiments addressing the
question "What would des A - des -A be if I found I had the bad/good allele?" To
carry out these experiments, note that we can write p = P(B | A) = P(AB)/P(A) =

P(A | BC)P(B | C)P(C) + P(A | B-C)P(B|-C)P(-C)(1-c)


-----------------------------------------------------
P(A | C)P(C) + P(A | -C)P(-C)
and similarly for p' = P(B | -A). Then we have

abc +a' b' (1-c) (1-a)bc + (1-a')b' (1-


c)
p = ------------------ p' = ------------------------
--
ac + a' (1-c) (1-a)c + (1-a')(1-c)
Now final p and p' are equal to each other, and to b or b' depending on whether
final c is 1 or 0. Since it is c's rise to 1 or fall to 0 that makes P(A) rise or fall as
much as it can without going off the kinematical map, the (quasi-decision) problem
has two ideal solutions, i.e., mixed acts in which the final unconditional probability
of A is the rigid conditional probability, a or a' , depending on whether c is 1 or 0.
But p = p' in either case, so each solution satisfies the conditions under which the
dominant pure outcome (A) of the mixed act maximizes des +/- A. (This is a quasi-
decision problem because what is imagined as moving c is not the decision but
factual information about C.)

The initial probabilities .093 and .025 in Table 1b were obtained by making the
following substitutions in the formulas for p and p' above.

a = .9, a' = .5, b = .2, b' = .01, c (initially) = .3

As p and p' rise toward b = .2 or fall toward b' = .01, tracking the rise or fall of c
toward 1 or 0, the negative difference des(continue) - des(quit) = -1.8 in Table 1b
rises toward the positive values 5-4b = 4.2 and 5-4b' = 4.96 in Table 2. Unless you,
the smoker, somehow becomes sure of your allele, neither of the two judgmental
positions shown in Table 2 will be yours. The table only shows that for you,
continuing is preferable to quitting in either state of certainty about the allele. The
kinematical map leads you to that conclusion on any assumption about initial c.
And initial uncertainty about the allele need not be modelled by a definite initial
value of c. Instead, an indefinite initial probabilistic state can be modelled by the
set of all pr assigning the values a, a' , b, b' as above, and with c = P(bad allele)
anywhere in the unit interval.

If you are a smoker convinced of Fisher's hypothesis, your unconditional


probabilities of continuing and quitting lag behind P(good) or P(bad) as your
probability for the allele rises toward 1. In particular, your probability ac+a' (1-c)
for continuing rises to a = .9 from its initial value of .62 or falls to a' = .5, as c rises
to 1 from its initial value of .3 or falls to 0. Here you see yourself as committed by
your genotype to one or the other of two mixed acts, analogs of gambles whose
possible outcomes are pure acts of continuing and quitting, at odds of 9:1 or 1:1.
You do not know which of these mixed acts you are committed to; your
judgmental odds between them, c:(1-c), are labile, or perhaps undefined. This
genetic commitment antedates your current deliberation. The mixed acts are not
options for you; still less are their pure outcomes. (Talk about pure acts as options
is shorthand for talk about mixed acts assigning those pure acts probabilities near
1.) Then there is much to be said for the judgment that quitting is preferable to
continuing (sc., as the more desirable "news item"), for quitting and continuing are
not options.
As a smoker who believes Fisher's hypothesis you are not so much trying to make
your mind up as trying to discover how it is already made up. But this may be
equally true in ordinary deliberation, where your question "What do I really want
to do?" is often understood as a question about the sort of person you are, a
question of which option you are already committed to, unknowingly. The
diagnostic mark of Newcomb problems is a strange linkage of this question with
the question of which state of nature is actual -- strange, because where in ordinary
deliberation any linkage is due to an influence of acts +/- A on states +/- B, in
Newcomb problems the linkage is due to an influence, from behind the scenes, of
deep states +/- C on acts +/- A and plain states +/- B. This difference explains why
deep states ("the sort of person I am") can be ignored in ordinary decision
problems, where the direct effect of such states is wholly on acts, which mediate
any further effect on plain states. But in Newcomb problems deep states must be
considered explicitly, for they directly affect plain states as well as acts (Fig. 1).

In the kinematics of decision the dynamical role of forces can be played by acts or
deep states, depending on which of these is thought to influence plain states
directly. Ordinary decision problems are modelled kinematically by applying the
rigidity condition to acts as causes. Ordinarily, acts screen off deep states from
plain ones in the sense that B is conditionally independent of +/- C given +/- A, so
that while it is variation in c that makes P(A) and P(B) vary, the whole of the latter
variation is accounted for by the former (Fig. 1a). But to model Newcomb
problems kinematically we apply the rigidity condition to the deep states, which
screen off acts from plain states (Fig. 1b). In Fig. 1a, the probabilities b and b' vary
with c in ways determined by the stable a's and p's, while in Fig. 1b the stable a's
and b's shape the labile p's as we have seen above:
abc +a' b' (1-c) (1-a)bc + (1-a')b'(1-c)
p = ------------------ p' = ------------------------
--
ac + a' (1-c) (1-a)c + (1-a')(1-c)
Similarly, in Fig. 1(a) the labile probabilities are:

apc +a' p' (1-c) (1-a)pc + (1-a')p'(1-c)


b = ------------------ b' = ------------------------
--
ac + a' (1-c) (1-a)c + (1-a')(1-c)
While C and -C function as causal hypotheses, they do not announce themselves as
such, even if we identify them by the causal rô les they are meant to play, e.g.,
when we identify the "bad" allele as the one that promotes cancer and inhibits
quitting. If there is such an allele, it is a still unidentified feature of human DNA.
Fisher was talking about hypotheses that further research might specify,
hypotheses he could only characterize in causal and probabilistic terms -- terms
like "malaria vector" as used before 1898, when the anopheles mosquito was
shown to be the organism playing that aetiological rô le. But if Fisher's science
fiction story had been verified, the status of certain biochemical hypotheses C and -
C as the agent's causal hypotheses would have been shown by satisfaction of the
rigidity conditions, i.e., constancy of P(--| C) and of P(--| -C), with C and -C
spelled out as technical specifications of alternative features of the agent's DNA.
Probabilistic features of those biochemical hypotheses, e.g., that they screen acts
off from states, would not be stated in those hypotheses, but would be shown by
interactions of those hypotheses with pr, B, and A, i.e., by truth of the following
consequences of the kinematical constraints.

P(B | act & C) = P(B | C), P(B | act & -C) = P(B | -C)

As Leeds (1984) points out in another connection, no purpose would be served by


packing such announcements into the hypotheses themselves, for at best -- i.e., if
true -- such announcements would be redundant. The causal talk, however useful
as commentary, does no work in the matter commented upon.

4.5 Newcomb

The flagship Newcomb problem resolutely fends off naturalism about deep states,
making a mystery of the common inclining cause of acts and plain states while
suggesting that the mystery could be cleared up in various ways, pointless to
elaborate. Thus, Nozick (1969) begins:
Suppose a being in whose power to predict your choices you have enormous
confidence. (One might tell a science-fiction story about a being from another
planet, with an advanced technology and science, who you know to be friendly,
and so on.) You know that this being has often correctly predicted your choices in
the past (and has never, so far as you know, made an incorrect prediction about
your choices), and furthermore you know that this being has often correctly
predicted the choices of other people, many of whom are similar to you, in the
particular situation to be described below. One might tell a longer story, but all
this leads you to believe that almost certainly this being's prediction about your
choice in the situation to be discussed will be correct.
There are two boxes ...

... The being has surely put $1,000 in one box, and (B) left the second empty or (-
B) put $1,000,000 in it, depending on whether the being predicts that you will take
(A) both boxes, or (-A) only the second.

Here you are to imagine yourself in a probabilistic frame of mind where your
desirability for -A is greater than that of A because although you think A's truth or
falsity has no influence on B's, your is near 1 (sec. 4.3), i.e., p is near 1, p' near
0. Does that seem a tall order? Not to worry! High is a red herring; a tiny bit will
do, e.g., if desirabilities are proportional to dollar payoffs, then the 1-box option, -
A, maximizes desirability as long as is greater than .001.

To see how that might go, think of the choice and the prediction as determined by
independent drawings by the agent and the predictor from the same urn, which
contains tickets marked "2" and "1" in an unknown proportion x : 1-x. Initially, the
agent's unit of probability density over the range [0,1] of possible values of x is flat
(Fig. 2a), but in time it can push toward one end of the unit interval or the other,
e.g., as in Fig. 2b, c. At t = 997 these densities determine the probabilities and
desirabilities in Table 3b and c, and higher values of t will make des A - des -A
positive. Then if t is calibrated in thousandths of a minute this map has the agent
preferring the 2-box option after a minute's deliberation. The urn model leaves the
deep state mysterious, but clearly specifies its mysterious impact on acts and plain
states.
The irrelevant detail of high was a bogus shortcut to the 1-box conclusion,
obtained if is not just high but maximum, which happens when p = 1 and p' = 0.
This means that the "best" and "worst" cells in the payoff table have unconditional
probability 0. Then taking both boxes means a thousand, taking just one means a
million, and preference between acts is clear, as long asthe probability r of A (take
both boxes), is neither 0 nor 1, and remains maximum, 1. The density functions
of Fig. 2 are replaced by probability assignments r and 1-r to the possibilities that
the ratio of 2-box tickets to 1-box tickets in the urn is 1:0 and 0:1, i.e., to the two
ways in which the urn can control the choice and the prediction deterministically
and in the same way. In place of the smooth density spreads in Fig. 2 we now have
point-masses r and 1-r at the two ends of the unit interval, with desirabilities of the
two acts constant as long as r is neither 0 nor 1. Now the 1-box option is preferable
throughout deliberation, up to the very moment of decision. But of course this
reasoning uses the premise that =1 through deliberation, a premise making
abstract sense in terms of uniformly stocked urns, but very hard to swallow as a
real possibility.
4.6 Hofstadter

Hofstadter (1983) saw prisoners's dilemmas as down-to-earth Newcomb problems.


Call the prisoners Alma and Boris. If one confesses and the other does not, the
confessor goes free and the other serves a long prison term. If neither confesses,
both serve short terms. If both confess, both serve intermediate terms. From Alma's
point of view, Boris's possible actions (B, confess, or -B, don't) are states of nature.
She thinks they think alike, so that her choices (A, confess, -A, don't) are pretty
good predictors of his, even though neither's choices influence the other's. If both
care only to minimize their own prison terms this problem fits the format of Table
1(a). The prisoners are thought to share a characteristic determining their separate
probabilities of confessing in the same way -- independently, on each hypothesis
about that characteristic. Hofstadter takes that characteristic to be rationality, and
compares the prisoners's dilemma to the problem Alma and Boris might have faced
as bright children, independently working the same arithmetic problem, whose
knowledge of each other's competence and ambition gives them good reason to
expect their answers to agree before either knows the answer: "If reasoning guides
me to [... ], then, since I am no different from anyone else as far as rational
thinking is concerned, it will guide everyone to [... ]." The deep states seem less
mysterious here than in the flagship Newcomb problem; here they have some such
form as Cx" = We are both likely to get the right answer, i.e., x. (And here ratios of
utilities are generally taken to be on the order of 10:1 instead of the 1000:1 ratios
that made the other endgame so demanding. With utilities 0, 1, 10, 11 instead of 0,
1, 1000, 1001, indifference between confessing and remaining silent now comes
at = 10% instead of one tenth of 1%.) But to heighten similarity to the
prisoners's dilemma let us suppose the required answer is the parity of x, so that the
deep states are simply C = We are both likely to get the right answer, i.e., even, and
-C = We are both likely to get the right answer, i.e., odd.

What's wrong with Hofstadter's view of this as justifying the coö perative
solution? [And with von Neumann and Morgenstern's (p. 148) transcendental
argument, remarked upon by Skyrms (1990, pp. 13-14), for expecting rational
players to reach a Nash equilibrium?] The answer is failure of the rigidity
conditions for acts, i.e., variability of P(He gets x | I get x) with P(I get x) in the
decision maker's kinematical map. It is Alma's conditional probability functions
P(-- | +/- C) rather than P(-- | +/- A) that remain constant as her probabilities for the
conditions vary. The implausibility of initial des(act) as a figure of merit for her act
is simply the implausibility of positing constancy of as her probability function
pr evolves in response to changes in P(A). But the point is not that confessing is the
preferable act, as causal decision theory would have it. It is rather that Alma's
problem is not indecision about which act to choose, but ignorance of which allele
is moving her.
4.7 Conclusion

Hofstadter's (1983) version of the prisoners's dilemma and the flagship Newcomb
problem have been analyzed here as cases where plausibility demands a continuum
[0,1] of possible deep states, with opinion evolving as smooth movements of
probability density toward one end or the other draw probabilities of possible acts
along toward 1 or 0. The problem of the smoker who believes Fisher's hypothesis
was simpler in that only two possibilities (C, -C) were allowed for the deep state,
neither of which determined the probability of either act as 0 or 1.

The story was meant to be a credible, down-to-earth Newcomb problem; after all,
Fisher (1959) honestly did give his hypothesis some credit. But if your genotype
commits you to one mixed act or the other, to objective odds of 9:1 or 1:1 on
continuing, there is no decision left for you to make. Yet, the story persuaded us
that, given your acceptance of the Fisher hypothesis, you would be foolish to quit,
or to try to quit: continuing would be the wiser move. This is not to say you will
surely continue to smoke, i.e., not to say you see a mixed act at odds of 1:0 on
continuing as an option, and, in fact, as the option you will choose. It only means
you prefer continuing as the pure outcome of whichever mixed act you are
unknowingly committed to. "Unknowingly" does not imply that you have no
probabilistic judgment about the matter -- although, indeed, you may have none,
i.e., c may be undefined. In fact, with c = .3, you think it unlikely that your
commitment makes odds of 9:1 on continuing; you think the odds most likely to be
1:1 But whatever the odds, you prefer the same pure outcome: continuing. You
don't know which "gamble" you face, but you know what constitutes winning:
continuing to smoke, i.e., the less likely outcome of the more desirable "gamble."
These scare-quotes emphasize that your mixed act is not a matter of spinning a
wheel of fortune and passively awaiting the outcome; you yourself are the chance
mechanism.

You think there is an objective, real probability of your quitting, i.e., .9 or .5,
depending on whether you have the bad genotype or the good one; there is a fact of
the matter, you think, even though you do not know the fact. If the real odds on
your continuing to smoke are even, that is because your tropism toward smoking is
of the softer kind, stemming from the good allele; you are lucky in your genotype.
But how does that work? How does the patch of DNA make you as likely to quit as
continue? How do we close the explanatory gap from biochemistry to preference
and behavior, i.e., to things like the relative importance you place on different
concomitants of smoking, on the positive side a certain stimulus and sensual
gratification, on the negative a certain inconvenience and social pressure? These
influences play themselves out in the micro-moves which add up to the actual
macro-outcome: continue, or quit. And if the odds are 9:1, that will stem from a
different pattern of interests and sensitivities, forming a strong tropism toward
continuing to smoke, somehow or other rooted in your DNA. What's weird about
Fisher's science fiction story is not its premise, that the mental and physical states
of reasoning animals are interconnected, but the thought that we might have the
sort of information about the connection that his story posits -- information
unneeded in ordinary deliberation, where acts screen it off.

The flagship Newcomb problem owes its bizarrerie to the straightforward


character of the pure acts: surely you can reach out and take both boxes, or just the
opaque box, as you choose. Then as the pure acts are options, you cannot be
committed to either of the non-optional mixed acts. But in the Fisher problem,
those of us who have repeatedly "quit" easily appreciate the smoker's dilemma as
humdrum entrapment in some mixed act, willy nilly. That the details of the
entrapment are describable as cycles of temptation, resolution and betrayal makes
the history no less believable -- only more petty. Quitting and continuing are not
options, i.e., pr A ~ 0 and pr A ~ 1 are are not destinations you think you can
choose, given your present position on your kinematical map, although you may
eventually find yourself at one of them. The reason is your conviction that if you
knew your genotype, your value of pr A would be either a or a' , neither of which is
~ 0 or ~ 1. (Translation: "At places on the map where pr C is at or near 0 or 1, pr A
is not.") The extreme version of the story, with a ~ 1 and a' ~ 0, is more like the
flagship Newcomb problem; here you do see yourself as already committed to one
of the pure acts, and when you learn which that is, you will know your genotype.

I have argued that Newcomb problems are like Escher's famous staircase on which
an unbroken ascent takes you back where you started. We know there can be no
such things, but see no local flaw; each step makes sense, but there is no way to
make sense of the whole picture; that's the art of it.

4.8 Notes

(End of sec. 4.1) See Jeffrey, "Risk and human rationality," and sec. 12.8 of The
Logic of Decisin. The point is Davidson's; e.g., see pp. 272-3.

(Sec. 4.2) Rigidity is also known as "sufficiency" (Diaconis and Zabell). A


sufficient statistic is a random variable whose sets of constancy ("data") form a
partition satisfying the "rigidity" condition.
2
(Sec. 4.3, ) The "regression coefficient" of a random variable Y on another, X,
is = cov(X,Y)/var(X), where cov(X,Y) = E[(X-EX)(Y-EY)] and var(X) = E(X-
EX)2. If X and Y are indicators of propositions (sc., "cause" and "effect"),

cov(X,Y) = P(cause & effect)-P(cause)P(effect),

var(X) = P(cause)P(-cause),

and reduces to the left-hand side of the inequality.


3
(Sec. 4.3, rigidity) For random variables generally, rigidity is constancy of the
conditional probability distribution of Y given X as the unconditional probability
distribution of X varies.
4
(Sec. 4.4) In the example, I take it that the numerical values a = .9, a' = .5, b = .2, ,
b' = .01 hold even when c is 0 or 1, e.g. b = P(ca | bad) = .2 even when P(bad) = 0;
the equation b. P(bad) = P(ca & bad) isn't what defines b.
5
(Sec. 4.5, Fig. 2) In this kinematical map, P(A) = 01xt+1f(x)dx and P(B | A) =
1 t+2
0 x f(x)dx/P(A), with f(x) as in Fig. 2(b) or (c). Thus, with f(x) as in (b), P(A) =
(t+1)/(t+3) and P(B | A) = (t+2)/(t+3). See Jeffrey (1988).
6
(Sec. 4.5, end) At the moment of decision the desirabilities of shaded rows in (b)
and (c) are not determined by ratios of unconditional probabilities, but continuity
considerations suggest that they remain good and bad, respectively.
7
(Sec. 4.7, start of third paragraph) "You think there is an objective, real
probability... " See the hard-core subjectivist's guide to objective chance in The
Logic of Decision, sec. 12, and note that the "no one chooses to have sacked Troy"
passage from the Nichomachean Ethics, used by Skyrms (1980, p. 128) to
introduce causal decision theory, also fits the present skepticism about Newcomb
problems.

(Sec. 4.7, end of third paragraph) Cf. Davidson's conclusion, that "nomological
slack between the mental and the physical is essential as long as we conceive of
man as a rational animal" ( p. 223).

(Sec. 4.7, Escher staircase) "Ascending and Descending" (lithograph, 1960), based
on Penrose (1958); see Escher (1989, p. 78). Elsewhere I have accepted Newcomb
problems as decision problems, and accepted "2-box" solutions as correct. Jeffrey
(1983, sec. 1.7 and 1.8) proposed a new criterion for acceptability of an act --
"ratifiability" -- which proved to break down in certain cases (see Jeffrey 1990, p.
20). In Jeffrey (1988, 1993), ratifiability was recast in terms more like the present
ones -- but still treating Newcomb problems as decision problems.
4.9 References

Arntzenius, F. (1990) `Physics and common causes', Synthese, vol. 82, pp. 77-96.

Bolker, E. (1965), Functions Resembling Quotients of Measures, Ph. D.


dissertation (Harvard University).

------ (1966),`Functions resembling quotients of measures', Transactions of the


American Mathematical Society, vol. 124, pp. 292-312.

------ (1967), "A simultaneous axiomatization of utility and subjective probability',


Philosophy of Science, vol. 34, pp. 333-340.

Davidson, D. (1980), Essays on Actions and Events (Oxford: Clarendon Press).

Diaconis, P. and Zabell, S. (1982), `Updating subjective probability'. Journal of


the American Statistical Asociation, vol. 77, pp. 822-830.

Escher, M.C. (1989), Escher on Escher (New York: Abrams).

Fisher, R. (1959), Smoking, the Cancer Controversy (London: Oliver and Boyd).

Hofstadter, D.R. (1983), `The calculus of coö peration is tested through a lottery',
Scientific American, vol. 248, pp. 14-28.

Jeffrey, R.C. (1965; 1983, 1990) The Logic of Decision (New York: McGraw-
Hill; Chicago: University of Chicago Press).

------ (1987), `Risk and human rationality' The Monist, vol. 70, no. 2, pp. 223-236.

------ (1988), `How to probabilize a Newcomb problem', Fetzer, J.H. (ed.),


Probability and Causality (Dordrecht: Reidel).

------ (1992), Probability and the Art of Judgment, Cambridge.

------ (1993), `Probability kinematics and causality', in Hill, D., Forbes, M. and
Okruhlik, K. (eds.) PSA 92, vol. 2 (Philosophy of Science Assn.: Michigan State
University, E. Lansing, MI).

Kolmogorov, A.N. (1933),`Grundbegriffe der Wahrscheinlichkeitsrechnung',


Ergebnisse der Mathematik, Vol. 2, No. 3; Springer, Berlin. Translation:
Foundations of Probability (New York: Chelsea, 1950).
Leeds, S. (1984), `Chance, realism, quantum mechanics', Journal of Philosophy,
vol. 81, pp. 567-578.

Nozick, R. (1963) The Normative Theory of Individual Choice, Ph. D. dissertation


(Princeton University).

------ (1969), `Newcomb's problem and two principles of choice', in N. Rescher


(ed.), Essays in Honor of Carl G. Hempel, (Dordrecht: Reidel).

------ (1990) Photocopy of Nozick (1963), with new preface (New York: Garland).

Penrose, L.S. and Penrose, R (1958), `Impossible Objects: a Special Type of


Visual Illusion', The British Journal of Psychology, vol. 49, pp. 31-33.

Skyrms, B. (1980) Causal Necessity (New Haven: Yale).

------ (1990), The Dynamics of Rational Deliberation (Cambridge, Mass:


Harvard).

von Neumann, J and Morgenstern, O. (1943, 1947), Theory of Games and


Economic Behavior (Princeton: Princeton University Press).
CHAPTER 5: PROBABILISM AND
INDUCTION
Introduction

What reason is there to suppose that the future will resemble the past, or that
unobserved particulars will resemble observed ones? None, of course, until
resemblances are further specified, e.g., because we do not and should not expect
the future to resemble the past in respect of being past, nor do or should we expect
the unobserved to resemble the observed in respect of being observed. Thus Nelson
Goodman replaces the old problem ('Hume's') of justifying induction by the new
problem of specifying the respects in which resemblances are expectable between
past and future, observed and unobserved.

The old problem is thereby postponed, not solved. As soon as the new problem is
solved, the old one returns, as a request for the credentials of the solution: "What
reason is there to expect the future/unobserved to resemble the past/observed with
respect to such- and- such dichotomies or classificatory schemes or magnitudes?"
The form of the question is further modified when we talk in terms of judgmental
probability instead of all-or- none expectation of resemblance, but the new problem
still waits, suitably modified.

It seems to me that Hume did not pose his problem before the means were at hand
to solve it, in the probabilism that emerged in the second half of the seventeenth
century, and that we know today primarily in the form that Bruno de Finetti gave it
in the decade from 1928 to 1938. The solution presented here (to the old and new
problems at once) is essentially present in Chapter 2 of de Finetti's 'La prevision'
(1937), but he stops short of the last step, shifting in Chapter 3 to a different sort of
solution, that uses the notion of exchangeability. At the end of this paper I shall
compare and contrast the two solutions, and say why I think it is that de Finetti
overlooked (or, anyway, silently balked at) the solution that lay ready to hand at
the end of his Chapter 2.

5.1 Probabilism, what

In a nutshell: probabilism sees opinions as more or less precise estimates of various


magnitudes, i.e., probability weighted averages of fom

( I ) est X = xopo + xl p1 + . . .,
where the xi are the different values that the magnitude X can assume, and each pi
is the probability that the value actually assumed is xi (If X is a continuous
magnitude, replace the sum by an integral.)

Estimation is not a matter of trying to guess the true value, e.g., 2.4 might be an
eminenty reasonable estimate of the number of someone's children, but it would be
a ridiculous guess. (If that was my estimate, my guess might be 2.) Similarly,
taking the truth value of a proposition to be 1 or 0 depending on whether it is true
or false, my estimate of the truth value of the proposition that I shall outlive the
present century is about 1/2, which couldn't be the truth value of any proposition.

The probability you attribute to a proposition is your estimate of its truth value: if
X is a proposition, then

(2) prob X = est X.

Here I follow de Finetti in taking propositions to be magnitudes that assume the


value 1 at worlds where they are true, and 0 where false. This comes to the same
thing as the more familiar identification of propositions with the sets of worlds at
which they are true, and makes for smoothness here. Observe that (2) follows from
(1), for as X can take only the two values 0 and 1, we can set x o = 0 and x1 = 1 in
(1) to get

est X = 0po + 1p1= p1

where pi is the probability that X assumes the value I, i.e., the probability (prob X)
that X is true.

Still following de Finetti, I take estimation to be the basic concept, and define
probability in terms of it. (The opposite tack, with estimation defined in terms of
probability as in (1), is more familiar.) Then (2) is given the status of a definition,
and the following axioms are adopted for the estimation operator.

(3) Additivity: est X + Y = est X + est Y

(4) Positivity: If X > 0 then est X > 0

(5) Normalization: est 1 = 1

(1 is the magnitude that assumes the value 1 everywhere. i.e., the necessary
proposition. 'X > 0' means that X assumes negative values nowhere.)

Once understood, these axioms are as obvious as the laws of logic - - in token of
which fact I shall call them and their consequences laws of 'probability logic' (de
Finetti's 'logic of the probable'). Here they are, in English:
(3) An estimate of the sum of two magnitudes must be the sum of the two separate
estimates.

(4) An estimate must not be negative if the magnitude estimated cannot be


negative.

(5) If the magnitude is certainly I, the estimate must be 1 .

Additivity impliesl that for each real number k,

(6) est kX = k est X

The Kolmogorov axioms for probability are easy consequences of axioms (3)- (5)
for estimation, together with (2) as a notational convention, i.e.,

(7) If X can take no values but 0 and I, then est X = prob X.

Here are the Kolmogorov axioms.

(8) Additivity: If XY= 0 then prob X + Y = prob X + prob Y

(9) Positivity: prob X>=0

(10) Normalization: prob 1 =1

(10) is just copied from the normalization axiom (5) for est. with 'est' transcribed as
'prob'. Positivity is the same for 'prob' as for 'est', given that when we write 'prob X'
it goes without saying that X > 0, since a proposition X can assume no values but 0
and 1. And additivity of prob as above comes to the same thing as the more
familiar version, i.e.,

If X and Y are incompatible propositions, prob X v Y = prob X + prob Y.

(With 0 for falsehood and 1 for truth, the condition XY = 0 that the product of X
and Y be 0 everywhere comes to the same thing as logical incompatibility of X and
Y; and under the condition XY = 0 the disjunction Xv Y, i.e., X + Y - XY in the
present notation, comes to the same thing as the simple sum X + Y.)

A precise, complete opinion concerning a collection of propositions would be


represented by a probability function defined on the truth functional closure of that
collection. More generally, a precise, complete opinion concerning a collection of
magnitudes might be represented by an estimation operator on the closure of that
collection under the operations of addition and multiplication of magnitudes, and
of multiplication of magnitudes by constants. (One might also include other
options, e.g. closure under exponentiation, XY.)
Such are precise, complete opinions, according to probabilism. But for the most
part our opinions run to imprecision and incompleteness. Such opinions can be
represented by conditions on the variable 'est' or, equivalently, by the sets of
particular estimation operators that satisfy those conditions Such sets will usually
be convex, i.e., if the operators esto and est1 both belong to one, then so will the
operator woesto+w1est1, if the w's are non- negative real numbers that sum to 1. An
example is given by what de Finetti (1970, 1974 3.10) calls 'The Fundamental
theorem of probability':

Given a coherent assignment of probabilities to a finite number of propositions, the


probability of any further propositions is either determined or can be coherently
assigned any value in a certain closed interval.

Thus, the set of probability measures that assign the given values to the finite set of
propositions must be convex. And incomplete, imprecise opinions can arise in
other ways, e.g. in my book, The Logic of Decision (1965, 1983), a complete
preference ranking will normally determine an infinite set of probability measures,
so that probabilities of propositions may be determined only within intervals:
see 6.6, 'Probability quantization'.

Probabilism would have you tune up your opinions with the aid of the probability
calculus, or, more generally, the estimation calculus: probability logic, in fact. This
is a matter of tracing consequences of conditions on estimation operators that
correspond to your opinion. When you trace these consequences you may find that
you had misidentified your opinion, i.e., you may see that after all, the conditions
whose consequences you traced are not all such as you really accept. Note that
where your opinion is incomplete or imprecise, there is no estimation operator you
can call your own. Example: the condition est(X-- est X)2 <= 1 is not a condition
on your unknown estimation operator, est. Rather: in that condition, 'est' is a
variable, in terms of which your opinion can be identified with the set { est: est (X
est X)2 <= 1 } of estimation operators that satisfy it.

5.2 Induction, what

Here is the probabilistic solution that I promised, of the new problem of induction.
It turns on the linearity of the expectation operator, in view of which we have
X1 + . . . + Xn est X1 + . . . + est Xn
est ------------------ = ----------------------
-----
n n
i.e., in words:
(11) An estimate of the average of any (finite) number of quantities must equal the
average of the estimates of the separate quantities.

(Proof: use (6) to get 1/n out front, and then apply (3) n - 1 times.) From (11) we
get what I shall call the

ESTIMATION THEOREM. If your opinion concerning the magnitudes Xl,...,Xn, ..


,Xn+m is characterized by the constraints est Xi = est Xj for all i, j = 1, ..., n + m
(among other constraints, perhaps), then your estimate of the average of the last m
of them will equal the observed average of the first n - if you know that average, or
think you do.

Implicitly, this assumes that although you know that the average of the first n X's is
(say) x, you don't know the individual values assumed by Xl,...,Xn separately
unless it happens that they are all exactly x, for if you did, and they weren't, the
constraint est Xi = est Xj would not characterize your opinion where the known
value of Xi differs from the known value of Xj.

Proof of the estimation theorem. If you assign probability I to the hypothesis that
the average of the first n X's is x, then by (1)2 you must estimate that average as x
The constraints then give est Xi = x for the last m X's, and the conclusion of the
estimation theorem follows by (11).

Example 1: Guessing weight

For continuous magnitudes, estimates serve as guesses. Suppose that you will be
rewarded if you guess someone's weight to within an accuracy of one pound. One
way to proceed is to find someone who seems to you to be of the same build, to be
dressed similarly, etc., so that where X2 is the weight you wish to guess correctly,
and Xl is the other person's weight, your opinion satisfies the constraint est X,=est
X2. Now have the other person step on an accurate scale, and use that value of X,
as your estimate of X2. This is an application of the estimation theorem with n = m
= 1.

Mind you: it is satisfaction of the constraint after the weighing that justifies (or
amounts to) taking the other person's actual weight as your estimate, and under
some circumstances, your opinion might change as a result of the weighing so as to
cease satisfying the constraint. Example: the other person's weight might prove to
be so far from your expectation as to undermine your prior judgement that the two
were relevantly similar, i.e., the basis for your prior opinion's satisfaction of the
constraint. Here is a case where you had antecedently judged the two people both
to have weights in a certain interval (say, from 155 to 175 pounds), so that when
the other person's weight proved to be far outside this interval (perhaps, 120
pounds) your opinion changed from {est: 155 <= est Xl = est X2< 175} to
something else, because the weighing imposed the further condition est Xl = 120
on your opinion, i.e., a condition incompatible with the previously given ones.
Probability logic need not tell you how to revise your opinion in such cases, any
more than deductive logic need tell you which of an inconsistent set of premises to
reject.

Example 2: Using averages as guesses

If you can find (say) ten people, each of whom strikes you as similar in the relevant
respects to an eleventh person, whose weight you wish to guess, then have the ten
assemble on a large platform scale, read their total weight, and use a tenth of that
as your estimate of Xl1. This is an application of the estimation theorem with n =
10, m = 1. The estimation theorem does not endorse this estimate as more accurate
than one based on a single person's weight, but under favorable conditions it may
help you form an opinion of your estimate's accuracy, as follows.

Example 3: Variance

The variance of a magnitude X is defined relative to an estimation function:

var X = est (X - est X)2 = est X2 - est2 X.

Thus, relative to an estimation function that characterizes your precise opinion, the
variance of the eleventh person's weight is your estimate of the square of your error
in estimating that weight, and this turns out to be equal to the amount by which the
square of your estimate of the magnitude falls short of your estimate of the square
of the magnitude. Now the estimation theorem can be applied to the magnitudes
Xl2, ..., X2n+l to establish that under the constraints est Xi2 = est Xj2 for i, j = 1, ..., n
+ 1, your estimate of the square of the eleventh person's weight must equal the
observed average of the squares of the first ten people's weights.3 To get the
variance of Xl1, simply subtract from that figure the square of the estimate of Xl1
formed in Example 2.

It is worthwhile to ring the changes on these examples, e.g. imagining that you are
estimating weights not by eye, but on the basis of significant but limited statistical
data, say, age and sex of the members of the sample, and of the person whose
weight is to be estimated; and imagining that it is not weight that is to be estimated,
but length of life - the estimate of which has the familiar name, 'life expectancy'.
(In this case the members of the sample are presumably dead already: people no
younger than the one whose life expectancy is sought, who were relevantly similar
to that one, at that age).

The estimation theorem was inspired by a class of applications (in 'La prevision',
Chapter 2) of what I shall call
DE FINETTI'S LAW OF SMALL NUMBERS: your estimate of the number of
truths among the propositions A1, ..., An must equal the sum of the probabilities
you attribute to them .

Here is another formulation, obtained by dividing both sides of the equation by n


and applying the linearity of est:

(12) Your estimate of the relative frequency of truths among the propositions
A1,..., An must equal the average of the probabilities you attribute to them.

Proof of de Finetti's law. The claim is that est (Al + . . . + An) = prob Al + . . . + prob
An, which is true by (3) and (7).

Example 4: Applying de Finetti's law

To form my probabilistic opinion concerning the proposition Al01 that the 101st
toss of a certain (possibly loaded) die will yield an ace, I count the number of times
the ace turns up on the first hundred tosses. Say the number is 21, and suppose I
don't keep track of the particular tosses that yielded the aces. It is to be expected
that I attribute the same probability to all 101 propositions of form Ai, i.e., it is to
be expected that my opinion satisfies the constraints est Ai = estAj (i,j = 1, ...,101).
Then by de Finetti's law with n = 100, the common value of those probabilities
will be 21%, and that will be the probability of Al01 as well.

Observe that in the case of propositions, i.e., magnitudes whose only possible
values are 0 and 1, the variance is determined by the estimate, i.e., by the
probability attributed to the proposition:

(13) If X is a proposition of probability p, then var X = p(l - p).

Proof. As X is a proposition, X2 =X, and therefore var X, in the form est X2 - est2
X, can be written as est X - est2 X, i.e., p - p2, i.e., p(l - p). Thus, the variance of a
proposition is null when its probability is extreme: 0, or 1. And variance is
maximum (i.e., 1/4) when p is 1/2. You can see that intuitively by considering that
the possible value of 'p' that is furthest from both of the possible values of X is the
one squarely in the middle.

5.3 Justifying induction

These examples show how probabilism would have us form our opinion about the
future on the basis of past experience, in simple cases of the very sorts concerning
which the problem of induction is commonly posed. The estimation theorem, and
de Finetti's laws of large and small numbers, are especially accessible parts of
probabilism's solution to the new problem of induction As the theorem and the law
are consequences of the axioms (3) (5) of probability logic, this solution can be
seen as borrowing its credentials from those axioms. Thus the old problem of
induction, in the form in which it bears on probabilism's solution to the new
problem, is the question of the credentials of the axioms of probability logic.

Note that I say 'probability logic', not 'logical probability'. De Finetti's subjectivism
implies that the basic axioms governing est (or, if you prefer, the corresponding
axioms for prob) are all the universally valid principles there are, for this logic. In
contrast, Carnap (1950,1952, 1971, 1980) tentatively proposed further principles as
universal validities for what he called 'logical probability', i.e., either a particular
probability function, e.g. c* (1945, 1950), or a class of them, e.g. {c : 0
< < } (1952). But such attempts to identify a special class of one or more
probability functions as the 'logical' ones strike me as hopeless. Example: the
functions of form c . do have interesting properties that recommend them for use
as subjective probability functions in certain sorts of cases, but there are plenty of
other sorts of cases where none of those functions are suitable.4 Carnap
(1980, 17) did see that, and finally added further adjustable parameters and ,
in an effort to achieve full generality. But I see no reason to think that would have
been the end of the broadening process, had Carnap lived to continue it, or had
others taken sufficient interest in the project to pursue it after Carnap's death5. With
de Finetti, I take the laws of probability logic to be the axioms (3)-(5) and their
consequences6.

It seems appropriate to call these axioms 'logical' because of the strength and
quality of their grip, as constraints that strike us as appropriate for estimation
functions. The feel of that grip is like that of the grip of such logical laws (i.e.,
constraints on truth- value assignments) as that any proposition, X, implies its
disjunction with any proposition, Y. In our notation, this comes out as X <= X + Y
- XY, given that X and Y take no values other than 0 and 1. I am at a loss to think
of more fundamental principles from which to deduce these axioms: to understand
is to acknowledge, given what we understand by 'estimate.'

But this is not to deny that illustrations can serve to highlight this logical character
of the axioms: a notable class of such illustrations are the 'Dutch book' arguments,
which proceed by considering situations in which your estimates of magnitudes
will be the prices- in dollars at which you are constrained to buy or sell (on
demand) tickets that can be exchanged for numbers of dollars equal to the true
values of those magnitudes

Example. The Dutch book argument for axiom (3) goes like this, where x and y are
the unknown true values of the magnitudes X and Y. For definiteness, we consider
the case where the axiom fails because the left- hand side is the greater. (If it fails
because the left- hand side is the smaller, simply interchange 'buy' and 'sell' in the
following argument, to show that you are willing to suffer a sure loss of est X + est
Y est (X + Y).)

If est (X + Y) exceeds est X + est Y, you are willing to buy for est (X + Y) dollars
a ticket worth x +y dollars; and for a lower price, i.e., est X + est Y dollars, you are
willing to sell a pair of tickets of that same combined worth: x +y dollars. Thus,
you are willing to suffer a sure loss, viz., est (X + Y) - est X - est Y dollars.

To paraphrase Brian Skyrms (1980, p. 119): if your estimates violate axiom (3) in
such cases, you are prepared to pay different amounts for the same good,
depending on how it is described. For a single ticket worth x + y dollars you will
pay est (X + Y) dollars, but for two tickets jointly worth x + y dollars you will pay
a different amount, i.e., est X + est Y dollars. In a certain clear sense, this is an
inconsistency.

Such Dutch book arguments serve to highlight the credentials of (3) (5) as axioms
of the logic of estimation, i.e., of probability logic. One might even think of them
as demonstrating the apriori credentials of the axioms, sc., as 'logical' truths in a
certain sense under the hypothesis that you are prepared to make definite estimates
of all the magnitudes that appear in them, in circumstances where those estimates
will be used as your buying- or- selling prices for tickets whose dollar values equal
the true values of those magnitudes. But the point of such demonstrations is
blunted if one's opinions are thought to be sets of estimation functions, or
conditions on estimation functions, for then the hypothesis that you are prepared to
make the definite estimates that the Dutch book arguments relate to will be
satisfied only in the extreme cases where opinion is precise and complete, relative
to the magnitudes in question.

5.4 Solution or evasion?

Even if you see the Dutch book arguments as only suggestive, not demonstrative,
you are unlikely to balk at the logicist solution to the old problem of induction
( 3) if you accept the probabilistic solution floated in 2 for the new problem. But
many will see probabilism as an evasion, not a solution; and while there can be no
perfectly decisive answer to such doubts, it would be evasive to end this paper
without some effort to meet them.

The doubt can be illustrated well enough in connection with Example 1: "Indeed, if
you maintain your initial opinion, according to which your estimates of the two
people's weights will be the same, the information that one of them weighs (say)
132 pounds will produce a new opinion in which both weights are estimated as 132
pounds. But to use your initial opinion as a datum in this way is to beg the question
("What shall your opinion be?") that the new problem poses. It is only by begging
that question that the old problem can be misconceived as a request for the
credentials of the general constraints (3) (5) on all estimation functions, rather than
for the special constraints that characterize your own opinion."

That's the question which a probabilist must fault as question- begging in a


characteristically objectivistic way. For that question contrasts your initial opinion,
according to which est X1, = est X2, with the information that the true value of X1,
is 132. But what is thus dignified as 'information' is more cautiously described as a
new feature of your opinion, specified by two new constraints:

estX1 = 132, varX1 = 0

It is the second of these that corresponds to the objectivistic characterization of the


new estimate, 132, as information, not mere opinion. To interpret 'information'
more strongly than this is to beg the question that the agent answers (in effect) by
forming his new opinion in accordance with the new constraints: it is to say that his
opinion about X, is not only definite (estX1 = 132) and confident (varX1 = 0) but
correct.

In turn, objectivists will identify this move as a typical subjectivistic refusal to see
the difference between subjective confidence and objective warrant. "If the only
constraints that my opinion must meet are the basic axioms, I am free to adopt any
further constraints I please, as long as they are consistent. But then I don't need to
go through the business of finding somebody who strikes me as similar in build,
weight of clothing, etc., to the person whose weight I wish to guess, and weighing
that similar person. I can simply decide on any number at all (it might even be 132,
as luck would have it), and use that as my guess: est X2 = 132. Nor does anything
prevent me from adopting great confidence regarding that guess: var X2=0". (Note
that in Example 1, where var X1 was 0, there was no reason to set var X2 = 0 as
well.)

But this is nonsense, on a par with "If God is dead then everything is permitted".
Look at it for a minute: God didn't die, either suddenly or after a long illness. The
hypothesis is rather that there is no such entity, and never was. Our morality has no
divine basis. Instead, its basis is in us: it is as it is because and insofar as we are as
we are. ('Insofar as': humans are not as uniform as the divine template story
suggests.) Moses smuggled the ten commandments up Mt. Sinai, in his heart. You
know the line.

I take the same line about Carnap's efforts, and Henry Kyburg's, to tell us just what
constraints on our probabilistic opinions would be rationally justified by this or
that corpus of fully- held beliefs. If you think that some such set of epistemological
commandments must be produced and justified if we are to form warranted
probabilistic opinions, then you will find subjectivistic probabilism offensively
nihilistic: a license for wishful thinking and all other sorts of epistemological sin.
But I think such fears unwarranted. Wishful thinking is more commonly a feature
of (shared or private) fantasy than of judgement. In forming opinion we aim to the
truth, for the most part. The fact that I would be violating no laws of logic if I were
simply to decide on a number out of the blue, as my estimate of someone's weight,
does not mean that I would or could do that. (I could say '132' or '212' easily
enough, but could I believe it?)

In a way, the contrast with von Mises' frequentism is more illuminating than that
with Carnap's logicism. Mises sought to establish probability theory as an
independent science, with its own subject- matter: mass phenomena. If that were
right, there would be a general expertise for determining probabilities of all sorts:
concerning horse races, the weather, U235--whatever. But according to the sort of
probabilism I am putting forward, such expertise is topical: you go to different
people for your opinions about horse races, weather, U235 etc. Probability logic
provides a common framework within which all manner of opinion can be
formulated, and tuned.

In the weight- guessing example you are supposed to have sought, and found, a
person you saw as sufficiently similar in relevant respects to the one whose weight
you wished to estimate, to persuade you that within broad limits, any estimate you
form for the one (e.g. by weighing him on a scale you trust) will be your estimate
for the other as well. Objectivism is willing to accept your judgement that the true
value of X1, is 132, based on the scale reading, but rejects the part of your
judgement that identifies the two estimates. But probabilism views both scale-
reading and equality-judging (for people's weights) as acquired skills. (The same
goes for judging the accuracy of scales.) The point about scale- reading is that it is
a more widely and uniformly acquired skill, than is equality-judgement for people's
weights. But when it comes down to it, your opinion reflects your own assessment
of your own skills of those sorts.

Here is how de Finetti (1938) expressed the basic attitude:

...one must invert the roles of inductive reasoning and probability theory: it is the
latter that has autonomous validity, whereas induction is the derived notion. One is
thus led to conclude with Poincare that "whenever we reason by induction we
make more or less conscious use of the calculus of probabilities".

The difference between the approach to the problem of induction that I suggest
here and the one de Finetti espoused in 'La prévision...' is a consequence of the
difference between de Finetti's drive to express uncertainty by means of definite
probability or estimation functions (e.g. 'exchangeable' ones (1937) and 'partially
exchangeable' ones (1938)), and the looser course taken here, where opinions can
be represented by constraints on estimation functions in cases where no one
function adequately represents the opinion. By taking this looser point of view,'
one can use the mathematically trivial estimation theorem to find that under the
constraints est X1 = . . . = est Xn+m, observed averages must be used as estimates of
future averages, on pain of incoherence, i.e., inconsistency with the canons of
probability logic. In contrast, de Finetti (1937) uses the mathematically nontrivial
law of large numbers, according to which one's degree of belief in the difference
between any two averages' exceeding (say) 10-10 can be made as small as you like
by making the numbers of magnitudes Xi that appear in the averages both get large
enough. The constraint in de Finetti's version of the law of large numbers are
stronger than those in the estimation theorem: they require existence of real
numbers a, b, c for which we have

estXi = a, est Xi2 = b, est(XiXj) = c

for all i, j = 1,2, ... with i != j. (Only the first of these constraints applies to the
estimation theorem, and then only for i = 1, ...,n + m.)

I think that de Finetti spurns or overlooks the estimation theorem because he insists
on representing opinions by definite probability and estimation functions. He then
uses conditionalization to take experience into account. As the presumed initial
opinion is precise and complete, he gets not only precise estimates of averages in
this way, but precise variances, too . It is a merit of the estimation theorem that it
uses a very diffuse initial opinion, i.e., one that need satisfy no constraints but est
X1 = . . . = est Xn+m . There is no use of conditionalization in proving or applying the
estimation theorem. If variances are forthcoming, it is by a further application of
the estimation theorem, as in Example 3: empirically, in large part8.

The claim is that through the estimation theorem, probabilism makes what
considerable sense there is to be made of naive frequentism9, i.e., of Hume's
inductivism in its statistical atavar.

Notes

1. For positive integers k, additivity clearly yields (6) by induction, since


est(n+1)X=estX+nX. Then for a positive integral k, est(1/k)X=(1/k)estX, since
kest(1/k)X=estX. This yields (6) for positive rational k, whence (6) follows for
positive real k by the density of the rationals in the reals. By (3) and (5), est 1 + 0 =
1 + est0, so that since 1 + 0 = 1, (5) yields est 0 = 0 and, thus, (6) for k = 0. Finally,
to get (6) for negative real k it suffices to note that est - 1 = - 1 since 0 = est 1 - 1 =
1 - est - l by (3) and (5). Here we have supposed that for real a and b, est aX + b Y
is defined whenever est X and est Y are, i e., we have assumed that the domain on
which est is defined is closed under addition and under multiplication by reals.
2 (1) is deducible from (3), (6), and (7) in case X assumes only finitely many
different values xi, for then we have X= vixSXs where Xi is the proposition that X
= xi, i.e., Xi assumes the value 1 (0) at worlds where X is true (false).

3 The constraints est Xi2 = est Xj2 represent a judgement quite different from that
represented by the constraints est Xi = est Xj a judgement we are less apt to make,
or to feel confident of, having made it. (Note that estimating x 2 is not generally
just a matter of squaring your estimate of X!)

4 As Johnson (1932) showed, and Kemeny (1963) rediscovered, the cases where
one of the functions cx is suitable are precisely those in which the user takes the
appropriate degree of belief in the next item's belonging to cell Pi Of the
partitioning {P1, ...,Pk} to depend only on (i) the number of items already sorted
into cells of that partitioning, and (ii) the number among them that have been
assigned to cell Pi.

5 Not everyone would agree that nobody is continuing work on Carnap's project,
e.g. Costantini (1982) sees himself as doing that. But as I see it, his program - a
very interesting one - is very different from Carnap's.

6 There are two more axioms, which de Finetti does not acknowledge: an axiom of
continuity, and an axiom that Lewis (1980) calls 'The Principal Principle' and
others call by other names: prob (H 1 chance H = x ) = x, where chance H is the
objective chance of H's truth. De Finetti also omits axiom (5), presumably on the
ground that the estimates of magnitudes are to represent estimates of the utilities
you expect from them, where the estimated utility est X need not be measured in
the same units as X itself, e.g. where X is income in florins, est X might be
measured in dollars.

7 I gather that it originates with Keynes (1921), reappears with Koopman (1940),
and is given essentially the form used here by Good (1950, e.g. p. 3). It is espoused
by Levi (1974,1980) as part of a rationalistic program. I first encountered it, or
something like it, in Kyburg (1961), but it took me 20 years to see its merits.
Among statisticians, the main support for this way of representing imprecise or
incomplete opinion comes from Good (1950, 1962), Smith (1961), Dempster
(1967, 1968), and Shafer (1976). In practice, the business of reasoning in terms of
a variable, prob' or 'est', that satisfies certain constraints, is widespread - but with
an unsatisfactory rationale, according to which one is reasoning about an unknown,
definite function, which the variable denotes.

8 Anyway, in larger part than in de Finetti's approach. Use of the observed average
is common to both, but the further constraints on est are weaker in this approach
than in de Finetti's. Observe that with the symmetric flat prior probability function
(Carnap's C#), conditioning on the proposition that m of the first n trials have been
successes yields a posterior probability function prob relative to which we always
have prob X1 = . . . = prob Xn = m/n, but have prob Xn+1 = (m + 1)/(n + 2) + m/n
unless n = 2m. The case is similar for other nonextreme symmetric priors, e.g. for
all of form

9 Not von Mises' science of limiting relative frequencies in irregular collectives,


but the prior, plausible intuition.

References

Carnap, R.: 1945, 'On inductive logic', Philosophy of Science 12, 72 97.

Carnap, R.: 1950, Logical Foundations of Probability, Univ. of Chicago Press.

Carnap, R.: 1952, The Continuum of Inductive Methods, Univ. of Chicago Press.

Carnap, R.: 1971, 'A basic system of inductive logic', in Carnap and Jeffrey (eds.)
(1971) and Jeffrey (ed.) (1980).

Carnap, R. and R. Jeffrey: 1971, Studies in Inductive Logic and Probability, Vol.
1, Univ. of California Press.

Costantini, D: 1982, 'The role of inductive logic in statistical inference', to appear


in Proceedings of a Conference on the Foundations of Statistics and Probability,
Luino, September 1981.

Dempster, Arthur P.: 1967, 'Upper and lower probabilities induced by a


multivalued mapping', Annals of Mathematical Statistics 38, 395 339.

Dempster, Arthur P.: 1968, 'A generalization of Bayesian inference', J. Royal Stat.
Soc., Series B. 30, 205 247.

Finetti, Bruno de: 1937, 'La prevision: ses lois logiques, ses sources subjectives',
Annales de l'lnstitut Henri Poincare 7, 1- 68. (English translation in Kyburg and
Smokler.)

Finetti, Bruno de: 1938, 'Sur la condition d'equivalence partielle', ActualEites


Scient. et Industr., No. 739, Hermann & Cie., Paris. (English translation in Jeffrey
(1980).)

Finetti, Bruno de: 1970, 1974, TeoHa Della Probabditai, Torino; English
translation, Theory of Probability, Vol. 1, Wiley, New York. (Vol. 2, 1975).

Good, I. J.: 1950, Probability and the Weighing of Evidence, Griffin, London.

Good, I. J.: 1962, 'Probability as the measure of a non- measurable set', in Ernest
Nagel, Patrick Suppes, and Alfred Tarski (eds.), Logic, Methodology, and
Philosophy of Science: Proceedings of the 1960 International Congress, Stanford
Univ. Press. Reprinted in Kyburg and Smokler.

Goodman, N.: 1979, Fact, Fiction and Forecast, Hackett Publ. Co., Indianapolis.

Hume, D.: 1739,A Treatise of Human Nature, London.

Jeffrey, Richard C.: 1965, 1983, The Logic of Decision, McGrawHill; 2nd ed.,
Univ of Chicago Press.

Jeffrey, Richard C.: 1980 (ed.), Studies in Inductive Logic and Probability, Vol. 2,
Univ. of California Press.

Johnson, W. E.: 1932, 'Probability', Mind 41, 1- - 16, 281- 296, 408 423.

Kemeny, J: 1963, 'Carnap's theory of probability and induction', in P. A. Schilpp


(ed.), The Philosophy of Rudolf Carnap, La Salle,111.

Keynes, John M.: 1921,A Treatise on Probabdity, London.

Kolmogorov, A. N.: 1933, Grundbegriffe der Wahrscheinlich-keitsrechnung,


Ergebnisse der Msth., Band II, No. 3. (English Translation, Chelsea, N.Y., 1946.)

Koopman, B O.: 1940, 'The bases of probability', BuGetin of the American


Mathematical Society 46, 763 774. Reprinted in Kyburg and Smokler.

Kyburg, Henry E., Jr.: 1961, Probability and the Logic of Rational Belief,
Wesleyan Univ . Press, 1961.

Kyburg, Henry E., Jr. and Howard Smokler (eds.): 1980, Studies in Subjective
Probability, 2nd ed., Krieger Publ. Co., Huntington, N.Y.

Levi, I.: 1974, 'On indeterminate probabilities', J. Phd. 71, 391418.

Levi, I.: 1980, The Enterprise of Knowledge, MIT Press.

Lewis, David K.: 1980, 'A subjectivist's guide to objective change', in Jeffrey (ed.)
(1980).

Mises, Richard v.: 1919, 'Grundlagen der Wahrscheinhchkeitsrechnung', Math. Zs.


5.

Skyrms, B.: 'Higher order degrees of belief', in D. H. Mellor (ed.), Prospects for
Pragmatism, Cambridge Univ. Press

Shafer, G.: 1976, A Mathematical Theory of Evidence, Princeton Univ. Press.


Smith, C. A. B.: 1961, 'Consistency in statistical inference and decision'. Royal
Stat. Soc., Series B. 23,1 25.

Dept. of Philosophy

Princeton University

Princeton, N.J. 08544, U.S.A.

Please write to bayesway@princeton.edu with any comments or suggestions.

S-ar putea să vă placă și