Sunteți pe pagina 1din 19

Bayesian Statistics: A Biologists Interpretation

Marguerite Pelletier URI Natural Resources Science / U.S. EPA

How have Bayesian Methods been used?


Federal allocation of money: Bayesian analysis of population characteristics such as poverty in small geographic areas Microsoft Windows Office Assistant: Bayesian artificial intelligence algorithm

It has been suggested that Bayesian statistics be used in environmental science because it addresses questions about the probability of events occurring, which allows better decision-making

Bayesian Statistics vs. Frequentist Statistics


Frequentist (Traditional) Statistics Assumes a fixed, true value for parameter of interest (e.g., mean, std dev) Expected value = average value obtained by random sampling repeated ad infinitum Can only reject the null hypothesis (Ho), not support the alternative hypothesis (Ha); p-values indicate statistical rareness Large sample sizes make rejection of Ho more likely Confidence intervals generated shows confidence about value of parameter, not how likely that parameter is in real life

Bayesian Statistics vs. Frequentist Statistics, cont.


Bayesian Statistics Assumes parameter of interest (e.g., mean, std dev) variable and based on the data Can test the probability of the alternate hypothesis (Ha) or hypotheses given the data (which is what most scientists really care about)

Generates probability for any hypothesis being true


Sample sizes taken into account; large sample size alone wont cause acceptance of the hypothesis

Creates credible intervals rather than confidence intervals tells how likely the answer is in the real world

How do Bayesian Statistics Work?


Posterior probability = Fishers Likelihood function * Prior probability Expected likelihood function

Likelihood function Given data, with a known (or predicted) distribution (i.e., Normal, Poisson), a likelihood function (probability distribution) can be calculated
Prior probability based on existing data or a subjective indication of what the investigator believes to be true Expected likelihood function marginal distribution of data given hyperparameter; takes sample size into account

Bayes Rule: Posterior Likelihood * Priors

Problems with Bayesian Statistics


Computationally intense (integration of complex functions) Howeverbetter computers and development of Markov Chain Monte Carlo methods made techniques more accessible Not directly applicable for many complex statistical analyses Can be used for certain regression techniques and to generate posterior distn given a prior. Attempts to utilize it in clustering unsuccessful Not readily available in most common statistical software (SPSS, SAS) Not applicable to very rare events: priors dominate the function so the posterior doesnt change implies that further study is not needed/useful

So When are Bayesian Statistics Useful?


When limited data available formalizes the use of Best Professional Judgment (Case Study 1) When Bayesian algorithms have been developed for a statistic; e.g., regression (Case Study 2) After using more traditional statistical methods develop a probability distribution (Case Study 3) When the answer is a single number rather than a complex function (e.g., simple calculation not complex multivariate analysis)

Case Study #1: Development of a Bayesian Probability Network in the Neuse River Estuary, N.C.

(Borsuk ME, Stow CA, Reckhow KH 2003. An integrated approach to TMDL development for the Neuse River estuary using a Bayesian probability network. Journal of Water Resources Planning and Management, accepted)

Summary of Project

Neuse River estuary impaired due to nitrogen (eutrophication problems), requiring a Total Maximum Daily Load (TMDL) to be developed For development of a TMDL, links must be developed between pollutant load ( [N] ), and water quality impairment Because of the range of endpoints and the need to determine probability of impact, a Bayesian Network was developed

Data for the model came from routine water quality monitoring and from elicited judgment of scientific experts

River [ N ]

River Flow
Pfisteria abundance

Algal Density Carbon Production Water Temperature Sediment Oxygen Demand

Bayesian Network System variable

Duration of Stratification
Shellfish Abundance Frequency of Cross-Channel Winds

Oxygen Concentration
Days of Hypoxia

Node or Submodel
Association

Frequency of Fish Kills

Fish Population Health

Use of Bayesian Network (focus on Fish Kills)


Fish kills = low bottom D.O. + cross-channel winds (force bottom water & fish to shores) + fish health (influences susceptibility) Two expert fisheries biologists asked about the likelihood of fish kill given certain conditions (various wind/hypoxia/fish health scenarios) All probabilistic relationships (including fish kill info) incorporated into Bayesian network. Four nitrogen reduction scenarios assessed: 0, 15, 30, 45 and 60% (relative to 1991-1995 baseline) using Latin Hypercube sampling As N inputs decreased, mean chl and exceedance frequency also reduced. Fish kills dont change substantially with N reduction fish kills relatively rare, & effect of reduced C production is damped out further along the causal chain

Case Study #2: Assessing Spatial Population Viability Models using Bayesian Statistics

(Mac Nally R, Fleishman E, Fay JP, Murphy DD 2003. Modeling butterfly species richness using mesoscale environmental variables: model construction and validation for the mountain ranges in the Great Basin of western North America. Biological Conservation 110:21-31.

Summary of Project
Species richness local environmental variables Over large scales these variables hard to collect This study: (14) environmental variables from GIS and remote sensing used to predict butterfly species richness

Poisson regression used to develop appropriate models from the 28 variables (IV + IV2); Schwartz Information Criteria used for selection
Appropriate variables then used in Bayesian Poisson model

Model output validated against additional field data

Bayesian Poisson Regression: log i = + k*Xik + Yi ~ Poisson ( i )


where i = mean (unobservable, true) spp richness at site i , k = regression coefficients; non-informative priors = model error Yi = observed spp richness

Markov Chain-Monte Carlo algorithm; 1000 iteration burn-in, 3000 iterations to generate parameter estimates and mean spp richness estimates

New model run using validation data and regression-coefficient distn from the 1st model
Model worked well for same mountain range, but not for new range

Case Study #3: Assessing Spatial Population Viability Models using Bayesian Statistics

(McCarthy MA, Lindenmayer DB, Possingham HP 2001. Assessing spatial PVA models of arboreal marsupials using significance tests and Bayesian statistics. Biological Conservation 98:191-200.

Summary of Project
Population Viability Analysis used in Conservation Biology to assess potential for species extinction

Many models based on limited data assessed via significance tests or Bayesian methods
Metapopulation models (for 4 arboreal marsupials) were developed 2 competing null models also developed No effect of fragmentation No dispersal between patches Models were compared using likelihood and Bayesian methods

Model Comparison
Predicted presence in patches was compared to observed presence using logistic regression: ln (o/(1 o)) = + *ln(p/(1 - p)) where o = observed presence p = predicted presence , = regression coefficients Significant differences between predicted and observed if significantly different from 0 or significantly different from 1 Models compared using log-likelihood; models with higher log-likelihood values (closer to 0) more closely match data

Bayesian posterior probabilities used to compare models; higher probabilities more closely match data prior all 3 models equally plausible Probability of Model = likelihood of model / sum of all likelihoods

Conclusions
Comparison with actual data: Full model best for greater glider, yellow-bellied glider No fragmentation model best for mountain brushtail possum, ringtail possum (but predicted values ~ observed values) Log-likelihood values: Confirm no fragmentation model best for 2 possum spp

Confimed full model best for the greater glider


Yellow bellied glider equally represented by full model and no dispersal model

Bayesian statistics confirmed log-likelihood results


Authors indicated that significance tests useful to assess model accuracy; Bayesian methods useful for comparing models but computationally intense

S-ar putea să vă placă și