Imagine I’m conducting a raffle. I have a large bowl, and inside of it are many raffle entries, each with different numbers on them.
Let the variable x represent the number drawn from the bowl.
Before reaching into the bowl, I want to know:
‘’What is the probability that I’m going to pick up a number less than 10 from the bowl?’’
That is, what is \(p(x \leq 10)\)?
Classical Statistics: No.
Probability is defined as: \(\lim_{n \rightarrow \infty} \frac{n_S}{n}\), where \(n\) is the number of times I repeat the experiment and \(n_S\) is the number of times a number \(x \leq 10\) is drawn.
\(p(x \leq 10)\) was always equal to zero, all that changed was your knowledge of the experiment.
Bayesian Statistics: Almost certainly.
Probability is a measure of subjective belief about how likely an event is, based on prior understanding and new information.
Prior \(\rightarrow\) Information \(\rightarrow\) Posterior
Bayesian statistics integrates the epistemological uncertainty of statistical estimation into its core procedures. It’s fundamental goal is to assess and improve the accuracy of one’s beliefs based on a set of identifying statistical assumptions.
Classical statistics attempts to instead conduct inference on the (unknown) underlying reality, based on its own set of identifying assumptions.
NOT mutually exclusive
The cornerstone of the Bayesian approach (and the source of its name) is the conditional likelihood theorem known as Bayes’ rule.
In its simplest form, Bayes’ Rule states that for two events and A and B (with \(P(B) \neq 0\)): \[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]
Or, if A can take on multiple values, we have the extended form:
\[ p(A_i|B) = \frac{p(B | A_i) P(A_i)}{\sum_j P(B|A_j)P(A_j)} \]
Adapting Bayes’ Rule to the case of a parameter value, \(\theta\) and observed data y, we have:
\[ p(\theta \mid y) = \frac{f(\theta, y)}{f(y)} = \frac{f(y\mid\theta)p(\theta)}{ \int f(y \mid \theta) p(\theta) d\theta} \underbrace{\propto}_{\text{proportional to}} f(y|\theta)p(\theta) \]
Adding a bit of terminology:
\(f(y|\theta)\) you may recognize from Maximum Likelihood estimation as:
Finally, \(p(\theta)|y)\) is our posterior (post-experiment) belief about the probability distribution of \(\theta\).
\[ \overbrace{p(\theta \mid y)}^{\text{posterior}} = \frac{\overbrace{f(y\mid\theta)}^\text{likelihood} \overbrace{p(\theta)}^\text{prior}}{\underbrace{\int f(y \mid \theta) p(\theta) d\theta}_\text{average likelihood}} \underbrace{\propto}_{\text{proportional to}} \overbrace{f(y|\theta)}^{likelihood} \overbrace{p(\theta)}^{prior}\]
Hence we have the basic statement:
\[ Posterior \propto Likelihood \times Prior \]
Given this compromise with prior beliefs, Bayesian analysis is often attacked as subjective and a lot of emphasis is placed on the role of prior beliefs. But fear not!
First, as the data sample increases, the data becomes the determining factor of the posterior probability.
Moreover, if desired, we can easily specify priors that have no effect on the posterior estimates.
Different types of priors include:
Uninformative (or “flat”) priors: Priors that have no impact on posterior values (ie assuming total ignorance about the possible parameter value).
Informative priors: Priors where we use prior knowledge to specify a prior with a best-guess of the prior mean and distribution for a parameter value.
Weakly informative or regularizing priors: Priors which are deliberately less informative than our actual knowledge, affecting estimates less than informative priors but at least incorporating very conservative information into the production of posterior estimates.
Think for instance of inference about possible life expectancy:
We could specify an uninformative prior that says allows for any possible lifespan (even living to 1 million years old).
We could specify some skewed normal distribution that approximates our current estimates of the distribution of current lifespans.
Or we could specify some weaker prior that for instance allows life expectancy to be more dispersed and ages to reach, say 150 or 200 years old.
In almost all circumstances, a weak prior should be preferred.
## Hypothesis Testing in Bayesian Analysis?
Another point of divergence for Bayesian vs. frequentist data analysis is even more dramatic:
Largely, there is no place for null-hypothesis significance testing (NHST) in Bayesian analysis
Bayesian analysis has something similar called a Bayes’ factor, which essentially assigns a prior probability to the likilihood ratio of a null and alternative model and then estimates it’s posterior probability.
But Bayes factors are heavily criticized by leading Bayesians like Andrew Gelman and Donald Rubin, because it is highly sensitive to prior probabilities and model structures, among other issues.
Instead, analysis is oriented around estimation of the posterior distribution of parameters (or predictions).
Without NHST to tell us if our results are significant, does that mean we just get point estimates and no way to assess how reliable they are? No!
Recall that we are estimating the posterior distribution of a parameter (or predicted outcome).
Hence, we can easily produce a 95% interval for the parameter, simply using the quantiles of the posterior CDF.
In Bayesian analysis, we replace the 100(\(1-\alpha\))% frequentist confidence interval with the 100(\(1-\alpha\))% credible interval.
A credible interval, however, has a much more appealing interpretation than a confidence interval.
A confidence interval has the interpretation that, in repeated samples, the true parameter lies within the confidence region 95% of the time.
A credible interval is what people think a confidence interval should mean: there is a 95% chance that the true value lies within the 95% credible interval.
One major feature of Bayesian inference that I haven’t mentioned so far is the intractability of analytic solutions for estimating posterior distributions in most circumstances.
Recall: \[ p(\theta \mid y) = \frac{f(\theta, y)}{f(y)} = \frac{f(y\mid\theta)p(\theta)}{ \int f(y \mid \theta) p(\theta) d\theta} \]
For models that are more complex or that involve high-dimensional data, closed-form solutions are not available for the integral in the denominator.
Hence, Bayesian analysis instead typically relies on numerical methods, usually Markov Chain Monte Carlo (MCMC) methods.
This method relies on sequentially sampling values of \(\theta\) from an approximation of the posterior distribution and then correcting the approximation to create better subsequent samples.
Because the approximated distribution used in 1 step relies on the sample from the previous step, the simulation forms a Markov chain.
A critical property then is convergence: Have our simulations converged to the real target distribution?
Probably the best approach to doing Bayesian analysis in any software environment is with rstan, which is an R interface to the Stan programming language designed for Bayesian analysis.
To use rstan, you will first need to install RTools from this link.
Then install the package rstan from RStudio (make sure to set dependencies=TRUE
when installing).
Once you’ve setup rstan, there is one more thing you should typically do: tell it to run on multiple cores.
library(rstan)
rstan_options (auto_write=TRUE)
# Run on multiple cores
options (mc.cores=parallel::detectCores ())
Now that you’re hopefully excited about rstan, let’s look at an example of a rstan regression from the package documentation:
# Create schools.stan --------------------------
data {
int<lower=0> J; // number of schools
real y[J]; // estimated treatment effects
real<lower=0> sigma[J]; // s.e. of effect estimates
}
parameters {
real mu;
real<lower=0> tau;
vector[J] eta;
}
## Still creating schools.stan
transformed parameters {
vector[J] theta;
theta = mu + tau * eta;
}
model {
target += normal_lpdf(eta | 0, 1);
target += normal_lpdf(y | theta, sigma);
}
# End of schools.stan --------------------------
## Run Stan regression using schools.stan
fit1 <- stan(
file = "schools.stan", # Stan program
data = schools_data, # named list of data
chains = 4, # number of Markov chains
warmup = 1000, # number of warmup iterations per chain
iter = 2000, # total number of iterations per chain
cores = 2, # number of cores (using 2 just for the vignette)
refresh = 1000 # show progress every 'refresh' iterations
)
Oops! That was a lot of code! And there’s still a bit more left out. Luckily, there are a couple of packages that exist to make your life easier:
First, there is rstanarm, which was created by the developers of Stan and rstan to make running a Bayesian regression with rstan much more like you would run a normal frequentist regression.
Another very similar package to rstanarm is brms, which also makes running Bayesian regression much simpler and ‘R-like’.
rstanarm is faster, has better posterior checking, and is a bit simpler to use.
brms is generally a bit more flexible, with support for some regression types missing in rstanarm, more flexible specification of priors, and support for more types of error correlational structures.
With rstanarm, most regressions you run using the function stan_glm()
Syntax:
mybayesreg <- stan_glm(y ~ X1 + x2 + x3 ...,
family = myfamily, data = mydata,
prior = myprior)
Family (with a possible link argument needed as well) defines the type of regression you want:
family = gaussian
family = binomial(link = "logit")
family = binomial(link = "probit")
family = poisson
Prior distributions:
prior = NULL
[Weakly] Informative Priors can be specified by using prior =
with one of:
data(Titanic)
# Display titanic data
glimpse(Titanic,width = 50)
## Observations: 1,313
## Variables: 6
## $ Name <fct> "Allen, Miss Elisabeth Walton"…
## $ PClass <fct> 1st, 1st, 1st, 1st, 1st, 1st, …
## $ Age <dbl> 29.00, 2.00, 30.00, 25.00, 0.9…
## $ Sex <fct> female, female, male, female, …
## $ Survived <int> 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, …
## $ SexCode <int> 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, …
# Reformat Class
Titanic$class <- str_extract(Titanic$PClass, "[0-9]")
TitanicLinear <- stan_glm(Survived ~ Age +
SexCode + as.factor(class),
data = Titanic, family = gaussian)
summary(TitanicLinear)
##
## Model Info:
##
## function: stan_glm
## family: gaussian [identity]
## formula: Survived ~ Age + SexCode + as.factor(class)
## algorithm: sampling
## priors: see help('prior_summary')
## sample: 4000 (posterior sample size)
## observations: 756
## predictors: 5
##
## Estimates:
## mean sd 2.5% 25% 50% 75% 97.5%
## (Intercept) 0.6 0.1 0.5 0.6 0.6 0.7 0.7
## Age 0.0 0.0 0.0 0.0 0.0 0.0 0.0
## SexCode 0.5 0.0 0.4 0.5 0.5 0.5 0.6
## as.factor(class)2 -0.2 0.0 -0.3 -0.2 -0.2 -0.2 -0.1
## as.factor(class)3 -0.4 0.0 -0.5 -0.4 -0.4 -0.4 -0.3
## sigma 0.4 0.0 0.4 0.4 0.4 0.4 0.4
## mean_PPD 0.4 0.0 0.4 0.4 0.4 0.4 0.5
## log-posterior -364.8 1.8 -369.0 -365.8 -364.5 -363.5 -362.4
##
## Diagnostics:
## mcse Rhat n_eff
## (Intercept) 0.0 1.0 2096
## Age 0.0 1.0 3087
## SexCode 0.0 1.0 4754
## as.factor(class)2 0.0 1.0 2660
## as.factor(class)3 0.0 1.0 2176
## sigma 0.0 1.0 4131
## mean_PPD 0.0 1.0 3971
## log-posterior 0.0 1.0 1741
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
You can also easily get print the credible intervals with the function posterior_interval()
posterior_interval(TitanicLinear, prob=0.95)
## 2.5% 97.5%
## (Intercept) 0.522467174 0.73215738
## Age -0.008253736 -0.00381582
## SexCode 0.442385749 0.56045498
## as.factor(class)2 -0.281720155 -0.12944070
## as.factor(class)3 -0.466034311 -0.31756088
## sigma 0.369875717 0.40839844
plot(TitanicLinear)
You can also easily plot the posterior distribution of a parameter in R.
Titanic_posterior <- TitanicLinear %>% as_tibble() %>%
rename(sec.class = "as.factor(class)2",
third.class = "as.factor(class)3")
ggplot(Titanic_posterior, aes(x=third.class)) +
geom_histogram()
posterior_vs_prior(TitanicLinear)
There are a number of different regression diagnostics after performing Bayesian regression to help infer if the model converged, how well it performs, and even compare between models.
Today, we’ll cover some of them included with rstanarm as well as the very useful shinystan package.
To check the predictive accuracy of the posterior distribution, you can use the function pp_check(), which plots simulated y values from the posterior distribution against the actual values of y.
pp_check(TitanicLinear)
A critical issue in both Bayesian and frequentist estimation is how to balance predictive accuracy with parsimony. Put another, the researcher should be concerned with not overfitting the data while still creating a compelling model.
The basic approach in frequentist method is to use the Akaike information criterion (AIC):
Expected Log Predictive Density: \[\hat{elpd}_{AIC} = \log p(y | \hat{\theta}_{MLE}) - k\]
The most basic Bayesian adaptation of the AIC is the Deviance information criterion (DIC):
\[\hat{elpd}_{DIC} = \log p(y | \hat{\theta}_{Bayes}) - p_{DIC}\]
An improvement over the DIC is the Watanabe-Akaike information criterion:
\[\hat{elpd}_{WAIC} = \sum_{i=1}^{n} \log p(y_i) - \sum_{i=1}^{n} \log V \Big[p(y_i) \Big]\]
The WAIC has the advantages of:
Averaging the likelihood over the posterior distribution rather than using the mean
Does not assume a multivariate Gaussian posterior distribution, as does the DIC (and AIC)
waic(TitanicLinear)
##
## Computed from 4000 by 756 log-likelihood matrix
##
## Estimate SE
## elpd_waic -360.6 19.4
## p_waic 6.1 0.3
## waic 721.1 38.8
Another method alongside WAIC for comparing out-of-sample predictive ability is to apply leave-one-out cross-validation (LOO).
Expected Log Predictive Density:
\[\hat{elpd}_{LOO} = \sum_{i=1}^{n} \log p(y_i | y_{-i})\]
loo(TitanicLinear)
##
## Computed from 4000 by 756 log-likelihood matrix
##
## Estimate SE
## elpd_loo -360.6 19.4
## p_loo 6.1 0.3
## looic 721.1 38.8
## ------
## Monte Carlo SE of elpd_loo is 0.0.
##
## All Pareto k estimates are good (k < 0.5).
## See help('pareto-k-diagnostic') for details.
Titanic_probit <- stan_glm(Survived ~ Age +
SexCode + as.factor(class),
data = Titanic, family = binomial(link=probit))
Loo_probit <- loo(Titanic_probit)
Titanic_logit <- stan_glm(Survived ~ Age +
SexCode + as.factor(class),
data = Titanic, family = binomial(link=logit))
Loo_logit <- loo(Titanic_logit)
# ELPD_diff>0 indicates more support for 2nd model
compare_models(Loo_probit, Loo_logit)
##
## Model comparison:
## (negative 'elpd_diff' favors 1st model, positive favors 2nd)
##
## elpd_diff se
## 1.3 0.8
Probably the most popular diagnostic for Bayesian regression in R is the functionality from the shinystan package.
# Do in console not RMarkdown
launch_shinystan(TitanicLinear)