Chapter 2 A Review of INLA

2.1 Introduction

This chapter gives a brief review of the Integrated Nested Laplace approximation (INLA), which was a framework proposed by Rue, Martino, and Chopin (2009) for approximate Bayesian inference in a subclass of structured additive regression models (Fahrmeir and Tutz 2001), called latent Gaussian models. Latent Gaussian models are a subset of Bayesian additive models with a structured additive predictor for modeling a response vector \({\boldsymbol y}^n = (y_1,y_2,\ldots,y_n)'\). Let \({\boldsymbol x}^n=(x_1,x_2,\ldots,x_n)'\) denote the vector of all the latent Gaussian variables, and \(\boldsymbol{\theta}\) denote a vector of hyperparameters, which are not necessarily Gaussian. In latent Gaussian models, the response variable can belong to an exponential family of distributions, including the normal, binomial, Poisson, negative binomial, etc. (see Chapter 1 – Appendix), the mean of the response is linked to structured additive predictors, and hyperparameters enable us to handle variance terms in the latent Gaussian model or the dispersion parameter in the negative binomial response model, etc. INLA approaches the modeling via a hierarchical structure and can directly compute very accurate approximations of posterior densities which significantly decreases the computational time compared to fully Bayesian inference via MCMC.

Since the Gaussian Markov random field (GMRF) structure is used in INLA for handling structural time series models, we give a brief description in Chapter 2 – Appendix. While the level of detail given in this chapter may not be necessary for a user who is primarily interested in using R-INLA for data analysis, it does facilitate an understanding of how INLA works. For more details, see Lauritzen (1996) or Rue and Held (2005). Since the Laplace approximation plays an important role in the implementation of INLA, we next give an overview of Laplace’s method as well as its modified version that is used in INLA.

2.2 Laplace approximation

The Laplace approach is used to approximate integrals which cannot be evaluated analytically. A general review of Laplace’s method, which is based on asymptotic expansions, can be found in the text by De Bruijn (1981). The approach was originally considered in Bayesian statistics by Lindley (1961), but has gained more attention with Lindley (1980), who proposed its use for approximating posterior moments.

Consider the ratio of integrals given by

\[\begin{align} \frac{\int w(\Theta)\,e^{\ell(\Theta)}d\Theta} {\int\pi(\Theta)\,e^{\ell(\Theta)}d\Theta}, \tag{2.1} \end{align}\] where \(\ell(\Theta) = \ell(\Theta; {\boldsymbol y}^n)\) is the log-likelihood function of the \(p\)-dimensional parameter vector \(\Theta\) based on \(n\) observations \({\boldsymbol y}^n=(y_1,\ldots,y_n)'\) coming from the probability model \(p({\boldsymbol y}^n|\Theta)\), i.e.,

\[\begin{align*} \ell(\Theta)=\sum_{t=1}^n \log\,p(y_t|\Theta). \end{align*}\]

The quantities \(w(\Theta)\) and \(\pi(\Theta)\) are functions of \(\Theta\) which may need to satisfy certain conditions depending on the context.

In the Bayesian formulation, \(\pi(\Theta)\) is the prior and \(w(\Theta)=u(\Theta)\pi(\Theta)\), where \(u(\Theta)\) is some function of \(\Theta\) which is of interest. Thus, the ratio represents the posterior expectation of \(u(\Theta)\), i.e., \(E(u(\Theta)|{\boldsymbol y}^n)\). For example, if \(u(\Theta)=\Theta\), then the ratio in (2.1) gives us the posterior mean of \(\Theta\). Similarly, if \(u(\Theta)=p({\boldsymbol y}^n|\Theta)\), it yields the posterior predictive distribution value at \({\boldsymbol y}^n\). Alternatively, we can write the ratio in (2.1) as

\[\begin{align} \frac{\int u(\Theta)e^{\Lambda(\Theta)} d\Theta}{\int e^{\Lambda(\Theta)}d\Theta}, \tag{2.2} \end{align}\] where \[\begin{align*} \Lambda(\Theta)=\ell(\Theta)+ \log \ \pi(\Theta). \end{align*}\] Lindley (1980) developed asymptotic expansions for the ratio of integrals in (2.2) as the sample size \(n\) gets large. The idea is to obtain a Taylor series expansion of all the above functions of \(\Theta\) about \(\widehat\Theta\), the posterior mode. Lindley’s approximation to \(E[u(\Theta)|{\boldsymbol y}^n]\) is given by

\[\begin{align} E[u(\Theta)|{\boldsymbol y}^n] \approx u(\widehat\Theta)+\frac{1}{2} \biggl(\sum_{i,j=1}^p u_{i,j}~h_{i,j}+\sum_{i,j,k,l=1}^p \Lambda_{i,j,k} u_l ~h_{i,j} ~h_{k,l}\biggr), \tag{2.3} \end{align}\] where

\[\begin{align*} u_i \equiv \frac{\partial u(\Theta)}{\partial \Theta_i}~ \Big|_{\Theta^n = \widehat\Theta}, \hspace{0.05in} u_{i,j} \equiv \frac{\partial^2 u(\Theta)}{\partial \Theta_i \partial \Theta_j} ~\Big|_{\Theta^n = \widehat\Theta}, \hspace{0.05in} \Lambda_{i,j,k} \equiv \frac{\partial^3 \Lambda(\Theta)} {\partial \Theta_i \partial \Theta_j \partial \Theta_k} ~\Big|_{\Theta^n = \widehat\Theta}, \end{align*}\] and \(h_{i,j}\) are the elements of the negative of the inverse Hessian of \(\Lambda\) at \(\widehat\Theta\).

Lindley’s approximation (2.3) involves third order differentiation and therefore is computationally cumbersome in highly parameterized cases. Tierney and Kadane (1986) proposed an alternative approximation which involves only the first and second order derivatives. This is achieved by using the mode of the product \(u(\Theta)e^{\Lambda(\Theta)}\) rather than the mode of the posterior \(e^{\Lambda(\Theta)}\) and evaluating the second derivatives at this mode. The Tierney-Kadane method approximates \(E[u(\Theta)|{\boldsymbol y}^n]\) by

\[\begin{align} E[u(\Theta)|{\boldsymbol y}^n]\approx \Bigl( \frac{|\Sigma^\ast(\widehat\Theta)|}{|\Sigma(\widehat\Theta)|}\Bigr)^{1/2}~ \exp \bigl[n \bigl(\Lambda^\ast(\widehat\Theta)- \Lambda(\widehat\Theta)\bigr)\bigr], \tag{2.4} \end{align}\] where \[\begin{align*} \Lambda^\ast(\Theta)= \log ~ u(\Theta)+\Lambda(\Theta), \end{align*}\] and \(\Sigma^\ast(\widehat\Theta)\) and \(\Sigma(\widehat\Theta)\) are the corresponding negative inverse Hessians of \(\Lambda^\ast\) and \(\Lambda\) evaluated at \(\widehat\Theta\), respectively.

2.2.1 Simplified Laplace approximation

Consider the Tierney-Kadane setup where we are interested in \(\pi(\Theta_i|{\boldsymbol y}^n)\), the posterior marginal of \(\Theta_i\) given \({\boldsymbol y}^n\). In other words, we are interested in the ratio of integrals

\[\begin{align*} \frac{\int\pi(\Theta)e^{\ell(\Theta)}d{\Theta}_{(-i)}}{\int% \pi(\Theta)e^{\boldsymbol \ell(\Theta)}d \Theta}, \end{align*}\] where \({\Theta}_{(-i)}=(\Theta_j|j=1,\ldots n;\,j\neq i)\). Tierney and Kadane (1986) provide the Laplace approximation for \(\pi(\Theta_i|{\boldsymbol y}^n)\) as

\[\begin{align} \pi(\Theta_i|{\boldsymbol y}^n)\approx\Bigl(\frac{|\boldsymbol \Sigma^\ast(\Theta_i)|}{2\pi n|\boldsymbol \Sigma(\widehat{\Theta}^{n})|}\Bigr)^{1/2}\,\,\frac{\pi(\Theta_i,\widehat{\Theta}_{(-i)})e^{\ell(\Theta_i,% \widehat{\Theta}_{(-i)})}}{\pi(\widehat{\Theta}^{n})e^{\ell(\widehat{\Theta}^{n})}}, \tag{2.5} \end{align}\] where \(\widehat{\Theta}_{(-i)}=\widehat{\Theta}_{(-i)}(\Theta_i)\) give the maxima of \(\pi(\Theta_i,\Theta_{(-i)})e^{\ell(\Theta_i,\Theta_{(-i)})}\,\) for fixed \(\Theta_i\), and \(\boldsymbol \Sigma^\ast(\Theta_i)\) is the corresponding negative inverse Hessian evaluated at \(\widehat{\Theta}_{(-i)}\). As before, \(\boldsymbol \Sigma(\widehat\Theta)\) is the negative inverse Hessian of the log posterior of \(\Theta\) evaluated at the posterior mode \(\widehat\Theta\).

Denoting the joint distribution of \(\Theta\) and \(\boldsymbol y^{n}\) by \(\pi(\Theta,\boldsymbol y^{n})\), we can write

\[\begin{align*} \pi(\Theta_i|\boldsymbol y^{n})\propto\frac{\pi(\Theta,\boldsymbol y^{n})}{\pi(\Theta_{(-i)}|\Theta_i,\boldsymbol y^{n})}=\pi(\Theta_i,\boldsymbol y^{n}). \end{align*}\] Rue, Martino, and Chopin (2009) approximate the above as proportional to

\[\begin{align*} \pi_{LA}(\Theta_i|\boldsymbol y^{n})\propto\frac{\pi(\Theta,\boldsymbol y^{n})}{\pi_G(\Theta_{(-i)}|\Theta_i,\boldsymbol y^{n})}\bigg|_{\Theta_{(-i)}=\widehat{\Theta}_{(-i)}} \end{align*}\] where \(\pi_G(\Theta_{(-i)}|\Theta_i,\boldsymbol y^{n})\) is called a Gaussian approximation to the density \(\pi(\Theta_{(-i)}|\Theta_i,\boldsymbol y^{n})\). More specifically,

\[\begin{align*} \pi_G(\Theta_{(-i)}|\Theta_i,\boldsymbol y^{n})=N(\widehat{\Theta}_{(-i)},\boldsymbol \Sigma^\ast(\Theta_i)). \end{align*}\] As noted by the authors, \(\pi_{LA}(\Theta_i|\boldsymbol y^n)\) is equivalent to Tierney and Kadane’s posterior approximation (2.5). This is easy to see since the above will reduce to

\[\begin{align} \pi_{LA}(\Theta_i|\boldsymbol y^n)\propto\pi(\Theta_i,\widehat{\Theta}_{(-i)},\boldsymbol y^n)|\boldsymbol \Sigma^\ast(\Theta_i)|^{1/2}, \tag{2.6} \end{align}\] which is equivalent to (2.5).

Since the Laplace approximation (2.6) is computationally cumbersome when \(n\) is large due to the evaluation of a Hessian for each \(i\), Rue, Martino, and Chopin (2009) proposed a simplified version of the approximation. The proposed approach evaluates \(\pi_G(\Theta_{(-i)}|\Theta_i,\boldsymbol y^n)\) in (2.6) using the conditional mean of \(\Theta_{(-i)}\) given \(\Theta_i\) implied by the Gaussian approximation to \(\pi(\Theta|\boldsymbol y^n)\), the complete posterior of \(\Theta\). The resulting approximation is referred to as the Simplified Laplace Approximation (SLA) by the authors. Since the conditional variance of \(\Theta_{(-i)}\) will not depend on \(\Theta_i\) in the Gaussian case, the Hessian will be constant, i.e., the SLA is given by

\[\begin{align} \pi_{SLA}(\Theta_i|\boldsymbol y^n)\propto\pi(\Theta_i,\widehat{\Theta}_{(-i)},\boldsymbol y^n). \tag{2.7} \end{align}\] A recent discussion of the SLA and its variants can be found in Wood (2020).

2.3 INLA structure for time series

Let \({\boldsymbol y}^n\) denote the data (response vector), \({\boldsymbol x^n}\) a vector of latent Gaussian variables describing the model and \(\boldsymbol{\theta}\) be a vector of hyperparameters. The dimension of the state vector \({\boldsymbol x^n}\) is large, typically being \(n\) in time series problems, whereas the dimension of the hyperparameters \(\boldsymbol{\theta}\) is small, usually under \(10\). INLA uses a hierarchical framework to represent the underlying probabilistic structure, as discussed in the following two examples.

Gaussian Markov Random Field model

The original implementation of INLA assumes that that the latent vector \({\boldsymbol x}^n\) is defined by a Gaussian Markov Random Field (GMRF) (Rue and Held 2005). The GMRF model can be expressed as a three level hierarchy consisting of the observed data \({\boldsymbol y}^n\), the latent process \({\boldsymbol x}^n\), and the unknown hyperparameters \(\boldsymbol{\theta}\). We can represent the hierarchical structure as

\[\begin{align} {\boldsymbol y^n}| x_t, \boldsymbol{\theta} \sim \prod_t p(y_t| {\boldsymbol x_t}, \boldsymbol{\theta}) \hspace{0.5in} & \mbox{data model} \tag{2.8} \\ {\boldsymbol x^n} \vert\boldsymbol{\theta} \sim N(\mathbf{0}, {\boldsymbol \Sigma}(\boldsymbol{\theta})) \hspace{0.5in} & \mbox{GMRF prior} \tag{2.9}\\ \boldsymbol{\theta} \sim \pi(\boldsymbol{\theta}) \hspace{0.5in} & \mbox{hyperprior} \tag{2.10} \end{align}\] where \(x_\ell \perp x_m \vert {\boldsymbol x^n}_{(-\ell, m)}\) (see Chapter 2 – Appendix). The notation “\({\boldsymbol x^n}_{(-\ell, m)}\)” indicates all the other elements of the parameter vector \({\boldsymbol x^n}\) excluding elements \(\ell\) and \(m\). The covariance matrix \({\boldsymbol \Sigma}\) depends on some hyperparameters \(\boldsymbol{\theta}\).

Example: Local level model as a three level hierarchy

Consider the local level model described in (1.15) and (1.16). We can express this as a three level hierarchy with the observed data \({\boldsymbol y}^n\), the latent unobserved process \({\boldsymbol x^n}\), and the unknown random hyperparameters \(\boldsymbol{\theta} = (\sigma^2_v, \sigma^2_w)'\) constituting the three levels of the hierarchy. The data distribution can belong to the exponential family (see Chapter 1 – Appendix) and could be normal, binomial, gamma, or Poisson, as we will see in later chapters, but the latent process is always Gaussian. The distributions of the hyperparameters need not be Gaussian. For instance, in this example, we may assume that the hyperparameters have the following distributions:

\[\begin{align*} \frac{1}{\sigma^2_v} \sim {\rm Gamma}(a_v, b_v) \mbox{ and } \frac{1}{\sigma^2_w} \sim {\rm Gamma}(a_w, b_w), \end{align*}\] where \((a_v, b_v)\) and \((a_w, b_w)\) are specified parameters for these prior distributions. We can represent the model in a hierarchical structure as

\[\begin{align} \boldsymbol {y}^n|\boldsymbol x^n,\boldsymbol \theta\sim N(\boldsymbol x^n,\sigma_v^2\, \boldsymbol{I}_n) \hspace{0.5in} & \mbox{data model} \\ \boldsymbol x^n|\boldsymbol \theta\sim\prod_{t=1}^n \pi(x_t|x_{t-1},\boldsymbol \theta) \hspace{0.5in} & \mbox{Markov prior} \\ \boldsymbol \theta\sim \pi(\boldsymbol \theta) \hspace{0.5in} & \mbox{hyperprior} \end{align}\] where, \(\pi(x_t|x_{t-1},\boldsymbol \theta)\) is \(N(x_{t-1},\sigma_w^2)\), \(\boldsymbol {I}_n\) is the \(n\)-dimensional identity matrix, and \(\pi(\boldsymbol \theta)\) is given by the product of two gamma densities.

2.3.1 INLA steps

Using the hierarchical structure, INLA describes the posterior marginals of interest as

\[\begin{align} \pi (x_i \vert {\boldsymbol y}^n) &= \int {\pi (x_i \vert \boldsymbol{\theta}, {\boldsymbol y}^n)\pi(\boldsymbol{\theta} \vert {\boldsymbol y}^n) d \boldsymbol{\theta}}, \notag \\ \pi (\theta_j \vert{\boldsymbol y}^n) &= \int {\pi(\boldsymbol{\theta} \vert {\boldsymbol y}^n )d \boldsymbol{\theta}_{(-j)}}, \end{align}\] where \(\theta_j\) denotes the \(j^{th}\) component of \(\boldsymbol{\theta}\), and \(\boldsymbol \theta_{(-j)}\) denotes all the other components except the \(j^{th}\). Under the INLA approach, these forms are used to construct nested approximations

\[\begin{align} \tilde{\pi} (x_i \vert {\boldsymbol y}^n) &= \int {\tilde{\pi} (x_i \vert \boldsymbol{\theta}, {\boldsymbol y}^n) \tilde{\pi}(\boldsymbol{\theta} \vert {\boldsymbol y}^n) d \boldsymbol{\theta}}, \notag \\ \tilde{\pi} (\theta_j \vert {\boldsymbol y}^n) &= \int {\tilde{\pi}(\boldsymbol{\theta} \vert {\boldsymbol y}^n)d\boldsymbol{\theta}_{(-j)}}, \end{align}\] where \(\tilde{\pi}(.|.)\) is an approximated conditional density. The INLA approach consists of three steps.

Step 1. First, INLA approximates the posterior marginal distributions of the hyperparameters \(\pi(\boldsymbol{\theta} \vert {\boldsymbol y}^n)\) by using a Laplace approximation. To achieve this, the full conditional distribution of \({\boldsymbol x^n}\), \(\pi ({\boldsymbol x^n}\vert \boldsymbol{\theta}, {\boldsymbol y}^n)\), is approximated using a multivariate Gaussian density \(\tilde{\pi}_G ({\boldsymbol x}\vert \boldsymbol{\theta},{\boldsymbol y}^n)\). Then, the posterior density of the hyperparameters is approximated by using the Laplace approximation

\[\begin{align} \tilde{\pi}(\boldsymbol{\theta} \vert {\boldsymbol y}^n) \propto \frac{\pi({\boldsymbol x^n}, \boldsymbol{\theta}, {\boldsymbol y}^n)} {\tilde{\pi}_G ({\boldsymbol x^n}\vert \boldsymbol{\theta}, {\boldsymbol y}^n)} \bigg|_{\boldsymbol x^{n}=\widehat{\boldsymbol x}^{n}}, \tag{2.11} \end{align}\] where \(\widehat{\boldsymbol x}^{n}\) is the mode of the full conditional distribution \(\pi ({\boldsymbol x^n}| \boldsymbol{\theta}, {\boldsymbol y}^n)\). This can be done using the Newton-Raphson algorithm.

Step 2. Second, INLA computes \(\pi(x_i \vert \boldsymbol{\theta},{\boldsymbol y}^n)\); i.e., it approximates conditional distributions of the latent Gaussian variables given selected hyperparameter values and the data, using one of three methods. In order of increasing accuracy, they are

  1. the Gaussian approximation,

  2. the simplified Laplace approximation, and

  3. the Laplace approximation.

Step 3. The last step uses numerical integration, i.e.,

\[\begin{align} \tilde{\pi}(x_i \vert {\boldsymbol y}^n) = \sum_j \tilde{\pi}(x_i \vert \boldsymbol \theta_j,{\boldsymbol y}^n) \tilde{\pi}(\boldsymbol \theta_j \vert {\boldsymbol y}) \Delta_j \end{align}\] to integrate out the hyperparameters, and obtains an approximation of the posterior distribution of the latent Gaussian variables.

Although in Step 2, the Laplace approximation is preferred in general, it can be computationally expensive. If the Gaussian approximation or the simplified Laplace approximation delivers small Kullback-Leibler divergence (see Chapter 2 - Appendix for a definition) between \(\pi(x_i \vert {\boldsymbol y}^n)\) and \(\tilde{\pi}( x_i \vert {\boldsymbol y}^n)\), then these methods are considered acceptable. Ruiz-Cárdenas, Krainski, and Rue (2012) describe simulation studies and examples to illustrate fast and accurate Bayesian inference for dynamic models using the R-INLA package.

2.4 Forecasting in INLA

One of the main objectives of time series analysis is the prediction (or forecasting) of future values of \(Y_t,~t=n+1,n+2,\ldots\) given observed data \({\boldsymbol y^n}\) up to time period \(n\). As discussed in Section 1.3, the predictive distribution of \(Y_{n+1}\) given \({\boldsymbol y^n}\) can be obtained, via the rules of probability, as

\[\begin{align} p(y_{n+1}\vert\boldsymbol y^n)=\int p( y_{n+1}\vert x_{n+1}, \boldsymbol \theta,\boldsymbol y^n) \pi(x_{n+1}, \boldsymbol \theta \vert{\boldsymbol y^n}) d x_{n+1} d \boldsymbol \theta. \tag{2.12} \end{align}\] Due to the conditional independence of \(Y_t\)’s given \(x_t\) and \(\boldsymbol \theta\), (2.12) reduces to

\[\begin{align} p(y_{n+1}\vert\boldsymbol y^n)=\int p( y_{n+1}\vert x_{n+1}, \boldsymbol \theta) \pi(x_{n+1}, \boldsymbol \theta \vert{\boldsymbol y^n}) d x_{n+1} d \boldsymbol \theta, \tag{2.13} \end{align}\] which is referred to as the one-step ahead forecast distribution. Generalization of (2.12) to the \(k\)-step ahead forecast distribution for \(Y_{n+k}\) is given by

\[\begin{align} p(y_{n+k}\vert\boldsymbol y^n)=\int p( y_{n+k}\vert x_{n+k}, \boldsymbol \theta) \pi(x_{n+k}, \boldsymbol \theta \vert{\boldsymbol y^n}) d x_{n+k} d \boldsymbol \theta, \tag{2.14} \end{align}\] for \(k \ge 1\).

INLA provides us with approximations to the marginal posteriors \(\pi(\boldsymbol x^{n}\vert\boldsymbol y^{n})\) (marginalized over, or integrating out, the hyperparameter vector \(\boldsymbol{\theta}\)) and \(\pi(\boldsymbol \theta\vert\boldsymbol y^{n})\) (marginalized over the state vector \(\boldsymbol{x}^n\)), as well as samples from these marginal posterior distributions. To evaluate the one-step ahead predictive distribution in (2.13), we can write

\[\begin{align} \pi(x_{n+1}, \boldsymbol \theta \vert{\boldsymbol y^n})=\pi(x_{n+1}\vert \boldsymbol \theta, \boldsymbol y^{n}) \pi(\boldsymbol \theta\vert\boldsymbol y^{n}). \end{align}\] Since samples from \(\pi(\boldsymbol \theta\vert\boldsymbol y^{n})\) are available, (2.13) can be evaluated by simulation, i.e., by drawing samples from \(\pi(x_{n+1}\vert \boldsymbol \theta, \boldsymbol y^{n})\). This is achieved in INLA by approximating the conditional posterior \(\pi(x_{n+1}\vert \boldsymbol \theta, \boldsymbol y^{n})\) and then drawing samples from the approximated distribution. For the case of a Gaussian observation equation as in (1.15), \(\pi(x_{n+1}\vert \boldsymbol \theta, \boldsymbol y^{n})\) can be approximated via a Gaussian distribution, but in general, more accurate approximations are used.2. A similar strategy is implemented to obtain the \(k\)-step ahead forecast distribution (2.14).

It is important to note that INLA was developed as a general approach for Bayesian analysis using the GMRF structure for \(x_t\)’s. As a result, it does not exploit the typical Markov evolution of \(x_t\)’s in the dynamic model framework to simulate the forecast distributions.

2.5 Marginal likelihood computation in INLA

Following Rue, Martino, and Chopin (2009), the marginal likelihood is computed as

\[\begin{align} m({\boldsymbol y^n} \vert M_k) \approx \int \frac{p({\boldsymbol y^n},\boldsymbol{\theta},\boldsymbol{x}^n \vert M_k)} {\tilde{\pi}_G (\boldsymbol{x}^n \vert \boldsymbol y^n, \boldsymbol{\theta}, M_k )} d\boldsymbol{\theta}, \tag{2.15} \end{align}\] where the right side is evaluated at \(\boldsymbol x^{n}=\widehat{\boldsymbol x}^{n}(M_k)\), the mode of the conditional posterior density \(\pi(\boldsymbol x^{n}|\boldsymbol y^n,\boldsymbol \theta,M_k)\) for model \(M_k\). The cross validation marginal likelihood for a model \(M_k\) can also be approximated using INLA. The individual conditional predictive ordinate (CPO) for observation \(y_t\) is given by

\[\begin{align*} p(y_t|\boldsymbol y_{(-t)},M_k)=\int p(y_t|\boldsymbol y_{(-t)}, \boldsymbol \theta,M_k)\,\pi(\boldsymbol \theta|\boldsymbol y_{(-t)},M_k)d\boldsymbol \theta, \end{align*}\] where, \(\boldsymbol y_{(-t)}\) denotes all elements of \(\boldsymbol y^n\) except \(y_t\). As noted by Held, Schrödle, and Rue (2010), evaluation of the CPO requires \(p(y_t|\boldsymbol \theta,M_k)\), which can be written as

\[\begin{align*} p(y_t|\boldsymbol y_{(-t)},\boldsymbol \theta,M_k)=1 \bigg/\int \frac{\pi(x_t|\boldsymbol y^n,\boldsymbol \theta,M_k)}{p(y_t|x_t,\boldsymbol \theta,M_k)}dx_t, \end{align*}\] where the numerator in the ratio, \(\pi(x_t|\boldsymbol y^n,\boldsymbol \theta,M_k)\), is obtained using INLA and the denominator is simply the likelihood contribution of \(y_t\). The integral is evaluated by numerical integration. Once we have the approximation \(\tilde p(y_t|\boldsymbol y_{(-t)},\boldsymbol \theta,M_k)\), the CPO term can be written as

\[\begin{align} \tilde p(y_t|\boldsymbol y_{(-t)},M_k)=1 \bigg/\sum_j \frac{\tilde \pi(\boldsymbol \theta_j|\boldsymbol y^n,M_k)}{\tilde p(y_t|\boldsymbol y_{(-t)},\boldsymbol \theta_j,M_k)}\Delta_j. \tag{2.16} \end{align}\] The INLA estimator (2.16) is a weighted harmonic mean of the \(\tilde p(y_t|\boldsymbol y_{(-t)},\boldsymbol \theta_j,M_k)\) terms with weights \(\tilde \pi(\boldsymbol \theta_j|\boldsymbol y^n,M_k) \Delta_j\). Given the individual CPO terms, we can obtain the cross validation marginal likelihood as their product.

2.6 R-INLA package – some basics

Integrated Nested Laplace Approximation (INLA) is implemented as an R package called INLA or R-INLA and is available from a specific repository at http://www.r-inla.org for Windows, Mac OS X, and Linux. The R-INLA website also includes extensive documentation about the package, examples, a discussion forum, and other resources about the theory and applications of INLA.

Note that the R-INLA package is not available from the Comprehensive R Archive Network (https://cran.r-project.org/) because it uses some external C libraries that make it difficult to build the binaries. Therefore, when installing the package, we need to use install.packages(), adding the URL of the R-INLA repository. A simple way to install the stable version is shown below. For the testing version, simply replace stable by testing when setting the repository.

install.packages(
  "INLA",
  repos = c(getOption("repos"),
            INLA = "https://inla.r-inla-download.org/R/stable"),
  dep = TRUE
)

To load the package in R once it is successfully installed, we need to type this in order to set up and run the code.

library(INLA)

The main function in the R-INLA package is called inla(). We will use this function together with formula() to set up and fit dynamic Bayesian models to time series using the INLA framework. In fitting a specified model to data, the inla() function is similar in usage to the R functions lm(), glm(), arima(), etc.

We can handle many different types of data/model combinations under the R-INLA umbrella. These include univariate time series with and without exogenous regressors, panel (multi-level) time series, multivariate time series, and discrete-valued time series such as binary or count time series. The modeling frameworks that we discuss in upcoming chapters include Gaussian linear models, hierarchical Gaussian linear models, generalized linear models with binomial, Poisson, or gamma sampling distributions, level correlated models for multivariate non-Gaussian time series, stochastic volatility models to handle time-varying heterogeneity, dynamic spatio-temporal models, etc. There are two main steps for fitting a model to data:

Step 1. Express the linear predictor part of the model as a formula object in R. This is similar to the expression on the right side in the familiar lm(y~formula) expression, and produces no output. Rather, formula becomes one of the inputs for the inla() call. We can specify fixed effects or random effects in the formulation.

  • The fixed effect specification is similar to the way lm() or glm() includes such effects. The variables denoting the fixed effects are given in the formula, separated by “+”. Fixed effects will have coefficients which are usually assigned vague priors, for instance, a N\((0, V^2)\) prior with large variance \(V^2\). The prior parameters can be set through the option control.fixed in the inla() call. INLA allows two types of fixed effects, linear and clinear. These will be discussed in later chapters.

  • A random effect is specified using the f() function, which includes an index to map the effect to the observations, specify the type of effect, etc. The effect is then included in the formula, where it is separated from other effects by a “+”. Random effects will usually be assigned a common Gaussian prior with zero mean and an (unknown) precision, to which a prior (such as a gamma prior) may be assigned.

As we will see in examples throughout this book, the syntax of the formula consists of the response variable, followed by the ~ symbol, and then the fixed and/or random effects separated by + operators. Here is an example of a formula for a response \(y\) as a linear function of an intercept, two fixed effects predictors z1 and z2, and an additive i.i.d. error represented by id which is an index taking values \(1:n\).

formula <- y ~ z1 + z2 + f(id, model = "iid")

If we wish to exclude the intercept from the model, we modify the code as (recall that lm() uses this as well):

formula <- y ~ z1 + z2 + f(id, model = "iid") -1

or

formula <- y ~ 0 + z1 + z2 + f(id, model = "iid")

We include an i.i.d. random variable \(x\) corresponding to a random effect by modifying the code as follows:

formula <- y ~ 0 + z1 + z2 + f(x, model = "iid") + f(id, model = "iid") 

Step 2. We run/fit the model by calling the inla() function with these input arguments:

  1. a specified formula (described in Step 1);

  2. a family for the data distribution; this is a string or a vector of strings to indicate the sampling distribution of the response (for constructing the likelihood function). Common families are Gaussian (default), Poisson, binomial, etc. A list of families is available by typing names(inla.models()$likelihood. Details about a specific family can be seen with inla.doc("familyname");

  3. the data, which consists of the data frame including the response, predictors, required mapping indexes, etc.;

  4. a set of control options, such as

  1. control.compute to indicate which model selection criteria to compute,

  2. control.predictor which gives details about the predictor and functions, such as a link function, or an indication about computation of posterior marginal distributions;

  3. control.family which enables us to specify priors for parameters involved in the likelihood.

The inla() call returns an object that contains information from the fitted model that includes

  1. Posterior summaries and marginals of parameters, hyperparameters, linear predictors, and fitted values, as well as a column kld representing the symmetric Kullback-Leibler divergence with the difference between the Gaussian and the simplified or full Laplace approximations for each posterior. R-INLA also provides a set of functions for post processing the posterior results.

  2. Estimates of different criteria to assess and compare Bayesian models, such as marginal likelihood, conditional predictive ordinate (CPO), deviance information criterion (DIC), and the Watanabe-Akaike information criterion (WAIC).

Prior specification of parameters/hyperparamters is an important aspect of Bayesian modeling. R-INLA allows a user to assume default specifications for some items, while other items must be explicitly specified. We discuss prior specifications in Chapter 3.

It is worth noting that R-INLA does not have an explicit predict() function as we are used to seeing in lm() or glm(). We handle this by assigning NA’s to responses that we wish to predict and including them with the observed data in the data frame.

INLA provides a few functions that operate on marginals. These functions and their usage are described in Chapter 14. We use inla.tmarginal() and inla.zmarginal() in Chapter 3, while other functions are used in later chapters as needed.

For inla.tmarginal(), the syntax is

inla.tmarginal(fun, marginal, n=1024, h.diff = .Machine$double.eps^(1/3),
               method = c("quantile", "linear"),  ...) 

In the above chunk, the arguments as described in the R-INLA documentation are

\(\bullet\) marginal is a marginal object from either inla() or inla.hyperpar(). This can either be list(x=c(), y=c()) with density values y at locations x, or a matrix(,n,2) for which the density values are in the second column and the locations are in the first column. The inla.hpdmarginal() function assumes a unimodal density;

\(\bullet\) fun is a (vectorized) function such as function(x) exp(x).

The other arguments in inla.tmarginal may be left at default INLA specifications. The user only needs to specify the inverse transformation function through fun. For example, as we will discuss further in Chapter 3, suppose \(\theta_v = \log(1/\sigma^2_v)\) is the internal INLA representation for a precision parameter \(1/\sigma^2_v\), where \(\sigma^2_v\) is a variance parameter. Then inla.tmarginal helps us recover information about the posterior marginal distribution of \(\sigma^2_v\), via an inverse transformation from \(\theta_v\) back to \(\sigma^2_v\). This is done by using fun, which a user defines as \(\exp(-\theta_v)\).

The syntax for inla.zmarginal() is

inla.zmarginal(marginal, silent = FALSE)

In most examples, we will need fitted values of the responses. When we run the inla() function, setting compute = TRUE in the control.predictor option will produce output which includes the following items:

summary.linear.predictor: is a data frame showing the posterior mean, standard deviation, and quantiles of the linear predictors;

summary.fitted.values: is a data frame with the posterior mean, standard deviation, and quantiles of the fitted values obtained by transforming the linear predictors by the inverse of the link function;

marginals.linear.predictor: is a list consisting of posterior marginals of the linear predictors;

marginals.fitted.values: is a list consisting of posterior marginals of the fitted values obtained by transforming the linear predictors via the inverse of the link function (in the Gaussian case, it is the identity link).

If we use the identity link function, as in Gaussian response models, the output from summary.linear.predictor and summary.fitted.values will coincide, and output from marginals.linear.predictor and marginals.fitted.values will coincide.

For more details, see Martino and Riebler (2014). Hubin and Storvik (2016) describe marginal likelihood computation, while Gómez-Rubio, Bivand, and Rue (2020) describe Bayesian model averaging using R-INLA.

Chapter 2 – Appendix

The appendix contains important concepts and properties that are important to understand the theory and methods that underlie INLA.

Gaussian Markov Random Field (GMRF)

Let \({\boldsymbol x^n} = (x_1,\ldots,x_n)'\) have a normal distribution. Further, suppose it has Markov properties: \(x_i \perp x_j | {\boldsymbol x^n}_{(-ij)}\), i.e., \(x_i\) and \(x_j\) are conditionally independent given all other components of \({\boldsymbol x^n}\), denoted by \({\boldsymbol x^n}_{(-ij)}\).

Consider an undirected graph \(\mathcal{G} =\{\mathcal{V},\mathcal{E}\}\) where \(\mathcal{V} =\{1,2,\ldots,n\}\) denotes the vertices and \(\mathcal{E}=\{(i,j), i,j=1,\ldots,n\}\) denotes the edges. If \(x_i \perp x_j | {\boldsymbol x^n}_{(-ij)}\), then there is no edge between vertices \(i\) and \(j\) in the graph. On the other hand, there is an edge between \(i\) and \(j\) if \(x_i\) is not conditionally independent of \(x_j\) given \({\boldsymbol x^n}_{(-ij)}\).

Suppose the random vector \({\boldsymbol x^n}\) has mean \(\boldsymbol{\mu}\) and precision matrix \(\boldsymbol{\Omega}\), which is the inverse of the variance-covariance matrix of \({\boldsymbol x^n}\). Then, \({\boldsymbol x}^n\) is called a GRMF with respect to a graph \(\mathcal{G}\) if and only if its p.d.f. has the form

\[\begin{align} \pi({\boldsymbol x^n}) = (2\pi)^{-n/2}|\boldsymbol{\Omega}|^{1/2} \exp\bigg( -\frac{1}{2}({\boldsymbol x^n} - \boldsymbol{\mu})' \boldsymbol{\Omega}({\boldsymbol x^n}- \boldsymbol{\mu}) \bigg), \end{align}\] with \(\Omega_{ij} \neq 0\) if and only if \((i,j) \in \mathcal{E}\), for all \(i,j=1,\ldots,n\).

Kullback-Leibler divergence

The Kullback-Leibler (KL) divergence is often referred to as relative entropy. It measures how one probability distribution differs from a reference probability distribution (Kullback and Leibler 1951). If \(U\) and \(V\) are continuous-valued random variables with respective p.d.f.s \(f(.)\) and \(g(.)\), the (KL) divergence from \(g(.)\) to \(f(.)\) is defined as

\[\begin{align} D_{KL}(f \parallel g) = \int_{-\infty}^{\infty} f(u)\log \frac{f(u)}{g(u)}du. \tag{2.17} \end{align}\] The KL divergence is always non-negative, i.e., \(D_{KL}(f \parallel g) \geq 0\) (Gibbs inequality), with equality holding if and only if \(f(u) = g(u)\) almost everywhere.

If \(f(.)\) represents a true distribution, we can find an approximation \(g(.)\) by minimizing the KL-divergence (2.17). This is because \(D_{KL}(f \parallel g)\) is the amount of information lost when \(f(.)\) is approximated by \(g(.)\). More details about the KL divergence can be found in Kullback (1959).

References

De Bruijn, Nicolaas Govert. 1981. Asymptotic Methods in Analysis. New York: Dover Publications.
Fahrmeir, Ludwig, and Gerhard Tutz. 2001. “Models for Multicategorical Responses: Multivariate Extensions of Generalized Linear Models.” In Multivariate Statistical Modelling Based on Generalized Linear Models, 69–137. Springer.
Gómez-Rubio, Virgilio, Roger S Bivand, and Håvard Rue. 2020. “Bayesian Model Averaging with the Integrated Nested Laplace Approximation.” Econometrics 8 (2): 23.
Held, Leonhard, Birgit Schrödle, and Håvard Rue. 2010. “Posterior and Cross-Validatory Predictive Checks: A Comparison of MCMC and INLA.” In Statistical Modelling and Regression Structures, 91–110. Springer.
Hubin, Aliaksandr, and Geir Storvik. 2016. “Estimating the Marginal Likelihood with Integrated Nested Laplace Approximation (INLA).” https://arxiv.org/abs/1611.01450.
Kullback, Solomon. 1959. Information Theory and Statistics. New York: John Wiley & Sons.
Kullback, Solomon, and Richard A. Leibler. 1951. “On Information and Sufficiency.” The Annals of Mathematical Statistics 22 (1): 79–86.
Lauritzen, Steffen L. 1996. Graphical Models. Vol. Oxford Statistical Science Series 17. Oxford: Clarendon Press.
Lindley, Dennis V. 1961. “The Use of Prior Probability Distributions in Statistical Inference and Decision.” In Proc. 4th Berkeley Symp. On Math. Stat. And Prob, 453–68.
———. 1980. “Approximate Bayesian Methods.” Trabajos de Estadı́stica y de Investigación Operativa 31 (1): 223–45.
Martino, Sara, and Andrea Riebler. 2014. “Integrated Nested Laplace Approximations (INLA).” Wiley StatsRef: Statistics Reference Online, 1–19.
Rue, Håvard, and Leonhard Held. 2005. Gaussian Markov Random Fields: Theory and Applications. Boca Raton, Florida: Chapman & Hall/CRC.
Rue, Håvard, Sara Martino, and Nicholas Chopin. 2009. “Approximate Bayesian Inference for Latent Gaussian Models Using Integrated Nested Laplace Approximations (with Discussion).” Journal of the Royal Statistical Society, Series B 71: 319–92.
Ruiz-Cárdenas, Ramiro, Elias T. Krainski, and Håvard Rue. 2012. “Direct Fitting of Dynamic Models Using Integrated Nested Laplace Approximations—INLA.” Computational Statistics & Data Analysis 56 (6): 1808–28.
Tierney, Luke, and Joseph B. Kadane. 1986. “Accurate Approximations for Posterior Moments and Marginal Densities.” Journal of the American Statistical Association 81 (393): 82–86.
Wood, Simon N. 2020. “Simplified Integrated Nested Laplace Approximation.” Biometrika 107 (1): 223–30.