Chapter 1 Bayesian Analysis
1.1 Introduction
This chapter gives a brief introduction to Bayesian framework in statistical analysis. In the Bayesian approach, the posterior probability distributions of parameters (which are treated as random quantities) have a central role. Based on the likelihood function of the parameters, given data, and a prior distribution on the parameters, the posterior distribution is obtained via an application of Bayes’ theorem. In many complex modeling situations, including most time series analyses, closed-form expressions for the posterior distributions are rarely available. One exception of course, is the Gaussian Dynamic Linear Model (DLM), for which Kalman filtering and smoothing equations offer closed form expressions for the posterior distribution of the unknown state variable. To handle complex situations, sampling based Bayesian approaches can be used for posterior inference and prediction, see Tanner and Wong (1987), Gelfand and Smith (1990), and D. Gamerman and Lopes (2006). General solutions via Markov Chain Monte Carlo (MCMC) methods have been provided to practitioners by software such as R, WinBUGS (Lunn et al. 2000), STAN,1 or JAGS (Plummer et al. 2003). While MCMC is an asymptotically exact method for posterior analysis, it does not scale well and can be slow in targeting random samples from the joint posterior of the parameters, especially in high dimensions and complex data situations. The computational complexity can be high even under distributed (parallel) computing environments. Under limited sampling, convergence of the MCMC algorithms may become an issue, and one cannot have much confidence in the accuracy of the results.
An alternative to MCMC for large, complex problems is provided by variational inference (Blei, Kucukelbir, and McAuliffe 2017). See Korobilis and Koop (2018) for a discussion of variational Bayesian (VB) inference for state-space models, and L. R. Berry and West (2020) for use of VB in modeling time series of multivariate counts. Application of VB methods may not be straightforward for many practitioners due to limited availability of software at the current time.
The Integrated Nested Laplace Approximation (INLA) (Rue, Martino, and Chopin 2009) provides a fast alternative to MCMC methods for Bayesian modeling in complex problems when exact, closed form solutions are not available.
Research on the use of INLA within R
has been prolific in recent years, with several books and articles describing how to use R-INLA
in different settings. In this book, we describe the use of R-INLA
for state space modeling of time series data. The use of dynamic models for handling a variety of time series will be of interest to a wide audience of academics and practitioners who work with data that are observed over time.
Ruiz-Cárdenas, Krainski, and Rue (2012) describe the use of R-INLA
for Dynamic Linear Models (DLMs).
Section 1.2 gives a brief review of the Bayesian framework. In Section 1.3, we review the setting for a Bayesian analysis of time series data. Section 1.4 introduces Gaussian DLMs, while Section 1.5 gives an overview of hierarchical Gaussian and non-Gaussian models.
1.2 Bayesian framework
As noted by Lindley (1983), the Bayesian approach is based on the premise that “the only satisfactory description of uncertainty is by means of probability.” This suggests that uncertainty about any unknown quantity, observed or unobserved (such as parameters), should be described probabilistically, and uncertainties must be combined using the rules of probability.
To introduce some notation, we let \(Y\) be a random variable, i.e., the observable random quantity, with realization \(y\), and \(\Theta\) denote an unknown (latent) parameter. The unobserved random quantity \(\Theta\) can be a scalar or a vector. Suppose that a probability model is specified with a p.d.f. (or p.m.f.) \(p(y|\Theta)\). For example, if \(Y \sim N(\mu, \sigma^2)\), then \(\Theta =(\mu, \sigma^2)\) and if \(Y \sim Poisson(\lambda)\), then \(\Theta = \lambda\). In Chapter 1 – Appendix, we give a brief review of conditional distributions. Before observing any data, uncertainty about the unknown quantity \(\Theta\) is described by \(\pi(\Theta)\) which is referred to as the prior distribution of \(\Theta\) as it reflects our a priori knowledge/belief about \(\Theta\). After we observe the data, \(\boldsymbol y^n=(y_{1},\ldots, y_{n})'\), we update our prior beliefs about \(\Theta\) to the posterior distribution \(\pi(\Theta |\boldsymbol y^n)\) using the calculus of probability, i.e., via Bayes’ theorem as
\[\begin{align} \pi(\Theta |\boldsymbol y^n) \propto \pi(\Theta) L(\boldsymbol y^n|\Theta), \tag{1.1} \end{align}\] where \(L(\boldsymbol y^n|\Theta)\) is the likelihood function. The likelihood \(L(\boldsymbol y^n|\Theta)\) is obtained by evaluating the joint p.d.f. of \((Y_{1}, \ldots, Y_{n})\) at the observed data \(\boldsymbol y^n\) for different values of \(\Theta\). Thus, it is a function of \(\Theta\) and can be interpreted as a scale of comparative support given by the observed data \(\boldsymbol y^n\) to various possible values of \(\Theta\). To reflect this, we find it more convenient to write the likelihood function as \(L(\Theta; \boldsymbol y^n)\) and to define the log likelihood by \(\ell(\Theta; \boldsymbol y^n)\).
The likelihood principle states that the totality of information about \(\Theta\) provided by \(\boldsymbol y^n\) is in the likelihood function, and plays an important role in the Bayesian framework. As a result, all inference about \(\Theta\) will be based on the posterior distribution which considers only the observed data \(\boldsymbol y^n\), but all possible values of \(\Theta\). The constant of integration for (1.1) is given by
\[\begin{align} m(\boldsymbol y^n) = \int L(\Theta; \boldsymbol y^n) \pi(\Theta) d \Theta, \tag{1.2} \end{align}\] which is referred to as the marginal likelihood since it does not depend on \(\Theta\). Computation of the marginal likelihood follows from the rules of probability since
\[\begin{align} p(\boldsymbol y^n)=\int p(\boldsymbol y^n|\Theta) \pi(\Theta) d \Theta, \tag{1.3} \end{align}\] where \(p(\boldsymbol y^n|\Theta)=\prod_{j=1}^{n} p(y_j|\Theta)\). To evaluate the marginal likelihood \(m(\boldsymbol y^n)\), we replace \(p(\boldsymbol y^n|\Theta)\) by \(L(\Theta; \boldsymbol y^n)\) in (1.3), once data become available. Prior to observing \(\boldsymbol y^n\), \(p(\boldsymbol y^n)\) is referred to as the prior predictive distribution of \((Y_{1}, \ldots, Y_{n})\).
Once the posterior distribution \(\pi(\Theta |\boldsymbol y^n)\) is available, the posterior predictive distribution of a future (new) observation \(\boldsymbol y^\ast\) is given by
\[\begin{align} p(\boldsymbol y^\ast|\boldsymbol y^n) = \int p(\boldsymbol y^\ast|\Theta) \pi(\Theta|\boldsymbol y^n) d \Theta. \tag{1.4} \end{align}\] As noted by Ebrahimi, Soofi, and Soyer (2010), the concept of prediction, which is usually an afterthought in classical statistics, arises naturally in the Bayesian approach as a consequence of using rules of probability. The predictive distribution is an essential component of the Bayesian approach in time series analysis.
1.2.1 Bayesian model comparison
Marginal likelihood
We defined the marginal distribution of the data \({\boldsymbol y^n}\) in (1.2). This concept plays a useful role in Bayesian model comparison when we must select from a set of models, \(M_1,\ldots, M_K\). Let \(\ell(\boldsymbol{\Theta}; {\boldsymbol y^n}, M_k)\) denote the log likelihood under model \(M_k\) of \(\boldsymbol{\Theta}\) given the data \({\boldsymbol y^n}\), and let \(\pi(\boldsymbol{\Theta} \vert M_k)\) denote the prior distribution of \(\boldsymbol{\Theta}\) under model \(M_k\). Each model may have its own set of parameters. Then, the marginal likelihood of model \({M_k}\) is
\[\begin{align} m({\boldsymbol y^n} \vert M_k) = \int L(\boldsymbol{\Theta}; {\boldsymbol y^n}, M_k) \pi(\boldsymbol{\Theta} \vert M_k) d\boldsymbol{\Theta}. \tag{1.5} \end{align}\] The marginal likelihood is useful in many Bayesian modeling steps, such as in the acceptance ratio computation in the Metropolis-Hastings algorithm, in Bayesian model selection and model averaging, etc. However, in most cases, \(p({\boldsymbol y^n} \vert M_k)\) cannot be obtained analytically. Many methods have been proposed including those by Tierney and Kadane (1986) (via the Laplace approximation), Newton and Raftery (1994) (harmonic mean estimator), Chib (1995) (via MCMC), Marin et al. (2012) (using approximate Bayesian computation, ABC), Jordan et al. (1998) (variational methods), or Rue, Martino, and Chopin (2009) (INLA).
Bayes factors
Consider two candidate models \(M_1\) and \(M_2\) to describe the given data \(\boldsymbol y^n\). Using posterior model probabilities \(\pi(M_k \vert \boldsymbol y^n)\) for \(k=1,2\), we can write the posterior odds as
\[\begin{align} \frac{\pi(M_1 \vert \boldsymbol y^n)}{\pi(M_2 \vert \boldsymbol y^n)} = \frac{m(\boldsymbol y^n; M_1)}{m(\boldsymbol y^n; M_2)} \times \frac{\pi(M_1)}{\pi(M_2)}, \tag{1.6} \end{align}\] where \(m(\boldsymbol y^n; M_k)\) and \(\pi(M_k)\) denote the marginal likelihood and prior probability for model \(M_k\), respectively. The first term on the right side of (1.6) is known as the Bayes factor (BF) in favor of model \(M_1\) when it is compared to \(M_2\) (Kass and Raftery 1995), i.e.,
\[\begin{align} {\rm BF}_{12}=\frac{m(\boldsymbol y^n; M_1)}{m(\boldsymbol y^n; M_2)}. \end{align}\] The BF can be interpreted as the summary of support provided by the data to one model as opposed to an alternative one. It can also be used for Bayesian hypothesis testing as originally suggested by Jeffreys (1935).
Since the marginal likelihood cannot be obtained analytically for many models, an alternative quantity based on the predictive density is
\[\begin{align} p(y_j|\boldsymbol y_{(-j)},M_k)=\int p(y_j|\boldsymbol \Theta_k,M_k)\,p(\boldsymbol \Theta_k|\boldsymbol y_{(-j)},M_k)d\boldsymbol \Theta_k, \tag{1.7} \end{align}\] where, for \(j=1,\ldots, n\), \(\boldsymbol y_{(-j)}=\{y_\ell|\ell=1,\ldots,n;\,\ell\neq j\}\) and \(\Theta_k\) denotes the parameters associated with model \(M_k\). The above density evaluated at data point \(y_j\) is referred to as the conditional predictive ordinate (CPO). The pseudo likelihood obtained as the product of individual CPO’s, i.e.,
\[\begin{align} \widehat{m}(\boldsymbol y^n)=\prod_{j=1}^{n} p(y_j|\boldsymbol y_{(-j)},M_k) \tag{1.8} \end{align}\] is called the cross validation marginal likelihood. Evaluation of the cross validation marginal likelihood typically requires Markov chain Monte Carlo (MCMC) simulation. Gelfand and Dey (1994) discuss how \(p(y_j|\boldsymbol y_{(-j)},M_k)\) can be approximated using posterior samples from the MCMC output. Once the cross validation marginal likelihood is approximated, the pseudo Bayes factor (PsBF) can be obtained to compare any two models.
Information criteria
As noted by Kass and Raftery (1995), the BF can be approximated using the Schwarz criterion, which is also known as the Bayesian information criterion (BIC). In the above, we can write the logarithm of the Bayes factor as
\[\begin{align*} \log {\rm BF}_{12}=\log\, m({\boldsymbol y^n}; M_1)-\log\,m({\boldsymbol y^n}; M_2). \end{align*}\] For large sample size \(n\), it is possible to show that
\[\begin{align*} \log\, m({\boldsymbol y^n}; M_k)\approx \log\, L_k({\boldsymbol y^n};\widehat{\boldsymbol \Theta}_k) -(r_k/2)\,\log(n), \end{align*}\] where \(L_k({\boldsymbol y^n};\widehat{\boldsymbol \Theta}_k)\) is the likelihood for model \(M_k\) evaluated at the maximum likelihood estimator \(\widehat{\boldsymbol \Theta}_k\) (or the posterior mode) of \(\boldsymbol \Theta_k\), while \(r_k\) is the number of parameters for model \(M_k\). In the above, \(-2 \log\, m({\boldsymbol y^n}; M_k)\) is known as the BIC for model \(M_k\), i.e.,
\[\begin{align*} {\rm BIC}_k=-2 \log\,L_k({\boldsymbol y^n};\widehat{\boldsymbol \Theta}_k) +r_k\,\log(n). \end{align*}\] A smaller value of \({\rm BIC}_k\) implies more support for the model \(M_k\). We can approximate BF using BIC as
\[\begin{align*} {\rm BF}_{12}\approx \exp\big[-\frac{1}{2}({\rm BIC}_1- {\rm BIC}_2)\big]. \end{align*}\]
Especially for hierarchical Bayesian models, it is not easy to evaluate BIC since the “effective” number of parameters in the model will not be known. For this purpose, the deviance information criterion (DIC) was proposed by Spiegelhalter et al. (2002). For a generic parameter vector \(\boldsymbol \Theta\), the DIC is defined as
\[\begin{align} {\rm DIC} &= \overline{D}+r_D, \text{ where}, \notag \\ D &= -2 \log\,L(\boldsymbol \Theta; \boldsymbol y^n ), \,\,\, \overline{D}=E_{\Theta|\boldsymbol y^n}(D), \text{ and } \notag \\ r_D &= \overline{D}-D(\widehat{\boldsymbol \Theta}), \tag{1.9} \end{align}\] where \(\widehat{\boldsymbol \Theta}\) is the posterior mean, \(\overline{D}\) represents the ``goodness of the fit” of the model, while the term \(r_D\) represents a complexity penalty as reflected by the effective number of parameters of the model. The unknown effective number of parameters of the model is estimated as the difference between the posterior mean of the deviance and the deviance evaluated at the posterior mean of \(\boldsymbol \Theta\).
An alternative to \(DIC\) is the Watanabe Akaike information criterion (WAIC) which is defined as (see Watanabe and Opper (2010) and Gelman, Hwang, and Vehtari (2014))
\[\begin{align} {\rm WAIC} = \mbox{lppd} + p_{{\rm WAIC}}, \tag{1.10} \end{align}\] where \(\mbox{lppd}\) is the log pointwise posterior density given by
\[\begin{align*} \mbox{lppd} =\sum_{j=1}^n \log \left(\int p(y_j|\Theta_k)\pi(\Theta_k|\boldsymbol y^n)d\Theta_k \right). \end{align*}\] The \(\mbox{lppd}\) is evaluated using draws \(\Theta_k^s, s=1,\ldots,S\) from the posterior distribution of \(\Theta_k\) as
\[\begin{align*} \sum_{j=1}^n \log\left(\frac{1}{S}\sum_{s=1}^S p(y_j|\Theta_k^s)\right), \end{align*}\] and \(p_{{\rm WAIC}}\) is a correction term for the effective number of parameters to adjust for overfitting,
\[\begin{align*} p_{{\rm WAIC}} = 2\sum_{j=1}^n\left(\log(E_{post} p(y_j|\Theta_k)) - E_{post}(\log p(y_j|\Theta_k))\right), \end{align*}\] which can be computed from draws \(\Theta_k^s, s=1,\ldots,S\) as
\[\begin{align*} 2\sum_{j=1}^n\left(\log \left(\frac{1}{S}\sum_{s=1}^S p(y_j|\Theta_k^s)\right) - \frac{1}{S}\sum_{s=1}^{S}\log p(y_j|\Theta_k^s)\right). \end{align*}\]
Model averaging
An alternative approach for predicting the future values of the response variable \(Y\) is to take into account model uncertainty and use Bayesian model averaging.
Assume that we have \(K\) possible models \(M_k,~k=1,\ldots,K\) with prior model probabilities \(p(M_k)\) such that \(\sum_{k=1}^K p(M_k)=1\). Typically, under model \(M_k\), we have unknown parameter(s) \(\boldsymbol{\Theta}_k\), and we denote the model for \(Y\) by \(p(y|\boldsymbol{\Theta}_k,M_k)\). Priors for \(\boldsymbol{\Theta}_k\) given \(M_k\) are denoted as \(p(\boldsymbol{\Theta}_k|M_k)\), for \(k=1,\ldots,K\).
If \(\boldsymbol{y}^n\) denotes the observed data, then \(p(y^*|\boldsymbol{y}^n)\), the posterior predictive distribution for a future value of \(Y,\) is given by
\[\begin{align} p(y^*|\boldsymbol{y}^n)=\sum_{k=1}^K p(y^*|\boldsymbol{y}^n,M_k) p(M_k|\boldsymbol{y}^n), \tag{1.11} \end{align}\] which is a weighted average of the posterior predictive distributions under each of the models. This is known as Bayesian model averaging (BMA). In (1.11), the weights are given by the posterior model probabilities
\[\begin{align} p(M_k|\boldsymbol{y}^n)=\frac{p(\boldsymbol{y}^n|M_k)\,p(M_k)}{\sum_{j=1}^K p(\boldsymbol{y}^n|M_j)\,p(M_j)}, \end{align}\] where the marginal likelihood term for model \(M_k\) is given by
\[\begin{align} p(\boldsymbol{y}^n|M_k)=\int p(\boldsymbol{y}^n|\boldsymbol{\theta}_k,M_k)\,p(\boldsymbol{\Theta}_k|M_k)d\boldsymbol{\Theta}_k. \end{align}\] Finally, \(p(y^*|\boldsymbol{y}^n,M_k)\) is obtained via
\[\begin{align} p(y^*|\boldsymbol{y}^n,M_k)=\int p(y^*|\boldsymbol{\Theta}_k,M_k)\,p(\boldsymbol{\Theta}_k|\boldsymbol{y}^n,M_k)d\boldsymbol{\Theta}_k, \end{align}\] where \[\begin{align} p(\boldsymbol{\Theta}_k|\boldsymbol{y}^n,M_k)\propto p(\boldsymbol{y}^n|\boldsymbol{\Theta}_k,M_k)p(\boldsymbol{\Theta}_k|M_k). \end{align}\]
1.3 Bayesian analysis of time series
In many applications, we are interested in modeling a time series of observations \(y_t,~ t=1,\ldots,n\), based on available information. The information might consist only of observed values of \(y_t\), or it may include a set of past and current observations on exogenous predictors in addition to the past history of \(y_t\). The observations \(\boldsymbol y^n=(y_1,y_2,\ldots,y_n)'\), collected at discrete time points \(t=1,\ldots,n\), is a realization of the discrete time stochastic process \(\{Y_t\}\).
The most important feature of the time series is possible serial correlation among the observations. Describing such correlation is the essence of time series modeling. As noted by Cox et al. (1981), such correlation is typically described by using either the observation driven or the parameter driven modeling strategies. The former includes models such as Autoregressive Moving Average (ARMA) processes and Markov chains, whereas the latter involves latent structure models such as the state-space models where the correlation is created by the time dependence among latent factors. The parameter driven models, due to their dynamic latent structure, are capable of capturing nonstationary behavior in the means and variances.
In our development, we will be dealing with Bayesian analysis of both the observation driven and parameter driven processes. The Bayesian approach to time series modeling is still driven by the notion of “describing uncertainty via probability” and adherence to the rules of probability and the likelihood principle. Typical Bayesian analysis of a time series model will still involve deriving posterior and predictive distributions of unknown quantities in the specified model. Due to the temporal dependence in time series data and the potential dynamic structure of model parameters, our inferences and predictions need to be performed sequentially. Sequential processing can be handled adequately in the Bayesian paradigm as will be discussed in what follows.
To be able to consider both the observation and parameter driven models, we will decompose our generic parameter vector \(\Theta\) into static and dynamic parameters denoted by \(\boldsymbol \theta\) and \(\boldsymbol x^n=(x_1,\ldots,x_n)'\), respectively. Thus, we define \(\Theta=(\boldsymbol \theta,\boldsymbol x^n)\). At time \(t\), we denote the observed data by \(\boldsymbol y^t=(y_1,y_2,\ldots,y_t)'\) and the posterior distribution of \(\Theta\) by \(\pi(\Theta|\boldsymbol y^t)\). It can be shown via Bayes’ theorem, that
\[\begin{align} \pi(\Theta|\boldsymbol y^t)\propto \pi(\Theta|\boldsymbol y^{t-1}) L(\Theta;y_t, \boldsymbol y^{t-1}), \tag{1.12} \end{align}\] where \(\pi(\Theta|\boldsymbol y^{t-1})\) is the posterior distribution of \(\Theta\) at time \((t-1)\) which can be considered as the prior of \(\Theta\) before observing \(y_t\). The likelihood term \(L(\Theta;y_t, \boldsymbol y^{t-1})\) is obtained by evaluating \(p(y_t|\Theta, \boldsymbol y^{t-1})\) at the observed value \(y_t\) at time \(t\). It is important to note that, for observation driven models, \(p(y_t|\Theta, \boldsymbol y^{t-1})\neq p(y_t|\Theta)\). Given \(\boldsymbol y^t\), the predictive distribution of \(Y_{t+1}\) is given by
\[\begin{align} p(y_{t+1}|\boldsymbol y^t)=\int p( y^{t+1}|\Theta,\boldsymbol y^t) \pi(\Theta|{\boldsymbol y^t}) d \Theta. \tag{1.13} \end{align}\] The marginal posterior distribution of the entire latent vector \(\boldsymbol x^n\) given \(\boldsymbol y^t\) is obtained by marginalization of the posterior distribution of \(\Theta\) as
\[\begin{align*} \pi(\boldsymbol x^n|\boldsymbol y^t)=\int \pi(\boldsymbol x^n,\boldsymbol \theta|\boldsymbol y^t) d \boldsymbol \theta. \end{align*}\] The posterior distributions of the components of \(\boldsymbol x^n\), i.e., \(\pi(x_\tau|\boldsymbol y^t), \tau=1,\ldots,n\) can be obtained similarly. The distribution of the current latent component \(x_t\), i.e., \(\pi(x_t|\boldsymbol y^t)\) is referred to as the filtering distribution. The remaining marginals for \(x_\tau\), where \(\tau < t\) and \(\tau > t\) are known as the smoothing distributions and forecast distributions, respectively.
In sequential analysis of time series data, often the interest is centered around the filtering distribution of \(x_t\) which can be obtained from the sequential updating of \(x_t\) and \(\boldsymbol \theta\) as
\[\begin{align*} \pi(x_t, \boldsymbol \theta|\boldsymbol y^t)\propto \pi(x_t, \boldsymbol \theta|\boldsymbol y^{t-1}) L(x_t, \boldsymbol \theta;y_t, \boldsymbol y^{t-1}). \end{align*}\] The marginal posterior distribution of \(x_t\) obtained from \(\pi(x_t, \boldsymbol \theta|\boldsymbol y^{t-1})\), i.e., \(\pi(x_t|\boldsymbol y^{t-1})\), is the forecast distribution of \(x_t\).
1.4 Gaussian dynamic linear models (DLMs)
In this section, we introduce Gaussian dynamic linear models (DLMs) for a univariate time series. Due to their ability to capture time varying parameters, the DLMs provide a flexible modeling framework for nonstationary time series. The Gaussian DLM leads to closed form expressions for estimates and forecasts, and may be the best path if such a model is suitable for the time series. We show a few examples that are often used in practice. In Chapter 3, we will fit these models to data using R-INLA
(Martins et al. 2013).
One way to model a univariate time series \(y_t\) is as the sum of an unobservable (latent) level plus a random error. The latent level can either be constant over time or can randomly evolve over time according to some stochastic process. We start with a simple example for a univariate time series \(y_t\), the constant level plus noise model, followed by a local level model, also referred to as the random walk plus noise model. In each case, we show by simple formulas how Bayesian updating is done. Next, we define the DLM framework for a univariate time series \(y_t\) and describe its properties, and present another commonly used example of DLM, the AR(1) plus noise model. We then give a brief review of a DLM for a vector-valued time series, and discuss the Kalman filtering and smoothing algorithms. We end the chapter with a brief look at how the DLM can be extended for non-Gaussian setups.
1.4.1 Constant level plus noise model
Consider a simple model for a univariate Gaussian time series \(y_t\), which can be described as the sum of a constant level and a Gaussian random error (Shumway and Stoffer 2017):
\[\begin{align} y_t = \alpha + v_t;~~ v_t \sim N(0, \sigma^2_v). \tag{1.14} \end{align}\] We use this model to illustrate Bayesian updating for a time series. Suppose we assume that the prior distribution of the random variable \(\alpha\) is normal with mean \(m_0\) and variance \(C_0\), i.e., \(\pi(\alpha) \sim N(m_0, C_0)\), which is independent of the distribution of \(v_t\) for all \(t\). For simplicity, we first assume that the error variance \(\sigma^2_v\) is known. Given data \({\boldsymbol y}^n\), our objective is to update our opinion about \(\alpha\) via the posterior distribution of \(\alpha\) given \({\boldsymbol y}^n\):
\[\begin{align*}
\pi(\alpha|\boldsymbol y^n)
\propto L(\alpha; \sigma^2_v, \boldsymbol y^n)~\pi(\alpha),
\end{align*}\]
where \(L(\alpha; \sigma^2_v, \boldsymbol y^n)\) is a Gaussian likelihood.
Due to the conjugacy of the normal likelihood and the normal prior for \(\alpha\), the posterior distribution of \(\alpha\) given \(\boldsymbol y^n\) is normal, i.e.,
\[\begin{align*} & \pi(\boldsymbol \alpha|\boldsymbol y^n) \sim N(m_n, C_n), \mbox{ where}, \notag \\ & m_n = E(\alpha|\boldsymbol y^n) = \frac{C_0}{C_0+\frac{\sigma^2_v}{n}}\overline{y} + \frac{\frac{\sigma^2_v}{n}}{C_0+\frac{\sigma^2_v}{n}} m_0, \mbox{ and} \notag \\ & C_n = Var(\alpha|\boldsymbol y^n) = \left(\frac{n}{\sigma^2_v} + \frac{1}{C_0}\right)^{-1} = \frac{\sigma^2_v C_0}{\sigma^2_v +n C_0}. \end{align*}\] Note that \(m_n\) is a weighted average of the sample mean \(\overline{y}=\frac{1}{n}\sum_{t=1}^n y_t\) and the prior mean \(m_0\), where the weight is a function of \(C_0\) and \(\sigma^2_v\). If \(C_0\) is large, then the posterior mean \(m_n\) is approximately equal to \(\overline{Y}\) and \(C_n \approx \sigma^2_v/n\). The posterior precision \(1/C_n\) is given by
\[\begin{align*} \frac{1}{C_n}= \frac{n}{\sigma^2_v}+\frac{1}{C_0}, \end{align*}\] where \(n/\sigma^2_v\) denotes the precision of the sample mean while \(1/C_0\) is the prior precision. Note that the posterior precision \(1/C_n\) is always larger than the prior precision \(1/C_0\).
It is easy to see how the posterior distribution of \(\alpha\) can be obtained sequentially over time. At time \(t=n-1\),
\[\begin{align*} \pi(\alpha|\boldsymbol y^{n-1}) \sim N(m_{n-1}, C_{n-1}). \end{align*}\] This distribution then plays the role of the prior for \(\alpha\) which is updated, when \(y_n\) is observed at time \(t=n\), to yield the posterior distribution
\[\begin{align*} & \pi(\alpha|\boldsymbol y^n)\sim N(m_n, C_n), \mbox{ where}, \notag \\ & m_n = \frac{C_{n-1}}{C_{n-1}+\sigma^2_v}y_n + \left(1-\frac{C_{n-1}}{C_{n-1}+\sigma^2_v}\right)m_{n-1}, \mbox{ and} \notag \\ & C_n = \left(\frac{1}{\sigma^2_v}+\frac{1}{C_{n-1}}\right)^{-1}= \frac{\sigma^2_v C_{n-1}}{\sigma^2_v+C_{n-1}}. \end{align*}\] We can obtain the predictive distribution of \(y_{n+1}\) given the information up to time \(n\) as
\[\begin{align*} y_{n+1}| {\boldsymbol y}^n \sim N(m_n, C_n+\sigma^2_v). \end{align*}\] Note that \(m_n\) is the posterior mean of \(\alpha\) and the one-step ahead point prediction of \(y_{n+1}\). This leads to an error correction form representation of the prediction as
\[\begin{align*} m_n = m_{n-1} + \frac{C_{n-1}}{C_{n-1}+\sigma^2_v}(y_n - m_{n-1}). \end{align*}\] Thus, \(m_n\) is obtained by correcting the previous estimate \(m_{n-1}\) with a weighted forecast error \(e_n = y_n - m_{n-1}\), the weight being
\[\begin{align*} \frac{C_{n-1}}{C_{n-1}+\sigma^2_v} = \frac{C_0}{\sigma^2_v+n C_0}. \end{align*}\] This model is too simplistic to offer a realistic fit to time series in practice. Therefore, we do not discuss this further in the book.
1.4.2 Local level model
In many situations, we may wish to model the level as dynamically evolving over time. One assumption is that the level evolves as a random walk. For this reason, this model is also called a random walk plus noise model:
\[\begin{align} y_t &= x_t + v_t;~~ v_t \sim N(0,\sigma^2_v), \tag{1.15} \\ x_t &= x_{t-1}+ w_t;~~ w_t \sim N(0,\sigma^2_w). \tag{1.16} \end{align}\] Here, \(v_t\) and \(w_t\) are random errors which are assumed to follow zero mean normal distributions with unknown variances \(\sigma^2_v\) and \(\sigma^2_w\) respectively. The model in (1.15) and (1.16) is one of the simplest dynamic models for a univariate time series \(y_t\) and is appropriate when the time series exhibits a slowly changing level over time. We illustrate how the dynamic component impacts Bayesian inference of the latent variable \(x_t\), before and after we observe the data \(y_t\) at time \(t\).
Prior distribution of \(x_t\). After we see \(y_{t-1}\) at time \(t-1\), we can obtain the posterior distribution of \(x_{t-1}\) given \(\boldsymbol y^{t-1}=(y_1, \ldots,y_{t-1})\), i.e., \(\pi(x_{t-1}| \boldsymbol y^{t-1})\). This leads to the prior distribution for \(x_t\) before we observe \(y_t\); i.e., \(\pi(x_t \vert \boldsymbol y^{t-1})\).
Forecast distribution of \(y_t\). At time \(t-1\), we predict the output \(y_t\) for time \(t\) using \(p(y_t \vert \boldsymbol y^{t-1})\).
Posterior distribution of \(x_t\). Once the data at time \(t\), i.e., \(y_{t}\), becomes available, we update our belief about \(x_t\), i.e., we obtain the posterior of \(x_{t}\) given \(\boldsymbol y^t\), denoted by \(\pi(x_{t}|\boldsymbol y^t)\). As previously mentioned, this is the filtering distribution for \(x_t\).
We next illustrate these three steps. At time \(t-1\), suppose that
\[\begin{align*} \pi(x_{t-1}|\boldsymbol y^{t-1})\sim N(m_{t-1}, C_{t-1}). \end{align*}\] Using the state equation
\[\begin{align*} x_t=x_{t-1}+w_t;~~w_t \sim N(0, \sigma^2_w), \end{align*}\] we can obtain the forecast distribution of \(x_t\) as
\[\begin{align*} \pi(x_t|\boldsymbol y^{t-1}) \sim N(a_t, R_t), \end{align*}\] where
\[\begin{align*} a_t = m_{t-1}, \text{ and } R_t = C_{t-1}+\sigma^2_w. \end{align*}\] Since \(\sigma^2_w>0\), we become more uncertain about the average level \(x_t\) than under the constant level model. Standing at time \(t-1\), we can also predict the next observation \(y_t\) using the predictive distribution
\[\begin{align*} y_t|\boldsymbol y^{t-1} \sim N(f_t, Q_t), \text{where} \end{align*}\]
\[\begin{align*} f_t = a_t \,\, \text{and} \,\, Q_t = R_t + \sigma^2_v. \end{align*}\]
The uncertainty about \(y_t\) depends on the measurement error \(\sigma^2_v\) as well as on the uncertainty about the average level \(x_t\) as reflected by the state equation.
At time \(t\), the new observation \(y_t\) becomes available, and we update our uncertainty about \(x_t\) with the new measurement \(y_t\). We obtain the posterior density \(\pi(x_t|\boldsymbol y^t)\), i.e., the filtering density as
\[\begin{align*} x_t|\boldsymbol y^t \sim N(m_t, C_t), \end{align*}\] where
\[\begin{align*} m_t = a_t + \frac{R_t}{R_t+\sigma^2_v}(y_t-f_t) \end{align*}\] and
\[\begin{align*} C_t = \frac{\sigma^2_vR_t}{\sigma^2_v+R_t}. \end{align*}\] Note that for this step, (i) the prior distribution of \(x_t\) is \(N(a_t, R_t)\) and (ii) \(y_t\) is independent of its past history given \(x_t\). This sequential updating mechanism is known as estimation-correction structure, because the previous best estimate \(a_t\) is corrected by a fraction of the forecast error \(e_t = y_t - f_t\) having weight \(K_t =R_t/(R_t+\sigma^2_v)\). The magnitude of the state error variance \(\sigma^2_w\) relative to the observation error variance \(\sigma^2_v\) is known as the signal-to-noise ratio (SNR). This quantity plays a crucial role in determining the effect of data on estimation and forecasting.
1.4.3 Gaussian DLM framework for univariate time series
The Gaussian DLM is a general and flexible framework for analyzing many different classes of time series (Shumway and Stoffer 2017). The DLM consists of an observation equation and a state equation, which are respectively given by
\[\begin{align} y_t = \alpha + F_t x_t + v_t, \ v_t \sim N (0,\sigma^2_v), \tag{1.17} \end{align}\] and
\[\begin{align} x_t = G_t x_{t-1}+ w_t, \ w_t \sim N(0,\sigma^2_w). \tag{1.18} \end{align}\]
The observation equation expresses \(y_t\) as a linear combination of the latent variable \(x_t\) and an observation error \(v_t\) plus a constant level \(\alpha\); here \(F_t\) is known. The state equation expresses \(x_t\) as a linear combination of \(x_{t-1}\) and a state error \(w_t\), where \(G_t\) represents the transition from \(x_{t-1}\) to \(x_t\) and may be known or unknown or time-varying or constant. We assume that the initial state is Gaussian, i.e., \(x_0 \sim N(m_0,C_0)\).
Usually, \(\alpha\), \(\sigma^2_v\), \(\sigma^2_w\), \(m_0\), and \(C_0\) will be unknown and must be estimated from the data. Let \(\boldsymbol{\theta}\) denote the collection of all the unknown scalars that we refer to as hyperparameters, i.e., \(\boldsymbol{\theta}=(\alpha, \sigma^2_v, \sigma^2_w, m_0, C_0)'\). In the DLM, the primary parameter of interest is the latent process \(x_t\).
Remark 1. The latent process \(x_t\) is a Markov chain. This means that the conditional distribution of \(x_t\) given its entire history, \(x_{t-1}, x_{t-2}, x_{t-3}, \ldots\) only depends on the previous state \(x_{t-1}\), i.e.,
\[\begin{align*} \pi(x_t \vert x_{t-1}, x_{t-2}, x_{t-3}, \ldots) = \pi(x_t \vert x_{t-1}). \end{align*}\] The distribution may depend on the hyperparameters \(\boldsymbol{\theta}\).
Remark 2. Conditionally on \(x_t\), the observed data \(y_t\) is independent of its own history \(y_{t-1}, y_{t-2}, y_{t-3}, \ldots\). All the information about the history of \(y_t\) is included in the current state \(x_t\), which then determines \(y_t\).
Remark 3. At each time \(t\), we can explain \(y_t\) as a linear function of \(x_t\) and an additive Gaussian noise. The latent process \(x_t\) follows the Markovian dynamics in a linear way; i.e., \(x_t\) depends linearly only on \(x_{t-1}\) and an additive Gaussian noise \(w_t\).
For example, the random walk plus noise (or local level) model defined in (1.15) and (1.16) for a univariate time series \(y_t\) is a special case of (1.17) and (1.18) with \(F_t =1, G_t=1\) and \(\alpha=0\). Further, if we set \(\sigma^2_w=0\), and \(\alpha \neq 0\), the random walk plus noise model reduces to the constant plus noise model where \(\alpha\) denotes a constant level that does not vary over time. In the next subsection, we show another simple and widely used example of a DLM.
1.4.4 AR(1) plus noise model
Rather than a random walk assumption for the dynamic evolution of the state variable in (1.18), we can allow the level \(x_t\) to evolve over time as an autoregression of order 1, i.e., an AR(1) process with serial correlation \(\phi\). That is, in (1.18), we set \(G_t = \phi\) instead of \(G_t =1\) for all \(t\). The AR(1) plus noise model is
\[\begin{align} y_t &= x_t + v_t;~~ v_t \sim N(0, \sigma^2_v), \tag{1.19} \\ x_t &= \phi x_{t-1}+ w_t;~~ w_t \sim N(0, \sigma^2_w). \tag{1.20} \end{align}\] For process stability, it is usual to assume that \(|\phi|<1\). The other assumptions are similar to those under the random walk plus noise model. We discuss analyzing and forecasting univariate time series under these DLMs in Chapter 3. We end this section by introducing the DLM for a vector-valued process.
1.4.5 DLM for vector-valued time series
Let \(\boldsymbol y_t\) be a \(q\)-dimensional observed time series, for \(q \ge 1\) and let \(\boldsymbol x_t\) be a \(p\)-dimensional unobserved (or latent) state process for \(p \ge 1\). The DLM for \(\boldsymbol y_t\) is described through the observation and state equations shown below:
\[\begin{align} \boldsymbol y_t = \boldsymbol F_t \boldsymbol x_t + \boldsymbol\Gamma \boldsymbol u_{1,t} + \boldsymbol v_t, \ \boldsymbol v_t \sim N_q (\boldsymbol 0,\boldsymbol V), \tag{1.21} \end{align}\] and
\[\begin{align} \boldsymbol x_t = \boldsymbol G_t \boldsymbol x_{t-1}+ \boldsymbol \Upsilon \boldsymbol u_{2,t} + \boldsymbol w_t, \ \boldsymbol w_t \sim N_p(\boldsymbol 0, \boldsymbol W). \tag{1.22} \end{align}\]
In (1.21) and (1.22), for \(t=1,\ldots,n\),
\(\boldsymbol F_t\) is a known matrix of order \(q \times p\),
\(\boldsymbol u_{j,t}\) are known \(r_j\)-dimensional vectors of exogenous predictors, for \(j=1,2\),
\(\boldsymbol \Gamma\) is a \(q \times r_1\) matrix of regression coefficients corresponding to \(\boldsymbol u_{1,t}\),
\(\boldsymbol \Upsilon\) is a \(p \times r_2\) matrix of regression coefficients corresponding to \(\boldsymbol u_{2,t}\),
\(\boldsymbol G_t\) is a possibly unknown state transition matrix of order \(p \times p\),
\(\boldsymbol v_t \sim N_q(\boldsymbol 0, \boldsymbol V)\) is the observation error process,
\(\boldsymbol w_t \sim N_p(\boldsymbol0, \boldsymbol W)\) is the state error process,
\(\{\boldsymbol v_t\}\) and \(\{\boldsymbol w_t\}\) are mutually independent white noise processes,
the initial state \(\boldsymbol x_{0} \sim N_p(\boldsymbol m_0, \boldsymbol C_0)\), and
\(\boldsymbol x_0\) is independent of \(\{\boldsymbol v_t\}\) and \(\{\boldsymbol w_t\}\).
We write the DLM properties in terms of a general distributional setup. As we mentioned in the univariate case, here too, the state process is assumed to be linear, Markovian, and Gaussian, i.e.,
\[\begin{align} \pi(\boldsymbol x_{t} \vert \boldsymbol x_{t-1}, \boldsymbol x_{t-2},\ldots) = \pi(\boldsymbol x_{t} \vert \boldsymbol x_{t-1}) = f_{w}(\boldsymbol x_{t} - \boldsymbol G_t \boldsymbol x_{t-1} -\boldsymbol \Upsilon \boldsymbol u_{2,t}), \tag{1.23} \end{align}\] where \(f_{w}(.)\) denotes a \(p\)-variate Gaussian p.d.f. with mean vector \(\boldsymbol 0\) and covariance matrix \(\boldsymbol W\). The observations are linear, Gaussian, and conditionally independent given the state vector, i.e.,
\[\begin{align} p(\boldsymbol y_{t}\vert \boldsymbol x_{t}, \boldsymbol y_{t-1}, \ldots,\boldsymbol y_1) = p(\boldsymbol y_{t}\vert \boldsymbol x_{t}) = f_{v}(\boldsymbol y_{t} - \boldsymbol F_{t} \boldsymbol x_{t}-\boldsymbol \Gamma \boldsymbol u_{1,t}), \tag{1.24} \end{align}\] where \(f_{v}(.)\) is a \(q\)-variate Gaussian p.d.f. with mean vector \(\boldsymbol 0\) and covariance matrix \(\boldsymbol V\). The joint distribution of the observation and state processes is specified by
\[\begin{align} \pi(\boldsymbol x_{0},\boldsymbol x_{1},\ldots,\boldsymbol x_{n}, \boldsymbol y_{1}, \ldots, \boldsymbol y_{n}) = \pi(\boldsymbol x_{0}) \times \prod_{t=1}^{n} f_{w}(\boldsymbol x_{t} - \boldsymbol G_t \boldsymbol x_{t-1} -\boldsymbol \Upsilon \boldsymbol u_{2,t}) f_{v}(\boldsymbol y_{t} - \boldsymbol F_{t}\boldsymbol x_{t}-\boldsymbol\Gamma \boldsymbol u_{1,t}). \tag{1.25} \end{align}\]
1.4.6 Kalman filtering and smoothing
Kalman filtering and smoothing for the Gaussian DLM have been described and implemented in R
packages such as dlm
(Petris 2010) and astsa
(Shumway and Stoffer 2017).
At each time \(t\), we can estimate the latent state vector \(\boldsymbol x_t\) given data
\(\boldsymbol Y^s = (\boldsymbol y_1', \boldsymbol y_2',\ldots, \boldsymbol y_s')'\), as \(\boldsymbol x_t^s\), where \(s < t\) or \(s = t\) or \(s > t\). Differences in the information
set \(\boldsymbol Y^s\) on which we condition lead to the three different situations listed below:
\(s < t\): Forecasting/prediction. For example, when \(s=t-1\), we estimate \(\boldsymbol x_t\) by \(\boldsymbol x_t^{t-1} = E(\boldsymbol x_t | \boldsymbol Y^{t-1})\), its one-step ahead forecast estimate;
\(s = t\): Filtering. When \(s=t\), we obtain the filter estimate of \(\boldsymbol x_t\) as \(\boldsymbol x_t^{t} = E(\boldsymbol x_t | \boldsymbol Y^{t})\);
\(s > t\): Smoothing. For example, when \(s=n>t\), we estimate \(\boldsymbol x_t\) by \(\boldsymbol x_t^{n} = E(\boldsymbol x_t | \boldsymbol Y^{n})\), its smoothed estimate.
Note that the R-INLA
implementation of a DLM produces the smoothed estimate of the states, i.e., \(\boldsymbol x_t^n\) and not the filter estimates \(\boldsymbol x_t^t\) for \(t=1,\ldots,n\). In fact, as we show in Chapter 3,
one must iteratively run the inla()
function \(n\) times in order to obtain the
filter estimates of \(\boldsymbol x_t\).
1.5 Beyond basic Gaussian DLMs
So far in this chapter, we have seen a few examples of Gaussian DLMs. In Chapters
3 and 4, we will describe how to use R-INLA
to fit these models to time series exhibiting various characteristics.
The statistical details of the INLA framework itself are briefly reviewed in Chapter 2.
In Chapter 5, we show how to fit Gaussian DLMs to time series when we have exogenous predictors observed over the same time frame as the time series we wish to model and predict, also including structural modeling (Harvey and Koopman 2014).
Hierarchical dynamic linear models (HDLMs) allow us to model a set of \(g\) independent time series, each of length \(n\) (D. Gamerman and Migon 1993). HDLMs are often referred to as panel time series models in the literature. In Chapter 6, we consider models for a set of \(g\) univariate time series.
While Gaussian DLMs are relatively easy to fit to data, they may not always be suitable for various types of complex time series data from different application domains. Examples consist of positive, continuous-valued time series, binary-valued time series, or count time series (West and Harrison 1997).
Chapter 7 describes models for time series following gamma, Weibull or beta distributions, while
in Chapter 8, we describe dynamic model fitting for binary-valued time series as well as categorical time series.
In Chapter 9, we use R-INLA
to model and forecast univariate count time series, as well as sets of univariate count time series.
A popular example of a nonlinear and non-Gaussian dynamic time series model is the stochastic volatility (SV) model that is used for modeling financial log returns, for which a Bayesian framework was described in Kim, Shephard, and Chib (1998). The SV model is also useful in other application domains to model logarithms of growths (which are similar to log returns).
In Chapter 10, we describe dynamic stochastic volatility models using R-INLA
. Chapter 11 discusses spatio-temporal modeling.
Chapter 12 describes DLMs for \(q\)-dimensional time series and approaches for forecasting them. A mitigating factor here is that R-INLA
is currently only able to handle dimensions \(q \le 5\).
Chapter 13 extends the hierarchical models in Chapter 6 to multivariate time series, using ideas from Chapter 12. The chapter also considers level correlated models for hierarchical modeling of
multivariate count time series. The use of INLA in these situations offers a useful computationally feasible alternative to MCMC methods which can be too slow to be practically useful in many situations.
Chapter 1 – Appendix
Conditional distributions
We review some useful notation and formulas. Let \(W,Y\), and \(Z\) denote random variables.
- Conditional distribution of \(Y\) given \(Z\):
\[\begin{align} p(Y|Z) = \frac{p(Y,Z)}{p(Z)}, \tag{1.26} \end{align}\] provided the denominator \(p(Z)\) is positive. We can write (1.26) equivalently as
\[\begin{align} p(Z) = \frac{p(Y,Z)}{p(Y|Z)}. \tag{1.27} \end{align}\]
- Conditioning both sides of (1.27) on \(W\),
\[\begin{align} p(Z|W) = \frac{p(Y,Z|W)}{p(Y|Z,W)}. \tag{1.28} \end{align}\] We can rewrite (1.28) equivalently as
\[\begin{align} p(Y,Z|W) = p(Z|W) \times p(Y|Z,W). \end{align}\]
Exponential family of distributions
One-parameter exponential family
The one-parameter exponential family is defined by the p.m.f. or p.d.f.
\[\begin{align} p(y|\theta)= h(y)\, \exp\big[\eta(\theta) t(y) -b(\theta)\big], \tag{1.29} \end{align}\] where \(t(y), h(y), b(\theta)\) are known functions. The support of \(Y\) does not depend on \(\theta\). The distribution (1.29) can also be written as
\[\begin{align} p(y|\theta)= h(y)\, g(\theta) \, \exp\big[\eta(\theta)\, t(y)\big], \tag{1.30} \end{align}\] or
\[\begin{align} p(y|\theta)=\exp\big[\eta(\theta)\, t(y)-b(\theta)+c(y)\big]. \tag{1.31} \end{align}\] If \(\eta(\theta)=\theta\), the exponential family is said to be in canonical form.
\(k\)-parameter exponential family
Let \(\boldsymbol\theta\) be a \(k\)-dimensional vector. The exponential family p.m.f. or p.d.f. has the form
\[\begin{align} p(y|\boldsymbol{\theta})= h(y)\, \exp\big[\sum_{i=1}^{k}\eta_{i}(\boldsymbol{\theta})\, t_{i}\,(y)-b(\boldsymbol{\theta})\big], \tag{1.32} \end{align}\] where \(\eta_{i}(\boldsymbol{\theta}),\ i=1,\ldots,k\) are real-valued functions of only \(\boldsymbol{\theta}\) and \(t_{i}(y), \ i=1,\ldots,k\) are real-valued functions of only \(y\). If \(\eta_{i}(\boldsymbol{\theta}) = \theta_i\), for all \(i\), the \(k\)-parameter exponential family will be in canonical form. We can also write (1.32) as
\[\begin{align} p(y|\boldsymbol{\theta})=h(y)\, \exp\big[\boldsymbol{\eta}(\boldsymbol{\theta})'\, {\boldsymbol t}(y)-b(\boldsymbol{\theta})\big], \tag{1.33} \end{align}\] or as
\[\begin{align} p(y|\boldsymbol{\theta})=h(y)\, g(\boldsymbol{\theta})\,\exp\big[\boldsymbol{\eta}(\boldsymbol{\theta})' \, {\boldsymbol t}(y)\big], \tag{1.34} \end{align}\] where \(\boldsymbol{\eta}(\boldsymbol{\theta})'=\big(\eta_{1}(\boldsymbol{\theta}), \ldots. \eta_{k}(\boldsymbol{\theta})\big)\) and \({\boldsymbol t}(y)=\big(t_1(y),\ldots, t_k(y)\big)\).
Exponential dispersion family of distributions
Let \(\theta\) be a scalar parameter of interest. Let \(\phi>0\) be a dispersion parameter which may or may not be known. Consider the p.d.f.
\[\begin{align} p(y| \theta,\phi) = \exp \big[\frac{y \theta - b(\theta)}{\phi} + c(y,\phi)\big], \tag{1.35} \end{align}\] where \(b(\cdot)\) and \(c(\cdot,\cdot)\) are specified functions. Here, \(\theta\) is called the canonical parameter of the distribution, \(b(\theta)\), \(\phi\) and \(c(y,\phi)\) are related by the condition that \(p(y| \theta,\phi)\) must integrate to 1. The function \(b(\theta)\) is usually explicitly given for standard distributions, while \(c(y,\phi)\) is left implicit. This is not a problem for estimating \(\theta\), since the score equation does not involve \(c(y,\phi)\). However, without an explicit \(c(y,\phi)\), likelihood based estimation of the dispersion \(\phi\), or a likelihood based inference for \((\theta,\phi)\) are not possible.
The log-likelihood function of \((\theta,\phi)\) is given by
\[\begin{align} \ell(\theta,\phi; y) = \log p(y| \theta,\phi) = \frac{\big[y\theta - b(\theta)\big]}{\phi} + c(y,\phi). \tag{1.36} \end{align}\] To find the mean and variance of \(Y \sim p(y| \theta,\phi)\), recall that
\[\begin{align} & E[\nabla \ell(\theta,\phi; y)] = E(\partial \ell/ \partial \theta) = 0, \tag{1.37} \\ & E(\partial^{2} \ell/ \partial \theta^{2})+E(\partial \ell/ \partial \theta)^{2}=0, \tag{1.38} \end{align}\] where
\[\begin{align*} \partial \ell/ \partial \theta &= \{y- \partial b(\theta) /\partial \theta\} / \phi = \{y-b'(\theta)\} / \phi,\\ \partial ^{2} \ell/ \partial \theta^{2}&= - \{ \partial ^{2} b (\theta) / \partial \theta^{2} \}/\phi = -b^{\prime \prime }(\theta)/\phi, \end{align*}\] for the p.d.f. (1.35). It follows from the above that
\[\begin{align} E(Y) &= \mu = \partial b(\theta) /\partial \theta =b'(\theta) \tag{1.39}, \mbox{ and} \\ Var(Y) &= E\big[(Y - \partial b(\theta) /\partial \theta )^2\big] = \partial ^{2} b (\theta) / \partial \theta^{2} \phi = \phi b''(\theta). \tag{1.40} \end{align}\] Since \(\phi>0\), as long as \(Y\) is nondegenerate, \(b''(\theta)>0\), so that \(b\) is strictly convex and \(b'\) is strictly increasing. Then the correspondence (1.39) is 1-to-1, and the distribution of \(Y\) can be parameterized by \(\mu = E(Y)\) and \(\phi\) via
\[\begin{align} \theta = (b')^{-1}(\mu). \tag{1.41} \end{align}\] The function \(b''(\theta)\) depends on the canonical parameter, and hence on \(\mu\), and is called the variance function, and is denoted by \(v(\mu)\). Then,
\[\begin{align} Var(Y) = \phi v(\mu) \mbox{ with } v(\mu)= b''((b')^{-1}(\mu)). \tag{1.42} \end{align}\] This explains why \(\phi\) is called the dispersion parameter in a generalized linear model (GLIM). It plays a role similar to \(\sigma^2\).