*How do you build a model from first principles? Here is a step by step guide.*

Following on from last week’s post on Principled Bayesian Workflow I want to reflect on how to motivate a model.

The purpose of most models is to understand change, and yet, considering what doesn’t change and should be kept constant can be equally important.

I will go through a couple of models in this post to illustrate this idea. The purpose of the model I want to build today is to predict how much ice cream is sold for different temperatures \((t)\). I am interested in the expected sales \((\mu(t))\), but also the data generating distribution that behaves in line with the observations \((y(t))\) and helps me to understand the variance of my sales.

# Data

My data set is quite small, it contains the units of ice cream sold for 12 different temperatures from \(11.9^\circ C\) to \(25.1^\circ C\).

```
icecream <- data.frame(
temp=c(11.9, 14.2, 15.2, 16.4, 17.2, 18.1,
18.5, 19.4, 22.1, 22.6, 23.4, 25.1),
units=c(185L, 215L, 332L, 325L, 408L, 421L,
406L, 412L, 522L, 445L, 544L, 614L)
)
```

This is the same data set I used earlier to discuss generalised linear models (GLMs.)

# Model building

Models are about what changes, and what doesn’t. Some are useful.

In mathematics change is often best described with differential equations, and that’s how I will motivate and justify my models today.

If something doesn’t change, we shouldn’t expected a variation of its measurement, and hence, constant variance \((\sigma^2)\) becomes a natural choice. The same can be said, if the change is constant.

## No change

The simplest model is to assume no change. Hot or cold, we assume to sale the same amount of ice cream. I can express this as an ordinary differential equation (ODE):

\[ \frac{d\mu(t)}{dt} = 0,\quad\mu(0) = \mu_0 \] It’s solution is \(\mu(t) = \mu_0\) and hence, we could assume for the observable data \(y(t)\): \[ y(t) \sim \mathcal{N}(\mu_0, \sigma^2) \]

Of course, this can’t be right. I would expect higher sales of ice cream for higher temperatures.

## Constant change

By assuming constant change I mean that the increase in sales is constant, regardless of the starting temperature, i.e. a change in temperature from \(7^\circ C\) to \(12^\circ C\) will cause the same increase in sales as from \(22^\circ C\) to \(27^\circ C\), or to put this into a ODE:

\[ \frac{d\mu(t)}{dt} = b,\quad \mu(0)=a \] Solving the ODE gives us the classic linear equation for the mean: \[ \mu(t) = a + b\,t \] Now, as we are assuming constant change, I should also assume that the variance \((\sigma^2)\) around my data doesn’t change either and hence, I could model my data as Gaussian again:

\[ y(t) \sim \mathcal{N}(\mu(t), \sigma^2) = \mathcal{N}(a + b\,t, \sigma^2) \]

This model makes a little more sense, but it could also predict negative expected sales, if \(a\) is negative. In addition, generating data from this distribution will produce negative values occasional as well, even if \(a>0\).

## Constant sales growth rate

Assuming a constant sales growth rate means that a \(1^\circ C\) increase in temperature is responded by the same proportional change in sales, i.e. for every \(1^\circ C\) increase the sales would increase by \(b\)%. Thus, we are moving from an additive to a multiplicative scale. We can describe the change in the *median* sales as:
\[
\frac{d\mu(t)}{dt} = \mu(t) \, b,\quad \mu(0)=a
\]
The solution to this ODE is exponential growth (\(b>0\)), or decay (\(b<0\)):
\[
\mu(t) = a \, \exp(b\,t)
\]
On a log-scale we have:
\[
\log{\mu(t)} = \log{a} + b\,t
\]

This is a linear model again. Thus, we can assume constant change and variance \((\sigma^2)\) on a log-scale. A natural choice for the data generating distribution would be:

\[ \log{y}(t) \sim \mathcal{N}(\log{a} + b\,t, \sigma^2) \]

Or, in other words, I believe the data follows a log-normal distribution. The expected sale for a given temperature is \(E[y(t)]=\exp{(\mu(t)+\sigma^2/2)}\), its median is \(\exp{(\mu(t))}\). In addition this also means that we assume a constant coefficient of variation \((CV)\) on the original scale:

\[ \begin{aligned} CV &=\frac{\sqrt{Var[y]}}{E[y]} \\ & = \frac{\sqrt{(e^{\sigma^2} - 1)e^{2\mu(t)+\sigma^2}}}{e^{\mu(t)+\sigma^2/2}} \\ & = \sqrt{e^{\sigma^2} - 1} \\ & = const \end{aligned} \]

Although this model avoids negative sales, it does assume exponential growth as temperatures increase. Perhaps, this a little too optimistic.

## Constant demand elasticity

A constant demand elasticity of \(b = 60\)% would mean in this context that a 10% increase in the temperature would lead to a 6% increase in demand/ sales.

Thus, the relative change in sales is proportional to the relative change in temperature:

\[ {\frac{d\mu(t)}{\mu(t)}} = b {\frac{dt}{t}} \] Re-arranging and adding an initial value gives us the following ODE:

\[ \frac{d\mu(t)}{dt} = \frac{b}{t}\mu(t),\quad\mu(1)=a \]

It’s solution is:

\[ \mu(t) = \exp{(a + b \log(t))} = \exp{(a)} \cdot t^b \]

Now take the log on both sides and we are back to linear regression equation:

\[\log(\mu(t)) = a + b \log(t)\] Thus, we are assuming constant change on a log-log scale. We can assume a log-normal data generating distribution again, but this time we measure temperature on a log-scale as well.

\[ \log{y(t)} \sim \mathcal{N}(a + b\log{t}, \sigma^2) \]

This model, as well as the log-transformed model above, predicts ever increasing ice cream sales as temperature rises.

## Constant market size

However, if I believe that my market/ sales opportunities are limited, i.e. we reach market saturation, then a logistic growth model would make more sense.

Let’s assume as temperature rises, sales will increase exponential initially and then tail off to a saturation level \(M\). The following ODE would describe this behaviour:

\[ \frac{d\mu(t)}{dt} = b\, \mu(t) (1 - \mu(t)/M),\quad\mu(0)=\mu_0 \]

Integrating the ODE provides us with the logistic growth curve (often used to describe population growth):

\[ \mu(t) = \frac{ M }{1 + \frac{M-\mu_0}{\mu_0}e^{-b\,t}} = \frac{\mu_0 M e^{b\,t}}{\mu_0 (e^{b\,t} - 1) + M} \]

Setting \(M=1\) gives me the proportion of maximum sales, and setting \(a:=(1-\mu_0)/\mu_0\) gives us:

\[ \begin{aligned} \mu_p(t) &= \frac{ 1 }{1 + \frac{1-\mu_0}{\mu_0}e^{-b\,t}} \\ & = \frac{1}{1 + e^{-(a+b\,t)}} \end{aligned} \]

Nice, this is logistic regression. Defining \(\mbox{logitic}(u) := \frac{1}{1 + e^{-u}}\), we can see that the above has a linear expression in \(u\).

Thus, if I set \(z(t):=y(t)/M\), I could assume the Binomial distribution as the data generating process:

\[ z(t) \sim \mbox{Bin}\left(M, \mbox{logitic}(\mu_p(t))\right) \]

But I don’t know \(M\). Thus, I have to provide a prior distribution, for example a Poisson distribution with parameter (my initial guess) \(M_0\)

\[ M \sim \mbox{Pois}(M_0) \]

Or, alternatively I could consider the Poisson distribution as the data generating distribution, with its mean and variance being described by the non-linear growth curve above (this is not a GLM, unlike the models so far):

\[ y(t) \sim \mbox{Pois}(\mu(t)) = \mbox{Pois}\left(\frac{\mu_0 M e^{b\,t}}{\mu_0 (e^{b\,t} - 1) + M}\right) \]

The Poisson distribution has the property that the mean and variance are the same. If \(y \sim \mbox{Pois}(\lambda)\), then \(E[y] =\mbox{Var}[y] = \lambda\), but more importantly, we are generating integers, in line with the observations.

So, let’s generate some data to check if this approach makes sense in the temperature range from \(-5^\circ C\) to \(35^\circ C\).

Let’s put priors over the parameters: \[ \begin{aligned} \mu_0 &\sim \mathcal{N}(10, 1),\quad \mbox{ice creams sold at }0^\circ C\\ b & \sim \mathcal{N}(0.2, 0.02^2),\quad \mbox{inital growth rate} \\ M & \sim \mathcal{N}(800, 40^2) ,\quad \mbox{total market size} \end{aligned} \]

```
temperature <- c(-5:35)
n <- length(temperature)
nSim <- 10000
set.seed(1234)
# Priors
mu0 = rep(rnorm(nSim, 10, 1), each=n)
b = rep(rnorm(nSim, 0.2, 0.02), each=n)
M = rep(rnorm(nSim, 800, 40), each=n)
lambda = matrix(M/(1 + (M - mu0)/mu0 * exp(-b * rep(temperature, nSim))),
ncol=n)
```

With those preparations done I can simulate data:

```
y_hat_p <- matrix(rpois(nSim*n, lambda), nrow=n)
probs <- c(0.025, 0.25, 0.5, 0.75, 0.975)
pp=t(apply(y_hat_p, 1, quantile, probs=probs))
matplot(temperature, pp, t="l", bty="n",
main=paste("Poisson percentiles:",
"2.5%, 25%, 50%, 75%, 97.5%", sep="\n"))
```

This looks reasonable, but the interquartile range looks quite narrow.

Hence, I want to consider the Negative-Binomial distribution, which has an additional shape parameter \(\theta\) to allow for over-dispersion, i.e. the variance can be greater than the mean. A Negative Binomial distribution with parameters \(\mu\) and \(\theta\) has mean \(\mu\) and variance \(\mu + \mu^2/\theta\). With \(\theta \to \infty\) we are back to a Poisson distribution.

```
library(MASS) # provides negative binomial
y_hat_nb <- matrix(rnegbin(nSim*n, lambda, theta=20), nrow=n)
nbp <- t(apply(y_hat_nb, 1, quantile, probs=probs))
matplot(temperature, nbp, t="l", bty="n",
main=paste("Negative Binomial percentiles:",
"2.5%, 25%, 50%, 75%, 97.5%", sep="\n"))
```

This looks a little more realistic and adds more flexibility to my model.

Finally, I am getting to a stage where it makes sense to make use of my data.

# Fitting the model

Using `brms`

I can fit the non-linear logistic growth curve with
a Negative-Binomial data generating distribution.
Note, I put lower bounds on the prior parameter distributions at \(0\), as I allow for high variances and negative values don’t make sense for them.

```
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
library(brms)
mdl <- brm(bf(units ~ M / (1 + (M - mu0) / mu0 * exp( -b * temp)),
mu0 ~ 1,
b ~ 1,
M ~ 1,
nl = TRUE),
data = icecream,
family = brmsfamily("negbinomial", "identity"),
prior = c(prior(normal(10, 100), nlpar = "mu0", lb=0),
prior(normal(0.2, 1), nlpar = "b", lb=0),
prior(normal(800, 400), nlpar = "M", lb=0)),
seed = 1234, control = list(adapt_delta=0.9))
```

`mdl`

```
## Family: negbinomial
## Links: mu = identity; shape = identity
## Formula: units ~ M/(1 + (M - mu0)/mu0 * exp(-b * temp))
## mu0 ~ 1
## b ~ 1
## M ~ 1
## Data: icecream (Number of observations: 12)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## mu0_Intercept 33.68 20.01 6.21 81.11 1011 1.00
## b_Intercept 0.19 0.05 0.10 0.30 887 1.00
## M_Intercept 794.45 215.38 536.00 1361.55 1036 1.00
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## shape 94.78 54.40 24.53 228.93 1417 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
```

`plot(mdl)`

The output looks reasonably well behaved. The estimated standard error for \(\mu_0\) is high, and the 95% credible interval is also quite wide. But I have to remember, the parameter describes the average number of ice cream sold at \(0^\circ C\) and the lowest data point we have for that is only at \(11.9^\circ C\).

Indeed, I have only 12 data points, so not too surprising the credible interval around \(M\), the total market size, is quite wide too. But like for the other parameters, its posterior standard error has shrunk considerably from the one assumed in the prior parameter distribution.

Let’s review the posterior predictive distribution for temperatures from \(-5^\circ C\) to \(35^\circ C\):

```
pp <- brms::posterior_predict(mdl, newdata=data.frame(temp=temperature))
pp_probs <- t(apply(pp, 2, quantile, probs=probs))
matplot(temperature, pp_probs, t="l", bty="n",
main=paste("Posterior predictive percentiles:",
"2.5%, 25%, 50%, 75%, 97.5%", sep="\n"),
ylab="Units of icecream sold",
xlab="Temperature")
points(icecream$temp, icecream$units, pch=19)
```

Hurray! This looks really good.

I am happy with the model and its narrative. The parameters are easy interpretable, the curve can be explained and the model generates positive integers, in-line with the data.

Of course, there will be many other factors that influence the sales of ice cream, but I am confident that the temperature is the main driver.

# Conclusion

Approaching the model build by thinking about how the sale of ice cream will change, as the temperature changes, and considering what to keep constant has led me down the road of differential equations.

With a small data set like this, starting from a simple model and reasoning any additional complexity has helped me to develop a better understanding of what a reasonable underlying data generating process might be.

Indeed, generating data from the model before using the data, helped to check my understanding of how the model might perform.

Thanks to Stan and brms, I was freed from the constrains under GLMs and I could select my own non-linear model for the expected sales and distribution family.

# Session Info

`sessionInfo()`

```
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] MASS_7.3-50
##
## loaded via a namespace (and not attached):
## [1] Brobdingnag_1.2-5 splines_3.5.0 gtools_3.8.1
## [4] StanHeaders_2.17.2 threejs_0.3.1 shiny_1.1.0
## [7] assertthat_0.2.0 stats4_3.5.0 yaml_2.1.19
## [10] pillar_1.2.3 backports_1.1.2 lattice_0.20-35
## [13] glue_1.2.0 digest_0.6.15 promises_1.0.1
## [16] minqa_1.2.4 colorspace_1.3-2 htmltools_0.3.6
## [19] httpuv_1.4.4.1 Matrix_1.2-14 plyr_1.8.4
## [22] dygraphs_1.1.1.4 pkgconfig_2.0.1 rstan_2.17.3
## [25] bookdown_0.7 purrr_0.2.5 xtable_1.8-2
## [28] mvtnorm_1.0-8 scales_0.5.0 later_0.7.3
## [31] lme4_1.1-17 tibble_1.4.2 bayesplot_1.5.0
## [34] ggplot2_2.2.1 DT_0.4 shinyjs_1.0
## [37] lazyeval_0.2.1 survival_2.42-3 magrittr_1.5
## [40] mime_0.5 evaluate_0.10.1 nlme_3.1-137
## [43] xts_0.10-2 blogdown_0.6 colourpicker_1.0
## [46] rsconnect_0.8.8 tools_3.5.0 loo_2.0.0
## [49] matrixStats_0.53.1 stringr_1.3.1 munsell_0.5.0
## [52] bindrcpp_0.2.2 compiler_3.5.0 rlang_0.2.1
## [55] nloptr_1.0.4 grid_3.5.0 ggridges_0.5.0
## [58] rstanarm_2.17.4 htmlwidgets_1.2 crosstalk_1.0.0
## [61] igraph_1.2.1 miniUI_0.1.1.1 labeling_0.3
## [64] base64enc_0.1-3 rmarkdown_1.10 codetools_0.2-15
## [67] gtable_0.2.0 inline_0.3.15 abind_1.4-5
## [70] markdown_0.8 reshape2_1.4.3 R6_2.2.2
## [73] gridExtra_2.3 rstantools_1.5.0 zoo_1.8-2
## [76] knitr_1.20 bridgesampling_0.4-0 dplyr_0.7.5
## [79] brms_2.3.1 bindr_0.1.1 shinystan_2.5.0
## [82] shinythemes_1.1.1 rprojroot_1.3-2 stringi_1.2.3
## [85] parallel_3.5.0 Rcpp_0.12.17 tidyselect_0.2.4
## [88] xfun_0.2 coda_0.19-1
```