Nanyang Business School Forum on Risk Management and Insurance
Regularized Regression for Reserving and Mortality Models
Maximum likelihood estimation (MLE), once the prime method of statistical estimation, was shown in Stein’s 1956 paper to produce higher error variances than does shrinking fitted values towards the overall mean. His method is very similar to actuarial credibility theory. In 1970, Hoerl and Kennard proved that for models with explanatory variables, like regression models, shrinking the coefficients towards zero will similarly reduce the error variance. This was developed from a mathematical approach to difficult models called regularization.
The problem was that how much to shrink is complicated, with no applicable goodness-of-fit measure to adjust the negative log likelihood (NLL) for shrunk parameters. This has now been solved, using Bayesian shrinkage. The paper lays out the methodology for frequentist and Bayesian shrinkage, and shows how to apply them to the loss reserving and mortality models that fit row and column factors to data in rectangular arrays, including triangles.
The 1970 paper introduced ridge regression, the name coming from its derivation. Instead of minimizing NLL, which MLE does, ridge regression minimizes NLL + s*Sum(bj²), for any shrinkage constant s, with parameters bj. The constant term is not shrunk, however. They proved that there is always some positive value of s that gives lower error variance than MLE.
In the 1990s, an alternative called lasso was developed that minimizes NLL + s*Sum(|bj|). This has become popular in that it shrinks some coefficients to exactly zero, eliminating those variables. For a given s, a modeler can feed lasso a lot of variables, and just those that in combination best fit the data will get non-zero parameters.
Methods have evolved for selecting s. Most involve dividing the data set into subsets and treating each one in turn as a holdout sample. Parameters are estimated without the holdout, and its NLL is computed. Summing all those holdout NLLs gives a goodness-of-fit measure for that s, which can be compared to select s. This approach aims at estimating parameters for the population rather than those optimizing fitting the sample. This is fast in R packages like glmnet, which use fastest descent methods, and have guidelines to focus on a range of s values.
Bayesian estimation starts by proposing distributions for where each parameter is likely to be, and then uses these and the likelihood function to estimate joint posterior distributions for the parameters. This is now done numerically using Markov Chain Monte Carlo, or MCMC, estimation. There are R packages for that as well, such as Stan, used here.
Bayesian shrinkage gives the parameters mean zero distributions, like the standard normal. This pushes the posteriors towards zero to some degree, dependent on the assumed standard deviations. It turn out that if you use normal priors, the posterior mode is the ridge regression estimate of the parameters. The double exponential, or Laplace prior, which is like an exponential with a mirror image across the Y-axis, yields the lasso estimates as its posterior mode. In both cases, s is related to the standard deviation of the prior.
Bayesian shrinkage has some advantages over frequentist:
– There is a goodness-of-fit measure now available, from 2017 and 2018 papers by Vehtari etal. This is called loo, for leave one out. It uses a cross-validation method that takes every observation as its own holdout sample. The sum of the left-out observation NLLs actually gives a reasonable estimate of the NLL for a new sample from the same population, which is also the goal of penalized likelihood measures like AIC, etc. It is fast to estimate numerically from the posterior distribution fit to the whole sample.
– Instead of trying this for a number of values of s, the fully Bayesian approach puts a prior on s. That provides a direct estimate of the posterior distribution of s, which usually is similar to finding the s that optimizes loo, and is sometimes slightly better than selecting any single value of s. This greatly simplifies the fitting process.
– MCMC can provide the posterior mode, which corresponds with the classical methods, but Bayesians can also use the posterior mean. The mode appears to have more risk of responding too much to properties unique to the sample.
A lot of actuarial applications use regression or GLM modeling, which can be run directly with shrinkage. Usually the constant term, the fixed parameters of the residual distribution, and s get regular priors, not shrinkage priors.
Row-column models can be set up as regressions or GLM models with the rectangle of observations strung out into a vector, and a design matrix with (0,1) dummy variables for each original row and column to identify the corresponding observations. Then the vector of coefficients times the matrix of dummy variables gives fitted values for an linear model with row and column effects. This can be exponentiated to get a multiplicative model.
That does not quite work for shrinkage, as you would not want to make the coefficients closer to zero. It works better to fit curves across the rows, columns, etc. and shrink parameters to make those curves smoother. In the January 2018 issue of the Astin Bulletin, one paper did this with cubic splines for loss reserve modeling, and one used linear splines (piecewise linear curves) for a mortality model. The present paper uses linear splines for loss reserve models.
What you shrink is the slope changes between line segments, which are the second differences in the row and column coefficients. If one of those goes to zero, the previous piecewise linear slope continues. The coefficients are cumulative sums of the slopes, which in turn are cumulative sums of the slope changes. This still can be set up as a linear model with dummy variables for the row and column slope changes, but instead of being (0,1) dummies, the dummies count how many times a slope change adds up for each data point.
MCMC is quite flexible in modeling capabilities and is not restricted to the distributions or model forms of GLM. For instance, the examples use a gamma distribution with mean ajb and variance ajb² for the jth cell. This makes the variance b*meanj for each cell, which is fairly good for aggregate losses. This is not possible in GLM, where a is the nuisance parameter and so is constant for all cells.
Also one model shown includes an additive column constant plus a row-column factor model. This was suggested by Muller in a 2016 Variance paper, and often improves the fit enough to be worth the extra parameters.
The complete paper is available for download https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3218863