szlo.ai - Introduction to the Cessie-Houwelingen Test Statistic

I first read about the Cessie-Houwelingen Test in section 10.5 Assessment of Model Fit from Dr. Frank Harrell’s “Regression Modeling Strategies, 2nd Ed.” years ago and have always wanted to do more of a deep dive into the inner workings of the test statistic.

S. le Cessie and J. C. van Houwelingen in their 1991 article “A Goodness-of-Fit Test for Binary Regression Models, Based on Smoothing Methods” published in Biometrics proposed a global test statistic for logistic regression models. This test has a few other names: le Cessie-van Houwelingen Test, le Cessie-van Houwelingen Normality Test, le Cessie-van Houwelingen goodness of fit Test, and I’m sure a few others. For the sake of brevity, I will be sticking with the Cessie-Houwelingen Test for the rest of this article.

NOTE: It’s important to distinguish the Cessie-Houwelingen Test from the Cessie-Houwelingen-Copas-Hosmer unweighted sum of squares test, which is a separate method for assessing global goodness of fit. This test, also referred to as the unweighted sum of squares test or the le Cessie-van Houwelingen normal test statistic for the unweighted sum of squared errors, is elaborated upon in the section discussing its historical context. Hosmer (1997) demonstrated through simulations the robustness of this test, and it has been recommended by Harrell for use. It wasn’t until writing this article that I realized they are two separate tests!

The Cessie-Houwelingen Test statistic is essentially a sum of squared standardized residuals post kernel smoothing. The kernel smoothing is particularly important as it accounts for the continuous attributes of covariates, enabling a more detailed and nuanced assessment of how well the model performs (as compared to binning the predictions like in the Hosmer-Lemeshow Test). The essence of this test statistic lies in integrating these refined, smoothed residuals to yield a comprehensive measure of the model’s overall fit.

Calculation Steps

The Cessie-Houwelingen Test Statistic is calculated as follows:

Standardization of Residuals
Kernel Smoothing of Residuals
Adjusting Smooth Residuals by Inverse Variance

Lets break down each step:

1. Standardization of Residuals

First we need to compute the standardized residuals from the logistic regression model:

\[ r_i = \frac{Y_i - \pi(X_i)}{\sqrt{\pi(X_i)(1 - \pi(X_i))}} \]

where \(Y_{i}\) are the observed responses, \(\pi(X_i)\) are the predicted probabilities from the logistic model and \(r_i\) are the standardized residuals.

\(\pi(X_i)\) is the predicted outputs for the logistic model: \(\pi(x) = \frac{e^{\alpha + x\beta}}{1 + e^{\alpha + x\beta}}\)

For each observation, the residual is defined as the difference between the observed binary outcome \(Y_i\) and the predicted probability \(\pi(X_i)\) from the logistic regression model. The difference is then standardized by dividing it by the standard deviation of the predicted probability: \({\sqrt{\pi(X_i)(1 - \pi(X_i))}}\).

2. Smoothing Residuals

\[\hat{r}(x) = \sum_{i=1}^n K_h(x - X_i) r_i\]

where \(K_h\) is the kernel function with bandwidth \(h\), and \(\hat{r}(x)\) are the smoothed residuals. The specific smoothing function used in (Cessie & Houwelingen 1991) was the Nadaraya-Watson Estimator which is a weighted average of the residuals in the neighborhood of \(x\), where the bandwidth determines the size of the region over which the residuals are averaged and the kernel function determines the weighting. The specific kernel function used in the manuscript (for simplcity?) was the uniform kernel function.

Uniform Kernel Function: \[ K(x) = \begin{cases} \frac{1}{2} & \text{if } |x| \leq 1, \\ 0 & \text{otherwise.} \end{cases}\]

3. Test Statistic

\[T = n \sum_{i=1}^n \hat{r}(X_i)^2 U(X_i)\]

with \(U(X_i)\) being the inverse of the variance of the smoothed residual at \(X_i\) and \(n\) is the sample size.

The inverse of the variance is a measure of precision or reliability. It indicates how much we trust the estimate of the smoothed residual at each observation. When squared residuals are weighted by the inverse of their variances, observations with more precise estimates (i.e., lower variance) contribute more to the test statistic, while observations with less precise estimates (i.e., higher variance) contribute less.

The authors mention that each observation contributes to the test statistic \(T\) in a way that is proportional to the square of the predicted probability \(p(X_i)\) for that observation. Importantly, this contribution is proportional to a factor called \(U(X_i)\), which is the inverse of the variance of the smoothed standardized residual at observation \(X_i\). Thus each observation should contribute to the test statistic in a similar manner. The multiplication of \(\hat{r}(X_i)^2\) by \(U(X_i)\) scales the contributions of each observation to account for the precision of the smoothed residuals at the observation. Observations with more precise smoothed residuals (lower variance) will have a larger value of \(U(X_i)\), which means their contribution is increased. In contrast observations with less precise smoothed residuals (higher variance) will have a smaller value of \(U(X_i)\) and their contribution is decreased.

R Implementation

I discovered a single R implementation of the Cessie-Houwelingen Test in the smwrStats package, which is now archived. This package was developed by Dr. Laura DeCicco at the United States Geological Survey (USGS). For continuous data, this implementation seems to compute the average of the residuals using Euclidean distance. In contrast, for categorical data, the residuals are calculated using Manhattan distance. Then it appears to use local smoothing instead of the Nadaraya-Watson Estimator. The chosen default binwidth is to utilize the mean of the distances as the binwidth when no specific width is specified. Here, I’ve developed a somewhat tidy version of the implementation:

library("tidyverse")
library("purrr")


leCessie_test <- function(object, bandwidth) {
  # Define a function to calculate categorical distances
  categorical <- function(x) {
    x <- as.matrix(x)
    retval <- apply(x, 2, function(y) {
      nrcats <- length(unique(y))
      if (nrcats == 1) {
        disty <- rep(0, length(y) * (length(y) - 1))
      } else {
        disty <- as.numeric(dist(y, 'manhattan') != 0) * nrcats / (nrcats - 1)
      }
      return(disty)
    })
    return(rowSums(retval))
  }

  # Store the original object and construct the variable name
  object.orig <- object
  Dname <- as.character(object$call$formula)
  Dname <- paste(Dname[2], Dname[1], Dname[3]) # Arrange variable names in the correct order

  # Calculate fitted values and residuals from the model
  fits <- fitted(object)
  resids <- resid(object, 'response')
  N <- length(resids)
  
  # Extract the model frame and check if response is a matrix
  obj.frame <- model.frame(object)
  if (class(obj.frame[[1]]) == 'matrix') stop("Cannot perform test where response is matrix")

  # Calculate distances for each predictor in the model
  distances <- lapply(obj.frame[,-1, drop=FALSE], function(x) {
    if (is.numeric(x)) {
      scaled_x <- scale(x)
      dist_x <- as.matrix(dist(scaled_x))
      return(0.5 * dist_x * dist_x)  # Compute squared Euclidean distance
    } else if (class(x) == 'factor') {
      return(categorical(x))  # Use the categorical distance function for factor variables
    } else {
      scaled_x <- scale(x[[1]])
      dist_x <- as.matrix(dist(scaled_x))
      return(0.5 * dist_x * dist_x)  # Handle structured variables created from a function
    }
  })

  # Initialize a matrix for combined distances and fill it
  dist.mat <- matrix(0, N, N)
  for (dist_x in distances) {
    dist.mat[lower.tri(dist.mat)] <- dist.mat[lower.tri(dist.mat)] + sqrt(dist_x[lower.tri(dist_x)])
  }
  dist.mat <- dist.mat + t(dist.mat)  # Symmetrize the distance matrix

  # Set bandwidth if not specified
  if (missing(bandwidth)) bandwidth <- mean(dist.mat)
  R.raw <- pmax(1 - dist.mat / bandwidth, 0)  # Compute the raw smoother matrix

  # Calculate smoothed residuals and raw Q
  smoothres <- t(resids) %*% R.raw
  Q.raw <- sum(smoothres * resids)

  # Prepare matrices for the corrected estimation of Q
  obj.mat <- model.matrix(object.orig)
  mu2 <- fits * (1 - fits)
  V <- diag(mu2)
  Vx <- diag(V)
  obj.hat <- Vx * obj.mat %*% solve(t(obj.mat) %*% (Vx * obj.mat)) %*% t(obj.mat)
  R.cor <- (diag(N) - obj.hat) %*% R.raw %*% (diag(N) - obj.hat)

  # Compute corrected estimated value of Q and its standard error
  E.Q <- sum(diag(R.cor) * mu2)
  mu4 <- mu2 * (1 - 3 * mu2)
  VarQ1 <- sum(diag(R.cor)^2 * (mu4 - 3 * mu2^2))
  R.tmp <- R.cor * matrix(rep(mu2, each = N), nrow = N)
  VarQ2 <- 2 * sum(diag(R.tmp %*% R.tmp))
  VarQ <- VarQ1 + VarQ2

  # Calculate the test statistic and degrees of freedom
  Test <- Q.raw * 2 * E.Q / VarQ
  df <- 2 * E.Q^2 / VarQ

  # Construct the output
  output <- data.frame(statistic = Test, 
             df = df,
             p.value = 1 - pchisq(Test, df),
             q = Q.raw, 
             e_q = E.Q,
             se_q = sqrt(VarQ))
  
  return(output)
  
}

Then testing it out on the titanic dataset.

titanic <- titanic::titanic_train %>% dplyr::select(Survived, Age) %>% filter(!is.na(Age))

glm_model <- glm(Survived ~ Age, data = titanic)

  statistic       df     p.value        q      e_q     se_q
1  20.56367 5.171196 0.001133206 380.1192 95.58951 59.44697

A p-value of 0.001133206 in the Cessie-Houwelingen Test strongly suggests that the model does not fit the data well. This result indicates a need to review and potentially revise the model to better capture the underlying relationship in the observed data.

While there might not be alot of implenetations of the Cessie-Houwelingen Test, there are quite a few of the Cessie-Houwelingen-Copas-Hosmer unweighted sum of squares test:

rms::residuals(rms::lrm(Y ~ X, data = simulated_data, x=TRUE, y=TRUE), type = "gof")
DescTools::HosmerLemeshowTest(fit)
MKmisc::HLgofTest(fit)

A simple explanation of the Cessie-Houwelingen

Thus you can consider the Cessie-Houlingen Test as a comparison of the observed values to the standardized residuals after weighting them by the inverse of the variance of a kernel smoothing function.

Some Historical Context

I believed it would be beneficial to provide a succinct overview of the timeline regarding the development of the Cessie & Houwelingen Statistic in relation to other global goodness-of-fit tests for logistic regression. Please note, this is not a comprehensive historical chronology.

1980 - Hosmer and Lemeshow introduced a widely-used method for assessing the goodness-of-fit in logistic regression models. This technique involves categorizing data into several groups based on the model’s predicted probabilities and then computing a chi-squared-like statistic.
1982 - Following this, Brown detailed a goodness-of-fit test for the logistic model based on score statistics. Brown’s approach extends the logistic model to a broader family of models with two additional parameters that, when set to zero, reduce the model to its standard logistic form.
1983 - Copas later used kernel methods to visually assess the fit of these models.
1989 - Copas introduced the unweighted sum of squares test for proportions.
1991 - Cessie & Houwelingen - Modified Copas (1983) to create a weighted sum of squares based on nonparametric kernel methods.
1995 - Expansion of the Cessie & Houwelingen Test Statistics to allow for unweighted sum of sqaures. - *Note that the unweighted sum of squares test (from Copas 1989) is considered a special case of the series of goodness-of-fit test statistics considered by Cessie and Houwelingen.
1997 - Hosmer and colleagues compared various goodness-of-fit tests, including the Pearson Chi-Square, Unweighted Sum of Squares (Copas 1989), Hosmer-Lemeshow (C, H), Kernel Smoothing (Uniform and Cubic weighting, Cessie & Houwelingen 1991), Royston (monotone and quadratic), and the Stukel Score Test. They found the unweighted sum of squares was both simple and powerful.

Some thoughts

Changing Kernel Function and Bandwidth
- In Section 7. Practical Considerations, the authors state:
  - “In the literature of kernel regression and density estimation, it is suggested that the choice of the kernel function is not so important. The simplest kernel to deal with is the uniform kernel on \([\frac{1}{2},\frac{1}{2}]\), which has been used throughout his paper. There is no evidence that a different type of kernel function would alter the result significantly.”
- Although I believe this statement to be correct, I noticed that the authors haven’t provided any references to back it up. I’m curious to explore how the test statistic varies with the use of different kernel regressions, specifically Gaussian and Epanechnikov, and also how it’s affected by adjusting the bandwidth of the kernel function.
- Gaussian Kernel Function
  - \(K(x) = \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2} x^2}\)
- Epanechnikov Kernel Function
  - \(K(x) = \begin{cases} \frac{3}{4}(1 - x^2) & \text{if } |x| \leq 1, \\ 0 & \text{otherwise.} \end{cases}\)
It would be nice to recreate the results of the Hosmer 1997 paper to show the performance of the Cessie-Houwelingen-Copas-Hosmer unweighted sum of squares test.

Further Reading:

Hosmer, D.W., Hosmer, T. and Lemeshow, S. (1980) A Goodness-of-Fit Tests for the Multiple Logistic Regression Model. Communications in Statistics, 10, 1043-1069. https://doi.org/10.1080/03610928008827941
Charles C. Brown (1982) On a goodness of fit test for the logistic model based on score statistics, Communications in Statistics - Theory and Methods, 11:10, 1087-1105, DOI: 10.1080/03610928208828295
Copas, J. B. (1983). Plotting p against x. Journal of the Royal Statistical Society. Series C (Applied Statistics), 32(1), 25–31. https://doi.org/10.2307/2348040
Copas, J. B. (1989). Unweighted Sum of Squares Test for Proportions. Journal of the Royal Statistical Society. Series C (Applied Statistics), 38(1), 71–80. https://doi.org/10.2307/2347682
le Cessie, S., & Houwelingen, J.C. (1991). A goodness-of-fit test for binary regression models, based on smoothing methods. Biometrics, 47, 1267-1282.
le Cessie, S., & van Houwelingen, H. C. (1995). Testing the fit of a regression model via score tests in random effects models. Biometrics, 51(2), 600–614.
Hosmer, D. W., Hosmer, T., Le Cessie, S., & Lemeshow, S. (1997). A comparison of goodness-of-fit tests for the logistic regression model. Statistics in medicine, 16(9), 965–980. https://doi.org/10.1002/(sici)1097-0258(19970515)16:9<965::aid-sim509>3.0.co;2-o
Harrell, F. E., Jr. (2016). Regression modeling strategies. Springer International Publishing. 2nd Ed. Section 10.5 Aessment of Model Fit, p. 236.