Whether your goal is to identify substitute characteristics or solve a process problem, regression algorithms can produce coefficients for almost any data. However, it doesn’t mean the resulting models are any good.
In machine learning, you divide your data into a training set on which you calculate coefficients and a testing set to check the model’s predictive ability. Testing concerns externally visible results and is not specific to regression.
Validation, on the other hand, is focused on the training set and involves using various regression-specific tools to detect inconsistencies with assumptions. For these purposes, we review methods provided by regression software.
In this post, we explore the meaning and the logic behind the tools provided for this purpose in linear simple and multiple regression in R, with the understanding that similar functions are available from other software and that similar tools exist for other forms of regression.
It is an attempt to clarify the meaning of these numbers and plots and help readers use them. They will be the judges of how successful it is.
The body of the post is about the application of these tools to an example dataset available from Kaggle, with about 30,000 data points. For the curious, some mathematical background is given in the appendix.
Many of the tools are developments from the last 40 years and, therefore, are not covered in the statistics literature from earlier decades.
In simple, linear regression, you have just one numeric explanatory variable and one numeric response. With a data set in the hundreds of points, you can plot the point cloud and the regression line going through it:
If you have so many points that they merge into a blob in the plot, you can use a heat map instead.
In either case, you can often tell visually whether the model is useful or not. You only need to look further into metrics of the fit if it isn’t visually obvious.
Visual validation is not an option in multiple regression, where you have several explanatory variables, or in multivariate regression, where you have several responses. In these cases, you need to consider numbers, with the understanding that some of these overall numbers must be intelligible to all stakeholders, while the more detailed ones are only for data scientists.
Relying on visualization also entails the risk of confirmation bias: You see patterns in the plot that you expect to see but that another observer might not. In a discussion we had a few years ago, Mark Graban and I saw different patterns in TV viewership of Academy Awards ceremonies.
We’ll use an example of multiple linear regression on data about quality extracted from Kaggle. All we know from Kaggle is the following:
“The roasting machine is an aggregate consisting of 5 chambers of equal size, each chamber has 3 temperature sensors. In addition, for this task, you have collected data on the height of the raw material layer and its moisture content. Layer height and humidity are measured when raw materials enter the machine. Raw materials pass through the kiln in an hour. Product quality is measured in the laboratory by samples that are taken every hour at the exit of the roasting machine.”
Conjectures about the Kaggle data set
Kaggle provides the data as just numbers, without even units of measure. Since the documentation refers to the equipment as a kiln, the products could be ceramics if the temperatures are in °C. Then, the quality score could be comprised of dimensional shrinkage during firing and surface smoothness class, or other appearance ratings.
Use of the Kaggle data
We use it here to illustrate the summaries and plots used to validate regression models. In a real case, we would want to know the complete backstory of the data, including what process physics and chemistry tell us about the relationship between the control parameters and the quality of the product.
Hourly Summaries of Explanatory Variables
For over three years, they measured temperatures every minute through three sensors in each of the five chambers the product traverses in a tunnel kiln. At the start, the Height of the raw materials entering the oven and its humidity were also measured. The quality of the outgoing materiawasis scored on samples every hour. There were about 2 million points for each measured variable and about 30,000 for quality. The regression model is linear on hourly averages of temperatures by chamber, height, and humidity.
Multiple Regression Model
The quality score Q is modeled as the sum of a constant alpha, a noise epsilon, and a linear combination of seven terms:
- The hourly averages of temperature T_1,dots,T_5
- The height H of the materials flowing through the kiln.
- The raw materials humidity W
The formula is as follows:
Q= alpha + beta_1T_1 +dots +beta_5T_5 +beta_6H + beta_7 W +mathcal{E}
It assumes a relationship between random variables, which would be true for any alternative dataset collected on the furnace as well as Kaggle’s. The coefficients alpha, beta_1, dots, beta_7 and the distribution of the error mathcal{E} are all unknown.
From the dataset q_i, t_{1i}, dots, t_{5i}, h_i, w_i, with i = 1,dots, N , we estimate the coefficients as hat{alpha},hat{ beta_1}, dots, hat{beta_7} and the characteristics of the error mathcal{E} from the residuals epsilon_i = q_i – hat{alpha} – hat{beta}_1t_{1i} -dots -hat{beta}_5t_{5i} -hat{beta}_6h_i -hat{beta}_6w_i .
To use the model, we want the mean of the error mathcal{E} to be 0, and its standard deviation to be sufficiently small, and the meaning of “sufficiently” depends on the context and the nature of the decisions about the process that the model is to support.
When we turn to numbers, we want to assess the model’s ability to approximate actual values on data outside the training set — that is, the data set used to estimate the model coefficients.
PEAT Calculations
In the testing set left ( x_{1i},dots,x_{k1}, y_i right ), i = 1,dots, M we have values for the explanatory variables and actual responses. Whatever model hat{Y} = fleft ( X_1,dots,X_k right ) we fit to the training set, we can apply it to the explanatory variables in the testing set and measure how close the fitted values hat{y}_1,dots,hat{y}_M come to the actual responses y_1,dots,y_M:
r_i = frac{y_i-hat{y}_i}{y_i},text{for},i=1,dots,M
Specifying ranges like pm 1%,pm2%, pm5%, pm10%,pm20%, we can then measure the proportion of the r_i within each range.
PEAT Example
Range off | Fitted value within range |
---|---|
±1% | 13% |
±2% | 29% |
±5% | 68% |
±10% | 90% |
±20% | 98% |
We call these Proportional Estimation Accuracy Tables (PEAT). They characterize the distribution of residuals in a way that is understandable to end users and applicable regardless of the model used.
If the precision shown in the PEAT is sufficient, for example, to establish tolerances on substitute characteristics instead of the elusive, true characteristic, you don’t need to look further. The end user can enjoy the sausage without knowing how you made it. It’s different when the performance is inadequate or marginal, and the analyst needs to dig deeper.
Recently, statistician Alexander Krannich shared his misgivings about showing the performance summary of regression models: “This is good enough for statisticians, but not optimal for presentations to non-statistician stakeholders.”
He was putting it mildly. This is an example from the lm function in R, on the same dataset: the
Only a statistician or data scientist could love this summary; it is unintelligible to anyone else. The regression summary from Excel is just as abstruse for non-statisticians. Minitab, SAS, or Python’s scikit-learn package all produce similar ones, none suitable for external communication.
Background of this summary
Let’s explore this example in more detail, and examine the meaning of these numbers. The goal here is to use linear regression to predict a roasting Quality Score of unknown composition from observed characteristics. The hourly measurements of this score taken over more than three years show no improvement:
This plot also suggests a skewed distribution, which Kernel Density Estimation (KDE) confirms:
We don’t know what the Quality Score is made of, but we have records mapping the control parameters of the kiln to it, and we can use multiple regression to find settings that should improve it.
If, based on the PEAT, you and the end users are happy with the model’s performance, the story ends here. Otherwise, you apply validation tools that focus on the training set. Regression software usually produces two sets of tools:
- A set of model metrics based on orthodox statistics, whose validity hinges on residuals being Gaussian.
- A set of more recent graphic tools, that help you ascertain the distribution of the residuals and identify possible anomalies in the data.
Most of the literature tries to explain these tools without using formulas, when in fact math in most easily explained with the notation developed CCF for this purpose. Explanations in text are simply more verbose and less precise.
Annotated Summary
Let’s consider first the summary as a whole, and then dig into its parts. One way to make sense of the summary is to annotate it:
The top part, about the formula, is self-explanatory. Let’s dig into the less obvious sections one by one.
The Residuals
The first part of the summary provide rank statistics on the residuals:
We can check their symmetry and range around the median, but we might just as well directly plot the density of the residuals:
It appears more pointy than a Gaussian, which we confirm with a Q-Q plot:
This says that the distribution of the residuals has thinner tails than if it were Gaussian while behaving like a Gaussian near the mean. The thin tails say that the fitted values are closer to the response than they would be if the residuals were Gaussian.
Statisticians call the fat-tailed distributions leptokurtic and the thin-tailed ones platykurtic. Both sound like diseases, but, to the extent that it is accurate, platykurtic residuals give you better estimates than Gaussians.
There are several actions you can take when the Q-Q plot shows that your residuals are not Gaussian:
- Ignore it. This has been, historically, the most common response, with statisticians using Gaussian models because they were mathematically tractable with the information technology of their day. It is Don Wheeler’s advice in the different context of Control Charts.
- Trim the data. There is a fine line between cleaning and doctoring data.
- Fix the model. Change the list of explanatory variables and try again.
The Coefficients
Linear regression software usually gives you a summary table for the coefficients, of the following form:
Or, in a more readable form:
Coefficient | Estimate | Standard Error | t-value | p-value | Stars |
|
|
|
|
|
*** |
|
|
|
|||
|
|
How Visual Management Tools Start & Sustain Improvement |