Appendix D — Training to minimize LOSS

D.1 Introduction

In the realm of statistical modeling and machine learning, the choice of a loss function plays a pivotal role in shaping the behavior and performance of a model. A loss function quantifies the discrepancy between the predicted values of a model and the actual observed outcomes. From a statistical perspective, loss functions serve as a bridge between the mathematical representation of a model and its alignment with the underlying data distribution.

D.2 Definition of Loss Function

At its core, a loss function measures the dissimilarity between the predicted values (f(x)) produced by a statistical model and the true observed values (y) present in the dataset. In other words, it quantifies the “loss” or “cost” associated with the model’s predictions. Mathematically, the loss function L can be defined as:

\[ L(y, f(x)) \]

where:

y represents the true observed value (ground truth).
f(x) denotes the model’s prediction based on the input x.

The primary objective in statistical modeling is to minimize this loss function, which effectively means reducing the discrepancy between predictions and actual outcomes.

D.3 Importance of Loss Functions

Loss functions serve a dual purpose in statistical modeling:

Objective Function: From an optimization standpoint, the loss function serves as the objective function that the model seeks to minimize. By finding the optimal parameter values that minimize the loss, the model becomes better aligned with the underlying data distribution.
Model Selection and Evaluation: Loss functions enable model selection by providing a quantitative measure of how well a model fits the data. Different loss functions emphasize different aspects of prediction errors (e.g., mean squared error for regression, cross-entropy for classification), allowing practitioners to tailor their models to the specific task at hand.

D.4 Role of Loss Functions in Parameter Estimation

When estimating the parameters of a statistical model, loss functions are used to guide the optimization process. Common optimization techniques such as gradient descent seek to find parameter values that minimize the loss. By minimizing the loss, the model effectively “learns” from the data and captures the underlying relationships between variables.

D.5 Types of Loss Functions

The choice of a loss function depends on the nature of the problem and the type of data. Different loss functions capture different aspects of prediction errors, such as absolute deviations, squared deviations, or likelihood-based measures. Common loss functions include:

Mean Squared Error (MSE): Emphasizes squared deviations between predictions and actual values, commonly used in regression problems.
Mean Absolute Error (MAE): Measures the absolute differences between predictions and actual values, providing a more robust measure against outliers.
Cross-Entropy: Used in classification tasks to measure the dissimilarity between predicted class probabilities and actual class labels.
Log-Likelihood: Often used in maximum likelihood estimation to quantify the likelihood of observed data given the model.

D.6 Example using R

The following example demonstrates the use of the Mean Squared Error (MSE) loss function and how optimization aims to minimize the errors using gradient descent.

We will create a simple linear regression model and use the MSE loss to train the model using gradient descent to fit a line to a set of data points.¹

# Generate some example data
set.seed(123)
x <- seq(1, 10, by = 0.5)
y <- 2 * x + 3 + rnorm(length(x), mean = 0, sd = 1)

# Define the MSE loss function
mse_loss <- function(y_true, y_pred) {
  mean((y_true - y_pred)^2)
}

# Define the model parameters (slope and intercept)
slope <- 0
intercept <- 0

# Hyperparameters
learning_rate <- 0.01
num_epochs <- 100

# Training loop
for (epoch in 1:num_epochs) {
  # Forward pass: compute predictions
  y_pred <- slope * x + intercept
  
  # Compute the MSE loss
  loss <- mse_loss(y, y_pred)
  
  # Compute gradients with respect to parameters
  d_slope <- -2 * mean(x * (y - y_pred))
  d_intercept <- -2 * mean(y - y_pred)
  
  # Update parameters using gradient descent
  slope <- slope - learning_rate * d_slope
  intercept <- intercept - learning_rate * d_intercept
  
  # Print progress
  cat(sprintf("Epoch %d - Loss: %.4f\n", epoch, loss))
}

Epoch 1 - Loss: 231.3500
Epoch 2 - Loss: 14.5751
Epoch 3 - Loss: 3.2058
Epoch 4 - Loss: 2.5980
Epoch 5 - Loss: 2.5541
Epoch 6 - Loss: 2.5398
Epoch 7 - Loss: 2.5271
Epoch 8 - Loss: 2.5147
Epoch 9 - Loss: 2.5023
Epoch 10 - Loss: 2.4901
Epoch 11 - Loss: 2.4779
Epoch 12 - Loss: 2.4658
Epoch 13 - Loss: 2.4538
Epoch 14 - Loss: 2.4420
Epoch 15 - Loss: 2.4302
Epoch 16 - Loss: 2.4185
Epoch 17 - Loss: 2.4068
Epoch 18 - Loss: 2.3953
Epoch 19 - Loss: 2.3839
Epoch 20 - Loss: 2.3726
Epoch 21 - Loss: 2.3613
Epoch 22 - Loss: 2.3501
Epoch 23 - Loss: 2.3390
Epoch 24 - Loss: 2.3281
Epoch 25 - Loss: 2.3171
Epoch 26 - Loss: 2.3063
Epoch 27 - Loss: 2.2956
Epoch 28 - Loss: 2.2849
Epoch 29 - Loss: 2.2743
Epoch 30 - Loss: 2.2639
Epoch 31 - Loss: 2.2534
Epoch 32 - Loss: 2.2431
Epoch 33 - Loss: 2.2329
Epoch 34 - Loss: 2.2227
Epoch 35 - Loss: 2.2126
Epoch 36 - Loss: 2.2026
Epoch 37 - Loss: 2.1927
Epoch 38 - Loss: 2.1828
Epoch 39 - Loss: 2.1730
Epoch 40 - Loss: 2.1633
Epoch 41 - Loss: 2.1537
Epoch 42 - Loss: 2.1441
Epoch 43 - Loss: 2.1346
Epoch 44 - Loss: 2.1252
Epoch 45 - Loss: 2.1159
Epoch 46 - Loss: 2.1066
Epoch 47 - Loss: 2.0975
Epoch 48 - Loss: 2.0883
Epoch 49 - Loss: 2.0793
Epoch 50 - Loss: 2.0703
Epoch 51 - Loss: 2.0614
Epoch 52 - Loss: 2.0526
Epoch 53 - Loss: 2.0438
Epoch 54 - Loss: 2.0351
Epoch 55 - Loss: 2.0265
Epoch 56 - Loss: 2.0179
Epoch 57 - Loss: 2.0094
Epoch 58 - Loss: 2.0010
Epoch 59 - Loss: 1.9926
Epoch 60 - Loss: 1.9843
Epoch 61 - Loss: 1.9760
Epoch 62 - Loss: 1.9679
Epoch 63 - Loss: 1.9598
Epoch 64 - Loss: 1.9517
Epoch 65 - Loss: 1.9437
Epoch 66 - Loss: 1.9358
Epoch 67 - Loss: 1.9279
Epoch 68 - Loss: 1.9201
Epoch 69 - Loss: 1.9124
Epoch 70 - Loss: 1.9047
Epoch 71 - Loss: 1.8971
Epoch 72 - Loss: 1.8895
Epoch 73 - Loss: 1.8820
Epoch 74 - Loss: 1.8746
Epoch 75 - Loss: 1.8672
Epoch 76 - Loss: 1.8599
Epoch 77 - Loss: 1.8526
Epoch 78 - Loss: 1.8454
Epoch 79 - Loss: 1.8382
Epoch 80 - Loss: 1.8311
Epoch 81 - Loss: 1.8240
Epoch 82 - Loss: 1.8171
Epoch 83 - Loss: 1.8101
Epoch 84 - Loss: 1.8032
Epoch 85 - Loss: 1.7964
Epoch 86 - Loss: 1.7896
Epoch 87 - Loss: 1.7829
Epoch 88 - Loss: 1.7762
Epoch 89 - Loss: 1.7696
Epoch 90 - Loss: 1.7630
Epoch 91 - Loss: 1.7565
Epoch 92 - Loss: 1.7500
Epoch 93 - Loss: 1.7436
Epoch 94 - Loss: 1.7372
Epoch 95 - Loss: 1.7309
Epoch 96 - Loss: 1.7246
Epoch 97 - Loss: 1.7184
Epoch 98 - Loss: 1.7122
Epoch 99 - Loss: 1.7061
Epoch 100 - Loss: 1.7000

# Final trained model parameters
cat("Trained Slope:", slope, "\n")

Trained Slope: 2.272553

cat("Trained Intercept:", intercept, "\n")

Trained Intercept: 1.291805

# Plot the original data and the fitted line
plot(x, y, main = "Linear Regression with MSE Loss", xlab = "X", ylab = "Y")
abline(a = intercept, b = slope, col = "red")

df <- data.frame(x,y)
library(plotly)
library(dplyr)
fig <- plot_ly(data = df, x = ~x, y = ~y, 
               type = 'scatter', alpha = 0.65, 
               mode = 'markers', name='data')
fig <- fig %>% 
  add_trace(data = df, 
            x = ~x, y = ~x*slope+intercept, 
            name = 'Lin. Regr. with MSE Loss', color = "red", 
            mode = 'lines', alpha = 1)
fig

In this example, we start with some synthetic data points \((x, y)\) that follow a linear relationship with added noise. We define the MSE loss function and the model parameters (slope and intercept). The training loop iterates for a fixed number of epochs, where we perform a forward pass to compute predictions, calculate the MSE loss, compute gradients, and update the model parameters using gradient descent. The slope and intercept are adjusted iteratively to minimize the MSE loss.

After training, you’ll see that the trained slope and intercept values approximate the original slope and intercept of the linear relationship. The fitted line should align with the data points, indicating successful minimization of the MSE loss through optimization.

D.7 Conclusion

Loss functions are the cornerstone of statistical modeling and machine learning, shaping the behavior of models and guiding their learning process. Their role extends beyond optimization, as they offer insights into the alignment between models and data distributions. A well-chosen loss function not only aids in achieving accurate predictions but also provides a deeper understanding of the underlying statistical relationships.

References and footnotes

We could use the lm() function in R to perform this analysis, but we will not delve into the details of how it is done here.↩︎