K-fold cross validation

K-fold is a Model evaluation option under the Machine Learning Predictions Algorithms & Settings section. K-fold cross-validation assesses the performance and robustness of a machine learning model by partitioning the data into multiple subsets, iteratively training and evaluating the model, and averaging the results to obtain a more reliable evaluation of the model's effectiveness.

Common choices for k in k-fold cross-validation are 5 or 10, but other values can be used depending on the size and characteristics of the dataset. the default k value for Analytics Explorer is 5.

The process

The process of k-fold cross-validation involves the following steps:

  1. Data splitting: The dataset is divided into k subsets (or folds) of roughly equal size. These subsets are disjoint, meaning that no data points are shared between them.

  2. Model training and evaluation: The training and evaluation process is repeated k times, where each fold is used as a validation set once while the other k-1 folds are used for training. So, in each iteration, the model is trained on k-1 folds and evaluated on the remaining one.

  3. Performance metrics: Standard performance metrics are computed for each iteration. These metrics provide an indication of how well the model generalizes to unseen data. These metrics are listed under Median Metric for all Folds. You want to minimize the error metrics and maximize R squared and Root mean squared.

    • Mean absolute error

    • Median absolute error

    • Mean absolute percentage error

    • Median absolute percentage error

    • R squared

    • Root mean squared.

Advantages

Some advantages of k-fold cross-validation are the following:

  1. Reduced bias: Using multiple folds for training and evaluation helps to reduce bias in performance estimation compared to a single train-test split.

  2. Efficient data utilization: k-fold cross-validation allows better utilization of available data, as all data points get to be part of both training and validation sets.

  3. Model comparison: k-fold cross-validation enables a fair comparison between different models since each model is evaluated on the same data subsets.

K-fold visualizations

After you run k-fold model evaluation, four visualizations are produced:

  • K-Fold Result Table—standard statistics per fold

  • Median Metric for all Folds—median value for each statistic (listed above)

  • Predicted vs Actual—scatter plot of predicted value (Y axis) vs actual value (X axis)

  • Percentage Error per Fold—a line graph of the statistic metric of each fold, with the max, median, and min values plotted as horizontal lines.

Note that you want to look out for unexplained spikes in these results. Referring to the graph in the lower right, although it appears that there was a big spike in the 5th fold, keep in mind the scale on the Y axis (15.5000 ot 16.3000) So the actual jump was less than 1%, certainly within the acceptable range.