Prediction Algorithms & Settings

The prediction algorithms are listed in order from highest to lowest with respect to correlation and accuracy. These algorithms are off-the-shelve, meaning that they are widely available and do not require adjustment. It is not within the scope of this document to explain the details of each, however, general descriptions are provided below. Each prediction algorithm has its own set of advanced algorithm settings. The defaults will generally yield the best results.

Gradient boosting tree

Gradient boosting tree is an ensemble of weak prediction models, typically decision trees. It builds the model in a stage-wise fashion like other boosting methods do, and it generalizes them by allowing optimization of an arbitrary differentiable loss function. The idea is that in the next learning iteration it will try to learn from mistakes made in the previous one.

The Tweedie loss function allows for additional weight on zero.

Random forest

Like Gradient boosting tree, this is an ensemble learning method for classification and regression that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees' habit of overfitting to their training set. It has been shown to perform very well in many machine learning applications.

Linear regression

Two linear regression algorithms are available:

Stochastic dual gradient descent
Online gradient descent

Assuming the relationship between the predicted curve and input curves is linear, linear regression algorithms try to fit the model such that the cost function is minimized. Normally lease square error is used for the cost function.

The algorithm settings vary depending on the prediction algorithm. In general, the defaults yield the best results. However, if you have knowledge of machine learning, you can adjust these settings.

When you are Fitting a model to training data, you have to try to strike a balance between underfitting and overfitting.

If we overfit, then the model performs extraordinarily well on the training data but doesn’t generalize well when we try to use it on new data. If we underfit, then it doesn’t give accurate or useful predictions for any data set.

The following list includes all advanced algorithm settings:

Max leaves	The boosting trees that are generated are binary trees, so this parameter limits the maximum depth of the tree to log2(maxLeaves). Higher values increase the fit but too high can lead to overfitting as it will allow the model to learn relationships specific to each sample. Too low of a value leaves the model very general but will lead to underfitting.
Total trees	Higher values potentially increase the coverage of the model, reducing its variance and allowing it to generalize better to new data, but high values can also increase training time. Lower values will be quicker but can lead to a model that does not generalize well and has high variance (overfit).
Min document	The minimal number of documents allowed in a leaf of a regression tree, out of the sub-sampled data and controls the complexity of each tree. Since we tend to use shorter trees this rarely has a large impact on performance. Typical values range from 5–15 where higher values help prevent a model from learning relationships which might be highly specific to the particular sample selected for a tree (overfitting) but smaller values can help with imbalanced target classes in classification problems.
Learning rate	Determines the contribution of each tree on the final outcome and controls how quickly the algorithm proceeds down the gradient descent (learns). Smaller values make the model robust to the specific characteristics of each individual tree, thus allowing it to generalize well. Smaller values also make it easier to stop prior to overfitting; however, they increase the risk of not reaching the optimum with a fixed number of trees and are more computationally demanding. Generally, the smaller this value, the more accurate the model can be but also will require more trees in the sequence.
Percentage of data to train model	This setting is not algorithm-specific and defaults to 80%. The algorithm uses 80% of the training data to create the model. The predicted values from the model are then compared to the actual values in the remaining 20% to evaluate the accuracy of the model.
Fill missing training data	Fill any missing values in the training data with the mean of the data in that column. If not checked, missing training data will not be included in the calculation.

Fitting a model

Fitting a model means that you're making your algorithm learn the relationship between predictors and outcome so that you can predict the future values of the outcome. So the best fitted model has a specific set of parameters which best defines the problem at hand.

Overfitting	Occurs when a statistical model or machine learning algorithm captures the noise of the data. In other words, overfitting occurs when the model or the algorithm fits the data too well.
Underfitting	Occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. In other words, underfitting occurs when the model or the algorithm does not fit the data well enough.