Learning algorithm is a procedure that controls model optimization process. In the modern theory of predictive modeling it is well known that the model should provide a trade-off between simplicity and accuracy. GMDH Shell matches this goal using learning algorithms of the Group Method of Data Handling (GMDH) that is a knowledge discovery technique originated by Prof. A.G.Ivakhnenko in 1968.
The idea of all GMDH-type algorithms is to apply a generator of gradually complicating models and select a set of models that show highest forecasting accuracy at previously unseen data. This data is usually called a validation or testing part, and the top-ranked model is claimed to be optimally-complex.
Mathematically, GMDH is quite simple and at the same time it is well coordinated with the state of the art data mining techniques. A central role in optimal complexity detection is given to validation - this is a common approach for majority of learning algorithms. However, GMDH Shell focuses on the two-level validation approach and, of course, provides necessary tools. Actually GMDH Shell automates the way most dataminers work. The second validation level that we call “hold-out sample” is a common practice but, the term “hold-out” may refer to a testing sample in other tools or literature. The hold-out sample doesn't take part in estimation of model coefficients or selection of model structure. It is used for calculation of performance of post-processed predictions that alert about over-fitting or simply about low quality models. In fact, the hold-out sample validates your modeling attempt in general, i.e. tests validity of transformations, significance of chosen inputs and the like.
There are two closely related learning algorithms available in GS:
Since these algorithms are logically separated from data preparation procedures, we call them Core algorithms. Core algorithms perform generation and selection of model structures. Then model coefficients are fitted using the least squares method (for both algorithms).
Core algorithm generates models from simple to complex ones until the testing accuracy increases. Built-in validation strategies i.e. cross-validation or simple partitioning deals with the data split into training and testing parts. Cross-validation, for example, corresponds to numerous splits considered one by one and eventually averaged. Training data is used for fitting of model coefficients, while testing data is used for calculation of a validation measure. Constant complication increases the number of model parameters (the same as neural network weights) as long as testing error decreases.
If configured to do so, GMDH Shell applies a ready-to-use model to the 2-nd level validation data and displays performance results that you should use to evaluate success of your modeling simulation.
Combinatorial GMDH model is a polynomial function that is linear in the parameters. Combinatorial model is a subset of terms of a polynomial function generated from a given set of variables. For example, if we model a data-set of two input variables x1 and x2 and an output (target) variable y, then commonly used quadratic polynomial function for which optimization will be performed is:
y = a0 + a1·x1 + a2·x2 + a3·x1·x2 + a4·x1² + a5·x2²
The power of polynomial function is defined by a user and also can represent a linear function, i.e. power equals to one. Combinatorial GMDH selects an optimally-complex model, for example,
y = a0 + a3·x1·x2 as a subset of terms of complete polynomial with the smallest model error at testing data.
Data preprocessing stage allows us to apply different operators to variables x1 and x2, for example, exponent, sigmoid function, time series lags, and so on. But the final model will be still linear in the parameters.
Full combinatorial search of model components frequently takes too much time, so we can limit it to a search of models with no more than
n terms. Models with only 2 terms, for example, allow search among 100'000 variables and probably larger sets. In the same time the full search is not recommended for model spaces with more than 25 polynomial or linear terms.
Combinatorial GMDH in general is a time consuming algorithm. Its application as a stand-alone algorithm is practically useful for variable selection or with a complexity limit of 3-7 model parameters that allows to process 500-100 variables. Use of preliminary variable selection available in the solver module allows to reduce the number of variables to acceptable bounds. We especially recommend to use linear Combinatorial models for time series forecasting of multi-variate and univariate series transformed into sets of lags.
GMDH-type neural networks also known as polynomial neural networks employ Combinatorial algorithm for optimization of neuron connection. The algorithm iteratively creates layers of neurons with two or more inputs. The algorithm saves only a limited set of optimally-complex neurons that we denote as initial layer width. Every new layer is created using two or more neurons taken from any of previous layers. Every neuron in the network applies transfer function (usually with two variables) that allows exhaustive Combinatorial search choose a transfer function that predicts testing data most accurately. The transfer function usually has a quadratic or linear form but also can be customized in the solver module.
GMDH-type networks generate many layers but layer connections are so sparse that their number can be as small as a few connections per layer.
As we mentioned above, the algorithm returns only a limited number of neurons defined by a user from every layer; Since every new layer can connect to previous layers the layer width grows constantly. Taking into account how rarely the upper layers improve population of models we divide additional size of the next layer by two and generate only half of the neurons generated at the previous layer, i.e. the number of neurons
N at layer
Nk=0.5·Nk-1. This heuristic makes the algorithm quicker while the chance to reduce model quality is low.
Generation of new layers is usually stopped when a new layer can't show better testing accuracy than previous layer. However, we also stop generating new layers if the testing error was reduced by less than 1% or if the number of layers has reached a certain limit that you can define.
In some ill posed simulations the number of unique weights of the network (or model complexity) can be too large in comparison with the number of dataset rows. As a result the model over-fits data and may fail to predict hold-out sample adequately. In this case you can limit the number of network layers manually, or redesign your set of inputs, or cancel transformations that expand initial set of inputs too much, for example, Additional variables x1·x2.
If you want to have even deeper insight of GMDH-family algorithms we suggest you to read the Research Report about GMDH method (PDF).