Return to gmdhshell.com
Preprocess plug-in transforms data according to requirements of the Solver module.
The Data manager is a visual tool that allows a user to specify columns of data-files which will serve as model inputs and prediction targets.
All input variables that contain text (nominal values) can be decomposed to binary variables. The number of new binary variables will be equal to the number of different nominal values in the data column. Frequency-limited decomposition is allowed. A target variable detected as nominal will not be decomposed but its values will be encoded with numbers 0,1,2,… .
The list of n files found in the project folder.
The list of n variables found in the selected file.
The list of input variables and their transformations. Format: FileName.VariableName, transformation1, transformation2, …
The list of target variables and their transformations. Format: FileName.VariableName, transformation1, transformation2, …
The list of transformations available for modification of Inputs and Targets.
Shows how GMDH Shell sees user's data inside a file.
Manage text inputs:
Stop with error | Skip column | Decompose to binary columns
Allows to handle input columns with text values.
Manage text targets:
Stop with error | Enumerate with 0,1,2,.. | Decompose to binary columns
Allows to handle target columns with text values.
Limit decomposition to most frequent labels
Enables limited decomposition to n variables. Only most frequent values will be decomposed to separate binary variables. The rest of values will be dropped to a singe binary variable number n.
Manage missing values:
Stop with error | 0 | Interpolate | Mean | Median | Most frequent |
First option throws an error when a NULL value is detected. Other options allow replacement of missing values with 0, mean of two neighboring values (interpolation), arithmetic mean value of current variable, median, and the most frequent value.
Prevent targets from being included to the set of inputs
The same column will not be placed to inputs and targets and thus models like y=y are impossible.
For short files that can't be preprocessed properly:
Stop with error | Skip silently
This option matters only when a set of files is imported. It warns you about or just skips silently any files that contain insufficiently small number of data rows.
Non-recent data detection:
No detection | Stop with error | Skip silently
If time series data is collected from a set of files you may wish to ensure that the last row of every file contains the most recent timestamp across the dataset. If some files have no recent data you can either stop processing or just silently drop these files and their variables during data preprocessing stage.
No detection option turns off this feature.
The module prepares data for time series forecasting. With this module GMDH Shell produces series of simulations and calculates one forecast (of certain time interval) for every simulation.
The step of forecasting. If you choose to forecast at several periods then GMDH Shell simply performs several simulations. In order to forecast one step ahead, for example, GMDH Shell puts model inputs
X one step ahead of model target
Task: Forecast into the future | Forecast past periods | Custom simulation
You can perform series of simulations that forecast variables
into the future or perform ex-post simulations that just help to understand hypothetical prediction errors. Custom simulation allows you to perform multiple ex-post and future forecasts within one series of simulations.
Sets the number of consequent simulations
It is the second part of the simulation series configuration. Preprocessor can move time-pointer back at N data rows after each simulation in the series. Thus we can avoid crossing of forecast periods on the timescale in case of several ex-post simulations.
Allows to hide some data from the Solver module in order to evaluate success of the simulation and overall correctness of the simulation settings.
General purpose preprocess belongs to the preprocessor module and allows you to configure the 2nd level of model validation stage called the hold-out. The General purpose preprocess panel is alternative to Time series preprocess panel and can be selected in Control Panel→Configuration→Workflow:
The hold-out sample is the final part of model validation process. Hold-out allows you to test the final model against absolutely new data and make a final conclusion about overall success of the modeling simulation. The hold-out sample should not be confused with data partitioning performed in Solver module in order to perform optimization and selection of the most accurate model. The goal of the hold-out sample is to validate initial selection of inputs, data transformations and all other project settings. Results of hold-out testing can be observed in Performance panel after completion of your modeling simulation.
Hold-out sample can be selected uniformly or from the end of the dataset. You can set the exact number of observations or a certain percentage of the dataset to be reserved for the hold-out. The rest of the dataset will be used for model learning.
prepro.txt is a text file containing a series of Preprocessor commands that fully describe what operations must be carried out over the imported dataset. Data manager translates visual configuration process to text commands but permits text style configurations
prepro.txt allows a user to specify
The asterisk command ('*') composes input set of variables from all imported columns.
Targets must be listed after the
:Target command (or the equivalent
:Output command) If there is no target mark in
prepro.txt, then the last variable in the list will be used as a single target variable.
prepro.txt accepts certain combinations of the following commands:
[FileName], [VariableName], [.], [*], [@],[LagInterval]. The general form of a command is
It may look as follows:
The semicolon symbol (';') marks single row comments.
; EXAMPLES OF VALID COMMANDS
;Get the variable named
varname from the set of imported variables and include it to the set of input variables.
;Get var variable from the file
fname.xls(csv) located in the project folder.
;Get all variables from the file
;Get all variables from all files in the project folder
;Get all variables of imported data file.
;Get variable var lagged by n time points.
;Get a sequence of lags of variable var. Time interval of lags changes from
m (n,n+1,…,m). The number of generated variables is m-n+1.
;Get all possible lags of the variable
;Get all variables of the imported data file and generate all possible lags.
;Get all variables from all files and generate all possible lags of all variables.
;Other possible combinations of commands may look like
fname.*@n-m, *.*@n-m, and so on.
Note: The preprocessor aligns imported columns with different lengths by first or by last observation (depends on
[Forecast latest observations] checkbox) and trims unusable part.