Preprocess


Preprocess plug-in transforms data according to requirements of the Solver module.

Data manager

The Data manager is a visual tool that allows a user to specify columns of data-files which will serve as model inputs and prediction targets.

All input variables that contain text (nominal values) can be decomposed to binary variables. The number of new binary variables will be equal to the number of different nominal values in the data column. Frequency-limited decomposition is allowed. A target variable detected as nominal will not be decomposed but its values will be encoded with numbers 0,1,2,… .

Files [n]

The list of n files found in the project folder.

Variables [n]

The list of n variables found in the selected file.

Input variables [n]

The list of input variables and their transformations. Format: FileName.VariableName, transformation1, transformation2, …

Target variables [n]

The list of target variables and their transformations. Format: FileName.VariableName, transformation1, transformation2, …

Transformations

The list of transformations available for modification of Inputs and Targets.

File preview:

Shows how GMDH Shell sees user's data inside a file.

Toolbar panel buttons


Saves preprocessor settings.
Reloads data files.
Switches to the text-mode editing of preprocessor settings.
Opens the Preprocess Configuration dialog.

Preprocess configuration

To acsess the Preprocess configuration dialog click at the toolbar panel of the Data manager tab.

Manage text inputs: Stop with error | Skip column | Decompose to binary columns
Allows to handle input columns with text values.

Manage text targets: Stop with error | Enumerate with 0,1,2,.. | Decompose to binary columns
Allows to handle target columns with text values.

Limit decomposition to most frequent labels
Enables limited decomposition to n variables. Only most frequent values will be decomposed to separate binary variables. The rest of values will be dropped to a singe binary variable number n.

Manage missing values: Stop with error | 0 | Interpolate | Mean | Median | Most frequent |

First option throws an error when a NULL value is detected. Other options allow replacement of missing values with 0, mean of two neighboring values (interpolation), arithmetic mean value of current variable, median, and the most frequent value.

Prevent targets from being included to the set of inputs
The same column will not be placed to inputs and targets and thus models like y=y are impossible.

For short files that can't be preprocessed properly: Stop with error | Skip silently
This option matters only when a set of files is imported. It warns you about or just skips silently any files that contain insufficiently small number of data rows.

Non-recent data detection: No detection | Stop with error | Skip silently
If time series data is collected from a set of files you may wish to ensure that the last row of every file contains the most recent timestamp across the dataset. If some files have no recent data you can either stop processing or just silently drop these files and their variables during data preprocessing stage. No detection option turns off this feature.

Time series preprocess

The module prepares data for time series forecasting. With this module GMDH Shell produces series of simulations and calculates one forecast (of certain time interval) for every simulation.

If the module is not yet loaded, click the Configuration button at the Control panel and select the module in the window that will appear.

Learning window size

The number of dataset rows that will be used for model learning and validation.

Forecast period

The step of forecasting. If you choose to forecast at several periods then GMDH Shell simply performs several simulations. In order to forecast one step ahead, for example, GMDH Shell puts model inputs X one step ahead of model target Y:

Task

Task: Forecast into the future | Forecast past periods | Custom simulation
You can perform series of simulations that forecast variables into the future or perform ex-post simulations that just help to understand hypothetical prediction errors. Custom simulation allows you to perform multiple ex-post and future forecasts within one series of simulations.

Perform simulations

Sets the number of consequent simulations

shift back each time by

It is the second part of the simulation series configuration. Preprocessor can move time-pointer back at N data rows after each simulation in the series. Thus we can avoid crossing of forecast periods on the timescale in case of several ex-post simulations.

Hold-out latest observations

Allows to hide some data from the Solver module in order to evaluate success of the simulation and overall correctness of the simulation settings.

General purpose preprocess

General purpose preprocess belongs to the preprocessor module and allows you to configure the 2nd level of model validation stage called the hold-out. The General purpose preprocess panel is alternative to Time series preprocess panel and can be selected in Control Panel→Configuration→Workflow:

Hold out

The hold-out sample is the final part of model validation process. Hold-out allows you to test the final model against absolutely new data and make a final conclusion about overall success of the modeling simulation. The hold-out sample should not be confused with data partitioning performed in Solver module in order to perform optimization and selection of the most accurate model. The goal of the hold-out sample is to validate initial selection of inputs, data transformations and all other project settings. Results of hold-out testing can be observed in Performance panel after completion of your modeling simulation.

Hold-out sample can be selected uniformly or from the end of the dataset. You can set the exact number of observations or a certain percentage of the dataset to be reserved for the hold-out. The rest of the dataset will be used for model learning.

prepro.txt

prepro.txt is a text file containing a series of Preprocessor commands that fully describe what operations must be carried out over the imported dataset. Data manager translates visual configuration process to text commands but permits text style configurations

prepro.txt allows a user to specify

  • global directives
  • input variables
  • target variables
  • transformations
  • depth of lags (for time series forecasting)

The asterisk command ('*') composes input set of variables from all imported columns.

Targets must be listed after the :Target command (or the equivalent :Output command) If there is no target mark in prepro.txt, then the last variable in the list will be used as a single target variable.

prepro.txt accepts certain combinations of the following commands: [FileName], [VariableName], [.], [*], [@],[LagInterval]. The general form of a command is FileName.VariableName@LagInterval It may look as follows:
varname_a@0-10
MyFile.*@0-10
:Target
varname_b
“varname c”

Examples of valid preprocessor strings


The semicolon symbol (';') marks single row comments.

; EXAMPLES OF VALID COMMANDS
;===========================

varname
;Get the variable named varname from the set of imported variables and include it to the set of input variables.

fname.varname
;Get var variable from the file fname.xls(csv) located in the project folder.

fname.*
;Get all variables from the file fname.xls(csv).

*.*
;Get all variables from all files in the project folder

*
;Get all variables of imported data file.

varname@n
;Get variable var lagged by n time points.

varname@n-m
;Get a sequence of lags of variable var. Time interval of lags changes from n to m (n,n+1,…,m). The number of generated variables is m-n+1.

varname@*
;Get all possible lags of the variable varname.

*@*
;Get all variables of the imported data file and generate all possible lags.

*.*@*
;Get all variables from all files and generate all possible lags of all variables.

;Other possible combinations of commands may look like fname.*@n-m, *.*@n-m, and so on.

Valid configuration examples


  • The following two strings will generate 10 singe-step lags of the first column and use original first column as output. In case of time series forecasting the last column will be moved at least one step ahead of other variables (i.e. lagged by -1).

x1@0-10
:Target
x1

  • The following string will use all tables from all files of the Project folder, generate two single-step lags and place the file selected in the 'Data file path' as the last table (last table contains the output variable).

*.*@0-2

  • The following one gets a number of columns from a number of files and generates some lags. Variables without a filename belong to the initial data file selected with Importer dialog.

filename_a.x3@10
filename_a.x3@21-31
“filename b”.x2@5
x1
x5@2-4
:Target
x2
filename_c.x1

Note: The preprocessor aligns imported columns with different lengths by first or by last observation (depends on [Forecast latest observations] checkbox) and trims unusable part.

You are here: IntroductionPreprocess
CC Attribution-Noncommercial 3.0 Unported
Valid CSS Driven by DokuWiki Recent changes RSS feed Valid XHTML 1.0