preprocess

Preprocess is used to transform data in accordance with modeling experiment conditions configured in the project. Actually preprocess uses a script which is either loaded from a template, or created visually using the panel **Variables and transformations**.

**Variables and transformations** is a panel accessible via the program menu **View > Variables and transformations**. When enabled, the panel is nested in the **Data explorer** tab. It is used to specify model inputs, targets (predicted columns) and their transformations.

**View by datasets / variables** is a list of datasets connected to the project. The panel is used to selected and then add variables to

**Input variables [n]** is a list of input variables and transformations applied to them. '**n**' is the number of variables in the list. Each row represents a variable or a set of variables in the following format:
**VariableName, transformation1, transformation2, …**

** All** and

** Edit** button in the panel title is used to edit preprocessor script manually via the text editor.

**Target variables [n]** is a list of variables that will be modeled and predicted. '**n**' is the number of variables in the list. Each row represents a variable or a set of variables in the following format: **VariableName, transformation1, transformation2, …**

**Transformations** is a panel used to access various transformations available for **Input variables** and **Target variables**.

** Settings** is a button in the panel title used to modify

**Obligatory transformations** is a dialog accessible via the * Settings* (underlined text button) in the title of the

**Manage text inputs** is used to handle input variables with categories (instead of numeric values). Available options are

**Stop with error** | **Skip column** | **Decompose to binary columns**

**Manage text targets** is used to handle target variables with categories (instead of numeric values). Available options are

**Stop with error** | **Enumerate with 0,1,2,..** | **Decompose to binary columns**

**Limit decomposition to the most frequent labels** is used to limit the number of new variables created in a result of decomposition. Only the most frequent labels are decomposed to separate binary variables. All other labels will be dropped to a binary variable named '** ?**'. The number of new variables including '

`?`

**Treat value as missing if it is outside n*sigma ** is used to replace spikes (i.e. outliers) that exceed certain number of standard deviations. All spikes that exceed

will be replaced with missing values. The way in which missing values are handled can be set using the **n*sigma ****Manage missing values** control.

**Manage missing values** is used to handle various gaps (NULL values) in the data. Available options are

**Stop with error** | **0** | **Interpolate** | **Mean** | **Median** | **Most frequent**.

The first option throws an error when a NULL value is detected. Other options allow replacement of missing values with 0, mean of two neighbouring values (interpolation), arithmetic mean of current variable, median, and the most frequent value.

If time series data is collected from a set of files it is possible to ensure that all files contain recent observations and all series are long enough to be preprocessed properly.

**For short files that can't be preprocessed properly** there are two options: **Stop with error** and **Skip silently**. This parameter matters only when a set of files is imported. If a file contains insufficiently small number of observations you can instruct preprocess to skip the file silently instead of throwing an error.

**Non-recent data detection** finds out if all series comes with the most recent timestamp found in the set of files. There are three options available:

**No detection** | **Stop with error** | **Skip silently**

The last one is used to simply drop non-recent files from the preprocessing results.

GMDH Shell uses different sets of preprocessing features for time series models and all other models, i.e. classification and regression models. A proper set of features is usually loaded form a template, to change it open the menu **File > Configuration**, **Modules** tab, **Preprocess**.

**Simple forecasting** is a panel in the program toolbar used to create time series models.

**Horizon** is the number of observations to be forecasted.

**Holdout** is used to withhold a sample of observations and to evaluate forecast accuracy using the out-of-sample data.

**Time series forecasting** is a panel in the program toolbar used to create time series models.

**Horizon** is the number of observations to be forecasted. GMDH Shell can perform several simulations with different forecast horizons at a time. In order to use this feature you should separate desired forecast horizons with a coma. For example `1,6,12`

.

**Holdout** is used to withhold a sample of observations and to evaluate forecast accuracy using the out-of-sample data.

**Repeat** is used to perform ex-post forecast of several previous horizons at a time.

**Window validation** is the number of experiments used for window size optimization. Optimal window size improves forecast accuracy.

**Hold-out data** panel is used to apply regression and classification models to a subset of observations immediately after the models are obtained. The hold-out sample can be either used to evaluate out-of-sample accuracy or to predict unknown values of target variables.

**Hold-out** is used to select a subsample to which the model will be applied. Available options are to hold-out **last observations** | **observations uniformly** | **missing target values**

In the next two control elements you can set the exact number of **dataset observations** or **% of dataset** to be reserved for the hold-out.

The script is used to specify commands that generate input variables, target variables and various transformations. Preprocess commands are one line, each of them can refer to a group of variables.

One line commands have the following format:

**FileName.VariableName, Transformation1(), Transformation2(), …**

If there is only one file connected, “**FileName.**” prefix is not required.
If a spreadsheet (.xls or .xlsx) contain data in more than one sheet, all variables must be refereed as “**FileName:SheetName.VariableName**”.

To refer a group of variables use the asterisk symbol “** ***”, for example:

`FileName.*`

is used to select all variables from a file.

`*.VariableName`

is used to select variables with certain name from all files.

`*.*`

is used to select all variables from all files (and all sheets).

`*`

is used to select all variables if there is only one file connected.

If a file or a variable contain spaces or reserved symbols in the name, each of the names must be quoted, for example: **“File Name”.“Variable Name”**. If the name contain a quote **“**, it must be doubled, for example **a”b“c** ⇒ **“a”“b”“c”**

A semicolon symbol ”

“ allows one line comments.
**;**

Here is the list of available transformations.

Transformation | Notation example | Description |

Square | `x, sqr` | `y=x^2` |

Square root | `x, sqrt` | `y=x^(1/2)` |

Cube | `x, cube` | `y=x^3` |

Cube root | `x, cubert` | `y=x^(1/3)` |

Exp | `x, exp` | `y=exp(x)` |

Logarithm | `x, ln` | `y=ln(|x|), x<>0` |

Sine | `x, sin` | `y=sin(x)` |

Cosine | `x, cos` | `y=cos(x)` |

Arctangent | `x, arctang` | `y=arctang(x)` |

Abs value | `x, abs` | `y=|x|` |

Sign | `x, sign` | `y=sign(x)` |

Floor | `x, floor` | `y=[x]` |

Fractional part | `x, frac` | `y=x-[x]` |

Normalization | `x, norm(b1,b2)` | `b1` is lower boundary, `b2` is upper boundary |

Transformation | Notation example | Description |

Lag | `x@a-b:c` | `a` is min lag, `b` is max lag, `c` is step. Example: step of 3 applied to 0-12 leaves only 0, 3, 6, 9 and 12. This helps to reduce the number of variables. |

All lags | `x@*:a` | `a` is step. Generates lags while dataset length allows. Applies for inputs only. |

Moving average | `x,SMA(a)` | `a` is window length; an integer 2..1000 |

Exponential MA | `x,EMA(a)` | `a` is quotient; a real [0.01, 1) |

Derivative | `x,d` | y = x[t] - x[t-1] |

Window size | `x,window(a)` | `a` is window size; an integer ≥ 2. Applies to targets only. |

Weighted by time | `x,weighted_by_time` | Sets higher weights for later observations in proportion: 1, 2, 3, …, n |

Fourier series | `x,fourier(a)` | `a` is period; an integer ≥ 2. Fourier series: cos and sin(2πkx/T). Generates multiple variables, does not stack with other generating transformations |

Transformation | Notation example | Description |

Year | `x,year` | Year number in Gregorian calendar. |

Year fraction | `x,year_frac` | Year fraction in Gregorian calendar. 0 = Jan 1st 00:00, 1 = Dec 31st 24:00. |

Month | `x,month,decompose` | Month number in Gregorian calendar. 1 = Jan, 12 = Dec. |

Month fraction | `x,month_frac` | Month fraction in Gregorian calendar. 0 = 1st 00:00, 1 = 31st 24:00. |

Day of month | `x,day,decompose` | Day of month in Gregorian calendar (1-31). |

Day of week | `x,dayofweek,decompose` | Day of week, 1 = Mon, 7 = Sun. |

Day fraction | `x,day_frac` | Fraction of the day. 0 = 00:00, 1 = 24:00. |

Hour | `x,hour,decompose` | Hour (0..23). |

Hour fraction | `x,hour_frac` | Fraction of the hour. 0 = ##:00:00, 1 = ##:60:00. |

Minute | `x,minute,decompose` | Minute (0..59). |

Second | `x,second,decompose` | Second (0..59). |

Transformation | Notation example | Description |

Day off | `x,isdayoff` | Returns 1, when the dat is Saturday, Sunday or a holiday of specific country. Otherwise returns 0. |

Workdays per period | `x,workdaysperperiod` | Returns the number of workdays (Mo-Fr, except holidays) in a period (week, month, etc.). |

Decomposed holidays | `x,holidays` | For each holiday in the holiday set, returns 1 if it falls into a period (week, month, etc.). Produces as many variables as holidays in the year. |

Is holiday | `x,isholiday` | Returns 1 if a holiday falls into a period (week, month, etc.). |

Find non-holiday | `x,nonholiday_day` | Steps a day(week) back until a non-holiday is found. |

Transformation | Notation example | Description |

Weighted by time | `x,weighted_by_time` | Sets higher weights for later observations in proportion: 1, 2, 3, …, n. |

Balanced classes | `x,balanced_classes` | Changes instance weights so that classes will have equal importance for training regardless of their proportion. |

Manual 2-class bias | `x,weighted_two_class(0.5)` | Manually weighs the upper class among the two. |

Transformation | Notation example | Description |

Time | `|time` | Time variable is simply a counter: 0, 1, 2, 3, … . |

Target | `|target` | Feeds target variable to input. This variable is feasible when solving multiple time series problems, or when making a template. |

ID | `|id` | Feeds ID or date/time variable of the dataset. |

First column | `|firstcolumn(a)` | `a` is step; an integer ≥ 4. Feeds the N'th variable counting from the first one. |

Last column | `|lastcolumn(a)` | `a` is step; an integer ≥ 4. Feeds the N'th variable counting from the last one. |

Transformation | Notation example | Description |

Decompose categories | `x,decompose` | Decomposes categorical column to binary columns. |

preprocess.txt · Last modified: 2014/06/10 10:12 by Andriy