Preprocess

Preprocess is used to transform data in accordance with modeling experiment conditions configured in the project. Actually preprocess uses a script which is either loaded from a template, or created visually using the panel Variables and transformations.

Variables and transformations

Variables and transformations is a panel accessible via the program menu View > Variables and transformations. When enabled, the panel is nested in the Data explorer tab. It is used to specify model inputs, targets (predicted columns) and their transformations.

View by datasets / variables is a list of datasets connected to the project. The panel is used to selected and then add variables to Input variables and Targets variables. If you click at the underlined part of the title (variables), the list will be rearranged showing variables in the top of the hierarchy and the datasets they come from as nested nodes. In this way you can select and manipulate groups of variables that appear in more than one dataset.

Input variables [n] is a list of input variables and transformations applied to them. 'n' is the number of variables in the list. Each row represents a variable or a set of variables in the following format: VariableName, transformation1, transformation2, …

All and None in the panel title are used to select and deselect all items in the list at a time.

Edit button in the panel title is used to edit preprocessor script manually via the text editor.

Target variables [n] is a list of variables that will be modeled and predicted. 'n' is the number of variables in the list. Each row represents a variable or a set of variables in the following format: VariableName, transformation1, transformation2, …

Transformations is a panel used to access various transformations available for Input variables and Target variables.

Settings is a button in the panel title used to modify Obligatory transformations applied to the whole dataset.

Obligatory transformations

Obligatory transformations is a dialog accessible via the Settings (underlined text button) in the title of the Transformations panel (View > Variables and transformations). It is used to set preprocessing rules applied to all variables.

Text data

Manage text inputs is used to handle input variables with categories (instead of numeric values). Available options are
Stop with error | Skip column | Decompose to binary columns

Manage text targets is used to handle target variables with categories (instead of numeric values). Available options are
Stop with error | Enumerate with 0,1,2,.. | Decompose to binary columns

Limit decomposition to the most frequent labels is used to limit the number of new variables created in a result of decomposition. Only the most frequent labels are decomposed to separate binary variables. All other labels will be dropped to a binary variable named '?'. The number of new variables including '?' does not exceed the selected limit.

Missing values

Treat value as missing if it is outside n*sigma is used to replace spikes (i.e. outliers) that exceed certain number of standard deviations. All spikes that exceed n*sigma will be replaced with missing values. The way in which missing values are handled can be set using the Manage missing values control.

Manage missing values is used to handle various gaps (NULL values) in the data. Available options are
Stop with error | 0 | Interpolate | Mean | Median | Most frequent.

The first option throws an error when a NULL value is detected. Other options allow replacement of missing values with 0, mean of two neighbouring values (interpolation), arithmetic mean of current variable, median, and the most frequent value.

Multiple files

If time series data is collected from a set of files it is possible to ensure that all files contain recent observations and all series are long enough to be preprocessed properly.

For short files that can't be preprocessed properly there are two options: Stop with error and Skip silently. This parameter matters only when a set of files is imported. If a file contains insufficiently small number of observations you can instruct preprocess to skip the file silently instead of throwing an error.

Non-recent data detection finds out if all series comes with the most recent timestamp found in the set of files. There are three options available:
No detection | Stop with error | Skip silently
The last one is used to simply drop non-recent files from the preprocessing results.

Preprocess types

GMDH Shell uses different sets of preprocessing features for time series models and all other models, i.e. classification and regression models. A proper set of features is usually loaded form a template, to change it open the menu File > Configuration, Modules tab, Preprocess.

Simple forecasting

Simple forecasting is a panel in the program toolbar used to create time series models.

Horizon is the number of observations to be forecasted.

Holdout is used to withhold a sample of observations and to evaluate forecast accuracy using the out-of-sample data.

Time series forecasting

Time series forecasting is a panel in the program toolbar used to create time series models.

Horizon is the number of observations to be forecasted. GMDH Shell can perform several simulations with different forecast horizons at a time. In order to use this feature you should separate desired forecast horizons with a coma. For example 1,6,12.

Holdout is used to withhold a sample of observations and to evaluate forecast accuracy using the out-of-sample data.

Repeat is used to perform ex-post forecast of several previous horizons at a time.

Window validation is the number of experiments used for window size optimization. Optimal window size improves forecast accuracy.

Regression and classification

Hold-out data panel is used to apply regression and classification models to a subset of observations immediately after the models are obtained. The hold-out sample can be either used to evaluate out-of-sample accuracy or to predict unknown values of target variables.

Hold-out is used to select a subsample to which the model will be applied. Available options are to hold-out last observations | observations uniformly | missing target values

In the next two control elements you can set the exact number of dataset observations or % of dataset to be reserved for the hold-out.

Preprocess script

The script is used to specify commands that generate input variables, target variables and various transformations. Preprocess commands are one line, each of them can refer to a group of variables.

One line commands have the following format:

FileName.VariableName, Transformation1(), Transformation2(), …

If there is only one file connected, “FileName.” prefix is not required. If a spreadsheet (.xls or .xlsx) contain data in more than one sheet, all variables must be refereed as “FileName:SheetName.VariableName”.

To refer a group of variables use the asterisk symbol “*”, for example:

FileName.* is used to select all variables from a file.
*.VariableName is used to select variables with certain name from all files.
*.* is used to select all variables from all files (and all sheets).
* is used to select all variables if there is only one file connected.

If a file or a variable contain spaces or reserved symbols in the name, each of the names must be quoted, for example: “File Name”.”Variable Name”. If the name contain a quote , it must be doubled, for example a”b”c“a”“b”“c”

A semicolon symbol ”;” allows one line comments.

Transformations

Here is the list of available transformations.

Elementary functions
TransformationNotation exampleDescription
Square x, sqry=x^2
Square root x, sqrty=x^(1/2)
Cubex, cubey=x^3
Cube rootx, cuberty=x^(1/3)
Expx, exp y=exp(x)
Logarithmx, lny=ln(|x|), x<>0
Sinex, siny=sin(x)
Cosinex, cosy=cos(x)
Arctangentx, arctangy=arctang(x)
Abs valuex, absy=|x|
Signx, signy=sign(x)
Floorx, floory=[x]
Fractional partx, fracy=x-[x]
Normalizationx, norm(b1,b2) b1 is lower boundary, b2 is upper boundary
Time series
TransformationNotation exampleDescription
Lagx@a-b:ca is min lag, b is max lag, c is step. Example: step of 3 applied to 0-12 leaves only 0, 3, 6, 9 and 12. This helps to reduce the number of variables.
All lagsx@*:aa is step. Generates lags while dataset length allows. Applies for inputs only.
Moving averagex,SMA(a)a is window length; an integer 2..1000
Exponential MAx,EMA(a)a is quotient; a real [0.01, 1)
Derivativex,dy = x[t] - x[t-1]
Window sizex,window(a)a is window size; an integer ≥ 2. Applies to targets only.
Weighted by timex,weighted_by_timeSets higher weights for later observations in proportion: 1, 2, 3, …, n
Fourier seriesx,fourier(a)a is period; an integer ≥ 2. Fourier series: cos and sin(2πkx/T). Generates multiple variables, does not stack with other generating transformations
Date/time
TransformationNotation exampleDescription
Yearx,yearYear number in Gregorian calendar.
Year fractionx,year_fracYear fraction in Gregorian calendar. 0 = Jan 1st 00:00, 1 = Dec 31st 24:00.
Monthx,month,decomposeMonth number in Gregorian calendar. 1 = Jan, 12 = Dec.
Month fractionx,month_fracMonth fraction in Gregorian calendar. 0 = 1st 00:00, 1 = 31st 24:00.
Day of monthx,day,decomposeDay of month in Gregorian calendar (1-31).
Day of weekx,dayofweek,decomposeDay of week, 1 = Mon, 7 = Sun.
Day fractionx,day_fracFraction of the day. 0 = 00:00, 1 = 24:00.
Hourx,hour,decomposeHour (0..23).
Hour fractionx,hour_fracFraction of the hour. 0 = ##:00:00, 1 = ##:60:00.
Minutex,minute,decomposeMinute (0..59).
Secondx,second,decomposeSecond (0..59).
Calendar
TransformationNotation exampleDescription
Day offx,isdayoffReturns 1, when the dat is Saturday, Sunday or a holiday of specific country. Otherwise returns 0.
Workdays per periodx,workdaysperperiodReturns the number of workdays (Mo-Fr, except holidays) in a period (week, month, etc.).
Decomposed holidaysx,holidaysFor each holiday in the holiday set, returns 1 if it falls into a period (week, month, etc.). Produces as many variables as holidays in the year.
Is holidayx,isholidayReturns 1 if a holiday falls into a period (week, month, etc.).
Find non-holidayx,nonholiday_daySteps a day(week) back until a non-holiday is found.
Weighted instances
TransformationNotation exampleDescription
Weighted by timex,weighted_by_timeSets higher weights for later observations in proportion: 1, 2, 3, …, n.
Balanced classesx,balanced_classesChanges instance weights so that classes will have equal importance for training regardless of their proportion.
Manual 2-class biasx,weighted_two_class(0.5)Manually weighs the upper class among the two.
Special variables
TransformationNotation exampleDescription
Time|timeTime variable is simply a counter: 0, 1, 2, 3, … .
Target|targetFeeds target variable to input. This variable is feasible when solving multiple time series problems, or when making a template.
ID|idFeeds ID or date/time variable of the dataset.
First column|firstcolumn(a)a is step; an integer ≥ 4. Feeds the N'th variable counting from the first one.
Last column|lastcolumn(a)a is step; an integer ≥ 4. Feeds the N'th variable counting from the last one.
TransformationNotation exampleDescription
Decompose categoriesx,decomposeDecomposes categorical column to binary columns.
You are here: IntroductionPreprocess
CC Attribution-Noncommercial 3.0 Unported
Valid CSS Driven by DokuWiki Recent changes RSS feed Valid XHTML 1.0