temporal_train_test_split#

Split time series data containers into a single train/test split.

Creates a single train/test split of endogenous time series y, an optionally exogenous time series X.

Splits time series y into a single temporally ordered train and test split. The split is based on test_size and train_size parameters, which can signify fractions of total number of indices, or an absolute number of integers to cut.

If the data contains multiple time series (Panel or Hierarchical), fractions and train-test sets will be computed per individual time series.

If X is provided, will also produce a single train/test split of X, at the same loc indices as y. If non-pandas based containers are used, will use iloc index instead.

Parameters:

ytime series in sktime compatible data container format: endogenous time series
Xtime series in sktime compatible data container format, optional, default=None: exogenous time series
test_sizefloat, int or None, optional (default=None): If float, must be between 0.0 and 1.0, and is interpreted as the proportion of the dataset to include in the test split. Proportions are rounded to the next higher integer count of samples (ceil). If int, is interpreted as total number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
train_sizefloat, int, or None, (default=None): If float, must be between 0.0 and 1.0, and is interpreted as the proportion of the dataset to include in the train split. Proportions are rounded to the next lower integer count of samples (floor). If int, is interpreted as total number of train samples. If None, the value is set to the complement of the test size.
fhForecastingHorizon: A forecast horizon to use for splitting, alternative specification for test set. If given, test_size and train_size cannot also be specified and must be None. If fh is passed, the test set will be: if fh.is_relative: the last possible indices to match fh within y if not fh.is_relative: the indices at the absolute index of fh
anchorstr, “start” (default) or “end”: determines behaviour if train and test sizes do not sum up to all data used only if fh=None and both test_size and train_size are not None if “start”, cuts train and test set from start of available series if “end”, cuts train and test set from end of available series

Returns:

splittingtuple, length = 2 * len(arrays): Tuple containing train-test split of y, and X if given. if X is None, returns (y_train, y_test). Else, returns (y_train, y_test, X_train, X_test).

References

[1]

originally adapted from alkaline-ml/pmdarima

Examples

>>> from sktime.datasets import load_airline, load_osuleaf
>>> from sktime.split import temporal_train_test_split
>>> from sktime.utils._testing.panel import _make_panel
>>> # univariate time series
>>> y = load_airline()
>>> y_train, y_test = temporal_train_test_split(y, test_size=36)
>>> y_test.shape
(36,)
>>> # panel time series
>>> y = _make_panel(n_instances = 2, n_timepoints = 20)
>>> y_train, y_test = temporal_train_test_split(y, test_size=5)
>>> # last 5 timepoints for each instance
>>> y_test.shape
(10, 1)

The function can also be applied to panel or hierarchical data, in this case the split will be applied per individual time series: >>> from sktime.utils._testing.hierarchical import _make_hierarchical >>> y = _make_hierarchical() >>> y_train, y_test = temporal_train_test_split(y, test_size=0.2)