ClassificationBenchmark#

class ClassificationBenchmark(id_format: str | None = None, backend=None, backend_params=None, return_data=False)[source]#

Classification benchmark.

Run a series of classifiers against a series of tasks defined via dataset loaders, cross validation splitting strategies and performance metrics, and return results as a df (as well as saving to file).

Parameters:

id_format: str, optional (default=None)

A regex used to enforce task/estimator ID to match a certain format

backendstring, by default “None”.

Parallelization backend to use for runs.

“None”: executes loop sequentially, simple list comprehension
“loky”, “multiprocessing” and “threading”: uses joblib.Parallel loops
“joblib”: custom and 3rd party joblib backends, e.g., spark
“dask”: uses dask, requires dask package in environment
“dask_lazy”: same as “dask”, but changes the return to (lazy)
dask.dataframe.DataFrame.
“ray”: uses ray, requires ray package in environment

Recommendation: Use “dask” or “loky” for parallel evaluate. “threading” is unlikely to see speed ups due to the GIL and the serialization backend (cloudpickle) for “dask” and “loky” is generally more robust than the standard pickle library used in “multiprocessing”.

backend_paramsdict, optional

additional parameters passed to the backend as config. Directly passed to utils.parallel.parallelize. Valid keys depend on the value of backend:

“None”: no additional parameters, backend_params is ignored
“loky”, “multiprocessing” and “threading”: default joblib backends

any valid keys for joblib.Parallel can be passed here, e.g., n_jobs, with the exception of backend which is directly controlled by backend. If n_jobs is not passed, it will default to -1, other parameters will default to joblib defaults. - “joblib”: custom and 3rd party joblib backends, e.g., spark. any valid keys for joblib.Parallel can be passed here, e.g., n_jobs, backend must be passed as a key of backend_params in this case. If n_jobs is not passed, it will default to -1, other parameters will default to joblib defaults. - “dask”: any valid keys for dask.compute can be passed, e.g., scheduler

“ray”: The following keys can be passed:
- “ray_remote_args”: dictionary of valid keys for ray.init
- “shutdown_ray”: bool, default=True; False prevents ray from shutting
  down after parallelization.
- “logger_name”: str, default=”ray”; name of the logger to use.
- “mute_warnings”: bool, default=False; if True, suppresses warnings

return_databool, optional (default=False)

Whether to return the prediction and the ground truth data in the results.

Attributes:

failed_experiments: Failed task-estimator pairs from the most recent benchmark run.

Methods

`add`(*args)	Add estimators, task components, full task tuples, or catalogues.
`add_estimator`(estimator[, estimator_id])	Register an estimator to the benchmark.
`add_task`(dataset_loader, cv_splitter, scorers)	Register a classification task to the benchmark.
`register_stored_tasks`()	Register stored tasks from datasets, metrics, and CV splitters.
`run`([output_file, force_rerun])	Run the benchmarking for all tasks and estimators.

add_task(dataset_loader: Callable | tuple, cv_splitter: Any, scorers: list, task_id: str | None = None, error_score: str = 'raise')[source]#

Register a classification task to the benchmark.

Parameters:

dataset_loaderUnion[Callable, tuple]: Can be - a function which returns a dataset, like from sktime.datasets. - a tuple containing two data containers that are sktime compatible. - single data container that is sktime compatible (only endogenous data).
cv_splitterBaseSplitter object: Splitter used for generating validation folds.
scorersa list of BaseMetric objects: Each BaseMetric output will be included in the results.
task_idstr, optional (default=None): Identifier for the benchmark task. If none given then uses dataset loader name combined with cv_splitter class name.
error_score“raise” or numeric, default=np.nan: Value to assign to the score if an exception occurs in estimator fitting. If set to “raise”, the exception is raised. If a numeric value is given, FitFailedWarning is raised.

Returns:

A dictionary of benchmark results for that classifier

add(*args)[source]#

Add estimators, task components, full task tuples, or catalogues.

Objects are interpreted based on their scitype and added to the benchmark accordingly. Multiple objects can be provided in a single call.

Supported inputs include estimators, datasets, metrics, CV splitters, task tuples, and catalogues.

Parameters:

*argsobject

Objects to add. Supported patterns are:

estimator
Estimator with scitype “classifier” or “forecaster”.
dict
Dictionary of estimators where keys are custom `estimator_id`s and values are the estimators.
list
List of estimators. `estimator_id`s are generated automatically using the estimator’s class name.
dataset
Object with scitype dataset_classification or dataset_forecasting.
metric
Object with scitype metric_forecasting, metric_tabular, or metric_proba_tabular.
cv_splitter
Object with scitype “splitter” or “splitter_tabular”.
(dataset, metric, splitter)
Tuple specifying a full task. Must contain exactly one dataset, one metric, and one splitter.
catalogue
Instance of BaseCatalogue. All contained objects are added recursively.

Raises:

TypeError

If:

a tuple has unsupported length (e.g., not length 3 for task tuples)
a task tuple does not contain exactly one dataset, metric, and splitter
duplicate scitypes are present in a task tuple
an object has an unrecognized scitype

Notes

Task tuples are order-invariant; roles are inferred via scitype.
Duplicate datasets, metrics, and splitters are ignored.

Examples

>>> benchmark = ClassificationBenchmark()

Add an estimator: >>> benchmark.add(DummyClassifier())

Add components individually: >>> benchmark.add(ArrowHead()) >>> benchmark.add(accuracy_score) >>> benchmark.add(KFold(n_splits=3))

Add a task tuple (order does not matter): >>> benchmark.add((accuracy_score, ArrowHead(), KFold(n_splits=3)))

Add a dictionary of estimators with custom IDs: >>> benchmark.add( … { … “dummy”: DummyClassifier(), … “knn”: KNeighborsClassifier(), … } … )

Add a list of estimators (IDs generated automatically): >>> benchmark.add([DummyClassifier(), KNeighborsClassifier()])

Add multiple objects: >>> benchmark.add( … {“dummy_1”: DummyClassifier()}, … (ArrowHead(), accuracy_score, KFold(n_splits=3)), … )

add_estimator(estimator: BaseEstimator, estimator_id: str | None = None)[source]#

Register an estimator to the benchmark.

Parameters:

estimatordict, list or BaseEstimator object

Estimator to add to the benchmark.

if BaseEstimator, single estimator. estimator_id is generated as the estimator’s class name if not provided.
If dict, keys are ``estimator_id``s used to customise identifier ID and values are estimators.
If list, each element is an estimator. ``estimator_id``s are generated automatically using the estimator’s class name.

estimator_idstr, optional (default=None)

Identifier for estimator. If none given then uses estimator’s class name.

property failed_experiments: list[FailedExperimentRecord][source]#: Failed task-estimator pairs from the most recent benchmark run.

register_stored_tasks()[source]#: Register stored tasks from datasets, metrics, and CV splitters.

run(output_file: str = None, force_rerun: str | list[str] = 'none')[source]#

Run the benchmarking for all tasks and estimators.

When output_file is given, results are written to that file on completion and checkpointed incrementally during the run. The file extension (.json, .csv, or .parquet) selects the final storage format; see get_storage_backend.

Parameters:

output_filestr or None, optional (default=None)

Path to the benchmark results file (e.g. "results.csv"). Must refer to a file, not a directory. When None, results are returned as a DataFrame only and are not saved to disk.

force_rerunstr or list of str, optional (default=”none”)

Controls re-execution of experiments that already have saved results:

"none" — skip task-model pairs with existing results.
"all" — rerun every task-model pair.
list of str — rerun only pairs whose model_id is in the list; other existing results are skipped.

Returns:

pandas.DataFrame: Summary of benchmark run for all completed experiments.