Skip to content

preprocessing

Defines the preprocessing configurations that define the ensembling of different members.

ClassifierEnsembleConfig dataclass

Bases: EnsembleConfig

Configuration for a classifier ensemble member.

See EnsembleConfig for more details.

generate_for_classification classmethod

generate_for_classification(
    *,
    n: int,
    subsample_size: int | float | None,
    max_index: int,
    add_fingerprint_feature: bool,
    polynomial_features: Literal["no", "all"] | int,
    feature_shift_decoder: (
        Literal["shuffle", "rotate"] | None
    ),
    preprocessor_configs: Sequence[PreprocessorConfig],
    class_shift_method: Literal["rotate", "shuffle"] | None,
    n_classes: int,
    random_state: int | Generator | None
) -> list[ClassifierEnsembleConfig]

Generate ensemble configurations for classification.

Parameters:

Name Type Description Default
n int

Number of ensemble configurations to generate.

required
subsample_size int | float | None

Number of samples to subsample. If int, subsample that many samples. If float, subsample that fraction of samples. If None, no subsampling is done.

required
max_index int

Maximum index to generate for.

required
add_fingerprint_feature bool

Whether to add fingerprint features.

required
polynomial_features Literal['no', 'all'] | int

Maximum number of polynomial features to add, if any.

required
feature_shift_decoder Literal['shuffle', 'rotate'] | None

How shift features

required
preprocessor_configs Sequence[PreprocessorConfig]

Preprocessor configurations to use on the data.

required
class_shift_method Literal['rotate', 'shuffle'] | None

How to shift classes for classpermutation.

required
n_classes int

Number of classes.

required
random_state int | Generator | None

Random number generator.

required

Returns:

Type Description
list[ClassifierEnsembleConfig]

List of ensemble configurations.

generate_for_regression classmethod

generate_for_regression(
    *,
    n: int,
    subsample_size: int | float | None,
    max_index: int,
    add_fingerprint_feature: bool,
    polynomial_features: Literal["no", "all"] | int,
    feature_shift_decoder: (
        Literal["shuffle", "rotate"] | None
    ),
    preprocessor_configs: Sequence[PreprocessorConfig],
    target_transforms: Sequence[
        TransformerMixin | Pipeline | None
    ],
    random_state: int | Generator | None
) -> list[RegressorEnsembleConfig]

Generate ensemble configurations for regression.

Parameters:

Name Type Description Default
n int

Number of ensemble configurations to generate.

required
subsample_size int | float | None

Number of samples to subsample. If int, subsample that many samples. If float, subsample that fraction of samples. If None, no subsampling is done.

required
max_index int

Maximum index to generate for.

required
add_fingerprint_feature bool

Whether to add fingerprint features.

required
polynomial_features Literal['no', 'all'] | int

Maximum number of polynomial features to add, if any.

required
feature_shift_decoder Literal['shuffle', 'rotate'] | None

How shift features

required
preprocessor_configs Sequence[PreprocessorConfig]

Preprocessor configurations to use on the data.

required
target_transforms Sequence[TransformerMixin | Pipeline | None]

Target transformations to apply.

required
random_state int | Generator | None

Random number generator.

required

Returns:

Type Description
list[RegressorEnsembleConfig]

List of ensemble configurations.

to_pipeline

to_pipeline(
    *, random_state: int | Generator | None
) -> SequentialFeatureTransformer

Convert the ensemble configuration to a preprocessing pipeline.

EnsembleConfig dataclass

Configuration for an ensemble member.

Attributes:

Name Type Description
feature_shift_count int

How much to shift the features columns.

class_permutation int

Permutation to apply to classes

preprocess_config PreprocessorConfig

Preprocessor configuration to use.

subsample_ix NDArray[int64] | None

Indices of samples to use for this ensemble member. If None, no subsampling is done.

generate_for_classification classmethod

generate_for_classification(
    *,
    n: int,
    subsample_size: int | float | None,
    max_index: int,
    add_fingerprint_feature: bool,
    polynomial_features: Literal["no", "all"] | int,
    feature_shift_decoder: (
        Literal["shuffle", "rotate"] | None
    ),
    preprocessor_configs: Sequence[PreprocessorConfig],
    class_shift_method: Literal["rotate", "shuffle"] | None,
    n_classes: int,
    random_state: int | Generator | None
) -> list[ClassifierEnsembleConfig]

Generate ensemble configurations for classification.

Parameters:

Name Type Description Default
n int

Number of ensemble configurations to generate.

required
subsample_size int | float | None

Number of samples to subsample. If int, subsample that many samples. If float, subsample that fraction of samples. If None, no subsampling is done.

required
max_index int

Maximum index to generate for.

required
add_fingerprint_feature bool

Whether to add fingerprint features.

required
polynomial_features Literal['no', 'all'] | int

Maximum number of polynomial features to add, if any.

required
feature_shift_decoder Literal['shuffle', 'rotate'] | None

How shift features

required
preprocessor_configs Sequence[PreprocessorConfig]

Preprocessor configurations to use on the data.

required
class_shift_method Literal['rotate', 'shuffle'] | None

How to shift classes for classpermutation.

required
n_classes int

Number of classes.

required
random_state int | Generator | None

Random number generator.

required

Returns:

Type Description
list[ClassifierEnsembleConfig]

List of ensemble configurations.

generate_for_regression classmethod

generate_for_regression(
    *,
    n: int,
    subsample_size: int | float | None,
    max_index: int,
    add_fingerprint_feature: bool,
    polynomial_features: Literal["no", "all"] | int,
    feature_shift_decoder: (
        Literal["shuffle", "rotate"] | None
    ),
    preprocessor_configs: Sequence[PreprocessorConfig],
    target_transforms: Sequence[
        TransformerMixin | Pipeline | None
    ],
    random_state: int | Generator | None
) -> list[RegressorEnsembleConfig]

Generate ensemble configurations for regression.

Parameters:

Name Type Description Default
n int

Number of ensemble configurations to generate.

required
subsample_size int | float | None

Number of samples to subsample. If int, subsample that many samples. If float, subsample that fraction of samples. If None, no subsampling is done.

required
max_index int

Maximum index to generate for.

required
add_fingerprint_feature bool

Whether to add fingerprint features.

required
polynomial_features Literal['no', 'all'] | int

Maximum number of polynomial features to add, if any.

required
feature_shift_decoder Literal['shuffle', 'rotate'] | None

How shift features

required
preprocessor_configs Sequence[PreprocessorConfig]

Preprocessor configurations to use on the data.

required
target_transforms Sequence[TransformerMixin | Pipeline | None]

Target transformations to apply.

required
random_state int | Generator | None

Random number generator.

required

Returns:

Type Description
list[RegressorEnsembleConfig]

List of ensemble configurations.

to_pipeline

to_pipeline(
    *, random_state: int | Generator | None
) -> SequentialFeatureTransformer

Convert the ensemble configuration to a preprocessing pipeline.

PreprocessorConfig dataclass

Configuration for data preprocessors.

Attributes:

Name Type Description
name Literal['per_feature', 'power', 'safepower', 'power_box', 'safepower_box', 'quantile_uni_coarse', 'quantile_norm_coarse', 'quantile_uni', 'quantile_norm', 'quantile_uni_fine', 'quantile_norm_fine', 'robust', 'kdi', 'none', 'kdi_random_alpha', 'kdi_uni', 'kdi_random_alpha_uni', 'adaptive', 'norm_and_kdi', 'kdi_alpha_0.3_uni', 'kdi_alpha_0.5_uni', 'kdi_alpha_0.8_uni', 'kdi_alpha_1.0_uni', 'kdi_alpha_1.2_uni', 'kdi_alpha_1.5_uni', 'kdi_alpha_2.0_uni', 'kdi_alpha_3.0_uni', 'kdi_alpha_5.0_uni', 'kdi_alpha_0.3', 'kdi_alpha_0.5', 'kdi_alpha_0.8', 'kdi_alpha_1.0', 'kdi_alpha_1.2', 'kdi_alpha_1.5', 'kdi_alpha_2.0', 'kdi_alpha_3.0', 'kdi_alpha_5.0']

Name of the preprocessor.

categorical_name Literal['none', 'numeric', 'onehot', 'ordinal', 'ordinal_shuffled', 'ordinal_very_common_categories_shuffled']

Name of the categorical encoding method. Options: "none", "numeric", "onehot", "ordinal", "ordinal_shuffled", "none".

append_original bool

Whether to append original features to the transformed features

subsample_features float

Fraction of features to subsample. -1 means no subsampling.

global_transformer_name str | None

Name of the global transformer to use.

RegressorEnsembleConfig dataclass

Bases: EnsembleConfig

Configuration for a regression ensemble member.

See EnsembleConfig for more details.

generate_for_classification classmethod

generate_for_classification(
    *,
    n: int,
    subsample_size: int | float | None,
    max_index: int,
    add_fingerprint_feature: bool,
    polynomial_features: Literal["no", "all"] | int,
    feature_shift_decoder: (
        Literal["shuffle", "rotate"] | None
    ),
    preprocessor_configs: Sequence[PreprocessorConfig],
    class_shift_method: Literal["rotate", "shuffle"] | None,
    n_classes: int,
    random_state: int | Generator | None
) -> list[ClassifierEnsembleConfig]

Generate ensemble configurations for classification.

Parameters:

Name Type Description Default
n int

Number of ensemble configurations to generate.

required
subsample_size int | float | None

Number of samples to subsample. If int, subsample that many samples. If float, subsample that fraction of samples. If None, no subsampling is done.

required
max_index int

Maximum index to generate for.

required
add_fingerprint_feature bool

Whether to add fingerprint features.

required
polynomial_features Literal['no', 'all'] | int

Maximum number of polynomial features to add, if any.

required
feature_shift_decoder Literal['shuffle', 'rotate'] | None

How shift features

required
preprocessor_configs Sequence[PreprocessorConfig]

Preprocessor configurations to use on the data.

required
class_shift_method Literal['rotate', 'shuffle'] | None

How to shift classes for classpermutation.

required
n_classes int

Number of classes.

required
random_state int | Generator | None

Random number generator.

required

Returns:

Type Description
list[ClassifierEnsembleConfig]

List of ensemble configurations.

generate_for_regression classmethod

generate_for_regression(
    *,
    n: int,
    subsample_size: int | float | None,
    max_index: int,
    add_fingerprint_feature: bool,
    polynomial_features: Literal["no", "all"] | int,
    feature_shift_decoder: (
        Literal["shuffle", "rotate"] | None
    ),
    preprocessor_configs: Sequence[PreprocessorConfig],
    target_transforms: Sequence[
        TransformerMixin | Pipeline | None
    ],
    random_state: int | Generator | None
) -> list[RegressorEnsembleConfig]

Generate ensemble configurations for regression.

Parameters:

Name Type Description Default
n int

Number of ensemble configurations to generate.

required
subsample_size int | float | None

Number of samples to subsample. If int, subsample that many samples. If float, subsample that fraction of samples. If None, no subsampling is done.

required
max_index int

Maximum index to generate for.

required
add_fingerprint_feature bool

Whether to add fingerprint features.

required
polynomial_features Literal['no', 'all'] | int

Maximum number of polynomial features to add, if any.

required
feature_shift_decoder Literal['shuffle', 'rotate'] | None

How shift features

required
preprocessor_configs Sequence[PreprocessorConfig]

Preprocessor configurations to use on the data.

required
target_transforms Sequence[TransformerMixin | Pipeline | None]

Target transformations to apply.

required
random_state int | Generator | None

Random number generator.

required

Returns:

Type Description
list[RegressorEnsembleConfig]

List of ensemble configurations.

to_pipeline

to_pipeline(
    *, random_state: int | Generator | None
) -> SequentialFeatureTransformer

Convert the ensemble configuration to a preprocessing pipeline.

balance

balance(x: Iterable[T], n: int) -> list[T]

Take a list of elements and make a new list where each appears n times.

default_classifier_preprocessor_configs

default_classifier_preprocessor_configs() -> (
    list[PreprocessorConfig]
)

Default preprocessor configurations for classification.

default_regressor_preprocessor_configs

default_regressor_preprocessor_configs() -> (
    list[PreprocessorConfig]
)

Default preprocessor configurations for regression.

fit_preprocessing

fit_preprocessing(
    configs: Sequence[EnsembleConfig],
    X_train: ndarray,
    y_train: ndarray,
    *,
    random_state: int | Generator | None,
    cat_ix: list[int],
    n_workers: int,
    parallel_mode: Literal["block", "as-ready", "in-order"]
) -> Iterator[
    tuple[
        EnsembleConfig,
        SequentialFeatureTransformer,
        ndarray,
        ndarray,
        list[int],
    ]
]

Fit preprocessing pipelines in parallel.

Parameters:

Name Type Description Default
configs Sequence[EnsembleConfig]

List of ensemble configurations.

required
X_train ndarray

Training data.

required
y_train ndarray

Training target.

required
random_state int | Generator | None

Random number generator.

required
cat_ix list[int]

Indices of categorical features.

required
n_workers int

Number of workers to use.

required
parallel_mode Literal['block', 'as-ready', 'in-order']

Parallel mode to use.

  • "block": Blocks until all workers are done. Returns in order.
  • "as-ready": Returns results as they are ready. Any order.
  • "in-order": Returns results in order, blocking only in the order that needs to be returned in.
required

Returns:

Type Description
EnsembleConfig

Iterator of tuples containing the ensemble configuration, the fitted

SequentialFeatureTransformer

preprocessing pipeline, the transformed training data, the transformed target,

ndarray

and the indices of categorical features.

fit_preprocessing_one

fit_preprocessing_one(
    config: EnsembleConfig,
    X_train: ndarray,
    y_train: ndarray,
    random_state: int | Generator | None = None,
    *,
    cat_ix: list[int]
) -> tuple[
    EnsembleConfig,
    SequentialFeatureTransformer,
    ndarray,
    ndarray,
    list[int],
]

Fit preprocessing pipeline for a single ensemble configuration.

Parameters:

Name Type Description Default
config EnsembleConfig

Ensemble configuration.

required
X_train ndarray

Training data.

required
y_train ndarray

Training target.

required
random_state int | Generator | None

Random seed.

None
cat_ix list[int]

Indices of categorical features.

required

Returns:

Type Description
EnsembleConfig

Tuple containing the ensemble configuration, the fitted preprocessing pipeline,

SequentialFeatureTransformer

the transformed training data, the transformed target, and the indices of

ndarray

categorical features.

generate_index_permutations

generate_index_permutations(
    n: int,
    *,
    max_index: int,
    subsample: int | float,
    random_state: int | Generator | None
) -> list[NDArray[int64]]

Generate indices for subsampling from the data.

Parameters:

Name Type Description Default
n int

Number of indices to generate.

required
max_index int

Maximum index to generate.

required
subsample int | float

Number of indices to subsample. If int, subsample that many indices. If float, subsample that fraction of indices. random_state: Random number generator.

required
random_state int | Generator | None

Random number generator.

required

Returns:

Type Description
list[NDArray[int64]]

List of indices to subsample.