Skip to content

preprocessing

AddFingerprintFeaturesStep

Bases: FeaturePreprocessingTransformerStep

Adds a fingerprint feature to the features based on hash of each row.

If is_test = True, it keeps the first hash even if there are collisions. If is_test = False, it handles hash collisions by counting up and rehashing until a unique hash is found.

fit

fit(X: ndarray, categorical_features: list[int]) -> Self

Fits the preprocessor.

Parameters:

Name Type Description Default
X ndarray

2d array of shape (n_samples, n_features)

required
categorical_features list[int]

list of indices of categorical feature.

required

transform

transform(X: ndarray) -> _TransformResult

Transforms the data.

Parameters:

Name Type Description Default
X ndarray

2d array of shape (n_samples, n_features).

required

FeaturePreprocessingTransformerStep

Base class for feature preprocessing steps.

It's main abstraction is really just to provide categorical indices along the pipeline.

fit

fit(X: ndarray, categorical_features: list[int]) -> Self

Fits the preprocessor.

Parameters:

Name Type Description Default
X ndarray

2d array of shape (n_samples, n_features)

required
categorical_features list[int]

list of indices of categorical feature.

required

transform

transform(X: ndarray) -> _TransformResult

Transforms the data.

Parameters:

Name Type Description Default
X ndarray

2d array of shape (n_samples, n_features).

required

KDITransformerWithNaN

Bases: KDITransformer

KDI transformer that can handle NaN values. It performs KDI with NaNs replaced by mean values and then fills the NaN values with NaNs after the transformation.

RemoveConstantFeaturesStep

Bases: FeaturePreprocessingTransformerStep

Remove features that are constant in the training data.

fit

fit(X: ndarray, categorical_features: list[int]) -> Self

Fits the preprocessor.

Parameters:

Name Type Description Default
X ndarray

2d array of shape (n_samples, n_features)

required
categorical_features list[int]

list of indices of categorical feature.

required

transform

transform(X: ndarray) -> _TransformResult

Transforms the data.

Parameters:

Name Type Description Default
X ndarray

2d array of shape (n_samples, n_features).

required

ReshapeFeatureDistributionsStep

Bases: FeaturePreprocessingTransformerStep

Reshape the feature distributions using different transformations.

fit

fit(X: ndarray, categorical_features: list[int]) -> Self

Fits the preprocessor.

Parameters:

Name Type Description Default
X ndarray

2d array of shape (n_samples, n_features)

required
categorical_features list[int]

list of indices of categorical feature.

required

get_adaptive_preprocessors staticmethod

get_adaptive_preprocessors(
    num_examples: int = 100, random_state: int | None = None
) -> dict[str, ColumnTransformer]

Returns a dictionary of adaptive column transformers that can be used to preprocess the data. Adaptive column transformers are used to preprocess the data based on the column type, they receive a pandas dataframe with column names, that indicate the column type. Column types are not datatypes, but rather a string that indicates how the data should be preprocessed.

Parameters:

Name Type Description Default
num_examples int

The number of examples in the dataset.

100
random_state int | None

The random state to use for the transformers.

None

get_column_types staticmethod

get_column_types(X: ndarray) -> list[str]

Returns a list of column types for the given data, that indicate how the data should be preprocessed.

transform

transform(X: ndarray) -> _TransformResult

Transforms the data.

Parameters:

Name Type Description Default
X ndarray

2d array of shape (n_samples, n_features).

required

SafePowerTransformer

Bases: PowerTransformer

Power Transformer which reverts features back to their original values if they are transformed to large values or the output column does not have unit variance. This happens e.g. when the input data has a large number of outliers.

SequentialFeatureTransformer

Bases: UserList

A transformer that applies a sequence of feature preprocessing steps. This is very related to sklearn's Pipeline, but it is designed to work with categorical_features lists that are always passed on.

Currently this class is only used once, thus this could also be made less general if needed.

fit

fit(X: ndarray, categorical_features: list[int]) -> Self

Fit all the steps in the pipeline.

Parameters:

Name Type Description Default
X ndarray

2d array of shape (n_samples, n_features)

required
categorical_features list[int]

list of indices of categorical feature.

required

fit_transform

fit_transform(
    X: ndarray, categorical_features: list[int]
) -> _TransformResult

Fit and transform the data using the fitted pipeline.

Parameters:

Name Type Description Default
X ndarray

2d array of shape (n_samples, n_features)

required
categorical_features list[int]

list of indices of categorical features.

required

transform

transform(X: ndarray) -> _TransformResult

Transform the data using the fitted pipeline.

Parameters:

Name Type Description Default
X ndarray

2d array of shape (n_samples, n_features).

required

ShuffleFeaturesStep

Bases: FeaturePreprocessingTransformerStep

Shuffle the features in the data.

fit

fit(X: ndarray, categorical_features: list[int]) -> Self

Fits the preprocessor.

Parameters:

Name Type Description Default
X ndarray

2d array of shape (n_samples, n_features)

required
categorical_features list[int]

list of indices of categorical feature.

required

transform

transform(X: ndarray) -> _TransformResult

Transforms the data.

Parameters:

Name Type Description Default
X ndarray

2d array of shape (n_samples, n_features).

required

add_safe_standard_to_safe_power_without_standard

add_safe_standard_to_safe_power_without_standard(
    input_transformer: TransformerMixin,
) -> Pipeline

In edge cases PowerTransformer can create inf values and similar. Then, the post standard scale crashes. This fixes this issue.

make_box_cox_safe

make_box_cox_safe(
    input_transformer: TransformerMixin | Pipeline,
) -> Pipeline

Make box cox save.

The Box-Cox transformation can only be applied to strictly positive data. With first MinMax scaling, we achieve this without loss of function. Additionally, for test data, we also need clipping.

skew

skew(x: ndarray) -> float