preprocessing ¶

AddFingerprintFeaturesStep ¶

Bases: FeaturePreprocessingTransformerStep

Adds a fingerprint feature to the features based on hash of each row.

If is_test = True, it keeps the first hash even if there are collisions. If is_test = False, it handles hash collisions by counting up and rehashing until a unique hash is found.

fit ¶

fit(X: ndarray, categorical_features: list[int]) -> Self

Fits the preprocessor.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	2d array of shape (n_samples, n_features)	required
`categorical_features`	`list[int]`	list of indices of categorical feature.	required

transform ¶

transform(X: ndarray) -> _TransformResult

Transforms the data.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	2d array of shape (n_samples, n_features).	required

FeaturePreprocessingTransformerStep ¶

Base class for feature preprocessing steps.

It's main abstraction is really just to provide categorical indices along the pipeline.

fit ¶

fit(X: ndarray, categorical_features: list[int]) -> Self

Fits the preprocessor.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	2d array of shape (n_samples, n_features)	required
`categorical_features`	`list[int]`	list of indices of categorical feature.	required

transform ¶

transform(X: ndarray) -> _TransformResult

Transforms the data.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	2d array of shape (n_samples, n_features).	required

KDITransformerWithNaN ¶

Bases: KDITransformer

KDI transformer that can handle NaN values. It performs KDI with NaNs replaced by mean values and then fills the NaN values with NaNs after the transformation.

RemoveConstantFeaturesStep ¶

Bases: FeaturePreprocessingTransformerStep

Remove features that are constant in the training data.

fit ¶

fit(X: ndarray, categorical_features: list[int]) -> Self

Fits the preprocessor.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	2d array of shape (n_samples, n_features)	required
`categorical_features`	`list[int]`	list of indices of categorical feature.	required

transform ¶

transform(X: ndarray) -> _TransformResult

Transforms the data.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	2d array of shape (n_samples, n_features).	required

ReshapeFeatureDistributionsStep ¶

Bases: FeaturePreprocessingTransformerStep

Reshape the feature distributions using different transformations.

fit ¶

fit(X: ndarray, categorical_features: list[int]) -> Self

Fits the preprocessor.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	2d array of shape (n_samples, n_features)	required
`categorical_features`	`list[int]`	list of indices of categorical feature.	required

get_adaptive_preprocessors `staticmethod` ¶

get_adaptive_preprocessors(
    num_examples: int = 100, random_state: int | None = None
) -> dict[str, ColumnTransformer]

Returns a dictionary of adaptive column transformers that can be used to preprocess the data. Adaptive column transformers are used to preprocess the data based on the column type, they receive a pandas dataframe with column names, that indicate the column type. Column types are not datatypes, but rather a string that indicates how the data should be preprocessed.

Parameters:

Name	Type	Description	Default
`num_examples`	`int`	The number of examples in the dataset.	`100`
`random_state`	`int \| None`	The random state to use for the transformers.	`None`

get_column_types `staticmethod` ¶

get_column_types(X: ndarray) -> list[str]

Returns a list of column types for the given data, that indicate how the data should be preprocessed.

transform ¶

transform(X: ndarray) -> _TransformResult

Transforms the data.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	2d array of shape (n_samples, n_features).	required

SafePowerTransformer ¶

Bases: PowerTransformer

Power Transformer which reverts features back to their original values if they are transformed to large values or the output column does not have unit variance. This happens e.g. when the input data has a large number of outliers.

SequentialFeatureTransformer ¶

Bases: UserList

A transformer that applies a sequence of feature preprocessing steps. This is very related to sklearn's Pipeline, but it is designed to work with categorical_features lists that are always passed on.

Currently this class is only used once, thus this could also be made less general if needed.

fit ¶

fit(X: ndarray, categorical_features: list[int]) -> Self

Fit all the steps in the pipeline.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	2d array of shape (n_samples, n_features)	required
`categorical_features`	`list[int]`	list of indices of categorical feature.	required

fit_transform ¶

fit_transform(
    X: ndarray, categorical_features: list[int]
) -> _TransformResult

Fit and transform the data using the fitted pipeline.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	2d array of shape (n_samples, n_features)	required
`categorical_features`	`list[int]`	list of indices of categorical features.	required

transform ¶

transform(X: ndarray) -> _TransformResult

Transform the data using the fitted pipeline.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	2d array of shape (n_samples, n_features).	required

ShuffleFeaturesStep ¶

Bases: FeaturePreprocessingTransformerStep

Shuffle the features in the data.

fit ¶

fit(X: ndarray, categorical_features: list[int]) -> Self

Fits the preprocessor.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	2d array of shape (n_samples, n_features)	required
`categorical_features`	`list[int]`	list of indices of categorical feature.	required

transform ¶

transform(X: ndarray) -> _TransformResult

Transforms the data.

Parameters:

Name	Type	Description	Default
`X`	`ndarray`	2d array of shape (n_samples, n_features).	required

add_safe_standard_to_safe_power_without_standard ¶

add_safe_standard_to_safe_power_without_standard(
    input_transformer: TransformerMixin,
) -> Pipeline

In edge cases PowerTransformer can create inf values and similar. Then, the post standard scale crashes. This fixes this issue.

make_box_cox_safe ¶

make_box_cox_safe(
    input_transformer: TransformerMixin | Pipeline,
) -> Pipeline

Make box cox save.

The Box-Cox transformation can only be applied to strictly positive data. With first MinMax scaling, we achieve this without loss of function. Additionally, for test data, we also need clipping.

skew ¶

skew(x: ndarray) -> float

preprocessing ¶

AddFingerprintFeaturesStep ¶

fit ¶

transform ¶

FeaturePreprocessingTransformerStep ¶

fit ¶

transform ¶

KDITransformerWithNaN ¶

RemoveConstantFeaturesStep ¶

fit ¶

transform ¶

ReshapeFeatureDistributionsStep ¶

fit ¶

get_adaptive_preprocessors staticmethod ¶

get_column_types staticmethod ¶

transform ¶

SafePowerTransformer ¶

SequentialFeatureTransformer ¶

fit ¶

fit_transform ¶

transform ¶

ShuffleFeaturesStep ¶

fit ¶

transform ¶

add_safe_standard_to_safe_power_without_standard ¶

make_box_cox_safe ¶

skew ¶

get_adaptive_preprocessors `staticmethod` ¶

get_column_types `staticmethod` ¶