preprocessing ¶
AddFingerprintFeaturesStep ¶
Bases: FeaturePreprocessingTransformerStep
Adds a fingerprint feature to the features based on hash of each row.
If is_test = True
, it keeps the first hash even if there are collisions.
If is_test = False
, it handles hash collisions by counting up and rehashing
until a unique hash is found.
fit ¶
Fits the preprocessor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
2d array of shape (n_samples, n_features) |
required |
categorical_features |
list[int]
|
list of indices of categorical feature. |
required |
transform ¶
Transforms the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
2d array of shape (n_samples, n_features). |
required |
FeaturePreprocessingTransformerStep ¶
Base class for feature preprocessing steps.
It's main abstraction is really just to provide categorical indices along the pipeline.
fit ¶
Fits the preprocessor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
2d array of shape (n_samples, n_features) |
required |
categorical_features |
list[int]
|
list of indices of categorical feature. |
required |
transform ¶
Transforms the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
2d array of shape (n_samples, n_features). |
required |
KDITransformerWithNaN ¶
Bases: KDITransformer
KDI transformer that can handle NaN values. It performs KDI with NaNs replaced by mean values and then fills the NaN values with NaNs after the transformation.
RemoveConstantFeaturesStep ¶
Bases: FeaturePreprocessingTransformerStep
Remove features that are constant in the training data.
fit ¶
Fits the preprocessor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
2d array of shape (n_samples, n_features) |
required |
categorical_features |
list[int]
|
list of indices of categorical feature. |
required |
transform ¶
Transforms the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
2d array of shape (n_samples, n_features). |
required |
ReshapeFeatureDistributionsStep ¶
Bases: FeaturePreprocessingTransformerStep
Reshape the feature distributions using different transformations.
fit ¶
Fits the preprocessor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
2d array of shape (n_samples, n_features) |
required |
categorical_features |
list[int]
|
list of indices of categorical feature. |
required |
get_adaptive_preprocessors
staticmethod
¶
get_adaptive_preprocessors(
num_examples: int = 100, random_state: int | None = None
) -> dict[str, ColumnTransformer]
Returns a dictionary of adaptive column transformers that can be used to preprocess the data. Adaptive column transformers are used to preprocess the data based on the column type, they receive a pandas dataframe with column names, that indicate the column type. Column types are not datatypes, but rather a string that indicates how the data should be preprocessed.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
num_examples |
int
|
The number of examples in the dataset. |
100
|
random_state |
int | None
|
The random state to use for the transformers. |
None
|
get_column_types
staticmethod
¶
Returns a list of column types for the given data, that indicate how the data should be preprocessed.
transform ¶
Transforms the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
2d array of shape (n_samples, n_features). |
required |
SafePowerTransformer ¶
Bases: PowerTransformer
Power Transformer which reverts features back to their original values if they are transformed to large values or the output column does not have unit variance. This happens e.g. when the input data has a large number of outliers.
SequentialFeatureTransformer ¶
Bases: UserList
A transformer that applies a sequence of feature preprocessing steps. This is very related to sklearn's Pipeline, but it is designed to work with categorical_features lists that are always passed on.
Currently this class is only used once, thus this could also be made less general if needed.
fit ¶
Fit all the steps in the pipeline.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
2d array of shape (n_samples, n_features) |
required |
categorical_features |
list[int]
|
list of indices of categorical feature. |
required |
fit_transform ¶
Fit and transform the data using the fitted pipeline.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
2d array of shape (n_samples, n_features) |
required |
categorical_features |
list[int]
|
list of indices of categorical features. |
required |
transform ¶
Transform the data using the fitted pipeline.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
2d array of shape (n_samples, n_features). |
required |
ShuffleFeaturesStep ¶
Bases: FeaturePreprocessingTransformerStep
Shuffle the features in the data.
fit ¶
Fits the preprocessor.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
2d array of shape (n_samples, n_features) |
required |
categorical_features |
list[int]
|
list of indices of categorical feature. |
required |
transform ¶
Transforms the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray
|
2d array of shape (n_samples, n_features). |
required |
add_safe_standard_to_safe_power_without_standard ¶
add_safe_standard_to_safe_power_without_standard(
input_transformer: TransformerMixin,
) -> Pipeline
In edge cases PowerTransformer can create inf values and similar. Then, the post standard scale crashes. This fixes this issue.
make_box_cox_safe ¶
Make box cox save.
The Box-Cox transformation can only be applied to strictly positive data. With first MinMax scaling, we achieve this without loss of function. Additionally, for test data, we also need clipping.