Skip to content

unsupervised

This module enables TabPFN to be used for unsupervised learning tasks including missing value imputation, outlier detection, and synthetic data generation. It leverages TabPFN's probabilistic nature to model joint data distributions without training labels.

Key features: - Missing value imputation with probabilistic sampling - Outlier detection based on feature-wise probability estimation - Synthetic data generation with controllable randomness - Compatibility with both TabPFN and TabPFN-client backends - Support for mixed data types (categorical and numerical features) - Flexible permutation-based approach for feature dependencies

Example usage
from tabpfn import TabPFNClassifier, TabPFNRegressor
from tabpfn_extensions.unsupervised import TabPFNUnsupervisedModel

# Create TabPFN models for classification and regression
clf = TabPFNClassifier()
reg = TabPFNRegressor()

# Create the unsupervised model
model = TabPFNUnsupervisedModel(tabpfn_clf=clf, tabpfn_reg=reg)

# Fit the model on data without labels
model.fit(X_train)

# Different unsupervised tasks
X_imputed = model.impute(X_with_missing_values)  # Fill missing values
outlier_scores = model.outliers(X_test)          # Detect outliers
X_synthetic = model.generate_synthetic_data(100)  # Generate new samples

TabPFNUnsupervisedModel

Bases: BaseEstimator

TabPFN experiments model for imputation, outlier detection, and synthetic data generation.

This model combines a TabPFNClassifier for categorical features and a TabPFNRegressor for numerical features to perform various experiments learning tasks on tabular data.

Parameters:

Name Type Description Default
tabpfn_clf

TabPFNClassifier, optional TabPFNClassifier instance for handling categorical features. If not provided, the model assumes that there are no categorical features in the data.

None
tabpfn_reg

TabPFNRegressor, optional TabPFNRegressor instance for handling numerical features. If not provided, the model assumes that there are no numerical features in the data.

None

Attributes:

Name Type Description
categorical_features

list List of indices of categorical features in the input data.

Example
>>> tabpfn_clf = TabPFNClassifier()
>>> tabpfn_reg = TabPFNRegressor()
>>> model = TabPFNUnsupervisedModel(tabpfn_clf, tabpfn_reg)
>>>
>>> X = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]
>>> model.fit(X)
>>>
>>> X_imputed = model.impute(X)
>>> X_outliers = model.outliers(X)
>>> X_synthetic = model.generate_synthetic_data(n_samples=100)

density_

density_(
    X_predict: Tensor,
    X_fit: Tensor,
    conditional_idx: list[int],
    column_idx: int,
) -> tuple[Any, Tensor, Tensor]

Generate density predictions for a specific feature based on other features.

This internal method is used by the imputation and outlier detection algorithms to model the conditional probability distribution of one feature given others.

Parameters:

Name Type Description Default
X_predict Tensor

Input data for which to make predictions

required
X_fit Tensor

Training data to fit the model

required
conditional_idx list[int]

Indices of features to condition on

required
column_idx int

Index of the feature to predict

required

Returns:

Type Description
tuple[Any, Tensor, Tensor]

tuple containing: - The fitted model (classifier or regressor) - The filtered features used for prediction - The target feature values to predict

fit

fit(
    X: ndarray | Tensor | DataFrame,
    y: ndarray | Tensor | Series | None = None,
) -> TabPFNUnsupervisedModel

Fit the model to the input data.

Parameters:

Name Type Description Default
X ndarray | Tensor | DataFrame

Union[np.ndarray, torch.Tensor, pd.DataFrame] Input data to fit the model, shape (n_samples, n_features).

required
y ndarray | Tensor | Series | None

Optional[Union[np.ndarray, torch.Tensor, pd.Series]], default=None Target values, shape (n_samples,). Optional since this is an unsupervised model.

None

Returns:

Type Description
TabPFNUnsupervisedModel

TabPFNUnsupervisedModel Fitted model instance (self).

generate_synthetic_data

generate_synthetic_data(
    n_samples: int = 100,
    t: float = 1.0,
    n_permutations: int = 3,
) -> Tensor

Generate synthetic tabular data samples using the fitted TabPFN models.

This method uses imputation to create synthetic data, starting with a matrix of NaN values and filling in each feature sequentially. Samples are generated feature by feature in a single pass, with each feature conditioned on previously generated features.

Parameters:

Name Type Description Default
n_samples int

int, default=100 Number of synthetic samples to generate

100
t float

float, default=1.0 Temperature parameter for sampling. Controls randomness: - Higher values (e.g., 1.0) produce more diverse samples - Lower values (e.g., 0.1) produce more deterministic samples

1.0
n_permutations int

int, default=3 Number of feature permutations to use for generation More permutations may provide more robust results but increase computation time

3

Returns:

Type Description
Tensor

torch.Tensor: Generated synthetic data of shape (n_samples, n_features)

Raises:

Type Description
AssertionError

If the model is not fitted (self.X_ does not exist)

get_embeddings

get_embeddings(
    X: tensor, per_column: bool = False
) -> tensor

Get the transformer embeddings for the test data X.

Parameters:

Name Type Description Default
X tensor
required

Returns:

Type Description
tensor

torch.Tensor of shape (n_samples, embedding_dim)

get_embeddings_per_column

get_embeddings_per_column(X: tensor) -> tensor

Alternative implementation for get_embeddings, where we get the embeddings for each column as a label separately and concatenate the results. This alternative way needs more passes but might be more accurate.

impute

impute(
    X: Tensor | ndarray | DataFrame,
    t: float = 1e-09,
    n_permutations: int = 10,
) -> Tensor

Impute missing values in the input data using the fitted TabPFN models.

This method fills missing values (np.nan) in the input data by predicting each missing value based on the observed values in the same sample. The imputation uses multiple random feature permutations to improve robustness.

Parameters:

Name Type Description Default
X Tensor | ndarray | DataFrame

Union[torch.Tensor, np.ndarray, pd.DataFrame] Input data of shape (n_samples, n_features) with missing values encoded as np.nan.

required
t float

float, default=0.000000001 Temperature for sampling from the imputation distribution. Lower values result in more deterministic imputations, while higher values introduce more randomness.

1e-09
n_permutations int

int, default=10 Number of random feature permutations to use for imputation. Higher values may improve robustness but increase computation time.

10

Returns:

Type Description
Tensor

torch.Tensor Imputed data with missing values replaced, of shape (n_samples, n_features).

Note

The model must be fitted with training data before calling this method.

impute_

impute_(
    X: Tensor,
    t: float = 1e-09,
    n_permutations: int = 10,
    condition_on_all_features: bool = True,
    fast_mode: bool = False,
) -> Tensor

Impute missing values (np.nan) in X by sampling all cells independently from the trained models.

Parameters:

Name Type Description Default
X Tensor

torch.Tensor Input data of shape (n_samples, n_features) with missing values encoded as np.nan

required
t float

float, default=0.000000001 Temperature for sampling from the imputation distribution, lower values are more deterministic

1e-09
n_permutations int

int, default=10 Number of permutations to use for imputation

10
condition_on_all_features bool

bool, default=True Whether to condition on all other features (True) or only previous features (False)

True
fast_mode bool

bool, default=False Whether to use faster settings for testing

False

Returns:

Type Description
Tensor

torch.Tensor: Imputed data with missing values replaced

impute_single_permutation_

impute_single_permutation_(
    X: Tensor,
    feature_permutation: list[int] | tuple[int, ...],
    t: float = 1e-09,
    condition_on_all_features: bool = True,
) -> tuple[Tensor, dict[str, Tensor]]

Impute missing values (np.nan) in X by sampling all cells independently from the trained models.

:param X: Input data of the shape (num_examples, num_features) with missing values encoded as np.nan :param t: Temperature for sampling from the imputation distribution, lower values are more deterministic :return: Imputed data, with missing values replaced

init_model_and_get_model_config

init_model_and_get_model_config() -> None

Initialize TabPFN models for use in unsupervised learning.

This function provides compatibility with different TabPFN implementations. It tries to initialize the model using the appropriate method based on the TabPFN implementation in use.

Raises:

Type Description
RuntimeError

If model initialization fails

outliers

outliers(
    X: Tensor | ndarray | DataFrame,
    n_permutations: int = 10,
) -> Tensor

Calculate outlier scores for each sample in the input data.

This is the preferred implementation for outlier detection, which calculates sample probability for each sample in X by multiplying the probabilities of each feature according to chain rule of probability. Lower probabilities indicate samples that are more likely to be outliers.

Parameters:

Name Type Description Default
X Tensor | ndarray | DataFrame

Union[torch.Tensor, np.ndarray, pd.DataFrame] Samples to calculate outlier scores for, shape (n_samples, n_features)

required
n_permutations int

int, default=10 Number of permutations to use for more robust probability estimates. Higher values may produce more stable results but increase computation time.

10

Returns:

Type Description
Tensor

torch.Tensor: Tensor of outlier scores (lower values indicate more likely outliers), shape (n_samples,)

Raises:

Type Description
RuntimeError

If the model initialization fails

ValueError

If the input data has incompatible dimensions

outliers_pdf

outliers_pdf(X: Tensor, n_permutations: int = 10) -> Tensor

Calculate outlier scores based on probability density functions for continuous features.

This method filters out categorical features and only considers numerical features for outlier detection using probability density functions.

Parameters:

Name Type Description Default
X Tensor

Input data tensor

required
n_permutations int

Number of permutations to use for the outlier calculation

10

Returns:

Type Description
Tensor

Tensor of outlier scores (lower values indicate more likely outliers)

outliers_pmf

outliers_pmf(X: Tensor, n_permutations: int = 10) -> Tensor

Calculate outlier scores based on probability mass functions for categorical features.

This method filters out numerical features and only considers categorical features for outlier detection using probability mass functions.

Parameters:

Name Type Description Default
X Tensor

Input data tensor

required
n_permutations int

Number of permutations to use for the outlier calculation

10

Returns:

Type Description
Tensor

Tensor of outlier scores (lower values indicate more likely outliers)

sample_from_model_prediction_

sample_from_model_prediction_(
    column_idx: int,
    X_fit: Tensor,
    model: Any,
    X_predict: Tensor,
    t: float,
) -> tuple[dict[str, Any] | ndarray, Tensor]

Sample values from a model's prediction distribution.

Parameters:

Name Type Description Default
column_idx int

Index of the column being predicted

required
X_fit Tensor

Training data used to determine feature type

required
model Any

The trained model (classifier or regressor)

required
X_predict Tensor

Input data for prediction

required
t float

Temperature parameter for sampling (lower values = more deterministic)

required

Returns:

Type Description
tuple[dict[str, Any] | ndarray, Tensor]

tuple containing: - The raw prediction output (dictionary for regressors, array for classifiers) - The sampled values as a tensor

set_categorical_features

set_categorical_features(
    categorical_features: list[int],
) -> None

Set categorical feature indices for the model.

Parameters:

Name Type Description Default
categorical_features list[int]

List of indices of categorical features

required

use_classifier_

use_classifier_(
    column_idx: int, y: Tensor | ndarray
) -> bool

Determine whether to use a classifier or regressor for a feature.

Parameters:

Name Type Description Default
column_idx int

Index of the column to check

required
y Tensor | ndarray

Values of the feature

required

Returns:

Name Type Description
bool bool

True if a classifier should be used, False for a regressor

efficient_random_permutation

efficient_random_permutation(
    indices: list[int], n_permutations: int = 10
) -> list[tuple[int, ...]]

Generate multiple unique random permutations of the given indices.

Parameters:

Name Type Description Default
indices list[int]

List of indices to permute

required
n_permutations int

Number of unique permutations to generate

10

Returns:

Type Description
list[tuple[int, ...]]

List of unique permutations

efficient_random_permutation_

efficient_random_permutation_(
    indices: list[int],
) -> tuple[int, ...]

Generate a single random permutation from the given indices.

Parameters:

Name Type Description Default
indices list[int]

List of indices to permute

required

Returns:

Type Description
tuple[int, ...]

A tuple representing a random permutation of the input indices