unsupervised ¶

This module enables TabPFN to be used for unsupervised learning tasks including missing value imputation, outlier detection, and synthetic data generation. It leverages TabPFN's probabilistic nature to model joint data distributions without training labels.

Key features: - Missing value imputation with probabilistic sampling - Outlier detection based on feature-wise probability estimation - Synthetic data generation with controllable randomness - Compatibility with both TabPFN and TabPFN-client backends - Support for mixed data types (categorical and numerical features) - Flexible permutation-based approach for feature dependencies

Example usage

from tabpfn import TabPFNClassifier, TabPFNRegressor
from tabpfn_extensions.unsupervised import TabPFNUnsupervisedModel

# Create TabPFN models for classification and regression
clf = TabPFNClassifier()
reg = TabPFNRegressor()

# Create the unsupervised model
model = TabPFNUnsupervisedModel(tabpfn_clf=clf, tabpfn_reg=reg)

# Fit the model on data without labels
model.fit(X_train)

# Different unsupervised tasks
X_imputed = model.impute(X_with_missing_values)  # Fill missing values
outlier_scores = model.outliers(X_test)          # Detect outliers
X_synthetic = model.generate_synthetic_data(100)  # Generate new samples

TabPFNUnsupervisedModel ¶

Bases: BaseEstimator

TabPFN experiments model for imputation, outlier detection, and synthetic data generation.

This model combines a TabPFNClassifier for categorical features and a TabPFNRegressor for numerical features to perform various experiments learning tasks on tabular data.

Parameters:

Name	Type	Description	Default
`tabpfn_clf`		TabPFNClassifier, optional TabPFNClassifier instance for handling categorical features. If not provided, the model assumes that there are no categorical features in the data.	`None`
`tabpfn_reg`		TabPFNRegressor, optional TabPFNRegressor instance for handling numerical features. If not provided, the model assumes that there are no numerical features in the data.	`None`

Attributes:

Name	Type	Description
`categorical_features`		list List of indices of categorical features in the input data.

Example

>>> tabpfn_clf = TabPFNClassifier()
>>> tabpfn_reg = TabPFNRegressor()
>>> model = TabPFNUnsupervisedModel(tabpfn_clf, tabpfn_reg)
>>>
>>> X = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]
>>> model.fit(X)
>>>
>>> X_imputed = model.impute(X)
>>> X_outliers = model.outliers(X)
>>> X_synthetic = model.generate_synthetic_data(n_samples=100)

density_ ¶

density_(
    X_predict: Tensor,
    X_fit: Tensor,
    conditional_idx: list[int],
    column_idx: int,
) -> tuple[Any, Tensor, Tensor]

Generate density predictions for a specific feature based on other features.

This internal method is used by the imputation and outlier detection algorithms to model the conditional probability distribution of one feature given others.

Parameters:

Name	Type	Description	Default
`X_predict`	`Tensor`	Input data for which to make predictions	required
`X_fit`	`Tensor`	Training data to fit the model	required
`conditional_idx`	`list[int]`	Indices of features to condition on	required
`column_idx`	`int`	Index of the feature to predict	required

Returns:

Type	Description
`tuple[Any, Tensor, Tensor]`	tuple containing: - The fitted model (classifier or regressor) - The filtered features used for prediction - The target feature values to predict

fit ¶

fit(
    X: ndarray | Tensor | DataFrame,
    y: ndarray | Tensor | Series | None = None,
) -> TabPFNUnsupervisedModel

Fit the model to the input data.

Parameters:

Name	Type	Description	Default
`X`	`ndarray \| Tensor \| DataFrame`	Union[np.ndarray, torch.Tensor, pd.DataFrame] Input data to fit the model, shape (n_samples, n_features).	required
`y`	`ndarray \| Tensor \| Series \| None`	Optional[Union[np.ndarray, torch.Tensor, pd.Series]], default=None Target values, shape (n_samples,). Optional since this is an unsupervised model.	`None`

Returns:

Type	Description
`TabPFNUnsupervisedModel`	TabPFNUnsupervisedModel Fitted model instance (self).

generate_synthetic_data ¶

generate_synthetic_data(
    n_samples: int = 100,
    t: float = 1.0,
    n_permutations: int = 3,
) -> Tensor

Generate synthetic tabular data samples using the fitted TabPFN models.

This method uses imputation to create synthetic data, starting with a matrix of NaN values and filling in each feature sequentially. Samples are generated feature by feature in a single pass, with each feature conditioned on previously generated features.

Parameters:

Name	Type	Description	Default
`n_samples`	`int`	int, default=100 Number of synthetic samples to generate	`100`
`t`	`float`	float, default=1.0 Temperature parameter for sampling. Controls randomness: - Higher values (e.g., 1.0) produce more diverse samples - Lower values (e.g., 0.1) produce more deterministic samples	`1.0`
`n_permutations`	`int`	int, default=3 Number of feature permutations to use for generation More permutations may provide more robust results but increase computation time	`3`

Returns:

Type	Description
`Tensor`	torch.Tensor: Generated synthetic data of shape (n_samples, n_features)

Raises:

Type	Description
`AssertionError`	If the model is not fitted (self.X_ does not exist)

get_embeddings ¶

get_embeddings(
    X: tensor, per_column: bool = False
) -> tensor

Get the transformer embeddings for the test data X.

Parameters:

Name	Type	Description	Default
`X`	`tensor`		required

Returns:

Type	Description
`tensor`	torch.Tensor of shape (n_samples, embedding_dim)

get_embeddings_per_column ¶

get_embeddings_per_column(X: tensor) -> tensor

Alternative implementation for get_embeddings, where we get the embeddings for each column as a label separately and concatenate the results. This alternative way needs more passes but might be more accurate.

impute ¶

impute(
    X: Tensor | ndarray | DataFrame,
    t: float = 1e-09,
    n_permutations: int = 10,
) -> Tensor

Impute missing values in the input data using the fitted TabPFN models.

This method fills missing values (np.nan) in the input data by predicting each missing value based on the observed values in the same sample. The imputation uses multiple random feature permutations to improve robustness.

Parameters:

Name	Type	Description	Default
`X`	`Tensor \| ndarray \| DataFrame`	Union[torch.Tensor, np.ndarray, pd.DataFrame] Input data of shape (n_samples, n_features) with missing values encoded as np.nan.	required
`t`	`float`	float, default=0.000000001 Temperature for sampling from the imputation distribution. Lower values result in more deterministic imputations, while higher values introduce more randomness.	`1e-09`
`n_permutations`	`int`	int, default=10 Number of random feature permutations to use for imputation. Higher values may improve robustness but increase computation time.	`10`

Returns:

Type	Description
`Tensor`	torch.Tensor Imputed data with missing values replaced, of shape (n_samples, n_features).

Note

The model must be fitted with training data before calling this method.

impute_ ¶

impute_(
    X: Tensor,
    t: float = 1e-09,
    n_permutations: int = 10,
    condition_on_all_features: bool = True,
    fast_mode: bool = False,
) -> Tensor

Impute missing values (np.nan) in X by sampling all cells independently from the trained models.

Parameters:

Name	Type	Description	Default
`X`	`Tensor`	torch.Tensor Input data of shape (n_samples, n_features) with missing values encoded as np.nan	required
`t`	`float`	float, default=0.000000001 Temperature for sampling from the imputation distribution, lower values are more deterministic	`1e-09`
`n_permutations`	`int`	int, default=10 Number of permutations to use for imputation	`10`
`condition_on_all_features`	`bool`	bool, default=True Whether to condition on all other features (True) or only previous features (False)	`True`
`fast_mode`	`bool`	bool, default=False Whether to use faster settings for testing	`False`

Returns:

Type	Description
`Tensor`	torch.Tensor: Imputed data with missing values replaced

impute_single_permutation_ ¶

impute_single_permutation_(
    X: Tensor,
    feature_permutation: list[int] | tuple[int, ...],
    t: float = 1e-09,
    condition_on_all_features: bool = True,
) -> tuple[Tensor, dict[str, Tensor]]

Impute missing values (np.nan) in X by sampling all cells independently from the trained models.

:param X: Input data of the shape (num_examples, num_features) with missing values encoded as np.nan :param t: Temperature for sampling from the imputation distribution, lower values are more deterministic :return: Imputed data, with missing values replaced

init_model_and_get_model_config ¶

init_model_and_get_model_config() -> None

Initialize TabPFN models for use in unsupervised learning.

This function provides compatibility with different TabPFN implementations. It tries to initialize the model using the appropriate method based on the TabPFN implementation in use.

Raises:

Type	Description
`RuntimeError`	If model initialization fails

outliers ¶

outliers(
    X: Tensor | ndarray | DataFrame,
    n_permutations: int = 10,
) -> Tensor

Calculate outlier scores for each sample in the input data.

This is the preferred implementation for outlier detection, which calculates sample probability for each sample in X by multiplying the probabilities of each feature according to chain rule of probability. Lower probabilities indicate samples that are more likely to be outliers.

Parameters:

Name	Type	Description	Default
`X`	`Tensor \| ndarray \| DataFrame`	Union[torch.Tensor, np.ndarray, pd.DataFrame] Samples to calculate outlier scores for, shape (n_samples, n_features)	required
`n_permutations`	`int`	int, default=10 Number of permutations to use for more robust probability estimates. Higher values may produce more stable results but increase computation time.	`10`

Returns:

Type	Description
`Tensor`	torch.Tensor: Tensor of outlier scores (lower values indicate more likely outliers), shape (n_samples,)

Raises:

Type	Description
`RuntimeError`	If the model initialization fails
`ValueError`	If the input data has incompatible dimensions

outliers_pdf ¶

outliers_pdf(X: Tensor, n_permutations: int = 10) -> Tensor

Calculate outlier scores based on probability density functions for continuous features.

This method filters out categorical features and only considers numerical features for outlier detection using probability density functions.

Parameters:

Name	Type	Description	Default
`X`	`Tensor`	Input data tensor	required
`n_permutations`	`int`	Number of permutations to use for the outlier calculation	`10`

Returns:

Type	Description
`Tensor`	Tensor of outlier scores (lower values indicate more likely outliers)

outliers_pmf ¶

outliers_pmf(X: Tensor, n_permutations: int = 10) -> Tensor

Calculate outlier scores based on probability mass functions for categorical features.

This method filters out numerical features and only considers categorical features for outlier detection using probability mass functions.

Parameters:

Name	Type	Description	Default
`X`	`Tensor`	Input data tensor	required
`n_permutations`	`int`	Number of permutations to use for the outlier calculation	`10`

Returns:

Type	Description
`Tensor`	Tensor of outlier scores (lower values indicate more likely outliers)

sample_from_model_prediction_ ¶

sample_from_model_prediction_(
    column_idx: int,
    X_fit: Tensor,
    model: Any,
    X_predict: Tensor,
    t: float,
) -> tuple[dict[str, Any] | ndarray, Tensor]

Sample values from a model's prediction distribution.

Parameters:

Name	Type	Description	Default
`column_idx`	`int`	Index of the column being predicted	required
`X_fit`	`Tensor`	Training data used to determine feature type	required
`model`	`Any`	The trained model (classifier or regressor)	required
`X_predict`	`Tensor`	Input data for prediction	required
`t`	`float`	Temperature parameter for sampling (lower values = more deterministic)	required

Returns:

Type	Description
`tuple[dict[str, Any] \| ndarray, Tensor]`	tuple containing: - The raw prediction output (dictionary for regressors, array for classifiers) - The sampled values as a tensor

set_categorical_features ¶

set_categorical_features(
    categorical_features: list[int],
) -> None

Set categorical feature indices for the model.

Parameters:

Name	Type	Description	Default
`categorical_features`	`list[int]`	List of indices of categorical features	required

use_classifier_ ¶

use_classifier_(
    column_idx: int, y: Tensor | ndarray
) -> bool

Determine whether to use a classifier or regressor for a feature.

Parameters:

Name	Type	Description	Default
`column_idx`	`int`	Index of the column to check	required
`y`	`Tensor \| ndarray`	Values of the feature	required

Returns:

Name	Type	Description
`bool`	`bool`	True if a classifier should be used, False for a regressor

efficient_random_permutation ¶

efficient_random_permutation(
    indices: list[int], n_permutations: int = 10
) -> list[tuple[int, ...]]

Generate multiple unique random permutations of the given indices.

Parameters:

Name	Type	Description	Default
`indices`	`list[int]`	List of indices to permute	required
`n_permutations`	`int`	Number of unique permutations to generate	`10`

Returns:

Type	Description
`list[tuple[int, ...]]`	List of unique permutations

efficient_random_permutation_ ¶

efficient_random_permutation_(
    indices: list[int],
) -> tuple[int, ...]

Generate a single random permutation from the given indices.

Parameters:

Name	Type	Description	Default
`indices`	`list[int]`	List of indices to permute	required

Returns:

Type	Description
`tuple[int, ...]`	A tuple representing a random permutation of the input indices