unsupervised ¶
This module enables TabPFN to be used for unsupervised learning tasks including missing value imputation, outlier detection, and synthetic data generation. It leverages TabPFN's probabilistic nature to model joint data distributions without training labels.
Key features: - Missing value imputation with probabilistic sampling - Outlier detection based on feature-wise probability estimation - Synthetic data generation with controllable randomness - Compatibility with both TabPFN and TabPFN-client backends - Support for mixed data types (categorical and numerical features) - Flexible permutation-based approach for feature dependencies
Example usage
from tabpfn import TabPFNClassifier, TabPFNRegressor
from tabpfn_extensions.unsupervised import TabPFNUnsupervisedModel
# Create TabPFN models for classification and regression
clf = TabPFNClassifier()
reg = TabPFNRegressor()
# Create the unsupervised model
model = TabPFNUnsupervisedModel(tabpfn_clf=clf, tabpfn_reg=reg)
# Fit the model on data without labels
model.fit(X_train)
# Different unsupervised tasks
X_imputed = model.impute(X_with_missing_values) # Fill missing values
outlier_scores = model.outliers(X_test) # Detect outliers
X_synthetic = model.generate_synthetic_data(100) # Generate new samples
TabPFNUnsupervisedModel ¶
Bases: BaseEstimator
TabPFN experiments model for imputation, outlier detection, and synthetic data generation.
This model combines a TabPFNClassifier for categorical features and a TabPFNRegressor for numerical features to perform various experiments learning tasks on tabular data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tabpfn_clf |
TabPFNClassifier, optional TabPFNClassifier instance for handling categorical features. If not provided, the model assumes that there are no categorical features in the data. |
None
|
|
tabpfn_reg |
TabPFNRegressor, optional TabPFNRegressor instance for handling numerical features. If not provided, the model assumes that there are no numerical features in the data. |
None
|
Attributes:
Name | Type | Description |
---|---|---|
categorical_features |
list List of indices of categorical features in the input data. |
>>> tabpfn_clf = TabPFNClassifier()
>>> tabpfn_reg = TabPFNRegressor()
>>> model = TabPFNUnsupervisedModel(tabpfn_clf, tabpfn_reg)
>>>
>>> X = [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]
>>> model.fit(X)
>>>
>>> X_imputed = model.impute(X)
>>> X_outliers = model.outliers(X)
>>> X_synthetic = model.generate_synthetic_data(n_samples=100)
density_ ¶
density_(
X_predict: Tensor,
X_fit: Tensor,
conditional_idx: list[int],
column_idx: int,
) -> tuple[Any, Tensor, Tensor]
Generate density predictions for a specific feature based on other features.
This internal method is used by the imputation and outlier detection algorithms to model the conditional probability distribution of one feature given others.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X_predict |
Tensor
|
Input data for which to make predictions |
required |
X_fit |
Tensor
|
Training data to fit the model |
required |
conditional_idx |
list[int]
|
Indices of features to condition on |
required |
column_idx |
int
|
Index of the feature to predict |
required |
Returns:
Type | Description |
---|---|
tuple[Any, Tensor, Tensor]
|
tuple containing: - The fitted model (classifier or regressor) - The filtered features used for prediction - The target feature values to predict |
fit ¶
fit(
X: ndarray | Tensor | DataFrame,
y: ndarray | Tensor | Series | None = None,
) -> TabPFNUnsupervisedModel
Fit the model to the input data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
ndarray | Tensor | DataFrame
|
Union[np.ndarray, torch.Tensor, pd.DataFrame] Input data to fit the model, shape (n_samples, n_features). |
required |
y |
ndarray | Tensor | Series | None
|
Optional[Union[np.ndarray, torch.Tensor, pd.Series]], default=None Target values, shape (n_samples,). Optional since this is an unsupervised model. |
None
|
Returns:
Type | Description |
---|---|
TabPFNUnsupervisedModel
|
TabPFNUnsupervisedModel Fitted model instance (self). |
generate_synthetic_data ¶
Generate synthetic tabular data samples using the fitted TabPFN models.
This method uses imputation to create synthetic data, starting with a matrix of NaN values and filling in each feature sequentially. Samples are generated feature by feature in a single pass, with each feature conditioned on previously generated features.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_samples |
int
|
int, default=100 Number of synthetic samples to generate |
100
|
t |
float
|
float, default=1.0 Temperature parameter for sampling. Controls randomness: - Higher values (e.g., 1.0) produce more diverse samples - Lower values (e.g., 0.1) produce more deterministic samples |
1.0
|
n_permutations |
int
|
int, default=3 Number of feature permutations to use for generation More permutations may provide more robust results but increase computation time |
3
|
Returns:
Type | Description |
---|---|
Tensor
|
torch.Tensor: Generated synthetic data of shape (n_samples, n_features) |
Raises:
Type | Description |
---|---|
AssertionError
|
If the model is not fitted (self.X_ does not exist) |
get_embeddings ¶
Get the transformer embeddings for the test data X.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
tensor
|
|
required |
Returns:
Type | Description |
---|---|
tensor
|
torch.Tensor of shape (n_samples, embedding_dim) |
get_embeddings_per_column ¶
Alternative implementation for get_embeddings, where we get the embeddings for each column as a label separately and concatenate the results. This alternative way needs more passes but might be more accurate.
impute ¶
Impute missing values in the input data using the fitted TabPFN models.
This method fills missing values (np.nan) in the input data by predicting each missing value based on the observed values in the same sample. The imputation uses multiple random feature permutations to improve robustness.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
Tensor | ndarray | DataFrame
|
Union[torch.Tensor, np.ndarray, pd.DataFrame] Input data of shape (n_samples, n_features) with missing values encoded as np.nan. |
required |
t |
float
|
float, default=0.000000001 Temperature for sampling from the imputation distribution. Lower values result in more deterministic imputations, while higher values introduce more randomness. |
1e-09
|
n_permutations |
int
|
int, default=10 Number of random feature permutations to use for imputation. Higher values may improve robustness but increase computation time. |
10
|
Returns:
Type | Description |
---|---|
Tensor
|
torch.Tensor Imputed data with missing values replaced, of shape (n_samples, n_features). |
Note
The model must be fitted with training data before calling this method.
impute_ ¶
impute_(
X: Tensor,
t: float = 1e-09,
n_permutations: int = 10,
condition_on_all_features: bool = True,
fast_mode: bool = False,
) -> Tensor
Impute missing values (np.nan) in X by sampling all cells independently from the trained models.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
Tensor
|
torch.Tensor Input data of shape (n_samples, n_features) with missing values encoded as np.nan |
required |
t |
float
|
float, default=0.000000001 Temperature for sampling from the imputation distribution, lower values are more deterministic |
1e-09
|
n_permutations |
int
|
int, default=10 Number of permutations to use for imputation |
10
|
condition_on_all_features |
bool
|
bool, default=True Whether to condition on all other features (True) or only previous features (False) |
True
|
fast_mode |
bool
|
bool, default=False Whether to use faster settings for testing |
False
|
Returns:
Type | Description |
---|---|
Tensor
|
torch.Tensor: Imputed data with missing values replaced |
impute_single_permutation_ ¶
impute_single_permutation_(
X: Tensor,
feature_permutation: list[int] | tuple[int, ...],
t: float = 1e-09,
condition_on_all_features: bool = True,
) -> tuple[Tensor, dict[str, Tensor]]
Impute missing values (np.nan) in X by sampling all cells independently from the trained models.
:param X: Input data of the shape (num_examples, num_features) with missing values encoded as np.nan :param t: Temperature for sampling from the imputation distribution, lower values are more deterministic :return: Imputed data, with missing values replaced
init_model_and_get_model_config ¶
Initialize TabPFN models for use in unsupervised learning.
This function provides compatibility with different TabPFN implementations. It tries to initialize the model using the appropriate method based on the TabPFN implementation in use.
Raises:
Type | Description |
---|---|
RuntimeError
|
If model initialization fails |
outliers ¶
Calculate outlier scores for each sample in the input data.
This is the preferred implementation for outlier detection, which calculates sample probability for each sample in X by multiplying the probabilities of each feature according to chain rule of probability. Lower probabilities indicate samples that are more likely to be outliers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
Tensor | ndarray | DataFrame
|
Union[torch.Tensor, np.ndarray, pd.DataFrame] Samples to calculate outlier scores for, shape (n_samples, n_features) |
required |
n_permutations |
int
|
int, default=10 Number of permutations to use for more robust probability estimates. Higher values may produce more stable results but increase computation time. |
10
|
Returns:
Type | Description |
---|---|
Tensor
|
torch.Tensor: Tensor of outlier scores (lower values indicate more likely outliers), shape (n_samples,) |
Raises:
Type | Description |
---|---|
RuntimeError
|
If the model initialization fails |
ValueError
|
If the input data has incompatible dimensions |
outliers_pdf ¶
Calculate outlier scores based on probability density functions for continuous features.
This method filters out categorical features and only considers numerical features for outlier detection using probability density functions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
Tensor
|
Input data tensor |
required |
n_permutations |
int
|
Number of permutations to use for the outlier calculation |
10
|
Returns:
Type | Description |
---|---|
Tensor
|
Tensor of outlier scores (lower values indicate more likely outliers) |
outliers_pmf ¶
Calculate outlier scores based on probability mass functions for categorical features.
This method filters out numerical features and only considers categorical features for outlier detection using probability mass functions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X |
Tensor
|
Input data tensor |
required |
n_permutations |
int
|
Number of permutations to use for the outlier calculation |
10
|
Returns:
Type | Description |
---|---|
Tensor
|
Tensor of outlier scores (lower values indicate more likely outliers) |
sample_from_model_prediction_ ¶
sample_from_model_prediction_(
column_idx: int,
X_fit: Tensor,
model: Any,
X_predict: Tensor,
t: float,
) -> tuple[dict[str, Any] | ndarray, Tensor]
Sample values from a model's prediction distribution.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column_idx |
int
|
Index of the column being predicted |
required |
X_fit |
Tensor
|
Training data used to determine feature type |
required |
model |
Any
|
The trained model (classifier or regressor) |
required |
X_predict |
Tensor
|
Input data for prediction |
required |
t |
float
|
Temperature parameter for sampling (lower values = more deterministic) |
required |
Returns:
Type | Description |
---|---|
tuple[dict[str, Any] | ndarray, Tensor]
|
tuple containing: - The raw prediction output (dictionary for regressors, array for classifiers) - The sampled values as a tensor |
set_categorical_features ¶
Set categorical feature indices for the model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
categorical_features |
list[int]
|
List of indices of categorical features |
required |
use_classifier_ ¶
Determine whether to use a classifier or regressor for a feature.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column_idx |
int
|
Index of the column to check |
required |
y |
Tensor | ndarray
|
Values of the feature |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if a classifier should be used, False for a regressor |
efficient_random_permutation ¶
efficient_random_permutation(
indices: list[int], n_permutations: int = 10
) -> list[tuple[int, ...]]
Generate multiple unique random permutations of the given indices.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
indices |
list[int]
|
List of indices to permute |
required |
n_permutations |
int
|
Number of unique permutations to generate |
10
|
Returns:
Type | Description |
---|---|
list[tuple[int, ...]]
|
List of unique permutations |
efficient_random_permutation_ ¶
Generate a single random permutation from the given indices.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
indices |
list[int]
|
List of indices to permute |
required |
Returns:
Type | Description |
---|---|
tuple[int, ...]
|
A tuple representing a random permutation of the input indices |