many_class_classifier ¶

Development Notebook: https://colab.research.google.com/drive/1HWF5IF0IN21G8FZdLVwBbLBkCMu94yBA?usp=sharing

This module provides a classifier that overcomes TabPFN's limitation on the number of classes (typically 10) by using a meta-classifier approach based on output coding. It works by breaking down multi-class problems into multiple sub-problems, each within TabPFN's class limit.

This version aims to be very close to an original structural design, with key improvements in codebook generation and using a custom validate_data function for scikit-learn compatibility.

Key features (compared to a very basic output coder): - Improved codebook generation: Uses a strategy that attempts to balance the number of times each class is explicitly represented and guarantees coverage. - Codebook statistics: Optionally prints statistics about the generated codebook. - Uses a custom validate_data for potentially better cross-sklearn-version compatibility for data validation. - Robustness: Minor changes for better scikit-learn compatibility (e.g., ensuring the wrapper is properly "fitted", setting n_features_in_).

Original structural aspects retained: - Fitting of base estimators for sub-problems largely occurs during predict_proba calls.

Example usage

import numpy as np
from sklearn.model_selection import train_test_split
from tabpfn import TabPFNClassifier # Assuming TabPFN is installed
from sklearn.datasets import make_classification

# Create synthetic data with many classes
n_classes_total = 15 # TabPFN might struggle with >10 if not configured
X, y = make_classification(n_samples=300, n_features=20, n_informative=15,
                           n_redundant=0, n_classes=n_classes_total,
                           n_clusters_per_class=1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=42,
                                                    stratify=y)

# Create a TabPFN base classifier
# Adjust N_ensemble_configurations and device as needed/available
# TabPFN's default class limit is often 10 for the public model.
base_clf = TabPFNClassifier(device='cpu', N_ensemble_configurations=4)

# Wrap it with ManyClassClassifier
many_class_clf = ManyClassClassifier(
    estimator=base_clf,
    alphabet_size=10, # Max classes the base_clf sub-problems will handle
                      # This should align with TabPFN's actual capability.
    n_estimators_redundancy=3,
    random_state=42,
    log_proba_aggregation=True,
    verbose=1 # Print codebook stats
)

# Use like any scikit-learn classifier
many_class_clf.fit(X_train, y_train)
y_pred = many_class_clf.predict(X_test)
y_proba = many_class_clf.predict_proba(X_test)

print(f"Prediction shape: {y_pred.shape}")
print(f"Probability shape: {y_proba.shape}")
if hasattr(many_class_clf, 'codebook_stats_'):
    print(f"Codebook Stats: {many_class_clf.codebook_stats_}")

ManyClassClassifier ¶

Bases: BaseEstimator, ClassifierMixin

Output-Code multiclass strategy to extend classifiers beyond their class limit.

This version adheres closely to an original structural design, with key improvements in codebook generation and using a custom validate_data function for scikit-learn compatibility. Fitting for sub-problems primarily occurs during prediction.

Parameters:

Name	Type	Description	Default
`estimator`	`BaseEstimator`	A classifier implementing fit() and predict_proba() methods.	required
`alphabet_size`	`int`	Maximum number of classes the base estimator can handle. If None, attempts to infer from `estimator.max_num_classes_`.	`None`
`n_estimators`	`int`	Number of base estimators (sub-problems). If None, calculated based on other parameters.	`None`
`n_estimators_redundancy`	`int`	Redundancy factor for auto-calculated `n_estimators`. Defaults to 4.	`4`
`random_state`	`int, RandomState instance or None`	Controls randomization for codebook generation.	`None`
`verbose`	`int`	Controls verbosity. If > 0, prints codebook stats. Defaults to 0.	`0`

Attributes:

Name	Type	Description
`classes_`	`ndarray`	Unique target labels.
`code_book_`	`ndarray \| None`	Generated codebook if mapping is needed.
`codebook_stats_`	`dict`	Statistics about the generated codebook.
`estimators_`	`list \| None`	Stores the single fitted base estimator only if `no_mapping_needed_` is True.
`no_mapping_needed_`	`bool`	True if n_classes <= alphabet_size.
`classes_index_`	`dict \| None`	Maps class labels to indices.
`X_train`	`ndarray \| None`	Stored training features if mapping needed.
`Y_train_per_estimator`	`ndarray \| None`	Encoded training labels for each sub-problem. Shape (n_estimators, n_samples).
`n_features_in_`	`int`	Number of features seen during `fit`.
`feature_names_in_`	`ndarray \| None`	Names of features seen during `fit`.

Examples:

>>> from sklearn.datasets import load_iris
>>> from tabpfn import TabPFNClassifier
>>> from tabpfn_extensions.many_class import ManyClassClassifier
>>> from sklearn.model_selection import train_test_split
>>> X, y = load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
>>> base_clf = TabPFNClassifier()
>>> many_clf = ManyClassClassifier(base_clf, alphabet_size=base_clf.max_num_classes_)
>>> many_clf.fit(X_train, y_train)
>>> y_pred = many_clf.predict(X_test)

codebook_statistics_ `property` ¶

codebook_statistics_

Returns statistics about the generated codebook.

fit ¶

fit(X, y, **fit_params) -> ManyClassClassifier

Prepare classifier using custom validate_data. Actual fitting of sub-estimators happens in predict_proba if mapping is needed.

predict ¶

predict(X) -> ndarray

Predict multi-class targets for X.

predict_proba ¶

predict_proba(X) -> ndarray

Predict class probabilities for X. Sub-estimators are fitted here if mapping is used.

set_categorical_features ¶

set_categorical_features(
    categorical_features: list[int],
) -> None

Attempts to set categorical features on the base estimator.