Skip to content

many_class_classifier

Development Notebook: https://colab.research.google.com/drive/1HWF5IF0IN21G8FZdLVwBbLBkCMu94yBA?usp=sharing

This module provides a classifier that overcomes TabPFN's limitation on the number of classes (typically 10) by using a meta-classifier approach based on output coding. It works by breaking down multi-class problems into multiple sub-problems, each within TabPFN's class limit.

This version aims to be very close to an original structural design, with key improvements in codebook generation and using a custom validate_data function for scikit-learn compatibility.

Key features (compared to a very basic output coder): - Improved codebook generation: Uses a strategy that attempts to balance the number of times each class is explicitly represented and guarantees coverage. - Codebook statistics: Optionally prints statistics about the generated codebook. - Uses a custom validate_data for potentially better cross-sklearn-version compatibility for data validation. - Robustness: Minor changes for better scikit-learn compatibility (e.g., ensuring the wrapper is properly "fitted", setting n_features_in_).

Original structural aspects retained: - Fitting of base estimators for sub-problems largely occurs during predict_proba calls.

Example usage
import numpy as np
from sklearn.model_selection import train_test_split
from tabpfn import TabPFNClassifier # Assuming TabPFN is installed
from sklearn.datasets import make_classification

# Create synthetic data with many classes
n_classes_total = 15 # TabPFN might struggle with >10 if not configured
X, y = make_classification(n_samples=300, n_features=20, n_informative=15,
                           n_redundant=0, n_classes=n_classes_total,
                           n_clusters_per_class=1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=42,
                                                    stratify=y)

# Create a TabPFN base classifier
# Adjust N_ensemble_configurations and device as needed/available
# TabPFN's default class limit is often 10 for the public model.
base_clf = TabPFNClassifier(device='cpu', N_ensemble_configurations=4)

# Wrap it with ManyClassClassifier
many_class_clf = ManyClassClassifier(
    estimator=base_clf,
    alphabet_size=10, # Max classes the base_clf sub-problems will handle
                      # This should align with TabPFN's actual capability.
    n_estimators_redundancy=3,
    random_state=42,
    log_proba_aggregation=True,
    verbose=1 # Print codebook stats
)

# Use like any scikit-learn classifier
many_class_clf.fit(X_train, y_train)
y_pred = many_class_clf.predict(X_test)
y_proba = many_class_clf.predict_proba(X_test)

print(f"Prediction shape: {y_pred.shape}")
print(f"Probability shape: {y_proba.shape}")
if hasattr(many_class_clf, 'codebook_stats_'):
    print(f"Codebook Stats: {many_class_clf.codebook_stats_}")

ManyClassClassifier

Bases: BaseEstimator, ClassifierMixin

Output-Code multiclass strategy to extend classifiers beyond their class limit.

This version adheres closely to an original structural design, with key improvements in codebook generation and using a custom validate_data function for scikit-learn compatibility. Fitting for sub-problems primarily occurs during prediction.

Parameters:

Name Type Description Default
estimator BaseEstimator

A classifier implementing fit() and predict_proba() methods.

required
alphabet_size int

Maximum number of classes the base estimator can handle. If None, attempts to infer from estimator.max_num_classes_.

None
n_estimators int

Number of base estimators (sub-problems). If None, calculated based on other parameters.

None
n_estimators_redundancy int

Redundancy factor for auto-calculated n_estimators. Defaults to 4.

4
random_state int, RandomState instance or None

Controls randomization for codebook generation.

None
verbose int

Controls verbosity. If > 0, prints codebook stats. Defaults to 0.

0

Attributes:

Name Type Description
classes_ ndarray

Unique target labels.

code_book_ ndarray | None

Generated codebook if mapping is needed.

codebook_stats_ dict

Statistics about the generated codebook.

estimators_ list | None

Stores the single fitted base estimator only if no_mapping_needed_ is True.

no_mapping_needed_ bool

True if n_classes <= alphabet_size.

classes_index_ dict | None

Maps class labels to indices.

X_train ndarray | None

Stored training features if mapping needed.

Y_train_per_estimator ndarray | None

Encoded training labels for each sub-problem. Shape (n_estimators, n_samples).

n_features_in_ int

Number of features seen during fit.

feature_names_in_ ndarray | None

Names of features seen during fit.

Examples:

>>> from sklearn.datasets import load_iris
>>> from tabpfn import TabPFNClassifier
>>> from tabpfn_extensions.many_class import ManyClassClassifier
>>> from sklearn.model_selection import train_test_split
>>> X, y = load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
>>> base_clf = TabPFNClassifier()
>>> many_clf = ManyClassClassifier(base_clf, alphabet_size=base_clf.max_num_classes_)
>>> many_clf.fit(X_train, y_train)
>>> y_pred = many_clf.predict(X_test)

codebook_statistics_ property

codebook_statistics_

Returns statistics about the generated codebook.

fit

fit(X, y, **fit_params) -> ManyClassClassifier

Prepare classifier using custom validate_data. Actual fitting of sub-estimators happens in predict_proba if mapping is needed.

predict

predict(X) -> ndarray

Predict multi-class targets for X.

predict_proba

predict_proba(X) -> ndarray

Predict class probabilities for X. Sub-estimators are fitted here if mapping is used.

set_categorical_features

set_categorical_features(
    categorical_features: list[int],
) -> None

Attempts to set categorical features on the base estimator.