many_class_classifier ¶
Development Notebook: https://colab.research.google.com/drive/1HWF5IF0IN21G8FZdLVwBbLBkCMu94yBA?usp=sharing
This module provides a classifier that overcomes TabPFN's limitation on the number of classes (typically 10) by using a meta-classifier approach based on output coding. It works by breaking down multi-class problems into multiple sub-problems, each within TabPFN's class limit.
This version aims to be very close to an original structural design, with key
improvements in codebook generation and using a custom validate_data
function
for scikit-learn compatibility.
Key features (compared to a very basic output coder):
- Improved codebook generation: Uses a strategy that attempts to balance the
number of times each class is explicitly represented and guarantees coverage.
- Codebook statistics: Optionally prints statistics about the generated codebook.
- Uses a custom validate_data
for potentially better cross-sklearn-version
compatibility for data validation.
- Robustness: Minor changes for better scikit-learn compatibility (e.g.,
ensuring the wrapper is properly "fitted", setting n_features_in_).
Original structural aspects retained: - Fitting of base estimators for sub-problems largely occurs during predict_proba calls.
Example usage
import numpy as np
from sklearn.model_selection import train_test_split
from tabpfn import TabPFNClassifier # Assuming TabPFN is installed
from sklearn.datasets import make_classification
# Create synthetic data with many classes
n_classes_total = 15 # TabPFN might struggle with >10 if not configured
X, y = make_classification(n_samples=300, n_features=20, n_informative=15,
n_redundant=0, n_classes=n_classes_total,
n_clusters_per_class=1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42,
stratify=y)
# Create a TabPFN base classifier
# Adjust N_ensemble_configurations and device as needed/available
# TabPFN's default class limit is often 10 for the public model.
base_clf = TabPFNClassifier(device='cpu', N_ensemble_configurations=4)
# Wrap it with ManyClassClassifier
many_class_clf = ManyClassClassifier(
estimator=base_clf,
alphabet_size=10, # Max classes the base_clf sub-problems will handle
# This should align with TabPFN's actual capability.
n_estimators_redundancy=3,
random_state=42,
log_proba_aggregation=True,
verbose=1 # Print codebook stats
)
# Use like any scikit-learn classifier
many_class_clf.fit(X_train, y_train)
y_pred = many_class_clf.predict(X_test)
y_proba = many_class_clf.predict_proba(X_test)
print(f"Prediction shape: {y_pred.shape}")
print(f"Probability shape: {y_proba.shape}")
if hasattr(many_class_clf, 'codebook_stats_'):
print(f"Codebook Stats: {many_class_clf.codebook_stats_}")
ManyClassClassifier ¶
Bases: BaseEstimator
, ClassifierMixin
Output-Code multiclass strategy to extend classifiers beyond their class limit.
This version adheres closely to an original structural design, with key
improvements in codebook generation and using a custom validate_data
function
for scikit-learn compatibility. Fitting for sub-problems primarily occurs
during prediction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
estimator |
BaseEstimator
|
A classifier implementing fit() and predict_proba() methods. |
required |
alphabet_size |
int
|
Maximum number of classes the base
estimator can handle. If None, attempts to infer from
|
None
|
n_estimators |
int
|
Number of base estimators (sub-problems). If None, calculated based on other parameters. |
None
|
n_estimators_redundancy |
int
|
Redundancy factor for auto-calculated
|
4
|
random_state |
int, RandomState instance or None
|
Controls randomization for codebook generation. |
None
|
verbose |
int
|
Controls verbosity. If > 0, prints codebook stats. Defaults to 0. |
0
|
Attributes:
Name | Type | Description |
---|---|---|
classes_ |
ndarray
|
Unique target labels. |
code_book_ |
ndarray | None
|
Generated codebook if mapping is needed. |
codebook_stats_ |
dict
|
Statistics about the generated codebook. |
estimators_ |
list | None
|
Stores the single fitted base estimator only
if |
no_mapping_needed_ |
bool
|
True if n_classes <= alphabet_size. |
classes_index_ |
dict | None
|
Maps class labels to indices. |
X_train |
ndarray | None
|
Stored training features if mapping needed. |
Y_train_per_estimator |
ndarray | None
|
Encoded training labels for each sub-problem. Shape (n_estimators, n_samples). |
n_features_in_ |
int
|
Number of features seen during |
feature_names_in_ |
ndarray | None
|
Names of features seen during |
Examples:
>>> from sklearn.datasets import load_iris
>>> from tabpfn import TabPFNClassifier
>>> from tabpfn_extensions.many_class import ManyClassClassifier
>>> from sklearn.model_selection import train_test_split
>>> X, y = load_iris(return_X_y=True)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
>>> base_clf = TabPFNClassifier()
>>> many_clf = ManyClassClassifier(base_clf, alphabet_size=base_clf.max_num_classes_)
>>> many_clf.fit(X_train, y_train)
>>> y_pred = many_clf.predict(X_test)
codebook_statistics_
property
¶
Returns statistics about the generated codebook.
fit ¶
fit(X, y, **fit_params) -> ManyClassClassifier
Prepare classifier using custom validate_data. Actual fitting of sub-estimators happens in predict_proba if mapping is needed.
predict_proba ¶
Predict class probabilities for X. Sub-estimators are fitted here if mapping is used.
set_categorical_features ¶
Attempts to set categorical features on the base estimator.