Prompting without Panic: Attribute-aware, Zero-shot, Test-Time Calibration

1IIT Delhi, 2TCS Research Labs

Summary

We propose TCA (Test-time Calibration via Attribute Alignment), a method that enhances confidence calibration in Vision-Language Models (VLMs) during zero-shot test-time prompt tuning. TCA extracts class-specific attributes using Large Language Models (LLMs) and incorporates them into prompts to preserve model calibration while improving classification accuracy.

  • Attribute-aware prompting: Uses LLMs to initialize prompts with semantically relevant class attributes.
  • Calibration loss: Introduces inter-class and intra-class regularization terms to preserve calibration while learning prompts.
  • Zero-shot and test-time: Requires no training labels, enabling calibration-aware prompt tuning on unseen samples.

Conceptual comparison of our method vs. contemporaries

Conceptual difference

Figure 1: Conceptual comparison between our proposed TCA vs. the contemporaries.

Step 1: Attribute Extraction

Attribute Extraction Visualization

Figure 1: Illustration of class-specific attribute extraction using language prompts. LLMs generate attribute descriptors that are semantically aligned with the class name and visually discriminative. This enhances interpretability and paves the way for attribute-guided calibration.

Step 2: Test-time Calibration via TCA loss





TCA Method Pipeline - Stage 3

Figure 2: Overview of the Test-time Calibration via Attribute Alignment (TCA) framework. (Top) Extracted attributes from the previous illustration generate soft prompts used during inference. Calibration error is minimized by aligning predicted features with those induced by text-based attribute prompts via TCA loss. (Bottom) Final classification is guided by semantically enriched and calibrated textual representations.

t-SNE Feature Embeddings

t-SNE Loss Feature Spread

Zoomed t-SNE Cluster

Figure 4: t-SNE plots of visual feature space before and after TCA. Post-calibration features show better separation and tighter clusters per class, illustrating how attribute alignment refines semantic structure in embedding space.

Attribute Count vs Calibration Error

Table 1: Ablation study showing how the number of selected attributes affects calibration performance. Optimal ECE is achieved using 2 attributes, after which adding more leads to marginal degradation due to attribute dilution.

Attributes ECE (CLIP-RN50) ECE (CLIP-ViT-B/16)
025.719.9
120.312.1
25.594.05
37.344.48

Performance on Fine-Grained Classification

Table 2: Accuracy and calibration (ECE) results across diverse fine-grained datasets. TCA improves ECE significantly while maintaining or improving classification accuracy, highlighting its utility in real-world recognition tasks.

Dataset Accuracy (%) ECE (%)
Caltech10193.263.09
Flowers68.93.57
Food10184.231.91
Aircraft25.383.36
EuroSAT44.34.36
UCF101652.71

Robustness to Natural Distribution Shift

Table 3: Evaluation on ImageNet-based natural distribution shift datasets. TCA shows strong robustness and maintains calibrated predictions under distribution shifts (e.g., real, sketch, adversarial), outperforming uncalibrated CLIP.

Dataset Accuracy (%) ECE (%)
ImageNet-A47.365.21
ImageNet-V260.851.81
ImageNet-R72.743.42
ImageNet-S45.724.81

BibTeX

@inproceedings{hebbalaguppe2025tca,
  title={Prompting without Panic: Attribute-aware, Zero-shot, Test-Time Calibration},
  author={Hebbalaguppe, Ramya and Kandar, Tamoghno and Nagpal, Abhinav and Arora, Chetan},
  booktitle={European Conference on Machine Learning (ECML)},
  year={2025}
}