BERxiT: Early Exiting Beyond Entropy
Why should BERT trust its own confidence scores to decide when to stop thinking?
DeeBERT showed that you can bolt classifiers onto intermediate layers and exit early on easy inputs. That gave you a clean speed–accuracy trade-off and exposed how much redundancy sits in deep transformers.
It also left two gaps:
- Early exiting stayed stuck in classification land. No logits, no entropy, no exit.
- The training setup forced a choice: lean toward strong final accuracy and weak early exits, or the reverse.
BERxiT tries to fix both, without touching the backbone. It replaces entropy with a learned notion of “am I right at this depth?” and swaps DeeBERT’s two-stage fine-tuning for an alternating schedule that serves both early and final exits.
Breaking Up With Entropy
In the DeeBERT post, we saw how early exits work:
- attach a small classifier at each layer
- compute a probability distribution over classes
- measure its entropy and exit when entropy drops below a threshold
That keeps things simple, but it bakes in two assumptions:
- the task exposes a clean probability distribution over a fixed label set
- softmax confidence is a good proxy for correctness
Both break down fast.
Regression has no discrete classes. Sequence generation spreads probability mass over large vocabularies. Span tasks predict two distributions you have to consider together.
Even when the setup fits, confidence is misleading. Neural nets produce sharp distributions on wrong answers all the time. Entropy measures how peaked things are, not whether the peak sits on the right class.
BERxiT’s answer is to stop reading confidence off task logits and learn correctness prediction as a separate task.
At each layer, you still attach a task head. On top of that, you attach a shared scalar head that takes the hidden state and outputs a certainty score in [0, 1]. That head ignores the logits entirely. Its only job is: given this hidden representation, how likely is it that the task head at this layer is “good enough”?
Training gives you the right signal.
For classification, after computing the layer’s prediction, you set the target certainty to:
- 1 if the prediction matches the label
- 0 otherwise
For regression, you map error to a score:
- small error → value close to 1
- large error → value close to 0
The paper uses 1 - tanh(|ŷ - y|) for that mapping.
The certainty head then minimizes a simple MSE loss against this target. Because it is shared across layers, it learns a generic mapping from “shape of hidden state” to “this depth is sufficient for this input on this task.”
At inference time the rule mirrors DeeBERT: move through layers, run the exit head and the certainty head, and stop as soon as certainty crosses a threshold.
The key shift: the exit decision now comes from internal representations, not from task probabilities. That decouples early exiting from classification. Any task with a notion of correctness during training can plug in.
On STS-B, BERxiT reaches close to baseline Pearson correlation while using about half the layers. DeeBERT could not touch that task at all, because its exit rule depends on entropy over classes.
Sharing a Backbone Fairly
The second problem DeeBERT exposed was optimization.
Jointly training all exits and the backbone drags the model toward shallow performance and hurts the final layer. Two-stage training flips the problem: it protects final accuracy but leaves early exits sitting on top of a backbone that never adapted to them.
BERxiT takes a small but important step: it alternates between two objectives during fine-tuning:
- one step that optimizes all exits (early and final)
- one step that optimizes only the final exit
So over time the backbone sees gradients that:
- pull it toward a configuration where all layers work reasonably well
- pull it back toward a configuration where the final layer matches a standard fine-tuned BERT
This constraint search matters most on small datasets, where you cannot afford to jointly optimize many competing heads.
On low-resource GLUE tasks like RTE and MRPC, this alternating schedule consistently pushes the accuracy–efficiency curve above both joint training and DeeBERT’s two-stage recipe. For the same accuracy drop, BERxiT typically saves 30–70% more layers than DeeBERT. On some tasks, the difference is larger.
You still get the main DeeBERT guarantee: final-layer performance stays close to a normal fine-tuned model. But intermediate layers become real exits, not weak classifiers glued on top of a frozen encoder.
What BERxiT Actually Buys You
Putting the two ideas together, BERxiT changes three things in practice:
-
Exit decisions stop depending on logits.
A small shared head learns correctness prediction from hidden states. That unlocks regression and other tasks where entropy makes little sense, and often yields better behavior on classification where calibration is poor. -
Fine-tuning balances early and final exits.
Alternating between “all exits” and “final exit” steps gives you a backbone that serves both, instead of sacrificing one for the other. -
Depth usage shifts further down.
On GLUE, for similar accuracy, average exit depth drops compared to DeeBERT. You pay for fewer layers per input on the same backbone and task.
There is also a side effect on interpretability. Because the certainty head outputs a scalar at each depth, you can track where it aligns with simple metrics like n-gram overlap and where it starts to diverge.
In the BERxiT experiments, early layers show high correlation between certainty and surface similarity, then this link weakens in deeper layers. That matches the story from earlier posts: early exits mostly handle “surface-easy” inputs, while deeper layers take over when structure, context, or conflicting evidence matter.
Where BERxiT Still Hurts
For all the improvements, two constraints from DeeBERT remain.
Batching. Dynamic exits and static GPU kernels still do not fit together cleanly. If half your batch exits at layer 4 and half at layer 10, you either waste computation to keep the batch aligned or introduce routing logic that current runtimes do not support well. BERxiT measures per-sample speedups; real deployments still need engineering work to turn that into batched throughput gains.
Thresholds. Swapping entropy for learned certainty removes the classification ceiling, but you still need a cutoff. Conservative thresholds give modest savings with safe accuracy. Aggressive thresholds unlock deeper savings and amplify errors. The paper tunes this on validation splits. Production systems need policies that adapt to drift and risk profiles.
And like DeeBERT, BERxiT adds exits during fine-tuning. Pre-training still assumes a single final head. We do not yet know how much headroom sits in models that are pre-trained with exits baked in from the start.
What Comes After BERxiT
Across this short series:
- early exiting reframed inference as an allocation problem rather than a fixed cost
- DeeBERT showed that simple entropy-based exits on BERT already cut a large fraction of work on easy samples
- BERxiT pushed past classification and sharpened the training story
The common thread is simple: the model should not spend the same amount of computation on every input.
BERxiT’s contribution is to treat “when to stop” as a learned prediction problem, not a hand-crafted rule derived from task outputs. Combined with an alternating fine-tuning schedule, that is enough to beat DeeBERT on most standard benchmarks without changing the backbone at all.
The hardware and tooling story still lags. But as the early exiting post argued, that is an engineering gap, not a conceptual one. Models that know when they know enough make more sense than models that burn full depth on every “2 + 2”.
Next, we can look at methods like LayerSkip that move adaptivity one level down, into the architecture itself, and take a more direct shot at the batching problem.
Jules Belveze