ML / AI · Apr 2026
Reasoning Model Failure Analysis, LLM Interpretability
A controlled LLM evaluation pipeline spanning six reasoning models from 7B to 70B parameters, designed to disentangle reasoning length effects from forced re-entry interventions. The study measured a 36-point accuracy decline in Llama-distilled models while Qwen-distilled models remained robust. Multi-GPU inference was conducted with a bfloat16 KV cache on 4x GH200 GPUs.
6 (7B to 70B)
Models evaluated
36 pts down
Llama degradation
stable
Qwen degradation
4x GH200
Hardware
Problem
Reasoning models frequently regress when forced to re-enter their own chain of thought; however, the underlying cause, whether reasoning length, the shape of the intervention, or the model family, remains unclear. A controlled comparison free of token-budget confounds was required.
Approach
I developed a configuration-driven evaluation pipeline that decouples reasoning length effects from forced re-entry interventions across six reasoning models ranging from 7B to 70B parameters. The framework employed multi-GPU inference with a bfloat16 KV cache on 4x GH200 GPUs, structured outputs for ablation review, and deterministic seeds throughout.
Results
Llama-distilled models lost 36 accuracy points under forced re-entry, while Qwen-distilled models remained robust. The finding constitutes a family-level interpretability signal applicable to downstream evaluation work, and a manuscript is in preparation.
Stack