ML / AI · Apr 2026

Reasoning Model Failure Analysis, LLM Interpretability

A controlled LLM evaluation pipeline spanning six reasoning models from 7B to 70B parameters, designed to disentangle reasoning length effects from forced re-entry interventions. The study measured a 36-point accuracy decline in Llama-distilled models while Qwen-distilled models remained robust. Multi-GPU inference was conducted with a bfloat16 KV cache on 4x GH200 GPUs.

Open repository on GitHub ↗All my repos →

6 (7B to 70B)

Models evaluated

36 pts down

Llama degradation

stable

Qwen degradation

4x GH200

Hardware

Problem

Reasoning models frequently regress when forced to re-enter their own chain of thought; however, the underlying cause, whether reasoning length, the shape of the intervention, or the model family, remains unclear. A controlled comparison free of token-budget confounds was required.

Approach

I developed a configuration-driven evaluation pipeline that decouples reasoning length effects from forced re-entry interventions across six reasoning models ranging from 7B to 70B parameters. The framework employed multi-GPU inference with a bfloat16 KV cache on 4x GH200 GPUs, structured outputs for ablation review, and deterministic seeds throughout.

Results

Llama-distilled models lost 36 accuracy points under forced re-entry, while Qwen-distilled models remained robust. The finding constitutes a family-level interpretability signal applicable to downstream evaluation work, and a manuscript is in preparation.

Stack

PyTorchHuggingFace TransformersSlurmbfloat16 KV-cacheGH200

← All projects