Researchers Break Open AI’s Black Box—and Use What They Find Inside to Control It

Researchers introduced the Recursive Feature Machine (RFM) to extract concept vectors from large AI models, enabling efficient monitoring and steering of behavior across language, vision-language, and reasoning systems; the technique identified both harmful ('anti-refusal') and beneficial ('anti-deception') vectors, worked cross-linguistically, transferred between models, and required under 500 samples and a single A100 GPU to operate.
Why it mattersGPT-4o concept vectors show product teams must update model guardrails to block 'anti-refusal' vulnerabilities.