Google DeepMind wants to know if chatbots are just virtue signaling

Google DeepMind argues that the moral behavior of large language models needs rigorous evaluation comparable to technical benchmarks; researchers propose stress tests, chain-of-thought traces, and mechanistic interpretability to detect fragility before entrusting LLMs with sensitive roles like therapy or medical advising.
Why it mattersGoogle DeepMind's call for moral-evaluation tests demands stricter model auditing before clinical or therapeutic deployment.