*Ahead of the GPT-5.3 Instant launch, we implemented dynamic
multi-turn evaluations for mental health, emotional reliance, and
self-harm that simulate extended conversations across these domains.
Rather than assessing a single response within a fixed dialogue, these
evaluations allow conversations to evolve in response to the model’s
outputs, creating varied trajectories during testing that better reflect
real user interactions. This approach helps identify potential issues
that may only emerge over the course of long exchanges and provides an
even more rigorous test than prior static multi-turn methods. By
utilizing realistic, yet adversarial user simulations, these evaluations
have enabled continued improvements in safety performance, particularly
in areas where earlier evaluation frameworks had reached saturation.
Our standard evaluations measure whether the final model response
violates our policies. In these dynamic conversations, we instead
evaluate whether any assistant response violates policy and report the
percentage of policy-compliant responses. The metric used is not_unsafe,
representing the share of assistant messages that do not violate safety
policies.