Assessing AI image safety: what Mindgard's prompts reveal about ChatGPT's defenses and where AI builders should focus
bbc.com

Assessing AI image safety: what Mindgard's prompts reveal about ChatGPT's defenses and where AI builders should focus

Tech News
4 min read

Published by AINave Editorial • Reviewed by Ramit

TL;DRUK security startup Mindgard demonstrated that a modified prompt can make ChatGPT generate graphic and sexualised images, even after OpenAI added safeguards. For AI builders, this reinforces the need for continuous red-teaming, layered protections, and active monitoring of image generation safety.

Researchers from UK AI security startup Mindgard showed that the latest public version of ChatGPT can be coaxed into producing graphic and sexualised images by slightly altering a widely shared prompt. For AI builders deploying image generation, this reinforces that static guardrails are not sufficient; continuous red-teaming, layered safety checks, and human review are essential.

What happened

Mindgard, a firm focused on red-teaming AI models, discovered that a small modification to a prompt originally designed for humorous results triggered ChatGPT's GPT-5.4 model to output violent and sexualised imagery without explicit instructions about subject matter. The images included gory crime scenes, sexualised poses, and depictions suggesting sexual violence. OpenAI stated it had introduced additional safeguards and layered protections, including automated systems and human review, to block such content. However, Mindgard reported that with further small tweaks, the problematic prompt still produced concerning material, and OpenAI acknowledged the ongoing "cat-and-mouse" dynamics between attackers and guardrails.

Why AI builders should care

This incident highlights that AI image safety is not a one-time fix. As models scale and are integrated into products that accept user-generated prompts, the risk of generating prohibited content remains. Models do not understand intent or context the way humans do, making it difficult to enforce nuanced policies solely through static rules. For teams building products that include image generation, this means guardrails must be treated as a continuous investment: regularly updated, stress-tested, and supplemented with monitoring and escalation processes.

Practical implications

Product teams should treat red-teaming as a recurring practice, not a pre-launch checkbox. Mindgard's approach of making small changes to known prompts to expose emergent gaps is replicable by internal security teams or external auditors. Layered protections are essential: automated content filters catch common bypasses, while human review and rapid response procedures handle edge cases that slip through. OpenAI's statement that it combines automated systems and human review serves as a reference architecture. Teams should also establish clear escalation paths when vulnerabilities are discovered, and consider public disclosure policies that balance transparency with responsible disclosure.

Caveats

The reported demonstrations are lab tests by security researchers and may not reflect typical user behavior. The exact prompt was not disclosed, and OpenAI has since taken action to block the specific variant. However, alternative bypasses remained effective during the testing, indicating that the threat surface is broader than a single prompt. Safeguards are evolving continuously, and new jailbreak techniques will likely emerge. Builders should not assume that any current set of filters is comprehensive.

FAQs

Can ChatGPT be prompted to generate sexualised or violent images?

Yes, Mindgard's demonstrations showed that with small prompt modifications, ChatGPT's image generator can produce graphic and sexualised outputs even when safeguards are in place.

What safeguards does OpenAI have in place to prevent graphic content from ChatGPT's image generator?

OpenAI reports using multiple layers of protection, combining automated systems and human review to block violating content. Its policies prohibit sexual violence, non-consensual intimate content, and attempts to bypass safeguards.

Have researchers demonstrated bypassing ChatGPT's image safeguards?

Yes, Mindgard demonstrated bypass techniques and showed the BBC that further small changes to the prompt still produced concerning content after OpenAI's initial fix.

What is Mindgard and what did their tests show?

Mindgard is a UK AI security startup that red-teams AI models. Their tests showed that a simple prompt modification could cause ChatGPT to generate gruesome and sexualised imagery without detailed subject instructions.

How does OpenAI respond to attempts to bypass content filters?

OpenAI stated it introduced additional safeguards and continues to monitor and deploy mitigations, acknowledging a continuous arms race between defenses and attackers.

What steps can organizations take to monitor and improve AI image safety?

Teams should conduct ongoing red-teaming, implement layered automated and human review, establish rapid response protocols for disclosed vulnerabilities, and maintain open channels with model providers.

Sources

Latest Tech News