A developer-led test using an OpenAI model demonstrated the effectiveness of anti-prompt injection training, though experts warn of risks in production environments.
An experiment conducted by developer Fernando Irarrázaval subjected an AI assistant to an open security test, in which participants attempted to extract system secrets by sending emails. The challenge, hosted on the hackmyclaw.com platform, registered around 2,000 participants and approximately 6,000 intrusion attempts. None of the attacks managed to extract the confidential information stored in the test environment.
The instance used in the challenge ran the Opus 4.6 model and featured explicit security instructions. The system was programmed to never reveal credentials, modify internal files, execute code from emails, or exfiltrate data to external endpoints. The operation consumed $500 in tokens and resulted in the temporary suspension of a Google account due to the anomalous volume of messages received.
The test results reflect a broader trend in the artificial intelligence industry. According to developer Simon Willison, AI labs have been investing in training cutting-edge models to resist prompt injection attacks. Willison noted that the GPT-5.6 technical documentation released by OpenAI indicates concrete progress in mitigating this type of vulnerability, making such intrusions harder to execute in practice.
Despite the challenge's success, experts remain cautious about deploying autonomous systems in production environments. Willison emphasizes that the 6,000 failed attempts do not guarantee absolute protection, as more sophisticated approaches could still succeed. The recommendation is that systems with the potential to cause irreversible damage should not rely exclusively on model training barriers to protect against prompt injections.
The discussion surrounding the experiment gained traction within technical communities. On Hacker News, the topic sparked a debate featuring grounded skepticism and detailed responses from the challenge's author, highlighting the industry's ongoing interest in AI agent security and the limitations of current defenses.
No. The AI assistant, subjected to approximately 6,000 intrusion attempts by around 2,000 participants, successfully prevented any extraction of confidential information or system secrets.
AI labs are investing in training cutting-edge models to resist prompt injection attacks. OpenAI's recent technical documentation indicates concrete progress in mitigating this vulnerability, making such intrusions harder to execute.
No. Experts warn that while anti-prompt injection training is effective, it does not guarantee absolute protection. Systems with the potential to cause irreversible damage should not rely exclusively on model training barriers to defend against attacks.