• Anthropic has developed an AI-powered tool that detects and blocks attempts to ask AI chatbots for nuclear weapons design
  • The company worked with the U.S. Department of Energy to ensure the AI could identify such attempts
  • Anthropic claims it spots dangerous nuclear-related prompts with 96% accuracy and has already proven effective on Claude

If you’re the type of person who asks Claude how to make a sandwich, you’re fine. If you’re the type of person who asks the AI chatbot how to build a nuclear bomb, you’ll not only fail to get any blueprints, you might also face some pointed questions of your own. That’s thanks to Anthropic’s newly deployed detector of problematic nuclear prompts.

Like other systems for spotting queries Claude shouldn’t respond to, the new classifier scans user conversations, in this case flagging any that veer into “how to build a nuclear weapon” territory. Anthropic built the classification feature in a partnership with the U.S. Department of Energy’s National Nuclear Security Administration (NNSA), giving it all the information it needs to determine whether someone is just asking about how such bombs work or if they’re looking for blueprints. It’s performed with 96% accuracy in tests.

Though it might seem over-the-top, Anthropic sees the issue as more than merely hypothetical. The chance that powerful AI models may have access to sensitive technical documents and could pass along a guide to building something like a nuclear bomb worries federal security agencies. Even if Claude and other AI chatbots block the most obvious attempts, innocent-seeming questions could in fact be veiled attempts at crowdsourcing weapons design. The new AI chatbot generations might help even if it’s not what their developers intend.

The classifier works by drawing a distinction between benign nuclear content, asking about nuclear propulsion, for instance, and the kind of content that could be turned to malicious use. Human moderators might struggle to keep up with any gray areas at the scale AI chatbots operate, but with proper training, Anthropic and the NNSA believe the AI could police itself. Anthropic claims its classifier is already catching real-world misuse attempts in conversations with Claude.

Nuclear AI safety

Nuclear weapons in particular represent a uniquely tricky problem, according to Anthropic and its partners at the DoE. The same foundational knowledge that powers legitimate reactor science can, if slightly twisted, provide the blueprint for annihilation. The arrangement between Anthropic and the NNSA could catch deliberate and accidental disclosures, and set up a standard to prevent AI from being used to help make other weapons, too. Anthropic plans to share its approach with the Frontier Model Forum AI safety consortium.

The narrowly tailored filter is aimed at making sure users can still learn about nuclear science and related topics. You still get to ask about how nuclear medicine works, or whether thorium is a safer fuel than uranium.

What the classifier attempts to circumvent are attempts to turn your home into a bomb lab with a few clever prompts. Normally, it would be questionable if an AI company could thread that needle, but the expertise of the NNSA should make the classifier different from a generic content moderation system. It understands the difference between “explain fission” and “give me a step-by-step plan for uranium enrichment using garage supplies.”

This doesn’t mean Claude was previously helping users design bombs. But it could help forestall any attempt to do so. Stick to asking about the way radiation can cure diseases or ask for creative sandwich ideas, not bomb blueprints.

You might also like

Services MarketplaceListings, Bookings & Reviews

Entertainment blogs & Forums

Leave a Reply

Subsequently, spanking fit performed an independent comparison of melanoma rates in oslo, norway versus helsinki, finland.