OpenAI’s flagship AI model has gotten more trustworthy but easier to trick

Bydls

Oct 18, 2023

OpenAI’s GPT-4 large language model may be more trustworthy than GPT-3.5 but also more vulnerable to jailbreaking and bias, according to research backed by Microsoft.

The paper — by researchers from the University of Illinois Urbana-Champaign, Stanford University, University of California, Berkeley, Center for AI Safety, and Microsoft Research — gave GPT-4 a higher trustworthiness score than its predecessor. That means they found it was generally better at protecting private information, avoiding toxic results like biased information, and resisting adversarial attacks. However, it could also be told to ignore security measures and leak personal information and conversation histories. Researchers found that users can bypass safeguards around GPT-4 because the model “follows misleading information more precisely” and is more likely to follow very tricky prompts to the letter.

The team says these vulnerabilities were tested for and not found in consumer-facing GPT-4-based products — basically, the majority of Microsoft’s products now — because “finished AI applications apply a range of mitigation approaches to address potential harms that may occur at the model level of the technology.”

To measure trustworthiness, the researchers measured results in several categories, including toxicity, stereotypes, privacy, machine ethics, fairness, and strength at resisting adversarial tests.

To test the categories, the researchers first tried GPT-3.5 and GPT-4 using standard prompts, which included using words that may have been banned. Next, the researchers used prompts designed to push the model to break its content policy restrictions without outwardly being biased against specific groups before finally challenging the models by intentionally trying to trick them into ignoring safeguards altogether.

The researchers said they shared the research with the OpenAI team.

“Our goal is to encourage others in the research community to utilize and build upon this work, potentially pre-empting nefarious actions by adversaries who would exploit vulnerabilities to cause harm,” the team said. “This trustworthiness assessment is only a starting point, and we hope to work together with others to build on its findings and create powerful and more trustworthy models going forward.”

The researchers published their benchmarks so others can recreate their findings.

AI models like GPT-4 often go through red teaming, where developers test several prompts to see if they will spit out unwanted results. When the model first came out, OpenAI CEO Sam Altman admitted GPT-4 “is still flawed, still limited.”

Services Marketplace – Listings, Bookings & Reviews

Entertainment blogs & Forums

Entertainment Tech

Jeremy Allen White ditches The Bear for The Boss in the new Springsteen: Deliver Me From Nowhere trailer – and I’m hooked

Tech

Prime Day now lasts four days – here are my 6 tips to grab the biggest bargains

Tech

Jeremy Allen White ditches The Bear for The Boss in the new Springsteen: Deliver Me From Nowhere trailer – and I’m hooked

Prime Day now lasts four days – here are my 6 tips to grab the biggest bargains

Jaw-dropping security flaws found in open source code could allow hackers to spirit away entire projects – here’s what devs need to know

5 Nintendo Switch 2 settings I recommend changing as soon as you boot your new console up

Jeremy Allen White ditches The Bear for The Boss in the new Springsteen: Deliver Me From Nowhere trailer – and I’m hooked

The Science Fiction and Fantasy Books You Can’t Afford to Miss in September!

Send a newsletter? This $100 list-building tool is just $12 right now.

There’s officially a snake named after Salazar Slytherin now

Jeremy Allen White ditches The Bear for The Boss in the new Springsteen: Deliver Me From Nowhere trailer – and I’m hooked

Prime Day now lasts four days – here are my 6 tips to grab the biggest bargains

Jaw-dropping security flaws found in open source code could allow hackers to spirit away entire projects – here’s what devs need to know

5 Nintendo Switch 2 settings I recommend changing as soon as you boot your new console up

OpenAI’s flagship AI model has gotten more trustworthy but easier to trick

Bydls

Related Post

Jeremy Allen White ditches The Bear for The Boss in the new Springsteen: Deliver Me From Nowhere trailer – and I’m hooked

Prime Day now lasts four days – here are my 6 tips to grab the biggest bargains

Jaw-dropping security flaws found in open source code could allow hackers to spirit away entire projects – here’s what devs need to know

You missed

Jeremy Allen White ditches The Bear for The Boss in the new Springsteen: Deliver Me From Nowhere trailer – and I’m hooked

Prime Day now lasts four days – here are my 6 tips to grab the biggest bargains

Jaw-dropping security flaws found in open source code could allow hackers to spirit away entire projects – here’s what devs need to know

5 Nintendo Switch 2 settings I recommend changing as soon as you boot your new console up