If there is one piece of technology that is sizzling on the Internet somewhere, somebody else is sure to try and break it or make it work otherwise. AI seems to be entwined with our lives now. We ask AI to create our emails, write social media posts, create blogs, comment on blogs, create images and videos, prepare summaries and help us in creating more novel content.

With so much work being done by AI technology, will hacking it be far behind? One type of attacking AI systems is called ‘AI jail breaking’. There are a few things to keep in mind before we understand AI jailbreaking.

  1. AI systems in general are passive systems(of course ‘active AI’ exists as well) We ask questions and it responds. Period. This is passive AI.

2. In addition, AI models are supposed to be non-deterministic. You may have noticed, that the AI model gives a different answer each time you ask the same question. This is known as ‘non-deterministic’ feature in an AI model.

3. An AI model is not a person who can understand people and their emotions. They cannot see and cannot understand the intent behind many questions. They don’t know if the question is genuine or is it laced with a bad intent. They just spit out the answers to anybody and everybody once the question is asked.

In such a situation, if the user twists and turns the questions in such a way to make the AI model give malicious answers overriding its safety precautions or “guardrails”, it is known as ‘AI jailbreaking’.

Hence, in an AI jailbreak, the AI model is forced to give dangerous or inappropriate answers to malicious questions(and it will know it is a malicious question only if it written in the safety instructions)

For instance, a user can repeatedly ask the AI model to help them hack a system in different ways and it might yield answers on how to do it(even though initially, it may refuse it) They can also do role play scenario and make the AI model vomit more mischievous information.

AI jailbreaking can be done in two ways(I am sure others also exist):

  1. Classic jailbreak – when the user has studied the system very well and creates mischievous questions to the AI model. The model is then confused by the questions and spits out dangerous answers(like making a bomb or hacking a system)
  2. Indirect prompt injection – hackers disguise bad prompts inside legitimate prompts to yield sensitive data or dangerous information.

Both of these and other techniques make use of the fact that AI models cannot judge between different types of questions and individuals.

How do we protect against these AI jailbreaks?

It is up to the AI companies to employ security strategies to keep things safe on their side and publish best practices for end users to adopt when working with them.

All security best practices and techniques that were erected for normal systems come into play again. It is good to

  1. adopt a layered AI defense system
  2. erect ‘guardrails’ or ‘safety features’ into AI systems
  3. train AI models on ‘scenario based guidance’
  4. employ input validation and sanitization

This field is still in its infancy and it is quite scary to know what AI models may spit out next… keep your eyes and ears peeled to what may come up in this space next!

This post is for BlogchatterA2Z 2025!

(Visited 11 times, 1 visits today)

Related Posts

2 thoughts on “Stealthy AI jailbreaking!

Leave a Reply