We have already discussed ‘AI jailbreaking’ yesterday. AI jailbreaking is hacking into the AI models and making it spit out dangerous/sensitive/toxic information.
It is quite a possibility that a user can ask toxic questions making the LLM give out more enraged answers. How can we make sure that toxic content is not elicited from AI models by mischievous individuals? One way is to erect guardrails to prevent it.
This is similar to the guardrails that we physically see on highways that are present to keep vehicles from veering off their course. They do not prevent accidents all the time, but they do ensure that vehicles stay within their lane most of the time. This is exactly what happens with AI guardrails too. Appropriate AI guardrail defenses are erected which ensures that AI does not produce unwanted and dangerous content.
How are these guardrails erected?
AI guardrails are erected by frameworks or tools like Amazon Bedrock Guardrails. These guardrails ensure that LLMs produce content that is
a. appropriate and doesn’t veer off topic
b. free of hallucination
c. without bias and toxicity
d. in accordance with regulation
e. factually correct
f. flags off any toxic questions and refuses to answer them citing safety issues
What are the advantages of using AI guardrails;
AI guardrails ensure that the content is:
a. safe and secure
b. factually correct
c. doesn’t have any biases
d. have end user’s trust
How can AI guardrails be erected in an organization?
As with any major restructuring in an organization, building AI guardrails into your AI application needs a team approach. Different team members should be able to offer their inputs on what types of guardrails are needed and how it needs to be implemented. It should also adhere to regulatory frameworks. It should also be recipient to dynamic changes since the field of AI is constantly evolving and new defenses will need to be erected.
Let us see the backend of a content moderation guardrail which can be used in any application;
This is a Python code of a content moderation AI guardrail generated by ChatGPT itself:
import re
#A basic list of blocked keywords or phrases
BLOCKED_KEYWORDS = [
“hate”, “violence”, “racist”, “explicit”, “offensive”
]
def content_moderation_guardrail(text: str) -> bool:
“””
Checks if the input text contains any blocked keywords.
Returns True if safe, False if content is flagged.
“””
for keyword in BLOCKED_KEYWORDS:
if re.search(rf”\b{re.escape(keyword)}\b”, text, re.IGNORECASE):
print(f”Content blocked due to keyword: ‘{keyword}'”)
return False
return True
#Example usage
user_input = “This is an offensive statement!”
if content_moderation_guardrail(user_input):
print(“Content passed moderation.”)
else:
print(“Content failed moderation.”)
By means of this content moderation code, if any user uses the blocked keywords such as hate, violence, racist, explicit, in their requests, the answer to their query is blocked by the LLM and the answer becomes
“Content is blocked due to…..keyword”.
Guardrails will continue to grow and evolve as more AI attacks become more prominent.
This post is for BlogchatterA2Z 2025!
I learnt something completely new today, thanks for sharing.
Thanks for introducing Amazon Bedrock.
I’ve found the guardrails to be a bit too sensitive when generating images. For example, I tried to create an image of a Buddhist priest who had been fatally attacked, but the prompt was blocked. So, I changed my prompt to a Buddhist priest who was attacked and ‘sleeping’ instead. yay!! I got it!