Fine-tuning ModernBERT as an Efficient Guardrail for LLMs

SpeedBot's customer support chatbot was designed to handle shipping queries and delivery updates. The team expected simple questions about package locations and delivery times. Yet three days after launch, it was writing Python scripts and debugging SQL queries as curious users tested its capabilities far beyond the intended scope. It's a universal law of LLM deployments: no matter what your application is designed to do, users will inevitably try to make it write their homework.

In this article, we'll explore a pragmatic approach to filtering user queries before they reach your Large Language Models (LLMs). Rather than relying on expensive and complex solutions, we'll see how fine-tuning a smaller encoder-only model like ModernBERT can create an efficient and cost-effective guardrail system for production environments.

The Challenge: Filtering Unwanted Queries

SpeedBots, a growing logistics company, recently launched their customer service chatbot with a clear mission:

  • Handle customer queries about tracking orders
  • Calculating shipping rates
  • Resolving delivery issues
  • Explaining available services.

But within days of deployment, the support team noticed an unexpected pattern. Their carefully crafted logistics assistant was receiving an increasing number of off-topic requests. Users were treating it as a general-purpose AI assistant rather than a specialized tool for shipping inquiries.

Common user behaviors include:

  • Trying to use your system for unintended purposes ("Write me a JavaScript application...")
  • Attempting to bypass safety measures ("How do I intercept packages...")
  • Asking how many R's a word has.

Each of these off-topic queries wasted computational resources, increased operational costs, and potentially exposed the company to risk. The challenge became clear: how could SpeedBots effectively filter queries before they reached their expensive LLM, without compromising the helpful experience for legitimate customers?

Common Solutions

Hope as a Strategy

1. The Polite Request Approach

The most popular solution is what we might call "the honor system" – simply asking the model to behave itself by adding filtering instructions to your system prompt:

You are a logistics assistant.
Only answer questions related to logistics and shipping services.
If the user asks about tracking orders, shipping rates, delivery times, or logistics problems, provide helpful responses.
Do not respond to programming requests, illegal activities, or unrelated topics. Politely explain that you can only help with logistics questions.

It's a bit like asking a toddler not to eat cookies and then leaving the cookie jar within reach. Sure, sometimes it works, but inevitably, clever users find ways to persuade the model that their off-topic query is actually logistics-related after all.
A slightly more sophisticated variant uses a separate "bouncer prompt" that evaluates each query before it gets access to your main model

Determine if the following query is related to logistics services.
Only respond with TRUE if the query is about shipping, deliveries, tracking, or other logistics topics.
Otherwise, respond with FALSE.
Query: "{{user_query}}"

Limitations:

  • Instruction Interference: When we pack competing directives into the same prompt—"be helpful" but also "be restrictive"—we're essentially diluting the model's effectiveness at both tasks. The more we emphasize caution, the less helpful it becomes; the more we emphasize helpfulness, the more permissive it becomes.
  • False Negatives: In practical deployments, LLMs consistently lean on the side of caution when given filtering responsibilities. They frequently reject perfectly legitimate queries that fall within their intended scope, creating a frustrating experience for users with valid requests.

For SpeedBots, this manifested as their chatbot rejecting simple questions like "How much would it cost to ship a fragile item?" because it interpreted "fragile item" as potentially against policy. Meanwhile, cleverly worded off-topic requests still managed to slip through. The result was the worst of both worlds: legitimate customers left frustrated while computational resources were still being wasted on non-logistics queries.

2. The Specialized Bouncer: Dedicated LLM Guardrails

A more sophisticated approach employs purpose-built models like Llama Guard that are specifically trained for content moderation and filtering. These specialized "bouncers" are smaller than premium models like GPT-4o, Claude 3.x, or Gemini 2, typically weighing in around 8B parameters instead of hundreds of billions.

On paper, this approach makes perfect sense: instead of asking your head chef to also work the door, you hire a dedicated doorman. For SpeedBots, this would mean routing all incoming queries through a safety model first, and only allowing approved questions to reach their main logistics assistant.

Limitations: A Smaller Problem Is Still a Problem

Calibration Challenges: While these specialized filtering models are more economical than premium LLMs (about 4x cheaper while being 8x smaller in parameter count), their real limitation lies in how they're calibrated. Safety models are typically designed with broad protective guardrails that aren't easily customizable for specific domains by only updating a prompt.

Excessive Zeal: Safety models tend to be calibrated for maximum protection, often at the expense of usability. For SpeedBots, this meant legitimate logistics queries about "fragile handling" were flagged as dangerous, while cleverly disguised inappropriate content sometimes slipped through. General-purpose safety models excel at catching obviously harmful content but struggle with the nuanced industry-specific distinctions that matter most for specialized applications—creating both false positives and false negatives that limit their effectiveness.

While these models do offer genuine improvements over the "polite request" method, the inability to precisely control what gets filtered represents a significant limitation. For a company like SpeedBots with specific domain requirements, this approach still leaves too much to chance.

3. The Custom-Tailored Bouncer: Fine-tuning an LLM

The next logical step is customizing your filtering model through fine-tuning. This approach adapts either a dedicated safety model like Llama Guard or a smaller general-purpose LLM to create a bespoke guardrail specifically for your domain.
For SpeedBots, this would mean training a model to understand the precise boundary between legitimate logistics questions and everything else. On paper, it promises the best of both worlds: efficient filtering with domain-specific understanding.

Limitations: When the Cure Becomes the Disease

Resource-Intensive Development: Fine-tuning even a "small" 8B parameter model requires specialized hardware, considerable engineering time, and significant computational resources. The process demands expertise in machine learning operations and careful dataset curation, often requiring dedicated GPU clusters with high-memory cards (like A100 GPUs) and hundreds of dollars in compute costs per training run.

Endless Iteration Cycles: Generative models present unique evaluation challenges that differ from common classification tasks. While teams can attempt to constrain outputs to specific formats (like JSON) for easier parsing, a core challenge remains: there's no straightforward probability score to assess confidence or adjust a threshold. Each iteration involves analyzing token selection patterns across diverse inputs and determining whether the model's responses appropriately balance permissiveness with restriction. This qualitative assessment process is inherently more complex and time-consuming than evaluating clear metrics like classification accuracy or F1 score.

Diminishing Returns: When the filtering component consumes more development resources than core product features, the cost-benefit equation becomes increasingly difficult to justify. SpeedBots found themselves asking, "Should we really dedicate our limited AI budget to building a better doorman instead of improving our actual service?"

Deployment Complexity: Deploying fine-tuned models is challenging regardless of your approach. Using API providers is simpler but expensive (costing up to 2x the base models) and depends on serverless options to have support for LoRA/QLoRA. Self-hosting, on the other hand, requires specialized hardware (high-memory GPUs with 16+ GB for 8B models), optimization libraries like vLLM, and complex scaling architecture. This means you'll need dedicated DevOps expertise just to maintain what is essentially a "doorman" rather than focusing on your core product.

For SpeedBots, what began as a straightforward filtering requirement morphed into complex infrastructure engineering. Their ML team found themselves spending more time configuring GPU clusters than improving the actual logistics assistance capabilities that drove their business value. The engineering overhead alone, before even considering the ongoing operational costs made this approach difficult to justify for what was essentially a preprocessing step.

SpeedBots' Solution: Creating a ModernBERT Guardrail

After exploring increasingly complex and expensive solutions, SpeedBots' engineering team decided to take a different approach. Rather than deploying heavyweight LLMs for filtering, they implemented a lightweight solution using ModernBERT an encoder-only model specifically optimized for text classification tasks. Here's how they built their guardrail system:

1. Strategic Dataset Creation

SpeedBots needed high-quality training data reflecting both legitimate logistics inquiries and the off-topic queries they wanted to filter. They leveraged DeepSeek V3, a powerful top tier model, to generate a diverse dataset:

# Example prompt for dataset generation
prompt = """
Generate 10 example user questions for a logistics company chatbot.
For each question, indicate if it's relevant (TRUE) or not relevant (FALSE).
Include some attempts to get the chatbot to perform unrelated tasks or bypass safety guidelines.
"""

This process allows them to create a dataset of >2k diverse examples. The model generated duplicate questions, but those were removed by dropping the duplicate text. While they were aware of better methods to reduce duplicates, the complications weren't worth it since generating this sample is very cheap—less than $1 to generate about 5000 records. This approach not only saved costs but also quickly produced thousands of realistic examples, including sophisticated attempts to bypass filters, which would have been difficult and time-consuming to create manually.

2. Validation Through Multiple Models

To ensure dataset quality, SpeedBots implemented a validation step using a different model (DeepSeek R1) to independently review the classifications:

# Example review prompt
review_prompt = """
Review the following user question and classification:
Question: "{question}"
Classification: {is_relevant}

Should this question be answered by a logistics company chatbot? Consider if the question:
1. Is related to logistics services
2. Is not attempting to bypass safety features
3. Is not requesting harmful content

Think step by step and provide your final classification as TRUE or FALSE.
"""

This multi-model approach served as a quality control mechanism, identifying edge cases where classification was ambiguous or potentially incorrect.

3. Human-in-the-Loop Refinement

The SpeedBots team found that about 5% of examples had disagreements between the two models. Rather than relying solely on automated processes, they:

  • Retained examples where both models agreed (approximately 98% of the dataset)
  • Manually reviewed each disagreement case
  • Made final determination based on their specific business requirements
  • Updated the dataset with these refined labels

This human oversight ensured the final training data accurately reflected SpeedBots' specific filtering needs.

4. Efficient Fine-tuning Process

Using Hugging Face's Transformers library, the team fine-tuned a ModernBERT model (395M parameters) on their custom dataset:

training_args = TrainingArguments(
    output_dir="modernbert-llm-router",
    per_device_train_batch_size=32,
    learning_rate=5e-5,
    num_train_epochs=2,
)

Unlike their previous attempts with larger models, this fine-tuning process was completed in under 15 mins using a developer's Mac M2. The memory footprint required to train the model is small enough that it would easily fit on a single NVIDIA T4 GPU. This represents a dramatic improvement over the multi-hour training cycles typically required for LLM fine-tuning.

Performance Results: The resulting model exceeded SpeedBots' expectations:

  • 97% F1 score when compared against the curated dataset
    Inference time of approximately 30ms per query vs. ~1000ms for a "small
  • LLM like Llama Guard and multi-second response times for reasoning models like R1
  • Accurate identification of subtle jailbreak attempts
  • Reduced false positives for legitimate logistics questions

Most remarkably, this lightweight model achieved classification performance nearly identical to DeepSeek R1 — a top-tier reasoning model that's significantly more powerful and orders of magnitude more expensive to run. The team had effectively distilled the classification capabilities of an advanced reasoning model into a specialized, efficient filter.

The Elegant Solution

What started as a challenging problem—efficiently filtering unwanted queries—led SpeedBots to discover an elegant solution that aligned perfectly with their specific needs. By choosing the right tool for the job rather than following conventional wisdom, they created a guardrail system that was:

  • Precisely Targeted: Designed specifically for their logistics domain
  • Resource Efficient: Requiring minimal computational overhead
  • Highly Accurate: Outperforming much larger and more complex models
  • Operationally Simple: Easy to deploy and maintain

The SpeedBots case demonstrates how sometimes the most effective AI solutions aren't the largest or most sophisticated models, but rather the ones most appropriate for the specific task at hand.

The complete code to replicate this experiment of fine tuning ModernBert plus the generation of the dataset can be found in this github repository.