Skip to main content
Background Image

Building AI Systems That Don't Break Under Attack

·1809 words·9 mins·
Pini Shvartsman
Author
Pini Shvartsman
Architecting the future of software, cloud, and DevOps. I turn tech chaos into breakthrough innovation, leading teams to extraordinary results in our AI-powered world. Follow for game-changing insights on modern architecture and leadership.
Securing Intelligence - This article is part of a series.
Part 2: This Article

This is Part 2 of the “Securing Intelligence” series on AI security.


In Part 1, we looked at how prompt injection has evolved from party tricks to production threats. We covered indirect injection, cross-context attacks, and the uncomfortable reality that every defense can be circumvented. That’s the problem space.

Now comes the harder question: if perfect security is impossible, what does responsible AI deployment actually look like?

I’ve spent over a decade in software engineering, development, and technical leadership, with the last couple of years deeply focused on AI—both building production systems and engaging with the community through tech groups and meetups. I’ve seen what separates organizations that sleep soundly from those waiting for their incident. It’s not about having perfect defenses. It’s about having defenses that work together, that fail gracefully, and that make attacks expensive enough that most attackers move on to easier targets.

The Foundation: Structured Prompts and Separation of Concerns
#

The first line of defense is architectural. If you’re mixing system instructions and user input in the same unstructured blob of text, you’ve already lost.

Structured prompts treat instructions and data as separate entities with clear boundaries. Think of it like the difference between eval(user_input) and proper API calls with typed parameters. One is begging to be exploited; the other has clear attack surfaces.

Here’s what this looks like in practice:

SYSTEM_CONTEXT (immutable):
You are a customer support assistant for Acme Corp.
You can access customer records and order history.
You cannot process refunds without manager approval.

TRUSTED_DATA (verified sources):
Customer #12345: Premium account, joined 2020
Order #789: $299.99, shipped 2025-10-10

USER_INPUT (untrusted):
[User's actual query goes here]

The key is that your application logic treats these as distinct components. Your system prompt isn’t just text at the top of your context window that can be overridden by clever user input; it’s enforced at the API level, in your orchestration layer, before it ever hits the LLM.

OpenAI’s structured outputs API and Anthropic’s system messages both support this pattern natively. Use them. Don’t try to enforce separation purely through prompt engineering. That’s like trying to prevent SQL injection by asking users nicely not to type semicolons.

AI Firewalls: The First Real Defense Layer
#

Traditional firewalls inspect network traffic for malicious patterns. AI firewalls do the same for prompts and outputs. They’re not perfect, but they’re necessary.

An AI firewall sits between your users and your LLM, analyzing inputs and outputs for injection attempts, data leakage, and policy violations. Think of it as your WAF (Web Application Firewall) equivalent for AI systems.

What good AI firewalls detect:

  • Known injection patterns (both direct and indirect)
  • Attempts to extract system prompts or bypass guardrails
  • Suspicious output patterns that suggest compromised responses
  • PII or sensitive data leakage in outputs
  • Unusual token patterns that don’t match legitimate queries

Companies like Lakera, Robust Intelligence, and Promptarmor are building commercial solutions. Open-source options like LLM Guard and NeMo Guardrails give you more control but require more expertise.

The catch: AI firewalls add latency (typically 50-200ms per request) and cost (you’re running additional inference). They also have false positives. Your customer support bot might flag legitimate technical questions as injection attempts.

This is where trade-offs start mattering. For high-risk applications (financial transactions, healthcare, code generation), the overhead is worth it. For low-risk use cases (general knowledge chatbots), maybe not.

Dual LLM Architecture: The Evaluator Pattern
#

Here’s a pattern that’s gaining traction: use one LLM to evaluate the safety of requests before they reach your main system.

The flow looks like this:

  1. User submits input
  2. Evaluator LLM analyzes: “Is this a legitimate query or an injection attempt?”
  3. If safe, proceed to main LLM
  4. Main LLM generates response
  5. Evaluator LLM checks output: “Does this response follow policies?”
  6. If clean, return to user

Why this works better than simple filtering: LLMs are actually quite good at detecting adversarial inputs when that’s their only job. By dedicating a model specifically to security evaluation, you get better accuracy than trying to bolt security onto your main workflow.

Why this isn’t a silver bullet: The evaluator LLM can be attacked too. Researchers have shown that with enough effort, you can craft prompts that fool the evaluator while still injecting malicious instructions into the main system. It’s defense in depth, not a complete solution.

Real-world implementation: Use a smaller, faster model for evaluation (GPT-4o-mini, Claude Haiku) and your primary model for generation. This keeps latency reasonable while adding a meaningful security layer.

Zero-Trust Principles for LLM Applications
#

The most important architectural shift is applying zero-trust principles to AI systems. Every output is untrusted until proven safe. Every action requires explicit authorization.

Implement least-privilege access aggressively. Your chatbot doesn’t need write access to your production database. Your code completion tool doesn’t need network access. Your document summarizer doesn’t need the ability to send emails.

When you do grant permissions, scope them narrowly:

  • Read-only access to specific tables, not entire databases
  • Ability to create draft emails, not send them automatically
  • Access to public documentation, not internal source code

Require human approval for high-stakes actions. If your AI system wants to process a refund over $500, issue a database migration, or modify production configuration, it should create a request for human review, not execute directly.

This is actually where AI systems have an advantage over traditional applications. Users expect a conversation. “I’ve drafted this refund for $750. Would you like me to submit it for approval?” feels natural. Use that to your advantage.

Output Sanitization and Monitoring
#

You can’t catch everything at the input layer, so you need robust output controls.

Content filtering should check for:

  • Leaked system prompts or internal instructions
  • PII or credentials that shouldn’t be in responses
  • Malicious content (phishing links, social engineering)
  • Off-policy responses (your customer support bot shouldn’t be giving medical advice)

Anomaly detection is where things get interesting. Build baselines for normal behavior:

  • Typical response length and complexity
  • Expected data access patterns
  • Common phrasing and tone
  • Frequency of certain operations

When you see deviations (responses that are suddenly much longer, accessing unusual data combinations, or using phrases that don’t match your trained patterns), flag them for review.

The implementation challenge: Building good anomaly detection requires instrumentation from day one. You need to log everything: prompts, responses, data accessed, operations attempted, confidence scores. Most teams don’t think about this until after an incident.

Start logging now. Future you will thank present you.

The Tool Use Problem
#

Here’s where it gets really interesting. Modern AI systems don’t just answer questions; they use tools. They query databases, call APIs, execute code, interact with other systems.

Each tool is an attack vector. If an attacker can inject instructions that cause your AI to use tools maliciously, they’ve achieved something close to remote code execution.

The defense: Implement tool use policies at the orchestration layer, not in the prompt.

Instead of telling your LLM “you can use the database tool to look up customer records,” implement it in code:

def can_use_tool(tool_name, parameters, context):
    if tool_name == "database_query":
        # Enforce read-only
        if "INSERT" in parameters.query.upper():
            return False
        # Enforce scope
        if context.user_role != "support" and "customer_data" in parameters.table:
            return False
    return True

Your orchestration layer validates every tool call before execution. The LLM can request actions, but your code decides what’s allowed.

The Real Talk: Trade-offs Nobody Mentions
#

Every security control has costs. Let’s be honest about them:

Latency: AI firewalls, dual LLM evaluation, output filtering all add 50-200ms. Stack them together and you’re adding seconds to response times. For real-time applications, this might be unacceptable.

False positives: Aggressive filtering catches legitimate queries. Your technical users will be frustrated when their debugging questions get flagged as injection attempts. Your security team and product team will argue about where to set thresholds.

Cost: Every evaluation layer is additional inference. If you’re processing millions of requests, the costs add up fast. A dual LLM architecture with output filtering can easily 3x your inference costs.

Complexity: More security layers mean more failure modes. What happens when your AI firewall goes down? Do you fail open (risky) or fail closed (customer impact)? These aren’t theoretical questions; you need answers before production.

The practical approach: Start with structured prompts and least-privilege access. These are low-cost, high-value changes. Add AI firewalls for high-risk operations. Implement dual LLM evaluation where the stakes justify the cost. Build monitoring and anomaly detection from day one.

Don’t try to implement everything at once. You’ll slow down your team and create a system so complex that security controls become the thing that breaks.

What’s Working in Production
#

After investing countless hours researching and experimenting with AI security, both theoretically and hands-on in production environments, here’s the architecture that actually works:

Layer 1: Input validation - Structured prompts, basic pattern matching, rate limiting

Layer 2: Execution control - Least-privilege tool access, operation allowlists, human approval workflows

Layer 3: Output verification - Content filtering, PII detection, policy compliance checks

Layer 4: Monitoring - Logging, anomaly detection, audit trails, incident response playbooks

Notice what’s missing: attempts to make the LLM itself secure. That’s not how this works. The LLM is a powerful but fundamentally untrustworthy component. Your architecture assumes it can be compromised and builds controls around it.

It’s the same philosophy we use for traditional applications: don’t trust user input, validate at boundaries, enforce least privilege, assume breach.

What Engineering Leaders Should Focus On
#

If you’re responsible for AI security, here’s your practical checklist:

This week: Audit your current AI systems. What data can they access? What actions can they take? Where are you mixing trusted and untrusted data?

This month: Implement structured prompts and least-privilege access. These are table stakes and should be non-negotiable.

This quarter: Add monitoring and anomaly detection. You need visibility before you can respond to incidents.

This year: Build tool use policies, implement human approval workflows for high-stakes operations, and establish incident response procedures.

Don’t wait for perfect solutions. The organizations getting this right aren’t the ones with the fanciest technology; they’re the ones who started early and iterated based on real-world experience.

What’s Coming Next
#

Defensive architectures are maturing fast. We’re seeing:

  • Better frameworks that enforce security by default
  • Standardized APIs for AI firewalls and evaluation
  • Industry benchmarks for measuring AI security effectiveness
  • Compliance frameworks that mandate specific controls

But here’s what nobody’s talking about: all of these defenses assume you control your infrastructure. What happens when the vulnerability isn’t in your code, but in the pre-trained model you downloaded? The prompt template you copied from GitHub? The RAG knowledge base you inherited from the previous team?

In the next part of this series, we’ll explore the AI supply chain: the attack vector that most teams don’t even know exists. Because the biggest security risk might not be in what you build, but in what you’re building on top of.

Securing Intelligence - This article is part of a series.
Part 2: This Article

Related

Prompt Injection 2.0: The New Frontier of AI Attacks
·1631 words·8 mins
AI's Dual Edge: When to Disrupt and When to Compound
·3545 words·17 mins
Google Drive Gets AI-Powered Ransomware Protection
·1207 words·6 mins