AI Engineering

The Practical Guide to Integrating LLMs into Production SaaS

By FiveNodes Team · May 2025 · 8 min read

Most LLM integrations fail not because of the model — but because of the plumbing. After shipping 18+ AI features into production SaaS products, we've seen the same failure modes repeat. Prompt hallucinations get the headlines. What actually kills AI features in production is missing rate-limit handling, no cost monitoring, no fallback when OpenAI goes down, and prompts that work in demos but degrade badly on edge-case user inputs.

This is what we've actually learned. Not theory — specific decisions and patterns drawn from real production deployments.

The model is a commodity. The integration layer — how you wrap, route, cache, monitor and fall back — is where AI features succeed or fail in production.

1. Design a model-agnostic abstraction layer first

Before writing a single prompt, build an abstraction layer. Every AI call in your codebase should go through a single internal service — not directly to OpenAI's SDK. This gives you:

Provider portability — swap OpenAI for Anthropic, Mistral, or a self-hosted model without touching product code
Centralised observability — one place to log every request, latency, token count, and cost
Fallback routing — if your primary provider is down, route to a backup automatically
Rate-limit management — handle 429s, retries, and backoff in one place

The interface should look like: ai.complete({ prompt, model?, context?, maxTokens? }). The caller doesn't know or care which model ran.

2. Prompt engineering is software engineering

Prompts are code. They need to be version-controlled, tested, and reviewed. We store all prompts in a /prompts directory as plain text files with semantic versioning. When a prompt changes, the old version is kept. Every deployment logs which prompt version produced each output.

Things that actually matter in production prompts:

System prompt discipline — be explicit about format, length, and what to do when the AI doesn't know something. "If you are uncertain, say so" prevents confident hallucinations.
Output schemas — for anything parsed downstream, ask for JSON with a defined schema. Validate it. Never trust unstructured output from an LLM in a data pipeline.
Negative examples — show the model what you don't want, not just what you want. This alone reduces format drift significantly.
User input sanitisation — always strip or escape user-supplied content before it enters a prompt. Prompt injection is a real attack vector.

3. Cost management from day one

AI API costs can scale 10x overnight if a feature gets unexpected usage. Build cost controls before you launch, not after you get an unexpected invoice.

Pattern 1

Per-user token budgets

Track token usage per user or tenant. Set soft limits (warn) and hard limits (graceful degradation). Surface usage back to the user so they understand the constraint.

Pattern 2

Semantic caching

Cache LLM responses for semantically similar inputs. A user asking "summarise this contract" and another asking "give me a summary of this contract" should hit the same cache entry. Use embedding similarity (cosine distance < 0.05) to match. Reduces repeat calls by 30–60% in document-heavy apps.

Pattern 3

Model tiering

Not every task needs GPT-4. Use a smaller, cheaper model (GPT-4o-mini, Claude Haiku) for classification, tagging, and short-form generation. Reserve large models for complex reasoning tasks. A tiering strategy typically cuts AI costs by 40–70%.

4. Fallback and resilience

OpenAI's API has outages. All external APIs do. Your product should not go fully down when your AI provider does. Design for graceful degradation:

Non-AI code paths for critical features — AI should enhance, not be the only path
Secondary provider fallback for high-priority requests (Anthropic as backup for OpenAI and vice versa)
Circuit breaker pattern — after N consecutive failures, stop sending requests and return a graceful error for a cooldown period
Retry with exponential backoff for 429 and 503 errors

5. Evaluation and monitoring

You can't improve what you don't measure. Build an eval framework before you launch. For every AI feature, define: what does "good" output look like? How do you detect regressions when you change a prompt?

In production, monitor:

Latency at p50, p95, p99 — LLM latency is high-variance. p99 matters enormously for UX.
Token counts per request — sudden spikes indicate prompt injection or user behaviour you didn't anticipate
Error rates by type (429, 500, timeout) — set alerts on these
Output quality signals — thumbs up/down, correction rates, abandonment after AI response

The 3am alarm that wakes you up is never "model quality degraded." It's "AI costs are 40x normal" or "every AI call is timing out." Monitor cost and latency before you monitor quality.

FiveNodes AI Profile

Have questions? Our AI can answer instantly

Ask about our services, tech stack, process, or case studies — no forms, no waiting, no sales calls required.

Try the AI Profile

6. Security considerations specific to LLMs

LLM integrations introduce attack surfaces that don't exist in traditional software:

Prompt injection — users can attempt to override your system prompt via their inputs. Never concatenate user input directly into system-level prompt strings. Use separate message roles (system, user, assistant) correctly.
Data leakage via context — if you include other users' data in context windows (for RAG or few-shot examples), enforce strict tenant isolation. One user's documents must never appear in another user's AI context.
PII in prompts — if prompts contain user PII, your AI provider's data retention policies apply. Know them. Some compliance contexts (HIPAA, GDPR) require zero-retention agreements.

The integration checklist before you go live

Abstraction layer in place — no direct SDK calls from product code
All prompts version-controlled and tested
Output validation for any structured response
Rate limit handling and retry logic
Per-user cost tracking and limits
Semantic caching implemented
Fallback provider or degraded mode for outages
Latency and cost alerting configured
Prompt injection protections in place
Data retention policy reviewed with your compliance requirements

If you're building AI features into a SaaS product and want a second opinion on your integration architecture, reach out. We've seen what works and what gets engineers paged at 3am.