How We Build

The Methodology Behind Production AI Systems

Agentic architecture, eval-first delivery, hardened guardrails, live observability and operational governance — the technical playbook behind every Shakan engagement.

Written for the senior engineer or technical buyer who needs to know how it actually works before signing.

Operating Principle

“We don’t sell tooling. We design and ship the system that makes the tooling produce a measurable business outcome.”

The Build Loop

Six Phases, One Repeatable Loop

Every engagement runs through the same structured path — so quality is reproducible, not heroic.

Phase 01

Discovery & Architecture

1–2 weeks

System mapping across your existing stack, ROI hypotheses tied to a measurable revenue or cost line, eval-harness scoping, and a written architecture brief before a single line of agent code is committed.

Phase 02

Eval Harness First

Week 1–2

Golden datasets, regression sets and a scoring rubric are assembled before any build work begins. If we cannot describe ‘good output’ in a test, we cannot ship it.

Phase 03

Phased Build

Weeks 2–8

Vertical-slice delivery: a thin end-to-end path lands first, then breadth. You approve scope at the gate between every phase — no Big Bang releases, no surprise invoices.

Phase 04

Production Deploy

Week 6–10

Canary release behind a feature flag, fallback chains wired up, traces and dashboards live from day one. We watch the first 100 real interactions side-by-side with your team.

Phase 05

Continuous Eval

Ongoing

Every prompt change, every model upgrade and every tool revision triggers the regression suite. Output drift, latency drift and cost drift are tracked weekly, not anecdotally.

Phase 06

Handover & Knowledge Transfer

Final 2 weeks

Runbooks, source escrow, prompt registry walkthrough and live training with your engineering or operations team. The goal is that your organisation can operate the system without us.

Agentic Architecture

LangGraph State Machines, Not Prompt Spaghetti

We architect agents as typed state machines, not chains of hopeful prompts. That choice drives everything downstream.

Why LangGraph over LangChain

LangChain is excellent for composing primitives. For stateful, multi-step flows that must be observable, replayable and testable, we lean on LangGraph: typed state, deterministic transitions, checkpointing, and first-class human-in-the-loop nodes. LangChain components still live inside LangGraph nodes where useful.

For simple linear chains, LangChain alone is the right call. We pick the tool that fits the workload, not the trend cycle.

CrewAI vs LangGraph vs custom

  • CrewAI — best when the workflow is naturally a small team of specialist agents with clear roles and minimal state.
  • LangGraph — default for production: stateful flows, multi-turn behaviour, A2A (agent-to-agent) handoffs, durable checkpoints.
  • Custom runtime — when latency, cost or compliance constraints make the framework overhead the wrong tradeoff.
Typical state machine
Intent Router
Classifies the inbound request, sets the state schema
Tool Selector
Chooses retrieval, search, CRM or write tools
Tool Executor
Calls tools with typed inputs; retries with backoff
Validator
Schema validation, claim-safety, refusal handling
Response
Composes final user-facing output
Audit Log
Persists trace, cost, latency and decision metadata

Each node has typed input / output state, an owning prompt version, a fallback path and an entry in the audit log. Tool-use grounding lives at the Tool Executor; refusal and schema validation live at the Validator.

Evals

How We Measure Quality

If you can’t describe ‘good output’ as a test, you can’t ship it. Evals are the contract between you, us and the model.

Golden datasets

50–500 examples per intent, curated with subject-matter experts. Versioned in git. Treated as production code.

Regression suites

Run on every prompt change, model upgrade or tool revision. CI blocks the merge if regressions exceed the agreed threshold.

Model A/B

Structured comparisons between Claude Sonnet 4.6, Opus 4.7, Haiku 4.5, GPT-4o, o1 and DeepSeek-V3 / R1 — scored on the same rubric.

Hallucination rate

Measured against grounded sources, not vibes. Reported per intent and per model with confidence intervals where sample size allows.

Tool-use accuracy

Did the agent select the right tool, with the right arguments, in the right order? Scored per turn against expected traces.

Latency & cost

p50 / p95 / p99 latency budgets, cost-per-conversation tracked per agent, per tenant and per model — surfaced in dashboards.

Deterministic snapshot tests cover prompts whose outputs must not drift unexpectedly — useful for legal, clinical or claim-sensitive copy. A snapshot failure is a deliberate decision point, not an outage.

Guardrails

Safety Architecture

Defence in depth. Validation, refusal patterns, escalation paths and fallback chains designed in — not retrofitted.

Output schema validation

Every structured output is validated through Zod (TypeScript) or Pydantic (Python). Invalid outputs trigger a retry-with-feedback loop, not a silent failure.

Content filters & refusal handling

Pre- and post-call filters for unsafe content, plus deterministic refusal patterns the model can fall back to rather than hallucinating an answer.

Human-in-the-loop thresholds

Confidence floors and policy triggers route ambiguous decisions to a human reviewer with full context — not an apology dialog box.

Fallback model chains

Sonnet 4.6 → Haiku 4.5 → static response, or vendor-A → vendor-B → cached. The system stays graceful when a provider is degraded.

Claim-safety checks

For content systems, factual claims are checked against source-of-truth documents before publication. No quiet fabrication.

PII redaction & injection defence

Inbound text is scanned for PII and prompt-injection patterns before it reaches the model. Outbound text is scanned for accidental leakage.

Observability

What We Watch

Production AI without observability is just a confident demo. We instrument from the first commit.

Distributed tracing

LangSmith, OpenTelemetry and Helicone-style tracing across the agent graph. Every tool call, every retry, every token cost — captured.

Latency dashboards

p50 / p95 / p99 per node and end-to-end. Alerts trigger before users notice, not after the support tickets arrive.

Cost-per-conversation

Tracked per agent, per tenant, per model. Anomalies on cost are treated as P2 incidents — runaway tokens are a real production risk.

Drift detection

Output distribution monitoring flags when behaviour shifts after a model swap, a prompt edit or an upstream data change.

Human review queue

A sampled stream of conversations is routed to human reviewers — used to feed regression sets, catch novel failure modes and tune confidence floors.

Transparency where appropriate

Where the use case warrants, end-users see confidence signals, source citations or ‘why this answer’ summaries.

Governance

Operational Discipline

The same change-management rigour you expect from any production system — applied to prompts, models and agents.

Audit logs

Every agent action — tool call, write, escalation, refusal — is logged with input, output, model version and prompt hash.

PII handling

Data residency, retention windows, masking rules and access controls are written into the architecture, not bolted on after launch.

Model selection policy

When to choose Claude vs GPT vs open-weight: documented per workload, with cost, quality, latency and compliance tradeoffs made explicit.

Prompt registry

Every prompt versioned, diffed, code-reviewed and tied to the eval run that approved it. No more ‘someone edited it in the UI on Friday’.

Change management

Agent changes follow a structured workflow: eval delta → staging → canary → production. Same rigour as any other production system.

Compliance hooks

HIPAA, AHPRA, GDPR and AFSL touchpoints are designed in where the vertical demands it — and verified during the audit phase.

Stack

The Stack We Orchestrate

Tooling is chosen per workload — not from a preferred-vendor list. We optimise for fit, then for cost, then for vendor stability.

Orchestration

LangGraphLangChainCrewAIn8nMakeCustom Python / TypeScript runtimes

Models

Anthropic Claude Sonnet 4.6, Opus 4.7, Haiku 4.5OpenAI GPT-4o, GPT-4o-mini, o1DeepSeek-V3, DeepSeek-R1Llama 3 and Gemma where on-prem or open-weight is required

Voice

Retell AIVapiElevenLabsDeepgramAssemblyAI

Vectors & memory

PineconeWeaviatepgvectorSupabase

Evaluation

LangSmithHeliconeCustom golden suites in git

Observability

LangSmithOpenTelemetryDatadogSentry

Hosting

VercelAWSCloudflareDocker / containerised workloads
In Practice

What This Looks Like in a Real Engagement

Three worked examples — problem, architecture, eval setup, outcome.

Voice AI for a healthcare practice

Problem

Reception team missing 30%+ of inbound calls outside business hours; bookings leaking to competitors.

Architecture

Retell AI front-end, LangGraph state machine for triage and booking, Claude Sonnet 4.6 for clinical-tone responses, Haiku 4.5 fallback for cost control, AHPRA-aware refusal patterns.

Eval setup

200-example golden set covering symptom triage edge cases, appointment-type routing, escalation triggers; weekly regression run.

Outcome

After-hours bookings recovered; human reception now handles only escalations and high-complexity calls.

Operations AI for a professional-services firm

Problem

Senior staff burning 10+ hours per week on document review, status reports and cross-system reconciliation.

Architecture

n8n + LangGraph orchestration, Opus 4.7 for review tasks, GPT-4o-mini for cheap classification, Pydantic schemas on every structured output.

Eval setup

Snapshot tests for report formats, regression set for classification accuracy, monthly human-review sampling.

Outcome

Reclaimed senior capacity redeployed to client work; reporting cadence moved from weekly to on-demand.

Content engine for a B2B SaaS

Problem

Content team unable to keep pace with channel demand; quality inconsistent across writers and weeks.

Architecture

Multi-agent LangGraph (research → outline → draft → claim-check → edit), Claude Opus 4.7 for long-form, Sonnet 4.6 for editing passes, retrieval grounded against a curated source corpus.

Eval setup

Claim-safety rubric, brand-voice scoring, deterministic prompt snapshots; failed claims block publication.

Outcome

Publishing cadence sustained without quality regression; editor time focused on strategy and final approval.

Technical FAQ

Questions Senior Engineers Actually Ask

Why LangGraph over LangChain?

LangChain is a great toolbox of primitives, but for stateful, multi-step agent flows we need explicit state machines — typed state, deterministic transitions, checkpointing and replay. LangGraph gives us that. We still use LangChain components inside LangGraph nodes where it makes sense; it is not an either/or choice. For simple linear chains, LangChain alone is often enough.

How do you measure agent quality?

Quality is measured against a versioned golden dataset of 50–500 examples per intent, scored on a rubric that mixes deterministic checks (schema validity, tool selection, citation presence) with model-graded checks (helpfulness, tone, factual grounding). The same rubric runs in CI, in staging and on sampled production traffic.

What is the failure mode if the model provider has an outage?

Every production agent has a documented fallback chain. A typical pattern: primary on Claude Sonnet 4.6, secondary on Haiku 4.5, tertiary on GPT-4o-mini, final fallback to a static deterministic response that says ‘we’re routing you to a human’ and escalates. Health checks and circuit breakers decide when to fail over, not the user’s patience.

Can we self-host?

Yes, where the workload justifies it. Open-weight models such as Llama 3 and Gemma can be hosted on your own infrastructure for compliance, residency or cost reasons. We will tell you honestly when self-hosting hurts quality more than it helps — and we will not architect around it if the tradeoff is not worth it.

How do you handle prompt-injection?

Defence in depth. Inbound text is scanned for known injection patterns and stripped of role-confusing tokens before reaching the model. System prompts are isolated from user input. Tools have allowlists, not blanket capabilities. Sensitive operations require structured confirmation. And we treat every model as untrusted — its outputs are validated before they trigger downstream actions.

What does observability cost?

Typically 3–8% of total model spend, depending on sampling rate and retention. We sample 100% of failed and low-confidence traces, plus a configurable percentage of successful traffic. Logs are tiered: hot for 30 days, cold for the retention window your compliance regime requires.

How do you do canary releases for AI?

Every change — prompt, model, tool — ships behind a feature flag. We route a small percentage of traffic to the new version, compare evals and production metrics against the control, and only promote when the deltas are within the agreed thresholds. If regressions appear, we roll back at the flag, not at the deploy.

What is in the handover artifact?

Architecture diagrams, runbooks for every failure mode we have observed, the prompt registry, the eval harness, dashboards, on-call playbooks, model-selection rationale, source escrow if requested, and a training session with your team. You own the system end-to-end on day one of handover.

Ready to Pressure-Test Your AI Architecture?

45 minutes with a senior architect. We’ll walk your existing or proposed system, identify the failure modes worth fixing first, and show you what an eval harness for your workload looks like.

45 Minutes
No Commitment
Senior Architect