← Back to resources

Architecture Deep Dive

Building the 24/7 Execution Layer: Architecture of a Digital Employee System

Modern e-commerce systems were built for human operators. We built a persistent, stateful runtime for autonomous digital employees operating business-critical functions continuously.

The Problem Space

Modern e-commerce infrastructure was built for human operators. APIs are designed for dashboards. Workflows assume 9-to-5 availability. State management treats every session as ephemeral.

When we set out to build autonomous digital employees, we were designing a new runtime environment for persistent, stateful agents that operate business-critical functions continuously. This post breaks down some of the technical architecture: how we handle long-term memory, tool use, multi-agent coordination, and the reliability guarantees required for production commerce systems.

System Architecture Overview

The system is organized into four distinct layers. The orchestration layer sits at the top, implemented on Temporal or Cadence, workflow engines that handle long-running, durable execution. This ensures agents can sleep for hours or days and resume exactly where they left off, handling retries, timeouts, and distributed state management without data loss.

Below that sits the agent runtime layer. Each runtime is a specialized digital employee with two core components: an LLM core with tool use capabilities that serves as the reasoning engine, and a three-tier memory system for persistence. The specializations map to business functions: sales agents handle product recommendations, objections, and checkout; support agents manage post-purchase issues, refunds, and tracking; operations agents monitor inventory, detect anomalies, and alert on supplier issues.

The tool abstraction layer is where most AI-for-commerce projects fail. Teams build deep Shopify integrations, then hit a wall when an enterprise client runs Salesforce Commerce Cloud or a headless custom stack. We architected for platform heterogeneity from the start.

Each commerce platform gets a typed adapter implementing a common interface. The Salesforce Commerce Cloud adapter is the most complex given enterprise requirements. SFCC presents two APIs with different capabilities: OCAPI has broader coverage for price books and promotions, while SCAPI is faster for inventory and orders. Multi-site SFCC instances require routing logic across site and realm structures. Some data, like complex product relationships, is only available via scheduled jobs rather than real-time APIs. OCAPI is aggressively rate-limited to roughly one hundred requests per minute for some endpoints, requiring heavy caching and careful request batching.

WooCommerce presents the opposite problem: maximum flexibility, minimal standardization. With over fifty thousand plugins and no standard API surface, we maintain a compatibility matrix for the top hundred plugins. Hosting variance ranges from ten-dollar shared hosting to managed WooCommerce.com infrastructure, requiring adaptive timeout and retry logic. Sometimes the REST API is insufficient for inventory accuracy, forcing read-only database connections for sync operations. WooCommerce webhooks are notoriously unreliable, so we implement aggressive polling fallbacks.

BigCommerce sits between these extremes: cleaner than WooCommerce but with its own quirks around inventory API availability and search functionality. For custom stacks, headless commerce, microservices, and legacy systems, we support multiple protocols including gRPC, GraphQL, and REST, with schema discovery for dynamic mapping.

The bottom layer is observability: tracing for every LLM call and tool execution, automated evaluation with human review sampling, escalation queues for edge cases, and experiment frameworks for agent variants.

The architectural difference from traditional chatbots is substantial. Where chatbots use stateless request-response patterns, this system provides stateful, interruptible execution. Instead of a single brain for all tasks, we deploy specialized agents with domain-specific memory. Rather than hardcoded integrations, we use a standardized tool layer. Instead of black box responses, we provide full observability and evaluation pipelines. And where simple systems fail on API timeouts, we implement durable execution with retries and circuit breakers.

Agent Runtime: Beyond Stateless Completion

Standard LLM APIs are stateless request-response. Digital employees require persistent, interruptible execution. The core runtime is built around an event-driven loop that loads context from long-term memory and current session state, generates action plans with tool availability, executes with retry logic and circuit breakers, commits updates to both episodic and semantic memory, and determines next state or sleep until the next event.

Key design decisions shape this runtime. Every await point functions as a checkpoint, allowing agents to pause for extended periods when customers don't respond and resume with full context intact. Tool use is implemented as typed interfaces with schemas and validation rather than naive function calling, with full side-effect tracking for observability. The system handles multi-modal inputs: text, product images, screenshots, and structured data feeds like inventory JSON.

Memory Architecture: The Hard Problem

LLM context windows are growing past two hundred thousand tokens, but context windows are not memory. They are bandwidth. Real memory requires retrieval, consolidation, and intentional forgetting.

We implement a three-tier system:

  • Working memory lives in-context with the LLM, accessible in under one hundred milliseconds, holding the current conversation turn and immediate tool results.
  • Episodic memory uses Redis and vector databases with fifty millisecond latency, storing recent interactions, session history, and roughly the last seven days of activity.
  • Semantic memory uses Postgres and vector databases with one hundred millisecond latency, storing learned facts, customer preferences, and brand knowledge.

Episodic memory retrieval combines recent raw events from Redis lists with relevant past sessions found via embedding similarity, filtered by customer ID. Semantic memory is organized as a knowledge graph with embeddings, critical for brand consistency. We extract and store product knowledge including features, common objections, and positioning; customer facts like preferences, past issues, and lifetime value tiers; and operational patterns covering common edge cases and resolution paths.

Knowledge extraction happens continuously from conversations, with atomic facts stored as subject-relation-object tuples with confidence scores and source references. Deduplication runs against the existing knowledge graph, keeping higher-confidence facts when conflicts arise. Memory consolidation jobs run nightly, compressing episodic traces into semantic summaries similar to human memory consolidation during sleep.

Tool Use: MCP and Beyond

We implement the Model Context Protocol for tool standardization but extend it significantly for production reliability. Tool definitions include production extensions beyond basic name, description, and parameters: automatic idempotency key generation for retries, side-effect classification as read, write, idempotent write, or destructive, configurable timeouts, circuit breaker configuration, and rate limiting. Complex tools support streaming for long-running operations like inventory syncs, and human confirmation rules for escalation.

Tool execution runs through guardrails. Pre-execution validation checks side-effect classification and routes destructive operations through async human-in-the-loop queues. Execution happens within traced spans with full argument logging. Retry logic handles transient failures. Post-execution verification ensures results match expected schemas. Automatic rollback triggers for failed write operations using idempotency keys.

Multi-Agent Coordination

Individual agents handle specific domains, but complex scenarios require handoffs. The handoff protocol begins with agent self-assessment of confidence for the current context. Below a threshold, the system routes to specialist agents based on intent classification, transferring an episodic summary, active tool calls, exported customer state, and escalation reasoning. The orchestrator spawns the target agent with this context package.

A shared context bus enables real-time coordination. Agents publish relevant events to a Kafka stream, allowing inventory updates to propagate across all sales agents, customer issue detection to trigger support agent intervention, and cross-agent learning from successful resolution patterns.

Reliability and Observability

Production agents require continuous evaluation beyond offline benchmarks. Automatic metrics capture response latency, tool success rates, and token efficiency relative to information density. An LLM-as-judge scores quality dimensions including helpfulness, brand voice adherence, factual accuracy, and conversion optimization. Human review sampling uses stratified selection based on confidence scores to prioritize uncertain cases for manual review.

A/B testing infrastructure assigns customers to control or treatment agents using consistent hashing, tracking conversion rate, customer satisfaction, handle time, and escalation rate. Circuit breakers and fallback layers handle LLM API degradation with cached responses or human escalation, tool API failures like Shopify rate limits with queuing and exponential backoff, and agent hallucination detection with factual grounding via retrieval-augmented generation on the product catalog.

Deployment and Scaling

The infrastructure stack runs Temporal for durable execution handling agent sleep, resume, retries, and sagas. Kubernetes provides compute with GPU nodes for inference and CPU nodes for orchestration. Storage uses Postgres for relational data, Redis for caching, Pinecone for vectors, and S3 for conversation logs. Observability combines Datadog with LangSmith for agent-specific tracing.

Horizontal scaling handles concurrent conversation load with stateless agent runtimes. Vertical scaling deploys larger context models for complex multi-turn negotiations. Semantic caching of common queries reduces LLM calls by approximately forty percent.

Current Challenges and Open Problems

Latency versus quality tradeoffs persist. Complex reasoning with chain-of-thought adds two to three seconds of latency. For real-time chat, we use speculative execution: a fast path for common intents and a slow path for edge cases.

Long-term memory compression remains unsolved. As customer relationships span months, we are testing hierarchical summarization with importance weighting to compress without losing critical details.

Multi-modal tool use for analyzing product images or parsing PDF invoices requires vision models. Integration is straightforward; latency and cost are the constraints.

Regulatory compliance for GDPR right-to-be-forgotten requires memory deletion across all tiers, implemented as cascading deletes with verification audits.

Conclusion

Building digital employees is not about wrapping chat interfaces around language models. It is a systems engineering problem spanning stateful distributed systems, reliable tool use, memory architecture, and continuous evaluation.

The brands that solve this will operate with structural advantages: twenty-four-hour execution coverage, consistent quality, and economics that improve with scale rather than degrade.

We are hiring across infrastructure, ML systems, and applied AI. If you are interested in the hard problems of agentic systems at production scale, reach out.