Building Real-Time AI Chatbots - Hard Lessons Learned

Avi Santoso

I've been deep in the trenches trying to build a real-time AI chatbot system, and let me tell you - it's nowhere near as straightforward as it initially seemed. What looked like a simple problem turned out to be anything but. I've hit enough walls that I'm actually shuttering my current project and pivoting to a component-based approach where I can specialize each part.

Here are the key lessons I've learned along the way:

Lesson 1: Response Time Is Everything

The main difference between a real-time application and a batch application is the expected response time. What's acceptable in an email or messaging system completely falls apart in a live conversation.

This is likely due to how our brains work. Think about how we can spot AI-generated images - we focus on hands, eyes, and other features that humans pay a lot of attention to. In the same way, in real-time conversation, things like intonation, pauses, and response speed are crucial.

The problem? We're forced to compress and optimize entire chatbot systems into such a small time window that getting quality output becomes incredibly difficult.

Lesson #1: The best voice-based conversation systems will prioritize response time once they've reached a baseline quality level where the average person's intuitive perception (System 1) can't easily distinguish further improvements.

Lesson 2: Deterministic State vs. Streaming Reality

Storing and evaluating deterministic state changes fundamentally conflicts with a streaming, real-time environment. Real conversations change quickly - someone can be mid-sentence when the topic shifts, you might cut them off, or the tone might change completely.

The challenge grows when you want a structured conversation that can still adapt naturally. In real life, it takes decades of daily practice to train someone to have conversations that check all boxes while remaining flexible.

Our core constraint is that LLMs are essentially stateless services. Some now offer token caches, but using high token counts increases response time and reduces accuracy. Some try to solve this with an agentic framework, but the overhead is too high for real-time environments.

What we need is a SINGLE agent that can act as MANY agents or entities - a problem I believe graph-based systems can solve. But storing graph state, evaluating current state, and generating the right prompt input works better in batch environments. To support a "Single Agent, Multiple Personality" (SAMP) system, we need to modulate real-time audio into batch messages that work better with graph state and evaluations.

Lesson #2: Real-time conversation systems require a hybrid approach that can modulate between streaming for responsiveness and batch processing for maintaining complex state - essentially a "Single Agent, Multiple Personality" architecture that balances speed with coherence.

Lesson 3: The Real-Time Response Equation

When building your own real-time streaming architecture, response time follows a clear mathematical function:

Response Time = Audio Transcription Time + Time To First Token + Generation Time (based on Tokens Per Second)

This is just the simplified version. In a real system with RAG, Graph RAG, multiple personalities, state evaluations, and prompt generation, it gets much more complex. Then add the time for generating streaming speech responses.

And this doesn't even account for audio transfer time between frontend and backend, or transport time between the LLM API and your services if they're on separate machines.

Lesson #3: Understanding and optimizing the full response time equation is crucial for real-time AI conversation systems - every millisecond in transcription, inference, and generation compounds to create the perceived responsiveness of your system.

Lesson 4: The Context Window Crunch

Token context limits become a very real problem affecting accuracy, speed, and API costs. Graph-based RAG can mitigate this by including just the right amount of context within the token window, producing better output compared to just sending the last n messages.

GraphRAG systems are still in their infancy, but I'm convinced they'll be crucial to AI's future. This context problem matters enormously for real-time chatbots because you need accuracy high enough for users to suspend disbelief, while keeping responses fast enough that users don't lose their train of thought. Trust comes from consistent, high-quality, fast responses.

Lesson #4: Graph-based RAG systems that intelligently manage context are essential for balancing the token window limitations with response quality and speed - the foundation for building user trust in real-time AI conversation.

Lesson 5: The Ecosystem Isn't Ready

The ecosystem simply isn't mature enough for easy orchestration of these systems. Many components are missing. Sure, there are companies offering solutions, but you're typically locked into their ecosystem with significant limitations.

Orchestration requires many smaller, high-quality parts with clear contracts for consumption and provision. When I tried building my own system, I was looking at 5+ different individual systems that all needed to work together. This quickly became overwhelmingly complex.

Lesson #5: The current AI ecosystem lacks the mature, interoperable components needed for seamless real-time conversation systems - successful implementations will require either substantial in-house development or accepting the limitations of existing walled gardens.

Lesson 6: Speed vs. Manageability

The key to speed in real-time systems is colocation and monolithic architecture. But modular services are easier to manage. The middle ground? A modular monolith - keeping a single mono-repo with a few packages, potentially different runtimes, but on the same machine.

This approach helps maintain the low latency needed for user engagement. The problem is that typical computers aren't powerful enough to run all the required services simultaneously - backend, frontend, transcription, voice model, and LLM. Even with quantization, you're looking at $10k+ machines just to scratch the surface, and you'll still need lower quantizations for high-parameter models.

Lesson #6: The optimal architecture for real-time AI conversation is a modular monolith where components are logically separated but physically colocated - balancing development flexibility with the performance demands of conversation.

Lesson 7: Non-Deterministic Evaluation

Build a good evaluation system that's non-deterministic most or all the way down. Typically, we separate deterministic and non-deterministic systems, wrapping the latter in adapter services to make them appear deterministic.

For benchmarking, think "turtles all the way down" - evaluators should themselves be evaluated by non-deterministic evaluators, which then evaluate your core model character. This differs wildly from traditional approaches like TDD or unit testing.

Lesson #7: Effective evaluation of real-time AI systems requires embracing non-determinism at every level - moving beyond traditional software testing paradigms to create evaluation frameworks that mirror the complexity of the systems they assess.

Moving Forward

I'm now focusing on building a system from composable parts, where each component can be specialized and optimized. This will allow for functionality that maintains both performance and quality, without getting trapped in the real-time vs. state management paradox.

The real challenge is balancing speed requirements with the complexity of managing state, personality, and context in a fluid conversation. By breaking the problem down into specialized components with clear interfaces, I think there's a path forward that doesn't sacrifice user experience.

For those attempting similar projects, my advice is to separate your concerns carefully, focus on the components that directly impact user experience, and don't underestimate the computational and architectural challenges of real-time AI conversation.

Recent Posts