AI Voice Platform
DINA
Realtime AI voice agents for customer onboarding and operations.
Problem
Voice is where real operations happen — onboarding calls, operational check-ins, coordination with people who are on the move — and all of it was human-paced: scripted calls, limited hours, inconsistent quality, drop-off whenever someone had to wait for a callback.
Text chatbots weren't the answer. The conversations that matter here happen by phone, and they need to do real work: collect structured information reliably, trigger workflows, and integrate with backend business systems — not just chat.
Solution
DINA is an AI voice platform supporting real-time inbound and outbound conversations. Audio streams into speech recognition, partial transcripts feed an LLM-driven dialogue orchestrator, and responses stream back through text-to-speech — fast enough to feel like conversation, not IVR.
The orchestrator is the interesting part. It owns the conversation state machine: which fields are filled, what's still missing, when to confirm, when to re-ask, and when to hand off to a human. The LLM proposes; the state machine disposes. It can also execute remote agents and workflows mid-conversation, so a call doesn't just gather information — it acts on it.
Conversation memory persists across the session, so an interrupted call resumes where it left off. What started as a POC became a flagship AI capability used in customer demos and onboarding — and a foundation for voice-driven autonomous operations.
Architecture
Everything is a stream. The moment any stage waits for a complete input from the previous one, the latency budget is gone.
The LLM never talks to the caller directly — it talks to an orchestrator that enforces the conversation contract and owns all side effects.
Challenges
The latency budget
A conversation dies above about a second of silence. Every stage streams — ASR emits partials, the LLM streams tokens, TTS starts speaking before the full response exists. The pipeline is engineered around time-to-first-audio, not total processing time.
Barge-in
Humans interrupt. The pipeline detects speech during playback, cancels synthesis mid-utterance, and re-enters listening — without losing the dialogue state that was being spoken.
Keeping the LLM honest
An onboarding agent that invents policy answers is worse than none. The orchestrator constrains the LLM to the current dialogue step, validates extracted fields against schemas, and routes anything off-script to a human handoff path.
Telephony reality
Packet loss, noisy lines, accents, hold music. Confidence thresholds on transcription decide between proceeding, confirming ("just to check, was that...?"), and escalating.
Lessons
Latency is a product feature, not an infrastructure detail. Users forgive a wrong answer faster than a slow one.
Structured extraction with validation beats free conversation. The state machine around the LLM is what makes the system dependable.
Build the human handoff path first, not last — it's what makes shipping an imperfect agent safe.