All writing
·7 min read

Building a Unified Communication Platform

Platform Engineering
Distributed Systems
Architecture

CommAgent started as a side-observation. I was working on a voice agent and needed it to send a follow-up email after a call. Simple task. Except there was no obvious way to do it — one service had its own email integration, another had a different one, SMS lived somewhere else entirely, and none of them agreed on retries, templates, or what "delivered" meant.

This is a normal way for systems to grow, and I don't think anyone made a bad decision along the way. Each team needed to send something, so each team integrated a provider. But by the time I was looking at it, the cost had compounded: when a provider had a bad day, different services failed in different ways, and answering "did the customer actually get this message?" meant digging through logs in three places.

So I pitched a POC: one communication platform, everything else goes through it. That POC eventually became CommAgent, which now handles 150k+ conversations a month across 100+ customer environments. Most of what follows is stuff I got wrong the first time.

Design the adapter contract around your worst provider

The architecture itself is not novel. One API in front, an adapter per channel (Email, SMS, WhatsApp, Voice), providers swappable behind each adapter. Anyone who has read about the adapter pattern can draw this diagram.

What the diagram doesn't tell you is which provider to design the contract around. I started with the cleanest API we used, sketched a neat interface against it, and felt pretty good — until I tried to fit the second provider in. Callback-only status updates. Undocumented rate limits. Webhooks that arrive out of order, or twice. The neat interface didn't survive contact.

I ended up redoing the contract with the messiest provider open in one tab the whole time. That version held. The clean APIs fit inside any abstraction; the ugly one defines what the abstraction actually has to be.

"Did it send?" is often unanswerable

A situation that comes up way more than I expected: you call a provider, the request times out, and you cannot know whether the message went out. If you retry, you might double-send someone's invoice. If you don't, you might send nothing.

The only way out I know of is to make retries safe by construction. Every send in CommAgent carries an idempotency key from the client, so the same logical message can be attempted five times and go out once. Whatever exhausts its retry budget lands in a dead-letter queue where a human can look at it, instead of silently disappearing.

We accepted at-least-once delivery and stopped trying to be clever about it. Duplicates are a fact of distributed systems; making them harmless was much less work than pretending we could prevent them.

Webhooks lie about time

Delivery receipts do not arrive in order. A "delivered" event can show up before the "sent" event that logically precedes it. Some providers replay events. Early on we applied webhooks as they came, and message statuses would occasionally flicker backwards — delivered, then sent, then delivered again. Customers noticed before we did, which is not a great feeling.

The fix was to treat status as a state machine that only moves forward. A late "sent" arriving after "delivered" is a no-op. It was a small change, and we haven't seen a status flicker since.

Build the trace before the features

If I could only keep one decision from this whole project, it would be this: every message gets an ID that survives from the initial API call to the final provider callback, and one query answers "where is this message and what happened to it?"

I did not appreciate how much this would matter. It started as a debugging aid for myself. Then support started using it daily. Then it turned out to be the reason other teams trusted the platform enough to migrate onto it, and later, when the workflow engine started sending messages through us, it became the base for workflow-level observability too.

A communication platform succeeds when other engineers stop thinking about communication. Most days now, nobody thinks about CommAgent at all. That took a while to feel good about, but it's the whole point.