From Minutes to Seconds: Rebuilding Email Ingress Around Real-Time

Built by Dinesh Chhantyal.

When alfred_ texts you that an important email just landed, one number decides everything: how long after the email arrives does your phone buzz. A few seconds, and alfred_ feels like it is reading your inbox over your shoulder. Ten minutes, and it feels like a digest you already scrolled past.

For most of our life that number was minutes. This is how we rebuilt the entire email ingestion layer around real-time push to fix it, and how we talked ourselves out of the fix that looks obvious and is wrong.

Why email is hard to “just watch”

If you have only built on your own database, “tell me when something changes” sounds trivial. Add a trigger, add a subscription, done.

Email is three different worlds wearing the same envelope. Gmail speaks in a history log: you do not get the message, you get a number that says “something happened, go look.” Outlook, through Microsoft Graph, calls a webhook, but on a strict clock and with its own rules about how subscriptions are born and die. IMAP, the long tail of every other provider, has no modern push at all; we front it with a service called EmailEngine that gives us webhooks on its own terms.

So “watch the inbox” is really three integrations, three auth models, and three failure modes hiding behind one word. For a long time we dodged all of it with the bluntest tool available.

The old world: one poller, every account, every couple of minutes

The original ingestion layer was a single scheduled function. On a timer, it walked every connected account, refreshed the token, asked the provider “what changed since last time,” staged anything new, and kicked off classification.

It was not careless. The interval was adaptive and timezone-aware: roughly every 2 minutes during your waking hours, stretching to 27 in the dead of your local night. Nobody got a text at 4am, and we did not burn quota scanning a mailbox no one was reading.

But the shape was fixed. The poller scanned every eligible account on every tick, change or no change, which came out to two or three thousand full scans a day, almost all of them finding nothing. And the latency lived entirely in one place: the wait for the next tick. Every step after that was fast. The wait was the whole game, a couple of minutes on a good day and most of half an hour at night. We had even bolted on extra cron jobs for calendar changes just to paper over the delay. We were adding crons to fight the cost of crons.

That is not a bug you can fix in the code. You cannot poll your way to instant. So we stopped trying.

The plan we threw out

The obvious move: when a provider says something changed, write straight into the staging table, exactly where the poller would have. Wire the new push receivers into the existing staging logic and call it done.

We built toward that and stopped, because we could see the next two things coming. Push events were not going to feed one consumer. They were going to feed notifications, a separate contact and knowledge graph, and more after that. Wiring each consumer directly into each receiver meant every new feature would touch all five provider receivers, and a bug in any one consumer could break the acknowledgement path for all of them. The provider does not care about your internal mess. If you are slow or you error, it throttles you or gives up.

So before shipping real-time to anyone, we changed the shape. Receivers do one job: accept the notification, turn it into a small normalized event, publish it. What happens next is somebody else’s problem, on the other side of a bus.

Going real-time, provider by provider

Each provider got its own receiver, fluent in that provider’s dialect.

Gmail. We enroll a watch on the mailbox, pointed at a Pub/Sub topic, scoped to inbox and sent. Google then pushes whenever the mailbox changes. The twist: the push contains no email, just an address and a history number. To learn what happened you call the history API with your last cursor and read the diff. The watch expires in about seven days. And since the notification comes from Google’s infrastructure, not a user, we verify a signed token on every call and reject anything that is not Google.

Outlook. Graph calls a webhook directly, which is easier until you read the contract. You answer within three seconds or Graph marks your endpoint slow and starts delaying you. Subscriptions live about three days, then must be renewed. You cannot edit the change type a subscription listens for, so widening coverage means deleting and recreating it. Every notification carries a secret we set at creation, and we check it, because a webhook anyone can call is not a webhook, it is an open door.

IMAP, through EmailEngine. This one inverts every assumption. EmailEngine gives us webhooks for every IMAP account it manages, but the switch is global, not per account. Nothing to enroll, nothing to renew, which sounds like a gift until you flip it on and it starts firing for every connected account at once. So the scoping moved into the receiver: an allowlist of addresses we actually want, where empty means accept nothing. Each webhook is signed, and a tampered body is rejected.

Three mechanisms, three auth schemes, one normalized event coming out the other side.

The insight that made it cheap

Here is what made the migration far less scary than it looked. We did not write new code to fetch email. When a Gmail push arrives, the receiver makes the exact same history call, with the exact same cursor handling, the poller already made. The fetch logic did not change. What changed was when it ran and for whom.

The poller ran on a clock, for every account. Push runs the instant something happens, for only the one mailbox that changed. Same fetch, different trigger. Thousands of scheduled scans of every account become event-driven work proportional to how much mail you actually get. That reframing is what turned “rewrite ingestion” into “change what pulls the trigger.”

Why we put a bus in the middle

Receivers publish small events, never the email body, into an event log backed by a queue. Consumers subscribe to what they care about. Notifications are one consumer, the contact graph another, and a third later touches no receiver at all.

The “never the body” rule is a scar, not an aesthetic. Stuffing large payloads through this kind of plumbing had bitten us before, so events are tiny pointers and consumers fetch the heavy content themselves, idempotently, when ready.

The first consumer reuses the poller’s hard-won gates almost verbatim: the filters that drop calendar invites and self-sent mail, the quiet-hours logic, the guard against texting someone who has not written to you yet. The rewrite was never about changing what counts as worth a text. Only how fast we find out.

The parts that fought back

Real-time is not polling but faster. It has whole categories of failure polling never had, because polling, for all its slowness, is relentless and self-correcting. A few that drew blood:

Subscriptions die quietly. A Gmail watch lapses in a week, a Graph subscription in days. Miss the renewal and the mailbox goes dark with no error, only silence. So renewal is its own scheduled job, sweeping for subscriptions about to expire, with Microsoft’s lifecycle warnings routed somewhere a human will see them. We treat silent subscription death as the number one failure mode, not an edge case.

Two pushes, one cursor. Polling is a single job on a spaced-out clock, so it never races itself. Push is not. In our first live test, two notifications for the same mailbox arrived nearly at once, both read the same cursor, both fetched, both advanced it, and produced duplicate events. This bug does not exist until the day you go real-time, and then it exists immediately.

Everything arrives at least twice. Pub/Sub, Graph, and the EmailEngine queue all promise at-least-once delivery, a polite way of saying “sometimes twice.” Dedup lives at every layer that writes: a uniqueness constraint on staged messages, a fingerprint on calendar changes, an existence check on publish. Consumers tolerate seeing the same thing again regardless.

The label that hid the sent folder. We enrolled the Gmail watch on the inbox label, reasonably enough. The hard lesson: the label controls which changes trigger a push, not which changes the history API returns. So sent mail only showed up piggybacked on the next inbound push, and on a quiet mailbox that could be a while. The fix was one line, adding the sent label. Finding it took longer.

Why we are keeping the poller

Once push works, the temptation is to delete what it replaced. We are not going to, and that is the most important decision in the project.

The old poller had an accidental virtue. Because it scanned everything on a timer, any event it ever missed got swept up on the next pass. Slow, but a safety net with no holes. Pure push has holes by construction: a dropped notification, a lapsed subscription, a few minutes of provider trouble, and that email is gone from your stream for good.

So the poller does not get deleted. It gets demoted. For real-time accounts it stops being the primary path and becomes a slower reconciliation sweep that catches whatever push dropped. Fast path for latency, slow path for correctness. You want both, and one of them you already had.

What we measured, and where it stands

On accounts running real-time, the change is what you would hope. End to end, from email landing to event in hand, the numbers moved from the old two-to-twenty-seven-minute window down to single-digit seconds: Gmail three to six, Graph around two, calendar in the same band.

Two honest caveats. These figures come from a deliberately small live cohort, not full production load, so read them as “the architecture delivers seconds,” not a published SLA. And on the calendar path, push is now so fast that the next bottleneck is a downstream cron we have not yet pulled into the event-driven world. Fixing one bottleneck has a way of promoting the next.

Real-time ingestion is built, merged, and deployed, but not yet the default for everyone. We brought it up carefully: off at the provider level, then on for a small set of mailboxes we watch closely, with the bus running in staging while we harden the rollout. The rest is the unglamorous, load-bearing kind of work: wiring enrollment into connect and disconnect, the reconciliation sweep, turning the dials up one cohort at a time.

We could have shipped the obvious version in an afternoon, pushed straight into staging, declared victory, and spent six months untangling receivers wired into consumers. We spent the extra days on the boring middle instead, the bus and the safety net, because the goal was never a demo that buzzes your phone in two seconds once. It was an ingestion layer that buzzes your phone in two seconds every time, and tells us when it doesn’t.

From Minutes to Seconds: Rebuilding Email Ingress Around Real-Time
Rebuilding Email Ingress Around Real-Time