Most “best AI email assistant” posts are written by someone who never opened the products. They paste pricing tables and feature checklists off the marketing pages and call it a review. You can tell, because every tool somehow scores well at the exact thing its homepage brags about.
So I did the opposite. I took one real email thread, the kind that actually ruins a Tuesday, and fed the exact same thread to five AI assistants. Same input, same moment, no cherry-picking. Then I watched what each one did, and where each one quietly fell apart.
The thread I used
A client, Maya, replies to a project update. Four paragraphs. Three of them are her being politely furious that a deliverable slipped. The fourth, the only one that needs anything from me, is buried at the bottom:
…and honestly that’s the third time this has happened. I need to know it won’t again.
Separately, can we push Thursday’s 2pm to Friday? Raj should be on it too, he owns the budget sign-off now, not Dana. Loop him in.
A good human assistant catches all of that without being told. That’s the bar.
The four jobs hiding in that thread
- Catch the actual ask. The reschedule, under three paragraphs of complaint.
- Get the people right. Move the meeting, add Raj, and notice Dana no longer owns sign-off.
- Read the temperature. This is not a “sounds good!” reply. The relationship needs tending, not just a calendar change.
- Don’t overstep. Nobody auto-sends to an unhappy client without me seeing it first.
What each one did
No tool gets credit for a feature it has. It only gets credit for what it did with this thread.
alfred_
It pulled the reschedule out of the noise and showed its reasoning before doing anything: it flagged the thread as needing a reply, surfaced “move Thu 2pm to Fri” as the action, and read the sender’s tone as frustrated, not neutral. The draft acknowledged the slip first, confirmed the new time second, and left the send to me. It did not pretend to have rescheduled the meeting.
Where it fell short: capture the one thing it missed on the Raj/Dana swap. Be honest here even though it’s us.
Fyxer
Superhuman
Shortwave
Lindy
The scorecard
Same four jobs, scored once the captures are in. No tool gets a point for a feature it has, only for what it did with this thread.
| Tool | Caught the ask | People right | Read the room | Stayed in its lane |
|---|---|---|---|---|
| alfred_ | Yes | Partial | Yes | Yes |
| Fyxer | ? | ? | ? | ? |
| Superhuman | ? | ? | ? | ? |
| Shortwave | ? | ? | ? | ? |
| Lindy | ? | ? | ? | ? |
What this test really measures
The marketing question is “which one drafts the best email.” The real question, the one that decides whether you trust a thing with your inbox, is narrower: when the email is a mess, does the assistant find the one thing that matters, get the humans right, and refuse to do anything irreversible behind your back?
That is the whole test. The full results land as each tool is run on the exact same thread.