The Household Bug Report
Remi started from a very ordinary problem: my wife would order something, it would show up, a week or two would pass, and then the same question would eventually hit the room.
"Wait, can I still return that?"
At which point the ritual began. Search Gmail. Find the order confirmation. Find the shipping email. Find the delivery email. Try to remember whether this retailer starts the return window from the order date or the delivery date. Realize one store gives you 30 days, another gives you 14, and a third has some weird exception for electronics, final sale, or whatever fresh anti-consumer side quest they felt like adding.
None of this is catastrophic. That is exactly what makes it interesting. Missing a return window is not a life event. It is just one of those low-grade modern annoyances that happens often enough to feel dumb, but not dramatically enough that most people build a system for it.
Those are often good software problems. They have just enough pain to matter and just enough repetition to justify infrastructure.
At first I thought I was building a reminder app. That framing lasted for maybe five minutes.
The Wrong Mental Model
The reminder is not the product.
The reminder is the easiest, least interesting part of the whole system. Every app in America can send a text three days before a date. Twilio can do it. Email can do it. Push notifications can do it. Cron can do it. That is not where the work is.
The hard part is knowing what date deserves the reminder in the first place.
Once you see the problem clearly, the shape of the app changes. This is not really a reminder app with some email features attached. It is an email classification and date-resolution system with a reminder layer hanging off the back.
That distinction matters because the core promise of Remi is not "I will buzz your phone before something expires." The real promise is "I actually know when your return window expires."
If that part is wrong, the rest of the product is just polished spam.
A shipping email misclassified as a delivered email means the countdown starts too early. A retailer normalized to the wrong brand means the return policy lookup is wrong. A tracking number extracted as an order number means a later delivery email no longer matches the original order confirmation, so the system creates a duplicate order instead of updating the existing one.
Tiny extraction mistakes turn into very product-shaped failures.
That is why the most load-bearing code in Remi is not the scheduler. It is the AI prompt.
The Load-Bearing Prompt
That probably sounds a little absurd if you have not built one of these systems before, but it becomes obvious once you trace the pipeline. The prompt is the thing telling the model what kind of email it is looking at, what counts as evidence, when a date is real, when a number is fake, and how uncertain it is allowed to be.
If that layer is sloppy, everything downstream becomes a cleanup exercise.
The first version of the prompt was basically one flat list of rules: figure out whether this is a delivery email, extract some fields, return JSON. It was fine until it was not. Smaller models especially tend to pattern-match too early. They see "order number," "package," and "arriving Thursday," and they start helpfully filling in fields before they have actually committed to what the email is.
That is how you get hallucinated delivery dates on shipping emails. The model is not being malicious. It is just trying to be useful a little too early.
So I rewrote the prompt as an actual workflow.
Step 1: classify before extracting
The model now has to make a classification decision first, then assign a subtype, and only then extract fields. That sounds obvious in retrospect, but it makes a real difference because it forces a commit before the model starts improvising details.
Instead of "this kind of feels like an order email, let me fill some slots," the model has to say what it believes the email actually is.
Step 2: make email type structural, not implied
The subtype enum ended up mattering a lot:
DELIVEREDSHIPPEDORDER_CONFIRMEDRETURN_REMINDEROTHER
That changed the system in a useful way because delivery_date stopped being an informal rule and became a structural consequence.
Only DELIVERED is allowed to produce a delivery_date.
That sounds like a small difference, but it is the difference between prose guidance and an enforceable boundary. "Please remember that shipping notifications should not set delivery dates" is a polite suggestion. "Only this subtype may populate this field" is a system rule.
Once subtype existed in the schema, I could also enforce it in code. If the model says SHIPPED and still hands me a delivery date, I strip the hallucinated field and keep the rest of the classification. If it says OTHER and is_delivery: true, I reject the payload because it is internally contradictory.
That is a broader pattern I keep seeing with AI products: if something matters, do not leave it floating in prose. Turn it into a typed boundary.
The Edge Cases Are the Product
A lot of software gets built as if the edge cases are annoying debris around the main flow. Here, the edge cases are the main flow.
Retailer normalization
Normal people read Gap Factory and Gap as close enough. The return policy table does not.
Same with Amazon Warehouse versus Amazon. Same with Nordstrom Rack versus Nordstrom. If the model collapses those to the parent brand, the app can compute a clean, confident, completely wrong return deadline.
So the retailer normalization rules had to get much stricter, and the policy lookup had to stop doing fuzzy matching in ways that undid that work. Preserving sub-brands only matters if the rest of the system agrees to respect them.
Tracking numbers pretending to be order numbers
Tracking numbers are catnip for language models. They look official. They are long, ugly, and visually prominent, which makes them feel important. The problem is that they are usually not the canonical ID I want for deduping later messages.
If the app stores a UPS tracking number as the order number, then the later delivery email may no longer match the original order confirmation. At that point the bug is no longer "the parser grabbed the wrong string." The bug is "the product thinks one purchase is two separate orders."
That is a much more expensive failure than it sounds.
Relative dates
"Delivered today" shows up constantly in real delivery emails.
If you do not define that behavior explicitly, you get chaos. Some models infer a date incorrectly. Some null it out. Some improvise based on whatever they think "today" ought to mean. So I made the split of responsibility explicit: the model can say "this is DELIVERED, but the email did not include a concrete date," and the application resolves that to the current date.
Again: the model identifies the situation, the application anchors it in time.
That division of labor is cleaner than pretending the model can safely infer every relative date in context.
What the Pipeline Actually Does
The product flow now looks roughly like this.
A user connects Gmail or Outlook with read-only access. The app does a first-pass pre-filter on sender and subject so it can reject most inbox noise without even touching the body. Marketing email gets skipped. Account notices get skipped. Grocery and food delivery gets skipped. A lot of the privacy story starts right there: the best email to parse is the one you never had to open.
If an email passes that filter, the app sends a truncated snippet to the model and asks for a strict JSON payload:
- subtype
- whether it is a real delivery/order email
- retailer
- order number
- order date
- delivery date
- return window, if explicitly stated
- confidence
Then the application does the grown-up work. It validates the output, rejects contradictions, normalizes the retailer, computes the deadline from the correct base date, and checks whether this email should update an existing order or create a new one.
That update path matters a lot. In practice, inboxes describe the same purchase in stages. You get an order confirmation first, then a shipping update, then a delivery confirmation. The app has to understand that these are not three unrelated events. They are three partial views of the same order.
That is why getting the order identifier right matters, and why the delivery classification matters even more.
Delivery Is the Real Starting Gun
One of the more important product corrections I made was changing when reminders get created.
The first instinct is to start the reminder sequence as soon as the system sees evidence of the purchase. That feels reasonable until you realize it makes the app subtly dishonest. A shipping email can arrive days before the package actually lands. If you start the clock there, the countdown is wrong, the midpoint reminder is wrong, and the final reminder can show up when the item has barely been in the house.
So Remi only creates reminders once delivery is actually confirmed.
That sounds like a detail, but it is one of the main things that makes the product feel trustworthy. It is not approximating the return window. It is tracking the actual one as best it can.
The sequence now is:
- confirmation reminder when delivery is confirmed
- midpoint reminder halfway through the return window
- final reminder near the deadline
That turns the app from "calendar-ish thing with email attached" into something that actually understands the lifecycle of the order.
Trust Is Part of the Architecture
Any product that reads email has to clear a high trust bar, and honestly it should. If the experience feels even a little creepy, people are right to bounce.
"Trust me with your inbox" is an insane thing to ask casually.
That pushed the design in a pretty opinionated direction. The app uses read-only OAuth scopes. It cannot send, delete, or modify anything. It stores structured metadata about orders and deadlines, not a shadow archive of the inbox. It tries to reject as much junk as possible before deeper parsing. The architecture is trying to keep the useful structure and discard the rest.
I think a lot of AI products treat privacy as a legal page problem. It is not. It is a product design problem. If the system cannot explain what it keeps, what it discards, and why it needs the access at all, then the user is right to assume the worst.
In Remi, the trust story is not a little compliance garnish on top of the app. It is part of the product itself.
What I Like About It
What I like about Remi is not just that it might save somebody from eating the cost of a missed return. I like that it started from a very normal domestic complaint and forced a bunch of sharper engineering decisions than a more self-important project might have.
It forced me to think about classification before extraction. It forced me to turn prompt guidance into typed boundaries. It forced me to treat trust as architecture instead of copywriting. It forced me to care about tiny parsing errors not as model trivia, but as product failures with real downstream consequences.
That is probably the bigger point.
Some of the best software ideas do not show up wearing a big founder costume. They show up as one annoying sentence repeated often enough that you eventually get tired of hearing it. In my case, that sentence was some version of "Wait, can I still return that?" The answer turned out not to be a note-taking app, not a calendar, and not one more generic reminder tool. It turned into a fairly opinionated little email agent whose entire job is to make one category of modern life a little less chopped.
That feels like real product work to me.