The first time you wire an API key into production, the dopamine is real. The second week is when the weird stuff shows up: empty responses, tool calls that “succeed” with nonsense, users who paste 30-page PDFs into a box meant for a sentence. The model did not get worse. Your safety net was missing.
Log like you will debug at 2 a.m. (because you will)
I log at minimum: the user-facing request id, the model name and version, token counts, latency, and a hash or truncated copy of the input. Not always full PII in plain text (check your policy), but enough to replay a failure in a staging environment. If you only have “it failed” in Sentry, you are guessing.
Fallbacks are a feature, not an apology
If the model times out, what happens? Silent error? Bad. A cached canned message? Better. A degraded path that does half the job without the model? Often best.
I try to make fallbacks feel intentional: “We could not generate a summary right now; here is the raw list instead.” People forgive honesty faster than they forgive confident nonsense.
Evals do not have to be a research project
You do not need a PhD to run a spreadsheet. Write down ten questions your feature is supposed to handle well. After each deploy, spot-check them. If that sounds basic, good—basic is how you notice when a “small” model upgrade changes tone in your support replies.
The part nobody posts about
Sometimes the win is not smarter AI. It is less AI. A button that says “use last week’s report as the template” saves more real minutes than a five-paragraph generated essay nobody asked for.
I still get excited about new models. I just try to pair that excitement with the same paranoia I bring to anything that touches money or customer data. Boring? Sure. So is a good night’s sleep.