What I've learned as an AI researcher shipping product

Shipping AI products is weird. The model isn’t just a component—it’s the user experience. Your users don’t interact with buttons and forms; they interact with a black box that sometimes works brilliantly and sometimes fails in baffling ways.

Over the past few years, I’ve shipped AI features at two startups: Code completion and editing tools at Augment Code and customer service chatbots at Eloquent Labs/Square. Each launch taught me something counterintuitive about how AI products actually work in the real world.

Here’s what I wish I’d known when I started.

The Model-UX Paradox: A Chicken-and-Egg Problem

The model is the UX, but UX decisions constrain your modeling choices. This creates a chicken-and-egg problem that requires good judgment to solve.

Example: Code Completion

Take code completion. In 2023, language models like StarCoder could generate impressively smart code—whole functions, complex logic, the works. But plug them directly into an editor and you’d get an awful experience:

❌ Massive completions that wouldn’t connect to your existing code
❌ “Fill-in-the-middle” failures—you’d be halfway through a function and the model would generate an entirely new function between your cursor and the rest of your code

The UX was broken. IIRC, GitHub Copilot’s early solution was to generate only single lines, but that wasted the model’s capabilities.

Our solution at Augment: We tuned our training data to encourage the model to generate code that aligned with semantic boundaries—end of expressions, statements, blocks—and simulated incomplete code better. The difference was huge. For the first time, the model was actually enjoyable to use, and the larger completions felt like leveling up.

Example: NextEdit Trade-offs

But sometimes UX constraints drive modeling decisions. When building Augment’s NextEdit feature—which suggests edits across your codebase based on recent changes—we faced a choice:

Inline suggestions (like code completion)
- ✅ More natural, reduced friction
- ❌ Had to appear in ~300ms, constraining us to smaller models
Separate list display
- ✅ Could get smarter suggestions by running larger models longer
- ❌ More friction for users

We chose inline because it was a more discoverable on-ramp for the feature. And it was the right decision since users could always fall back to chat/agent for more comprehensive suggestions.

💡 Key Lesson: Your model architecture and your user interface aren’t separate decisions. They’re the same decision.

Ship Early and Often: Real Users Beat Any Eval

No evaluation beats real users. Ship as soon as you have something that works, because the problems you’ll discover are ones you never could have anticipated.

The Copy-Paste Problem We Never Saw Coming

With NextEdit, we thought we understood the use case: help users make consistent edits across their codebase. But once real users started using it, we discovered something unexpected:

🐛 Problem: The model would get confused when users copy-pasted large blocks of text.

🤔 Why this happened: Users copy-paste to get a starting point, then make changes. But the model couldn’t tell which parts were copied and which were actively being edited.

🔧 Our fix: We made edit events more granular to help the model distinguish between copy-paste and real edits.

🐛 New problem: The model became hyper-sensitive to recent changes. If you undid one edit, it would try to undo all your changes, assuming you wanted to revert everything.

💡 Key Insight: These are real-world problems that are impossible to anticipate in a lab. You can’t eval your way to discovering them. You have to ship and learn.

Blame the Model Last

When an AI feature produces bad output, your first instinct is to blame the AI. Don’t. Most quality issues aren’t model problems—they’re everything-else problems.

My Debug-First Approach

This mindset has served me well as a technical leader: Treat 100% of quality issues as the model’s fault, but investigate everything except the model first.

Case Study: Augment Agent

We launched Augment’s Agent in 2.5 months. I was responsible for quality.

Initial state: The agents were incredibly brittle. Team members questioned whether agents could work at all. Others pushed for more complex model orchestration.

What I discovered: When I dug into the traces, the failures weren’t model failures. They were context failures:

❌ Wrong file names
❌ Incorrect paths
❌ Bad environment information

Fundamentally, IDEs are complex, and we had to teach the agent how to navigate them. The solutions were simple, but effective:

✅ Fuzzy matching to make edits more robust
✅ Better error messages for the agent so it could recover
✅ Simplified tools to make it easier for the agent

Only after cleaning up that debt could we focus on growing the agent’s capabilities—better tools, task lists, enhanced context, etc.

💡 Key Lesson: You have to earn the right to work on the fun stuff by fixing the boring infrastructure problems first.

Know Your Audience: Demoers vs Users

Demoers index for recall. Users index for precision.

I learned this early as Head of AI at Eloquent Labs, where we built enterprise customer service chatbots, and then again when working on AI coding agents:

🎯 Demoers (VCs and executives; vibe coders):

Always tried the most complex scenarios—edge cases, weird requests, stress tests
Wanted to see if the product could handle anything (recall)

👥 Typical users:

Just wanted the simple stuff to work reliably every time (precision)
Didn’t care if we couldn’t handle esoteric requests if it couldn’t do the simple stuff consistently

💡 Key Insight: Both recall and precision matter, but when you’re evaluating feedback, know which audience you’re hearing from.

User Onboarding Is a Blocker For Model Quality

No amount of model quality will compensate for poor onboarding. Users need to understand not just how to use your feature, but when and why it’s valuable.

Case Study: NextEdit’s Mental Model Problem

When we soft-launched NextEdit, adoption was terrible. Users complained it was confusing and not useful. We interviewed them and discovered the problem:

🐛 The Issue: Their only introduction to the feature was an old video from when NextEdit showed a list of suggestions across the entire codebase. They thought it was only useful for whole-codebase refactors.

🤔 The Misunderstanding: They’d completely missed the point. NextEdit wasn’t a refactoring tool—it was “code completions++”. It was meant to help with small, contextual edits as you worked.

What we fixed:

✅ Better messaging: Fixed with a new intro video that set the right expectations
✅ Interactive onboarding: Built a lightweight tutorial that explained the keybindings
✅ Simplified UI: Tweaked the interface to lean into the completions++ mental model

The Result

The second rollout had much better reception. The feature was exactly the same. The model was exactly the same. Only the onboarding and UX changed.

💡 Key Insight: Positioning matters. Make sure users have the right mental model for how to get the most out of your AI model.

Heuristics: Quick Fixes That Buy Time for Real Solutions

Every AI feature has quirks you’ll discover during user testing. Completions that generate unhelpful TODOs and comments. NextEdit constantly undoing user changes because it disagrees with something they did earlier or adding unnecessary whitespace.

The Heuristics Trade-off

About 90% of the time, the right fixes are simple heuristics:

✅ Quick to implement: Use regexes and/or an xgboost classifier to filter out bad outputs
✅ Immediately effective: These fixes work and solve the user-facing problem
❌ Blunt instrument: Will also filter out that remaining 10% of genuinely valuable AI suggestions
❌ Technical debt: Creates maintenance burden and complexity

The right approach is to use heuristics as a bridge. They buy you time to move the fixes into the model itself, where they belong.

💡 Key Lesson: Heuristics are a tactical solution to buy time for strategic fixes. Use them, but don’t rely on them forever.

Vibes Are Your Only Hope Until They Don’t Cut It Anymore

Comprehensive automatic evaluation is nearly impossible before you launch. You can’t get realistic data that matches real usage patterns. You just can’t anticipate problems like NextEdit’s copy-paste confusion.

The Two-Phase Development Approach

🚀 Phase 1: Ship on vibes to get from 0->1

✅ Fast iteration: Rely on intuition, qualitative assessment, and crossing your fingers
✅ Early user feedback: Get real problems you couldn’t have anticipated
❌ Limited scalability: Can’t systematically improve or catch regressions

📊 Phase 2: Evaluation-driven improvement

✅ Systematic optimization: Measure impact of changes objectively
✅ Regression prevention: Catch when new features interfere with existing ones
✅ User complaint triage: Distinguish one-offs from genuine regressions
❌ Higher overhead: Takes time to build proper evaluation systems

The Critical Transition

This transition—from vibes-based development to evaluation-driven improvement—is one of the most important shifts in shipping AI products. It’s the difference between a prototype and a product.

I’ve written more about this challenge here, but the key insight is knowing when to make the transition:

🔄 Too early: You’ll over-engineer evaluation for problems that don’t exist
🔄 Too late: You’ll ship broken updates to users who are starting to rely on your product

💡 Key Insight: Start with vibes to discover real problems, then build systematic evaluation to solve them at scale. Timing this transition is critical.

Shipping AI products is fundamentally about learning from users. You start with intuition and technical capability, but you succeed by building systematic ways to understand what people actually need. The model is just the beginning.

The Model-UX Paradox: A Chicken-and-Egg Problem#

Example: Code Completion#

Example: NextEdit Trade-offs#

Ship Early and Often: Real Users Beat Any Eval#

The Copy-Paste Problem We Never Saw Coming#

Blame the Model Last#

My Debug-First Approach#

Case Study: Augment Agent#

Know Your Audience: Demoers vs Users#

User Onboarding Is a Blocker For Model Quality#

Case Study: NextEdit’s Mental Model Problem#

What we fixed:#

The Result#

Heuristics: Quick Fixes That Buy Time for Real Solutions#

The Heuristics Trade-off#

Vibes Are Your Only Hope Until They Don’t Cut It Anymore#

The Two-Phase Development Approach#

The Critical Transition#