What I've learned as an AI researcher shipping product

Shipping AI products is weird. The model isn’t just a component—it’s the user experience. Your users don’t interact with buttons and forms; they interact with a black box that sometimes works brilliantly and sometimes fails in baffling ways. Over the past few years, I’ve shipped AI features at two startups: Code completion and editing tools at Augment Code and customer service chatbots at Eloquent Labs/Square. Each launch taught me something counterintuitive about how AI products actually work in the real world. ...

August 1, 2025 · 8 min · 1534 words · Arun Tejasvi Chaganty

How to evaluate your product's AI

For the past two decades, benchmarks have been the backbone of AI progress. Capability benchmarks like MMLU, SWE-Bench or HLE have served as proxies for foundation model “IQ.” But they fall short when it comes to evaluating how well AI will perform in your product.1 I’ve spent much of my career building and critiquing evaluations,2 and in this post, I’ll share key lessons on designing an evaluation strategy that reflects real-world product impact. ...

April 22, 2025 · 5 min · 865 words · Arun Tejasvi Chaganty

Why does the chaos game converge to the Sierpinski triangle?

Here is a simple process— also known as the “chaos game”— to generate a shape: #. Draw an equilateral triangle on a piece of paper and mark a random initial point. #. Mark the next point midway to one of the vertices of the triangle, chosen randomly. #. Repeat step 2 ad infinitum or ad nauseum, whichever comes first. If you haven’t seen this before (and maybe even if you have): what shape do you expect to emerge? Now, try simulating the game below: ...

April 28, 2020 · 17 min · 3617 words · Arun Tejasvi Chaganty

Highlights from CoNLL and EMNLP 2019

CoNLL and EMNLP, two top-tier natural language processing conferences, were held in Hong Kong last month. A large contingent of the Square AI team, myself included, attended and our fantastic intern, Justin Dieter, presented our work on a new contextual language generation task: mimic rephrasals. Despite being a regular conference attendee, I was surprised by the sheer quantity and quality of innovative ideas presented at the conference: a true testament to how fast the field is moving. It’s impossible to cover everything that happened, but in this post I’ve tried to capture a sampling of the ideas I found most exciting in the sessions I attended. ...

December 3, 2019 · 22 min · 4526 words · Arun Tejasvi Chaganty

Efficiently estimating recall

tl;dr Factorize recall measurement into a cheaper precision measurement problem and profit. Measuring precision tells you where your model made a mistake but measuring recall tells you where your model can improve. Estimating precision directly is relatively easy but estimating recall directly is quintessentially hard on “open-world” domains because you don’t know what you don’t know. As a result, recall-oriented annotation can cost an order of magnitude more than the analogous precision-oriented annotation. By combining cheaper precision-oriented annotations on several models’ predictions with an importance-reweighted estimator, you can triple the data efficiency of getting an unbiased estimate of true recall. If you’re trying to detect or identify an event with a machine learning system, the metrics you really care about are precision and recall: if you think of the problem as finding needles in a haystack, precision tells you how often your system mistakes a straw of hay for a needle, while recall helps you measure how often your system misses needles entirely. Both of these metrics are important in complementary ways. The way I like to think about it is that measuring precision tells you where your model made a mistake while measuring recall tells you where your model can improve. ...

June 9, 2019 · 12 min · 2374 words · Arun Tejasvi Chaganty

Why we need human evaluation.

$$ \newcommand{\sX}{\mathcal{X}} \newcommand{\sZ}{\mathcal{Z}} \newcommand{\muh}{\hat{\mu}} \newcommand{\mub}{\muh_{\text{mean}}} \newcommand{\mucv}{\muh_{\text{cv}}} \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} $$We’re witnessing an exciting boom in the subfield of natural language generation (NLG), with more than 150 related papers published at ACL, NAACL and EMNLP in just the last year! These papers cover a range of tasks including abstractive summarization (Nallapati et al., 2016), open-response question answering (Nguyen et al., 2016; Kočisky et al., 2017), image captioning (Lin et al., 2014), and open-domain dialogue (Lowe et al., 2017b). Unfortunately, it’s incredibly hard to compare these different methods in a meaningful way. While most papers report automatic evaluations using metrics like BLEU or ROUGE, these metrics have consistently been shown to poorly correlate with human judgment on fluency, redundancy, overall quality, etc. On the other hand, only a small fraction of these papers actually conduct a thorough human evaluation. ...

July 10, 2018 · 10 min · 2080 words · Arun Tejasvi Chaganty

On Atomic Norms

How many linear measurements do you need to (efficiently) recover a low rank matrix? What about a sparse vector or an orthogonal matrix? Given that we know our object of interest has some ‘structure’, can we answer this question in a general manner? In this article, I will show you one approach to do so; regression using atomic norms. Most of the material I will cover was presented in the paper, “The Convex Geometry of Linear Inverse Problems” by Venkat Chandrasekaran, et. al. ...

May 25, 2013 · 17 min · 3525 words · Arun Tejasvi Chaganty

Topic Models, Gaussian Integrals, and Getting Scooped

Today’s post is a brief overview of one of my research projects; one that’s unfortunately already been done before. It is going to be a little mathematical, but I’ll try and provide sufficient intuitive reasoning that you don’t have to really be at one with the mathematics. Topic Models So, these days, every Tom, Dick and Harry can crawl the web or some other incomprehensibly large source of news, views and… I’d rather not continue. What could you do with this kind of data? Well, you could use it to classify/cluster web-pages for use in search engines like DuckDuckGo, or for recommending books, etc. While analysing these documents text, single words on their own like “files” could have several different connotations, depending on whether you’re talking about computers, bureaucracy or breaking out of a prison. However, you might be able to disambiguate which “topic” is being talked about depending on other words you saw like “screen”, “touchpad” and “Steve Jobs”. ...

March 25, 2012 · 9 min · 1774 words · Arun Tejasvi Chaganty

Excursions in Programming - Generators

“Lazy computation kicks the llama’s ass” - Somebody The concept of a generator may be very foreign to those who haven’t really explored Python - though it is a general principle (like iterators). The crux of it is to generate the members of a set/relation one at time (which are subsequently returned). If you are not going to store the enumeration, this methods prevents needless use of memory (that may lead the program to be extremely slow). The trade of is the (minute) expense of storing some “state” in the function, and perhaps some additional function call overhead that is rarely a concern. ...

November 12, 2009 · 6 min · 1227 words · Arun Tejasvi Chaganty