Articles

How to evaluate your product’s AI

ai evaluation

For the past two decades, benchmarks have been the backbone of AI progress. Capability benchmarks like MMLU, SWE-Bench or HLE have served as proxies for foundation model “IQ.” But they fall short when it comes to evaluating how well AI will perform in your product. I’ve spent much of my career building and critiquing evaluations, and in this post, I’ll share key lessons on designing an evaluation strategy that reflects real-world product impact.

Published on 22 Apr 2025

Why does the chaos game converge to the Sierpinski triangle?

math chaos

The chaos game is simple— starting from a random point, mark the next point midway to a random vertex of a triangle— yet generates a fascinating fractal: the Sierpinski triangle. In this post, I explain why through proofs and interactive figures.

Published on 28 Apr 2020

Highlights from CoNLL and EMNLP 2019

nlp research

CoNLL and EMNLP, two top-tier natural language processing conferences, were held in Hong Kong last month. A large contingent of the Square AI team, myself included, attended and our fantastic intern, Justin Dieter, presented our work on a new contextual language generation task: mimic rephrasals. Despite being a regular conference attendee, I was surprised by the sheer quantity and quality of innovative ideas presented at the conference: a true testament to how fast the field is moving. It’s impossible to cover everything that happened, but in this post I’ve tried to capture a sampling of the ideas I found most exciting in the sessions I attended.

Published on 03 Dec 2019

Efficiently estimating recall

applied-ml evaluation human-in-the-loop beta

When you’re trying to detect or identify an event with a machine learning system, the metrics you really care about are precision and recall. While measuring precision tells you where your model made a mistake, measuring recall can tell you where your model can improve. In this post, we’ll look at how to triple the data efficiency of estimating recall in practice using an importance-reweighted estimator.

Published on 09 Jun 2019

Why we need human evaluation.

nlp evaluation human-in-the-loop

We’re witnessing an exciting boom in the subfield of natural language generation (NLG), with more than 150 related papers published at top conferences in just the last year! Unfortunately, it’s incredibly hard to compare these different methods in a meaningful way as many automatic evaluations have consistently been shown to poorly correlate with human judgment. In this paper, we’ll ask if complete human evaluation really necessary and if so, what can we do to make it easier or cheaper to conduct human evaluations

Published on 10 Jul 2018

On Atomic Norms

optimization convex-geometry

How many linear measurements do you need to (efficiently) recover a low rank matrix? What about a sparse vector or an orthogonal matrix? Given that we know our object of interest has some ‘structure’, can we answer this question in a general manner?

In this article, I will show you one approach to do so; regression using atomic norms. Most of the material I will cover was presented in the paper, “The Convex Geometry of Linear Inverse Problems” by Venkat Chandrasekaran, et. al.

Published on 25 May 2013

Topic Models, Gaussian Integrals, and Getting Scooped

nlp graphical-models

This post is a brief overview of an approach I took to derive variational inference for the Correlated Topic Model, before realizing it had been done before.

Published on 25 Mar 2012

Excursions in Programming - Generators

programming

This is a historical post about generators in Python (when that was still a new and exciting feature) and how to achieve the same in C++.

Published on 12 Nov 2009

Projects

You can find a lot of my personal coding projects on GitHub.