Evaluation

How to evaluate your product's AI

For the past two decades, benchmarks have been the backbone of AI progress. Capability benchmarks like MMLU, SWE-Bench or HLE have served as proxies for foundation model “IQ.” But they fall short when it comes to evaluating how well AI will perform in your product.1 I’ve spent much of my career building and critiquing evaluations,2 and in this post, I’ll share key lessons on designing an evaluation strategy that reflects real-world product impact. ...

Efficiently estimating recall

tl;dr Factorize recall measurement into a cheaper precision measurement problem and profit. Measuring precision tells you where your model made a mistake but measuring recall tells you where your model can improve. Estimating precision directly is relatively easy but estimating recall directly is quintessentially hard on “open-world” domains because you don’t know what you don’t know. As a result, recall-oriented annotation can cost an order of magnitude more than the analogous precision-oriented annotation. By combining cheaper precision-oriented annotations on several models’ predictions with an importance-reweighted estimator, you can triple the data efficiency of getting an unbiased estimate of true recall. If you’re trying to detect or identify an event with a machine learning system, the metrics you really care about are precision and recall: if you think of the problem as finding needles in a haystack, precision tells you how often your system mistakes a straw of hay for a needle, while recall helps you measure how often your system misses needles entirely. Both of these metrics are important in complementary ways. The way I like to think about it is that measuring precision tells you where your model made a mistake while measuring recall tells you where your model can improve. ...

Why we need human evaluation.

$$ \newcommand{\sX}{\mathcal{X}} \newcommand{\sZ}{\mathcal{Z}} \newcommand{\muh}{\hat{\mu}} \newcommand{\mub}{\muh_{\text{mean}}} \newcommand{\mucv}{\muh_{\text{cv}}} \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} $$We’re witnessing an exciting boom in the subfield of natural language generation (NLG), with more than 150 related papers published at ACL, NAACL and EMNLP in just the last year! These papers cover a range of tasks including abstractive summarization (Nallapati et al., 2016), open-response question answering (Nguyen et al., 2016; Kočisky et al., 2017), image captioning (Lin et al., 2014), and open-domain dialogue (Lowe et al., 2017b). Unfortunately, it’s incredibly hard to compare these different methods in a meaningful way. While most papers report automatic evaluations using metrics like BLEU or ROUGE, these metrics have consistently been shown to poorly correlate with human judgment on fluency, redundancy, overall quality, etc. On the other hand, only a small fraction of these papers actually conduct a thorough human evaluation. ...