$$ \newcommand{\sX}{\mathcal{X}} \newcommand{\sZ}{\mathcal{Z}} \newcommand{\muh}{\hat{\mu}} \newcommand{\mub}{\muh_{\text{mean}}} \newcommand{\mucv}{\muh_{\text{cv}}} \newcommand{\E}{\mathbb{E}} \newcommand{\Var}{\operatorname{Var}} \newcommand{\Cov}{\operatorname{Cov}} $$We’re witnessing an exciting boom in the subfield of natural language generation (NLG), with more than 150 related papers published at ACL, NAACL and EMNLP in just the last year! These papers cover a range of tasks including abstractive summarization (Nallapati et al., 2016), open-response question answering (Nguyen et al., 2016; Kočisky et al., 2017), image captioning (Lin et al., 2014), and open-domain dialogue (Lowe et al., 2017b). Unfortunately, it’s incredibly hard to compare these different methods in a meaningful way. While most papers report automatic evaluations using metrics like BLEU or ROUGE, these metrics have consistently been shown to poorly correlate with human judgment on fluency, redundancy, overall quality, etc. On the other hand, only a small fraction of these papers actually conduct a thorough human evaluation.
...