Highlights from CoNLL and EMNLP 2019 December 03, 2019

CoNLL and EMNLP, two top-tier natural language processing conferences, were held in Hong Kong last month. A large contingent of the Square AI team, myself included, attended and our fantastic intern, Justin Dieter, presented our work on a new contextual language generation task: mimic rephrasals. Despite being a regular conference attendee, I was surprised by the sheer quantity and quality of innovative ideas presented at the conference: a true testament to how fast the field is moving. It’s impossible to cover everything that happened, but in this post I’ve tried to capture a sampling of the ideas I found most exciting in the sessions I attended.

Outline

Here’s an outline of the selection of ideas covered in this post:

Methods:
Datasets / Tasks:
Takeaways

Methods

This section covers the methodological advances I found interesting – these are techniques that I think could be broadly applicable and are worth adding to your tool box.

Significance testing done right when searching over hyperparameters

Show Your Work: Improved Reporting of Experimental Results arxiv
Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, Noah A. Smith

Hyperparameters are a very important factor in how well a model performs: a simpler model might do significantly better than a complex one (or vice versa) depending on how much effort you put into finding the right hyper-parameters.
As a particularly prominent recent example, RoBERTa uses the same architecture as BERT, but explores the hyperparameter space more thoroughly. The authors show that it outperforms XL-NET (which significantly outperformed the original BERT models) when given the same amount of data (in fairness more than the original BERT model).
So, is XL-NET actually better model than BERT? It’s hard to say because the result clearly depends on the number of hyper-parameters you trained with. Resolving this question is exactly what this paper tries to solve.
It models hyper-parameter search as a random process: we draw hyper-parameters using some user-defined (random) search mechanism, and this induces a (potentially gnarly) distribution over scores on the development set.
Now, instead of saying that model A is “better” than model B in the absolute sense, we frame the question as: “is model A better than model B given a search budget of \(N\) hyperparameters?”
Despite this additional bit of complexity, the actual statistical testing procedure proposed by the paper is very simple and the authors have even made the code readily available.
Of course, there are some minor points one can quibble over, e.g., some models just have more hyper-parameters and thus need more hyper-parameter search or maybe someone just uses a more clever hyper-parameter search algorithm, but this is definitely a step in the right direction.

Unlearning dataset bias by fitting the residual

Unlearn Dataset Bias in Natural Language Inference by Fitting the Residual arxiv
He He, Sheng Zha, Haohan Wang

Several datasets have been shown to have significant annotation biases: for example, systems can correctly predict whether a sentence (the “premise”) entails another (the “hypothesis”) 67% of the time on the Stanford Natural Language Inference dataset without even looking at the premise (majority class will only get you 34.3%)!
The root of the problem is that there isn’t enough randomness in source data because annotators are time-pressured.
Is there a way to train a classifier that doesn’t make spurious associations?
The key idea the paper presents is to learn a biased classifier that is deliberately handicapped to use “bad” features (e.g. only using features from the hypothesis without looking at the premise) and to then train your actual model on the residual between the gold label and what the biased classifier predicts.
As a result, the classifier is forced to focus on things that are “hard” and can’t be predicted solely using bad features.
It’s a very simple idea, but the authors show that this “de-biased” classifier performs much better on adversarial datasets.

Bidirectional sequence generation

Attending to Future Tokens for Bidirectional Sequence Generation pdf
Carolin Lawrence, Bhushan Kotnis and Mathias Niepert

This is a pretty out-there idea: most of the time, models (as do humans) generate sequences in left-to-right (or right-to-left or top-down based on your writing system) order, but would it help if you could generate out-of-order?
Ostensibly, this could be useful if you were first “sketching” out a sentence and then reformatting it, though I think it’s mostly an interesting modelling challenge.
The authors adapt a standard sequence-to-sequence transformer architecture which typically uses a single “placeholder” token to generate the next token.
Instead, the authors include a long (but fixed) sequence of placeholders that the model could predict tokens for.
They experiment with different decoding strategies like left-to-right or greedily picking the most confident token regardless of position. The predicted tokens are then fed into the model as usual in the next step of decoding.
When training, you need to (un-)mask some tokens so that you see some of these partial generation situations. They do this by replacing placeholders stochastically – using Bernoulli RV or a Gaussian RV, find that Gaussians work better.
- Note: I’m curious why they didn’t use a Poisson instead.
The authors find that their model does well on two generation tasks, but what I was really interested in was how the different decoding strategies performed: it turns out that the greedy decoding strategy (using the most confident token) basically runs left-to-right (which is somewhat expected given how language evolved), but the ability to attend to future tokens actually seems to help.
A quick example is that when generating the first word, the model would often attend to the typical verb position.
Note: There was another paper in the same session that looked at out-of-order generation from Facebook by generating “(token, relative-position)” pairs, but I especially liked the analysis in this one.

A promising approach to evaluate content relevance for summarization: automated pyramid scores

Automated Pyramid Summarization Evaluation pdf
Yanjun Gao, Chen Sun, Rebecca J. Passonneau

(Automatic) evaluation for automatic text summarization is a hard problem: metrics like ROUGE that were once thought to have high correlation with human judgment don’t work as well on non-extractive summarization systems.
One of the most exciting evaluation methods I’ve come across while studying the subject for my PhD was the pyramid method, because it provides a quantitative measure of the importance of the information summarized.
However, the original method requires an elaborate manual annotation process which involves people identifying “units of information” (i.e., summary content units) and then comparing them across multiple summaries of the same document.
The key question this paper solves is automating the identification of these SCUs. It does so by breaking up a summary into clauses (so a clause acts like a SCU) and then matching these clauses using vector representations and a novel graph-based clustering algorithm.
The paper of course reports higher correlation scores than ROUGE, but more interestingly is able to extract very reasonable-looking SCUs. Even if you’re unwilling to trust a single number for a task as complex as summarization, this evaluation method produces an auditable trail that could be verified more easily by annotators.

A general-purpose algorithm for constrained sequential inference

A General-Purpose Algorithm for Constrained Sequential Inference pdf
Daniel Deutsch, Shyam Upadhyay, Dan Roth

This paper answers a very clear but widely applicable problem (making it a great!): how do you enforce structural constraints when generating sequentially?
Some examples of where structural constraints are necessary are when generating parse trees or programs sequentially.
The method proposed allows users to define structural constraints as an automata using the Pynini library.
When generating output, the method uses beam search and keeps track of the automata state of each element on the beam. This allows you to prune invalid intermediate output from the beam early.
The paper also proposes an active-set algorithm to avoid the intersection of multiple constraints and speed up the algorithm.
Because it wraps around beam search, the method easily works with many different models.

Using generalized CCA to combine embeddings for unsupervised duplicate question detection

Multi-View Domain Adapted Sentence Embeddings for Low-Resource Unsupervised Duplicate Question Detection pdf
Nina Poerner, Hinrich Schütze

Duplicate question detection (DQD) is a very useful task, e.g., we’ve employed it at Eloquent Labs to automate answers to frequently asked questions.
One of the bigger challenges when building a DQD model for a new domain is collecting good data, particularly if the questions require specialized knowledge to understand.
A reasonable baseline uses off-the-shelf unsupervised sentence embeddings to find duplicate questions using cosine similarity or other vector-based similarity metrics.
However, unsupervised embeddings typically don’t work as well in more specific domains because what is similar to “tree” in a general sense may not be within a domain like “computer science” or “WordPress”.
One might be tempted to try contextual word embedding approaches (like BERT) that are fine-tuned on a dataset, but that also doesn’t seem to work well.
This paper proposes a method to combine many different sentence embeddings (both non-contextual and contextual) using generalized CCA, and actually beats the strong IR baseline (BM25).
While you might want to fine-tune your own DQD model, this multi-view combination method seems like a useful tool to have in the box!

Predicting performance drop under domain shift

To Annotate or Not? Predicting Performance Drop under Domain Shift pdf
Hady Elsahar, Matthias Gallé

The language that people use naturally changes over time, particularly in the context of a product where new features or marketing campaigns reach customers.
One needs to collect new training data to keep up with these changes, which begs the question: when should you re-annotate and re-train your models?
This paper answers exactly that question by trying to predict performance drop under domain shift using several distributional shift measures.
These measures include classical distributional metrics like H-divergence or A-distance that can be used the distribution of some pre-defined input features, but they need careful feature engineering.
The most interesting method proposed (which also does the best) uses the accuracy of discriminator (a classifier trained to discriminate between the two domains) as a proxy metric: the more the accuracy of the discriminator, the more there is a domain shift.
On simple tasks like sentiment analysis (which is known to be quite sensitive to domain), they measure a mean average error rate of about 2% for predicting a performance drop, which is quite compelling!

A sparsity regularizer that’s differentiable in its sparseness

Adaptively Sparse Transformers pdf
Gonçalo M. Correia, Vlad Niculae, André F. T. Martins

The main reason I’m excited by this paper is a fancy new regularizer to enforce sparsity the authors propose and apply to transformers.
The regularizer, \(\alpha\)-entmax, tries to interpolate between the \(\operatorname{softmax}\) (which has dense gradients) and the \(\operatorname{argmax}\) (which has a maximally sparse non-zero gradient).
Fascinatingly, it is possible to optimize the level of sparsity, \(\alpha\) using a bi-level optimization approach.
The authors apply the method on the attention component of the Transformer (which conventionally uses a \(\operatorname{softmax}\)) and find that heads tend to become sparser and more “diverse” (differing \(\alpha\)) through training, and actually seem to attend across sub-word units more. The paper also reports a small boost on BLEU scores.
While the method is quite complex, the authors have released a easy-to-use PyTorch implementation, so it might even be practical to try out!

Reducing complex question answering tasks to simpler ones by generating answering templates

A Discrete Hard EM Approach for Weakly Supervised Question Answering pdf
Sewon Min, Danqi Chen, Hannaneh Hajishirzi, Luke Zettlemoyer

This paper shows that a template-based approach actually does incredibly well on complex question answering tasks, e.g. ones where answers involve running SQL queries or solving arithmetic calculations without actually having labeled intermediate expressions (SQL queries or arithmetic expressions).
The paper reduces these complex problems into a classification problem over a set of intermediate expressions that were exhaustively generated from the text. For example, on the TriviaQA reading comprehension dataset, the answer templates are all possible spans in the document. On the DROP dataset (arithmetic or counting problems), the templates are all possible (short) equations between numbers in the text (\(5-3\), \(15-3\), etc.).
This simplifies the classification problem as the candidates have explicit intermediate expressions, but we still have to train a model to score the candidates. The authors use existing models that predict the expressions (e.g. a reading comprehension model for TriviaQA, QANet for DROP, etc.).
The next challenge is figuring out how to train the models because many candidates may match the labeled answer, making our supervision noisier. A “soft”-EM approach would use maximum marginal likelihood, but the authors instead pick the best candidate under their model (hence “hard”-EM).
They’re able to train this system and get state of the art on 4 out of 5 tasks by a fair margin.

Datasets / Tasks

This section references some datasets that I thought were particularly exciting. I was surprised how few things made it to this list, though that’s probably a reflection of which sessions I was sitting in on.

A more natural fact-checking corpus based on Snopes

A Richly Annotated Corpus for Different Tasks in Automated Fact-Checking pdf
Andreas Hanselowski, Christian Stab, Claudia Schulz, Zile Li, Iryna Gurevych

Fact-checking as a topic of interest and community has really grown over the last few years (I highly recommend checking out the FEVER workshop if this is a topic you’re curious about). However, creating good-enough datasets for the task remains a challenge.
A significant step in this direction was the FEVER corpus released last year, though a paper by Darsh et. al this year identifies annotation artifacts in the corpus similar to those found in SNLI. Additionally, the corpus contains synthetic claims derived from Wikipedia that don’t transfer as well to the natural claims people make online.
This paper instead mines claims and supporting evidence from the excellent Snopes website, and then get fine-grained human annotations on this data.
The resulting dataset is a task is an order of magnitude smaller than the FEVER dataset, but more “in-domain”, which I think will make it a useful contribution to this field.

Detecting framing in news stories

Detecting Frames in News Headlines and Its Application to Analyzing News Framing Trends Surrounding U.S. Gun Violence pdf
Siyi Liu, Lei Guo, Kate Mays, Margrit Betke, Derry Tanti Wijaya

I think that the framing of news articles in the media is a serious problem plaguing reasoned discourse around the world; it’s not an easy problem and it’d be naive to think that one can report the news entirely objectively. That said, I do believe that increasing transparency around how stories are framed could balance the narrative and that NLP could help to provide that transparency.
However, before we can build anything useful, we need to (quantitatively) understand how the phenomena manifests in the real world: we need to start building datasets and iteratively defining tasks around framing.
This paper presents an exciting early step in that direction with the Gun Violence Frame Corpus. The corpus classifies about 1,300 headlines about gun violence into 9 different frame classes (e.g., 2nd amendment issues or mental health).
- On a related note, by following references from this paper, I found the Media Frame Corpus by Card et al. (2015) that uses a different set of frame classes on a broader set of topics.
As an example of how this dataset could be useful for analysis, the authors use a model trained on it to predict frames on a much larger set of news articles and then study trends in framing. For example, they show how headlines have started to use the “economic consequences” frame more often after major companies such as Dick’s Sporting Goods decided to stop selling assault-style weapons.

Zero-shot entity linking from entity descriptions

Zero-Shot Entity Linking by Reading Entity Descriptions pdf
Lajanugen Logeswaran, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, Jacob Devlin, Honglak Lee

Note: this paper was actually published at ACL 2019, but was presented as part of a great keynote by Kristina at the DeepLo workshop.
Entity linking, connected entity mentions across documents, allows us to aggregate information across documents and is possibly one of the most under-appreciated tasks in NLP.
One reason is that the task is just incredibly hard, and even more so for new or specialized domains: existing methods work only when you have high lexical overlap or the entities are frequent enough to build up co-occurrence statistics.
This paper presents an interesting new dataset for zero-shot entity linking that relies on entity descriptions for rare entities (which limits its scope, but not its interestingness).
They exploit the structure in the community wikis on Fandom (neé Wikia) to construct a dataset using entity descriptions from the first paragraph of each wiki page and noisy labels for entity links from hyperlinks in the articles.
During training, models can use the observed links along with entity descriptions on one wiki, but models only see the entity descriptions at test time for an entirely different wiki.
The paper proposes a method to generate candidates using BM25 (classical IR method) and scores them using various Transformer based models. Maybe the most interesting result from their method section is that domain-adaptive pretraining, i.e. pretraining BERT on the target domain and fine-tuning on the source domain (that has labels), works phenomenally well, taking scores from around 20% to 65%, significantly better than just using the IR ranking.

Takeaways

In this section, I’d like to highlight results that I think transcend the particular model or dataset used and are worth keeping around in the back of my head.

Learning a document retriever using paragraph vectors outperforms IR when you don’t know what to query

Latent Retrieval for Weakly Supervised Open Domain Question Answering (ACL 2019) pdf
Kenton Lee, Ming-Wei Chang, Kristina Toutanova

Note: this paper was actually published at ACL 2019, but was presented as part of a great keynote by Kristina at the DeepLo workshop.
A common approach to extend reading comprehension systems (that answer questions on a given document) to large text corpora (“open-domain question answering”) is to first use a retrieval system to filter paragraphs that can be fed to a conventional reading comprehension model (e.g., trained on SQuAD).
Off-the-shelf IR systems like BM25 have been a really hard-to-beat baseline retrieval systems, but this paper actually manages to learn a retrieval system that outperforms BM25 on several domains.
The key idea to score candidate paragraphs for retrieval using BERT-based paragraph vectors. In order to make any progress at all, one has to bootstrap the retrieval system using pre-trained paragraph vectors.
The authors do this using a clever “inverse Cloze” task: instead of filling in a blank given the context, one must fill in the context given a sentence from a set of candidates. As a caveat, the authors leave the given sentence in the context 10% of the time to make sure the model can still do exact word matches.
By training to predict the context, the retriever is expected to work better when there is less word overlap between the query and document. This reflects in the evaluation, with the proposed method working substantially better than BM25 on tasks with natural queries like NaturalQuestions or WebQuestions, but doing a little worse on tasks with questions derived from the text like SQuAD or TriviaQA.

Using distant supervision to pretrain embeddings can improve relation extraction performance

Kristina Toutanova

Note: I haven’t been able to find a reference for this bit, so it might still be unpublished work. It was presented as part of a great keynote by Kristina at the DeepLo workshop.
In relation extraction, we want to identify the relation between two entities in a sentence (e.g., are they spouses, is one employed by the other, etc.). There are already some great datasets for the task like TACRED: this paper shows how BERT-styled pretraining can be targeted to relation extraction like tasks.
The idea is use an existing knowledge base (e.g. WikiData) to identify entity pairs that have some relation between them. Sentences containing these entity pairs are then treated as training data (with binary labels for “is this entity pair related”) in a sort of distant supervision setup.
Distant supervision is an idea that’s been around for quite a while, but it’s never really worked significantly better than annotating a little more data: this paper instead uses the distant supervision setup to train better representations: using a BERT-style model pretrained in this fashion gets you several points on the TACRED dataset.
I would have liked to see more analysis of which relations this sort of technique helps on, but I liked the idea of using distant supervision to pretrain embeddings.

Language models internalize disturbing biases that can be triggered innocuously

Universal Adversarial Triggers for Attacking and Analyzing NLP pdf
Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, Sameer Singh

We all knew this was coming, but this paper shows that relatively short “universal” (fixed) adversarial triggers can force a number of different models to generate targeted bad output for any input.
All the examples are quite eye-opening (and intentionally offensive).
The way the triggers are found is quite interesting: they backpropagate the loss all the way to the input embedding layer and use this to optimize the words in the trigger.
The universality of the triggers I think exposes the internal biases of current models (and of course the data they’re trained on).

Attention might actually be a valid form of explanation

Attention is not not Explanation pdf
Sarah Wiegreffe, Yuval Pinter

There’s been some ongoing debate on whether attention weights in models actually serve as a faithful explanation (as opposed to just a plausible one).
In the paper, the authors ask some well-formulated research questions to shed further light on the debate.
First, they ask if attention is necessary for good performance. They do this by measuring the performance drop if they set attention weights to be uniform across the input tokens. They find that attention is only important for some datasets (unsurprisingly), so it’s important to only use these datasets when analyzing the importance of attention.
Next, they ask if the attention weights can be manipulated without affecting model performance. They use adversarial training to minimize the change in prediction scores while maximizing the change in attention distributions. With this method, they find it hard to significantly manipulate attention without also significantly affecting performance, adding evidence that attention is useful for explanation.
Finally, they ask if the attention weights can be used without the actual tokens, and find that it actually works quite well, supporting the claim that the attention weights are semantically meaningful in and of themselves.
Put together, the paper makes a strong (testable) case that attention is a faithful explanation for some tasks.

For (non-English) languages with grammatical gender, be wary of its noun representations

How Does Grammatical Gender Affect Noun Representations in Gender-Marking Languages? pdf
Hila Gonen, Yova Kementchedjhieva, Yoav Goldberg

In gender-marking languages like Italian, word similarity neighbors don’t pick up neighbors with differently gendered nouns (note that this is entirely a phenomenon of grammatical gender and not related to social gender biases).
This should be expected since word representations are trained using the distribution of words in context and for these languages, a change in gender leads to large changes in contexts.
The authors propose a simple fix: using morphological analysis to remove the gender inflection.
It’s a simple result, but I think this is a phenomenon I hadn’t thought of before and is important to keep in mind when working with gender-marking languages.

Active learning ties the collected data to the model posing an obstacle to deploying it in practice

Practical Obstacles to Deploying Active Learning pdf
David Lowell, Zachary C. Lipton, Byron C. Wallace

Despite having several simple and easy-to-use algorithms and lots of theory supporting it, active learning has had relatively thin adoption in practice. This paper tries to break down some of its practical obstacles.
The first is that picking the right active learning heuristic isn’t easy, and most of the time active learning only provides a small percentage improvement over random sampling.
The second, and the one that I think is far more important, is that using active learning ties your data to the model which means your train distribution isn’t going to match the test distribution.
This doesn’t surprise me, but I’ve actively (pun unintended) worked on the problem of dataset bias: my EMNLP 2017 paper showed how using model output to collect data can significantly bias evaluation results for other models.
I think this result is really important to keep in mind and figuring out how to “re-target” data collected using one model for another will be key to making active learning practical to use.

When using probes to study a model’s representations, make sure you control for the probe itself!

Designing and Interpreting Probes with Control Tasks pdf
John Hewitt, Percy Liang

Probing is a way to study what information is captured in the intermediate representations of a model by using these representations to solve specific tasks, e.g. POS tagging.
This paper shows that some existing results are actually explained by the probing model or dataset instead of the representations: these are false positives.
The authors show this by constructing control tasks that use non-sensical labels to control for the ability of the probing model to fit the task. A more complex probing model that does better on the actual task (e.g. POS tagging), will also do better on the control task, but the difference between the two (dubbed selectivity) maybe a more robust signal for the information contained in the representations.
A prominent result in the paper reverses what we thought we knew about ELMo embeddings: the original paper reported that the first layer of ELMo was better than the second at representing POS tags, but the authors show that exactly the opposite is true if you look at selectivity!
It’s still an open question as to which control tasks one should use, but this paper takes a step towards better analysis.

∎

Special thanks to Aniruddh Raghu and Maithra Raghu for feedback on earlier drafts of this post.