-
Are Models Biased on Text without Gender-related Language?.
International Conference on Learning Representations.
2024
Conference
[ OpenReview Page, Project Website, Abstract, BibTex ]Gender bias research has been pivotal in revealing undesirable behaviors in large language models, exposing serious gender stereotypes associated with occupations, and emotions. A key observation in prior work is that models reinforce stereotypes as a consequence of the gendered correlations that are present in the training data. In this paper, we focus on bias where the effect from training data is unclear, and instead address the question: Do language models still exhibit gender bias in non-stereotypical settings? To do so, we introduce UnStereoEval (USE), a novel framework tailored for investigating gender bias in stereotype-free scenarios. USE defines a sentence-level score based on pretraining data statistics to determine if the sentence contain minimal word-gender associations. To systematically benchmark the fairness of popular language models in stereotype-free scenarios, we utilize USE to automatically generate benchmarks without any gender-related language. By leveraging USE's sentence-level score, we also repurpose prior gender bias benchmarks (Winobias and Winogender) for non-stereotypical evaluation. Surprisingly, we find low fairness across all 28 tested models. Concretely, models demonstrate fair behavior in only 9%-41% of stereotype-free sentences, suggesting that bias does not solely stem from the presence of gender-related words. These results raise important questions about where underlying model biases come from and highlight the need for more systematic and comprehensive bias evaluation. We release the full dataset and code at ucinlp.github.io/unstereo-eval.@inproceedings{diagnosing:iclr24, author = {Catarina Belem and Preethi Seshadri and Yasaman Razeghi and Sameer Singh}, title = { {Are Models Biased on Text without Gender-related Language?} }, booktitle = {International Conference on Learning Representations}, doi = {https://openreview.net/forum?id=w1JanwReU6}, year = {2024} }
, , , .
-
To Adapt or to Annotate: Challenges and Interventions for Domain Adaptation in Open-Domain Question Answering.
Association for Computational Linguistics (ACL).
2023
Conference
[ ACL Anthology, ArXiV, PDF, Abstract, BibTex ]Recent advances in open-domain question answering (ODQA) have demonstrated impressive accuracy on general-purpose domains like Wikipedia. While some work has been investigating how well ODQA models perform when tested for out-of-domain (OOD) generalization, these studies have been conducted only under conservative shifts in data distribution and typically focus on a single component (i.e., retriever or reader) rather than an end-to-end system. This work proposes a more realistic end-to-end domain shift evaluation setting covering five diverse domains. We not only find that end-to-end models fail to generalize but that high retrieval scores often still yield poor answer prediction accuracy. To address these failures, we investigate several interventions, in the form of data augmentations, for improving model adaption and use our evaluation set to elucidate the relationship between the efficacy of an intervention scheme and the particular type of dataset shifts we consider. We propose a generalizability test that estimates the type of shift in a target dataset without training a model in the target domain and that the type of shift is predictive of which data augmentation schemes will be effective for domain adaption. Overall, we find that these interventions increase end-to-end performance by up to ~24 points.@inproceedings{adaptqa:acl23, author = {Dheeru Dua and Emma Strubell and Sameer Singh and Pat Verga}, title = { {To Adapt or to Annotate: Challenges and Interventions for Domain Adaptation in Open-Domain Question Answering} }, booktitle = {Association for Computational Linguistics (ACL)}, doi = {10.18653/v1/2023.acl-long.807}, pages = {14429–14446}, year = {2023} }
, , , . -
MISGENDERED: Limits of Large Language Models in Understanding Pronouns.
Association for Computational Linguistics (ACL).
2023
Conference
[ ACL Anthology, ArXiV, PDF, Video, Demo, Code, Abstract, BibTex ]Content Warning: This paper contains examples of misgendering and erasure that could be offensive and potentially triggering.Gender bias in language technologies has been widely studied, but research has mostly been restricted to a binary paradigm of gender. It is essential also to consider non-binary gender identities, as excluding them can cause further harm to an already marginalized group. In this paper, we comprehensively evaluate popular language models for their ability to correctly use English gender-neutral pronouns (e.g., singular they, them) and neo-pronouns (e.g., ze, xe, thon) that are used by individuals whose gender identity is not represented by binary pronouns. We introduce Misgendered, a framework for evaluating large language models’ ability to correctly use preferred pronouns, consisting of (i) instances declaring an individual’s pronoun, followed by a sentence with a missing pronoun, and (ii) an experimental setup for evaluating masked and auto-regressive language models using a unified method. When prompted out-of-the-box, language models perform poorly at correctly predicting neo-pronouns (averaging 7.6% accuracy) and gender-neutral pronouns (averaging 31.0% accuracy). This inability to generalize results from a lack of representation of non-binary pronouns in training data and memorized associations. Few-shot adaptation with explicit examples in the prompt improves the performance but plateaus at only 45.4% for neo-pronouns. We release the full dataset, code, and demo at https://tamannahossainkay.github.io/misgendered/.@inproceedings{misgendered:acl23, author = {Tamanna Hossain and Sunipa Dev and Sameer Singh}, title = { {MISGENDERED: Limits of Large Language Models in Understanding Pronouns} }, booktitle = {Association for Computational Linguistics (ACL)}, doi = {10.18653/v1/2023.acl-long.293}, pages = {5352–5367}, year = {2023} }
, , . -
Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling.
International Conference on Machine Learning (ICML).
2023
Conference
[ ArXiV, Project Page, Code, BibTex ]@inproceedings{deckard:icml23, author = {Kolby Nottingham and Prithviraj Ammanabrolu and Alane Suhr and Yejin Choi and Hannaneh Hajishirzi and Sameer Singh and Roy Fox}, title = { {Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling} }, booktitle = {International Conference on Machine Learning (ICML)}, year = {2023} }
, , , , , , . -
Towards Factual and Informative Review Generation for Explainable Recommendation.
AAAI Conference on Artificial Intelligence (AAAI).
2023
Conference
[ ArXiV, BibTex ]@inproceedings{recomm:aaai23, author = {Zhouhang Xie and Sameer Singh and Julian McAuley and Bodhisattwa P. Majumder}, title = { {Towards Factual and Informative Review Generation for Explainable Recommendation} }, booktitle = {AAAI Conference on Artificial Intelligence (AAAI)}, year = {2023} }
, , , . -
Maestro: A Gamified Platform for Teaching AI Robustness.
AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI).
2023
Conference
[ BibTex ]@inproceedings{maestro:eaai23, author = {Margarita Geleta and Jiacen Xu and Manikanta Loya and Junlin Wang and Sameer Singh and Zhou Li and Sergio Gago Masague}, title = { {Maestro: A Gamified Platform for Teaching AI Robustness} }, booktitle = {AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI)}, year = {2023} }
, , , , , , . -
Evaluating the generalisability of neural rumour verification models.
Information Processing and Management.
2023
Journal
[ Journal page, PDF, Abstract, BibTex ]Research on automated social media rumour verification, the task of identifying the veracity of questionable information circulating on social media, has yielded neural models achieving high performance, with accuracy scores that often exceed 90%. However, none of these studies focus on the real-world generalisability of the proposed approaches, that is whether the models perform well on datasets other than those on which they were initially trained and tested. In this work we aim to fill this gap by assessing the generalisability of top performing neural rumour verification models covering a range of different architectures from the perspectives of both topic and temporal robustness. For a more complete evaluation of generalisability, we collect and release COVID-RV, a novel dataset of Twitter conversations revolving around COVID-19 rumours. Unlike other existing COVID-19 datasets, our COVID-RV contains conversations around rumours that follow the format of prominent rumour verification benchmarks, while being different from them in terms of topic and time scale, thus allowing better assessment of the temporal robustness of the models. We evaluate model performance on COVID-RV and three popular rumour verification datasets to understand limitations and advantages of different model architectures, training datasets and evaluation scenarios. We find a dramatic drop in performance when testing models on a different dataset from that used for training. Further, we evaluate the ability of models to generalise in a few-shot learning setup, as well as when word embeddings are updated with the vocabulary of a new, unseen rumour. Drawing upon our experiments we discuss challenges and make recommendations for future research directions in addressing this important problem.@article{rumors:ipm23, author = {Elena Kochkina and Tamanna Hossain and Robert L. Logan IV and Miguel Arana-Catania and Rob Procter and Arkaitz Zubiaga and Sameer Singh and Yulan He and Maria Liakata}, title = { {Evaluating the generalisability of neural rumour verification models} }, journal = {Information Processing and Management}, doi = {10.1016/j.ipm.2022.103116}, year = {2023} }
, , , , , , , , . -
Explaining machine learning models with interactive natural language conversations using TalkToModel.
Nature Machine Intelligence.
2023
Journal
[ Nature page, PDF, Code, ArXiV, Demo, Abstract, BibTex ]Practitioners increasingly use machine learning (ML) models, yet models have become more complex and harder to understand. To understand complex models, researchers have proposed techniques to explain model predictions. However, practitioners struggle to use explainability methods because they do not know which explanation to choose and how to interpret the explanation. Here we address the challenge of using explainability methods by proposing TalkToModel: an interactive dialogue system that explains ML models through natural language conversations. TalkToModel consists of three components: an adaptive dialogue engine that interprets natural language and generates meaningful responses; an execution component that constructs the explanations used in the conversation; and a conversational interface. In real-world evaluations, 73% of healthcare workers agreed they would use TalkToModel over existing systems for understanding a disease prediction model, and 85% of ML professionals agreed TalkToModel was easier to use, demonstrating that TalkToModel is highly effective for model explainability.@article{talktomodel:ni23, author = {Dylan Slack and Satyapriya Krishna and Himabindu Lakkaraju and Sameer Singh}, title = { {Explaining machine learning models with interactive natural language conversations using TalkToModel} }, journal = {Nature Machine Intelligence}, doi = {10.1038/s42256-023-00692-8}, year = {2023} }
, , , .
-
Successive Prompting for Decomposing Complex Questions.
Empirical Methods in Natural Language Processing (EMNLP).
2022
Conference
[ ACL Anthology, BibTex ]@inproceedings{decompqa:emnlp22, author = {Dheeru Dua and Shivanshu Gupta and Sameer Singh and Matt Gardner}, title = { {Successive Prompting for Decomposing Complex Questions} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, year = {2022} }
, , , . -
Continued Pretraining for Better Zero- and Few-Shot Promptability.
Empirical Methods in Natural Language Processing (EMNLP).
2022
Conference
[ ACL Anthology, BibTex ]@inproceedings{pretraining:emnlp22, author = {Zhaofeng Wu and Robert L. Logan IV and Pete Walsh and Akshita Bhagia and Dirk Groeneveld and Sameer Singh and Iz Beltagy}, title = { {Continued Pretraining for Better Zero- and Few-Shot Promptability} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, year = {2022} }
, , , , , , . -
Impact of Pretraining Term Frequencies on Few-Shot Numerical Reasoning.
Findings of the Association for Computational Linguistics: EMNLP (EMNLP Findings).
2022
Conference
[ ACL Anthology, BibTex ]@inproceedings{impact:femnlp22, author = {Yasaman Razeghi and Robert L. Logan IV and Matt Gardner and Sameer Singh}, title = { {Impact of Pretraining Term Frequencies on Few-Shot Numerical Reasoning} }, booktitle = {Findings of the Association for Computational Linguistics: EMNLP (EMNLP Findings)}, year = {2022} }
, , , . -
Structurally Diverse Sampling for Sample-Efficient Training and Comprehensive Evaluation.
Findings of the Association for Computational Linguistics: EMNLP (EMNLP Findings).
2022
Conference
[ ArXiV, PDF, ACL Anthology, BibTex ]@inproceedings{structdiversity:femnlp22, author = {Shivanshu Gupta and Sameer Singh and Matt Gardner}, title = { {Structurally Diverse Sampling for Sample-Efficient Training and Comprehensive Evaluation} }, booktitle = {Findings of the Association for Computational Linguistics: EMNLP (EMNLP Findings)}, year = {2022} }
, , . -
Unobserved Local Structures Make Compositional Generalization Hard.
Empirical Methods in Natural Language Processing (EMNLP).
2022
Conference
[ ArXiV, PDF, BibTex ]@inproceedings{compgenhardness:emnlp22, author = {Ben Bogin and Shivanshu Gupta and Jonathan Berant}, title = { {Unobserved Local Structures Make Compositional Generalization Hard} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, year = {2022} }
, , . -
Learning to Query Internet Text for Informing Reinforcement Learning Agents.
Reinforcement Learning and Decision Making (RLDM).
2022
Conference
Extended abstract
[ PDF, ArXiV, Abstract, BibTex ]Generalization to out of distribution tasks in reinforcement learning is a challenging problem. One successful approach improves generalization by conditioning policies on task or environment descriptions that provide information about the current transition or reward functions. Previously, these descriptions were often expressed as generated or crowd sourced text. In this work, we begin to tackle the problem of extracting useful information from natural language found in the wild (e.g. internet forums, documentation, and wikis). These natural, pre-existing sources are especially challenging, noisy, and large and present novel challenges compared to previous approaches. We propose to address these challenges by training reinforcement learning agents to learn to query these sources as a human would, and we experiment with how and when an agent should query. To address the how, we demonstrate that pretrained QA models perform well at executing zero-shot queries in our target domain. Using information retrieved by a QA model, we train an agent to learn when it should execute queries. We show that our method correctly learns to execute queries to maximize reward in a reinforcement learning setting.@inproceedings{queryrl:rldm22, author = {Kolby Nottingham and Alekhya Pyla and Sameer Singh and Roy Fox}, title = { {Learning to Query Internet Text for Informing Reinforcement Learning Agents} }, booktitle = {Reinforcement Learning and Decision Making (RLDM)}, year = {2022} }
, , , . -
FRUIT: Faithfully Reflecting Updated Information in Text.
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
2022
Conference
Best Task Paper Award
[ PDF, ACL Anthology, ArXiV, Abstract, BibTex ]Textual knowledge bases such as Wikipedia require considerable effort to keep up to date and consistent. While automated writing assistants could potentially ease this burden, the problem of suggesting edits grounded in external knowledge has been under-explored. In this paper, we introduce the novel generation task of *faithfully reflecting updated information in text* (FRUIT) where the goal is to update an existing article given new evidence. We release the FRUIT-WIKI dataset, a collection of over 170K distantly supervised data produced from pairs of Wikipedia snapshots, along with our data generation pipeline and a gold evaluation set of 914 instances whose edits are guaranteed to be supported by the evidence. We provide benchmark results for popular generation systems as well as EDIT5 – a T5-based approach tailored to editing we introduce that establishes the state of the art. Our analysis shows that developing models that can update articles faithfully requires new capabilities for neural generation models, and opens doors to many new applications.@inproceedings{fruit:naacl22, author = {Robert L. Logan IV and Alexandre Passos and Sameer Singh and Ming-Wei Chang}, title = { {FRUIT: Faithfully Reflecting Updated Information in Text} }, booktitle = {Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, doi = {10.18653/v1/2022.naacl-main.269}, pages = {3670–3686}, year = {2022} }
, , , . -
Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts.
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
2022
Conference
[ PDF, ACL Anthology, Code, Abstract, BibTex ]Fine-tuning continuous prompts for target tasks has recently emerged as a compact alternative to full model fine-tuning. Motivated by these promising results, we investigate the feasibility of extracting a discrete (textual) interpretation of continuous prompts that is faithful to the problem they solve. In practice, we observe a “wayward” behavior between the task solved by continuous prompts and their nearest neighbor discrete projections: We can find continuous prompts that solve a task while being projected to an arbitrary text (e.g., definition of a different or even a contradictory task), while being within a very small (2%) margin of the best continuous prompt of the same size for the task. We provide intuitions behind this odd and surprising behavior, as well as extensive empirical analyses quantifying the effect of various parameters. For instance, for larger model sizes we observe higher waywardness, i.e, we can find prompts that more closely map to any arbitrary text with a smaller drop in accuracy. These findings have important implications relating to the difficulty of faithfully interpreting continuous prompts and their generalization across models and tasks, providing guidance for future progress in prompting language models.@inproceedings{wayward:naacl22, author = {Daniel Khashabi and Xinxi Lyu and Sewon Min and Lianhui Qin and Kyle Richardson and Sean Welleck and Hannaneh Hajishirzi and Tushar Khot and Ashish Sabharwal and Sameer Singh and Yejin Choi}, title = { {Prompt Waywardness: The Curious Case of Discretized Interpretation of Continuous Prompts} }, booktitle = {Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, pages = {3631-3643}, year = {2022} }
, , , , , , , , , , . -
ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension .
Association for Computational Linguistics (ACL).
2022
Conference
[ PDF, ACL Anthology, Code, Abstract, BibTex ]Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are useful for image classification across domains, it remains unclear if they can be applied in a zero-shot manner to more complex tasks like ReC. We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC. Motivated by the close connection between ReC and CLIP’s contrastive pre-training objective, the first component of ReCLIP is a region-scoring method that isolates object proposals via cropping and blurring, and passes them to CLIP. However, through controlled experiments on a synthetic dataset, we find that CLIP is largely incapable of performing spatial reasoning off-the-shelf. We reduce the gap between zero-shot baselines from prior work and supervised models by as much as 29% on RefCOCOg, and on RefGTA (video game imagery), ReCLIP’s relative improvement over supervised ReC models trained on real images is 8%.@inproceedings{reclip:acl22, author = {Sanjay Subramanian and William Merrill and Trevor Darrell and Matt Gardner and Sameer Singh and Anna Rohrbach}, title = { {ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension } }, booktitle = {Association for Computational Linguistics (ACL)}, doi = {10.18653/v1/2022.acl-long.357}, pages = {5198-5215}, year = {2022} }
, , , , , . -
Combining Feature and Instance Attribution to Detect Artifacts.
Findings of the Association for Computational Linguistics (ACL Findings).
2022
Conference
[ PDF, ArXiV, ACL Anthology, Abstract, BibTex ]Training the deep neural networks that dominate NLP requires large datasets. These are often collected automatically or via crowdsourcing, and may exhibit systematic biases or annotation artifacts. By the latter we mean spurious correlations between inputs and outputs that do not represent a generally held causal relationship between features and classes; models that exploit such correlations may appear to perform a given task well, but fail on out of sample data. In this paper we evaluate use of different attribution methods for aiding identification of training data artifacts. We propose new hybrid approaches that combine saliency maps (which highlight "important" input features) with instance attribution methods (which retrieve training samples "influential" to a given prediction). We show that this proposed training-feature attribution can be used to efficiently uncover artifacts in training data when a challenging validation set is available. We also carry out a small user study to evaluate whether these methods are useful to NLP researchers in practice, with promising results.@inproceedings{tfa:facl22, author = {Pouya Pezeshkpour and Sarthak Jain and Sameer Singh and Byron Wallace}, title = { {Combining Feature and Instance Attribution to Detect Artifacts} }, booktitle = {Findings of the Association for Computational Linguistics (ACL Findings)}, doi = {10.18653/v1/2022.findings-acl.153}, pages = {1934–1946}, year = {2022} }
, , , . -
Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models.
Findings of the Association for Computational Linguistics (ACL Findings).
2022
Conference
Also presented at the Neurips workshop on Efficient Natural Language and Speech Processing (ENLSP)
[ PDF, ArXiV, ACL Anthology, Code, Abstract, BibTex ]Prompting language models (LMs) with training examples and task descriptions has been seen as critical to recent successes in few-shot learning. In this work, we show that finetuning LMs in the few-shot setting can considerably reduce the need for prompt engineering. In fact, one can use null prompts, prompts that contain neither task-specific templates nor training examples, and achieve competitive accuracy to manually-tuned prompts across a wide range of tasks. While finetuning LMs does introduce new parameters for each downstream task, we show that this memory overhead can be substantially reduced: finetuning only the bias terms can achieve comparable or better accuracy than standard finetuning while only updating 0.1% of the parameters. All in all, we recommend finetuning LMs for few-shot learning as it is more accurate, robust to different prompts, and can be made nearly as efficient as using frozen LMs.@inproceedings{cutting:facl22, author = {Robert L. Logan IV and Ivana Balažević and Eric Wallace and Fabio Petroni and Sameer Singh and Sebastian Riedel}, title = { {Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models} }, booktitle = {Findings of the Association for Computational Linguistics (ACL Findings)}, doi = {10.18653/v1/2022.findings-acl.222}, pages = {2824–2835}, year = {2022} }
, , , , , . -
BottleFit: Learning Compressed Representations in Deep Neural Networks for Effective and Efficient Split Computing.
IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM).
2022
Conference
[ ArXiV Page, PDF, Abstract, BibTex ]Although mission-critical applications require the use of deep neural networks (DNNs), their continuous execution at mobile devices results in a significant increase in energy consumption. While edge offloading can decrease energy consumption, erratic patterns in channel quality, network and edge server load can lead to severe disruption of the system’s key operations. An alternative approach, called split computing, generates compressed representations within the model (called "bottlenecks"), to reduce bandwidth usage and energy consumption. Prior work has proposed approaches that introduce additional layers, to the detriment of energy consumption and latency. For this reason, we propose a new framework called BottleFit, which, in addition to targeted DNN architecture modifications, includes a novel training strategy to achieve high accuracy even with strong compression rates. We apply BottleFit on cutting-edge DNN models in image classification, and show that BottleFit achieves 77.1% data compression with up to 0.6% accuracy loss on ImageNet dataset, while state of the art such as SPINN loses up to 6% in accuracy. We experimentally measure the power consumption and latency of an image classification application running on an NVIDIA Jetson Nano board (GPU-based) and a Raspberry PI board (GPU-less). We show that BottleFit decreases power consumption and latency respectively by up to 49% and 89% with respect to (w.r.t.) local computing and by 37% and 55% w.r.t. edge offloading. We also compare BottleFit with state-of-the-art autoencoders-based approaches, and show that (i) BottleFit reduces power consumption and execution time respectively by up to 54% and 44% on the Jetson and 40% and 62% on Raspberry PI; (ii) the size of the head model executed on the mobile device is 83 times smaller. We publish the code repository for reproducibility of the results in this study.@inproceedings{bottlefit:wowmom22, author = {Yoshitomo Matsubara and Davide Callegaro and Sameer Singh and Marco Levorato and Francesco Restuccia}, title = { {BottleFit: Learning Compressed Representations in Deep Neural Networks for Effective and Efficient Split Computing} }, booktitle = {IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks (WoWMoM)}, doi = {10.1109/WoWMoM54355.2022.00032}, pages = {337-346}, year = {2022} }
, , , , . -
Snoopy: An Online Interface for Exploring the Effect of Pretraining Term Frequencies on Few-Shot LM Performance.
Demo at the Empirical Methods in Natural Language Processing (EMNLP).
2022
Demo
[ Demo, ACL Anthology, PDF, BibTex ]@inproceedings{snoopy:emnlp22, author = {Yasaman Razeghi and Raja Sekhar Reddy Mekala and Robert L. Logan IV and Matt Gardner and Sameer Singh}, title = { {Snoopy: An Online Interface for Exploring the Effect of Pretraining Term Frequencies on Few-Shot LM Performance} }, booktitle = {Demo at the Empirical Methods in Natural Language Processing (EMNLP)}, year = {2022} }
, , , , . -
PYLON: A PyTorch Framework for Learning with Constraints.
Demo at the AAAI Conference on Artificial Intelligence (AAAI).
2022
Demo
Also presented as a Demo paper at Neurips 2021.
[ AAAI Proceedings, AAAI PDF (shorter), Neurips Proceedings, NeurIPS PDF (longer), Website, Code, Video, Abstract, BibTex ]Deep learning excels at learning low-level task information from large amounts of data, but struggles with learning high-level domain knowledge, which can often be directly and succinctly expressed. In this work, we introduce Pylon, a neuro-symbolic training framework that builds on PyTorch to augment procedurally trained neural networks with declaratively specified knowledge. Pylon allows users to programmatically specify constraints as PyTorch functions, and compiles them into a differentiable loss, thus training predictive models that fit the data whilst satisfying the specified constraints. Pylon includes both exact as well as approximate compilers to efficiently compute the loss, employing fuzzy logic, sampling methods, and circuits, ensuring scalability even to complex models and constraints. A guiding principle in designing Pylon has been the ease with which any existing deep learning codebase can be extended to learn from constraints using only a few lines: a function expressing the constraint and a single line of code to compile it into a loss. We include case studies from natural language processing, computer vision, logical games, and knowledge graphs, that can be interactively trained, and highlights Pylon's usage.@inproceedings{pylon:aaai22, author = {Kareem Ahmed and Tao Li and Thy Ton and Quan Guo and Kai-Wei Chang and Parisa Kordjamshidi and Vivek Srikumar and Guy Van den Broeck and Sameer Singh}, title = { {PYLON: A PyTorch Framework for Learning with Constraints} }, booktitle = {Demo at the AAAI Conference on Artificial Intelligence (AAAI)}, doi = {10.1609/aaai.v36i11.21711}, pages = {13152-13154}, year = {2022} }
, , , , , , , , . -
An Empirical Comparison of Machine Learning Methods for Text-based Sentiment Analysis of Online Consumer Reviews.
International Journal of Research in Marketing.
2022
Journal
[ Journal, BibTex ]@article{sentiment:ijrm22, author = {Huwail J.Alantari and Imran S.Currim and Yiting Deng and Sameer Singh}, title = { {An Empirical Comparison of Machine Learning Methods for Text-based Sentiment Analysis of Online Consumer Reviews} }, journal = {International Journal of Research in Marketing}, volume = {39}, number = {1}, doi = {10.1016/j.ijresmar.2021.10.011}, pages = {1-19}, year = {2022} }
, , , . -
Quantifying Social Biases Using Templates is Unreliable.
TSRML Workshop @ NeurIPS.
2022
Workshop
[ ArXiV, BibTex ]@inproceedings{templates:tsrml22, author = {Preethi Seshadri and Pouya Pezeshkpour and Sameer Singh}, title = { {Quantifying Social Biases Using Templates is Unreliable} }, booktitle = {TSRML Workshop @ NeurIPS}, year = {2022} }
, , . -
TalkToModel: Explaining Machine Learning Models with Interactive Natural Language Conversations .
TSRML Workshop @ NeurIPS.
2022
Workshop
[ ArXiV, Code, Demo, BibTex ]@inproceedings{talktomodel:tsrml22, author = {Dylan Slack and Satyapriya Krishna and Himabindu Lakkaraju and Sameer Singh}, title = { {TalkToModel: Explaining Machine Learning Models with Interactive Natural Language Conversations } }, booktitle = {TSRML Workshop @ NeurIPS}, year = {2022} }
, , , . -
Rethinking Explainability as a Dialogue: A Practitioner's Perspective .
HCAI Workshop @ NeurIPS.
2022
Workshop
[ ArXiV, BibTex ]@inproceedings{rethinking:hcai22, author = {Himabindu Lakkaraju and Dylan Slack and Yuxin Chen and Chenhao Tan and Sameer Singh}, title = { {Rethinking Explainability as a Dialogue: A Practitioner's Perspective } }, booktitle = {HCAI Workshop @ NeurIPS}, year = {2022} }
, , , , . -
SAFER: Data-Efficient and Safe Reinforcement Learning via Skill Acquisition .
DARL Workshop @ ICML.
2022
Workshop
[ ArXiV, BibTex ]@inproceedings{safer:darl22, author = {Dylan Slack and Yinlam Chow and Bo Dai and Nevan Wichers}, title = { {SAFER: Data-Efficient and Safe Reinforcement Learning via Skill Acquisition } }, booktitle = {DARL Workshop @ ICML}, year = {2022} }
, , , .
-
COVR: A Test-Bed for Visually Grounded Compositional Generalization with Real Images.
Empirical Methods in Natural Language Processing (EMNLP).
2021
Conference
[ PDF, ArXiV, Website, Code, ACL Anthology, BibTex ]@inproceedings{covr:emnlp21, author = {Ben Bogin and Shivanshu Gupta and Jonathan Berant and Matt Gardner}, title = { {COVR: A Test-Bed for Visually Grounded Compositional Generalization with Real Images} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, year = {2021} }
, , , . -
Counterfactual Explanations Can Be Manipulated.
Neural Information Processing Systems (NeurIPS).
2021
Conference
[ PDF, ArXiV, BibTex ]@inproceedings{manipcfs:neurips21, author = {Dylan Slack and Sophie Hilgard and Himabindu Lakkaraju and Sameer Singh}, title = { {Counterfactual Explanations Can Be Manipulated} }, booktitle = {Neural Information Processing Systems (NeurIPS)}, year = {2021} }
, , , . -
Reliable Post hoc Explanations Modeling Uncertainty in Explainability.
Neural Information Processing Systems (NeurIPS).
2021
Conference
[ PDF, ArXiV, BibTex ]@inproceedings{bayeslimeshap:neurips21, author = {Dylan Slack and Sophie Hilgard and Sameer Singh and Himabindu Lakkaraju}, title = { {Reliable Post hoc Explanations Modeling Uncertainty in Explainability} }, booktitle = {Neural Information Processing Systems (NeurIPS)}, year = {2021} }
, , , . -
Generative Context Pair Selection for Multi-hop Question Answering.
Empirical Methods in Natural Language Processing (EMNLP).
2021
Conference
[ PDF, ArXiV, ACL Anthology, BibTex ]@inproceedings{genqa:emnlp21, author = {Dheeru Dua and Cicero Nogueira dos Santos and Patrick Ng and Ben Athiwaratkun and Bing Xiang and Matt Gardner and Sameer Singh}, title = { {Generative Context Pair Selection for Multi-hop Question Answering} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, year = {2021} }
, , , , , , . -
Entity-Based Knowledge Conflicts in Question Answering.
Empirical Methods in Natural Language Processing (EMNLP).
2021
Conference
[ PDF, ArXiV, Project Page, Source Code, ACL Anthology, BibTex ]@inproceedings{qaconflicts:emnlp21, author = {Shayne Longpre and Kartik Perisetla and Anthony Chen and Nikhil Ramesh and Chris DuBois and Sameer Singh}, title = { {Entity-Based Knowledge Conflicts in Question Answering} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, year = {2021} }
, , , , , . -
Learning with Instance Bundles for Reading Comprehension.
Empirical Methods in Natural Language Processing (EMNLP).
2021
Conference
[ PDF, ArXiV, ACL Anthology, Abstract, BibTex ]When training most modern reading comprehension models, all the questions associated with a context are treated as being independent from each other. However, closely related questions and their corresponding answers are not independent, and leveraging these relationships could provide a strong supervision signal to a model. Drawing on ideas from contrastive estimation, we introduce several new supervision techniques that compare question-answer scores across multiple related instances. Specifically, we normalize these scores across various neighborhoods of closely contrasting questions and/or answers, adding another cross entropy loss term that is used in addition to traditional maximum likelihood estimation. Our techniques require bundles of related question-answer pairs, which we can either mine from within existing data or create using various automated heuristics. We empirically demonstrate the effectiveness of training with instance bundles on two datasets -- HotpotQA and ROPES -- showing up to 11% absolute gains in accuracy.@inproceedings{bundles:emnlp21, author = {Dheeru Dua and Pradeep Dasigi and Sameer Singh and Matt Gardner}, title = { {Learning with Instance Bundles for Reading Comprehension} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, year = {2021} }
, , , . -
Competency Problems: On Finding and Removing Artifacts in Language Data.
Empirical Methods in Natural Language Processing (EMNLP).
2021
Conference
[ PDF, ArXiV, ACL Anthology, BibTex ]@inproceedings{competency:emnlp21, author = {Matt Gardner and William Merrill and Jesse Dodge and Matthew Peters and Alexis Ross and Sameer Singh and Noah A. Smith}, title = { {Competency Problems: On Finding and Removing Artifacts in Language Data} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, year = {2021} }
, , , , , , . -
Paired Examples as Indirect Supervision in Latent Decision Models.
Empirical Methods in Natural Language Processing (EMNLP).
2021
Conference
[ PDF, ArXiV, ACL Anthology, BibTex ]@inproceedings{pairednmn:emnlp21, author = {Nitish Gupta and Sameer Singh and Matt Gardner and Dan Roth}, title = { {Paired Examples as Indirect Supervision in Latent Decision Models} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, year = {2021} }
, , , . -
Calibrate Before Use: Improving Few-shot Performance of Language Models.
International Conference on Machine Learning (ICML).
2021
Conference
[ PDF, ArXiV, ICML Page, Video/Slides, BibTex ]@inproceedings{poisoning:icml21, author = {Tony Z. Zhao and Eric Wallace and Shi Feng and Dan Klein and Sameer Singh}, title = { {Calibrate Before Use: Improving Few-shot Performance of Language Models} }, booktitle = {International Conference on Machine Learning (ICML)}, year = {2021} }
, , , , . -
Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP.
Association for Computational Linguistics (ACL).
2021
Conference
[ ACL Anthology, PDF, BibTex ]@inproceedings{amber:acl21, author = {Anthony Chen and Pallavi Gudipati and Shayne Longpre and Xiao Ling and Sameer Singh}, title = { {Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP} }, booktitle = {Association for Computational Linguistics (ACL)}, doi = {10.18653/v1/2021.acl-long.345}, year = {2021} }
, , , , . -
Benchmarking Scalable Methods for Streaming Cross Document Coreference.
Association for Computational Linguistics (ACL).
2021
Conference
[ ACL Anthology, PDF, BibTex ]@inproceedings{streamingcdcr:acl21, author = {Robert L. Logan IV and Andrew McCallum and Sameer Singh and Dan Bikel}, title = { {Benchmarking Scalable Methods for Streaming Cross Document Coreference} }, booktitle = {Association for Computational Linguistics (ACL)}, doi = {10.18653/v1/2021.acl-long.364}, year = {2021} }
, , , . -
Enforcing Consistency in Weakly Supervised Semantic Parsing.
Association for Computational Linguistics (ACL).
2021
Conference
[ ACL Anthology, PDF, BibTex ]@inproceedings{spconsistency:acl21, author = {Nitish Gupta and Sameer Singh and Matt Gardner}, title = { {Enforcing Consistency in Weakly Supervised Semantic Parsing} }, booktitle = {Association for Computational Linguistics (ACL)}, doi = {10.18653/v1/2021.acl-short.22}, year = {2021} }
, , . -
An Empirical Comparison of Instance Attribution Methods for NLP.
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
2021
Conference
[ PDF, ArXiV, Abstract, BibTex ]Widespread adoption of deep pretrained (masked) neural language models has motivated a pressing need for approaches for interpreting network outputs and for facilitating model debugging. Instance attribution methods constitute one means of accomplishing these goals by retrieving training instances that (may have) led to a particular prediction. Influence functions (IF) provide machinery for doing this by quantifying the effect that perturbing individual train instances would have on a specific test prediction. However, even approximating the IF is computationally expensive, to a degree that may be prohibitive in many cases. Might simpler approaches (e.g., retrieving train instance most similar to a given test point) perform comparably? In this work we evaluate the degree to which different potential instance attribution agree with respect to the importance of training samples. We find that simple retrieval methods yield training instances that differ from those identified via gradient-based methods (such as the IF), but that nonetheless exhibit desirable characteristics similar to more complex attribution methods.@inproceedings{emp-instance:naacl21, author = {Pouya Pezeshkpour and Sarthak Jain and Byron Wallace and Sameer Singh}, title = { {An Empirical Comparison of Instance Attribution Methods for NLP} }, booktitle = {Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, year = {2021} }
, , , . -
Concealed Data Poisoning Attacks on NLP Models.
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
2021
Conference
[ PDF, ArXiV, ACL Anthology, Website, Code, Abstract, BibTex ]Adversarial attacks alter NLP model predictions by perturbing test-time inputs. However, it is much less understood whether, and how, predictions can be manipulated with small, concealed changes to the training data. In this work, we develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input. For instance, we insert 50 poison examples into a sentiment model’s training set that causes the model to frequently predict Positive whenever the input contains “James Bond”. Crucially, we craft these poison examples using a gradient-based procedure so that they do not mention the trigger phrase. We also apply our poison attack to language modeling (“Apple iPhone” triggers negative generations) and machine translation (“iced coffee” mistranslated as “hot coffee”). We conclude by proposing three defenses that can mitigate our attack at some cost in prediction accuracy or extra human annotation.@inproceedings{poisoning:naacl21, author = {Eric Wallace and Tony Z. Zhao and Shi Feng and Sameer Singh}, title = { {Concealed Data Poisoning Attacks on NLP Models} }, booktitle = {Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, year = {2021} }
, , , . -
Improved Consistency Regularization for GANs.
AAAI Conference on Artificial Intelligence (AAAI).
2021
Conference
[ PDF, ArXiV, Abstract, BibTex ]Recent work has increased the performance of Generative Adversarial Networks (GANs) by enforcing a consistency cost on the discriminator. We improve on this technique in several ways. We first show that consistency regularization can introduce artifacts into the GAN samples and explain how to fix this issue. We then propose several modifications to the consistency regularization procedure designed to improve its performance. We carry out extensive experiments quantifying the benefit of our improvements. For unconditional image synthesis on CIFAR-10 and CelebA, our modifications yield the best known FID scores on various GAN architectures. For conditional image synthesis on CIFAR-10, we improve the state-of-the-art FID score from 11.48 to 9.21. Finally, on ImageNet-2012, we apply our technique to the original BigGAN model and improve the FID from 6.66 to 5.38, which is the best score at that model size.@inproceedings{icrgan:aaai21, author = {Zhengli Zhao and Sameer Singh and Honglak Lee and Zizhao Zhang and Augustus Odena and Han Zhang}, title = { {Improved Consistency Regularization for GANs} }, booktitle = {AAAI Conference on Artificial Intelligence (AAAI)}, year = {2021} }
, , , , , . -
PARSINLU: A Suite of Language Understanding Challenges for Persian.
Transactions of the Association for Computational Linguistics (TACL).
2021
Journal
[ PDF, ArXiV, Abstract, BibTex ]Despite the progress made in recent years in addressing natural language understanding (NLU) challenges, the majority of this progress remains to be concentrated on resource-rich languages like English. This work focuses on Persian language, one of the widely spoken languages in the world, and yet there are few NLU datasets available for this rich language. The availability of high-quality evaluation datasets is a necessity for reliable assessment of the progress on different NLU tasks and domains. We introduce PARSINLU, the first benchmark in Persian language that includes a range of high-level tasks — Reading Comprehension, Textual Entailment, etc. These datasets are collected in a multitude of ways, often involving manual annotations by native speakers. This results in over 14.5k new instances across 6 distinct NLU tasks. Besides, we present the first results on state-of-the-art monolingual and multi-lingual pre-trained language-models on this benchmark and compare them with human performance, which provides valuable insights into our ability to tackle natural language understanding challenges in Persian. We hope PARSINLU fosters further research and advances in Persian language understanding.@article{parsinlu:tacl21, author = {Daniel Khashabi and Arman Cohan and Siamak Shakeri and Pedram Hosseini and Pouya Pezeshkpour and Malihe Alikhani and Moin Aminnaseri and Marzieh Bitaab and Faeze Brahman and Sarik Ghazarian and Mozhdeh Gheini and Arman Kabiri and Rabeeh Karimi Mahabadi and Omid Memarrast and Ahmadreza Mosallanezhad and Erfan Noury and Shahab Raji and Mohammad Sadegh Rasooli and Sepideh Sadeghi and Erfan Sadeqi Azer and Niloofar Safi Samghabadi and Mahsa Shafaei and Saber Sheybani and Ali Tazarv and Yadollah Yaghoobzadeh}, title = { {PARSINLU: A Suite of Language Understanding Challenges for Persian} }, journal = {Transactions of the Association for Computational Linguistics (TACL)}, year = {2021} }
, , , , , , , , , , , , , , , , , , , , , , , , . -
Climatology and Evolution of the Antarctic Peninsula Föhn Wind‐Induced Melt Regime From 1979–2018.
Journal of Geophysical Research: Atmospheres.
2021
Journal
[ Journal, BibTex ]@article{fohn:jgr21, author = {Matthew K Laffin and Charles Zender and Sameer Singh and J. Van Wessem and C. J. P. P. Smeets and C. H. Reijmer}, title = { {Climatology and Evolution of the Antarctic Peninsula Föhn Wind‐Induced Melt Regime From 1979–2018} }, journal = {Journal of Geophysical Research: Atmospheres}, volume = {126}, number = {4}, doi = {10.1029/2020JD033682}, year = {2021} }
, , , , , . -
Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models.
NeurIPS Workshop on Efficient Natural Language and Speech Processing (ENLSP).
2021
Workshop
Best Poster Award
[ ArXiV, PDF, Code, BibTex ]@inproceedings{nullprompts:effnlp21, author = {Robert L. Logan IV and Ivana Balažević and Eric Wallace and Fabio Petroni and Sameer Singh and Sebastian Riedel}, title = { {Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models} }, booktitle = {NeurIPS Workshop on Efficient Natural Language and Speech Processing (ENLSP)}, year = {2021} }
, , , , , . -
Modular Framework for Visuomotor Language Grounding.
Embodied AI Workshop at CVPR.
2021
Workshop
[ PDF, BibTex ]@inproceedings{modulargl:embodied21, author = {Kolby Nottingham and Litian Liang and Daeyun Shin and Charless C. Fowlkes and Roy Fox and Sameer Singh}, title = { {Modular Framework for Visuomotor Language Grounding} }, booktitle = {Embodied AI Workshop at CVPR}, year = {2021} }
, , , , , . -
Deriving Behavioral Tests from Common Sense Knowledge Graphs.
AAAI Workshop on Common Sense Knowledge Graphs (CSKGs).
2021
Workshop
[ PDF, Abstract, BibTex ]Although NLP models have demonstrated “superhuman” performance on common sense reasoning tasks, it is unclear whether these models truly have common sense knowledge. Constructing evaluation datasets to test this knowledge is expensive due to the manual effort involved, and is also limited in scope. Meanwhile, common sense knowledge graphs (CSKGs) aim for a wide coverage of structured common sense knowledge, but can not be directly used for testing purposes. In this work, we introduce a semi-automated approach that leverages CSKGs to construct out-of-domain evaluation sets for NLP tasks that are more scalable than purely manual approaches. Using this procedure, we create test cases from two popular CSKGs—ConceptNet and ATOMIC—to test the common sense reasoning capability of models trained for natural language inference (NLI) and question answering (QA). These tests reveal interesting differences in failure modes of these models; models trained on NLI tend to perform better on tests of ontological knowledge, e.g. ’is a’ and ’used for’ relations, failing on tests that require understanding ’desires’, ’needs’, and ’wants’, while QA models perform better on tests that involve ’wants’, and ’desires’.@inproceedings{cskgtests:cskg21, author = {Yasaman Razeghi and Robert L. Logan IV and Sameer Singh}, title = { {Deriving Behavioral Tests from Common Sense Knowledge Graphs} }, booktitle = {AAAI Workshop on Common Sense Knowledge Graphs (CSKGs)}, year = {2021} }
, , . -
What Models Know About Their Attackers: Deriving Attacker Information From Latent Representations.
EMNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP).
2021
Workshop
[ PDF, ACL Anthology, Abstract, BibTex ]Adversarial attacks curated against NLP models are increasingly becoming practical threats. Although various methods have been developed to detect adversarial attacks, securing learning-based NLP systems in practice would require more than identifying and evading perturbed instances. To address these issues, we propose a new set of adversary identification tasks, Attacker Attribute Classification via Textual Analysis (AACTA), that attempts to obtain more detailed information about the attackers from adversarial texts. Specifically, given a piece of adversarial text, we hope to accomplish tasks such as localizing perturbed tokens, identifying the attacker’s access level to the target model, determining the evasion mechanism imposed, and specifying the perturbation type employed by the attacking algorithm. Our contributions are as follows: we formalize the task of classifying attacker attributes, and create a benchmark on various target models from sentiment classification and abuse detection domains. We show that signals from BERT models and target models can be used to train classifiers that reveal the properties of the attacking algorithms. We demonstrate that adversarial attacks leave interpretable traces in both feature spaces of pre-trained language models and target models, making AACTA a promising direction towards more trustworthy NLP systems.@inproceedings{advdetect:bbox21, author = {Zhouhang Xie and Jonathan Brophy and Adam Noack and Wencong You and Kalyani Asthana and Carter Perkins and Sabrina Reis and Zayd Hammoudeh and Daniel Lowd and Sameer Singh}, title = { {What Models Know About Their Attackers: Deriving Attacker Information From Latent Representations} }, booktitle = {EMNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackBoxNLP)}, year = {2021} }
, , , , , , , , , .
-
AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts .
Empirical Methods in Natural Language Processing (EMNLP).
2020
Conference
[ PDF, Website, ACL Anthology, Abstract, BibTex ]The remarkable success of pretrained language models has motivated the study of what kinds of knowledge these models learn during pretraining. Reformulating tasks as fill-in-the-blanks problems (e.g., cloze tests) is a natural approach for gauging such knowledge, however, its usage is limited by the manual effort and guesswork required to write suitable prompts. To address this, we develop AutoPrompt, an automated method to create prompts for a diverse set of tasks, based on a gradient-guided search. Using AutoPrompt, we show that masked language models (MLMs) have an inherent capability to perform sentiment analysis and natural language inference without additional parameters or finetuning, sometimes achieving performance on par with recent state-of-the-art supervised models. We also show that our prompts elicit more accurate factual knowledge from MLMs than the manually created prompts on the LAMA benchmark, and that MLMs can be used as relation extractors more effectively than supervised relation extraction models. These results demonstrate that automatically generated prompts are a viable parameter-free alternative to existing probing methods, and as pretrained LMs become more sophisticated and capable, potentially a replacement for finetuning.@inproceedings{autoprompt:emnlp20, author = {Taylor Shin and Yasaman Razeghi and Robert L. Logan IV and Eric Wallace and Sameer Singh}, title = { {AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts } }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, pages = {4222–4235}, year = {2020} }
, , , , . -
MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics.
Empirical Methods in Natural Language Processing (EMNLP).
2020
Conference
[ PDF, Website, ACL Anthology, Abstract, BibTex ]Posing reading comprehension as a generation problem provides a great deal of flexibility, allowing for open-ended questions with few restrictions on possible answers. However, progress is impeded by existing generation metrics, which rely on token overlap and are agnostic to the nuances of reading comprehension. To address this, we introduce a benchmark for training and evaluating generative reading comprehension metrics: MOdeling Correctness with Human Annotations. MOCHA contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation. Using MOCHA, we train a Learned Evaluation metric for Reading Comprehension, LERC, to mimic human judgement scores. LERC outperforms baseline metrics by 10 to 36 absolute Pearson points on held-out annotations. When we evaluate robustness on minimal pairs, LERC achieves 80% accuracy, outperforming baselines by 14 to 26 absolute percentage points while leaving significant room for improvement. MOCHA presents a challenging problem for developing accurate and robust generative reading comprehension metrics.@inproceedings{mocha:emnlp20, author = {Anthony Chen and Gabriel Stanovsky and Sameer Singh and Matt Gardner}, title = { {MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, pages = {6521–6532}, year = {2020} }
, , , . -
Gradient-based Analysis of NLP Models is Manipulable.
Findings of the Association for Computational Linguistics: EMNLP (EMNLP Findings).
2020
Conference
[ PDF, Website, BibTex ]@inproceedings{facade:femnlp20, author = {Junlin Wang and Jens Tuyls and Eric Wallace and Sameer Singh}, title = { {Gradient-based Analysis of NLP Models is Manipulable} }, booktitle = {Findings of the Association for Computational Linguistics: EMNLP (EMNLP Findings)}, pages = {247–258}, year = {2020} }
, , , . -
Evaluating Models’ Local Decision Boundaries via Contrast Sets.
Findings of the Association for Computational Linguistics: EMNLP (EMNLP Findings).
2020
Conference
[ PDF, BibTex ]@inproceedings{contrast:femnlp20, author = {Matt Gardner and Yoav Artzi and Victoria Basmov and Jonathan Berant and Ben Bogin and Sihao Chen and Pradeep Dasigi and Dheeru Dua and Yanai Elazar and Ananth Gottumukkala and Nitish Gupta and Hannaneh Hajishirzi and Gabriel Ilharco and Daniel Khashabi and Kevin Lin and Jiangming Liu and Nelson F. Liu and Phoebe Mulcaire and Qiang Ning and Sameer Singh and Noah A. Smith and Sanjay Subramanian and Reut Tsarfaty and Eric Wallace and Ally Zhang and Ben Zhou}, title = { {Evaluating Models’ Local Decision Boundaries via Contrast Sets} }, booktitle = {Findings of the Association for Computational Linguistics: EMNLP (EMNLP Findings)}, pages = {1307–1323}, year = {2020} }
, , , , , , , , , , , , , , , , , , , , , , , , , . -
MedICaT: A Dataset of Medical Images, Captions, and Textual References.
Findings of the Association for Computational Linguistics: EMNLP (EMNLP Findings).
2020
Conference
[ PDF, BibTex ]@inproceedings{medicat:femnlp20, author = {Sanjay Subramanian and Lucy Lu Wang and Ben Bogin and Sachin Mehta and Madeleine van Zuylen and Sravanthi Parasa and Sameer Singh and Matt Gardner and Hannaneh Hajishirzi}, title = { {MedICaT: A Dataset of Medical Images, Captions, and Textual References} }, booktitle = {Findings of the Association for Computational Linguistics: EMNLP (EMNLP Findings)}, pages = {2112–2120}, year = {2020} }
, , , , , , , , . -
Beyond Accuracy: Behavioral Testing of NLP models with CheckList.
Association for Computational Linguistics (ACL).
2020
Conference
Best Paper Award
[ PDF, Code, ACL Anthology, Video+Slides, ArXiV, Abstract, BibTex ]Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.@inproceedings{checklist:acl20, author = {Marco Tulio Ribeiro and Tongshuang Wu and Carlos Guestrin and Sameer Singh}, title = { {Beyond Accuracy: Behavioral Testing of NLP models with CheckList} }, booktitle = {Association for Computational Linguistics (ACL)}, pages = {4902-4912}, year = {2020} }
, , , . -
On Importance Sampling-Based Evaluation of Latent Language Models.
Association for Computational Linguistics (ACL).
2020
Conference
[ PDF, ACL Anthology, Video+Slides, Abstract, BibTex ]Language models that use additional latent structures (e.g., syntax trees, coreference chains, knowledge graph links) provide several advantages over traditional language models. However, likelihood-based evaluation of these models is often intractable as it requires marginalizing over the latent space. Existing works avoid this issue by using importance sampling. Although this approach has asymptotic guarantees, analysis is rarely conducted on the effect of decisions such as sample size and choice of proposal distribution on the reported estimates. In this paper, we carry out this analysis for three models: RNNG, EntityNLM, and KGLM. In addition, we elucidate subtle differences in how importance sampling is applied in these works that can have substantial effects on the final estimates, as well as provide theoretical results which reinforce the validity of this technique.@inproceedings{impsample:acl20, author = {Robert L. Logan IV and Matt Gardner and Sameer Singh}, title = { {On Importance Sampling-Based Evaluation of Latent Language Models} }, booktitle = {Association for Computational Linguistics (ACL)}, pages = {2171-2176}, year = {2020} }
, , . -
Obtaining Faithful Interpretations from Compositional Neural Networks.
Association for Computational Linguistics (ACL).
2020
Conference
[ PDF, ACL Anthology, ArXiV, Video+Slides, Abstract, BibTex ]Neural module networks (NMNs) are a popular approach for modeling compositionality: they achieve high accuracy when applied to problems in language and vision, while reflecting the compositional structure of the problem in the network architecture. However, prior work implicitly assumed that the structure of the network modules, describing the abstract reasoning process, provides a faithful explanation of the model’s reasoning; that is, that all modules perform their intended behaviour. In this work, we propose and conduct a systematic evaluation of the intermediate outputs of NMNs on NLVR2 and DROP, two datasets which require composing multiple reasoning steps. We find that the intermediate outputs differ from the expected output, illustrating that the network structure does not provide a faithful explanation of model behaviour. To remedy that, we train the model with auxiliary supervision and propose particular choices for module architecture that yield much better faithfulness, at a minimal cost to accuracy.@inproceedings{nmninterpret:acl20, author = {Sanjay Subramanian and Ben Bogin and Nitish Gupta and Tomer Wolfson and Sameer Singh and Jonathan Berant and Matt Gardner}, title = { {Obtaining Faithful Interpretations from Compositional Neural Networks} }, booktitle = {Association for Computational Linguistics (ACL)}, pages = {5594-5608}, year = {2020} }
, , , , , , . -
Benefits of Intermediate Annotations in Reading Comprehension.
Association for Computational Linguistics (ACL).
2020
Conference
[ PDF, ACL Anthology, Video+Slides, Abstract, BibTex ]Complex compositional reading comprehension datasets require performing latent sequential decisions that are learned via supervision from the final answer. A large combinatorial space of possible decision paths that result in the same answer, compounded by the lack of intermediate supervision to help choose the right path, makes the learning particularly hard for this task. In this work, we study the benefits of collecting intermediate reasoning supervision along with the answer during data collection. We find that these intermediate annotations can provide two-fold benefits. First, we observe that for any collection budget, spending a fraction of it on intermediate annotations results in improved model performance, for two complex compositional datasets: DROP and Quoref. Second, these annotations encourage the model to learn the correct latent reasoning steps, helping combat some of the biases introduced during the data collection process.@inproceedings{intannot:acl20, author = {Dheeru Dua and Sameer Singh and Matt Gardner}, title = { {Benefits of Intermediate Annotations in Reading Comprehension} }, booktitle = {Association for Computational Linguistics (ACL)}, pages = {5627-5634}, year = {2020} }
, , . -
Dynamic Sampling Strategies for Multi-Task Reading Comprehension.
Association for Computational Linguistics (ACL).
2020
Conference
[ PDF, ACL Anthology, Video+Slides, Abstract, BibTex ]Building general reading comprehension systems, capable of solving multiple datasets at the same time, is a recent aspirational goal in the research community. Prior work has focused on model architecture or generalization to held out datasets, and largely passed over the particulars of the multi-task learning set up. We show that a simple dynamic sampling strategy, selecting instances for training proportional to the multi-task model’s current performance on a dataset relative to its single task performance, gives substantive gains over prior multi-task sampling strategies, mitigating the catastrophic forgetting that is common in multi-task learning. We also demonstrate that allowing instances of different tasks to be interleaved as much as possible between each epoch and batch has a clear benefit in multitask performance over forcing task homogeneity at the epoch or batch level. Our final model shows greatly increased performance over the best model on ORB, a recently-released multitask reading comprehension benchmark.@inproceedings{dynsample:acl20, author = {Ananth Gottumukkala and Dheeru Dua and Sameer Singh and Matt Gardner}, title = { {Dynamic Sampling Strategies for Multi-Task Reading Comprehension} }, booktitle = {Association for Computational Linguistics (ACL)}, pages = {920-924}, year = {2020} }
, , , . -
Revisiting Evaluation of Knowledge Base Completion Models.
Automated Knowledge Base Construction (AKBC).
2020
Conference
Runner-up for Best Paper Award
[ PDF, Yago3-TC Data, Video+Slides, OpenReview, AKBC Page, Abstract, BibTex ]Representing knowledge graphs (KGs) by learning embeddings for entities and relations has led to accurate models for existing KG completion benchmarks. However, due to the open-world assumption of existing KGs, evaluation of KG completion uses ranking metrics and triple classification with negative samples, and is thus unable to directly assess models on the goals of the task: completion. In this paper, we first study the shortcomings of these evaluation metrics. Specifically, we demonstrate that these metrics (1) are unreliable for estimating how calibrated the models are, (2) make strong assumptions that are often violated, and 3) do not sufficiently, and consistently, differentiate embedding methods from each other, or from simpler approaches. To address these issues, we gather a semi-complete KG referred as YAGO3-TC, using a random subgraph from the test and validation data of YAGO3-10, which enables us to compute accurate triple classification accuracy on this data. Conducting thorough experiments on existing models, we provide new insights and directions for the KG completion research. Along with the dataset and the open source implementation of the models, we also provide a leaderboard for knowledge graph completion that consists of a hidden, and growing, test set, available at https://pouyapez.github.io/yago3-tc/.@inproceedings{kbeval:akbc20, author = {Pouya Pezeshkpour and Yifan Tian and Sameer Singh}, title = { {Revisiting Evaluation of Knowledge Base Completion Models} }, booktitle = {Automated Knowledge Base Construction (AKBC)}, year = {2020} }
, , . -
Building a Better Lie Detector with BERT: The Difference Between Truth and Lies.
International Joint Conference on Neural Networks (IJCNN).
2020
Conference
[ PDF, BibTex ]@inproceedings{bertdecept:ijcnn20, author = {Dan Barsever and Sameer Singh and Emre Neftci}, title = { {Building a Better Lie Detector with BERT: The Difference Between Truth and Lies} }, booktitle = {International Joint Conference on Neural Networks (IJCNN)}, year = {2020} }
, , . -
Neural Module Networks for Reasoning over Text.
International Conference on Learning Representations (ICLR).
2020
Conference
[ PDF, arXiv, OpenReview, Code, Abstract, BibTex ]Answering compositional questions that require multiple steps of reasoning against text is challenging, especially when they involve discrete, symbolic operations. Neural module networks (NMNs) learn to parse such questions as executable programs composed of learnable modules, performing well on synthetic visual QA domains. However, we find that it is challenging to learn these models for non-synthetic questions on open-domain text, where a model needs to deal with the diversity of natural language and perform a broader range of reasoning. We extend NMNs by: (a) introducing modules that reason over a paragraph of text, performing symbolic reasoning (such as arithmetic, sorting, counting) over numbers and dates in a probabilistic and differentiable manner; and (b) proposing an unsupervised auxiliary loss to help extract arguments associated with the events in text. Additionally, we show that a limited amount of heuristically-obtained question program and intermediate module output supervision provides sufficient inductive bias for accurate learning. Our proposed model significantly outperforms state-of-the-art models on a subset of the DROP dataset that poses a variety of reasoning challenges that are covered by our modules.@inproceedings{nmn:iclr20, author = {Nitish Gupta and Kevin Lin and Dan Roth and Sameer Singh and Matt Gardner}, title = { {Neural Module Networks for Reasoning over Text} }, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2020} }
, , , , . -
Explain Your Move: Understanding Agent Actions Using Specific and Relevant Feature Attribution.
International Conference on Learning Representations (ICLR).
2020
Conference
[ PDF, Project page, arXiv, Code+Data, OpenReview, Abstract, BibTex ]As deep reinforcement learning (RL) is applied to more tasks, there is a need to visualize and understand the behavior of learned agents. Saliency maps explain agent behavior by highlighting the features of the input state that are most relevant for the agent in taking an action. Existing perturbation-based approaches to compute saliency often highlight regions of the input that are not relevant to the action taken by the agent. Our proposed approach, SARFA (Specific and Relevant Feature Attribution), generates more focused saliency maps by balancing two aspects (specificity and relevance) that capture different desiderata of saliency. The first captures the impact of perturbation on the relative expected reward of the action to be explained. The second downweighs irrelevant features that alter the relative expected rewards of actions other than the action to be explained. We compare SARFA with existing approaches on agents trained to play board games (Chess and Go) and Atari games (Breakout, Pong and Space Invaders). We show through illustrative examples (Chess, Atari, Go), human studies (Chess), and automated evaluation methods (Chess) that SARFA generates saliency maps that are more interpretable for humans than existing approaches. For the code release and demo videos, see: https://nikaashpuri.github.io/sarfa-saliency/.@inproceedings{salrl:iclr20, author = {Piyush Gupta and Nikaash Puri and Sukriti Verma and Dhruv Kayastha and Shripad Deshmukh and Balaji Krishnamurthy and Sameer Singh}, title = { {Explain Your Move: Understanding Agent Actions Using Specific and Relevant Feature Attribution} }, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2020} }
, , , , , , . -
Minecraft as a Platform for Project-Based Learning in AI.
AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI).
2020
Conference
[ PDF, Website, Poster, Spotlight, AAAI Page, Abstract, BibTex ]Undergraduate courses that focus on open-ended, projectbased learning teach students how to define concrete goals, transfer conceptual understanding of algorithms to code, and evaluate/analyze/present their solution. However, AI, along with machine learning, is getting increasingly varied in terms of both the approaches and applications, making it challenging to design project courses that span a sufficiently wide spectrum of AI. For these reasons, existing AI project courses are restricted to a narrow set of approaches (e.g. only reinforcement learning) or applications (e.g. only computer vision).
In this paper, we propose to use Minecraft as the platform for teaching AI via project-based learning. Minecraft is an open-world sandbox game with elements of exploration, resource gathering, crafting, construction, and combat, and is supported by the Malmo library that provides a programmatic interface to the player observations and actions at various levels of granularity. In Minecraft, students can design projects to use approaches like search-based AI, reinforcement learning, supervised learning, and constraint satisfaction, on data types like text, audio, images, and tabular data. We describe our experience with an open-ended, undergraduate AI projects course using Minecraft that includes 82 different projects, covering themes that ranged from navigation, instruction following, object detection, combat, and music/image generation.@inproceedings{malmo:eaai20, author = {Sameer Singh}, title = { {Minecraft as a Platform for Project-Based Learning in AI} }, booktitle = {AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI)}, doi = {10.1609/aaai.v34i09.7070}, pages = {13504-13505}, year = {2020} }
. -
Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods.
AAAI/ACM Conference on AI, Ethics, and Society (AIES).
2020
Conference
[ PDF, arXiv, ACM Page, Abstract, BibTex ]As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this paper, we demonstrate that post hoc explanations techniques that rely on input perturbations, such as LIME and SHAP, are not reliable. Specifically, we propose a novel scaffolding technique that effectively hides the biases of any given classifier by allowing an adversarial entity to craft an arbitrary desired explanation. Our approach can be used to scaffold any biased classifier in such a way that its predictions on the input data distribution still remain biased, but the post hoc explanations of the scaffolded classifier look innocuous. Using extensive evaluation with multiple real-world datasets (including COMPAS), we demonstrate how extremely biased (racist) classifiers crafted by our framework can easily fool popular explanation techniques such as LIME and SHAP into generating innocuous explanations which do not reflect the underlying biases.@inproceedings{advlime:aies20, author = {Dylan Slack and Sophie Hilgard and Emily Jia and Sameer Singh and Himabindu Lakkaraju}, title = { {Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods} }, booktitle = {AAAI/ACM Conference on AI, Ethics, and Society (AIES)}, doi = {10.1145/3375627.3375830}, pages = {180-186}, year = {2020} }
, , , , . -
Head Network Distillation: Splitting Distilled Deep Neural Networks for Resource-Constrained Edge Computing Systems.
IEEE Access.
2020
Journal
[ Journal, BibTex ]@article{headnet:ieee20, author = {Yoshitomo Matsubara and Davide Callegaro and Sabur Baidya and Marco Levorato and Sameer Singh}, title = { {Head Network Distillation: Splitting Distilled Deep Neural Networks for Resource-Constrained Edge Computing Systems} }, journal = {IEEE Access}, volume = {126}, number = {4}, doi = {10.1109/ACCESS.2020.3039714}, year = {2020} }
, , , , . -
On the Utility of Active Instance Selection for Few-Shot Learning.
NeurIPS Workshop on Human And Model in the Loop Evaluation and Training Strategies (HAMLETS).
2020
Workshop
[ PDF, OpenReview, BibTex ]@inproceedings{activefew:hamlets20, author = {Pouya Pezeshkpour and Zhengli Zhao and Sameer Singh}, title = { {On the Utility of Active Instance Selection for Few-Shot Learning} }, booktitle = {NeurIPS Workshop on Human And Model in the Loop Evaluation and Training Strategies (HAMLETS)}, year = {2020} }
, , . -
COVIDLies: Detecting COVID-19 Misinformation on Social Media.
EMNLP NLP Covid19 Workshop.
2020
Workshop
Best Paper Award
[ PDF, ACL Anthology, Website (w/ demo), Abstract, BibTex ]The ongoing pandemic has heightened the need for developing tools to flag COVID-19-related misinformation on the internet, specifically on social media such as Twitter. However, due to novel language and the rapid change of information, existing misinformation detection datasets are not effective for evaluating systems designed to detect misinformation on this topic. Misinformation detection can be divided into two sub-tasks: (i) retrieval of misconceptions relevant to posts being checked for veracity, and (ii) stance detection to identify whether the posts Agree, Disagree, or express No Stance towards the retrieved misconceptions. To facilitate research on this task, we release COVIDLies (https://ucinlp.github.io/covid19), a dataset of 6761 expert-annotated tweets to evaluate the performance of misinformation detection systems on 86 different pieces of COVID-19 related misinformation. We evaluate existing NLP systems on this dataset, providing initial benchmarks and identifying key challenges for future models to improve upon.@inproceedings{covidlies:nlpcovid20, author = {Tamanna Hossain and Robert L. Logan IV and Arjuna Ugarte and Yoshitomo Matsubara and Sean Young and Sameer Singh}, title = { {COVIDLies: Detecting COVID-19 Misinformation on Social Media} }, booktitle = {EMNLP NLP Covid19 Workshop}, doi = {10.18653/v1/2020.nlpcovid19-2.11}, year = {2020} }
, , , , , . -
Tweeki: Linking Named Entities on Twitter to a Knowledge Graph.
EMNLP Workshop on Noisy, User-generated Text (W-NUT).
2020
Workshop
[ PDF, ACL Anthology, Abstract, BibTex ]To identify what entities are being talked about in tweets, we need to automatically link named entities that appear in tweets to structured KBs like WikiData. Existing approaches often struggle with such short, noisy texts, or their complex design and reliance on supervision make them brittle, difficult to use and maintain, and lose significance over time. Further, there is a lack of a large, linked corpus of tweets to aid researchers, along with lack of gold dataset to evaluate the accuracy of entity linking. In this paper, we introduce (1) Tweeki, an unsupervised, modular entity linking system for Twitter, (2) TweekiData, a large, automatically-annotated corpus of Tweets linked to entities in WikiData, and (3) TweekiGold, a gold dataset for entity linking evaluation. Through comprehensive analysis, we show that Tweeki is comparable to the performance of recent state-of-the-art entity linkers models, the dataset is of high quality, and a use case of how the dataset can be used to improve downstream tasks in social media analysis (geolocation prediction).@inproceedings{tweeki:wnut20, author = {Bahareh Harandizadeh and Sameer Singh}, title = { {Tweeki: Linking Named Entities on Twitter to a Knowledge Graph} }, booktitle = {EMNLP Workshop on Noisy, User-generated Text (W-NUT)}, doi = {10.18653/v1/2020.wnut-1.29}, year = {2020} }
, . -
Citations Beyond Self Citations: Identifying Authors, Affiliations, and Nationalities in Scientific Papers.
Workshop on Mining Scientific Publications (WOSP).
2020
Workshop
[ PDF, Code, ACL Anthology, Abstract, BibTex ]The question of the utility of the blind peer-review system is fundamental to scientific research. Some studies investigate exactly how “blind” the papers are in the double-blind review system by manually or automatically identifying the true authors, mainly suggesting the number of self-citations in the submitted manuscripts as the primary signal for identity. However, related work on the automated approaches are limited by the sizes of their datasets and the restricted experimental setup, thus they lack practical insights into the blind review process. In this work, we train models that identify the authors, their affiliations, and their nationalities through real-world, large-scale experiments on the Microsoft Academic Graph, including the cold start scenario. Our models are accurate; we identify at least one of authors, affiliations, and nationalities of held-out papers with 40.3%, 47.9% and 86.0% accuracy respectively, from the top-10 guesses of our models. However, through insights from the model, we demonstrate that these entities are identifiable with a small number of guesses primarily by using a combination of self-citations, social, and common citations. Moreover, our further analysis on the results leads to interesting findings, such as that prominent affiliations are easily identifiable (e.g. 93.8% of test papers written by Microsoft are identified with top-10 guesses). The experimental results show, against conventional belief, that the self-citations are no more informative than looking at the common citations, thus suggesting that removing self-citations is not sufficient for authors to maintain their anonymity.@inproceedings{deblind:wosp20, author = {Yoshitomo Matsubara and Sameer Singh}, title = { {Citations Beyond Self Citations: Identifying Authors, Affiliations, and Nationalities in Scientific Papers} }, booktitle = {Workshop on Mining Scientific Publications (WOSP)}, year = {2020} }
, . -
Data Importance-Based Active Learning for Limited Labels.
CVPR Workshop on Visual Learning with Limited Labels (VL3).
2020
Workshop
[ Video, BibTex ]@inproceedings{ibal:vl320, author = {Pouya Pezeshkpour and Zhengli Zhao and Sameer Singh}, title = { {Data Importance-Based Active Learning for Limited Labels} }, booktitle = {CVPR Workshop on Visual Learning with Limited Labels (VL3)}, year = {2020} }
, , .
-
Universal Adversarial Triggers for Attacking and Analyzing NLP.
Empirical Methods in Natural Language Processing (EMNLP).
2019
Conference
[ PDF, arXiv, Blog post, Code, ACL Anthology, Abstract, BibTex ]Adversarial examples highlight model vulnerabilities and are useful for evaluation and interpretation. We define universal adversarial triggers: input-agnostic sequences of tokens that trigger a model to produce a specific prediction when concatenated to any input from a dataset. We propose a gradient-guided search over tokens which finds short trigger sequences (e.g., one word for classification and four words for language modeling) that successfully trigger the target prediction. For example, triggers cause SNLI entailment accuracy to drop from 89.94% to 0.55%, 72% of “why” questions in SQuAD to be answered “to kill american people”, and the GPT-2 language model to spew racist output even when conditioned on non-racial contexts. Furthermore, although the triggers are optimized using white-box access to a specific model, they transfer to other models for all tasks we consider. Finally, since triggers are input-agnostic, they provide an analysis of global model behavior. For instance, they confirm that SNLI models exploit dataset biases and help to diagnose heuristics learned by reading comprehension models.@inproceedings{trigger:emnlp19, author = {Eric Wallace and Shi Feng and Nikhil Kandpal and Matt Gardner and Sameer Singh}, title = { {Universal Adversarial Triggers for Attacking and Analyzing NLP} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, doi = {10.18653/v1/D19-1221}, pages = {2153-2162}, year = {2019} }
, , , , . -
Do NLP Models Know Numbers? Probing Numeracy in Embeddings.
Empirical Methods in Natural Language Processing (EMNLP).
2019
Conference
[ PDF, arXiv, ACL Anthology, BibTex ]@inproceedings{numeracy:emnlp19, author = {Eric Wallace and Yizhong Wang and Sujian Li and Sameer Singh and Matt Gardner}, title = { {Do NLP Models Know Numbers? Probing Numeracy in Embeddings} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, doi = {10.18653/v1/D19-1534}, pages = {5307-5315}, year = {2019} }
, , , , . -
Knowledge Enhanced Contextual Word Representations.
Empirical Methods in Natural Language Processing (EMNLP).
2019
Conference
[ PDF, arXiv, ACL Anthology, BibTex ]@inproceedings{knobert:emnlp19, author = {Matthew E. Peters and Mark Neumann and Robert L. Logan IV and Roy Schwartz and Vidur Joshi and Sameer Singh and Noah A. Smith}, title = { {Knowledge Enhanced Contextual Word Representations} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, doi = {10.18653/v1/D19-1005}, pages = {43-54}, year = {2019} }
, , , , , , . -
Barack's Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling.
Association for Computational Linguistics (ACL).
2019
Conference
[ PDF, arXiv, Data, Code, ACL Anthology, Abstract, BibTex ]Modeling human language requires the ability to not only generate fluent text but also encode factual knowledge. However, traditional language models are only capable of remembering facts seen at training time, and often have difficulty recalling them. To address this, we introduce the knowledge graph language model (KGLM), a neural language model with mechanisms for selecting and copying facts from a knowledge graph that are relevant to the context. These mechanisms enable the model to render information it has never seen before, as well as generate out-of-vocabulary tokens. We also introduce the Linked WikiText-2 dataset, a corpus of annotated text aligned to the Wikidata knowledge graph whose contents (roughly) match the popular WikiText-2 benchmark. In experiments, we demonstrate that the KGLM achieves significantly better performance than a strong baseline language model. We additionally compare different language model’s ability to complete sentences requiring factual knowledge, showing that the KGLM outperforms even very large language models in generating facts.@inproceedings{kglm:acl19, author = {Robert L. Logan IV and Nelson F. Liu and Matthew E. Peters and Matt Gardner and Sameer Singh}, title = { {Barack's Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling} }, booktitle = {Association for Computational Linguistics (ACL)}, doi = {10.18653/v1/P19-1598}, pages = {5962-5971}, year = {2019} }
, , , , . -
Are Red Roses Red? Evaluating Consistency of Question-Answering Models.
Association for Computational Linguistics (ACL).
2019
Conference
[ PDF, ACL Anthology, BibTex ]@inproceedings{impl:acl19, author = {Marco Tulio Ribeiro and Carlos Guestrin and Sameer Singh}, title = { {Are Red Roses Red? Evaluating Consistency of Question-Answering Models} }, booktitle = {Association for Computational Linguistics (ACL)}, doi = {10.18653/v1/P19-1621}, pages = {6174-6184}, year = {2019} }
, , . -
Compositional Questions Do Not Necessitate Multi-hop Reasoning.
Association for Computational Linguistics (ACL).
2019
Conference
[ PDF, arXiv, ACL Anthology, BibTex ]@inproceedings{mhop:acl19, author = {Sewon Min and Eric Wallace and Sameer Singh and Matt Gardner and Hannaneh Hajishirzi and Luke Zettlemoyer}, title = { {Compositional Questions Do Not Necessitate Multi-hop Reasoning} }, booktitle = {Association for Computational Linguistics (ACL)}, doi = {10.18653/v1/P19-1416}, pages = {4249-4257}, year = {2019} }
, , , , , . -
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs.
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
2019
Conference
[ PDF, Website, arXiv, Data, ACL Anthology, Leaderboard, Demo, Abstract, BibTex ]Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 55k-question benchmark, a system must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting). These operations require a much more comprehensive understanding of the content of paragraphs, as they remove the paraphrase-and-entity-typing shortcuts available in prior datasets. We apply state-of-the-art methods from both the reading comprehension and semantic parsing literatures on this dataset and show that the best systems only achieve 38.4% F1 on our generalized accuracy metric, while expert human performance is 96%. We additionally present a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1.@inproceedings{drop:naacl19, author = {Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner}, title = { {DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs} }, booktitle = {Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, doi = {10.18653/v1/N19-1246}, pages = {2368-2378}, year = {2019} }
, , , , , . -
Investigating Robustness and Interpretability of Link Prediction via Adversarial Modifications.
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
2019
Conference
[ PDF, Website, arXiv, Code, Video, ACL Anthology, Abstract, BibTex ]Representing entities and relations in an embedding space is a well-studied approach for machine learning on relational data. Existing approaches, however, primarily focus on improving accuracy and overlook other aspects such as robustness and interpretability. In this paper, we propose adversarial modifications for link prediction models: identifying the fact to add into or remove from the knowledge graph that changes the prediction for a target fact after the model is retrained. Using these single modifications of the graph, we identify the most influential fact for a predicted link and evaluate the sensitivity of the model to the addition of fake facts. We introduce an efficient approach to estimate the effect of such modifications by approximating the change in the embeddings when the knowledge graph changes. To avoid the combinatorial search over all possible facts, we train a network to decode embeddings to their corresponding graph components, allowing the use of gradient-based optimization to identify the adversarial modification. We use these techniques to evaluate the robustness of link prediction models (by measuring sensitivity to additional facts), study interpretability through the facts most responsible for predictions (by identifying the most influential neighbors), and detect incorrect facts in the knowledge base.@inproceedings{criage:naacl19, author = {Pouya Pezeshkpour and Yifan Tian and Sameer Singh}, title = { {Investigating Robustness and Interpretability of Link Prediction via Adversarial Modifications} }, booktitle = {Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, doi = {10.18653/v1/N19-1337}, pages = {3336-3347}, year = {2019} }
, , . -
GenderQuant: Quantifying Mention-Level Genderedness.
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
2019
Conference
[ PDF, Website, Code, ACL Anthology, Abstract, BibTex ]Language is gendered if the context surrounding a mention is suggestive of a particular binary gender for that mention. Detecting the different ways in which language is gendered is an important task since gendered language can bias NLP models (such as for coreference resolution). This task is challenging since genderedness is often expressed in subtle ways. Existing approaches need considerable annotation efforts for each language, domain, and author, and often require handcrafted lexicons and features. Additionally, these approaches do not provide a quantifiable measure of how gendered the text is, nor are they applicable at the fine-grained mention level.
In this paper, we use existing NLP pipelines to automatically annotate gender of mentions in the text. On corpora labeled using this method, we train a supervised classifier to predict the gender of any mention from its context and evaluate it on unseen text. The model confidence for a mention's gender can be used as a proxy to indicate the level of genderedness of the context. We test this gendered language detector on movie summaries, movie reviews, news articles, and fiction novels, achieving an AUC-ROC of up to 0.71, and observe that the model predictions agree with human judgments collected for this task. We also provide examples of detected gendered sentences from aforementioned domains.@inproceedings{gender:naacl19, author = {Ananya Ananya and Nitya Parthasarthi and Sameer Singh}, title = { {GenderQuant: Quantifying Mention-Level Genderedness} }, booktitle = {Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, doi = {10.18653/v1/N19-1303}, pages = {2959-2969}, year = {2019} }
, , . -
PoMo: Generating Entity-Specific Post-Modifiers in Context.
Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
2019
Conference
[ PDF, Website, arXiv, Data, ACL Anthology, Abstract, BibTex ]We introduce entity post-modifier generation as an instance of a collaborative writing task. Given a sentence about a target entity, the task is to automatically generate a post-modifier phrase that provides contextually relevant information about the entity. For example, for the sentence, "Barack Obama, _______, supported the #MeToo movement.", the phrase "a father of two girls" is a contextually relevant post-modifier. To this end, we build PoMo, a post-modifier dataset created automatically from news articles reflecting a journalistic need for incorporating entity information that is relevant to a particular news event. PoMo consists of more than 231K sentences with post-modifiers and associated facts extracted from Wikidata for around 57K unique entities. We use crowdsourcing to show that modeling contextual relevance is necessary for accurate post-modifier generation.
We adapt a number of existing generation approaches as baselines for this dataset. Our results show there is large room for improvement in terms of both identifying relevant facts to include (knowing which claims are relevant gives a >20% improvement in BLEU score), and generating appropriate post-modifier text for the context (providing relevant claims is not sufficient for accurate generation). We conduct an error analysis that suggests promising directions for future research.@inproceedings{pomo:naacl19, author = {Jun Seok Kang and Robert L. Logan IV and Zewei Chu and Yang Chen and Dheeru Dua and Kevin Gimpel and Sameer Singh and Niranjan Balasubramanian}, title = { {PoMo: Generating Entity-Specific Post-Modifiers in Context} }, booktitle = {Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, doi = {10.18653/v1/N19-1089}, pages = {826-838}, year = {2019} }
, , , , , , , . -
AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models.
Demo at the Empirical Methods in Natural Language Processing (EMNLP).
2019
Demo
Best Demonstration Paper Award.
[ PDF, Project Page, ACL Anthology, ArXiv, Poster, Abstract, BibTex ]Neural NLP models are increasingly accurate but are imperfect and opaque---they break in counterintuitive ways and leave end users puzzled at their behavior. Model interpretation methods ameliorate this opacity by providing explanations for specific model predictions. Unfortunately, existing interpretation codebases make it difficult to apply these methods to new models and tasks, which hinders adoption for practitioners and burdens interpretability researchers. We introduce AllenNLP Interpret, a flexible framework for interpreting NLP models. The toolkit provides interpretation primitives (e.g., input gradients) for any AllenNLP model and task, a suite of built-in interpretation methods, and a library of front-end visualization components. We demonstrate the toolkit's flexibility and utility by implementing live demos for five interpretation methods (e.g., saliency maps and adversarial attacks) on a variety of models and tasks (e.g., masked language modeling using BERT and reading comprehension using BiDAF). These demos, alongside our code and tutorials, are available at https://allennlp.org/interpret.@inproceedings{interpret:emnlp19, author = {Eric Wallace and Jens Tuyls and Junlin Wang and Sanjay Subramanian and Matt Gardner and Sameer Singh}, title = { {AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models} }, booktitle = {Demo at the Empirical Methods in Natural Language Processing (EMNLP)}, doi = {10.18653/v1/D19-3002}, pages = {7-12}, year = {2019} }
, , , , , . -
Detecting Conversation Topics in Primary Care Office Visits from Transcripts of Patient-Provider Interactions.
Journal of the American Medical Informatics Association.
2019
Journal
[ PDF, Website, BibTex ]@article{convtopics:jamia19, author = {Jihyun Park and Dimitrios Kotzias and Patty Kuo and Robert L. Logan IV and Kritzia Merced and Sameer Singh and Michael Tanana and Efi Karra-Taniskidou and Jennifer Elston Lafata and David C. Atkins and Ming Tai-Seale and Zac E Imel and Padhraic Smyth}, title = { {Detecting Conversation Topics in Primary Care Office Visits from Transcripts of Patient-Provider Interactions} }, journal = {Journal of the American Medical Informatics Association}, volume = {26}, number = {12}, doi = {10.1093/jamia/ocz140}, pages = {1493-1504}, year = {2019} }
, , , , , , , , , , , , . -
Comment on Semantic Based Adversarial Examples Fool Face Recognition.
Synced Review.
2019
Online
[ Article, BibTex ]@misc{review:synced19, author = {Sameer Singh}, title = { {Comment on Semantic Based Adversarial Examples Fool Face Recognition} }, editor = {Synced Review}, month = {August}, url = {https://syncedreview.com/2019/08/09/semantic-based-adversarial-examples-fool-face-recognition/}, year = {2019} }
. -
Distilled Split Deep Neural Networks for Edge-Assisted Real-Time Systems.
Mobicom Workshop on Hot Topics in Video Analytics and Intelligent Edges.
2019
Workshop
[ PDF, BibTex ]@inproceedings{distill:hottopics19, author = {Yoshitomo Matsubara and Sabur Baidya and Davide Callegaro and Marco Levorato and Sameer Singh}, title = { {Distilled Split Deep Neural Networks for Edge-Assisted Real-Time Systems} }, booktitle = {Mobicom Workshop on Hot Topics in Video Analytics and Intelligent Edges}, year = {2019} }
, , , , . -
Evaluating Question Answering Evaluation.
Workshop on Machine Reading and Question Answering (MRQA).
2019
Workshop
Best Paper Award.
[ PDF, BibTex ]@inproceedings{evalqa:mrqa19, author = {Anthony Chen and Gabriel Stanovsky and Sameer Singh and Matt Gardner}, title = { {Evaluating Question Answering Evaluation} }, booktitle = {Workshop on Machine Reading and Question Answering (MRQA)}, year = {2019} }
, , , . -
ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension.
Workshop on Machine Reading and Question Answering (MRQA).
2019
Workshop
[ PDF, BibTex ]@inproceedings{orb:mrqa19, author = {Dheeru Dua and Ananth Gottumukkala and Alon Talmor and Sameer Singh and Matt Gardner}, title = { {ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension} }, booktitle = {Workshop on Machine Reading and Question Answering (MRQA)}, year = {2019} }
, , , , . -
Analyzing Compositionality of Visual Question Answering.
NeurIPS Workshop on Visually Grounded Interaction and Language (ViGIL).
2019
Workshop
[ PDF, BibTex ]@inproceedings{compvqa:vigil19, author = {Sanjay Subramanian and Sameer Singh and Matt Gardner}, title = { {Analyzing Compositionality of Visual Question Answering} }, booktitle = {NeurIPS Workshop on Visually Grounded Interaction and Language (ViGIL)}, year = {2019} }
, , . -
Improving Differentially Private Models with Active Learning.
NeurIPS Workshop on Privacy in Machine Learning (PriML).
2019
Workshop
[ PDF, arXiv, BibTex ]@inproceedings{dpal:priml19, author = {Zhengli Zhao and Nicolas Papernot and Sameer Singh and Neoklis Polyzotis and Augustus Odena}, title = { {Improving Differentially Private Models with Active Learning} }, booktitle = {NeurIPS Workshop on Privacy in Machine Learning (PriML)}, year = {2019} }
, , , , .
-
From Reinforcement Learning to Deep Reinforcement Learning: An Overview.
Braverman Readings in Machine Learning: Key Ideas from Inception to Current State, Springer Press.
2018
Chapter
[ PDF (Springer), Springer, Amazon, Google Books, BibTex ]@incollection{deeprl:chap18, author = {Forest Agostinelli and Guillaume Hocquet and Sameer Singh and Pierre Baldi}, title = { {From Reinforcement Learning to Deep Reinforcement Learning: An Overview} }, booktitle = {Braverman Readings in Machine Learning: Key Ideas from Inception to Current State, Springer Press}, pages = {298-328}, year = {2018} }
, , , . -
Embedding Multimodal Relational Data for Knowledge Base Completion.
Empirical Methods in Natural Language Processing (EMNLP).
2018
Conference
[ PDF, Code/Data, arXiv, ACL Anthology, Video, Abstract, BibTex ]Representing entities and relations in an embedding space is a well-studied approach for machine learning on relational data. Existing approaches, however, primarily focus on simple link structure between a finite set of entities, ignoring the variety of data types that are often used in knowledge bases, such as text, images, and numerical values. In this paper, we propose multimodal knowledge base embeddings (MKBE) that use different neural encoders for this variety of observed data, and combine them with existing relational models to learn embeddings of the entities and multimodal data. Further, using these learned embedings and different neural decoders, we introduce a novel multimodal imputation model to generate missing multimodal values, like text and images, from information in the knowledge base. We enrich existing relational datasets to create two novel benchmarks that contain additional information such as textual descriptions and images of the original entities. We demonstrate that our models utilize this additional information effectively to provide more accurate link prediction, achieving state-of-the-art results with a considerable gap of 5-7% over existing methods. Further, we evaluate the quality of our generated multimodal values via a user study.@inproceedings{mmkb:emnlp18, author = {Pouya Pezeshkpour and Liyan Chen and Sameer Singh}, title = { {Embedding Multimodal Relational Data for Knowledge Base Completion} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, doi = {10.18653/v1/D18-1359}, pages = {3208-3218}, year = {2018} }
, , . -
Interpretation of Natural Language Rules in Conversational Machine Reading.
Empirical Methods in Natural Language Processing (EMNLP).
2018
Conference
[ PDF, arXiv, ACL Anthology, Abstract, BibTex ]Most work in machine reading focuses on question answering problems where the answer is directly expressed in the text to read. However, many real-world question answering problems require the reading of text not because it contains the literal answer, but because it contains a recipe to derive an answer together with the reader's background knowledge. One example is the task of interpreting regulations to answer "Can I...?" or "Do I have to...?" questions such as "I am working in Canada. Do I have to carry on paying UK National Insurance?" after reading a UK government website about this topic. This task requires both the interpretation of rules and the application of background knowledge. It is further complicated due to the fact that, in practice, most questions are underspecified, and a human assistant will regularly have to ask clarification questions such as "How long have you been working abroad?" when the answer cannot be directly derived from the question and text. In this paper, we formalise this task and develop a crowd-sourcing strategy to collect 32k task instances based on real-world rules and crowd-generated questions and scenarios. We analyse the challenges of this task and assess its difficulty by evaluating the performance of rule-based and machine-learning baselines. We observe promising results when no background knowledge is necessary, and substantial room for improvement whenever background knowledge is needed.@inproceedings{quarc:emnlp18, author = {Marzieh Saeidi and Max Bartolo and Patrick Lewis and Sameer Singh and Tim Rocktaschel and Mike Sheldon and Guillaume Bouchard and Sebastian Riedel}, title = { {Interpretation of Natural Language Rules in Conversational Machine Reading} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, doi = {10.18653/v1/D18-1233}, pages = {2087-2097}, year = {2018} }
, , , , , , , . -
Semantically Equivalent Adversarial Rules for Debugging NLP models.
Association for Computational Linguistics (ACL).
2018
Conference
Honorable Mention for Best Paper.
[ PDF, Appendix, Code, ACL Anthology, Video, Slides, Abstract, BibTex ]Complex machine learning models for NLP are often brittle, making different predictions for input instances that are extremely similar semantically. To automatically detect this behavior for individual instances, we present semantically equivalent adversaries (SEAs) - semantic-preserving perturbations that induce changes in the model’s predictions. We generalize these adversaries into semantically equivalent adversarial rules (SEARs) - simple, universal replacement rules that induce adversaries on many instances. We demonstrate the usefulness and flexibility of SEAs and SEARs by detecting bugs in black-box state-of-the-art models for three domains: machine comprehension, visual question-answering, and sentiment analysis. Via user studies, we demonstrate that we generate high-quality local adversaries for more instances than humans, and that SEARs induce four times as many mistakes as the bugs discovered by human experts. SEARs are also actionable: retraining models using data augmentation significantly reduces bugs, while maintaining accuracy.@inproceedings{sears:acl18, author = {Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin}, title = { {Semantically Equivalent Adversarial Rules for Debugging NLP models} }, booktitle = {Association for Computational Linguistics (ACL)}, doi = {10.18653/v1/P18-1079}, pages = {856-865}, year = {2018} }
, , . -
Generating Natural Adversarial Examples.
International Conference on Learning Representations (ICLR).
2018
Conference
[ PDF, Source Code, arXiv, OpenReview, Abstract, BibTex ]Due to their complex nature, it is hard to characterize the ways in which machine learning models can misbehave or be exploited when deployed. Recent work on adversarial examples, i.e. inputs with minor perturbations that result in substantially different model predictions, is helpful in evaluating the robustness of these models by exposing the adversarial scenarios where they fail. However, these malicious perturbations are often unnatural, not semantically meaningful, and not applicable to complicated domains such as language. In this paper, we propose a framework to generate natural and legible adversarial examples that lie on the data manifold, by searching in semantic space of dense and continuous data representation, utilizing the recent advances in generative adversarial networks. We present generated adversaries to demonstrate the potential of the proposed approach for black-box classifiers for a wide range of applications such as image classification, textual entailment, and machine translation. We include experiments to show that the generated adversaries are natural, legible to humans, and useful in evaluating and analyzing black-box classifiers.@inproceedings{natadv:iclr18, author = {Zhengli Zhao and Dheeru Dua and Sameer Singh}, title = { {Generating Natural Adversarial Examples} }, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2018} }
, , . -
Combining Symbolic Expressions and Black-box Function Evaluations for Training Neural Programs.
International Conference on Learning Representations (ICLR).
2018
Conference
[ PDF, Source Code, arXiv, OpenReview, Abstract, BibTex ]Neural programming involves training neural networks to learn programs, mathematics, or logic from data. Previous works have failed to achieve good generalization performance, especially on problems and programs with high complexity or on large domains. This is because they mostly rely either on black-box function evaluations that do not capture the structure of the program, or on detailed execution traces that are expensive to obtain, and hence the training data has poor coverage of the domain under consideration. We present a novel framework that utilizes black-box function evaluations, in conjunction with symbolic expressions that define relationships between the given functions. We employ tree LSTMs to incorporate the structure of the symbolic expression trees. We use tree encoding for numbers present in function evaluation data, based on their decimal representation. We present an evaluation benchmark for this task to demonstrate our proposed model combines symbolic reasoning and function evaluation in a fruitful manner, obtaining high accuracies in our experiments. Our framework generalizes significantly better to expressions of higher depth and is able to fill partial equations with valid completions.@inproceedings{funeval:iclr18, author = {Forough Arabshahi and Sameer Singh and Animashree Anandkumar}, title = { {Combining Symbolic Expressions and Black-box Function Evaluations for Training Neural Programs} }, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2018} }
, , . -
Anchors: High-Precision Model-Agnostic Explanations.
AAAI Conference on Artificial Intelligence (AAAI).
2018
Conference
[ PDF, Code (package), Code (results), AAAI Page, Abstract, BibTex ]We introduce a novel model-agnostic system that explains the behavior of complex models with high-precision rules called anchors, representing local, “sufficient” conditions for predictions. We propose an algorithm to efficiently compute these explanations for any black-box model with high-probability guarantees. We demonstrate the flexibility of anchors by explaining a myriad of different models for different domains and tasks. In a user study, we show that anchors enable users to predict how a model would behave on unseen instances with less effort and higher precision, as compared to existing linear explanations or no explanations.@inproceedings{anchors:aaai18, author = {Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin}, title = { {Anchors: High-Precision Model-Agnostic Explanations} }, booktitle = {AAAI Conference on Artificial Intelligence (AAAI)}, pages = {1527-1535}, year = {2018} }
, , . -
A Framework of Rapid Regional Tsunami Damage Recognition from Post-event TerraSAR-X Imagery Using Deep Neural Networks.
IEEE Geoscience and Remote Sensing Letters.
2018
Journal
[ PDF, IEEE, Abstract, BibTex ]Near real-time building damage mapping is an indispensable prerequisite for governments to make decisions for disaster relief. With high-resolution synthetic aperture radar (SAR) systems, such as TerraSAR-X, the provision of such products in a fast and effective way becomes possible. In this letter, a deep learning-based framework for rapid regional tsunami damage recognition using post-event SAR imagery is proposed. To perform such a rapid damage mapping, a series of tile-based image split analysis is employed to generate the data set. Next, a selection algorithm with the SqueezeNet network is developed to swiftly distinguish between built-up (BU) and nonbuilt-up regions. Finally, a recognition algorithm with a modified wide residual network is developed to classify the BU regions into wash away, collapsed, and slightly damaged regions. Experiments performed on the TerraSAR-X data from the 2011 Tohoku earthquake and tsunami in Japan show a BU region extraction accuracy of 80.4% and a damage-level recognition accuracy of 74.8%, respectively. Our framework takes around 2 h to train on a new region, and only several minutes for prediction.@article{tsunami:geosense18, author = {Yanbing Bai and Chang Gao and Sameer Singh and Magaly Koch and Bruno Adriano and Erick Mas and Shunichi Koshimura}, title = { {A Framework of Rapid Regional Tsunami Damage Recognition from Post-event TerraSAR-X Imagery Using Deep Neural Networks} }, journal = {IEEE Geoscience and Remote Sensing Letters}, volume = {15}, number = {1}, doi = {10.1109/LGRS.2017.2772349}, pages = {43-47}, year = {2018} }
, , , , , , . -
Towards Solving Differential Equations through Neural Programming.
ICML Workshop on Neural Abstract Machines and Program Induction (NAMPI).
2018
Workshop
[ PDF, Poster, BibTex ]@inproceedings{diffeqeval:nampi18, author = {Forough Arabshahi and Sameer Singh and Animashree Anandkumar}, title = { {Towards Solving Differential Equations through Neural Programming} }, booktitle = {ICML Workshop on Neural Abstract Machines and Program Induction (NAMPI)}, year = {2018} }
, , .
-
Entity Linking via Joint Encoding of Types, Descriptions, and Context.
Empirical Methods in Natural Language Processing (EMNLP).
2017
Conference
[ PDF, Code, ACL Anthology, Website, Abstract, BibTex ]For accurate entity linking, we need to capture various information aspects of an entity, such as its description in a KB, contexts in which it is mentioned, and structured knowledge. Additionally, a linking system should work on texts from different domains without requiring domain-specific training data or hand-engineered features.
In this work we present a neural, modular entity linking system that learns a unified dense representation for each entity using multiple sources of information, such as its description, contexts around its mentions, and its fine-grained types. We show that the resulting entity linking system is effective at combining these sources, and performs competitively, sometimes out-performing current state-of-the-art systems across datasets, without requiring any domain-specific training data or hand-engineered features. We also show that our model can effectively "embed" entities that are new to the KB, and is able to link its mentions accurately.@inproceedings{neuralel:emnlp17, author = {Nitish Gupta and Sameer Singh and Dan Roth}, title = { {Entity Linking via Joint Encoding of Types, Descriptions, and Context} }, booktitle = {Empirical Methods in Natural Language Processing (EMNLP)}, month = {September}, doi = {10.18653/v1/D17-1284}, pages = {2681-2690}, year = {2017} }
, , . -
Intelligent Data Filtering in Constrained IoT Systems.
Asilomar Conference on Signals, Systems, and Computers.
2017
Invited
[ PDF, IEEE Xplore, Abstract, BibTex ]The expansion of complex autonomous sensing and control mechanisms in the Internet-of-Things systems clashes with constraints on computation and wireless communication resources. In this paper, we propose a framework to address this conflict for applications in which resolution using a centralized architecture with a general-purpose compression of observations is not appropriate. Three approaches for distributing observation detection workload between sensing and processing devices are considered for sensor systems within wireless islands. Each of the approaches is formulated for the shared configuration of a sensor-edge system, in which the network structure, observation monitoring problem, and machine learning-based detector implementing it are not modified. For every approach, a high-level strategy for realization of the detector for different assumptions on the relation between its complexity and the system's constraints is considered. In each case, the potential for the constraints' satisfaction is shown to exist and be exploitable via division, approximation, and delegation of the detector's workload to the sensing devices off the edge processor. We present examples of applications that benefit from the proposed approaches.@inproceedings{semcompress:asilomar17, author = {Igor Burago and Davide Callegaro and Marco Levorato and Sameer Singh}, title = { {Intelligent Data Filtering in Constrained IoT Systems} }, booktitle = {Asilomar Conference on Signals, Systems, and Computers}, year = {2017} }
, , , . -
Semantic Compression for Edge-Assisted Systems.
Information Theory and Applications (ITA) Workshop.
2017
Invited
[ PDF, ArXiv version, BibTex ]@inproceedings{semcompress:ita17, author = {Igor Burago and Marco Levorato and Sameer Singh}, title = { {Semantic Compression for Edge-Assisted Systems} }, booktitle = {Information Theory and Applications (ITA) Workshop}, month = {February}, year = {2017} }
, , . -
Generating Natural Adversarial Examples.
NeurIPS Workshop on Machine Deception.
2017
Workshop
Amazon Best Poster Award at the Southern California Machine Learning Symposium.
Shorter version of the paper at ICLR 2018.
[ PDF, ArXiv (full paper), Abstract, BibTex ]Due to their complex nature, it is hard to characterize the ways in which machine learning models can misbehave or be exploited when deployed. Recent work on adversarial examples, i.e. inputs with minor perturbations that result in substantially different model predictions, is helpful in evaluating the robustness of these models by exposing the adversarial scenarios where they fail. However, these malicious perturbations are often unnatural, not semantically meaningful, and not applicable to complicated domains such as language. In this paper, we propose a framework to generate natural and legible adversarial examples by searching in semantic space of dense and continuous data representation, utilizing the recent advances in generative adversarial networks. We present generated adversaries to demonstrate the potential of the proposed approach for black-box classifiers in a wide range of applications such as image classification, textual entailment, and machine translation. We include experiments to show that the generated adversaries are natural, legible to humans, and useful in evaluating and analyzing black-box classifiers.@inproceedings{natadv:mldecept17, author = {Zhengli Zhao and Dheeru Dua and Sameer Singh}, title = { {Generating Natural Adversarial Examples} }, booktitle = {NeurIPS Workshop on Machine Deception}, year = {2017} }
, , . -
How Biased Are We? Automated Detection of Gendered Language.
ACL Workshop on Women and Underrepresented Minorities in NLP (WiNLP).
2017
Workshop
Also presented at the NeurIPS 2017 Workshop for Women in Machine Learning (WiML).
[ PDF, BibTex ]@inproceedings{gender:winlp17, author = {Ananya Ananya and Sameer Singh}, title = { {How Biased Are We? Automated Detection of Gendered Language} }, booktitle = {ACL Workshop on Women and Underrepresented Minorities in NLP (WiNLP)}, month = {August}, year = {2017} }
, . -
Compact Factorization of Matrices Using Generalized Round-Rank.
Southern California Machine Learning Symposium.
2017
Workshop
[ PDF, BibTex ]@inproceedings{grank:southcal17, author = {Pouya Pezeshkpour and Carlos Guestrin and Sameer Singh}, title = { {Compact Factorization of Matrices Using Generalized Round-Rank} }, booktitle = {Southern California Machine Learning Symposium}, year = {2017} }
, , . -
Embedding Multimodal Relational Data.
Workshop on Automated Knowledge Base Construction (AKBC).
2017
Workshop
[ PDF, BibTex ]@inproceedings{mmkbe:akbc17, author = {Pouya Pezeshkpour and Liyan Chen and Sameer Singh}, title = { {Embedding Multimodal Relational Data} }, booktitle = {Workshop on Automated Knowledge Base Construction (AKBC)}, year = {2017} }
, , . -
Multimodal Attribute Extraction.
Workshop on Automated Knowledge Base Construction (AKBC).
2017
Workshop
[ PDF, BibTex ]@inproceedings{maed:akbc17, author = {Robert L. Logan IV and Samuel Humeau and Sameer Singh}, title = { {Multimodal Attribute Extraction} }, booktitle = {Workshop on Automated Knowledge Base Construction (AKBC)}, year = {2017} }
, , . -
Relational Learning and Feature Extraction by Querying over Heterogeneous Information Networks.
International Workshop on Statistical Relational AI (StarAI).
2017
Workshop
[ PDF, ArXiv version, Abstract, BibTex ]Many real world systems need to operate on heterogeneous information networks that consist of numerous interacting components of different types. Examples include systems that perform data analysis on biological information networks; social networks; and information extraction systems processing unstructured data to convert raw text to knowledge graphs. Many previous works describe specialized approaches to perform specific types of analysis, mining and learning on such networks. In this work, we propose a unified framework consisting of a data model -a graph with a first order schema along with a declarative language for constructing, querying and manipulating such networks in ways that facilitate relational and structured machine learning. In particular, we provide an initial prototype for a relational and graph traversal query language where queries are directly used as relational features for structured machine learning models. Feature extraction is performed by making declarative graph traversal queries. Learning and inference models can directly operate on this relational representation and augment it with new data and knowledge that, in turn, is integrated seamlessly into the relational structure to support new predictions. We demonstrate this system's capabilities by showcasing tasks in natural language processing and computational biology domains.@inproceedings{saul:starai17, author = {Parisa Kordjamshidi and Sameer Singh and Daniel Khashabi and Christos Christodoulopoulos and Mark Summons and Saurabh Sinha and Dan Roth}, title = { {Relational Learning and Feature Extraction by Querying over Heterogeneous Information Networks} }, booktitle = {International Workshop on Statistical Relational AI (StarAI)}, month = {July}, year = {2017} }
, , , , , , .
-
Better call Saul: Flexible Programming for Learning and Inference in NLP.
International Conference on Computational Linguistics (COLING).
2016
Conference
[ PDF, ACL Anthology, BibTex ]@inproceedings{saul:coling16, author = {Parisa Kordjamshidi and Daniel Khashabi and Christos Christodoulopoulos and Bhargav Mangipudi and Sameer Singh and Dan Roth}, title = { {Better call Saul: Flexible Programming for Learning and Inference in NLP} }, booktitle = {International Conference on Computational Linguistics (COLING)}, month = {December}, pages = {3030-3040}, year = {2016} }
, , , , , . -
Connotation Frames: A Data-Driven Investigation.
Association for Computational Linguistics (ACL).
2016
Conference
[ PDF, arXiv, Website, ACL Anthology, BibTex ]@inproceedings{connot:acl16, author = {Hannah Rashkin and Sameer Singh and Yejin Choi}, title = { {Connotation Frames: A Data-Driven Investigation} }, booktitle = {Association for Computational Linguistics (ACL)}, month = {August}, doi = {10.18653/v1/P16-1030}, pages = {311-321}, year = {2016} }
, , . -
"Why Should I Trust You?": Explaining the Predictions of Any Classifier.
Knowledge Discovery and Data Mining (KDD).
2016
Conference
Audience Appreciation Award
Also presented at the CHI 2016 Workshop on Human-Centred Machine Learning (HCML).
[ PDF, arXiv, Code, Video, O'Reilly, Code (experiments), ACM Page, BibTex ]@inproceedings{lime:kdd16, author = {Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin}, title = { {"Why Should I Trust You?": Explaining the Predictions of Any Classifier} }, booktitle = {Knowledge Discovery and Data Mining (KDD)}, month = {August}, doi = {10.1145/2939672.2939778}, pages = {1135-1144}, year = {2016} }
, , . -
"Why Should I Trust You?": Explaining the Predictions of Any Classifier.
Demo at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL).
2016
Demo
Demonstration of the KDD 2016 paper.
[ PDF, Code, BibTex ]@inproceedings{lime:naacl16, author = {Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin}, title = { {"Why Should I Trust You?": Explaining the Predictions of Any Classifier} }, booktitle = {Demo at the Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, month = {June}, year = {2016} }
, , . -
Introduction to Local Interpretable Model-Agnostic Explanations (LIME).
O'Reilly Media.
2016
Online
[ Article, BibTex ]@misc{lime:oreilly16, author = {Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin}, title = { {Introduction to Local Interpretable Model-Agnostic Explanations (LIME)} }, editor = {O'Reilly Media}, month = {August}, url = {https://www.oreilly.com/learning/introduction-to-local-interpretable-model-agnostic-explanations-lime}, year = {2016} }
, , . -
Programs as Black-Box Explanations.
NeurIPS Workshop on Interpretable Machine Learning in Complex Systems.
2016
Workshop
[ PDF, arXiv, Abstract, BibTex ]Recent work in model-agnostic explanations of black-box machine learning has demonstrated that interpretability of complex models does not have to come at the cost of accuracy or model flexibility. However, it is not clear what kind of explanations, such as linear models, decision trees, and rule lists, are the appropriate family to consider, and different tasks and models may benefit from different kinds of explanations. Instead of picking a single family of representations, in this work we propose to use "programs" as model-agnostic explanations. We show that small programs can be expressive yet intuitive as explanations, and generalize over a number of existing interpretable families. We propose a prototype program induction method based on simulated annealing that approximates the local behavior of black-box classifiers around a specific prediction using random perturbations. Finally, we present preliminary application on small datasets and show that the generated explanations are intuitive and accurate for a number of classifiers.@inproceedings{prog:nipsws16, author = {Sameer Singh and Marco Tulio Ribeiro and Carlos Guestrin}, title = { {Programs as Black-Box Explanations} }, booktitle = {NeurIPS Workshop on Interpretable Machine Learning in Complex Systems}, month = {November}, year = {2016} }
, , . -
Nothing Else Matters: Model-Agnostic Explanations By Identifying Prediction Invariance.
NeurIPS Workshop on Interpretable Machine Learning in Complex Systems.
2016
Workshop
[ PDF, arXiv, Abstract, BibTex ]At the core of interpretable machine learning is the question of whether humans are able to make accurate predictions about a model's behavior. Assumed in this question are three properties of the interpretable output: coverage, precision, and effort. Coverage refers to how often humans think they can predict the model's behavior, precision to how accurate humans are in those predictions, and effort is either the up-front effort required in interpreting the model, or the effort required to make predictions about a model's behavior.
In this work, we propose anchor-LIME (aLIME), a model-agnostic technique that produces high-precision rule-based explanations for which the coverage boundaries are very clear. We compare aLIME to linear LIME with simulated experiments, and demonstrate the flexibility of aLIME with qualitative examples from a variety of domains and tasks.@inproceedings{anchor:nipsws16, author = {Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin}, title = { {Nothing Else Matters: Model-Agnostic Explanations By Identifying Prediction Invariance} }, booktitle = {NeurIPS Workshop on Interpretable Machine Learning in Complex Systems}, month = {November}, year = {2016} }
, , . -
"Why Should I Trust You?": Explaining the Predictions of Any Classifier.
CHI Workshop on Human-Centred Machine Learning (HCML).
2016
Workshop
Shorter version of the paper presented at KDD 2016.
[ PDF, BibTex ]@inproceedings{lime:hcml16, author = {Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin}, title = { {"Why Should I Trust You?": Explaining the Predictions of Any Classifier} }, booktitle = {CHI Workshop on Human-Centred Machine Learning (HCML)}, month = {May}, year = {2016} }
, , . -
Model-Agnostic Interpretability of Machine Learning.
ICML Workshop on Human Interpretability in Machine Learning (WHI).
2016
Workshop
Best Paper Award
[ PDF, BibTex ]@inproceedings{lime:whi16, author = {Marco Tulio Ribeiro and Sameer Singh and Carlos Guestrin}, title = { {Model-Agnostic Interpretability of Machine Learning} }, booktitle = {ICML Workshop on Human Interpretability in Machine Learning (WHI)}, month = {June}, year = {2016} }
, , . -
Creating Interactive and Visual Educational Resources for AI.
AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI).
2016
Workshop
[ PDF, AAAI Page, Abstract, BibTex ]Teaching artificial intelligence is effective if the experience is a visual and interactive one, with educational materials that utilize combinations of various content types such as text, math, and code into an integrated experience. Unfortunately, easy-to-use tools for creating such pedagogical resources are not available to the educators, resulting in most courses being taught using a disconnected set of static materials, which is not only ineffective for learning AI, but further, requires repeated and redundant effort for the instructor. In this paper, we introduce Moro, a software tool for easily creating and presenting AI-friendly teaching materials. Moro notebooks integrate content of different types (text, math, code, images), allow real-time interactions via modifiable and executable code blocks, and are viewable in browsers both as long-form pages and as presentations. Creating notebooks is easy and intuitive; the creation tool is also in-browser, is WYSIWYG for quick iterations of editing, and supports a variety of shortcuts and customizations for efficiency. We present three deployed case studies of Moro that widely differ from each other, demonstrating its utility in a variety of scenarios such as in-class teaching and conference tutorials.@inproceedings{moro:eaai16, author = {Sameer Singh and Sebastian Riedel}, title = { {Creating Interactive and Visual Educational Resources for {AI}} }, booktitle = {AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI)}, year = {2016} }
, .