2020 has been, no doubt, one of the weirdest years in recent history.
Putting aside the challenges brought about by a global pandemic, here is a wrap-up of the research results obtained in 2020 by reciTAL.
Throughout the year, reciTAL researchers have seen their works accepted at 8 venues, including flagship AI/NLP conferences (NeurIPS, ICML, EMNLP) which are among the most competitive and prestigious venues in computer science (see impact ranking for AI and Computational Linguistics).
We have extended our ties with world-renowned international research centers: besides the established relation with the LIP6-Sorbonne laboratory, we have worked with fellow researchers from New York University (US), FBK (Italy), University of Tours, Sciences Po, INRIA Paris, and EtaLab. We have activated an AI Chair with ESILV Paris.
Our research efforts have focused on two main axes: Natural Language Generation (NLG) and on the development of multilingual large-scale datasets.
NLG is getting a lot of media attention, with the release of large models such as GPT-3 being met with a mixture of excitement and criticism.
On our side, we focused on concrete use-cases such as document summarization, with a particular attention on the quality of the generated texts in terms of factualness, consistency, and relevance.
At NeurIPS, we presented a study in which we investigated why text generation with GANs falled short so far, and proposed a training methodology which allows Language GANs to outperform, for the first time, models trained via Maximum Likelihood Estimation.
At ICML, we tackled the problem of exposure bias in text generation models: we proposed Discriminative Adversarial Search as an alternative to the commonly used Beam Search, and showed how our approach significantly improves the quality of the generated texts.
Furthermore, we worked on providing the research community with datasets in non-English languages: at LREC we presented the PIAF dataset for French Question Answering, in collaboration with the Etalab, while at EMNLP we published the first large-scale multilingual dataset for summarization, comprising circa 1.5M articles in five languages (French, German, Spanish, Russian, Turkish).
What lies ahead in AI
Natural Language Processing has seen important progress in the last three years, thanks to a new deep learning architecture, the Transformer, which achieved new state-of-the-arts in most NLP tasks, including Question Answering and Natural Language Generation. We have started to see in 2020 some success for the Transformer in Vision or Audio, improving over commonly used architectures such as Convolutional Neural Networks.
2021 will probably be the year for universal models that can deal with multiple modalities.
What lies ahead for reciTAL research
We plan to focus our research effort on Document Intelligence and Robust AI in 2021.
Most of current NLP work so far is restricted to text-only datasets. Conversely, real life NLP means to treat rich formats such as PDF, MS Office or scanned documents. To deal efficiently with these, a multimodal approach is necessary: the models should not only consider the raw text, but also take into account additional information such as images, table structures, visual layout. Moreover, we will focus on making our models more robust, to better generalize on various domains, and continue to improve the transparency of our end-to-end pipelines and provide more control to our customers.
These can be seen as the business oriented side for the current trends of Multimodality and Robustness in the AI research community.
To these ends, we have financed two additional PhD students, who join Thomas Scialom in pursuing doctoral studies: a big welcome to Laura Nguyen (LIP6-Sorbonne) and Gregor Jouet (ESILV Paris – University of Tours)! We look forward to the integration of these cutting-edge works into reciTAL products.
The complete list of papers we published in 2020 follows:
NeurIPS 2020
ColdGANs: Taming Language GANs with Cautious Sampling Strategies
T Scialom, PA Dray, S Lamprier, B Piwowarski, J Staiano
ICML 2020
Discriminative Adversarial Search for Abstractive Summarization
T Scialom, PA Dray, S Lamprier, B Piwowarski, J Staiano
EMNLP 2020
MLSUM: The Multilingual Summarization Corpus
T Scialom, PA Dray, S Lamprier, B Piwowarski, J Staiano
Findings of EMNLP 2020
Toward Stance-based Personas for Opinionated Dialogues
T Scialom, SS Tekiroglu, J Staiano, M Guerini
INLG 2020
What BERT Sees: Cross-Modal Transfer for Visual Question Generation
T Scialom, P Bordes, PA Dray, J Staiano, P Gallinari
COLING 2020
Ask to Learn: A Study on Curiosity-driven Question Generation
T Scialom and J Staiano
LREC 2020
Project PIAF: Building a Native French Question-Answering Dataset
R Keraron, G Lancrenon, M Bras, F Allary, G Moyse, T Scialom, EP Soriano-Morales, J Staiano
ASONAM 2020
P Ramaciotti Morales, JP Cointet, J Laborde