A Year in Research at reciTAL

2020 has been, no doubt, one of the weirdest years in recent history. 

Putting aside the challenges brought about by a global pandemic, here is a wrap-up of the research results obtained in 2020 by reciTAL.

Throughout the year, reciTAL researchers have seen their works accepted at 8 venues, including flagship AI/NLP conferences (NeurIPS, ICML, EMNLP) which are among the most competitive and prestigious venues in computer science (see impact ranking for AI and Computational Linguistics). 

We have extended our ties with world-renowned international research centers: besides the established relation with the LIP6-Sorbonne laboratory, we have worked with fellow researchers from New York University (US), FBK (Italy), University of Tours, Sciences Po, INRIA Paris, and EtaLab. We have activated an AI Chair with ESILV Paris.

Our research efforts have focused on two main axes: Natural Language Generation (NLG) and on the development of multilingual large-scale datasets.

NLG is getting a lot of media attention, with the release of large models such as GPT-3 being met with a mixture of excitement and criticism

On our side, we focused on concrete use-cases such as document summarization, with a particular attention on the quality of the generated texts in terms of factualness, consistency, and relevance. 

At NeurIPS, we presented a study in which we investigated why text generation with GANs falled short so far, and proposed a training methodology which allows Language GANs to outperform, for the first time, models trained via Maximum Likelihood Estimation.

At ICML, we tackled the problem of exposure bias in text generation models: we proposed Discriminative Adversarial Search as an alternative to the commonly used Beam Search, and showed how our approach significantly improves the quality of the generated texts.

Furthermore, we worked on providing the research community with datasets in non-English languages: at LREC we presented the PIAF dataset for French Question Answering, in collaboration with the Etalab, while at EMNLP we published the first large-scale multilingual dataset for summarization, comprising circa 1.5M articles in five languages (French, German, Spanish, Russian, Turkish).

What lies ahead in AI

Natural Language Processing has seen important progress in the last three years, thanks to a new deep learning architecture, the Transformer, which achieved new state-of-the-arts in most NLP tasks, including Question Answering and Natural Language Generation. We have started to see in 2020 some success for the Transformer in Vision or Audio, improving over commonly used architectures such as Convolutional Neural Networks.

2021 will probably be the year for universal models that can deal with multiple modalities.

What lies ahead for reciTAL research

We plan to focus our research effort on Document Intelligence and Robust AI in 2021. 

Most of current NLP work so far is restricted to text-only datasets. Conversely, real life NLP means to treat rich formats such as PDF, MS Office or scanned documents. To deal efficiently with these, a multimodal approach is necessary: the models should not only consider the raw text, but also take into account additional information such as images, table structures, visual layout. Moreover, we will focus on making our models more robust, to better generalize on various domains, and continue to improve the transparency of our end-to-end pipelines and provide more control to our customers. 

These can be seen as the business oriented side for the current trends of Multimodality and Robustness in the AI research community. 

To these ends, we have financed two additional PhD students, who join Thomas Scialom in pursuing doctoral studies: a big welcome to Laura Nguyen (LIP6-Sorbonne) and Gregor Jouet (ESILV Paris – University of Tours)! We look forward to the integration of these cutting-edge works into reciTAL products.

The complete list of papers we published in 2020 follows:

NeurIPS 2020

ColdGANs: Taming Language GANs with Cautious Sampling Strategies

T Scialom, PA Dray, S Lamprier, B Piwowarski, J Staiano

ICML 2020

Discriminative Adversarial Search for Abstractive Summarization

T Scialom, PA Dray, S Lamprier, B Piwowarski, J Staiano

EMNLP 2020

MLSUM: The Multilingual Summarization Corpus

T Scialom, PA Dray, S Lamprier, B Piwowarski, J Staiano

Findings of EMNLP 2020

Toward Stance-based Personas for Opinionated Dialogues

T Scialom, SS Tekiroglu, J Staiano, M Guerini

INLG 2020

What BERT Sees: Cross-Modal Transfer for Visual Question Generation

T Scialom, P Bordes, PA Dray, J Staiano, P Gallinari

COLING 2020

Ask to Learn: A Study on Curiosity-driven Question Generation

T Scialom and J Staiano

LREC 2020

Project PIAF: Building a Native French Question-Answering Dataset

R Keraron, G Lancrenon, M Bras, F Allary, G Moyse, T Scialom, EP Soriano-Morales, J Staiano

ASONAM 2020

Your Most Telling Friends: Propagating Latent Ideological Features on Twitter Using Neighborhood Coherence

P Ramaciotti Morales, JP Cointet, J Laborde