The NLP revolution has seen machines learning to speak English and Chinese, but not much French. reciTAL has partnered with Etalab to build PIAF, the first French-language question answering dataset.
The French language shouldn’t be missing out on the NLP revolution. Tech firms GAFA and BATX have invested heavily in R&D in the field of Natural Language Processing (NLP) and new applications are emerging in English and Chinese.
“Multilingual” models like BERT, RoBERTa and GPT2 are trained in around 100 languages, but English is still their “native tongue” because it had such a huge input in the initial learning stage.
When reciTAL was founded, we were quick to identify the risk to the French-speaking world and alerted several institutions.
Etalab responded to our call for action and partnered with us to work on the scientific and methodological aspects of the first French-language question answering dataset, based on SQuAD (Stanford Question Answering Dataset).
These datasets help train deep neural networks to answer natural language questions from a segment of text.
We have now created the first version of the question answering dataset with thousands of questions and answers in French and published our initial research findings (LREC paper).
Our work has confirmed the need for datasets to be trained in separate languages in the future for optimum performance of the models.