MEETING: a French corpus for studying how people actually talk in meetings

Meetings are a strange kind of conversation. They are not casual chats, but they are not scripted speech either. People report information, negotiate decisions, plan actions, interrupt each other, go off-topic, repair misunderstandings, and somehow still try to produce something useful by the end. For NLP and dialogue research, that makes meetings extremely valuable and extremely annoying to collect.

This paper introduces MEETING, also circulated as SUMM-RE, a corpus of roughly 95 hours of spontaneous French meeting-style conversations. The goal is simple but important: create a substantial French resource for studying and modelling multi-party spoken interaction, especially for downstream tasks such as meeting summarization.

The corpus fills a real gap. Classic meeting corpora such as AMI and ICSI have been central for English meeting research, but comparable French resources have been much thinner. MEETING is designed to bring French into that space: not just as translated text, but as real spoken interaction with its own timing, interruptions, discourse structure, and conversational messiness.

The conversations are meeting-style rather than fully natural workplace meetings. Participants were given tasks that imitate features of real meetings, reporting information, decision-making, and planning, while still leaving room for spontaneous interaction. That makes the corpus nicely positioned between controlled experimental data and completely uncontrolled real-world meetings: structured enough to compare across sessions, but open enough to contain genuine dialogue dynamics.

A key strength is the transcription layer. In its current form, the corpus includes around 25 hours of manually corrected transcripts aligned with the audio, plus automatic transcripts and alignments for the full corpus. That combination makes it useful in two ways: the manually corrected portion can support evaluation of ASR, speaker recognition, and alignment systems, while the larger automatically processed portion can be used for broader NLP experiments.

In other words, this is not just some recordings plus text. It is infrastructure. It gives researchers a way to work on French meeting summarization, spoken dialogue modelling, discourse segmentation, ASR evaluation, speaker behaviour, and the linguistic structure of multi-party conversation.

The broader point is that meeting summarization is only as good as the data underneath it. Written-document summarization is already hard; spoken meetings add overlapping turns, false starts, incomplete sentences, speaker changes, and long-range discourse structure. MEETING matters because it gives French NLP a corpus where these problems can be studied directly, instead of imported awkwardly from English datasets.

So the punchline is: MEETING makes French meeting speech usable as research material. It is a corpus paper, yes, but the kind of corpus paper that quietly enables a whole family of downstream studies.