“In an effort to democratise research on large-scale alignment, we release OpenAssistant Conversations, a human-generated, human-annotated assistant-style conversation corpus containing 161,443 messages distributed across 66,497 conversation trees, in 35 different languages, and annotated with 461,291 quality ratings,” stated the research paper.
The dataset is the product of a worldwide crowdsourcing effort by over 13,000 volunteers.
Crowdsourcing was a good way to generate multilingual training data, which contributed to a high-quality dataset.