Crowdsourcing of parallel corpora: The case of style transfer for detoxification

Daryna Dementieva, Sergey Ustyantsev, David Dale, Olga Kozlova, Nikita Semenov, Alexander Panchenko, Varvara Logacheva

Research output: Contribution to journalConference articlepeer-review


One of the ways to fighting toxicity online is to automatically rewrite toxic messages. This is a sequence-to-sequence task, and the easiest way of solving it is to train an encoder-decoder model on a set of parallel sentences (pairs of sentences with the same meaning, where one is offensive and the other is not). However, such data does not exist, making researchers resort to non-parallel corpora. We close this gap by suggesting a crowdsourcing scenario for creating a parallel dataset of detoxifying paraphrases. In our first experiments, we collect paraphrases for 1,200 toxic sentences. We describe and analyse the crowdsourcing setup and the resulting corpus.

Original languageEnglish
Pages (from-to)35-49
Number of pages15
JournalCEUR Workshop Proceedings
Publication statusPublished - 2021
Event2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale, CSW 2021 - Copenhagen, Denmark
Duration: 20 Aug 2021 → …


  • Crowdsourcing
  • Dataset
  • Parallel data
  • Toxicity


Dive into the research topics of 'Crowdsourcing of parallel corpora: The case of style transfer for detoxification'. Together they form a unique fingerprint.

Cite this