One of the ways to fighting toxicity online is to automatically rewrite toxic messages. This is a sequence-to-sequence task, and the easiest way of solving it is to train an encoder-decoder model on a set of parallel sentences (pairs of sentences with the same meaning, where one is offensive and the other is not). However, such data does not exist, making researchers resort to non-parallel corpora. We close this gap by suggesting a crowdsourcing scenario for creating a parallel dataset of detoxifying paraphrases. In our first experiments, we collect paraphrases for 1,200 toxic sentences. We describe and analyse the crowdsourcing setup and the resulting corpus.
|Number of pages||15|
|Journal||CEUR Workshop Proceedings|
|Publication status||Published - 2021|
|Event||2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale, CSW 2021 - Copenhagen, Denmark|
Duration: 20 Aug 2021 → …
- Parallel data