Efficient Reconciliation of Genomic Datasets of High Similarity

Yoshihiro Shibuya, Djamal Belazzougui, Gregory Kucherov

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k-mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k-mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k-mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k-mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k-mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data (SARS-CoV-2 and Streptococcus Pneumoniae genomes).

Original languageEnglish
Title of host publication22nd International Workshop on Algorithms in Bioinformatics, WABI 2022
EditorsChristina Boucher, Sven Rahmann
PublisherSchloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
ISBN (Electronic)9783959772433
DOIs
Publication statusPublished - 1 Sep 2022
Externally publishedYes
Event22nd International Workshop on Algorithms in Bioinformatics, WABI 2022 - Potsdam, Germany
Duration: 5 Sep 20227 Sep 2022

Publication series

NameLeibniz International Proceedings in Informatics, LIPIcs
Volume242
ISSN (Print)1868-8969

Conference

Conference22nd International Workshop on Algorithms in Bioinformatics, WABI 2022
Country/TerritoryGermany
CityPotsdam
Period5/09/227/09/22

Keywords

  • IBLT
  • Invertible Bloom Lookup Tables
  • k-mers
  • MinHash
  • minimizers
  • sketching
  • syncmers

Fingerprint

Dive into the research topics of 'Efficient Reconciliation of Genomic Datasets of High Similarity'. Together they form a unique fingerprint.

Cite this