PlumX Metrics
Embed PlumX Metrics

SPRISS: Approximating frequent k-mers by sampling reads, and applications

Bioinformatics, ISSN: 1460-2059, Vol: 38, Issue: 13, Page: 3343-3350
2022
  • 1
    Citations
  • 0
    Usage
  • 13
    Captures
  • 0
    Mentions
  • 0
    Social Media
Metric Options:   Counts1 Year3 Year

Metrics Details

  • Citations
    1
    • Citation Indexes
      1
  • Captures
    13

Article Description

Motivation: The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis. Results: In this work, we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS uses a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset.

Bibliographic Details

Diego Santoro; Leonardo Pellegrina; Matteo Comin; Fabio Vandin; Can Alkan

Oxford University Press (OUP)

Mathematics; Biochemistry, Genetics and Molecular Biology; Computer Science

Provide Feedback

Have ideas for a new metric? Would you like to see something else here?Let us know