Study done by Pascal Costanza, Charlotte Herzeel and Wilfried Verachtert for new implementation language for elPrep. elPrep is an established multi-threaded framework for preparing SAM and BAM files in sequencing pipelines. To achieve good performance, its software architecture makes only a single pass through a SAM/BAM file for multiple preparation steps, and keeps sequencing data as much as possible in main memory.
The sequence alignment/map format (SAM/BAM) is the de facto standard in the bioinformatics community for storing mapped sequencing data.
In most programming languages, there exist more or less similar ways to explicitly or implicitly allocate memory for heap objects which, unlike stack values, are not bound to the lifetimes of function or method invocations. However, programming languages strongly differ in how memory for heap objects is subsequently deallocated.
The article then describes typical preparation pipeline steps using elPrep’s software architecture in the three selected programming languages:
- Sorting reads for coordinate order.
- Removing unmapped reads
- Marking duplicate reads
- Replacing read groups
- Reordering and filtering the sequence dictionary
The Go implementation performs best, yielding the best balance between runtime performance and memory use. While the Java benchmarks report a somewhat faster runtime than the Go benchmarks, the memory use of the Java runs is significantly higher. The C++17 benchmarks run significantly slower than both Go and Java, while using somewhat more memory than the Go runs.
Result charts and detailed benchmark process also described. How exciting!
[Read More]