In summary: I will show which is the best miRNA mapping tool. I used several options for this benchmarking:
- miraligner from SeqBuster suit (I am one of the authors)
- bowtie2 and bowtie
- novoalign from novocraft suit
I think that these are the most used, and other not used but good to try them. They were clearly developed for other purposes, but as well, they generate the input of many miRNA pipelines. I just wanted to know how well my tool was doing. The first aim to develop miraligner was to get annotated additions of nucleotides at the end of miRNA sequences, something that is very common in mirna biogenesis: isomirs and often they are missed by short read and fast mappers. I have a repository for this kind of things, so anybody can reproduce my results, and check if I did something wrong, or comment on it. In this post I just want to know which tool detects more miRNA, for that I did two main steps:
- simulate a bunch of miRNAs (isomirs) with my python script that is part of SeqBuster suit. I generated around 10000 sequences. Normally, one small-RNAseq library produces around half million different sequences.
- use only miRBase human precursors as reference genome
I used default parameter for all, so probably there is a set of parameters that would be better for some tools, but I didn’t search for them this time (happy to accept issues to add them). The results are:
- miraligner and STAR map more sequences. miraligner loses sequences shorter than 15 nt, normally (miRNA) are around 21, and those sequences map to repeat elements
- STAR is the best mapping, but need some parsing to reduce false positive. I think that pipelines should change to this tool.
- GEM has problems with additions and nt substitutions in many cases, same as novoaligner (I will look at this in the future)
- bowtie2/bowtie is the second that annotated most (and best)
- microrazer has a problem with mismatches, but maybe there is some parameter to trick
- miRExpress with default options only will map perfect matches sequences to precursor, so strongly recommended to allow errors to increase sensibility.
I have to say that the advantage of miraligner is that it gives you the sequence annotated as miRNA or precursor, and gives you the exactly modifications that sequences have if they are compared to miRBase database. And you could feed the results to my R package to plot isomers distribution of samples from different groups, and do differential expression analysis with DESeq2, or another tool. I didn’t add time consumption because all of them run in a couple of minutes. In my next post I will focus in the the correct annotation of each sequence, and the possible problems with cross-mapping events, when the sequence comes from another regions of the genome but map to miRBase precursor as well [see my previous post for more details]. As well I will use STAR with the full genome and see if the mapping continues being the best. In that case I will add a script to SeqBuster to parse the output of STAR for those who can map with STAR (need up to 32G for human genome)