Soohyun Lee
Chae Hwa Seo
Peter J. Park
Sanghyuk Lee
Center for Biomedical Informatics
Harvard Medical School
Boston, MA, USA.
DNA Link, Inc.
Seoul, Korea

EMSAR is a C program that quantifies RNA abundance (gene & RNA isoform expression levels) from RNA-seq data. It takes a transcriptome reference fasta file and an alignment file (BAM/SAM/default bowtie output), and generates RNA abundance estimates in FPKM, TPM and inferred read count.

Download »


  • EMSAR v1.1 is available for download. (Jul 2, 2014)
    • Multi-sample calling is now implemented.
    • Poly-A removal is depricated.
  • EMSAR v1.0 is available for download. (Jun 24, 2014)
    • This version uses nomenclature FPKM rather than FVKM.
    • Some memory leaks in reading paired-end bam file is fixed.
    • This version removes polyA from fasta file by default and provides an option of retaining polyA.
  • EMSAR v0.9.8d (pre-official-release version) is available for download. (Mar 7, 2014)
    • A bug in a newly introduced error handling code is now fixed.
  • EMSAR v0.9.8c (pre-official-release version) is available for download. (Mar 3, 2014)
    • We fixed a critical bug of clashing rshsize++ with paired-end parallelization which may result in discrepancy between actual number of cid's and calculated cid's and cause segmentation fault and/or incorrect estimation.
    • v0.9.8c handles set size=1 separately in a simple way, to save time.
    • v0.9.8c checks bowtie output format.
    • v0.9.8c handles the error caused by reallocation of sfa_s for the size zero case.
    • v0.9.8c checks the number of iterations for MLE and performs reinitialization in case the the number of interation is larger than a certain threshold (set to 10000) to prevent 'hanging.'
  • EMSAR v0.9.8 (pre-official-release version) is available for download. (Jan 4, 2014)
    • v0.9.8 sets the default -r value to be 1E-15, instead of 1E-3, to avoid a spurious bimodal distribution of logFPKM.
    • v0.9.8 reads fasta file twice, the first to compute the size of the sequence array and number of transcripts and the second to store information. This approach avoids reallocation of large arrays which may sometimes fail and cause segmentation fault at a later step. Likewise, the initial tagged suffix array size was doubled to avoid reallocation. Also, it produces an error message when there is a problem with memory allocation.
  • EMSAR v0.9.7c (pre-official-release version) is available for download. (Sep 23, 2013)


Usage1 : EMSAR [options] fastafile outdir outprefix bowtieoutfile|SAMfile|BAMfile
Usage2 : bowtie_command | EMSAR [options] fastafile outdir outprefix

Example usages

ex : EMSAR -p 4 -h R human.rna.fna RNAseq sample22 sample22.bowtieout
ex2: EMSAR -p 4 -B -q human.rna.fna RNAseq sample22 sample22.BAM
ex3: EMSAR -p 4 -S --PE human.rna.fna RNAseq sample22 sample22.SAM
ex4 : bowtie -v 2 -a -m 100 -p 8 human.rna sample22.fastq | EMSAR -p 4 -h R human.rna.fna RNAseq sample22



-P, --PE
paired-end data (default : single-end)
-s, --strand_type [strand_type]
set strand type ('ns','ssf','ssr' for single-end, 'ns','ssfr','ssrf' for paired-end). (default: ns(unstranded))
-S, --SAM
input file format is SAM (by default, default bowtie output)
-B, --BAM
input file format is BAM (by default, default bowtie output)
-p, --maxthread [num_thread]
number of threads to use simultaneously. Using multiple threads comes with a slight increase in memory usage. We recommend -p 4 for an optimal performance. (default : 1)
-F, --maxfraglen [Max_fraglen]
Maximum fragment length (=read length for SE). Use a number that is safely large, unless you want to apply fragment length filtering. Default 400.
-f, --minfraglen [Min_fraglen]
Minimum fragment length (=read length for SE). Use a number that is safely small, unless you want to apply fragment length filtering. Default 1.
-h, --header [E|R]
fasta header option. E : Ensembl header(default), R : RefSeq header.
-g, --print_segments
print out segment information.

-v, --verbose
make a (very) verbose output log.
-q, --no_verbose
turn-off verbosity.

Advanced Options

-b, --binsize [binsize]
binsize for indexing transcriptome (default : 5000). This bin size can be made smaller to improve speed at the cost of more memory, or vise versa.
-t [taglen]
length of short sequence tags to use for constructing suffix array on a subset of substrings only. This affects only speed and memory usage. Currently, three values are supported (1,2,3). (default 2)
-n, --nround [num_rounds]
number of MLE runs for computing mean and sd of FVKM. Default 4.
-e, --epsilon [epsilon]
epsilon value to check convergence of MLE (default : 1E-9)
-r, --precision [precision]
estimate precision (epsilon for step size) (default : 1E-15)
-d, --delta [delta]
delta offset for MLE (default : 0)
-H, --heavy
compute exact fragment length sampling probability. Takes longer and uses more memory. (default: use observed distribution as sampling probability).

How it works

When RNA-seq reads are aligned to the transcriptome reference sequences, the reads may be mapped to multiple transcripts. Based on this information, each read is assigned to a group, which we termed 'segment'. For example, a read mapped to both isoform A and B is in segment 1, and a read mapped to isoforms A, B and C is in segment 2. For each segment, the length of each segment is computed as the number of all possible unique reads with the same transcript sharing. The read count for a segment should depend on the segment length and the sum of abundance of the transcripts that define the segment. Then, 'sequence-sharing sets' are defined as a group of transcripts, so that any two transcripts that together define a segment would belong to the same set. A sequence-sharing set is usually comprised of isoforms of a gene or multiple genes. A Poisson-based likelihood function is defined as a joint probability over all segments in a sequence-sharing set and the abundance estimates of transcripts in the set that maximize the likelihood are found.