About

CIRI2: Circular RNA identification based on multiple seed matching

CIRI (circRNA identifier) is a novel chiastic clipping signal based algorithm,which can unbiasedly and accurately detect circRNAs from transcriptome data by employing multiple filtration strategies.

Release Notes

What’s new in CIRI2?

  1. CIRI2 supports multi-thread and is convenient for large data set.

  2. CIRI2 supports variable read lengths and can process combined data sets after trimming.

  3. CIRI2 outputs strand of predicted circRNAs.

Citations:

If you use CIRI2, please cite the following papers:

  1. Yuan Gao†, Jinfeng Wang† and Fangqing Zhao*. CIRI: an efficient and unbiased algorithm for de novo circular RNA identification. Genome Biology (2015) 16:4.

  2. Yuan Gao, Jinyang Zhang and Fangqing Zhao*. Circular RNA identification based on multiple seed matching. Briefings in Bioinformatics (2017) DOI: 10.1093/bib/bbx014.

Usage

Commands and arguments

How to run CIRI2:

perl CIRI2.pl -I in.sam -O outfile -F ref.fa (-R ref_dir/)

*CIRI2 can determine automatically to use PE or SE mode according to input SAM.

The arguments of CIRI2 are as followings:

-I, --in
    input SAM file name (required; generated by BWA-MEM)
-O, --out
    output circRNA list name (required)
-F, --ref_file
    FASTA file of all reference sequences. Please make sure this file is 
    the same one provided to BWA-MEM. Either this argument or -R/--ref-dir 
    is required.
-R, --ref_dir
    directory of reference sequence(s). Please make sure fasta files in 
    this directory are from the FASTA file(s) provided to BWA-MEM. Either 
    this argument or -F/--ref-file is required.
-A, --anno
    input GTF/GFF3 formatted annotation file name (optional)
-G, --log
    output log file name (optional)
-H, --help
    show this help information
-S, --max_span
    max spanning distance of circRNAs (default: 200000)
-high, --high_strigency
    use high strigency: only output circRNAs supported by more than 2 
    distinct PCC signals (default)
-low, --low_strigency
    use low strigency: only output circRNAs supported by more than 2 
    junction reads
-0, --no_strigency
    output all circRNAs regardless junction read counts or PCC signals
-U, --mapq_uni
    set threshold for mappqing quality of each segment of junction reads 
    (default: 10; should be within [0,30])
-E, --rel_exp
    set threshold for relative expression calculated based on counts of 
    junction reads and non-junction reads (optional: e.g. 0.1)
-M, --chrM
    tell CIRI2 the ID of mitochondrion in reference file(s) (default: chrM)
-T, --thread_num
    set number of threads for parallel running (default: 1)
-Q, --quiet
    keep quiet when running
-D, --output_all
    keep the temporary files after running (more disk space would be needed)

Reads mapped to mitochondrial sequence (provided by –chrM will be dropped).

Preparation for using CIRI2

  • Trim reads first if necessary. This step is optional. For data sets with good sequencing quality, this step will not largely influence prediction of CIRI2.

  • Align the reads to reference to generate SAM file using BWA-MEM tool , which is a split mapping algorithm (http://bio-bwa.sourceforge.net/bwa.shtml).

Recommended protocols for running BWA-MEM:

bwa index -a bwtsw ref.fa
bwa mem –T 19 ref.fa reads.fq > aln-se.sam (single-end reads)
bwa mem –T 19 ref.fa read1.fq read2.fq > aln-pe.sam (paired-end reads)

When using a nohup command, please make sure a log file is specifically designated so that the output file is a clean SAM record. Recommended protocols as following:

bwa index -a bwtsw ref.fa
bwa mem –T 19 ref.fa reads.fq 1> aln-se.sam 2> aln-se.log 
bwa mem –T 19 ref.fa read1.fq read2.fq 1> aln-pe.sam 2> aln-pe.log

IMPORTANT: Please do not try processing the output SAM (e.g. sorting or filtering), which would confuse CIRI2. CIRI2 has been optimized to analysis SAM file without modification, and efficient filtrations have been included in CIRI2.

Annotation formats

Although CIRI2 can de novo detect circRNA, it can also use annotation as complementary filtrations.

Please make sure you are using exactly the same version of genomic sequences and their annotations when running CIRI2.

We recommend .gtf annotations generated by ensembl. Here is the link for their latest annotations of model organisms: ftp://ftp.ensembl.org/pub/current_gtf

An example of running CIRI2

Before we start, please make sure you have installed parallel Perl 5.8 or higher and use Mac OS X or Linux operation system.

Please download CIRI2 and test_data2.zip from https://sourceforge.net/projects/ciri/files/CIRI2/ and then unzip the three files in test_data2.zip. test.sam is a SAM file of alignment records generated by BWA-MEM. The only parameter of BWA-MEM was -T 19. chr1.fa is the FASTA file of hg19 chromosome 1 downloaded from UCSC and chr1.gtf is annotation for chromosome 1 extracted from version 18 gencode gtf file.

Enter the directory and type as following in your terminal:

perl CIRI2.pl -I test.sam -O outfile -F chr1.fa -A chr1.gtf

CIRI2 can finish circRNA detecting within a minute, and then you will see the following results in outfile:

circRNA_ID	chr	circRNA_start	circRNA_end	#junction_reads	SM_MS_SMS	#non_junction_reads	junction_reads_ratio	circRNA_type	gene_id	strand	junction_reads_ID
chr1:16770127|16775696	chr1	16770127	16775696	8	5_6_0	192	0.077	exon	ENSG00000157191.15,	+	GFGG-GA2-1:2:33:2000:1585,GFGG-GA2-1:2:33:2000:1587,GFGG-GA2-1:2:33:2000:1589,GFGG-GA2-1:2:33:2000:1590,GFGG-GA2-1:2:33:2000:1576,GFGG-GA2-1:2:33:2000:1580,GFGG-GA2-1:2:33:2000:1583,GFGG-GA2-1:2:33:2000:1593,
chr1:16775588|16778510	chr1	16775588	16778510	6	5_5_0	203	0.056	exon	ENSG00000157191.15,	+	GFGG-GA2-1:2:33:2000:1276,GFGG-GA2-1:2:33:2000:1277,GFGG-GA2-1:2:33:2000:1280,GFGG-GA2-1:2:33:2000:1284,GFGG-GA2-1:2:33:2000:1279,GFGG-GA2-1:2:33:2000:1283,
...

Or CIRI2 can search circRNAs without the annotation gtf:

perl CIRI2.pl -I test.sam -O outfile2 -F chr1.fa

The outfile2 is as follows:

circRNA_ID	chr	circRNA_start	circRNA_end	#junction_reads	SM_MS_SMS	#non_junction_reads	junction_reads_ratio	circRNA_type	gene_id	strand	junction_reads_ID
chr1:16770127|16775696	chr1	16770127	16775696	8	5_6_0	192	0.077	n/a	/n/a	+	GFGG-GA2-1:2:33:2000:1585,GFGG-GA2-1:2:33:2000:1587,GFGG-GA2-1:2:33:2000:1589,GFGG-GA2-1:2:33:2000:1590,GFGG-GA2-1:2:33:2000:1576,GFGG-GA2-1:2:33:2000:1580,GFGG-GA2-1:2:33:2000:1583,GFGG-GA2-1:2:33:2000:1593,
chr1:16775588|16778510	chr1	16775588	16778510	6	5_5_0	203	0.056	n/a	/n/a	+	GFGG-GA2-1:2:33:2000:1276,GFGG-GA2-1:2:33:2000:1277,GFGG-GA2-1:2:33:2000:1280,GFGG-GA2-1:2:33:2000:1284,GFGG-GA2-1:2:33:2000:1279,GFGG-GA2-1:2:33:2000:1283,
...

Columns of output file are split by tabs (”\t” in shell and perl). Each column gives information of a predicted circRNA:

Column Description
1 ID of a predicted circRNA in the pattern of "chr:start
2 chromosome of a predicted circRNA
3 start loci of a predicted circRNA on the chromosome
4 end loci of a predicted circRNA on the chromosome
5 circular junction read (also called as back-spliced junction read) count of a predicted circRNA
6 unique CIGAR types of a predicted circRNA. For example, a circRNAs have three junction reads: read A (80M20S, 80S20M), read B (80M20S, 80S20M), read C (40M60S, 40S30M30S, 70S30M), then its has two SM types (80S20M, 70S30M), two MS types (80M20S, 70M30S) and one SMS type (40S30M30S). Thus its SM_MS_SMS should be 2_2_1.
7 non-junction read count of a predicted circRNA that mapped across the circular junction but consistent with linear RNA instead of being back-spliced
8 ratio of circular junction reads calculated by 2#junction_reads/(2#junction_reads+#non_junction_reads). #junction_reads is multiplied by two because a junction read is generated from two ends of circular junction but only counted once while a non-junction read is from one end. It has to be mentioned that the non-junction reads are still possibly from another larger circRNA, so the junction_reads_ratio based on it may be an inaccurate estimation of relative expression of the circRNA.
9 type of a circRNA according to positions of its two ends on chromosome (exon, intron or intergenic_region; only available when annotation file is provided)
10 ID of the gene(s) where an exonic or intronic circRNA locates
11 strand info of a predicted circRNAs (new in CIRI2)
12 all of the circular junction read IDs (split by ",")

Running time and summary are recorded in oufile.log.

Test of CIRI2

You can also test CIRI2 using the framework added since version 2.0.5, though the test itself is not required for running CIRI2. If you test CIRI2 using this framework, please note modules Test::More and Test::Class as well as Perl 5.10 or higher will be needed.

Enter the directory and type as follows in your terminal:

perl data/test_CIRI.pl

Q&A

(1) How to use multiple threads in CIRI2?

CIRI2 supports multi-thread. To use it, just designate thread number using the argument -T.

It should be noted that using multiple threads will inevitably need more RAM due to the design of parallel Perl, although we have optimized RAM utility in CIRI2.

As a suggestion, please choose to use multi-thread if you are using moderate or high-performance server, which will largely reduce running time for large data sets. For reference, it usually takes no more than 4 hours for 40 Gb data (SAM file >100 Gb) generated from RNaseR treated sample using 4 threads.

If you only have a mini-server or even a 16Gb-RAM iMac, you can still use CIRI2 without paralleling to complete processing large data sets in acceptable time. It usually takes no more than 10 hours to process the above data set using single thread.

We do not recommend to use more than 20 threads in CIRI2, which will cost a large RAM but indeed speed up little.

(2) How to set parameters in BWA-MEM and CIRI2 for different data?

An important argument of BWA-MEM for junction read mapping is -T, which gives the threshold of alignment score of output.

Our simulation study shows that a lower -T than default (30) can improve sensitivity of CIRI2. We recommend -T 19 for most data set (such as read length ≥60).

We have applied CIRI2 to simulated data and real data. Here are some recommended parameters for short or single-end reads.

  1. Single-end reads result in higher FDR comparing to paired-end reads for lack of PEM information as one of filters when using default parameters. Thus we provided parameter -U (e.g. -U 15), to set mapping quality thresholds of a junction read and help control FDR.

  2. Although short read length (<60 bp) may lead to lower sensitivity, alteration of parameters for BWA-MEM (e.g. -k 15 and -T 15) to allow alignments for low mapping scores can improve performance for this type of sequencing data.

(3) How to generate a genome annotation file for non-model organisms.

If the genome you use does not have annotation in ensembl, we would recommend you to convert the annotated file to one of the following formats.

  1. gtf format:

chr1    HAVANA  gene    11869   14412   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1    HAVANA  transcript      11869   14409   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1    HAVANA  exon    11869   12227   .       +       .       gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1;  exon_id "ENSE00002234944.1";  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1"; 

The last column (column 9) should contains the key words such as “gene_id” and “transcript_id”, and each of them is split by a “; “.

  1. gff format:

AC_000023.1     RefSeq  region  1       202526509       .       +       .       ID=id0;Name=1;Dbxref=taxon:10090;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=mixed
AC_000023.1     BestRefSeq      gene    3222752 3677460 .       -       .       ID=gene0;Name=Xkr4;Dbxref=GeneID:497097,MGI:3528744;description=X Kell blood group precursor related family member 4;gbkey=Gene;gene=Xkr4;gene_synonym=AY534250,Gm210,mKIAA1889,XRG4;partial=true
AC_000023.1     BestRefSeq      mRNA    3222752 3677460 .       -       .       ID=rna0;Name=NM_001011874.1;Parent=gene0;Note=The RefSeq transcript aligns at 78%25 coverage compared to this genomic sequence;Dbxref=GeneID:497097,Genbank:NM_001011874.1,MGI:3528744;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=Xkr4;product=X Kell blood group precursor related family member 4;transcript_id=NM_001011874.1
AC_000023.1     BestRefSeq      exon    3677303 3677460 .       -       .       ID=id1;Parent=rna0;Note=The RefSeq transcript aligns at 78%25 coverage compared to this genomic sequence;Dbxref=GeneID:497097,Genbank:NM_001011874.1,MGI:3528744;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=Xkr4;product=X Kell blood group precursor related family member 4;transcript_id=NM_001011874.1

The last column (column 9) should contains the key words such as “gene=” and “transcript_id=”, and each of them is split by a “;”.

  1. another gff format:

Chr1    TAIR10  chromosome      1       30427671        .       .       .       ID=Chr1;Name=Chr1
Chr1    TAIR10  gene    3631    5899    .       +       .       ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1    TAIR10  mRNA    3631    5899    .       +       .       ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1    TAIR10  protein 3760    5630    .       +       .       ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1    TAIR10  exon    3631    3913    .       +       .       Parent=AT1G01010.1
Chr1    TAIR10  five_prime_UTR  3631    3759    .       +       .       Parent=AT1G01010.1

The last column (column 9) should contains the key words such as “Parent=”, and each of them is split by a “;”.

Any questions about CIRI2 please mail to gaoyuan06@mails.ucas.ac.cn.