About¶
CIRI2: Circular RNA identification based on multiple seed matching
CIRI (circRNA identifier) is a novel chiastic clipping signal based algorithm,which can unbiasedly and accurately detect circRNAs from transcriptome data by employing multiple filtration strategies.
Release Notes¶
What’s new in CIRI2?
CIRI2 supports multi-thread and is convenient for large data set.
CIRI2 supports variable read lengths and can process combined data sets after trimming.
CIRI2 outputs strand of predicted circRNAs.
Citations:¶
If you use CIRI2, please cite the following papers:
Yuan Gao†, Jinfeng Wang† and Fangqing Zhao*. CIRI: an efficient and unbiased algorithm for de novo circular RNA identification. Genome Biology (2015) 16:4.
Yuan Gao, Jinyang Zhang and Fangqing Zhao*. Circular RNA identification based on multiple seed matching. Briefings in Bioinformatics (2017) DOI: 10.1093/bib/bbx014.
Usage¶
Commands and arguments¶
How to run CIRI2:
perl CIRI2.pl -I in.sam -O outfile -F ref.fa (-R ref_dir/)
*CIRI2 can determine automatically to use PE or SE mode according to input SAM.
The arguments of CIRI2 are as followings:
-I, --in
input SAM file name (required; generated by BWA-MEM)
-O, --out
output circRNA list name (required)
-F, --ref_file
FASTA file of all reference sequences. Please make sure this file is
the same one provided to BWA-MEM. Either this argument or -R/--ref-dir
is required.
-R, --ref_dir
directory of reference sequence(s). Please make sure fasta files in
this directory are from the FASTA file(s) provided to BWA-MEM. Either
this argument or -F/--ref-file is required.
-A, --anno
input GTF/GFF3 formatted annotation file name (optional)
-G, --log
output log file name (optional)
-H, --help
show this help information
-S, --max_span
max spanning distance of circRNAs (default: 200000)
-high, --high_strigency
use high strigency: only output circRNAs supported by more than 2
distinct PCC signals (default)
-low, --low_strigency
use low strigency: only output circRNAs supported by more than 2
junction reads
-0, --no_strigency
output all circRNAs regardless junction read counts or PCC signals
-U, --mapq_uni
set threshold for mappqing quality of each segment of junction reads
(default: 10; should be within [0,30])
-E, --rel_exp
set threshold for relative expression calculated based on counts of
junction reads and non-junction reads (optional: e.g. 0.1)
-M, --chrM
tell CIRI2 the ID of mitochondrion in reference file(s) (default: chrM)
-T, --thread_num
set number of threads for parallel running (default: 1)
-Q, --quiet
keep quiet when running
-D, --output_all
keep the temporary files after running (more disk space would be needed)
Reads mapped to mitochondrial sequence (provided by –chrM will be dropped).
Preparation for using CIRI2¶
Trim reads first if necessary. This step is optional. For data sets with good sequencing quality, this step will not largely influence prediction of CIRI2.
Align the reads to reference to generate SAM file using BWA-MEM tool , which is a split mapping algorithm (http://bio-bwa.sourceforge.net/bwa.shtml).
Recommended protocols for running BWA-MEM:
bwa index -a bwtsw ref.fa
bwa mem –T 19 ref.fa reads.fq > aln-se.sam (single-end reads)
bwa mem –T 19 ref.fa read1.fq read2.fq > aln-pe.sam (paired-end reads)
When using a nohup command, please make sure a log file is specifically designated so that the output file is a clean SAM record. Recommended protocols as following:
bwa index -a bwtsw ref.fa
bwa mem –T 19 ref.fa reads.fq 1> aln-se.sam 2> aln-se.log
bwa mem –T 19 ref.fa read1.fq read2.fq 1> aln-pe.sam 2> aln-pe.log
IMPORTANT: Please do not try processing the output SAM (e.g. sorting or filtering), which would confuse CIRI2. CIRI2 has been optimized to analysis SAM file without modification, and efficient filtrations have been included in CIRI2.
Annotation formats¶
Although CIRI2 can de novo detect circRNA, it can also use annotation as complementary filtrations.
Please make sure you are using exactly the same version of genomic sequences and their annotations when running CIRI2.
We recommend .gtf annotations generated by ensembl. Here is the link for their latest annotations of model organisms: ftp://ftp.ensembl.org/pub/current_gtf
An example of running CIRI2¶
Before we start, please make sure you have installed parallel Perl 5.8 or higher and use Mac OS X or Linux operation system.
Please download CIRI2 and test_data2.zip from https://sourceforge.net/projects/ciri/files/CIRI2/ and then unzip the three files in test_data2.zip. test.sam is a SAM file of alignment records generated by BWA-MEM. The only parameter of BWA-MEM was -T 19. chr1.fa is the FASTA file of hg19 chromosome 1 downloaded from UCSC and chr1.gtf is annotation for chromosome 1 extracted from version 18 gencode gtf file.
Enter the directory and type as following in your terminal:
perl CIRI2.pl -I test.sam -O outfile -F chr1.fa -A chr1.gtf
CIRI2 can finish circRNA detecting within a minute, and then you will see the following results in outfile:
circRNA_ID chr circRNA_start circRNA_end #junction_reads SM_MS_SMS #non_junction_reads junction_reads_ratio circRNA_type gene_id strand junction_reads_ID
chr1:16770127|16775696 chr1 16770127 16775696 8 5_6_0 192 0.077 exon ENSG00000157191.15, + GFGG-GA2-1:2:33:2000:1585,GFGG-GA2-1:2:33:2000:1587,GFGG-GA2-1:2:33:2000:1589,GFGG-GA2-1:2:33:2000:1590,GFGG-GA2-1:2:33:2000:1576,GFGG-GA2-1:2:33:2000:1580,GFGG-GA2-1:2:33:2000:1583,GFGG-GA2-1:2:33:2000:1593,
chr1:16775588|16778510 chr1 16775588 16778510 6 5_5_0 203 0.056 exon ENSG00000157191.15, + GFGG-GA2-1:2:33:2000:1276,GFGG-GA2-1:2:33:2000:1277,GFGG-GA2-1:2:33:2000:1280,GFGG-GA2-1:2:33:2000:1284,GFGG-GA2-1:2:33:2000:1279,GFGG-GA2-1:2:33:2000:1283,
...
Or CIRI2 can search circRNAs without the annotation gtf:
perl CIRI2.pl -I test.sam -O outfile2 -F chr1.fa
The outfile2 is as follows:
circRNA_ID chr circRNA_start circRNA_end #junction_reads SM_MS_SMS #non_junction_reads junction_reads_ratio circRNA_type gene_id strand junction_reads_ID
chr1:16770127|16775696 chr1 16770127 16775696 8 5_6_0 192 0.077 n/a /n/a + GFGG-GA2-1:2:33:2000:1585,GFGG-GA2-1:2:33:2000:1587,GFGG-GA2-1:2:33:2000:1589,GFGG-GA2-1:2:33:2000:1590,GFGG-GA2-1:2:33:2000:1576,GFGG-GA2-1:2:33:2000:1580,GFGG-GA2-1:2:33:2000:1583,GFGG-GA2-1:2:33:2000:1593,
chr1:16775588|16778510 chr1 16775588 16778510 6 5_5_0 203 0.056 n/a /n/a + GFGG-GA2-1:2:33:2000:1276,GFGG-GA2-1:2:33:2000:1277,GFGG-GA2-1:2:33:2000:1280,GFGG-GA2-1:2:33:2000:1284,GFGG-GA2-1:2:33:2000:1279,GFGG-GA2-1:2:33:2000:1283,
...
Columns of output file are split by tabs (”\t” in shell and perl). Each column gives information of a predicted circRNA:
Column | Description |
---|---|
1 | ID of a predicted circRNA in the pattern of "chr:start |
2 | chromosome of a predicted circRNA |
3 | start loci of a predicted circRNA on the chromosome |
4 | end loci of a predicted circRNA on the chromosome |
5 | circular junction read (also called as back-spliced junction read) count of a predicted circRNA |
6 | unique CIGAR types of a predicted circRNA. For example, a circRNAs have three junction reads: read A (80M20S, 80S20M), read B (80M20S, 80S20M), read C (40M60S, 40S30M30S, 70S30M), then its has two SM types (80S20M, 70S30M), two MS types (80M20S, 70M30S) and one SMS type (40S30M30S). Thus its SM_MS_SMS should be 2_2_1. |
7 | non-junction read count of a predicted circRNA that mapped across the circular junction but consistent with linear RNA instead of being back-spliced |
8 | ratio of circular junction reads calculated by 2#junction_reads/(2#junction_reads+#non_junction_reads). #junction_reads is multiplied by two because a junction read is generated from two ends of circular junction but only counted once while a non-junction read is from one end. It has to be mentioned that the non-junction reads are still possibly from another larger circRNA, so the junction_reads_ratio based on it may be an inaccurate estimation of relative expression of the circRNA. |
9 | type of a circRNA according to positions of its two ends on chromosome (exon, intron or intergenic_region; only available when annotation file is provided) |
10 | ID of the gene(s) where an exonic or intronic circRNA locates |
11 | strand info of a predicted circRNAs (new in CIRI2) |
12 | all of the circular junction read IDs (split by ",") |
Running time and summary are recorded in oufile.log.
Test of CIRI2¶
You can also test CIRI2 using the framework added since version 2.0.5, though the test itself is not required for running CIRI2. If you test CIRI2 using this framework, please note modules Test::More and Test::Class as well as Perl 5.10 or higher will be needed.
Enter the directory and type as follows in your terminal:
perl data/test_CIRI.pl
Q&A¶
(1) How to use multiple threads in CIRI2?
CIRI2 supports multi-thread. To use it, just designate thread number using the argument -T.
It should be noted that using multiple threads will inevitably need more RAM due to the design of parallel Perl, although we have optimized RAM utility in CIRI2.
As a suggestion, please choose to use multi-thread if you are using moderate or high-performance server, which will largely reduce running time for large data sets. For reference, it usually takes no more than 4 hours for 40 Gb data (SAM file >100 Gb) generated from RNaseR treated sample using 4 threads.
If you only have a mini-server or even a 16Gb-RAM iMac, you can still use CIRI2 without paralleling to complete processing large data sets in acceptable time. It usually takes no more than 10 hours to process the above data set using single thread.
We do not recommend to use more than 20 threads in CIRI2, which will cost a large RAM but indeed speed up little.
(2) How to set parameters in BWA-MEM and CIRI2 for different data?
An important argument of BWA-MEM for junction read mapping is -T, which gives the threshold of alignment score of output.
Our simulation study shows that a lower -T than default (30) can improve sensitivity of CIRI2. We recommend -T 19 for most data set (such as read length ≥60).
We have applied CIRI2 to simulated data and real data. Here are some recommended parameters for short or single-end reads.
Single-end reads result in higher FDR comparing to paired-end reads for lack of PEM information as one of filters when using default parameters. Thus we provided parameter -U (e.g. -U 15), to set mapping quality thresholds of a junction read and help control FDR.
Although short read length (<60 bp) may lead to lower sensitivity, alteration of parameters for BWA-MEM (e.g. -k 15 and -T 15) to allow alignments for low mapping scores can improve performance for this type of sequencing data.
(3) How to generate a genome annotation file for non-model organisms.
If the genome you use does not have annotation in ensembl, we would recommend you to convert the annotated file to one of the following formats.
gtf format:
chr1 HAVANA gene 11869 14412 . + . gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1 HAVANA transcript 11869 14409 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1 HAVANA exon 11869 12227 . + . gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1; exon_id "ENSE00002234944.1"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
The last column (column 9) should contains the key words such as “gene_id” and “transcript_id”, and each of them is split by a “; “.
gff format:
AC_000023.1 RefSeq region 1 202526509 . + . ID=id0;Name=1;Dbxref=taxon:10090;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=mixed
AC_000023.1 BestRefSeq gene 3222752 3677460 . - . ID=gene0;Name=Xkr4;Dbxref=GeneID:497097,MGI:3528744;description=X Kell blood group precursor related family member 4;gbkey=Gene;gene=Xkr4;gene_synonym=AY534250,Gm210,mKIAA1889,XRG4;partial=true
AC_000023.1 BestRefSeq mRNA 3222752 3677460 . - . ID=rna0;Name=NM_001011874.1;Parent=gene0;Note=The RefSeq transcript aligns at 78%25 coverage compared to this genomic sequence;Dbxref=GeneID:497097,Genbank:NM_001011874.1,MGI:3528744;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=Xkr4;product=X Kell blood group precursor related family member 4;transcript_id=NM_001011874.1
AC_000023.1 BestRefSeq exon 3677303 3677460 . - . ID=id1;Parent=rna0;Note=The RefSeq transcript aligns at 78%25 coverage compared to this genomic sequence;Dbxref=GeneID:497097,Genbank:NM_001011874.1,MGI:3528744;exception=annotated by transcript or proteomic data;gbkey=mRNA;gene=Xkr4;product=X Kell blood group precursor related family member 4;transcript_id=NM_001011874.1
The last column (column 9) should contains the key words such as “gene=” and “transcript_id=”, and each of them is split by a “;”.
another gff format:
Chr1 TAIR10 chromosome 1 30427671 . . . ID=Chr1;Name=Chr1
Chr1 TAIR10 gene 3631 5899 . + . ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010
Chr1 TAIR10 mRNA 3631 5899 . + . ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1
Chr1 TAIR10 protein 3760 5630 . + . ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1
Chr1 TAIR10 exon 3631 3913 . + . Parent=AT1G01010.1
Chr1 TAIR10 five_prime_UTR 3631 3759 . + . Parent=AT1G01010.1
The last column (column 9) should contains the key words such as “Parent=”, and each of them is split by a “;”.
Any questions about CIRI2 please mail to gaoyuan06@mails.ucas.ac.cn.