Skip to content

ICGC ARGO RNA Seq Analysis

lindaxiang edited this page Aug 17, 2022 · 1 revision

RNA-Seq reference genome

file name size md5sum
GRCh38_Verily_v1.genome.fa 3150152408 16626761857940321a7a1142e03f8217
GRCh38_Verily_v1.genome.fa.fai 123145 b373ad1f64003c910dce216f93718aab
GRCh38_Verily_v1.genome.fa.gz 887918831 1fb31dcb45ca7c52d0e27c523504bc9a
GRCh38_Verily_v1.genome.fa.gz.gzi 772104 55b7a860d1cef3793fcda54af56664e3
GRCh38_Verily_v1.genome.fa.gz.fai 123145 b373ad1f64003c910dce216f93718aab
README.txt 1492 db3b3e4233b6ddb92ff3e3dc152ccda8

The above files need to be staged under a path in the file system where workflow jobs can access. The files can be downloaded using wget, one example is given as below:

wget https://object.cancercollaboratory.org:9080/swift/v1/genomics-public-data/rna-seq-references/GRCh38_Verily_v1.genome/README.txt
  • Since RNA-Seq aligners are not ALT-aware, a slightly different version of reference genome is used by ICGC-ARGO for RNA-Seq Analysis. This file is composed of the following sequences:
    • GRCh38 primary assembly
    • Decoy sequences
    • Epstein-Barr virus (EBV) sequence

RNA-Seq genome annotation files

  • GENCODE v40 contains the comprehensive gene annotation on the reference chromosomes, scaffolds, assembly patches and alternate loci (haplotypes)
file name size md5sum
gencode.v40.chr_patch_hapl_scaff.annotation.gtf 1616162883 beeee37565d2a76f477fb474fcfa922e

The above files need to be staged under a path in the file system where workflow jobs can access. The files can be downloaded using wget, one example is given as below:

wget https://object.cancercollaboratory.org:9080/swift/v1/genomics-public-data/rna-seq-references/GRCh38_Verily_v1.annotation/gencode.v40.chr_patch_hapl_scaff.annotation.gtf

RNA-Seq Alignment

STAR index and auxilary files

file name size md5sum
Genome 3823751360 c40d86d0b50c34dd46a9347472462937
SA 24750976325 318f38f408c9b48e8c2e4c4911bcf470
SAindex 1565873619 aa883f548dbc9399e7a1444891fdd741
STARindex.log 763 acb99118146c8723378709968565c3c4
chrLength.txt 13158 7f5964f5965ea24ade6257990c7461cf
chrName.txt 66140 fbb1fe18634dc8fc7192930225f0e6a1
chrNameLength.txt 79298 98e83030349933a0b0ca21888a71edb0
chrStart.txt 28378 d87a026a4ec84d67c3e53256a985f904
exonGeTrInfo.tab 56068230 696ce75a16af0d50a47f79c4b95ff4b1
exonInfo.tab 22952209 8932772eace6ab5408133590b4f34b56
geneInfo.tab 2591817 303aa1d1f63fae8bd954dba3c5f5dcb9
genomeParameters.txt 1008 7ad39ed85712bdb3f7e238e364f39de4
sjdbInfo.txt 11620218 29b9af281debb5900d8db82f832a0642
sjdbList.fromGTF.out.tab 12610890 1d4d6966ec9f67d9125067b7636c3038
sjdbList.out.tab 10259582 207834d2baf062f5d0a303c18fdb8798
transcriptInfo.tab 16599748 445aa2a51ddfc112f0f6f6b8463f9b8d

The above files need to be staged under a path in the file system where workflow jobs can access. The files can be downloaded using wget, one example is given as below:

wget https://object.cancercollaboratory.org:9080/swift/v1/genomics-public-data/rna-seq-references/GRCh38_Verily_v1.STARindex.sjdbOverhang_75/STARindex.log

HiSAT2 index and auxilary files

file name size md5sum
GRCh38_Verily_v1.1.ht2 1818403521 60998540231f7e21ad8a53d13898de08
GRCh38_Verily_v1.2.ht2 736877080 a6b58d2aa00d32007c1227e9835e2038
GRCh38_Verily_v1.3.ht2 31508 682418739b6d9c3dd92dc39df73fdfeb
GRCh38_Verily_v1.4.ht2 735167267 aac99bf451926a49e0cb5a921588fdf2
GRCh38_Verily_v1.5.ht2 1772593003 834ad923bead0a77562f80ea55ed3c93
GRCh38_Verily_v1.6.ht2 749013982 89110c7f502a5ffa5fd9895cba2f87da
GRCh38_Verily_v1.7.ht2 14465092 bef9ed20ad08932a0d07b5da317be62b
GRCh38_Verily_v1.8.ht2 2823782 4bfcde812f6b0ce124439d6da85ccdf6
GRCh38_Verily_v1.log 10620 4e833f06e59568c17e409b1a69cf7b11

The above files need to be staged under a path in the file system where workflow jobs can access. The files can be downloaded using wget, one example is given as below:

wget https://object.cancercollaboratory.org:9080/swift/v1/genomics-public-data/rna-seq-references/GRCh38_Verily_v1.HISAT2index/GRCh38_Verily_v1.log

Picard-CollectRnaSeqMetrics auxilary files

  • --ref_flat: a tab-delimited file containing information about the location of RNA transcripts, exon start and stop sites, etc.
  • --ribosomal_interval_list: provide the locations of rRNA sequences in the genome in interval_list format. If not specified no bases will be identified as being ribosomal.
file name size md5sum
GRCh38_Verily_v1.rRNA.interval_list 134077 6e00a55590ec6cbddafe9bd59f7f444b
GRCh38_Verily_v1.refFlat.txt.gz 8043021 21ebee2684e7be6df13500d880b2b6ad

The above files need to be staged under a path in the file system where workflow jobs can access. The files can be downloaded using wget, one example is given as below:

wget https://object.cancercollaboratory.org:9080/swift/v1/genomics-public-data/rna-seq-references/GRCh38_Verily_v1.Picard_CollectRnaSeqMetrics/GRCh38_Verily_v1.rRNA.interval_list