STAR-RSEMによる発現量推定その2 (2020/08/20 追記)

前回の記事↓の続きです。

STAR-RSEMによる発現量推定その１ - Palmsonntagmorgen

Stranded/Unstranded RNA-seq

RNA-seqにはunstrandedとstranded の二種類があります。unstrandedの場合はmRNAに対して半々の確率で順鎖と逆鎖が読まれ、stranded の場合は逆鎖のみが読まれます(Illumina TruSeq kitの場合)。最近のRNA-seqはほぼ全てStranded RNA-seqになっていると思いますが、古いデータはunstrandedである場合があります。それぞれマッピング時のパラメータが異なりますので、事前にチェックしておく必要があります。

どちらかわからんという場合は以下のように簡単にマッピングすることで確かめることができます。 indexは既知のmRNA配列から作成したindexとします。

$ index=NCBI-H_sapiens-RNA
$ bowtie $index <(zcat SRR065495_1.fastq.gz) -p12 | cut -f2 | sort | uniq -c
# reads processed: 50170737
# reads with at least one reported alignment: 34437131 (68.64%)
# reads that failed to align: 15733606 (31.36%)
Reported 34437131 alignments
17168386 +
17268745 -

詳細は割愛しますが、この結果はマップされた34437131リードのうち、17168386リードが順鎖、17268745が逆鎖にマップされたことを表しています。つまりほぼ半々なので、今回のデータはunstranded RNA-seqとわかります。

（2019/7/11 追記）
ご質問があったので、stranded RNA-seqの場合にどうなるかを追記します。ここでは例として以下のサンプルを使います。

https://www.ebi.ac.uk/ena/data/view/SRR2952390

Paired-endのサンプルですが、F3側だけを使います。

$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR295/000/SRR2952390/SRR2952390_1.fastq.gz
$ index=NCBI-H_sapiens-RNA
$ bowtie $index <(zcat SRR2952390_1.fastq.gz) -p12 | cut -f2 | sort | uniq -c
# reads processed: 98531979
# reads with at least one reported alignment: 78815814 (79.99%)
# reads that failed to align: 19716165 (20.01%)
Reported 78815814 alignments to 1 output stream(s)
78463223 -
 352591 +

マップされた 78815814 リードのうち、大半の 78463223 リードが逆鎖(-)にマップされています。従って、このサンプルはStranded RNAであるということがわかります。

STARでマッピング

それでは改めてSTARでマッピングします。 RSEMコマンド中でSTARのマッピングもできるのですが、ログの出力が貧弱なため、私は独立でSTARを動かしてマッピングしています。
STARのコマンドは以下のようになります。

mkdir star # 出力フォルダを作成
STAR --outSAMtype BAM SortedByCoordinate \   # sort済BAM を出力
     --quantMode TranscriptomeSAM \    # RSEMでの発現量推定に必要なオプション
     --runThreadN 12 \            # 使用CPU数
     --outSAMattributes All \     # 出力BAMファイルに含まれる情報を最大に
     --readFilesCommand zcat \    # gzip圧縮のfastqを入力にする場合のみ必要
     --genomeDir rsem-star-index/UCSC-hg38 \     # indexファイルを指定
     --readFilesIn SRR065504_1.fastq.gz SRR065504_2.fastq.gz \  # 入力fastq
     --outSAMstrandField intronMotif \    #   unstranded RNA-seqの場合のみ必要
     --outFileNamePrefix star/Myers_H1-hESC_cell_2x75_200_1.  # 出力ファイル名(最後のピリオドを忘れず)

STARは相当メモリを食いますので注意してください（十数GB程度）。

４サンプルに対して上記を実行しますが、入力ファイル名と出力ファイル名が異なるので、今回はfor文は使いません。色々やり方はありますが、ここでは以下のようにパラメータ列をまとめることにします。

starparam="--genomeLoad NoSharedMemory \
     --outSAMtype BAM SortedByCoordinate \
     --quantMode TranscriptomeSAM \
     --runThreadN 12 \
     --outSAMattributes All \
     --readFilesCommand zcat \
     --genomeDir rsem-star-index/UCSC-hg38 \
     --outSAMstrandField intronMotif"

STAR $starparam \
     --readFilesIn SRR065504_1.fastq.gz SRR065504_2.fastq.gz \
     --outFileNamePrefix star/Myers_H1-hESC_cell_2x75_200_1.
STAR $starparam \
     --readFilesIn SRR065495_1.fastq.gz SRR065495_2.fastq.gz \
     --outFileNamePrefix star/Myers_H1-hESC_cell_2x75_200_2.
STAR $starparam \
     --readFilesIn SRR065509_1.fastq.gz SRR065509_2.fastq.gz \
     --outFileNamePrefix star/Myers_HUVEC_cell_2x75_200_1.
STAR $starparam \
     --readFilesIn SRR065524_1.fastq.gz SRR065524_2.fastq.gz \
     --outFileNamePrefix star/Myers_HUVEC_cell_2x75_200_2.

starディレクトリの中に以下のようにファイルが生成されていれば成功です。

$ ls star
Myers_H1-hESC_cell_2x75_200_1.Aligned.sortedByCoord.out.bam
Myers_H1-hESC_cell_2x75_200_1.Aligned.toTranscriptome.out.bam
Myers_H1-hESC_cell_2x75_200_1.Log.final.out
Myers_H1-hESC_cell_2x75_200_1.Log.out
Myers_H1-hESC_cell_2x75_200_1.Log.progress.out
Myers_H1-hESC_cell_2x75_200_1.SJ.out.tab
Myers_H1-hESC_cell_2x75_200_2.Aligned.sortedByCoord.out.bam
Myers_H1-hESC_cell_2x75_200_2.Aligned.toTranscriptome.out.bam
Myers_H1-hESC_cell_2x75_200_2.Log.final.out
Myers_H1-hESC_cell_2x75_200_2.Log.out
Myers_H1-hESC_cell_2x75_200_2.Log.progress.out
Myers_H1-hESC_cell_2x75_200_2.SJ.out.tab
Myers_HUVEC_cell_2x75_200_1.Aligned.sortedByCoord.out.bam
Myers_HUVEC_cell_2x75_200_1.Aligned.toTranscriptome.out.bam
Myers_HUVEC_cell_2x75_200_1.Log.final.out
Myers_HUVEC_cell_2x75_200_1.Log.out
Myers_HUVEC_cell_2x75_200_1.Log.progress.out
Myers_HUVEC_cell_2x75_200_1.SJ.out.tab
Myers_HUVEC_cell_2x75_200_2.Aligned.sortedByCoord.out.bam
Myers_HUVEC_cell_2x75_200_2.Aligned.toTranscriptome.out.bam
Myers_HUVEC_cell_2x75_200_2.Log.final.out
Myers_HUVEC_cell_2x75_200_2.Log.out
Myers_HUVEC_cell_2x75_200_2.Log.progress.out
Myers_HUVEC_cell_2x75_200_2.SJ.out.tab

マッピングスタッツを確認してみましょう。*.Log.final.outにスタッツが記載されています。

$ cat star/Myers_HUVEC_cell_2x75_200_2.Log.final.out 
                                 Started job on |   Dec 21 14:48:27
                             Started mapping on |   Dec 21 14:50:50
                                    Finished on |   Dec 21 15:09:14
       Mapping speed, Million of reads per hour |   156.28

                          Number of input reads |   47926048
                      Average input read length | 150
                                    UNIQUE READS:
                   Uniquely mapped reads number |   37351930
                        Uniquely mapped reads % |   77.94%
                          Average mapped length |   147.79
                       Number of splices: Total |   20037714
            Number of splices: Annotated (sjdb) |   19925915
                       Number of splices: GT/AG |   19886116
                       Number of splices: GC/AG |   121436
                       Number of splices: AT/AC |   15391
               Number of splices: Non-canonical |   14771
                      Mismatch rate per base, % |   1.13%
                         Deletion rate per base |   0.01%
                        Deletion average length |   1.44
                        Insertion rate per base |   0.00%
                       Insertion average length |   1.34
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   3004986
             % of reads mapped to multiple loci |   6.27%
        Number of reads mapped to too many loci |   9073
             % of reads mapped to too many loci |   0.02%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |   0.00%
                 % of reads unmapped: too short |   14.25%
                     % of reads unmapped: other |   1.53%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

マップリード数のほか、スプライスサイトの情報なども載っていて便利です。

RSEMで発現量推定

STARで出力された*.Aligned.toTranscriptome.out.bamを使ってRSEMで発現量推定をします。今度はfor文が使えます。

for prefix in Myers_H1-hESC_cell_2x75_200_1 \
              Myers_H1-hESC_cell_2x75_200_2 \
              Myers_HUVEC_cell_2x75_200_1 \
              Myers_HUVEC_cell_2x75_200_2
do
    rsem-calculate-expression \
               --paired-end --alignments \
               --estimate-rspd \
               --strandedness none \
               --no-bam-output \
               -p 12 \
               star/$prefix.Aligned.toTranscriptome.out.bam \
               rsem-star-index/UCSC-hg38/UCSC-hg38 \
               star/$prefix
done

（2020/08/20追記）
stranded RNAを指定する際にこれまで用いていた --forward-prob オプションがdeprecatedとなり、--strandedness というオプションに置き換わったようです。ですので、stranded RNA-seqの場合は --strandedness reverse, unstrandedの場合は --strandedness none と指定してください。デフォルトはnoneなので上のコマンドにおいて--strandedness noneは省略可能ですが、わかりやすさのために敢えて記載しています。

indexの指定方法がSTARの時と異なることに注意してください。完了するとstarディレクトリの中に*.genes.resultsと*.isoforms.results というファイルが生成され、それぞれ遺伝子単位、transcript単位の発現量（count, TPM, FPKM）が出力されます。発現量としてはTPMを使うのが最も望ましいでしょう。
ここではstarディレクトリの中に発現量も出力していますが、場合によっては別にrsemディレクトリを作って発現量をそちらに格納する方がSTARとRSEMの区別がはっきりして良いかもしれません。

また長くなったので、ファイルのマージ方法は次回に回します。次回は年明けの予定です。

STAR-RSEMによる発現量推定その3 - Palmsonntagmorgen