【单细胞测序】RNA velocity分析：velocyto（一）

dooooob

2020年09月25日 15:45

收录于文集

共23篇

根据前一篇的背景介绍，计算单细胞RNA velocity需要的信息是pre-mRNA （unspliced） counts和mature mRNA （spliced）counts。然后依据这两个矩阵做后续处理与分析。

2018年Nature文章 “RNA velocity in single cells” 的作者Gioele La Manno等人提供了一个工具包velocyto（主页http://velocyto.org/， github主页https://github.com/velocyto-team）来完成这一步。

这个包有python和R两个实现，但其实并不完全一样。

首先python实现包含两个功能：一个命令行工具，用于从mapping的bam文件中得到前述两个矩阵，生成的是一个包含了前述矩阵信息的loom文件；和一个分析工具，用于对loom文件做后续RNA velocity分析

而R实现只是用于单细胞RNA velocity分析，需要的前述两个矩阵信息可以用python版本的命令行工具获取或者其他软件，比如dropEst（https://dropest.readthedocs.io/en/latest/index.html）。

因此，做单细胞RNA velocity的分析，包括以下两步：

从bam文件中计算得到关于spliced mRNA数和unspliced mRNA数，得到对应的矩阵
对得到的矩阵进行后续分析

在这一部分中，先介绍第一步的做法，可以使用的工具包，目前有velocyto和dropEst。

工具包：velocyto

这里用到的是python实现的velocyto，也叫velocyto.py

安装需要先安装numpy scipy cython numba matplotlib scikit-learn h5py click pysam，可直接通过pip安装，或者conda。 pip install numpy scipy cython numba matplotlib scikit-learn h5py click pysam然后安装velocyto pip install velocyto
使用运行velocyto需要的输入数据为bam文件，基因组注释gtf文件，和mask注释gtf文件。输出文件为loom格式，存储的信息是spliced mRNA counts matrice，unspliced mRNA counts matrice，ambiguous mRNA counts matrice。相应的命令为：velocyto run，对于特定平台产生的单细胞数据，有相应的子命令，比如，velocyto run10x, velocyto run_smartseq2, velocyto run_dropest。
1. 准备annotation文件基因组注释文件（gtf）：如果是10X的数据，那么这个gtf可以直接使用对应reference的gtf，位于reference目录下的gene文件夹mask注释文件：从UCSC genome browser获取。 • 选定基因组和对应版本, • Track选择为RepeatMasker, Table为rmsk, output format为GTF
  1. 小鼠mm10：https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=611454127_NtvlaW6xBSIRYJEBI0iRDEWisITa&clade=mammal&org=Mouse&db=mm10&hgta_group=allTracks&hgta_track=rmsk&hgta_table=0&hgta_regionType=genome&position=chr12%3A56694976-56714605&hgta_outputType=primaryTable&hgta_outputType=gff&hgta_outFileName=mm10_rmsk.gtf
  2. 人GRCh38：https://genome.ucsc.edu/cgi-bin/hgTables?hgsid=611454127_NtvlaW6xBSIRYJEBI0iRDEWisITa&clade=mammal&org=Human&db=0&hgta_group=allTracks&hgta_track=rmsk&hgta_table=rmsk&hgta_regionType=genome&position=&hgta_outputType=gff&hgta_outFileName=GRCh38_rmsk.gtf

运行以10X的数据为例基本用法为 velocyto run10x [OPTIONS] SAMPLEFOLDER GTFFILESAMPLEFOLDER指的是cellranger的输出目录，比如运行cellranger时选择的样本ID为sample01，然后在mypath路径下，那么SAMPLEFOLDER为mypath/sample01，在sample01这个文件夹下，有outs文件夹，里面包含了cellranger的输出bam文件，表达矩阵。如果想用到mask文件，那么-m mask文件就行。velocyto run10x -m mask_folder/repeat_mask_name.gtf mypath/sample01 somepath/refdata-cellranger-mm10-3.0.1/genes/genes.gtf最后输出文件在sample01这个文件夹下也可以直接使用 velocyto run这个命令velocyto run [OPTIONS] BAMFILE... GTFFILE-b, --bcfile FILE Valid barcodes file, to filter the bam. If --bcfile is not specified all the cell barcodes will be included. Cell barcodes should be specified in the bcfile as the `CB` tag for each read-o, --outputfolder PATH Output folder, if it does not exist it will be created.-e, --sampleid PATH The sample name that will be used to retrieve informations from metadatatable-m, --mask FILE .gtf file containing intervals to mask-s, --metadatatable FILE Table containing metadata of the various samples (csv formatted, rows are samples and cols are entries)

本文为我原创，未经授权禁止转载

cv7730422

分享至

投诉或建议