Accurate identification and analysis of human mRNA isoforms using deep long read sequencing. Academic Article uri icon

Overview

abstract

  • Precise identification of RNA-coding regions and transcriptomes of eukaryotes is a significant problem in biology. Currently, eukaryote transcriptomes are analyzed using deep short-read sequencing experiments of complementary DNAs. The resulting short-reads are then aligned against a genome and annotated junctions to infer biological meaning. Here we use long-read complementary DNA datasets for the analysis of a eukaryotic transcriptome and generate two large datasets in the human K562 and HeLa S3 cell lines. Both data sets comprised at least 4 million reads and had median read lengths greater than 500 bp. We show that annotation-independent alignments of these reads provide partial gene structures that are very much in-line with annotated gene structures, 15% of which have not been obtained in a previous de novo analysis of short reads. For long-noncoding RNAs (i.e., lncRNA) genes, however, we find an increased fraction of novel gene structures among our alignments. Other important aspects of transcriptome analysis, such as the description of cell type-specific splicing, can be performed in an accurate, reliable and completely annotation-free manner, making it ideal for the analysis of transcriptomes of newly sequenced genomes. Furthermore, we demonstrate that long read sequence can be assembled into full-length transcripts with considerable success. Our method is applicable to all long read sequencing technologies.

publication date

  • March 1, 2013

Research

keywords

  • Gene Expression Profiling
  • RNA Isoforms
  • RNA, Long Noncoding

Identity

PubMed Central ID

  • PMC3583448

Scopus Document Identifier

  • 84883216800

Digital Object Identifier (DOI)

  • 10.1534/g3.112.004812

PubMed ID

  • 23450794

Additional Document Info

volume

  • 3

issue

  • 3